Support search slicing with point-in-time #74457

jtibshirani · 2021-06-23T00:44:06Z

This PR adds support for using the slice option in point-in-time searches. By
default, the slice query splits documents based on their Lucene ID. This
strategy is more efficient than the one used for scrolls, which is based on the
_id field and must iterate through the whole terms dictionary. When slicing a
search, the same point-in-time ID must be used across slices to guarantee the
partitions don't overlap or miss documents.

List of individual changes:

Remove 'field' variable from the parent class SliceQuery
Add new DocIdSliceQuery based on Lucene doc IDs
Update SliceBuilder to track the default case when 'field' has not been set
(now the field can be null, before it filled in with _id)
Allow slicing to work with point-in-time in addition to scrolls

Closes #65740.

jtibshirani · 2021-06-23T20:32:37Z

server/src/main/java/org/elasticsearch/search/slice/DocIdSliceQuery.java

+ * readers. It's intended for scenarios where the reader doesn't change, like in
+ * a point-in-time search.
+ */
+public final class DocIdSliceQuery extends SliceQuery {


Some notes:

In principle we could add support for using this query as a lead iterator. This didn't seem so useful to me since in the common case, we shouldn't be slicing within a shard in a super fine-grained way?

I considered slicing based on contiguous ranges of document IDs. This had the advantage that slices could skip certain segments completely. However I was worried about the case where sort order was heavily correlated with doc ID order, which would result in a big imbalance across slices (maybe some would even be empty).

I like the idea of contiguous ranges. Slicing is useful for a full scan where fetching the data is most of the time the costly operation. Using sequential doc ids per slice would ensure that a match all query with slices can get the fetch optimization all the time (read stored_fields per block).

I thought more and am not sure my concern even made sense. Slicing is used when scanning over all documents, not to find top hits, and it's fine if some slices skip over all the early results.

I pushed a simple approach where we split the documents into ranges with roughly equivalent size.

elasticmachine · 2021-06-23T20:57:32Z

Pinging @elastic/es-search (Team:Search)

jimczi

One more step to deprecate scrolls!
I left some comments regarding the splitting of slices.
Thanks @jtibshirani

jimczi · 2021-06-24T08:14:01Z

docs/reference/search/point-in-time-api.asciidoc

+Lucene document IDs, which are not stable across changes to the index.
+
+NOTE: By default the maximum number of slices allowed per search is limited to 1024.
+You can update the `index.max_slices_per_scroll` index setting to bypass this limit.


With this setup, the limit is not very useful. We should remove it when scrolls are gone, yeah!

The name is also a little confusing now (it mentions scrolls). I wonder if we should update the check so it doesn't apply to point-in-time searches.

I ended up removing the limit for slicing with point-in-time.

jimczi · 2021-06-24T08:26:11Z

server/src/main/java/org/elasticsearch/search/slice/DocIdSliceQuery.java

+ * readers. It's intended for scenarios where the reader doesn't change, like in
+ * a point-in-time search.
+ */
+public final class DocIdSliceQuery extends SliceQuery {


I like the idea of contiguous ranges. Slicing is useful for a full scan where fetching the data is most of the time the costly operation. Using sequential doc ids per slice would ensure that a match all query with slices can get the fetch optimization all the time (read stored_fields per block).

jimczi · 2021-06-24T08:28:05Z

server/src/main/java/org/elasticsearch/search/slice/DocIdSliceQuery.java

+
+            @Override
+            public boolean isCacheable(LeafReaderContext ctx) {
+                return true;


Should we cache this query ? Could be nice to avoid caching entirely, especially if we take the contiguous range approach.

Oops, I agree caching is not a great idea.

jtibshirani · 2021-07-06T21:25:51Z

@jimczi this is ready for another look when you have the chance.

dnhatn

LGTM. Thanks @jtibshirani.

dnhatn · 2021-07-08T14:41:53Z

docs/reference/search/search-your-data/paginate-search-results.asciidoc

+maximum number of slices is set to 2 the union of the results of the two requests is equivalent
+to the results of a scroll query without slicing. By default the splitting is done first on the
+shards, then locally on each shard using the `_id` field. The local splitting follows the formula
+`slice(doc) = doc.lucene_id % max`.


slice(doc) = doc.lucene_id % max

I think we should use the old formula for sliced scroll (slice(doc) = floorMod(hashCode(doc._id), max))?

Good catch, this was a bad copy-paste!

dnhatn · 2021-07-08T15:06:11Z

server/src/main/java/org/elasticsearch/search/DefaultSearchContext.java

@@ -206,7 +206,7 @@ public void preProcess(boolean rewrite) {
            }
        }

-        if (sliceBuilder != null) {
+        if (sliceBuilder != null && scrollContext() != null) {


I think we should have a safe guard for sliced point in time although it's less serious than the sliced scroll. We can do it in a follow-up.

What sort of safeguard did you have in mind (and what would it protect against)?

The max slice of a search request.

We discussed offline and agreed that it's not critical to apply a limit now, but plan to revisit it when integrating point-in-time searches into reindex.

dnhatn · 2021-07-08T15:41:36Z

server/src/main/java/org/elasticsearch/search/slice/SliceQuery.java

-    public int hashCode() {
-        return Objects.hash(classHash(), field, id, max);
-    }
+    protected abstract boolean doEquals(SliceQuery o);


The semantics of two new methods doEquals and doHashCode are not obvious (to me). Maybe use ShardDocSortField.NAME for the field name in DocIdSliceQuery.java so we don't have to add these methods? But I am okay if you prefer to keep these.

This is a nice simplification, I ended up using "_doc" as the field name.

jtibshirani · 2021-07-08T22:22:36Z

@elasticmachine run elasticsearch-ci/part-1

jtibshirani · 2021-07-08T22:33:10Z

It looks like "elasticsearch-ci/part-1" passed in CI, but the status isn't being reported. (Maybe this is related to the Jenkins upgrade?) I'm going to merge even though it's not showing green here.

jtibshirani force-pushed the slice-query branch 3 times, most recently from e7db689 to 58127a2 Compare June 23, 2021 18:42

Support search slicing with point-in-time

d85a5c0

jtibshirani force-pushed the slice-query branch from 58127a2 to d85a5c0 Compare June 23, 2021 18:57

jtibshirani commented Jun 23, 2021

View reviewed changes

jtibshirani added :Search/Search Search-related issues that do not fall into other categories >enhancement v7.14.0 v8.0.0 labels Jun 23, 2021

jtibshirani marked this pull request as ready for review June 23, 2021 20:57

elasticmachine added the Team:Search Meta label for search team label Jun 23, 2021

Add REST tests

9eff775

jimczi reviewed Jun 24, 2021

View reviewed changes

jtibshirani added 2 commits June 24, 2021 11:33

Merge remote-tracking branch 'upstream/master' into slice-query

1939315

Update slicing to use contiguous ID ranges

518a6b8

mark-vieira added v7.15.0 and removed v7.14.0 labels Jun 30, 2021

jtibshirani added 4 commits July 1, 2021 19:25

Update test to sort on _doc instead of _shard_doc

a80590e

Merge remote-tracking branch 'upstream/master' into slice-query

acfc3a1

Only apply max slices limit for scrolls

bfbfe62

Merge remote-tracking branch 'upstream/master' into slice-query

df649eb

jtibshirani requested a review from dnhatn July 7, 2021 16:05

dnhatn approved these changes Jul 8, 2021

View reviewed changes

jtibshirani added 4 commits July 8, 2021 10:22

Simplify SliceQuery subclasses

19b8977

Fix bad copy-paste in scroll docs

0117c76

Remove unnecessary changes to slice query classes

b0a07de

Merge remote-tracking branch 'upstream/master' into slice-query

fad1068

jtibshirani merged commit cdf67e0 into elastic:master Jul 8, 2021

jtibshirani deleted the slice-query branch July 8, 2021 22:33

jtibshirani mentioned this pull request Jul 8, 2021

Support search slicing with point-in-time #75162

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

sethmlarson mentioned this pull request Aug 10, 2021

Can we get batch data use df.to_pandas() in the case of big data? elastic/eland#345

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support search slicing with point-in-time #74457

Support search slicing with point-in-time #74457

jtibshirani commented Jun 23, 2021 •

edited

Loading

jtibshirani Jun 23, 2021

jimczi Jun 24, 2021

jtibshirani Jun 24, 2021

elasticmachine commented Jun 23, 2021

jimczi left a comment

jimczi Jun 24, 2021

jtibshirani Jun 24, 2021

jtibshirani Jul 6, 2021

jimczi Jun 24, 2021

jimczi Jun 24, 2021

jtibshirani Jun 24, 2021 •

edited

Loading

jtibshirani commented Jul 6, 2021

dnhatn left a comment

dnhatn Jul 8, 2021

jtibshirani Jul 8, 2021

dnhatn Jul 8, 2021

jtibshirani Jul 8, 2021

dnhatn Jul 8, 2021

jtibshirani Jul 8, 2021

dnhatn Jul 8, 2021

jtibshirani Jul 8, 2021

jtibshirani commented Jul 8, 2021

jtibshirani commented Jul 8, 2021

Support search slicing with point-in-time #74457

Support search slicing with point-in-time #74457

Conversation

jtibshirani commented Jun 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticmachine commented Jun 23, 2021

jimczi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani Jun 24, 2021 • edited Loading

Choose a reason for hiding this comment

jtibshirani commented Jul 6, 2021

dnhatn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani commented Jul 8, 2021

jtibshirani commented Jul 8, 2021

jtibshirani commented Jun 23, 2021 •

edited

Loading

jtibshirani Jun 24, 2021 •

edited

Loading