Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support search slicing with point-in-time #74457

Merged
merged 12 commits into from
Jul 8, 2021

Conversation

jtibshirani
Copy link
Contributor

@jtibshirani jtibshirani commented Jun 23, 2021

This PR adds support for using the slice option in point-in-time searches. By
default, the slice query splits documents based on their Lucene ID. This
strategy is more efficient than the one used for scrolls, which is based on the
_id field and must iterate through the whole terms dictionary. When slicing a
search, the same point-in-time ID must be used across slices to guarantee the
partitions don't overlap or miss documents.

List of individual changes:

  • Remove 'field' variable from the parent class SliceQuery
  • Add new DocIdSliceQuery based on Lucene doc IDs
  • Update SliceBuilder to track the default case when 'field' has not been set
    (now the field can be null, before it filled in with _id)
  • Allow slicing to work with point-in-time in addition to scrolls

Closes #65740.

@jtibshirani jtibshirani force-pushed the slice-query branch 3 times, most recently from e7db689 to 58127a2 Compare June 23, 2021 18:42
* readers. It's intended for scenarios where the reader doesn't change, like in
* a point-in-time search.
*/
public final class DocIdSliceQuery extends SliceQuery {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes:

  • In principle we could add support for using this query as a lead iterator. This didn't seem so useful to me since in the common case, we shouldn't be slicing within a shard in a super fine-grained way?
  • I considered slicing based on contiguous ranges of document IDs. This had the advantage that slices could skip certain segments completely. However I was worried about the case where sort order was heavily correlated with doc ID order, which would result in a big imbalance across slices (maybe some would even be empty).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of contiguous ranges. Slicing is useful for a full scan where fetching the data is most of the time the costly operation. Using sequential doc ids per slice would ensure that a match all query with slices can get the fetch optimization all the time (read stored_fields per block).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought more and am not sure my concern even made sense. Slicing is used when scanning over all documents, not to find top hits, and it's fine if some slices skip over all the early results.

I pushed a simple approach where we split the documents into ranges with roughly equivalent size.

@jtibshirani jtibshirani added :Search/Search Search-related issues that do not fall into other categories >enhancement v7.14.0 v8.0.0 labels Jun 23, 2021
@jtibshirani jtibshirani marked this pull request as ready for review June 23, 2021 20:57
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Jun 23, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more step to deprecate scrolls!
I left some comments regarding the splitting of slices.
Thanks @jtibshirani

Lucene document IDs, which are not stable across changes to the index.

NOTE: By default the maximum number of slices allowed per search is limited to 1024.
You can update the `index.max_slices_per_scroll` index setting to bypass this limit.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this setup, the limit is not very useful. We should remove it when scrolls are gone, yeah!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is also a little confusing now (it mentions scrolls). I wonder if we should update the check so it doesn't apply to point-in-time searches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up removing the limit for slicing with point-in-time.

* readers. It's intended for scenarios where the reader doesn't change, like in
* a point-in-time search.
*/
public final class DocIdSliceQuery extends SliceQuery {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of contiguous ranges. Slicing is useful for a full scan where fetching the data is most of the time the costly operation. Using sequential doc ids per slice would ensure that a match all query with slices can get the fetch optimization all the time (read stored_fields per block).


@Override
public boolean isCacheable(LeafReaderContext ctx) {
return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we cache this query ? Could be nice to avoid caching entirely, especially if we take the contiguous range approach.

Copy link
Contributor Author

@jtibshirani jtibshirani Jun 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I agree caching is not a great idea.

@jtibshirani
Copy link
Contributor Author

@jimczi this is ready for another look when you have the chance.

@jtibshirani jtibshirani requested a review from dnhatn July 7, 2021 16:05
Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @jtibshirani.

maximum number of slices is set to 2 the union of the results of the two requests is equivalent
to the results of a scroll query without slicing. By default the splitting is done first on the
shards, then locally on each shard using the `_id` field. The local splitting follows the formula
`slice(doc) = doc.lucene_id % max`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slice(doc) = doc.lucene_id % max

I think we should use the old formula for sliced scroll (slice(doc) = floorMod(hashCode(doc._id), max))?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this was a bad copy-paste!

@@ -206,7 +206,7 @@ public void preProcess(boolean rewrite) {
}
}

if (sliceBuilder != null) {
if (sliceBuilder != null && scrollContext() != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a safe guard for sliced point in time although it's less serious than the sliced scroll. We can do it in a follow-up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What sort of safeguard did you have in mind (and what would it protect against)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The max slice of a search request.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed offline and agreed that it's not critical to apply a limit now, but plan to revisit it when integrating point-in-time searches into reindex.

public int hashCode() {
return Objects.hash(classHash(), field, id, max);
}
protected abstract boolean doEquals(SliceQuery o);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantics of two new methods doEquals and doHashCode are not obvious (to me). Maybe use ShardDocSortField.NAME for the field name in DocIdSliceQuery.java so we don't have to add these methods? But I am okay if you prefer to keep these.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice simplification, I ended up using "_doc" as the field name.

@jtibshirani
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/part-1

@jtibshirani
Copy link
Contributor Author

It looks like "elasticsearch-ci/part-1" passed in CI, but the status isn't being reported. (Maybe this is related to the Jenkins upgrade?) I'm going to merge even though it's not showing green here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v7.15.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slice query for Point in time readers (PIT)
6 participants