Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Virtual Sort field for automatic tie-breaking #56828

Closed
jimczi opened this issue May 15, 2020 · 4 comments · Fixed by #68833
Closed

Virtual Sort field for automatic tie-breaking #56828

jimczi opened this issue May 15, 2020 · 4 comments · Fixed by #68833
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team

Comments

@jimczi
Copy link
Contributor

jimczi commented May 15, 2020

The pagination of search requests using search_after require to use a tiebreaker that is unique per document. This is done automatically on sorted _scroll queries by tie-breaking documents on the index/shardId/docID tuple. This tuple is not accessible to normal search requests so the other option is to copy the _id of the document into a doc value field and use it as a tiebreaker.
This solution is difficult to implement for solutions that are not in charge of indexation.
With the introduction of the search context for requests, we'll be able to paginate over a set of sorted results using search_after with the guarantee to see the same documents during the walk. Since the internal document id wouldn't change between requests, using the tuple that _scroll queries use become possible.
This issue proposes to expose a virtual sort field called _tiebreak (or any name that suits better). The field would be accessible as a sort criteria that can be used with a search context to ensure consistent ordering. The field would be composed of:

  • The index UUID
  • The shard ID
  • The internal document ID

The order of the composition should be discussed but the main goal is to allow consistent ordering using search_after without relying on manual operations at index-time.

@jimczi jimczi added >enhancement :Search/Search Search-related issues that do not fall into other categories labels May 15, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Search)

@elasticmachine elasticmachine added the Team:Search Meta label for search team label May 15, 2020
@mayya-sharipova
Copy link
Contributor

This is a great idea!

Is the idea that a user needs to explicitly provide this sort field in their request: "sort": ["my_date", "_tiebreak"]?

Or that when doing a search sort with search_context, elasticsearch will automatically rewrite sort to add this field as tie break?

@jpountz
Copy link
Contributor

jpountz commented May 20, 2020

I wonder the same as Mayya, maybe we could have a good tie breaker by default that wouldn't require to expose a virtual field? Index UUID and shard ID are the same on all documents of a shard, so Lucene's default tie-breaker (docID) would do the right thing, so maybe we would only have to change how hits are merged on the coordinating node and we could provide consistent ordering with negligible overhead?

@jimczi
Copy link
Contributor Author

jimczi commented May 20, 2020

I agree that it would be nice to add the tiebreaker automatically but it needs to be materialized in the sort values of the response. This is useful only for search_after queries so we rely on users to provide this value when they paginate.

@matriv matriv self-assigned this May 26, 2020
jimczi added a commit to jimczi/elasticsearch that referenced this issue Nov 24, 2020
This change generates a tiebreaker automatically for sorted queries that are executed
under a PIT (point in time reader). This allows to paginate consistently over the matching documents without
requiring to provide a sort criteria that is unique per document.
The tiebreaker is automatically added as the last sort values of the search hits in the response.
It is then used by `search_after` to ensure that pagination will not miss any documents and that each document
will appear only once.
This commit also allows queries sorted by internal Lucene id (`_doc`) to be optimized if they are executed
under a PIT the same way than scroll queries.

Closes elastic#56828
jimczi added a commit to jimczi/elasticsearch that referenced this issue Dec 1, 2020
This change ensures that the shard index that is used to tiebreak documents with identical sort
remains consistent between two requests that target the same shards. The index is now always computed from the
natural order of the shards in the search request.
This change also adds the consistent shard index to the ShardSearchRequest. That allows the slice builder
to use this information to build more balanced slice query.

Relates elastic#56828
jimczi added a commit that referenced this issue Dec 7, 2020
* Adds a consistent shard index to ShardSearchRequest

This change ensures that the shard index that is used to tiebreak documents with identical sort
remains consistent between two requests that target the same shards. The index is now always computed from the
natural order of the shards in the search request.
This change also adds the consistent shard index to the ShardSearchRequest. That allows the slice builder
to use this information to build more balanced slice query.

Relates #56828
jimczi added a commit that referenced this issue Dec 7, 2020
This change ensures that the shard index that is used to tiebreak documents with identical sort
remains consistent between two requests that target the same shards. The index is now always computed from the
natural order of the shards in the search request.
This change also adds the consistent shard index to the ShardSearchRequest. That allows the slice builder
to use this information to build more balanced slice query.

Relates #56828
jimczi added a commit to jimczi/elasticsearch that referenced this issue Dec 9, 2020
This commit introduces a new sort field called `_shard_doc` that
can be used in conjunction with a PIT to consistently tiebreak
identical sort values. The sort value is a numeric long that is
composed of the ordinal of the shard (assigned by the coordinating node)
and the internal Lucene document ID. These two values are consistent within
a PIT so this sort criteria can be used as the tiebreaker of any search
requests.
Since this sort criteria is stable we'd like to add it automatically to any
sorted search requests that use a PIT but we also need to expose it explicitly
in order to be able to:
* Reverse the order of the tiebreaking, useful to search "before" `search_after`.
* Force the primary sort to use it in order to benefit from the `search_after` optimization when sorting by index order (to be released in Lucene 8.8.

I plan to add the documentation and the automatic configuration for PIT in a follow up since this change is already big.

Relates elastic#56828
jimczi added a commit that referenced this issue Dec 18, 2020
This commit introduces a new sort field called `_shard_doc` that
can be used in conjunction with a PIT to consistently tiebreak
identical sort values. The sort value is a numeric long that is
composed of the ordinal of the shard (assigned by the coordinating node)
and the internal Lucene document ID. These two values are consistent within
a PIT so this sort criteria can be used as the tiebreaker of any search
requests.
Since this sort criteria is stable we'd like to add it automatically to any
sorted search requests that use a PIT but we also need to expose it explicitly
in order to be able to:
* Reverse the order of the tiebreaking, useful to search "before" `search_after`.
* Force the primary sort to use it in order to benefit from the `search_after` optimization when sorting by index order (to be released in Lucene 8.8.

I plan to add the documentation and the automatic configuration for PIT in a follow up since this change is already big.

Relates #56828
jimczi added a commit that referenced this issue Dec 18, 2020
This commit introduces a new sort field called `_shard_doc` that
can be used in conjunction with a PIT to consistently tiebreak
identical sort values. The sort value is a numeric long that is
composed of the ordinal of the shard (assigned by the coordinating node)
and the internal Lucene document ID. These two values are consistent within
a PIT so this sort criteria can be used as the tiebreaker of any search
requests.
Since this sort criteria is stable we'd like to add it automatically to any
sorted search requests that use a PIT but we also need to expose it explicitly
in order to be able to:
* Reverse the order of the tiebreaking, useful to search "before" `search_after`.
* Force the primary sort to use it in order to benefit from the `search_after` optimization when sorting by index order (to be released in Lucene 8.8.

I plan to add the documentation and the automatic configuration for PIT in a follow up since this change is already big.

Relates #56828
jimczi added a commit to jimczi/elasticsearch that referenced this issue Feb 10, 2021
This PR adds the special `_shard_doc` sort tiebreaker automatically to any
search requests that use a PIT. Adding the tiebreaker ensures that any
sorted query can be paginated consistently within a PIT.

Closes elastic#56828
jimczi added a commit that referenced this issue Feb 17, 2021
This PR adds the special `_shard_doc` sort tiebreaker automatically to any
search requests that use a PIT. Adding the tiebreaker ensures that any
sorted query can be paginated consistently within a PIT.

Closes #56828
jimczi added a commit that referenced this issue Feb 18, 2021
This PR adds the special `_shard_doc` sort tiebreaker automatically to any
search requests that use a PIT. Adding the tiebreaker ensures that any
sorted query can be paginated consistently within a PIT.

Closes #56828
jimczi added a commit to jimczi/elasticsearch that referenced this issue Feb 22, 2021
This commit ensures that the automatic tiebreaker `_shard_doc` does
not disable sort optimization.

Relates elastic#56828
jimczi added a commit that referenced this issue Feb 22, 2021
This commit ensures that the automatic tiebreaker `_shard_doc` does
not disable sort optimization.

Relates #56828
jimczi added a commit that referenced this issue Feb 22, 2021
This commit ensures that the automatic tiebreaker `_shard_doc` does
not disable sort optimization.

Relates #56828
jimczi added a commit that referenced this issue Feb 22, 2021
This commit ensures that the automatic tiebreaker `_shard_doc` does
not disable sort optimization.

Relates #56828
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team
Projects
None yet
5 participants