Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shortcut query phase using the results of other shards #51852

Merged
merged 53 commits into from
Mar 17, 2020

Conversation

jimczi
Copy link
Contributor

@jimczi jimczi commented Feb 4, 2020

This commit, built on top of #51708, allows to modify shard search requests based on informations collected on other shards. It is intended to speed up sorted queries on time-based indices. For queries that are only interested in the top documents.

This change will rewrite the shard queries to match none if the bottom sort value computed in prior shards is better than all values in the shard.
For queries that mix top documents and aggregations this change will reset the size of the top documents to 0 instead of rewriting to match none.
This means that we don't need to keep a search context open for this shard since we know in advance that it doesn't contain any competitive hit.

Closes #49601

This change ensures that the rewrite of the shard request is executed in the network thread or in the refresh listener
when waiting for an active shard. This allows queries that rewrite to match_no_docs to bypass the search thread pool
entirely even if the can_match phase was skipped (pre_filter_shard_size > number of shards). Coordinating nodes
don't have the ability to create empty responses so this change also ensures that at least one shard creates a full empty
response while the other can return null ones. This is needed since creating true empty responses on shards require to create
concrete aggregators which would be too costly to build on a network thread. We should move this functionality to aggregation
builders in a follow up but that would be a much bigger change.
This change is also important for elastic#49601 since we want to add the ability to use the result of other shards to rewrite the request
of subsequent ones. For instance if the first M shards have their top N computed, the top worst document in the global queue can be pass
to subsequent shards that can then rewrite to match_no_docs if they can guarantee that they don't have any document better than the provided
one.
@jimczi jimczi added >enhancement :Search/Search Search-related issues that do not fall into other categories :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Feb 4, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Distributed)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Search)

Copy link
Contributor

@mayya-sharipova mayya-sharipova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, LGTM, I just left a couple of comments.

@@ -72,4 +85,38 @@ protected void onShardGroupFailure(int shardIndex, Exception exc) {
protected SearchPhase getNextPhase(final SearchPhaseResults<SearchPhaseResult> results, final SearchPhaseContext context) {
return new FetchSearchPhase(results, searchPhaseController, context);
}

ShardSearchRequest rewriteShardRequest(ShardSearchRequest request) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not relevant to this PR, but In future, do we want to rewrite also requests without sort ( e.g. a keyword search) that can be shortcut?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I follow. Are you talking of handling queries sorted by _score ? We can probably propagate the global min competitive score up to the query collector so that wouldn't require any rewrite.

jimczi added a commit that referenced this pull request Mar 18, 2020
This change adapts the serialization checks to 7.7.0 in order to cope with #53659.
Note that this commit also disables the bwc tests temporarily in order to be able to
merge #53659 first.

Relates #51852
jimczi added a commit that referenced this pull request Mar 18, 2020
This commit, built on top of #51708, allows to modify shard search requests based on informations collected on other shards. It is intended to speed up sorted queries on time-based indices. For queries that are only interested in the top documents.

This change will rewrite the shard queries to match none if the bottom sort value computed in prior shards is better than all values in the shard.
For queries that mix top documents and aggregations this change will reset the size of the top documents to 0 instead of rewriting to match none.
This means that we don't need to keep a search context open for this shard since we know in advance that it doesn't contain any competitive hit.
jimczi added a commit to jimczi/elasticsearch that referenced this pull request Mar 18, 2020
This commit disables the sort optimization added in elastic#51852 for scroll requests.
Scroll queries keep a state per shard so we cannot modify the request on
the first round (submit).
This bug was introduced in non-released versions which is why this pr
is marked as a non-issue.
jimczi added a commit that referenced this pull request Mar 19, 2020
This commit disables the sort optimization added in #51852 for scroll requests.
Scroll queries keep a state per shard so we cannot modify the request on
the first round (submit).
This bug was introduced in non-released versions which is why this pr
is marked as a non-issue.
jimczi added a commit that referenced this pull request Mar 19, 2020
This commit disables the sort optimization added in #51852 for scroll requests.
Scroll queries keep a state per shard so we cannot modify the request on
the first round (submit).
This bug was introduced in non-released versions which is why this pr
is marked as a non-issue.
jimczi added a commit to jimczi/elasticsearch that referenced this pull request Aug 6, 2020
Collapse search queries that sort by a field can throw
an ArrayStoreException due to a bug in the [sort optimization](elastic#51852)
introduced in 7.7.0. Search collapsing were not supposed to
be eligible for this sort optimization so this change explicitly
filters them from this new feature.
jimczi added a commit that referenced this pull request Aug 6, 2020
Collapse search queries that sort by a field can throw
an ArrayStoreException due to a bug in the [sort optimization](#51852)
introduced in 7.7.0. Search collapsing were not supposed to
be eligible for this sort optimization so this change explicitly
filters them from this new feature.
jimczi added a commit that referenced this pull request Aug 6, 2020
Collapse search queries that sort by a field can throw
an ArrayStoreException due to a bug in the [sort optimization](#51852)
introduced in 7.7.0. Search collapsing were not supposed to
be eligible for this sort optimization so this change explicitly
filters them from this new feature.
jimczi added a commit that referenced this pull request Aug 6, 2020
Collapse search queries that sort by a field can throw
an ArrayStoreException due to a bug in the [sort optimization](#51852)
introduced in 7.7.0. Search collapsing were not supposed to
be eligible for this sort optimization so this change explicitly
filters them from this new feature.
javanna added a commit to javanna/elasticsearch that referenced this pull request Dec 1, 2022
Whenever sorting on a date, numeric or keyword field (as primary sort), the can_match phase retrieves min and max for the field and sorts the shards (asc or desc depending on the sort order) so that they are going to be queried following that order. This allows incremental results to be exposed in that same order when using async search, as well as optimizations built on top of such behaviour (elastic#51852).

For fields with points we call `getMinPackedValue` and `getMaxPackedValue`, while for keyword fields we call `Terms#getMin` and `Terms#getMax`. Elasticsearch uses `FilterTerms` implementations to cancel queries as well as to track field usage. Such filter implementations should delegate their `getMin` and `getMax` calls to the wrapped `Terms` instance, which will leverage info from the block tree that caches min and max, otherwise they are always going to be retrieved from the index, which does I/O and slows the can_match phase down.
javanna added a commit to javanna/rally-tracks that referenced this pull request Dec 19, 2022
… and geonames tracks

We recently found a regression that affected searches sorted by keyword field (elastic/elasticsearch#92026).

Given that we had no benchmarks for sorting by keyword, this commit adds the relevant operations to the http-logs and geonames tracks.

Geonames is a good base but it's good to also make the new challenges part of the many-shards benchmarks as differences can be appreciated
with a high amount of shards involved in a query. This commit adds also specific challenges to verify the effect of elastic/elasticsearch#51852
when a search is sorted by numeric or timestamp.
javanna added a commit that referenced this pull request Jan 11, 2023
Whenever sorting on a date, numeric or keyword field (as primary sort), the can_match phase retrieves min and max for the field and sorts the shards (asc or desc depending on the sort order) so that they are going to be queried following that order. This allows incremental results to be exposed in that same order when using async search, as well as optimizations built on top of such behaviour (#51852).

For fields with points we call `getMinPackedValue` and `getMaxPackedValue`, while for keyword fields we call `Terms#getMin` and `Terms#getMax`. Elasticsearch uses `FilterTerms` implementations to cancel queries as well as to track field usage. Such filter implementations should delegate their `getMin` and `getMax` calls to the wrapped `Terms` instance, which will leverage info from the block tree that caches min and max, otherwise they are always going to be retrieved from the index, which does I/O and slows the can_match phase down.
javanna added a commit to javanna/elasticsearch that referenced this pull request Jan 11, 2023
…#92026)

Whenever sorting on a date, numeric or keyword field (as primary sort), the can_match phase retrieves min and max for the field and sorts the shards (asc or desc depending on the sort order) so that they are going to be queried following that order. This allows incremental results to be exposed in that same order when using async search, as well as optimizations built on top of such behaviour (elastic#51852).

For fields with points we call `getMinPackedValue` and `getMaxPackedValue`, while for keyword fields we call `Terms#getMin` and `Terms#getMax`. Elasticsearch uses `FilterTerms` implementations to cancel queries as well as to track field usage. Such filter implementations should delegate their `getMin` and `getMax` calls to the wrapped `Terms` instance, which will leverage info from the block tree that caches min and max, otherwise they are always going to be retrieved from the index, which does I/O and slows the can_match phase down.
javanna added a commit that referenced this pull request Jan 12, 2023
…#92854)

Whenever sorting on a date, numeric or keyword field (as primary sort), the can_match phase retrieves min and max for the field and sorts the shards (asc or desc depending on the sort order) so that they are going to be queried following that order. This allows incremental results to be exposed in that same order when using async search, as well as optimizations built on top of such behaviour (#51852).

For fields with points we call `getMinPackedValue` and `getMaxPackedValue`, while for keyword fields we call `Terms#getMin` and `Terms#getMax`. Elasticsearch uses `FilterTerms` implementations to cancel queries as well as to track field usage. Such filter implementations should delegate their `getMin` and `getMax` calls to the wrapped `Terms` instance, which will leverage info from the block tree that caches min and max, otherwise they are always going to be retrieved from the index, which does I/O and slows the can_match phase down.
danielmitterdorfer pushed a commit to danielmitterdorfer/elasticsearch that referenced this pull request Jan 12, 2023
…#92026)

Whenever sorting on a date, numeric or keyword field (as primary sort), the can_match phase retrieves min and max for the field and sorts the shards (asc or desc depending on the sort order) so that they are going to be queried following that order. This allows incremental results to be exposed in that same order when using async search, as well as optimizations built on top of such behaviour (elastic#51852).

For fields with points we call `getMinPackedValue` and `getMaxPackedValue`, while for keyword fields we call `Terms#getMin` and `Terms#getMax`. Elasticsearch uses `FilterTerms` implementations to cancel queries as well as to track field usage. Such filter implementations should delegate their `getMin` and `getMax` calls to the wrapped `Terms` instance, which will leverage info from the block tree that caches min and max, otherwise they are always going to be retrieved from the index, which does I/O and slows the can_match phase down.
javanna added a commit to javanna/elasticsearch that referenced this pull request Jan 12, 2023
…#92026)

Whenever sorting on a date, numeric or keyword field (as primary sort), the can_match phase retrieves min and max for the field and sorts the shards (asc or desc depending on the sort order) so that they are going to be queried following that order. This allows incremental results to be exposed in that same order when using async search, as well as optimizations built on top of such behaviour (elastic#51852).

For fields with points we call `getMinPackedValue` and `getMaxPackedValue`, while for keyword fields we call `Terms#getMin` and `Terms#getMax`. Elasticsearch uses `FilterTerms` implementations to cancel queries as well as to track field usage. Such filter implementations should delegate their `getMin` and `getMax` calls to the wrapped `Terms` instance, which will leverage info from the block tree that caches min and max, otherwise they are always going to be retrieved from the index, which does I/O and slows the can_match phase down.
javanna added a commit that referenced this pull request Jan 12, 2023
Whenever sorting on a date, numeric or keyword field (as primary sort), the can_match phase retrieves min and max for the field and sorts the shards (asc or desc depending on the sort order) so that they are going to be queried following that order. This allows incremental results to be exposed in that same order when using async search, as well as optimizations built on top of such behaviour (#51852).

For fields with points we call `getMinPackedValue` and `getMaxPackedValue`, while for keyword fields we call `Terms#getMin` and `Terms#getMax`. Elasticsearch uses `FilterTerms` implementations to cancel queries as well as to track field usage. Such filter implementations should delegate their `getMin` and `getMax` calls to the wrapped `Terms` instance, which will leverage info from the block tree that caches min and max, otherwise they are always going to be retrieved from the index, which does I/O and slows the can_match phase down.
javanna added a commit to elastic/rally-tracks that referenced this pull request Jan 23, 2023
… and geonames tracks (#357)

We recently found a regression that affected searches sorted by a keyword field (elastic/elasticsearch#92026).

Given that we had no benchmarks for sorting by keyword, this commit adds the relevant operations to the http-logs and geonames tracks. We will want to also add similar challenges to the many-shards benchmarks, as the regressions we found can be seen with more than a couple of shards. This commit adds also specific challenges to verify the effect of elastic/elasticsearch#51852 when a search is sorted by numeric or timestamp.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. >enhancement release highlight :Search/Search Search-related issues that do not fall into other categories v7.7.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize distributed numeric sort for time-based indices
5 participants