New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support CCS minimize round trips in async search #96012
Support CCS minimize round trips in async search #96012
Conversation
Pinging @elastic/es-search (Team:Search) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me so far, I am curious on why it's marked WIP ;) maybe we should add some indication in the get async search API response that some remote clusters are minimizing roundtrips?
qa/multi-cluster-search/src/test/java/org/elasticsearch/search/CCSDuelIT.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/search/SearchProgressListener.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/search/SearchProgressListener.java
Outdated
Show resolved
Hide resolved
1477f42
to
82930fd
Compare
193e2af
to
eeb6bfc
Compare
With these changes, I have removed the notifyMinimizeRoundtrips/onMinimizeRoundtrips call to the SearchProgressListener in favor of modifying the The Here's an example of the progression in the POST blogs,remote1:*,remote2:*/_async_search?ccs_minimize_roundtrips=true
{
// ... search body for a slow search not shown
}
// after _async_seach submit
{
"id": "FkF3RE1KZnFSU3M2VUdsWEZjVnlmZFEaQVRnZDdaNjBUZ1N0RU1sUjl0QmphUTo2OTI=",
"is_partial": true,
"is_running": true,
"start_time_in_millis": 1684516160099,
"expiration_time_in_millis": 1684948160099,
"response": {
"took": 806,
"timed_out": false,
"terminated_early": false,
"num_reduce_phases": 0,
"_shards": {
"total": 3, // only shows local shards until remote searches finish (current behavior)
"successful": 0, // no shards completed
"skipped": 0,
"failed": 0
},
"_clusters": {
"total": 3,
"successful": 0, // no clusters completed (new behavior)
"skipped": 0
},
"hits": {
"total": {
"value": 0,
"relation": "gte"
},
"max_score": null,
"hits": []
}
}
}
// while running (local cluster has finished; two remote cluster searches still running
GET /_async_search/FkF3RE1KZnFSU3M2VUdsWEZjVnlmZFEaQVRnZDdaNjBUZ1N0RU1sUjl0QmphUTo2OTI
{
"id": "FkF3RE1KZnFSU3M2VUdsWEZjVnlmZFEaQVRnZDdaNjBUZ1N0RU1sUjl0QmphUTo2OTI=",
"is_partial": true,
"is_running": true,
"start_time_in_millis": 1684516160099,
"expiration_time_in_millis": 1684948160099,
"response": {
"took": 5973,
"timed_out": false,
"terminated_early": false,
"_shards": {
"total": 3,
"successful": 3, // local shards completed
"skipped": 0,
"failed": 0
},
"_clusters": {
"total": 3,
"successful": 1, // local cluster has completed (new behavior)
"skipped": 0
},
"hits": {
"total": {
"value": 142,
"relation": "eq"
},
"max_score": null,
"hits": []
}
}
}
// when all clusters completed
GET /_async_search/FkF3RE1KZnFSU3M2VUdsWEZjVnlmZFEaQVRnZDdaNjBUZ1N0RU1sUjl0QmphUTo2OTI
{
"id": "FkF3RE1KZnFSU3M2VUdsWEZjVnlmZFEaQVRnZDdaNjBUZ1N0RU1sUjl0QmphUTo2OTI=",
"is_partial": false,
"is_running": false,
"start_time_in_millis": 1684516160099,
"expiration_time_in_millis": 1684948160099,
"response": {
"took": 10350,
"timed_out": false,
"num_reduce_phases": 4,
"_shards": {
"total": 9,
"successful": 9,
"skipped": 0,
"failed": 0
},
"_clusters": {
"total": 3,
"successful": 3,
"skipped": 0
},
"hits": {
"total": {
"value": 432,
"relation": "eq"
},
"max_score": 17.110811,
"hits": [
... Note that this is NOT the behavior with ccs_minimize_roundtrips=false. There there _clusters output is always the final output where total == successful + skipped: POST blogs,remote1:*,remote2:*/_async_search?ccs_minimize_roundtrips=false
{
// ... search body for a slow search not shown
}
...
"id": "FkF3RE1KZnFSU3M2VUdsWEZjVnlmZFEaQVRnZDdaNjBUZ1N0RU1sUjl0QmphUTo2OTI=",
"is_partial": true,
"is_running": true,
"start_time_in_millis": 1684516160099,
"expiration_time_in_millis": 1684948160099,
"response": {
...
"_clusters": {
"total": 3,
"successful": 3, // shows final state even if no shards/clusters have completed (current behavior)
"skipped": 0
}, |
.../plugin/async-search/src/main/java/org/elasticsearch/xpack/search/MutableSearchResponse.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/search/SearchResponse.java
Outdated
Show resolved
Hide resolved
eeb6bfc
to
aaa9b4e
Compare
I tested a case where the remote clusters are down: // case where skip_unavailable: true is NOT set for a remote cluster and it is down:
POST blogs,remote1:*,remote2:*/_async_search?ccs_minimize_roundtrips=true
...
// response:
{
"id": "FjJJZG9yWUhOU3VLaTJzNXZMalZaTkEaaVhraVNDT2VTenFoUDlod21hYmlKZzo5MTE=",
"is_partial": true,
"is_running": false,
"start_time_in_millis": 1684767109136,
"expiration_time_in_millis": 1685199109136,
"response": {
"took": 10927,
"timed_out": false,
"terminated_early": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"_clusters": {
"total": 3,
"successful": 1,
"skipped": 0
},
"hits": {
"total": {
"value": 167,
"relation": "eq"
},
"max_score": null,
"hits": []
}
},
"error": {
"type": "status_exception",
"reason": "error while executing search",
"caused_by": {
"type": "connect_transport_exception",
"reason": "[][127.0.0.1:9301] connect_exception",
"caused_by": {
"type": "uncategorized_execution_exception",
"reason": "Failed execution",
"caused_by": {
"type": "execution_exception",
"reason": "execution_exception: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /127.0.0.1:9301",
"caused_by": {
"type": "i_o_exception",
"reason": "Connection refused: /127.0.0.1:9301",
"caused_by": {
"type": "i_o_exception",
"reason": "Connection refused"
}
}
}
}
}
}
}
----
// where skip_unavailable:true
POST blogs,remote1:*,remote2:*/_async_search?ccs_minimize_roundtrips=true
...
// response:
{
"id": "FnE0SjZrb2lOVFR5ejFCN3dtX0NHbVEbaVhraVNDT2VTenFoUDlod21hYmlKZzoxMDIz",
"is_partial": false,
"is_running": false,
"start_time_in_millis": 1684767263845,
"expiration_time_in_millis": 1685199263845,
"response": {
"took": 11943,
"timed_out": false,
"num_reduce_phases": 3,
"_shards": {
"total": 6,
"successful": 6,
"skipped": 0,
"failed": 0
},
"_clusters": {
"total": 3,
"successful": 2,
"skipped": 1 // was properly updated at the end (started off as 0)
},
"hits": {
"total": {
"value": 370,
"relation": "eq"
},
"max_score": 16.595804,
"hits": [
{
... |
aaa9b4e
to
2793f5d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left a couple of comments, great progress!!
server/src/main/java/org/elasticsearch/action/search/TransportSearchAction.java
Outdated
Show resolved
Hide resolved
.../plugin/async-search/src/main/java/org/elasticsearch/xpack/search/MutableSearchResponse.java
Outdated
Show resolved
Hide resolved
.../plugin/async-search/src/main/java/org/elasticsearch/xpack/search/MutableSearchResponse.java
Outdated
Show resolved
Hide resolved
.../plugin/async-search/src/main/java/org/elasticsearch/xpack/search/MutableSearchResponse.java
Outdated
Show resolved
Hide resolved
271c63c
to
fba88c4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SHIP IT
1cab8d9
to
a6460bd
Compare
@abdonpijpelink - Would you have time to review the end user docs changes here? You can look at the last commit to see the new changes: a7b7237 |
a6460bd
to
a7b7237
Compare
I just talked with Luca and we are going to change the default setting in a follow-on PR, so the end user docs changes will go into that new PR so you can wait to review it then. I'll be deleting those commits from this PR in a moment. Sorry for the change. |
a7b7237
to
5c20cf5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing! For your follow-up PR: note that the <1>
numbers ("callouts") have to restart from 1 for each code snippet, otherwise the docs won't build.
"successful": 3, <4> | ||
"skipped": 0, | ||
"failed": 0 | ||
}, | ||
"_clusters": { | ||
"total": 3, | ||
"successful": 1, <5> | ||
"skipped": 0 | ||
}, | ||
"hits": { | ||
"total": { | ||
"value": 167, <6> | ||
"relation": "eq" | ||
}, | ||
"max_score": null, | ||
"hits": [] | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
|
||
<4> All the local cluster shards have completed. | ||
<5> The local cluster search has completed, so the "successful" clusters entry | ||
is set to 1. The `_clusters` response section will not be updated for the remote clusters | ||
until all remote searches have finished (either successfully or been skipped). | ||
<6> Number of hits from the local cluster search. Final hits are not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"successful": 3, <4> | |
"skipped": 0, | |
"failed": 0 | |
}, | |
"_clusters": { | |
"total": 3, | |
"successful": 1, <5> | |
"skipped": 0 | |
}, | |
"hits": { | |
"total": { | |
"value": 167, <6> | |
"relation": "eq" | |
}, | |
"max_score": null, | |
"hits": [] | |
} | |
} | |
} | |
-------------------------------------------------- | |
<4> All the local cluster shards have completed. | |
<5> The local cluster search has completed, so the "successful" clusters entry | |
is set to 1. The `_clusters` response section will not be updated for the remote clusters | |
until all remote searches have finished (either successfully or been skipped). | |
<6> Number of hits from the local cluster search. Final hits are not | |
"successful": 3, <1> | |
"skipped": 0, | |
"failed": 0 | |
}, | |
"_clusters": { | |
"total": 3, | |
"successful": 1, <2> | |
"skipped": 0 | |
}, | |
"hits": { | |
"total": { | |
"value": 167, <3> | |
"relation": "eq" | |
}, | |
"max_score": null, | |
"hits": [] | |
} | |
} | |
} | |
-------------------------------------------------- | |
<1> All the local cluster shards have completed. | |
<2> The local cluster search has completed, so the "successful" clusters entry | |
is set to 1. The `_clusters` response section will not be updated for the remote clusters | |
until all remote searches have finished (either successfully or been skipped). | |
<3> Number of hits from the local cluster search. Final hits are not |
This commit makes the smallest set of changes to allow async-search based cross-cluster search to work with the CCS minimize_round_trips feature without changing the internals/architecture of the search action. When ccsMinimizeRoundtrips is set to true on SubmitAsyncSearchRequest, the AsyncSearchTask on the primary CCS coordinator sends a synchronous SearchRequest to all to clusters for a remote coordinator to orchestrate and return the entire result set to the CCS coordinator as a single response. This is the same functionality provided by synchronous CCS search using minimize_round_trips. However, since this is an async search, it means that the async search coordinator has no visibility into search progress on the remote clusters while they are running the search, thus losing one of the key features of async search. However, this is a good first approach for improving overall search latency for cross cluster searches that query a large number of shards on remote clusters, since Kibana does not currently expose incremental progress of an async search to users.
…ul count in Clusters - for async minimizeRoundtrips usage
MutableSearchResponse now has logic in buildResponse about whether to update the Clusters object to indicate that the local cluster has finished.
…tableSearchResponse
…ProgressListener. Modified SearchResponse.Clusters to optionally track remoteClusters and ccsMinimizeRoundtrips. As the original Clusters object was intended to be immutable, I've added a "frozen" flag that is set when the original constructor is called. CCS AsyncSearch minimizeRoundTrips uses this new Clusters variant. onListShards, with the revised impl and usage of SearchResponse.Clusters, now handles notifying the SearchProgressListener in the AsyncSearchTask about the number of remote clusters, whether local shards are being searched and ccsMinimizeRoundtrips status.
Made the Clusters and localClusterStatusUpdated fields to be updated in a synchronized block. in MutableSearchResponse.SearchClustersInfo to make them safe for multi threaded access.
5c20cf5
to
ded776b
Compare
* Support CCS minimize round trips in async search This commit makes the smallest set of changes to allow async-search based cross-cluster search to work with the CCS minimize_round_trips feature without changing the internals/architecture of the search action. When ccsMinimizeRoundtrips is set to true on SubmitAsyncSearchRequest, the AsyncSearchTask on the primary CCS coordinator sends a synchronous SearchRequest to all to clusters for a remote coordinator to orchestrate and return the entire result set to the CCS coordinator as a single response. This is the same functionality provided by synchronous CCS search using minimize_roundtrips. Since this is an async search, it means that the async search coordinator has no visibility into search progress on the remote clusters while they are running the search, thus losing one of the key features of async search. However, this is a good first approach for improving overall search latency for cross cluster searches that query a large number of shards on remote clusters, since Kibana does not currently expose incremental progress of an async search to users. Relates elastic#73971
This commit makes the smallest set of changes to allow async-search based cross-cluster search
to work with the CCS minimize_round_trips feature without changing the internals/architecture of
the search action.
When ccsMinimizeRoundtrips is set to true on SubmitAsyncSearchRequest, the AsyncSearchTask on the
primary CCS coordinator sends a synchronous SearchRequest to all to clusters for a remote coordinator
to orchestrate and return the entire result set to the CCS coordinator as a single response.
This is the same functionality provided by synchronous CCS search using minimize_round_trips.
However, since this is an async search, it means that the async search coordinator has no visibility
into search progress on the remote clusters while they are running the search, thus losing one of
the key features of async search. However, this is a good first approach for improving overall search
latency for cross cluster searches that query a large number of shards on remote clusters, since
Kibana does not currently expose incremental progress of an async search to users.