Add new x-pack endpoints to track the progress of a search asynchronously #49931

jimczi · 2019-12-06T17:12:16Z

High level view

This change introduces a new API in x-pack basic that allows to track the progress of a search.
Users can submit an asynchronous search through a new endpoint called _async_search that
works exactly the same as the _search endpoint but instead of blocking and returning the final response when available, it returns a response after a provided wait_for_completion time.

# Submit an _async_search and waits up to 100ms for a final response
GET my_index_pattern*/_async_search?wait_for_completion=100ms
{
  "aggs": {
    "date_histogram": {
      "field": "@timestamp",
      "fixed_interval": "1h"
    }
  }
}

If after 100ms the final response is not available, a partial_response is included in the body:

{
  "id": "9N3J1m4BgyzUDzqgC15b",
  "version": 1,
  "is_running": true,
  "is_partial": true,
  "response": {
   "_shards": {
       "total": 100,
       "successful": 5,
       "failed": 0
    },
    "total_hits": {
      "value": 1653433,
      "relation": "eq"
    },
    "aggs": {
      ...
    }
  }
}

The partial response contains the total number of requested shards, the number of shards that successfully returned and the number of shards that failed.
It also contains the total hits as well as partial aggregations computed from the successful shards.
To continue to monitor the progress of the search users can call the get _async_search API like the following:

GET _async_search/9N3J1m4BgyzUDzqgC15b/?wait_for_completion=100ms

That returns a new response that can contain the same partial response than the previous call if the search didn't progress, in such case the returned version
should be the same. If new partial results are available, the version is incremented and the partial_response contains the updated progress.
Finally if the response is fully available while or after waiting for completion, the partial_response is replaced by a response section that contains the usual _search response:

{
  "id": "9N3J1m4BgyzUDzqgC15b",
  "version": 10,
  "is_running": false,
  "response": {
     "is_partial": false,
     ...
  }
}

Persistency

Asynchronous search are stored in a restricted index called .async-search if they survive (still running) after the initial submit. Each request has a keep alive that defaults to 5 days but this value can be changed/updated any time:

GET my_index_pattern*/_async_search?wait_for_completion=100ms&keep_alive=10d

The default can be changed when submitting the search, the example above raises the default value for the search to 10d.

GET _async_search/9N3J1m4BgyzUDzqgC15b/?wait_for_completion=100ms&keep_alive=10d

The time to live for a specific search can be extended when getting the progress/result. In the example above we extend the keep alive to 10 more days.
A background service that runs only on the node that holds the first primary shard of the async-search index is responsible for deleting the expired results. It runs every hour but the expiration is also checked by running queries (if they take longer than the keep_alive) and when getting a result.

Like a normal _search, if the http channel that is used to submit a request is closed before getting a response, the search is automatically cancelled. Note that this behavior is only for the submit API, subsequent GET requests will not cancel if they are closed.

Resiliency

Asynchronous search are not persistent, if the coordinator node crashes or is restarted during the search, the asynchronous search will stop. To know if the search is still running or not the response contains a field called is_running that indicates if the task is up or not. It is the responsibility of the user to resume an asynchronous search that didn't reach a final response by re-submitting the query. However final responses and failures are persisted in a system index that allows
to retrieve a response even if the task finishes.

DELETE _async_search/9N3J1m4BgyzUDzqgC15b

The response is also not stored if the initial submit action returns a final response. This allows to not add any overhead to queries that completes within the initial wait_for_completion.

Security

The .async-search index is a restricted index (should be migrated to a system index in +8.0) that is accessible only through the async search APIs. These APIs also ensure that only the user that submitted the initial query can retrieve or delete the running search. Note that admins/superusers would still be able to cancel the search task through the task manager like any other tasks.

Relates #49091

…arch

elasticmachine · 2019-12-06T19:40:57Z

Pinging @elastic/es-search (:Search/Search)

jimczi · 2019-12-06T20:07:39Z

@elasticmachine run elasticsearch-ci/packaging-sample-matrix

jpountz

Woohoo! This looks great in general, some questions:

If I read correctly, it's up to the user to garbage collect responses manually. Should we do this automatically when a final response has been retrieved? We already have a wait_for_completion parameter that allows to reduce the number of roundtrips for fast requests, so it doesn't feel consistent to always require a new roundtrip to delete the response? I'm also a bit biased towards reducing the number of cases when responses need to be garbage-collected via ILM, as you could accumulate a large volume of responses in 5 days?
If we moved from time-based ids to true uuids - which can't be guessed, I wonder whether we'd still need to require that the user that views a response is the same as the user who submitted the request. I don't think it would be surprising to users that sharing the id of an async search has pretty much the same consequences as sharing the response of the search request?
Since response and partial_response should have mostly the same format, I wonder whether we should use the same response key combined with a partial flag?

x-pack/plugin/core/src/main/resources/async-search-history.json

jimczi · 2019-12-10T15:07:49Z

If I read correctly, it's up to the user to garbage collect responses manually. Should we do this automatically when a final response has been retrieved? We already have a wait_for_completion parameter that allows to reduce the number of roundtrips for fast requests, so it doesn't feel consistent to always require a new roundtrip to delete the response? I'm also a bit biased towards reducing the number of cases when responses need to be garbage-collected via ILM, as you could accumulate a large volume of responses in 5 days?

I like the idea, getting the same final response twice is something that our regular caches should handle transparently so this would also emphasize the fact that this response are not meant to be used as an additional cache.

If we moved from time-based ids to true uuids - which can't be guessed, I wonder whether we'd still need to require that the user that views a response is the same as the user who submitted the request. I don't think it would be surprising to users that sharing the id of an async search has pretty much the same consequences as sharing the response of the search request?

+1 for true uuids, I agree with the response sharing analogy but since we want to delete final responses when they are reported back I think it would be nice to have this extra layer. It's not a lot of work and something that we already implement in scrolls.

Since response and partial_response should have mostly the same format, I wonder whether we should use the same response key combined with a partial flag?

Are you talking of the rest format or the internal response ? I think it's important to keep the distinction internally (for the hlrc) but I agree that the response could look like this:

{
  "id": "9N3J1m4BgyzUDzqgC15b",
  "version": 1,
  "is_running": true,
  "response": {
      "is_partial": true

Is it what you meant ?

jpountz · 2019-12-10T16:03:26Z

Yes that's what I meant.

giladgal · 2019-12-10T21:00:27Z

This all sounds great. Two suggestions to consider:

If there is a burst of queries for which the result sets are not fully retrieved (e.g. because they are lengthy and the user checks them on a long interval). Can we crash the system because there are too many result sets? Should we start deleting based on the index size of the results index (not just on time or retrieval)?
Would it make sense to have a limit on the time that the async query runs, e.g. have a default for the environment and allow users to specifically indicate that their query should be killed after a certain time interval (e.g. kill this query if it doesn't conclude after 24 hours). My concern is that if we don't have a limit, we can end up having zombie queries that queue on a resource starved system because there is no time-out to kill them. Such a mechanism would be an automated way for the admin to kill queries based on a very lengthy timeout.

…he async search response

jimczi · 2019-12-12T20:56:26Z

I pushed another iteration that addresses @jpountz's comments.
We now use true random uuids for the async search index and the document is automatically deleted when the final response (completed or failed) is returned to the user (through submit or get).

We also discussed the best options to secure the system index with @elastic/es-security. The simplest way today would be to add the async search index in the RestrictedIndicesNames like we do for the .security index. Only the superuser would be able to access this index directly which is enough imo. The restricted indices don't support index pattern today but @albertzaharovits said that it would be trivial to add. The other option would be to use a single index and to replace the ilm garbage collection with a periodic delete by query. I am leaning towards the first option (adding support for index patterns in RestrictedIndicesNames) since it simpler and ensures that we re-create the index periodically but I am curious to hear what others think.
I'll now focus on securing the async-search APIs to ensure that only the user that submitted the query can retrieve or delete the running search. Note that admins/superusers would still be able to cancel the search task through the task manager like any other tasks.

colings86 · 2019-12-13T11:10:56Z

I wonder if we should add "pending_shards" to the partial_response section so it's explicit that we are expecting those shard to return?

Also I wonder if we should group the shard counts together into its own object?

mark-vieira · 2019-12-13T21:31:37Z

FYI, you'll need to merge in the latest changes from master to fix these CI failure due to the new Java 13 build requirement.

…s to the user that submitted the initial rquest

javanna

I left a couple more small comments but LGTM. Can you also update the description and remove the mention of the 304 status code which I believe is outdated? Thanks for taking this to the finish line.

...ugin/async-search/src/main/java/org/elasticsearch/xpack/search/RestGetAsyncSearchAction.java

javanna · 2020-03-09T22:32:08Z

...n/async-search/src/main/java/org/elasticsearch/xpack/search/RestSubmitAsyncSearchAction.java

+        ActionRequestValidationException validationException = submit.validate();
+        if (validationException != null) {
+            throw validationException;
+        }


I still see it, did you mean to remove it?

javanna · 2020-03-09T22:43:29Z

.../core/src/main/java/org/elasticsearch/xpack/core/search/action/SubmitAsyncSearchRequest.java

+        request.setCcsMinimizeRoundtrips(false);
+        request.setPreFilterShardSize(1);
+        request.setBatchedReduceSize(5);
+        request.requestCache(true);


I am still missing where we reject ccs minimize roundtrips set to false

javanna · 2020-03-09T22:43:58Z

.../core/src/main/java/org/elasticsearch/xpack/core/search/action/SubmitAsyncSearchRequest.java

+
+    @Override
+    public Task createTask(long id, String type, String action, TaskId parentTaskId, Map<String, String> headers) {
+        return new CancellableTask(id, type, action, "", parentTaskId, headers) {


could you address this too?

x-pack/plugin/src/test/resources/rest-api-spec/api/async_search.get.json

javanna · 2020-03-10T10:10:43Z

x-pack/plugin/src/test/resources/rest-api-spec/api/async_search.submit.json

+          "path":"/_async_search",
+          "methods":[
+            "GET",
+            "POST"


one more thing I had missed before, here we may want to remove GET? I tend to think that POST is the only method that suits an API that submits something. Was it here only for consistency with search?

x-pack/plugin/src/test/resources/rest-api-spec/api/async_search.get.json

javanna · 2020-03-10T11:50:10Z

x-pack/plugin/src/test/resources/rest-api-spec/api/async_search.submit.json

+      },
+      "keep_alive": {
+        "type": "time",
+        "description": "Specify the time that the request should remain reachable in the cluster."


maybe rephrase to something like "Specify the time interval in which the results (partial or final) for this search will be available"

AsyncSearchActionTests#testCleanupOnFailure fails sporadically in CI but not locally. This commit switches the tests into a SuiteScopeTestCase that creates internal states once on static members in order to make the tests more reproducible. Relates #49931

Deleting an async search id can throw a ResourceNotFoundException even if the query was successfully cancelled. We delete the stored response automatically if the query is cancelled so that creates a race with the delete action that also ensures that the task is removed. This change ensures that we ignore missing async search ids in the async search index if they were successfuly cancelled. Relates elastic#53360 Relates elastic#49931

Deleting an async search id can throw a ResourceNotFoundException even if the query was successfully cancelled. We delete the stored response automatically if the query is cancelled so that creates a race with the delete action that also ensures that the task is removed. This change ensures that we ignore missing async search ids in the async search index if they were successfully cancelled. Relates #53360 Relates #49931

@timestamp

…usly (#49931) (#53591) This change introduces a new API in x-pack basic that allows to track the progress of a search. Users can submit an asynchronous search through a new endpoint called `_async_search` that works exactly the same as the `_search` endpoint but instead of blocking and returning the final response when available, it returns a response after a provided `wait_for_completion` time. ```` GET my_index_pattern*/_async_search?wait_for_completion=100ms { "aggs": { "date_histogram": { "field": "@timestamp", "fixed_interval": "1h" } } } ```` If after 100ms the final response is not available, a `partial_response` is included in the body: ```` { "id": "9N3J1m4BgyzUDzqgC15b", "version": 1, "is_running": true, "is_partial": true, "response": { "_shards": { "total": 100, "successful": 5, "failed": 0 }, "total_hits": { "value": 1653433, "relation": "eq" }, "aggs": { ... } } } ```` The partial response contains the total number of requested shards, the number of shards that successfully returned and the number of shards that failed. It also contains the total hits as well as partial aggregations computed from the successful shards. To continue to monitor the progress of the search users can call the get `_async_search` API like the following: ```` GET _async_search/9N3J1m4BgyzUDzqgC15b/?wait_for_completion=100ms ```` That returns a new response that can contain the same partial response than the previous call if the search didn't progress, in such case the returned `version` should be the same. If new partial results are available, the version is incremented and the `partial_response` contains the updated progress. Finally if the response is fully available while or after waiting for completion, the `partial_response` is replaced by a `response` section that contains the usual _search response: ```` { "id": "9N3J1m4BgyzUDzqgC15b", "version": 10, "is_running": false, "response": { "is_partial": false, ... } } ```` Asynchronous search are stored in a restricted index called `.async-search` if they survive (still running) after the initial submit. Each request has a keep alive that defaults to 5 days but this value can be changed/updated any time: ````` GET my_index_pattern*/_async_search?wait_for_completion=100ms&keep_alive=10d ````` The default can be changed when submitting the search, the example above raises the default value for the search to `10d`. ````` GET _async_search/9N3J1m4BgyzUDzqgC15b/?wait_for_completion=100ms&keep_alive=10d ````` The time to live for a specific search can be extended when getting the progress/result. In the example above we extend the keep alive to 10 more days. A background service that runs only on the node that holds the first primary shard of the `async-search` index is responsible for deleting the expired results. It runs every hour but the expiration is also checked by running queries (if they take longer than the keep_alive) and when getting a result. Like a normal `_search`, if the http channel that is used to submit a request is closed before getting a response, the search is automatically cancelled. Note that this behavior is only for the submit API, subsequent GET requests will not cancel if they are closed. Asynchronous search are not persistent, if the coordinator node crashes or is restarted during the search, the asynchronous search will stop. To know if the search is still running or not the response contains a field called `is_running` that indicates if the task is up or not. It is the responsibility of the user to resume an asynchronous search that didn't reach a final response by re-submitting the query. However final responses and failures are persisted in a system index that allows to retrieve a response even if the task finishes. ```` DELETE _async_search/9N3J1m4BgyzUDzqgC15b ```` The response is also not stored if the initial submit action returns a final response. This allows to not add any overhead to queries that completes within the initial `wait_for_completion`. The `.async-search` index is a restricted index (should be migrated to a system index in +8.0) that is accessible only through the async search APIs. These APIs also ensure that only the user that submitted the initial query can retrieve or delete the running search. Note that admins/superusers would still be able to cancel the search task through the task manager like any other tasks. Relates #49091 Co-authored-by: Luca Cavanna <javanna@users.noreply.github.com>

lizozom · 2020-04-19T14:29:08Z

@jimczi

Could you please provide an example on how to update the keep alive time of a currently running async search?

I tried this without success.

Also, the final GET returns all of the data on a single response, without pagination. Is that the way this should always be?

jimczi · 2020-04-19T21:52:55Z

Could you please provide an example on how to update the keep alive time of a currently running async search?

The response returns the initial keep alive instead of the updated one, I opened #55435 to fix the bug.

Also, the final GET returns all of the data on a single response, without pagination. Is that the way this should always be?

What do you mean by pagination ? It should return the same final response than a normal _search.

Add new x-pack endpoints to asynchronously track the progress of a se…

750a81d

…arch

jimczi force-pushed the async_search branch from f2ec4c0 to 750a81d Compare December 6, 2019 19:40

jimczi added :Search/Search Search-related issues that do not fall into other categories v8.0.0 WIP labels Dec 6, 2019

jimczi added the >feature label Dec 6, 2019

javanna mentioned this pull request Dec 6, 2019

Async search #49091

Closed

6 tasks

jimczi added 2 commits December 6, 2019 21:26

notify partial reduce even on top-docs query

c28766c

fix npe

4bfafb4

jpountz reviewed Dec 9, 2019

View reviewed changes

x-pack/plugin/core/src/main/resources/async-search-history.json Outdated Show resolved Hide resolved

x-pack/plugin/core/src/main/resources/async-search-history.json Outdated Show resolved Hide resolved

jimczi added 5 commits December 11, 2019 13:55

use a single object named response in the xcontent serialization of t…

3ace355

…he async search response

delete frozen search response automatically

5d000ca

add integration tests

1cd2a9a

simplify schema for the .async-search index

52085bb

switch to true random uuuids

da7329d

iter

20c24a0

replace waitUntil with assertBusy

438a7b0

jimczi added 5 commits December 16, 2019 10:48

add a security layer that restricts the usage of the async search API…

5a907b9

…s to the user that submitted the initial rquest

Merge branch 'master' into async_search

a1eac14

fix checkstyle

2180dde

fix double clean

be0ba05

iter

3c1ffdd

javanna approved these changes Mar 9, 2020

View reviewed changes

jimczi added 5 commits March 10, 2020 00:10

Merge branch 'master' into async_search

05e2209

address review

415c910

unused import

91699d5

Merge branch 'master' into async_search

20d3f66

fix rest API reference to the outdated 304 response status

e3bea16

javanna reviewed Mar 10, 2020

View reviewed changes

x-pack/plugin/src/test/resources/rest-api-spec/api/async_search.get.json Outdated Show resolved Hide resolved

javanna reviewed Mar 10, 2020

View reviewed changes

jimczi added 3 commits March 10, 2020 13:26

remove last_version parameter

8727385

rephrase rest option after review

315bb49

Merge branch 'master' into async_search

806ae8f

jimczi merged commit 146b2a8 into elastic:master Mar 10, 2020

jimczi deleted the async_search branch March 10, 2020 15:33

williamrandolph mentioned this pull request Mar 10, 2020

[CI] AsyncSearchActionTests fails unpredictably #53360

Closed

javanna mentioned this pull request Mar 10, 2020

Submit async search to work only with POST #53368

Merged

jimczi mentioned this pull request Mar 10, 2020

Fix sporadic failures in AsyncSearchAsyncTests #53375

Merged

jimczi mentioned this pull request Mar 12, 2020

Fix race condition when deleting an async search #53513

Merged

jimczi mentioned this pull request Mar 16, 2020

Add new x-pack endpoints to track the progress of a search asynchronously #53591

Merged

codebrain mentioned this pull request Apr 1, 2020

7.7.0 meta ticket elastic/elasticsearch-net#4525

Closed

38 tasks

lizozom mentioned this pull request Apr 13, 2020

[Meta] Search Sessions Roadmap elastic/kibana#61738

Closed

32 tasks

wylieconlon mentioned this pull request Mar 11, 2021

[Lens] Display number of shards searched vs remaining for long running searches elastic/kibana#94495

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new x-pack endpoints to track the progress of a search asynchronously #49931

Add new x-pack endpoints to track the progress of a search asynchronously #49931

jimczi commented Dec 6, 2019 •

edited

Loading

elasticmachine commented Dec 6, 2019

jimczi commented Dec 6, 2019

jpountz left a comment

jimczi commented Dec 10, 2019

jpountz commented Dec 10, 2019

giladgal commented Dec 10, 2019

jimczi commented Dec 12, 2019

colings86 commented Dec 13, 2019

mark-vieira commented Dec 13, 2019

javanna left a comment

javanna Mar 9, 2020

javanna Mar 9, 2020

javanna Mar 9, 2020

javanna Mar 10, 2020

javanna Mar 10, 2020

lizozom commented Apr 19, 2020

jimczi commented Apr 19, 2020

Add new x-pack endpoints to track the progress of a search asynchronously #49931

Add new x-pack endpoints to track the progress of a search asynchronously #49931

Conversation

jimczi commented Dec 6, 2019 • edited Loading

High level view

Persistency

Resiliency

Security

elasticmachine commented Dec 6, 2019

jimczi commented Dec 6, 2019

jpountz left a comment

Choose a reason for hiding this comment

jimczi commented Dec 10, 2019

jpountz commented Dec 10, 2019

giladgal commented Dec 10, 2019

jimczi commented Dec 12, 2019

colings86 commented Dec 13, 2019

mark-vieira commented Dec 13, 2019

javanna left a comment

Choose a reason for hiding this comment

javanna Mar 9, 2020

Choose a reason for hiding this comment

javanna Mar 9, 2020

Choose a reason for hiding this comment

javanna Mar 9, 2020

Choose a reason for hiding this comment

javanna Mar 10, 2020

Choose a reason for hiding this comment

javanna Mar 10, 2020

Choose a reason for hiding this comment

lizozom commented Apr 19, 2020

jimczi commented Apr 19, 2020

jimczi commented Dec 6, 2019 •

edited

Loading