Separate metadata from the actual response in async search index #71223

jimczi · 2021-04-02T12:35:23Z

Getting the status of an async search is fast when the query is running but can be very costly if the response is already stored.
When the task is gone/finished, we interrogate the async search index to retrieve the full response if it is still available. The metadata and statistics are stored inside a binary field that also contains the actual response so the cost to retrieve the status depends greatly on the size of the search response.
We should separate the actual response from the metadata in order to allow to query these informations separately.
That would make the cost of status calls more constant and cheaper than retrieving the full response each time.

elasticmachine · 2021-04-02T12:35:26Z

Pinging @elastic/es-search (Team:Search)

mayya-sharipova · 2021-04-06T19:31:41Z

Currently _async_search/status for stored searches returns below:

for successfully completed searches:

{
  "id" : "FmRldE8zREVEUzA2ZVpUeGs2ejJFUFEaMkZ5QTVrSTZSaVN3WlNFVmtlWHJsdzoxMDc=",
  "is_running" : false,
  "is_partial" : false,
  "start_time_in_millis" : 1583945890986,
  "expiration_time_in_millis" : 1584377890986,
  "_shards" : {
      "total" : 562,
      "successful" : 562,
      "skipped" : 0,
      "failed" : 0
  },
 "completion_status" : 200 
}

for unsuccessfully completed searches:

{
  "id" : "FmRldE8zREVEUzA2ZVpUeGs2ejJFUFEaMkZ5QTVrSTZSaVN3WlNFVmtlWHJsdzoxMDc=",
  "is_running" : false,
  "is_partial" : true,
  "start_time_in_millis" : 1583945890986,
  "expiration_time_in_millis" : 1584377890986,
  "_shards" : {
      "total" : 562,
      "successful" : 450,
      "skipped" : 0,
      "failed" : 112
  },
 "completion_status" : 503 
}

Currently all stored responses are supposed to be completed. But in future, we may be storing partial results as well.

mayya-sharipova · 2021-04-06T19:49:07Z

Problem

To retrieve status from a stored response, we do:

GET request to .async-search index with
get result field from _source (which is object encoded with base64 encoding)
decode result to AsyncSearchResponse
create AsyncStatusResponse from AsyncSearchResponse and its SearchResponse.

If stored result is huge, retrieving it and decoding takes a lot of computer resources. Kibana may access status API thousands time per second.

The current mapping for .async-search is:

"properties" : {
  "expiration_time" : {
    "type" : "long"
  },
  "headers" : {
    "type" : "object",
    "enabled" : false
  },
  "response_headers" : {
    "type" : "object",
    "enabled" : false
  },
  "result" : {
    "type" : "object",
    "enabled" : false
  }
}

There are several ways we can separate metadata from the actual response:

Proposal 1: add status field as a base64 encoded stored only binary field:

Mapping:

"status": {
  "type" : "binary",
  "doc_values": false,
  "store": true
}

GET request will retrieve status as a separate source field, and decode it into AsyncStatusResponse object.
We only retrieve and decode "status" field without retrieving and decoding an expensive _source field.

GET .async-search/_doc/<ID>?stored_fields=status

Upgrade scenario and BWC:

We either need to:
- update mapping of an existing _async_search index to add status fields (is this a part of upgrade assistant ?)
- or create a new index _async_search2 and use it.
Older docs indexed in _async_search don't contain the status stored field, and a GET request with stored_fields will have stored fields empty (or GET request will return unfound if we are trying to find older docs in a new index). In this case, we need to reissue another GET request with _source (or older index).

Proposal 2: add status fields as separate stored only fields:

mapping

"status": {
  "type" : "object",
  "properties": {
      "is_running": {
        "type": "boolean",
        "store": true,
        "index": false,
        "doc_values" : false
      },
      "is_partial": {
        "type": "boolean",
        "store": true,
        "index": false,
        "doc_values" : false
      },
      "start_time": {
        "type": "long",
        "store": true,
        "index": false,
        "doc_values" : false
      },
     "expiration_time": {
        "type": "long",
        "store": true,
        "index": false,
        "doc_values" : false
      },
      "shards.total": {
        "type": "integer",
        "store": true,
        "index": false,
        "doc_values" : false
      },
      "shards.successful": {
        "type": "integer",
        "store": true,
        "index": false,
        "doc_values" : false
      },
      "shards.skipped": {
        "type": "integer",
        "store": true,
        "index": false,
        "doc_values" : false
      },
      "shards.failed": {
        "type": "integer",
        "store": true,
        "index": false,
        "doc_values" : false
      },
      "completion_status": {
        "type": "short",
        "store": true,
        "index": false,
        "doc_values" : false
      }
  }    
}

And this GET request will retrieve only these stored field without retrieving and decoding an expensive _source field:

GET .async-search/_doc/<ID>?stored_fields=status.is_running,status.is_partial,status.start_time,status.expiration_time,status.shards.total,status.shards.successful,status.shards.skipped,status.shards.failed,status.completion_status

Upgrade scenario and BWC same as proposal 1
Avoid encoding and decoding step that is needed for proposal 1, but needs to manage more fields.

Proposal 3: create a separate index for status updates: .async-search-status

Pros:

Upgrade and BWC is easier than proposal 1 and 2
Retrieval of status will be faster. For proposals 1 and 2, as Lucene keeps all stored fields together, Lucene stills needs to read stored _status and _source from disk.

Cons:

double work on indexing, updating and deleting docs in .async-search and .async-search-status. It is possible that two indices could be out of synchronization.

mayya-sharipova · 2021-04-08T12:32:33Z

We had a team discussion and decided to go with the Proposal 1. So the plan is:

Introduce status as a single binary field in existing .async-search indices
status can be either:
- a stored only field
- a doc values only field
Do performance benchmarking with around 1000 docs to compare stored VS doc values fields performances
- some docs should have huge responses
- 1000 requests per second, per 10 seconds

dnhatn · 2021-04-08T14:21:09Z

Older docs indexed in _async_search don't contain the status stored field, and a GET request with stored_fields will have stored fields empty (or GET request will return unfound if we are trying to find older docs in a new index). In this case, we need to reissue another GET request with _source (or older index).

@mayya-sharipova I think we can add a version to AsyncExecutionId(and encoded id) then make a decision based on the version so we can avoid issue two requests in a mixed cluster.

mayya-sharipova · 2021-04-08T14:43:14Z

@dnhatn Thanks, that's a great proposal. I will experiment with this idea

mayya-sharipova · 2021-04-19T20:02:38Z

We had another discussion on this topic today, and decided to go with GET response, and for a search response we need to worry about refresh interval.

We need to measure performance of retrieving doc values fields VS stored field fields for status, and if retrieving doc values fields turns out to be much faster than stored fields, we can think of adding support for retrieving doc value fields to GET api.

mayya-sharipova · 2021-04-20T20:20:33Z

I've benchmarked the time needed for retrieving stored fields through GET request VS retrieving doc values fields through SEARCH request depending on the size of result field. Here are the results:

GET request with stored fields VS SEARCH request with doc values fields

number of docs in the index: 1000

Result size	GET time with stored fields, ms	SEARCH time with doc values, ms	SEARCH time with doc values for first request, ms
100Kb	1.1	1.9	6.2
1Mb	2.4	3.0	6.0
10Mb	13.4	14.1	19.3
990 docs of 10Mb + 10 docs of 100 Mb	97.6 – for 100Mb doc 15 – for 10Mb doc	99.1 – for 100Mb doc 18.4 – for 10 Mb doc

It looks like retrieving just doc values fields through a search request doesn't bring significant benefits. It could be explained that while doing _search request we still need to access stored field of _id, and read and decompress stored fields.

GET request with _source VS GET request with stored fields

number of docs in the index: 1000
990 docs of 10Mb , 10 docs of 100 Mb

Result size	full _source ms	only stored status field ms
10Mb	30.8	13.4
100Mb	304.7	97.6

So the conclusion will be to proceed with using GET request where status field is a separate stored field, as this will allow us to have a request at least 3 times faster.

cc @jimczi

jimczi · 2021-04-22T16:24:03Z

Are you sure that you're benchmarking the doc values case with "stored_fields": ["_none_"] ? I guess that could explain why responses with doc values are not fast. 100ms for a status query is quite a lot, especially when testing in isolation. Is it only on the first retrieval or it's consistently slow ?

mayya-sharipova · 2021-04-22T20:30:54Z

Are you sure that you're benchmarking the doc values case with "stored_fields": ["none"] ?

Great question! No, I only disabled _source, _id field was still being read from stored field. Doing "stored_fields": "_none_", I've got much better benchmarks for doc_values:

Result size	GET time with stored fields	SEARCH time with doc values	SEARCH time with doc values for first request
10Mb	17.7 ms	1.9 ms	7.8 ms
990 docs of 10Mb + 10 docs of 100 Mb	100 ms (100Mb doc) 11.1 ms (10 Mb doc)	1.7 ms (100Mb doc) 1.2 ms (10 Mb doc)

jimczi · 2021-04-22T20:49:14Z

Considering that response in the MBs range are more anomalies than realistic response sizes, I would still go with the stored field option. It was made for that purpose so it's logical to use it here. My rough feeling is that big responses are rather in a range between 100k and 1MB so the number shared above for stored fields are good enough. Do you have the numbers for the current full source solution in that range ?

jtibshirani · 2021-04-22T21:26:03Z

I'd also be really curious to see a comparison against loading from _source in the 100KB - 1MB range. I tried a similar experiment on metricbeat data and only saw a small improvement when moving from _source to stored fields: #9034 (comment). This made me think that JSON parsing overhead is low. However the metricbeat documents are substantially smaller, it would be helpful to see this datapoint as well.

mayya-sharipova · 2021-04-22T22:02:42Z

Index contains 1000 docs
10 docs: 1Mb
990 docs: 100Kb

Doc size	GET _source	GET stored fields	SEARCH doc values
100 kb	1.3 ms	1.3 ms	1.4 ms
1 Mb	4.7 ms	3.0 ms	1.4 ms

@jimczi @jtibshirani Thanks for the comments. Indeed for smaller documents there is no much difference between retrieving the whole _source from disk or retrieving separate stored fields.

Also, a note, in my experiments, result field is a synthetic field encoded in base 64 encoding, not real results from async search. In real life, there will be an extra overhead for GET _source request for decoding encoded result field to AsyncSearchResponse object, but for smaller stored response, the overhead should be minimal.

mayya-sharipova · 2021-05-03T20:55:58Z

We've discussed this again, and considering that _async_search index should NOT include big responses, and for responses < 1Mb, retrieving _source or a separate stored status field doesn't make much difference, we've decided for now NOT to proceed with separating metadata from the actual response.

But we still would like to keep the issue open, as there could be other ways to improve retrieving metadata:

Currently the whole _source is getting sent from a coordinating node to a data node. Can we send to a coordinating node only status (metadata)? Perhaps if we know that first X bytes of _source constitute status, can we send only first X bytes?
Can status (metadata) be a binary header?
Can we add a circuit breaker when we request too big of a source to load?
Related to this is the issue to break up huge async search response into several docs? If a response is broken up into several docs, perhaps status can be stored only in the first doc among these docs, and we would need to retrieve and process only this doc?

javanna · 2023-06-09T08:44:15Z

I am closing this issue as we have no concrete plans to work on it.

jimczi added >enhancement :Search/Search Search-related issues that do not fall into other categories labels Apr 2, 2021

jimczi assigned mayya-sharipova Apr 2, 2021

elasticmachine added the Team:Search Meta label for search team label Apr 2, 2021

Dosant mentioned this issue Apr 2, 2021

[Search Sessions] Monitoring hardening elastic/kibana#96131

Closed

11 tasks

jimczi changed the title ~~Separate metadata from the actual response in async search~~ Separate metadata from the actual response in async search index Apr 22, 2021

mayya-sharipova mentioned this issue Apr 22, 2021

Separe STATUS from RESULT in async search #72115

Closed

mayya-sharipova mentioned this issue Jul 20, 2022

Enhancing Async Search #88658

Open

11 tasks

javanna closed this as not planned Won't fix, can't repro, duplicate, stale Jun 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate metadata from the actual response in async search index #71223

Separate metadata from the actual response in async search index #71223

jimczi commented Apr 2, 2021

elasticmachine commented Apr 2, 2021

mayya-sharipova commented Apr 6, 2021 •

edited

mayya-sharipova commented Apr 6, 2021 •

edited

mayya-sharipova commented Apr 8, 2021 •

edited

dnhatn commented Apr 8, 2021

mayya-sharipova commented Apr 8, 2021

mayya-sharipova commented Apr 19, 2021

mayya-sharipova commented Apr 20, 2021 •

edited

jimczi commented Apr 22, 2021

mayya-sharipova commented Apr 22, 2021 •

edited

jimczi commented Apr 22, 2021

jtibshirani commented Apr 22, 2021 •

edited

mayya-sharipova commented Apr 22, 2021 •

edited

mayya-sharipova commented May 3, 2021 •

edited

javanna commented Jun 9, 2023

Separate metadata from the actual response in async search index #71223

Separate metadata from the actual response in async search index #71223

Comments

jimczi commented Apr 2, 2021

elasticmachine commented Apr 2, 2021

mayya-sharipova commented Apr 6, 2021 • edited

mayya-sharipova commented Apr 6, 2021 • edited

Problem

Proposal 1: add status field as a base64 encoded stored only binary field:

Proposal 2: add status fields as separate stored only fields:

Proposal 3: create a separate index for status updates: .async-search-status

mayya-sharipova commented Apr 8, 2021 • edited

dnhatn commented Apr 8, 2021

mayya-sharipova commented Apr 8, 2021

mayya-sharipova commented Apr 19, 2021

mayya-sharipova commented Apr 20, 2021 • edited

GET request with stored fields VS SEARCH request with doc values fields

GET request with _source VS GET request with stored fields

jimczi commented Apr 22, 2021

mayya-sharipova commented Apr 22, 2021 • edited

jimczi commented Apr 22, 2021

jtibshirani commented Apr 22, 2021 • edited

mayya-sharipova commented Apr 22, 2021 • edited

mayya-sharipova commented May 3, 2021 • edited

javanna commented Jun 9, 2023

mayya-sharipova commented Apr 6, 2021 •

edited

mayya-sharipova commented Apr 6, 2021 •

edited

mayya-sharipova commented Apr 8, 2021 •

edited

mayya-sharipova commented Apr 20, 2021 •

edited

mayya-sharipova commented Apr 22, 2021 •

edited

jtibshirani commented Apr 22, 2021 •

edited

mayya-sharipova commented Apr 22, 2021 •

edited

mayya-sharipova commented May 3, 2021 •

edited