Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate metadata from the actual response in async search index #71223

Closed
Tracked by #88658
jimczi opened this issue Apr 2, 2021 · 15 comments
Closed
Tracked by #88658

Separate metadata from the actual response in async search index #71223

jimczi opened this issue Apr 2, 2021 · 15 comments
Assignees
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team

Comments

@jimczi
Copy link
Contributor

jimczi commented Apr 2, 2021

Getting the status of an async search is fast when the query is running but can be very costly if the response is already stored.
When the task is gone/finished, we interrogate the async search index to retrieve the full response if it is still available. The metadata and statistics are stored inside a binary field that also contains the actual response so the cost to retrieve the status depends greatly on the size of the search response.
We should separate the actual response from the metadata in order to allow to query these informations separately.
That would make the cost of status calls more constant and cheaper than retrieving the full response each time.

@jimczi jimczi added >enhancement :Search/Search Search-related issues that do not fall into other categories labels Apr 2, 2021
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Apr 2, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Apr 6, 2021

Currently _async_search/status for stored searches returns below:

for successfully completed searches:

{
  "id" : "FmRldE8zREVEUzA2ZVpUeGs2ejJFUFEaMkZ5QTVrSTZSaVN3WlNFVmtlWHJsdzoxMDc=",
  "is_running" : false,
  "is_partial" : false,
  "start_time_in_millis" : 1583945890986,
  "expiration_time_in_millis" : 1584377890986,
  "_shards" : {
      "total" : 562,
      "successful" : 562,
      "skipped" : 0,
      "failed" : 0
  },
 "completion_status" : 200 
}

for unsuccessfully completed searches:

{
  "id" : "FmRldE8zREVEUzA2ZVpUeGs2ejJFUFEaMkZ5QTVrSTZSaVN3WlNFVmtlWHJsdzoxMDc=",
  "is_running" : false,
  "is_partial" : true,
  "start_time_in_millis" : 1583945890986,
  "expiration_time_in_millis" : 1584377890986,
  "_shards" : {
      "total" : 562,
      "successful" : 450,
      "skipped" : 0,
      "failed" : 112
  },
 "completion_status" : 503 
}

Currently all stored responses are supposed to be completed. But in future, we may be storing partial results as well.

@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Apr 6, 2021

Problem

To retrieve status from a stored response, we do:

  • GET request to .async-search index with
  • get result field from _source (which is object encoded with base64 encoding)
  • decode result to AsyncSearchResponse
  • create AsyncStatusResponse from AsyncSearchResponse and its SearchResponse.

If stored result is huge, retrieving it and decoding takes a lot of computer resources. Kibana may access status API thousands time per second.

The current mapping for .async-search is:

"properties" : {
  "expiration_time" : {
    "type" : "long"
  },
  "headers" : {
    "type" : "object",
    "enabled" : false
  },
  "response_headers" : {
    "type" : "object",
    "enabled" : false
  },
  "result" : {
    "type" : "object",
    "enabled" : false
  }
}

There are several ways we can separate metadata from the actual response:

Proposal 1: add status field as a base64 encoded stored only binary field:

Mapping:

"status": {
  "type" : "binary",
  "doc_values": false,
  "store": true
}

GET request will retrieve status as a separate source field, and decode it into AsyncStatusResponse object.
We only retrieve and decode "status" field without retrieving and decoding an expensive _source field.

GET .async-search/_doc/<ID>?stored_fields=status

Upgrade scenario and BWC:

  • We either need to:
    • update mapping of an existing _async_search index to add status fields (is this a part of upgrade assistant ?)
    • or create a new index _async_search2 and use it.
  • Older docs indexed in _async_search don't contain the status stored field, and a GET request with stored_fields will have stored fields empty (or GET request will return unfound if we are trying to find older docs in a new index). In this case, we need to reissue another GET request with _source (or older index).

Proposal 2: add status fields as separate stored only fields:

mapping

"status": {
  "type" : "object",
  "properties": {
      "is_running": {
        "type": "boolean",
        "store": true,
        "index": false,
        "doc_values" : false
      },
      "is_partial": {
        "type": "boolean",
        "store": true,
        "index": false,
        "doc_values" : false
      },
      "start_time": {
        "type": "long",
        "store": true,
        "index": false,
        "doc_values" : false
      },
     "expiration_time": {
        "type": "long",
        "store": true,
        "index": false,
        "doc_values" : false
      },
      "shards.total": {
        "type": "integer",
        "store": true,
        "index": false,
        "doc_values" : false
      },
      "shards.successful": {
        "type": "integer",
        "store": true,
        "index": false,
        "doc_values" : false
      },
      "shards.skipped": {
        "type": "integer",
        "store": true,
        "index": false,
        "doc_values" : false
      },
      "shards.failed": {
        "type": "integer",
        "store": true,
        "index": false,
        "doc_values" : false
      },
      "completion_status": {
        "type": "short",
        "store": true,
        "index": false,
        "doc_values" : false
      }
  }    
}

And this GET request will retrieve only these stored field without retrieving and decoding an expensive _source field:

GET .async-search/_doc/<ID>?stored_fields=status.is_running,status.is_partial,status.start_time,status.expiration_time,status.shards.total,status.shards.successful,status.shards.skipped,status.shards.failed,status.completion_status
  • Upgrade scenario and BWC same as proposal 1
  • Avoid encoding and decoding step that is needed for proposal 1, but needs to manage more fields.

Proposal 3: create a separate index for status updates: .async-search-status

Pros:

  • Upgrade and BWC is easier than proposal 1 and 2
  • Retrieval of status will be faster. For proposals 1 and 2, as Lucene keeps all stored fields together, Lucene stills needs to read stored _status and _source from disk.

Cons:

  • double work on indexing, updating and deleting docs in .async-search and .async-search-status. It is possible that two indices could be out of synchronization.

@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Apr 8, 2021

We had a team discussion and decided to go with the Proposal 1. So the plan is:

  • Introduce status as a single binary field in existing .async-search indices
  • status can be either:
    • a stored only field
    • a doc values only field
  • Do performance benchmarking with around 1000 docs to compare stored VS doc values fields performances
    • some docs should have huge responses
    • 1000 requests per second, per 10 seconds

@dnhatn
Copy link
Member

dnhatn commented Apr 8, 2021

Older docs indexed in _async_search don't contain the status stored field, and a GET request with stored_fields will have stored fields empty (or GET request will return unfound if we are trying to find older docs in a new index). In this case, we need to reissue another GET request with _source (or older index).

@mayya-sharipova I think we can add a version to AsyncExecutionId(and encoded id) then make a decision based on the version so we can avoid issue two requests in a mixed cluster.

@mayya-sharipova
Copy link
Contributor

@dnhatn Thanks, that's a great proposal. I will experiment with this idea

@mayya-sharipova
Copy link
Contributor

We had another discussion on this topic today, and decided to go with GET response, and for a search response we need to worry about refresh interval.

We need to measure performance of retrieving doc values fields VS stored field fields for status, and if retrieving doc values fields turns out to be much faster than stored fields, we can think of adding support for retrieving doc value fields to GET api.

@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Apr 20, 2021

I've benchmarked the time needed for retrieving stored fields through GET request VS retrieving doc values fields through SEARCH request depending on the size of result field. Here are the results:

GET request with stored fields VS SEARCH request with doc values fields

number of docs in the index: 1000

Result size GET time with stored fields, ms SEARCH time with doc values, ms SEARCH time with doc values for first request, ms
100Kb 1.1 1.9 6.2
1Mb 2.4 3.0 6.0
10Mb 13.4 14.1 19.3
990 docs of 10Mb +
10 docs of 100 Mb
97.6 – for 100Mb doc
15 – for 10Mb doc
99.1 – for 100Mb doc
18.4 – for 10 Mb doc

It looks like retrieving just doc values fields through a search request doesn't bring significant benefits. It could be explained that while doing _search request we still need to access stored field of _id, and read and decompress stored fields.

GET request with _source VS GET request with stored fields

number of docs in the index: 1000
990 docs of 10Mb , 10 docs of 100 Mb

Result size full _source ms only stored status field ms
10Mb 30.8 13.4
100Mb 304.7 97.6

So the conclusion will be to proceed with using GET request where status field is a separate stored field, as this will allow us to have a request at least 3 times faster.

cc @jimczi

@jimczi jimczi changed the title Separate metadata from the actual response in async search Separate metadata from the actual response in async search index Apr 22, 2021
@jimczi
Copy link
Contributor Author

jimczi commented Apr 22, 2021

Are you sure that you're benchmarking the doc values case with "stored_fields": ["_none_"] ? I guess that could explain why responses with doc values are not fast. 100ms for a status query is quite a lot, especially when testing in isolation. Is it only on the first retrieval or it's consistently slow ?

@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Apr 22, 2021

Are you sure that you're benchmarking the doc values case with "stored_fields": ["none"] ?

Great question! No, I only disabled _source, _id field was still being read from stored field. Doing "stored_fields": "_none_", I've got much better benchmarks for doc_values:

Result size GET time with stored fields SEARCH time with doc values SEARCH time with doc values for first request
10Mb 17.7 ms 1.9 ms 7.8 ms
990 docs of 10Mb +
10 docs of 100 Mb
100 ms (100Mb doc)
11.1 ms (10 Mb doc)
1.7 ms (100Mb doc)
1.2 ms (10 Mb doc)

@jimczi
Copy link
Contributor Author

jimczi commented Apr 22, 2021

Considering that response in the MBs range are more anomalies than realistic response sizes, I would still go with the stored field option. It was made for that purpose so it's logical to use it here. My rough feeling is that big responses are rather in a range between 100k and 1MB so the number shared above for stored fields are good enough. Do you have the numbers for the current full source solution in that range ?

@jtibshirani
Copy link
Contributor

jtibshirani commented Apr 22, 2021

I'd also be really curious to see a comparison against loading from _source in the 100KB - 1MB range. I tried a similar experiment on metricbeat data and only saw a small improvement when moving from _source to stored fields: #9034 (comment). This made me think that JSON parsing overhead is low. However the metricbeat documents are substantially smaller, it would be helpful to see this datapoint as well.

@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Apr 22, 2021

Index contains 1000 docs
10 docs: 1Mb
990 docs: 100Kb

Doc size GET _source GET stored fields SEARCH doc values
100 kb 1.3 ms 1.3 ms 1.4 ms
1 Mb 4.7 ms 3.0 ms 1.4 ms

@jimczi @jtibshirani Thanks for the comments. Indeed for smaller documents there is no much difference between retrieving the whole _source from disk or retrieving separate stored fields.

Also, a note, in my experiments, result field is a synthetic field encoded in base 64 encoding, not real results from async search. In real life, there will be an extra overhead for GET _source request for decoding encoded result field to AsyncSearchResponse object, but for smaller stored response, the overhead should be minimal.

@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented May 3, 2021

We've discussed this again, and considering that _async_search index should NOT include big responses, and for responses < 1Mb, retrieving _source or a separate stored status field doesn't make much difference, we've decided for now NOT to proceed with separating metadata from the actual response.

But we still would like to keep the issue open, as there could be other ways to improve retrieving metadata:

  • Currently the whole _source is getting sent from a coordinating node to a data node. Can we send to a coordinating node only status (metadata)? Perhaps if we know that first X bytes of _source constitute status, can we send only first X bytes?
  • Can status (metadata) be a binary header?
  • Can we add a circuit breaker when we request too big of a source to load?
  • Related to this is the issue to break up huge async search response into several docs? If a response is broken up into several docs, perhaps status can be stored only in the first doc among these docs, and we would need to retrieve and process only this doc?

@javanna
Copy link
Member

javanna commented Jun 9, 2023

I am closing this issue as we have no concrete plans to work on it.

@javanna javanna closed this as not planned Won't fix, can't repro, duplicate, stale Jun 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants