Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent storing huge async search responses #67594

Closed
mayya-sharipova opened this issue Jan 15, 2021 · 6 comments
Closed

Prevent storing huge async search responses #67594

mayya-sharipova opened this issue Jan 15, 2021 · 6 comments
Labels
>bug >enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team

Comments

@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Jan 15, 2021

We saw an issue when storing a huge async search response lead to OOM.

This issue is about finding ways to prevent this. We've discussed possible ways to prevent this from happening:

  1. Introduce an additional circuit breaker on a coordinating node for fetch phase. In Early detection of circuit breaker exception in the coordinating node #67431, we've introduced a circuit breaker for aggs for a coordinating node, but we still don't have a circuit breaker for top docs on the fetch phase. This circuit breaker will trip when memory used for collected top docs exceeds a Request circuit breaker size.

  2. Introduce a circuit breaker on indexing huge docs.

  3. Store async search response as an object with enabled=false instead of storing in a stored field. When we store async response in a stored field, we first encode it with base 64 encoding which needs additional memory. If we avoid base 64 encoding, we may need much less memory. But here a question is how to preserve the es version when this response was recorded?

@mayya-sharipova mayya-sharipova added >bug >enhancement :Search/Search Search-related issues that do not fall into other categories labels Jan 15, 2021
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Jan 15, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@jtibshirani
Copy link
Contributor

From digging into the errors we've seen, there are a couple different sources of problems:

  • We serialize the in-memory search response to a byte array, then base64 encode this to a string. First, we can OOM while allocating the byte array. Second, if the length of the encoded response exceeds ~2GB, then we'll also OOM. We could try to catch these cases to avoid OOMing and instead throw a normal exception (with a clear error message)?
  • We use an update request to write the response to the async search index. Update requests must parse the document source to a map to merge it into the existing document. This can use ~3x the memory of the original request which has caused OOMs. Is there any way to rework the async search code to use an index request instead of update, to avoid the memory overhead?
  • Encoding and writing the search response can consume extra memory -- can we account for this additional memory usage in circuit breaking?

These are short-term fixes that would immediately help with the problems we're seeing. In parallel we may want to step back and generally think through how to be robust to large search responses (we've discussed ideas like circuit breaking the fetch phase, truncating field values, limiting document size at index time...). I don't have a clear idea for this yet.

mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this issue Jun 1, 2021
mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this issue Jun 2, 2021
@jtibshirani
Copy link
Contributor

jtibshirani commented Jun 3, 2021

After more thought and discussion with others, I think it'd make the most sense to start with a soft limit on the size of the async search response we attempt to store and return. If coordinating node detects the serialized response would be larger than the limit, we'd throw an error early to avoid memory problems. The limit could be configured through a cluster setting.

The thinking is that even if we addressed the points in the comment above, we could see problems when indexing the document, since to create stored fields, we copy the document source into a BytesRef (which is backed by a byte array). Also having very large documents in the .async_search index can lead to high memory usage down the line, for example while merging stored fields.

Summary of proposal:

  • Introduce a soft limit on the size of the async search response. We'd need to choose the limit carefully to avoid blocking too many searches that were previously possible.
  • As a complement, look into "low hanging fruit" to reduce memory usage on coordinating node and add circuit breaking. (For example, it may be possible to reduce overhead during encoding from 3x to ~2x?)

We could also look into improvements that help make the search response smaller, so that we reject as few searches as possible. We've discussed some ideas like truncating large fields (#72453), and compressing the response before storing it.

dnhatn added a commit that referenced this issue Jun 6, 2021
Today, writing a Writable value to XContent in Base64 format performs 
these steps: (1) create a BytesStreamOutput, (2) write Writable to that
output, (3) encode a copy of bytes from that output stream, (4) create a
string from the encoded bytes, (5) write the encoded string to XContent. 
These steps allocate/use memory 5 times than writing the encode chars
directly to the output of XContent.

This API would help reduce memory usage when storing a large response 
of an async search.

Relates #67594
dnhatn added a commit that referenced this issue Jun 7, 2021
)

This change tries to write an async response directly to XContent in 
Base64 to avoid using multiple buffers.

Relates to #67594
dnhatn added a commit that referenced this issue Jun 9, 2021
This change integrates the circuit breaker in AsyncTaskIndexService to 
make sure that we won't hit OOM when serializing a large response of an
async search.

Related to #67594
Supersedes  #73638

Co-authored-by: Mayya Sharipova <mayya.sharipova@elastic.co>
dnhatn added a commit to dnhatn/elasticsearch that referenced this issue Jun 9, 2021
Today, writing a Writable value to XContent in Base64 format performs
these steps: (1) create a BytesStreamOutput, (2) write Writable to that
output, (3) encode a copy of bytes from that output stream, (4) create a
string from the encoded bytes, (5) write the encoded string to XContent.
These steps allocate/use memory 5 times than writing the encode chars
directly to the output of XContent.

This API would help reduce memory usage when storing a large response
of an async search.

Relates elastic#67594
dnhatn added a commit that referenced this issue Jun 9, 2021
Today, writing a Writable value to XContent in Base64 format performs
these steps: (1) create a BytesStreamOutput, (2) write Writable to that
output, (3) encode a copy of bytes from that output stream, (4) create a
string from the encoded bytes, (5) write the encoded string to XContent.
These steps allocate/use memory 5 times than writing the encode chars
directly to the output of XContent.

This API would help reduce memory usage when storing a large response
of an async search.

Relates #67594
dnhatn added a commit to dnhatn/elasticsearch that referenced this issue Jun 9, 2021
This change integrates the circuit breaker in AsyncTaskIndexService to
make sure that we won't hit OOM when serializing a large response of an
async search.

Related to elastic#67594
Supersedes  elastic#73638

Co-authored-by: Mayya Sharipova <mayya.sharipova@elastic.co>
dnhatn added a commit that referenced this issue Jun 10, 2021
This change integrates the circuit breaker in AsyncTaskIndexService to
make sure that we won't hit OOM when serializing a large response of an
async search.

Related to #67594
Supersedes  #73638

Co-authored-by: Mayya Sharipova <mayya.sharipova@elastic.co>
mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this issue Jun 22, 2021
Add a dynamic transient cluster setting search.max_async_search_response_size
that controls the maximum allowed size for a stored async search
response. The default max size is 10Mb. An attempt to store
an async search response larger than this size will result in error.

Relates to elastic#67594
mayya-sharipova added a commit that referenced this issue Jun 30, 2021
Add a dynamic transient cluster setting search.max_async_search_response_size
that controls the maximum allowed size for a stored async search
response. The default max size is 10Mb. An attempt to store
an async search response larger than this size will result in error.

Relates to #67594
mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this issue Jun 30, 2021
Add a dynamic transient cluster setting search.max_async_search_response_size
that controls the maximum allowed size for a stored async search
response. The default max size is 10Mb. An attempt to store
an async search response larger than this size will result in error.

Relates to elastic#67594
dnhatn added a commit that referenced this issue Jun 30, 2021
mayya-sharipova added a commit that referenced this issue Jun 30, 2021
Add a dynamic transient cluster setting search.max_async_search_response_size
that controls the maximum allowed size for a stored async search
response. The default max size is 10Mb. An attempt to store
an async search response larger than this size will result in error.

Relates to #67594
dnhatn added a commit that referenced this issue Jul 1, 2021
dnhatn added a commit to dnhatn/elasticsearch that referenced this issue Jul 1, 2021
dnhatn added a commit that referenced this issue Jul 3, 2021
@jtibshirani
Copy link
Contributor

jtibshirani commented Jul 7, 2021

Here are the fixes we made (thanks to @mayya-sharipova and @dnhatn!) In summary, 7.14 will bring stability improvements, but the issue won't be fully addressed until 7.15.

7.14 Check circuit breaker when encoding and decoding response, and avoid unnecessarily copying data.

7.15 Add a soft limit on the size of the response we will attempt to store. Compress the response to reduce size of documents and help avoid hitting the limit.

Follow-ups planned for 7.15:

  • In 7.x the soft limit is disabled by default. We're discussing whether we can enable it by default to a large number like we do on master (10MB).
  • The limit is currently disabled for EQL responses, should we enable it as well?

@mayya-sharipova
Copy link
Contributor Author

@jtibshirani Thanks, that's a great summary!

@julie I've doubled checked our action on the failure of createResponse, and besides recording a failure in the logs, we also return a failure response to a user, so we don't need to do anything extra here.

jtibshirani added a commit that referenced this issue Jul 14, 2021
In 7.14 we started to check the circuit breaker when updating the response in
the async search index. If the circuit breaker is tripped, the async update
call reports a failure, but we never update the response. So it can appear to
clients as if the search is still running. This can be a problem for clients
like Kibana that set a very large keep-alive time.

We started recording failures in master and 7.15 when we introduced the
response size limit. This PR backports the logic from that PR to 7.14.

Relates to #67594.
@jtibshirani
Copy link
Contributor

I added the two follow-ups to the async search meta issue (#88658). Closing this in favor of that issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug >enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team
Projects
None yet
Development

No branches or pull requests

3 participants