New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent storing huge async search responses #67594
Comments
Pinging @elastic/es-search (Team:Search) |
From digging into the errors we've seen, there are a couple different sources of problems:
These are short-term fixes that would immediately help with the problems we're seeing. In parallel we may want to step back and generally think through how to be robust to large search responses (we've discussed ideas like circuit breaking the fetch phase, truncating field values, limiting document size at index time...). I don't have a clear idea for this yet. |
After more thought and discussion with others, I think it'd make the most sense to start with a soft limit on the size of the async search response we attempt to store and return. If coordinating node detects the serialized response would be larger than the limit, we'd throw an error early to avoid memory problems. The limit could be configured through a cluster setting. The thinking is that even if we addressed the points in the comment above, we could see problems when indexing the document, since to create stored fields, we copy the document source into a Summary of proposal:
We could also look into improvements that help make the search response smaller, so that we reject as few searches as possible. We've discussed some ideas like truncating large fields (#72453), and compressing the response before storing it. |
Today, writing a Writable value to XContent in Base64 format performs these steps: (1) create a BytesStreamOutput, (2) write Writable to that output, (3) encode a copy of bytes from that output stream, (4) create a string from the encoded bytes, (5) write the encoded string to XContent. These steps allocate/use memory 5 times than writing the encode chars directly to the output of XContent. This API would help reduce memory usage when storing a large response of an async search. Relates #67594
) This change tries to write an async response directly to XContent in Base64 to avoid using multiple buffers. Relates to #67594
Today, writing a Writable value to XContent in Base64 format performs these steps: (1) create a BytesStreamOutput, (2) write Writable to that output, (3) encode a copy of bytes from that output stream, (4) create a string from the encoded bytes, (5) write the encoded string to XContent. These steps allocate/use memory 5 times than writing the encode chars directly to the output of XContent. This API would help reduce memory usage when storing a large response of an async search. Relates elastic#67594
Today, writing a Writable value to XContent in Base64 format performs these steps: (1) create a BytesStreamOutput, (2) write Writable to that output, (3) encode a copy of bytes from that output stream, (4) create a string from the encoded bytes, (5) write the encoded string to XContent. These steps allocate/use memory 5 times than writing the encode chars directly to the output of XContent. This API would help reduce memory usage when storing a large response of an async search. Relates #67594
This change integrates the circuit breaker in AsyncTaskIndexService to make sure that we won't hit OOM when serializing a large response of an async search. Related to elastic#67594 Supersedes elastic#73638 Co-authored-by: Mayya Sharipova <mayya.sharipova@elastic.co>
Add a dynamic transient cluster setting search.max_async_search_response_size that controls the maximum allowed size for a stored async search response. The default max size is 10Mb. An attempt to store an async search response larger than this size will result in error. Relates to elastic#67594
Add a dynamic transient cluster setting search.max_async_search_response_size that controls the maximum allowed size for a stored async search response. The default max size is 10Mb. An attempt to store an async search response larger than this size will result in error. Relates to #67594
Add a dynamic transient cluster setting search.max_async_search_response_size that controls the maximum allowed size for a stored async search response. The default max size is 10Mb. An attempt to store an async search response larger than this size will result in error. Relates to elastic#67594
Add a dynamic transient cluster setting search.max_async_search_response_size that controls the maximum allowed size for a stored async search response. The default max size is 10Mb. An attempt to store an async search response larger than this size will result in error. Relates to #67594
Here are the fixes we made (thanks to @mayya-sharipova and @dnhatn!) In summary, 7.14 will bring stability improvements, but the issue won't be fully addressed until 7.15. 7.14 Check circuit breaker when encoding and decoding response, and avoid unnecessarily copying data.
7.15 Add a soft limit on the size of the response we will attempt to store. Compress the response to reduce size of documents and help avoid hitting the limit.
Follow-ups planned for 7.15:
|
@jtibshirani Thanks, that's a great summary! @julie I've doubled checked our action on the failure of |
In 7.14 we started to check the circuit breaker when updating the response in the async search index. If the circuit breaker is tripped, the async update call reports a failure, but we never update the response. So it can appear to clients as if the search is still running. This can be a problem for clients like Kibana that set a very large keep-alive time. We started recording failures in master and 7.15 when we introduced the response size limit. This PR backports the logic from that PR to 7.14. Relates to #67594.
I added the two follow-ups to the async search meta issue (#88658). Closing this in favor of that issue. |
We saw an issue when storing a huge async search response lead to OOM.
This issue is about finding ways to prevent this. We've discussed possible ways to prevent this from happening:
Introduce an additional circuit breaker on a coordinating node for fetch phase. In Early detection of circuit breaker exception in the coordinating node #67431, we've introduced a circuit breaker for aggs for a coordinating node, but we still don't have a circuit breaker for top docs on the fetch phase. This circuit breaker will trip when memory used for collected top docs exceeds a Request circuit breaker size.
Introduce a circuit breaker on indexing huge docs.
Store async search response as an object with
enabled=false
instead of storing in a stored field. When we store async response in a stored field, we first encode it with base 64 encoding which needs additional memory. If we avoid base 64 encoding, we may need much less memory. But here a question is how to preserve the es version when this response was recorded?The text was updated successfully, but these errors were encountered: