Prevent storing huge async search responses #67594

mayya-sharipova · 2021-01-15T17:43:25Z

We saw an issue when storing a huge async search response lead to OOM.

This issue is about finding ways to prevent this. We've discussed possible ways to prevent this from happening:

Introduce an additional circuit breaker on a coordinating node for fetch phase. In Early detection of circuit breaker exception in the coordinating node #67431, we've introduced a circuit breaker for aggs for a coordinating node, but we still don't have a circuit breaker for top docs on the fetch phase. This circuit breaker will trip when memory used for collected top docs exceeds a Request circuit breaker size.
Introduce a circuit breaker on indexing huge docs.
Store async search response as an object with enabled=false instead of storing in a stored field. When we store async response in a stored field, we first encode it with base 64 encoding which needs additional memory. If we avoid base 64 encoding, we may need much less memory. But here a question is how to preserve the es version when this response was recorded?

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-01-15T17:43:28Z

Pinging @elastic/es-search (Team:Search)

jtibshirani · 2021-05-28T01:56:21Z

From digging into the errors we've seen, there are a couple different sources of problems:

We serialize the in-memory search response to a byte array, then base64 encode this to a string. First, we can OOM while allocating the byte array. Second, if the length of the encoded response exceeds ~2GB, then we'll also OOM. We could try to catch these cases to avoid OOMing and instead throw a normal exception (with a clear error message)?
We use an update request to write the response to the async search index. Update requests must parse the document source to a map to merge it into the existing document. This can use ~3x the memory of the original request which has caused OOMs. Is there any way to rework the async search code to use an index request instead of update, to avoid the memory overhead?
Encoding and writing the search response can consume extra memory -- can we account for this additional memory usage in circuit breaking?

These are short-term fixes that would immediately help with the problems we're seeing. In parallel we may want to step back and generally think through how to be robust to large search responses (we've discussed ideas like circuit breaking the fetch phase, truncating field values, limiting document size at index time...). I don't have a clear idea for this yet.

Related to elastic#67594

jtibshirani · 2021-06-03T21:08:39Z

After more thought and discussion with others, I think it'd make the most sense to start with a soft limit on the size of the async search response we attempt to store and return. If coordinating node detects the serialized response would be larger than the limit, we'd throw an error early to avoid memory problems. The limit could be configured through a cluster setting.

The thinking is that even if we addressed the points in the comment above, we could see problems when indexing the document, since to create stored fields, we copy the document source into a BytesRef (which is backed by a byte array). Also having very large documents in the .async_search index can lead to high memory usage down the line, for example while merging stored fields.

Summary of proposal:

Introduce a soft limit on the size of the async search response. We'd need to choose the limit carefully to avoid blocking too many searches that were previously possible.
As a complement, look into "low hanging fruit" to reduce memory usage on coordinating node and add circuit breaking. (For example, it may be possible to reduce overhead during encoding from 3x to ~2x?)

We could also look into improvements that help make the search response smaller, so that we reject as few searches as possible. We've discussed some ideas like truncating large fields (#72453), and compressing the response before storing it.

Today, writing a Writable value to XContent in Base64 format performs these steps: (1) create a BytesStreamOutput, (2) write Writable to that output, (3) encode a copy of bytes from that output stream, (4) create a string from the encoded bytes, (5) write the encoded string to XContent. These steps allocate/use memory 5 times than writing the encode chars directly to the output of XContent. This API would help reduce memory usage when storing a large response of an async search. Relates #67594

) This change tries to write an async response directly to XContent in Base64 to avoid using multiple buffers. Relates to #67594

This change integrates the circuit breaker in AsyncTaskIndexService to make sure that we won't hit OOM when serializing a large response of an async search. Related to #67594 Supersedes #73638 Co-authored-by: Mayya Sharipova <mayya.sharipova@elastic.co>

Today, writing a Writable value to XContent in Base64 format performs these steps: (1) create a BytesStreamOutput, (2) write Writable to that output, (3) encode a copy of bytes from that output stream, (4) create a string from the encoded bytes, (5) write the encoded string to XContent. These steps allocate/use memory 5 times than writing the encode chars directly to the output of XContent. This API would help reduce memory usage when storing a large response of an async search. Relates elastic#67594

Today, writing a Writable value to XContent in Base64 format performs these steps: (1) create a BytesStreamOutput, (2) write Writable to that output, (3) encode a copy of bytes from that output stream, (4) create a string from the encoded bytes, (5) write the encoded string to XContent. These steps allocate/use memory 5 times than writing the encode chars directly to the output of XContent. This API would help reduce memory usage when storing a large response of an async search. Relates #67594

This change integrates the circuit breaker in AsyncTaskIndexService to make sure that we won't hit OOM when serializing a large response of an async search. Related to elastic#67594 Supersedes elastic#73638 Co-authored-by: Mayya Sharipova <mayya.sharipova@elastic.co>

This change integrates the circuit breaker in AsyncTaskIndexService to make sure that we won't hit OOM when serializing a large response of an async search. Related to #67594 Supersedes #73638 Co-authored-by: Mayya Sharipova <mayya.sharipova@elastic.co>

Add a dynamic transient cluster setting search.max_async_search_response_size that controls the maximum allowed size for a stored async search response. The default max size is 10Mb. An attempt to store an async search response larger than this size will result in error. Relates to elastic#67594

Add a dynamic transient cluster setting search.max_async_search_response_size that controls the maximum allowed size for a stored async search response. The default max size is 10Mb. An attempt to store an async search response larger than this size will result in error. Relates to #67594

Add a dynamic transient cluster setting search.max_async_search_response_size that controls the maximum allowed size for a stored async search response. The default max size is 10Mb. An attempt to store an async search response larger than this size will result in error. Relates to elastic#67594

Related to #67594

Add a dynamic transient cluster setting search.max_async_search_response_size that controls the maximum allowed size for a stored async search response. The default max size is 10Mb. An attempt to store an async search response larger than this size will result in error. Relates to #67594

Related to #67594

Related to elastic#67594

Related to #67594

jtibshirani · 2021-07-07T18:12:25Z

Here are the fixes we made (thanks to @mayya-sharipova and @dnhatn!) In summary, 7.14 will bring stability improvements, but the issue won't be fully addressed until 7.15.

7.14 Check circuit breaker when encoding and decoding response, and avoid unnecessarily copying data.

7.15 Add a soft limit on the size of the response we will attempt to store. Compress the response to reduce size of documents and help avoid hitting the limit.

Follow-ups planned for 7.15:

In 7.x the soft limit is disabled by default. We're discussing whether we can enable it by default to a large number like we do on master (10MB).
The limit is currently disabled for EQL responses, should we enable it as well?

mayya-sharipova · 2021-07-08T12:04:42Z

@jtibshirani Thanks, that's a great summary!

@julie I've doubled checked our action on the failure of createResponse, and besides recording a failure in the logs, we also return a failure response to a user, so we don't need to do anything extra here.

In 7.14 we started to check the circuit breaker when updating the response in the async search index. If the circuit breaker is tripped, the async update call reports a failure, but we never update the response. So it can appear to clients as if the search is still running. This can be a problem for clients like Kibana that set a very large keep-alive time. We started recording failures in master and 7.15 when we introduced the response size limit. This PR backports the logic from that PR to 7.14. Relates to #67594.

jtibshirani · 2022-07-20T20:56:01Z

I added the two follow-ups to the async search meta issue (#88658). Closing this in favor of that issue.

mayya-sharipova added >bug >enhancement :Search/Search Search-related issues that do not fall into other categories labels Jan 15, 2021

elasticmachine added the Team:Search Meta label for search team label Jan 15, 2021

mayya-sharipova mentioned this issue May 3, 2021

Separate metadata from the actual response in async search index #71223

Closed

mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this issue Jun 1, 2021

Add circuit breaker to storing async response

a23e641

Related to elastic#67594

mayya-sharipova mentioned this issue Jun 1, 2021

Add circuit breaker to storing async response #73638

Closed

mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this issue Jun 2, 2021

Add circuit breaker to storing async response

5d361b2

Related to elastic#67594

jtibshirani mentioned this issue Jun 5, 2021

Write async response directly to XContent to reduce memory usage #73707

Merged

dnhatn mentioned this issue Jun 6, 2021

Allow build XContent directly from Writable #73804

Merged

dnhatn added a commit that referenced this issue Jun 7, 2021

Write async response directly to XContent to reduce memory usage (#73707

6b7fea0

) This change tries to write an async response directly to XContent in Base64 to avoid using multiple buffers. Relates to #67594

dnhatn mentioned this issue Jun 7, 2021

Integrate circuit breaker in AsyncTaskIndexService #73862

Merged

dnhatn mentioned this issue Jun 9, 2021

Allow build XContent directly from Writable #73960

Merged

dnhatn mentioned this issue Jun 9, 2021

Integrate circuit breaker in AsyncTaskIndexService #73969

Merged

mayya-sharipova mentioned this issue Jun 22, 2021

Set max allowed size for stored async response #74455

Merged

dnhatn mentioned this issue Jun 30, 2021

Compress async search responses before storing #74766

Merged

mayya-sharipova mentioned this issue Jun 30, 2021

Set max allowed size for stored async response #74785

Merged

dnhatn added a commit that referenced this issue Jun 30, 2021

Compress async search responses before storing (#74766)

55175de

Related to #67594

dnhatn added a commit that referenced this issue Jul 1, 2021

Compress async search responses before storing (#74766)

59d4508

Related to #67594

dnhatn added a commit to dnhatn/elasticsearch that referenced this issue Jul 1, 2021

Compress async search responses before storing (elastic#74766)

a49a2fe

Related to elastic#67594

dnhatn mentioned this issue Jul 1, 2021

Compress async search responses before storing (#74766) #74804

Merged

dnhatn added a commit that referenced this issue Jul 3, 2021

Compress async search responses before storing (#74766)

f194818

Related to #67594

jtibshirani mentioned this issue Jul 13, 2021

Record failure when writing async search trips circuit breaker #75270

Merged

jtibshirani mentioned this issue Jul 20, 2022

Enhancing Async Search #88658

Open

11 tasks

jtibshirani closed this as completed Jul 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent storing huge async search responses #67594

Prevent storing huge async search responses #67594

mayya-sharipova commented Jan 15, 2021 •

edited

elasticmachine commented Jan 15, 2021

jtibshirani commented May 28, 2021

jtibshirani commented Jun 3, 2021 •

edited

jtibshirani commented Jul 7, 2021 •

edited

mayya-sharipova commented Jul 8, 2021

jtibshirani commented Jul 20, 2022

Prevent storing huge async search responses #67594

Prevent storing huge async search responses #67594

Comments

mayya-sharipova commented Jan 15, 2021 • edited

elasticmachine commented Jan 15, 2021

jtibshirani commented May 28, 2021

jtibshirani commented Jun 3, 2021 • edited

jtibshirani commented Jul 7, 2021 • edited

mayya-sharipova commented Jul 8, 2021

jtibshirani commented Jul 20, 2022

mayya-sharipova commented Jan 15, 2021 •

edited

jtibshirani commented Jun 3, 2021 •

edited

jtibshirani commented Jul 7, 2021 •

edited