Migrations should dynamically adjust batch size to prevent failing on 413 errors from Elasticsearch #107641

joshdover · 2021-08-04T12:53:35Z

Today, it's possible for a batch of objects being indexed during the TRANSFORMED_DOCUMENTS_BULK_INDEX step to fail with a 413 error from Elasticsearch, indicating that the uncompressed request body is larger than Elasticsearch's http.max_content_length config (which defaults to 100mb).

We should be able to better handle this scenario by either providing a log message to adjust setting(s) or automatically shrinking the size of the bulk index requests.

For instance, the TRANSFORMED_DOCUMENTS_BULK_INDEX step could automatically break the batch of documents in half to be indexed in two separate requests rather than one. I'd imagine this would work by:

Making TransformedDocumentsBulkIndex.transformedDocs an array of arrays Array<Array<SavedObjectsRawDoc>>
On each TRANSFORMED_DOCUMENTS_BULK_INDEX action, only attempt to index the first batch in the transformedDocs array.
- If it fails with a 413, split the first batch in transformedDocs into two batches of equal sizes and run the TRANSFORMED_DOCUMENTS_BULK_INDEX step again.
- If it succeeds, either run the TRANSFORMED_DOCUMENTS_BULK_INDEX on the next batch (if any remain) or proceed to OUTDATED_DOCUMENTS_SEARCH_READ (if no batches remain)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-08-04T12:53:37Z

Pinging @elastic/kibana-core (Team:Core)

mshustov · 2021-08-04T20:13:55Z

We should be able to better handle this scenario by either providing a log message to adjust setting(s) or automatically shrinking the size of the bulk index requests.

yeah, I suggested the same approach in CRA, but IMO it's valuable to analyze why we have such a large SO. Is it even a valid case to store several Mb in a single SO? It might lead to Entity too large problems in other places: when importing SOs or writing a single SO that is larger than http.max_content_length, for example
Also, it might lead to OOM crashes when Kibana is run on an instance with small RAM size.
IMO we should have a reasonable limit on the size of SO.

joshdover · 2021-08-11T14:45:05Z

While we haven't been able to reproduce this exactly, we have seen cases where batches are approaching 85MB and we suspect this is on large batches of index-patterns. @mattkime Do we still need to store the field list in the index-pattern saved object? My understanding is that we're now fetching the field list on the fly. Is it possible to add a migration that strips these fields from the SOs?

rudolf · 2021-08-16T14:56:36Z

In one case this was caused by siem-detection-engine-rule-status documents, when a rule fails it includes the failure log (I believe this is the case for other rule status types too). In this specific case the failure log was extremely long (log line was truncated so unable to say exactly what size it is) and there were probably many rule status documents. In the past there was also canvas-workpad-template documents which shipped with Kibana and was 10Mb. I suspect if users create a canvas workpad with high resolution images they could still create similarly sized documents.

rudolf · 2021-08-16T14:57:33Z

Only saw this after creating my own issue which I now closed as a duplicate. From #108708

Thoughts on implementation...

For simplicity we should continue to read batches of migrations.batchSize, theoretically reading a large batch could exceed Kibana's available heap, but this is currently a low risk.

So the dynamic batch size would only be relevant when indexing documents (the bulkOverwriteTransformedDocuments action). I see two alternatives:

Use a library like json-size to calculate the size of each document as we add it to the current batch (elasticsearch uses the uncompressed payload size to enforce it's limit). Once the batch is "full" or has migrations.batchSize documents, index it and build the next batch.
Index using migrations.batchSize until we hit a 413 error and then halve the payload size before trying again.

Unless (1) adds significant CPU overhead and slows down migrations, it feels like the preferred approach.

mshustov · 2021-08-16T15:11:49Z

Unless (1) adds significant CPU overhead and slows down migrations, it feels like the preferred approach.

What if we talk to the Elasticsearch team to include the actual request size and max allowed request size in the error response? So we can use these limits to lower the number of the object sent. This approach is quite similar to option (2), but it might be a bit more faster than halving the payload.

In the past there was also canvas-workpad-template documents which shipped with Kibana and was 10Mb.
theoretically reading a large batch could exceed Kibana's available heap, but this is currently a low risk.

I agree it's rather low risk, but uncontrolled growth of SO size might sooner or later lead to this problem. Shouldn't we have at least an approximate understanding of what might be the max heap size used by a batch? We can use max_allowed_size * batch size to calculate the max possible memory footprint of reading a SO batch.

rudolf · 2021-08-17T07:47:31Z

We can use max_allowed_size * batch size to calculate the max possible memory footprint of reading a SO batch.

Yeah I agree, having no upper bound on a saved object size makes it really hard to create a robust system.

What if we talk to the Elasticsearch team to include the actual request size and max allowed request size in the error response? So we can use these limits to lower the number of the object sent. This approach is quite similar to option (2), but it might be a bit more faster than halving the payload.

Are you suggesting that instead of calculating each batch size like in (1) we only calculate the batch size with json-size after we receive a 413, and then instead of using 100mb as the batch size, we use the value from the Elasticsearch response? I'm don't see how we could use the actual request size from the ES response though.

I like this approach, that way we're not artificially limiting the batch sizes when an ES cluster might already have a larger http.max_content_length configured.

pgayvallet · 2021-08-18T15:48:20Z

Should this be added to #104083?

rudolf · 2021-08-19T13:08:23Z

I've opened elastic/elasticsearch#76705 but for now will just continue assuming the limit is set to 100mb

mshustov · 2021-08-24T12:13:47Z

Are you suggesting that instead of calculating each batch size like in (1) we only calculate the batch size with json-size after we receive a 413, and then instead of using 100mb as the batch size, we use the value from the Elasticsearch response?

Discussed offline: yes, I suggested getting this information from an Elasticsearch response because data structure size might be different in JS (Kibana) and Java (Elasticsearch), so libs like json-size may provide inaccurate data.

Yeah I agree, having no upper bound on a saved object size makes it really hard to create a robust system.

We can start by providing a warning about excessive size in dev mode to prevent such cases #109815

rudolf mentioned this issue Aug 16, 2021

migrationsv2: Dynamically adjust batch size to prevent exceeding ES HTTP payload size #108708

Closed

rudolf changed the title ~~Migrations should handle 413 errors from Elasticsearch~~ Migrations should dynamically adjust batch size to prevent failing on 413 errors from Elasticsearch Aug 17, 2021

rudolf mentioned this issue Aug 19, 2021

[SO Migration] meta list of the possible improvements by area #104083

Open

rudolf mentioned this issue Aug 24, 2021

Truncate lastFailureMessage for siem-detection-engine-rule-status documents #109815

Closed

rudolf mentioned this issue Aug 26, 2021

Migrationsv2: limit batch sizes to migrations.batchSizeBytes (= 100mb by default) #109540

Merged

4 tasks

rudolf closed this as completed in #109540 Sep 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrations should dynamically adjust batch size to prevent failing on 413 errors from Elasticsearch #107641

Migrations should dynamically adjust batch size to prevent failing on 413 errors from Elasticsearch #107641

joshdover commented Aug 4, 2021 •

edited

elasticmachine commented Aug 4, 2021

mshustov commented Aug 4, 2021

joshdover commented Aug 11, 2021 •

edited

rudolf commented Aug 16, 2021

rudolf commented Aug 16, 2021 •

edited

mshustov commented Aug 16, 2021

rudolf commented Aug 17, 2021

pgayvallet commented Aug 18, 2021

rudolf commented Aug 19, 2021

mshustov commented Aug 24, 2021

Migrations should dynamically adjust batch size to prevent failing on 413 errors from Elasticsearch #107641

Migrations should dynamically adjust batch size to prevent failing on 413 errors from Elasticsearch #107641

Comments

joshdover commented Aug 4, 2021 • edited

elasticmachine commented Aug 4, 2021

mshustov commented Aug 4, 2021

joshdover commented Aug 11, 2021 • edited

rudolf commented Aug 16, 2021

rudolf commented Aug 16, 2021 • edited

mshustov commented Aug 16, 2021

rudolf commented Aug 17, 2021

pgayvallet commented Aug 18, 2021

rudolf commented Aug 19, 2021

mshustov commented Aug 24, 2021

joshdover commented Aug 4, 2021 •

edited

joshdover commented Aug 11, 2021 •

edited

rudolf commented Aug 16, 2021 •

edited