Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrations should dynamically adjust batch size to prevent failing on 413 errors from Elasticsearch #107641

Closed
joshdover opened this issue Aug 4, 2021 · 10 comments · Fixed by #109540
Labels
Feature:Migrations Feature:Saved Objects project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@joshdover
Copy link
Member

joshdover commented Aug 4, 2021

Today, it's possible for a batch of objects being indexed during the TRANSFORMED_DOCUMENTS_BULK_INDEX step to fail with a 413 error from Elasticsearch, indicating that the uncompressed request body is larger than Elasticsearch's http.max_content_length config (which defaults to 100mb).

We should be able to better handle this scenario by either providing a log message to adjust setting(s) or automatically shrinking the size of the bulk index requests.

For instance, the TRANSFORMED_DOCUMENTS_BULK_INDEX step could automatically break the batch of documents in half to be indexed in two separate requests rather than one. I'd imagine this would work by:

  • Making TransformedDocumentsBulkIndex.transformedDocs an array of arrays Array<Array<SavedObjectsRawDoc>>
  • On each TRANSFORMED_DOCUMENTS_BULK_INDEX action, only attempt to index the first batch in the transformedDocs array.
    • If it fails with a 413, split the first batch in transformedDocs into two batches of equal sizes and run the TRANSFORMED_DOCUMENTS_BULK_INDEX step again.
    • If it succeeds, either run the TRANSFORMED_DOCUMENTS_BULK_INDEX on the next batch (if any remain) or proceed to OUTDATED_DOCUMENTS_SEARCH_READ (if no batches remain)
@joshdover joshdover added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Feature:Saved Objects project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Feature:Migrations labels Aug 4, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@mshustov
Copy link
Contributor

mshustov commented Aug 4, 2021

We should be able to better handle this scenario by either providing a log message to adjust setting(s) or automatically shrinking the size of the bulk index requests.

yeah, I suggested the same approach in CRA, but IMO it's valuable to analyze why we have such a large SO. Is it even a valid case to store several Mb in a single SO? It might lead to Entity too large problems in other places: when importing SOs or writing a single SO that is larger than http.max_content_length, for example
Also, it might lead to OOM crashes when Kibana is run on an instance with small RAM size.
IMO we should have a reasonable limit on the size of SO.

@joshdover
Copy link
Member Author

joshdover commented Aug 11, 2021

While we haven't been able to reproduce this exactly, we have seen cases where batches are approaching 85MB and we suspect this is on large batches of index-patterns. @mattkime Do we still need to store the field list in the index-pattern saved object? My understanding is that we're now fetching the field list on the fly. Is it possible to add a migration that strips these fields from the SOs?

@rudolf
Copy link
Contributor

rudolf commented Aug 16, 2021

In one case this was caused by siem-detection-engine-rule-status documents, when a rule fails it includes the failure log (I believe this is the case for other rule status types too). In this specific case the failure log was extremely long (log line was truncated so unable to say exactly what size it is) and there were probably many rule status documents. In the past there was also canvas-workpad-template documents which shipped with Kibana and was 10Mb. I suspect if users create a canvas workpad with high resolution images they could still create similarly sized documents.

@rudolf
Copy link
Contributor

rudolf commented Aug 16, 2021

Only saw this after creating my own issue which I now closed as a duplicate. From #108708

Thoughts on implementation...

For simplicity we should continue to read batches of migrations.batchSize, theoretically reading a large batch could exceed Kibana's available heap, but this is currently a low risk.

So the dynamic batch size would only be relevant when indexing documents (the bulkOverwriteTransformedDocuments action). I see two alternatives:

  1. Use a library like json-size to calculate the size of each document as we add it to the current batch (elasticsearch uses the uncompressed payload size to enforce it's limit). Once the batch is "full" or has migrations.batchSize documents, index it and build the next batch.
  2. Index using migrations.batchSize until we hit a 413 error and then halve the payload size before trying again.

Unless (1) adds significant CPU overhead and slows down migrations, it feels like the preferred approach.

@mshustov
Copy link
Contributor

Unless (1) adds significant CPU overhead and slows down migrations, it feels like the preferred approach.

What if we talk to the Elasticsearch team to include the actual request size and max allowed request size in the error response? So we can use these limits to lower the number of the object sent. This approach is quite similar to option (2), but it might be a bit more faster than halving the payload.

In the past there was also canvas-workpad-template documents which shipped with Kibana and was 10Mb.
theoretically reading a large batch could exceed Kibana's available heap, but this is currently a low risk.

I agree it's rather low risk, but uncontrolled growth of SO size might sooner or later lead to this problem. Shouldn't we have at least an approximate understanding of what might be the max heap size used by a batch? We can use max_allowed_size * batch size to calculate the max possible memory footprint of reading a SO batch.

@rudolf
Copy link
Contributor

rudolf commented Aug 17, 2021

We can use max_allowed_size * batch size to calculate the max possible memory footprint of reading a SO batch.

Yeah I agree, having no upper bound on a saved object size makes it really hard to create a robust system.

What if we talk to the Elasticsearch team to include the actual request size and max allowed request size in the error response? So we can use these limits to lower the number of the object sent. This approach is quite similar to option (2), but it might be a bit more faster than halving the payload.

Are you suggesting that instead of calculating each batch size like in (1) we only calculate the batch size with json-size after we receive a 413, and then instead of using 100mb as the batch size, we use the value from the Elasticsearch response? I'm don't see how we could use the actual request size from the ES response though.

I like this approach, that way we're not artificially limiting the batch sizes when an ES cluster might already have a larger http.max_content_length configured.

@rudolf rudolf changed the title Migrations should handle 413 errors from Elasticsearch Migrations should dynamically adjust batch size to prevent failing on 413 errors from Elasticsearch Aug 17, 2021
@pgayvallet
Copy link
Contributor

Should this be added to #104083?

@rudolf
Copy link
Contributor

rudolf commented Aug 19, 2021

I've opened elastic/elasticsearch#76705 but for now will just continue assuming the limit is set to 100mb

@mshustov
Copy link
Contributor

Are you suggesting that instead of calculating each batch size like in (1) we only calculate the batch size with json-size after we receive a 413, and then instead of using 100mb as the batch size, we use the value from the Elasticsearch response?

Discussed offline: yes, I suggested getting this information from an Elasticsearch response because data structure size might be different in JS (Kibana) and Java (Elasticsearch), so libs like json-size may provide inaccurate data.

Yeah I agree, having no upper bound on a saved object size makes it really hard to create a robust system.

We can start by providing a warning about excessive size in dev mode to prevent such cases #109815

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Migrations Feature:Saved Objects project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants