Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split large batches of documents if received 413 from Elasticsearch #29778

Closed
Tracked by #16
rdner opened this issue Jan 10, 2022 · 11 comments · Fixed by #34911
Closed
Tracked by #16

Split large batches of documents if received 413 from Elasticsearch #29778

rdner opened this issue Jan 10, 2022 · 11 comments · Fixed by #34911
Assignees
Labels
8.6-candidate estimation:Week Task that represents a week of work. Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team v8.6.0

Comments

@rdner
Copy link
Member

rdner commented Jan 10, 2022

Describe the enhancement:

Currently, after seeing a 413 response from Elasticsearch the whole batch is dropped and the error is logged (#29368). Some of our customers would like to preserve at least some data from the batch instead of discarding the whole batch.

The proposal is:

  • When seeing a 413 response from Elasticsearch try to split the current batch (maybe in 2, or based on size of each document in the batch)
  • If the 413 response is seen again – repeat the process until:
    • either all the smaller (split) batches are successfully sent to Elasticsearch or
    • the initial batch is reduced to a single document that cannot pass the http.max_content_length threshold in Elasticsearch
  • If a batch contain only a single document that cannot be uploaded – drop the batch

Something similar was done in this PR logstash-plugins/logstash-output-elasticsearch#497

Please ensure that each of these actions are logged, in particular:

when the batch is dropped, please state in the log for info:

  • inform the smaller batch was dropped
  • how many iterations it took to reduce the size (if this is possible)
  • Any info from the batch that was dropped (ideally if we know what application or integration)

When the batch is being cut to size:

  • what is current size, and what is it being split into
  • what is the current configured max_bulk_size

Describe a specific use case for the enhancement or feature:

Some of our clients are more sensitive to data loss than others and this enhancement would allow to preserve more data in case of misconfiguration of http.max_content_length in Elasticsearch or bulk_max_size in beats. This would improve the situation in most of the cases but it would not completely solve the data loss problem.

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jan 10, 2022
@rdner rdner added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Jan 10, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jan 10, 2022
@jlind23
Copy link
Collaborator

jlind23 commented Jan 11, 2022

ping @nimarezainia for prioritization and awareness

@nimarezainia
Copy link
Contributor

@rdner this sounds like the right thing to do and makes our product more robust. What would be the level of effort involved in getting this done?

Ideally the buffers would match dynamically so we wouldn't hit these issues but I know that is near impossible.

Could you please ensure that each of these actions are logged, in particular:

when the batch is dropped, please state in the log for info:

  • inform the smaller batch was dropped
  • how many iterations it took to reduce the size (if this is possible)
  • Any info from the batch that was dropped (ideally if we know what application or integration)

When the batch is being cut to size:

  • what is current size, and what is it being split into
  • what is the current configured max_bulk_size

@rdner
Copy link
Member Author

rdner commented Jan 17, 2022

@nimarezainia
We sure need to log every single step, I totally agree with that, I copied it into the description.

Regarding the estimation of effort, I'm quite new to the project, so it's hard for me to give a precise estimation on the effort, I would ask @faec for help here since we already touched on this topic once.

@cmacknz it might be worth considering to introduce this kind of behaviour into the shippers design too.

@jlind23
Copy link
Collaborator

jlind23 commented Mar 21, 2022

@cmacknz Should I "close" this one and focus on the shippers as you have already included it in the V2 implementation?

@cmacknz
Copy link
Member

cmacknz commented Mar 21, 2022

Let's keep the issue as it is a good description of the work to do. We could remove the release target and labels though. This will happen as part of the shipper work at some to be determined point in the future.

@Foxboron
Copy link

Is this not going to be fixed in the current beat implementation?

@jlind23
Copy link
Collaborator

jlind23 commented Mar 30, 2022

@Foxboron Even if here we talk about fixing it in the shippers it doesn't mean that it will not be fixed in standalone beats.
The work that will be done under the shipper project but doesn't exclude beats fix.

@jlind23 jlind23 changed the title Split large batches of documents if received 413 from Elasticsearch [Design]Split large batches of documents if received 413 from Elasticsearch Jul 5, 2022
@faec faec added the estimation:Week Task that represents a week of work. label Jul 14, 2022
@jlind23 jlind23 changed the title [Design]Split large batches of documents if received 413 from Elasticsearch Split large batches of documents if received 413 from Elasticsearch Jul 20, 2022
@cmacknz
Copy link
Member

cmacknz commented Oct 20, 2022

The current plan is to address this in Beats so the fix is available sooner, and then port it into shipper afterwards so we aren't tied to date when the shipper is ready to be released.

@amitkanfer
Copy link
Collaborator

Ideally the buffers would match dynamically so we wouldn't hit these issues but I know that is near impossible.

@nimarezainia why is it near impossible? is it because Agent / beat can send to multiple ES clusters with different size limits? I agree with the conclusion, just want to make sure i understand all the reasons for it.

@nimarezainia
Copy link
Contributor

Ideally the buffers would match dynamically so we wouldn't hit these issues but I know that is near impossible.

@nimarezainia why is it near impossible? is it because Agent / beat can send to multiple ES clusters with different size limits? I agree with the conclusion, just want to make sure i understand all the reasons for it.

I believe I was told that the ES buffer size is not known to us. this may have changed. if there was an API for us to read that, perhaps our output can be set to match, minimizing drops. perhaps things have changed since that comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.6-candidate estimation:Week Task that represents a week of work. Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team v8.6.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants