New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split large batches on error instead of dropping them #34911
Conversation
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Can we add 2 more test cases. One when you split but all messages are too big so all the retries fail. And the second when you split and one batch works but the other fails because it is too big.
Hmm, some obstacles to meaningfully testing these scenarios:
To the extent that nested success / failure can be tested, it's currently done in the |
This pull request is now in conflicts. Could you fix it? 🙏
|
I can be convinced that this isn't the right place to do the test. But I still think we need to test what happens in these edge cases. Is there an existing publish test that we could expand? |
The main difficulty is that meaningfully testing these scenarios requires a full running pipeline (with retryer, output workers, and the rest) -- since I can spend some time looking for a better compromise, but could you say a little more about what specific logic or data flow you want tested? Would you be satisfied by a slightly augmented unit test where the test code itself manually sends the "retried" values through |
This looks good to me, pending resolution of Lee's requests for additional tests. I like how small and easy to follow this implementation turned out to be, although I'm sure a lot of thought went into making it this way. Thanks! |
Reaping the benefits of old refactors... I spent an ON week a while back rewriting the whole batch retry mechanism to be a lot more straightforward, and that did the heavy lifting for this feature :D |
This pull request is now in conflicts. Could you fix it? 🙏
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome. thank you.
Looks like the failures are unrelated tests that couldn't download from github.com and nodejs.org -- presumably flakiness in the test environment. |
/test |
What does this PR do?
Add a new method,
SplitRetry
, topublisher.Batch
, which splits the events of the batch into two subsets and then separately retries each of them. Invoke this method in the Elasticsearch and Shipper outputs when we get an error indicating the batch was too large, instead of dropping them as previously.Checklist
I have made corresponding changes to the documentationI have made corresponding change to the default configuration filesCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.How to test this PR locally
Elasticsearch output
elasticsearch.yml
, sethttp.max_content_length: 128kb
(smaller than that blocks all connections since it prevents filebeat from loading the initial index template)filebeat.yml
, setoutput.elasticsearch.bulk_max_size: 500
and ingest some not-too-small log events.When run at head, this should produce many errors of the form:
and drops many events (in my test it only ingested 1536 out of 1 million in my test logs).
When run with this PR, these errors should be gone, and it should ingest 100% of the events.
Shipper output
64 * units.KiB
rather than64 * units.MiB
When run at head, some events should be dropped, and the logs should show many errors similar to:
"dropping 50 events because of RPC failure: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (118688 vs. 65536)"
When run with this PR, the errors should be gone and ingestion should succeed (as long as no individual event exceeds the RPC limit). For me the typical batch size in the logs drops from 50 to 25 or lower.
Related issues