Split large batches on error instead of dropping them #34911

faec · 2023-03-23T19:21:29Z

What does this PR do?

Add a new method, SplitRetry, to publisher.Batch, which splits the events of the batch into two subsets and then separately retries each of them. Invoke this method in the Elasticsearch and Shipper outputs when we get an error indicating the batch was too large, instead of dropping them as previously.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~I have made corresponding changes to the documentation~~
~~I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Elasticsearch output

In elasticsearch.yml, set http.max_content_length: 128kb (smaller than that blocks all connections since it prevents filebeat from loading the initial index template)
In filebeat.yml, set output.elasticsearch.bulk_max_size: 500 and ingest some not-too-small log events.

When run at head, this should produce many errors of the form:

{"log.level":"error","@timestamp":"2023-03-23T10:51:57.845-0400","log.logger":"elasticsearch","log.origin":{"file.name":"elasticsearch/client.go","file.line":241},"message":"failed to perform any bulk index operations: the bulk payload is too large for the server. Consider to adjust `http.max_content_length` parameter in Elasticsearch or `bulk_max_size` in the beat. The batch has been dropped","service.name":"filebeat","ecs.version":"1.6.0"}

and drops many events (in my test it only ingested 1536 out of 1 million in my test logs).

When run with this PR, these errors should be gone, and it should ingest 100% of the events.

Shipper output

Rebuild the shipper with an artificially low RPC limit by editing inputhandler.go to use e.g. 64 * units.KiB rather than 64 * units.MiB
Try ingesting into the Shipper (default Agent config should be enough when pointed out the shipper, the standard metrics should be big enough to trigger the issue)

When run at head, some events should be dropped, and the logs should show many errors similar to:

"dropping 50 events because of RPC failure: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (118688 vs. 65536)"

When run with this PR, the errors should be gone and ingestion should succeed (as long as no individual event exceeds the RPC limit). For me the typical batch size in the logs drops from 50 to 25 or lower.

Related issues

elasticmachine · 2023-03-23T19:21:33Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

mergify · 2023-03-23T19:22:15Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @faec? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

elasticmachine · 2023-03-23T19:27:36Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-03-31T14:22:53.718+0000
Duration: 23 min 55 sec

Test stats 🧪

Test	Results
Failed	0
Passed	310
Skipped	0
Total	310

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

leehinman

Looks good. Can we add 2 more test cases. One when you split but all messages are too big so all the retries fail. And the second when you split and one batch works but the other fails because it is too big.

faec · 2023-03-27T14:57:04Z

Looks good. Can we add 2 more test cases. One when you split but all messages are too big so all the retries fail. And the second when you split and one batch works but the other fails because it is too big.

Hmm, some obstacles to meaningfully testing these scenarios:

Nothing in these tests exercises actual size, the mocked RPC just returns simulated success or failure based on test case config. We could expand the tests so the mocked server returns different statuses on subsequent calls, but:
Publish itself is stateless and has no way of telling whether the batch it's given is split, or being retried, or otherwise has any sequential significance. The only mechanism of carryover is the retryer, which in a real pipeline would be listening on the retry channel the tests use to verify Publish behavior. But as far as Publish itself is concerned the only thing we can meaningfully check is whether it handles an individual batch correctly based on the resulting RPC status code, and the tests already check every currently handled combination of status code / SplitRetry success.

To the extent that nested success / failure can be tested, it's currently done in the SplitRetry tests, which check recursive splitting and final batch acknowledgment.

libbeat/outputs/elasticsearch/client.go

mergify · 2023-03-28T01:49:54Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b batch-size-fix upstream/batch-size-fix
git merge upstream/main
git push upstream batch-size-fix

leehinman · 2023-03-28T15:12:26Z

To the extent that nested success / failure can be tested, it's currently done in the SplitRetry tests, which check recursive splitting and final batch acknowledgment.

I can be convinced that this isn't the right place to do the test. But I still think we need to test what happens in these edge cases. Is there an existing publish test that we could expand?

faec · 2023-03-28T15:55:22Z

I can be convinced that this isn't the right place to do the test. But I still think we need to test what happens in these edge cases. Is there an existing publish test that we could expand?

The main difficulty is that meaningfully testing these scenarios requires a full running pipeline (with retryer, output workers, and the rest) -- since Publish itself is ~stateless what is really being tested here is full pipeline retry operation, which is closer to an end-to-end test. This gets pretty tricky -- we would need enough of the pipeline running that it has full autonomous retry behavior, but not so much that it creates real connections or introduces timing-based nondeterminism into batch/reply order.

I can spend some time looking for a better compromise, but could you say a little more about what specific logic or data flow you want tested? Would you be satisfied by a slightly augmented unit test where the test code itself manually sends the "retried" values through Publish again, or is the full-pipeline retry behavior part of what you're getting at?

CHANGELOG.next.asciidoc

cmacknz · 2023-03-29T18:53:50Z

This looks good to me, pending resolution of Lee's requests for additional tests.

I like how small and easy to follow this implementation turned out to be, although I'm sure a lot of thought went into making it this way. Thanks!

faec · 2023-03-29T19:48:11Z

I like how small and easy to follow this implementation turned out to be, although I'm sure a lot of thought went into making it this way.

Reaping the benefits of old refactors... I spent an ON week a while back rewriting the whole batch retry mechanism to be a lot more straightforward, and that did the heavy lifting for this feature :D

mergify · 2023-03-30T18:50:37Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b batch-size-fix upstream/batch-size-fix
git merge upstream/main
git push upstream batch-size-fix

…y output tests

…error

leehinman

awesome. thank you.

faec · 2023-03-31T14:22:26Z

Looks like the failures are unrelated tests that couldn't download from github.com and nodejs.org -- presumably flakiness in the test environment.

faec · 2023-03-31T14:22:38Z

/test

faec added 13 commits March 6, 2023 14:54

Prevent pipeline from queueing large batches before an output is set

bf9f492

typo

8f44e3c

Merge branch 'main' of github.com:elastic/beats into batch-size-fix

bc22f1a

Add test for new outputController behavior

28b081c

Add unit test for eventConsumer changes

93e2212

lint

2166c98

Merge branch 'main' of github.com:elastic/beats into batch-size-fix

a96ea20

Merge branch 'main' of github.com:elastic/beats into batch-size-fix

39ca691

working on batch splitting

419097a

Merge branch 'main' of github.com:elastic/beats into batch-size-fix

ddf2716

batch split unit tests + splitting in ES output

9c62e67

add batch splitting + unit tests to shipper output

ed4097c

return immediately from publishEvents on payload size error

66146b4

faec added bug Team:Elastic-Agent Label for the Agent team labels Mar 23, 2023

faec requested a review from leehinman March 23, 2023 19:21

faec requested a review from a team as a code owner March 23, 2023 19:21

faec self-assigned this Mar 23, 2023

faec requested review from ycombinator and removed request for a team March 23, 2023 19:21

botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Mar 23, 2023

faec added 2 commits March 23, 2023 15:22

revert unrelated error message change

9d0498e

Merge branch 'main' of github.com:elastic/beats into batch-size-fix

ed2bc13

faec added 3 commits March 23, 2023 15:28

changelog

254f3e0

make check

d8300d7

lint

b639a50

fix test broken by linter change

5206762

leehinman requested changes Mar 24, 2023

View reviewed changes

cmacknz reviewed Mar 27, 2023

View reviewed changes

libbeat/outputs/elasticsearch/client.go Outdated Show resolved Hide resolved

faec added 2 commits March 28, 2023 08:35

add metric 'batches.split'

0fb8cb0

Merge branch 'main' of github.com:elastic/beats into batch-size-fix

93f338a

jlind23 requested a review from leehinman March 28, 2023 13:17

cmacknz reviewed Mar 28, 2023

View reviewed changes

CHANGELOG.next.asciidoc Outdated Show resolved Hide resolved

faec added 2 commits March 28, 2023 15:28

Merge branch 'main' of github.com:elastic/beats into batch-size-fix

91481e8

fix changelog merge

d89d231

faec added 2 commits March 30, 2023 17:04

Add a helper for testing live batches and use it to add new SplitRetr…

f516d9a

…y output tests

make the mocked shipper server accept a callback instead of constant …

e9b23d2

…error

leehinman approved these changes Mar 30, 2023

View reviewed changes

faec added 2 commits March 31, 2023 09:16

add requested tests to shipper output

ccc9328

Merge branch 'main' of github.com:elastic/beats into batch-size-fix

b2c534f

faec merged commit df59745 into elastic:main Mar 31, 2023
82 checks passed

faec deleted the batch-size-fix branch March 31, 2023 14:46

chrisberkhout pushed a commit that referenced this pull request Jun 1, 2023

Split large batches on error instead of dropping them (#34911)

9bb9923

rodrigc mentioned this pull request Sep 7, 2023

Enhance logic for splitting large payload requests by querying http.max_content_length from Elasticsearch #36534

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split large batches on error instead of dropping them #34911

Split large batches on error instead of dropping them #34911

faec commented Mar 23, 2023 •

edited

elasticmachine commented Mar 23, 2023

mergify bot commented Mar 23, 2023

elasticmachine commented Mar 23, 2023 •

edited by jenkins-beats-ci bot

Build stats

Test stats 🧪

leehinman left a comment

faec commented Mar 27, 2023

mergify bot commented Mar 28, 2023

leehinman commented Mar 28, 2023

faec commented Mar 28, 2023

cmacknz commented Mar 29, 2023 •

edited

faec commented Mar 29, 2023

mergify bot commented Mar 30, 2023

leehinman left a comment

faec commented Mar 31, 2023

faec commented Mar 31, 2023

Split large batches on error instead of dropping them #34911

Split large batches on error instead of dropping them #34911

Conversation

faec commented Mar 23, 2023 • edited

What does this PR do?

Checklist

How to test this PR locally

Elasticsearch output

Shipper output

Related issues

elasticmachine commented Mar 23, 2023

mergify bot commented Mar 23, 2023

elasticmachine commented Mar 23, 2023 • edited by jenkins-beats-ci bot

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

leehinman left a comment

Choose a reason for hiding this comment

faec commented Mar 27, 2023

mergify bot commented Mar 28, 2023

leehinman commented Mar 28, 2023

faec commented Mar 28, 2023

cmacknz commented Mar 29, 2023 • edited

faec commented Mar 29, 2023

mergify bot commented Mar 30, 2023

leehinman left a comment

Choose a reason for hiding this comment

faec commented Mar 31, 2023

faec commented Mar 31, 2023

faec commented Mar 23, 2023 •

edited

elasticmachine commented Mar 23, 2023 •

edited by jenkins-beats-ci bot

cmacknz commented Mar 29, 2023 •

edited