Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split large batches on error instead of dropping them #34911

Merged
merged 27 commits into from Mar 31, 2023

Conversation

faec
Copy link
Contributor

@faec faec commented Mar 23, 2023

What does this PR do?

Add a new method, SplitRetry, to publisher.Batch, which splits the events of the batch into two subsets and then separately retries each of them. Invoke this method in the Elasticsearch and Shipper outputs when we get an error indicating the batch was too large, instead of dropping them as previously.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Elasticsearch output

  • In elasticsearch.yml, set http.max_content_length: 128kb (smaller than that blocks all connections since it prevents filebeat from loading the initial index template)
  • In filebeat.yml, set output.elasticsearch.bulk_max_size: 500 and ingest some not-too-small log events.

When run at head, this should produce many errors of the form:

{"log.level":"error","@timestamp":"2023-03-23T10:51:57.845-0400","log.logger":"elasticsearch","log.origin":{"file.name":"elasticsearch/client.go","file.line":241},"message":"failed to perform any bulk index operations: the bulk payload is too large for the server. Consider to adjust `http.max_content_length` parameter in Elasticsearch or `bulk_max_size` in the beat. The batch has been dropped","service.name":"filebeat","ecs.version":"1.6.0"}

and drops many events (in my test it only ingested 1536 out of 1 million in my test logs).

When run with this PR, these errors should be gone, and it should ingest 100% of the events.

Shipper output

  • Rebuild the shipper with an artificially low RPC limit by editing inputhandler.go to use e.g. 64 * units.KiB rather than 64 * units.MiB
  • Try ingesting into the Shipper (default Agent config should be enough when pointed out the shipper, the standard metrics should be big enough to trigger the issue)

When run at head, some events should be dropped, and the logs should show many errors similar to:

"dropping 50 events because of RPC failure: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (118688 vs. 65536)"

When run with this PR, the errors should be gone and ingestion should succeed (as long as no individual event exceeds the RPC limit). For me the typical batch size in the logs drops from 50 to 25 or lower.

Related issues

@faec faec added bug Team:Elastic-Agent Label for the Agent team labels Mar 23, 2023
@faec faec requested a review from leehinman March 23, 2023 19:21
@faec faec requested a review from a team as a code owner March 23, 2023 19:21
@faec faec self-assigned this Mar 23, 2023
@faec faec requested review from ycombinator and removed request for a team March 23, 2023 19:21
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Mar 23, 2023
@mergify
Copy link
Contributor

mergify bot commented Mar 23, 2023

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @faec? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@elasticmachine
Copy link
Collaborator

elasticmachine commented Mar 23, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-03-31T14:22:53.718+0000

  • Duration: 23 min 55 sec

Test stats 🧪

Test Results
Failed 0
Passed 310
Skipped 0
Total 310

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

Copy link
Contributor

@leehinman leehinman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Can we add 2 more test cases. One when you split but all messages are too big so all the retries fail. And the second when you split and one batch works but the other fails because it is too big.

@faec
Copy link
Contributor Author

faec commented Mar 27, 2023

Looks good. Can we add 2 more test cases. One when you split but all messages are too big so all the retries fail. And the second when you split and one batch works but the other fails because it is too big.

Hmm, some obstacles to meaningfully testing these scenarios:

  • Nothing in these tests exercises actual size, the mocked RPC just returns simulated success or failure based on test case config. We could expand the tests so the mocked server returns different statuses on subsequent calls, but:
  • Publish itself is stateless and has no way of telling whether the batch it's given is split, or being retried, or otherwise has any sequential significance. The only mechanism of carryover is the retryer, which in a real pipeline would be listening on the retry channel the tests use to verify Publish behavior. But as far as Publish itself is concerned the only thing we can meaningfully check is whether it handles an individual batch correctly based on the resulting RPC status code, and the tests already check every currently handled combination of status code / SplitRetry success.

To the extent that nested success / failure can be tested, it's currently done in the SplitRetry tests, which check recursive splitting and final batch acknowledgment.

@mergify
Copy link
Contributor

mergify bot commented Mar 28, 2023

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b batch-size-fix upstream/batch-size-fix
git merge upstream/main
git push upstream batch-size-fix

@jlind23 jlind23 requested a review from leehinman March 28, 2023 13:17
@leehinman
Copy link
Contributor

To the extent that nested success / failure can be tested, it's currently done in the SplitRetry tests, which check recursive splitting and final batch acknowledgment.

I can be convinced that this isn't the right place to do the test. But I still think we need to test what happens in these edge cases. Is there an existing publish test that we could expand?

@faec
Copy link
Contributor Author

faec commented Mar 28, 2023

I can be convinced that this isn't the right place to do the test. But I still think we need to test what happens in these edge cases. Is there an existing publish test that we could expand?

The main difficulty is that meaningfully testing these scenarios requires a full running pipeline (with retryer, output workers, and the rest) -- since Publish itself is ~stateless what is really being tested here is full pipeline retry operation, which is closer to an end-to-end test. This gets pretty tricky -- we would need enough of the pipeline running that it has full autonomous retry behavior, but not so much that it creates real connections or introduces timing-based nondeterminism into batch/reply order.

I can spend some time looking for a better compromise, but could you say a little more about what specific logic or data flow you want tested? Would you be satisfied by a slightly augmented unit test where the test code itself manually sends the "retried" values through Publish again, or is the full-pipeline retry behavior part of what you're getting at?

CHANGELOG.next.asciidoc Outdated Show resolved Hide resolved
@cmacknz
Copy link
Member

cmacknz commented Mar 29, 2023

This looks good to me, pending resolution of Lee's requests for additional tests.

I like how small and easy to follow this implementation turned out to be, although I'm sure a lot of thought went into making it this way. Thanks!

@faec
Copy link
Contributor Author

faec commented Mar 29, 2023

I like how small and easy to follow this implementation turned out to be, although I'm sure a lot of thought went into making it this way.

Reaping the benefits of old refactors... I spent an ON week a while back rewriting the whole batch retry mechanism to be a lot more straightforward, and that did the heavy lifting for this feature :D

@mergify
Copy link
Contributor

mergify bot commented Mar 30, 2023

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b batch-size-fix upstream/batch-size-fix
git merge upstream/main
git push upstream batch-size-fix

Copy link
Contributor

@leehinman leehinman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome. thank you.

@faec
Copy link
Contributor Author

faec commented Mar 31, 2023

Looks like the failures are unrelated tests that couldn't download from github.com and nodejs.org -- presumably flakiness in the test environment.

@faec
Copy link
Contributor Author

faec commented Mar 31, 2023

/test

@faec faec merged commit df59745 into elastic:main Mar 31, 2023
82 checks passed
@faec faec deleted the batch-size-fix branch March 31, 2023 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Team:Elastic-Agent Label for the Agent team
Projects
None yet
4 participants