Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shipper output fails on large events/batches #34695

Closed
Tracked by #16
faec opened this issue Feb 28, 2023 · 4 comments · Fixed by elastic/elastic-agent-shipper#281 or #34911
Closed
Tracked by #16

Shipper output fails on large events/batches #34695

faec opened this issue Feb 28, 2023 · 4 comments · Fixed by elastic/elastic-agent-shipper#281 or #34911
Assignees
Labels
bug Team:Elastic-Agent Label for the Agent team

Comments

@faec
Copy link
Contributor

faec commented Feb 28, 2023

When event batches targeting the shipper exceed the RPC limit (currently 4MB), the shipper output drops all the events with an error similar to the following:

{"log.level":"error","@timestamp":"2023-02-28T15:52:14.521Z","message":"failed to publish events: failed to publish the batch to the shipper, none of the 2048 events were accepted: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4692625 vs. 4194304)","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"log.origin":{"file.line":176,"file.name":"pipeline/client_worker.go"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"publisher_pipeline_output","ecs.version":"1.6.0"}

There are multiple short-term mitigations (increase the RPC size limit; decrease the shipper output's batch size) but both of those approaches can still drop data unpredictably in the case of large events. If an event size is supported, then we should handle this error by splitting up the batch rather than permanently dropping its contents.

@faec faec added bug Team:Elastic-Agent Label for the Agent team labels Feb 28, 2023
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@faec
Copy link
Contributor Author

faec commented Feb 28, 2023

Small correction: with current code the output will actually retry the batch forever on error, stalling the whole pipeline. Dropping on error is what was actually intended by the code; this is a separate bug that I've filed as #34700

@cmacknz
Copy link
Member

cmacknz commented Mar 1, 2023

As a short term fix we can increase the maximum message size on the shipper gRPC server: https://pkg.go.dev/google.golang.org/grpc#MaxMsgSize. Here is an example configuring this in the agent:

		server = grpc.NewServer(
			grpc.Creds(creds),
			grpc.MaxRecvMsgSize(m.grpcConfig.MaxMsgSize),
		)

We should default this to 100mb to match the default value of Elasticsearch's http.max_content_length setting.

http.max_content_length
(Static, byte value) Maximum size of an HTTP request body. Defaults to 100mb. Configuring this setting to greater than 100mb can cause cluster instability and is not recommended. If you hit this limit when sending a request to the Bulk API, configure your client to send fewer documents in each bulk request. If you wish to index individual documents that exceed 100mb, pre-process them into smaller documents before sending them to Elasticsearch. For instance, store the raw data in a system outside Elasticsearch and include a link to the raw data in the documents that Elasticsearch indexes.

In the long term we should automatically split up the batch, hopefully using the solution from #29778. We likely want to keep the gRPC max message size set high enough that we are unlikely to incur the overhead of splitting the batch regularly.

@faec
Copy link
Contributor Author

faec commented Mar 23, 2023

When I wrote "this PR mitigates but does not fix #34695" in elastic/elastic-agent-shipper#281, github apparently took that to mean that the PR did fix the issue, and closed it automatically. It is not fixed yet.

A real fix is impending, though 😜

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Team:Elastic-Agent Label for the Agent team
Projects
None yet
3 participants