Shipper doesn't support infinite retry / robust failure handling #262

faec · 2023-02-28T19:22:24Z

#151 originally called for the Elasticsearch output to default to infinite retry, similar to Beats, but our Elasticsearch ingestion library go-elasticsearch doesn't support this feature. We need to add this feature to the library and/or develop a more robust error handling mechanism to report consistent failures. Infinite retry is brittle: if we attempt it because we don't realize a particular error type is deterministically fatal, it can block the entire pipeline permanently, which can lead to data loss in many common configurations. One possible approach is to implement infinite retry for an allow-list of explicit errors that we know are always retryable, while keeping bounded retry for other error types and instead adding better error reporting so permanent failures can be recognized and diagnosed instead of blocking the rest of the pipeline.

The text was updated successfully, but these errors were encountered:

cmacknz · 2023-02-28T19:31:03Z

The guiding principal is that we need to keep Filebeat's "at least once" delivery guarantee: https://www.elastic.co/guide/en/beats/filebeat/current/how-filebeat-works.html#at-least-once-delivery

One possible approach is to implement infinite retry for an allow-list of explicit errors that we know are always retryable, while keeping bounded retry for other error types and instead adding better error reporting so permanent failures can be recognized and diagnosed instead of blocking the rest of the pipeline.

This feels like the right approach, given that the "at least once" delivery guarantee doesn't apply to data that could never be indexed.

cmacknz · 2023-02-28T19:41:41Z

This feels like the right approach, given that the "at least once" delivery guarantee doesn't apply to data that could never be indexed.

Thinking about this a bit more, someone could consider stalling the pipeline if the data cannot be indexable a feature as it gives the user a chance to add a processor to drop or modify the problematic field and resume rather than discarding the data and continuing.

This is definitely an extreme edge case though, and assumes that the input source isn't lossy like quickly rotating files or a UDP socket. This is probably better served with a dead letter queue as requested in #245.

fearful-symmetry · 2023-03-31T22:21:17Z

an allow-list of explicit errors that we know are always retryable

That seems like a list that already has to exist, somewhere, but not sure where. A few minutes of google didn't turn anything up.

I also agree that a dead letter queue/handler/whatever is probably the best fit for this, which isn't something we have in the shipper right now, and kind of feels like a whole project in its own right.

cmacknz · 2023-04-03T18:29:08Z

That seems like a list that already has to exist, somewhere, but not sure where. A few minutes of google didn't turn anything up.

We could make this list configurable, such that if we discover a new non-retryable error in a real deployment we can just update the configuration to move past it before adding it to the initially empty list of defaults.

I also agree that a dead letter queue/handler/whatever is probably the best fit for this, which isn't something we have in the shipper right now, and kind of feels like a whole project in its own right

Agreed, a dead letter index is a better solution and is out of scope for this issue.

faec added enhancement New feature or request Team:Elastic-Agent Label for the Agent team labels Feb 28, 2023

cmacknz mentioned this issue Feb 28, 2023

[Meta] Elastic Agent Shipper Project #16

Open

100 tasks

cmacknz mentioned this issue Feb 28, 2023

Add integration tests for end to end acknowledgement #264

Open

6 tasks

jlind23 assigned fearful-symmetry Mar 27, 2023

fearful-symmetry mentioned this issue Apr 13, 2023

Add infinite retry, fix monitoring #296

Merged

6 tasks

fearful-symmetry closed this as completed in #296 Apr 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shipper doesn't support infinite retry / robust failure handling #262

Shipper doesn't support infinite retry / robust failure handling #262

faec commented Feb 28, 2023

cmacknz commented Feb 28, 2023

cmacknz commented Feb 28, 2023

fearful-symmetry commented Mar 31, 2023

cmacknz commented Apr 3, 2023

Shipper doesn't support infinite retry / robust failure handling #262

Shipper doesn't support infinite retry / robust failure handling #262

Comments

faec commented Feb 28, 2023

cmacknz commented Feb 28, 2023

cmacknz commented Feb 28, 2023

fearful-symmetry commented Mar 31, 2023

cmacknz commented Apr 3, 2023