Skip to content
This repository has been archived by the owner on Sep 21, 2023. It is now read-only.

Shipper doesn't support infinite retry / robust failure handling #262

Closed
Tracked by #16
faec opened this issue Feb 28, 2023 · 4 comments · Fixed by #296
Closed
Tracked by #16

Shipper doesn't support infinite retry / robust failure handling #262

faec opened this issue Feb 28, 2023 · 4 comments · Fixed by #296
Assignees
Labels
enhancement New feature or request Team:Elastic-Agent Label for the Agent team

Comments

@faec
Copy link
Contributor

faec commented Feb 28, 2023

#151 originally called for the Elasticsearch output to default to infinite retry, similar to Beats, but our Elasticsearch ingestion library go-elasticsearch doesn't support this feature. We need to add this feature to the library and/or develop a more robust error handling mechanism to report consistent failures. Infinite retry is brittle: if we attempt it because we don't realize a particular error type is deterministically fatal, it can block the entire pipeline permanently, which can lead to data loss in many common configurations. One possible approach is to implement infinite retry for an allow-list of explicit errors that we know are always retryable, while keeping bounded retry for other error types and instead adding better error reporting so permanent failures can be recognized and diagnosed instead of blocking the rest of the pipeline.

@faec faec added enhancement New feature or request Team:Elastic-Agent Label for the Agent team labels Feb 28, 2023
@cmacknz
Copy link
Member

cmacknz commented Feb 28, 2023

The guiding principal is that we need to keep Filebeat's "at least once" delivery guarantee: https://www.elastic.co/guide/en/beats/filebeat/current/how-filebeat-works.html#at-least-once-delivery

One possible approach is to implement infinite retry for an allow-list of explicit errors that we know are always retryable, while keeping bounded retry for other error types and instead adding better error reporting so permanent failures can be recognized and diagnosed instead of blocking the rest of the pipeline.

This feels like the right approach, given that the "at least once" delivery guarantee doesn't apply to data that could never be indexed.

@cmacknz
Copy link
Member

cmacknz commented Feb 28, 2023

This feels like the right approach, given that the "at least once" delivery guarantee doesn't apply to data that could never be indexed.

Thinking about this a bit more, someone could consider stalling the pipeline if the data cannot be indexable a feature as it gives the user a chance to add a processor to drop or modify the problematic field and resume rather than discarding the data and continuing.

This is definitely an extreme edge case though, and assumes that the input source isn't lossy like quickly rotating files or a UDP socket. This is probably better served with a dead letter queue as requested in #245.

@fearful-symmetry
Copy link
Contributor

an allow-list of explicit errors that we know are always retryable

That seems like a list that already has to exist, somewhere, but not sure where. A few minutes of google didn't turn anything up.

I also agree that a dead letter queue/handler/whatever is probably the best fit for this, which isn't something we have in the shipper right now, and kind of feels like a whole project in its own right.

@cmacknz
Copy link
Member

cmacknz commented Apr 3, 2023

That seems like a list that already has to exist, somewhere, but not sure where. A few minutes of google didn't turn anything up.

We could make this list configurable, such that if we discover a new non-retryable error in a real deployment we can just update the configuration to move past it before adding it to the initially empty list of defaults.

I also agree that a dead letter queue/handler/whatever is probably the best fit for this, which isn't something we have in the shipper right now, and kind of feels like a whole project in its own right

Agreed, a dead letter index is a better solution and is out of scope for this issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants