You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 21, 2023. It is now read-only.
#151 originally called for the Elasticsearch output to default to infinite retry, similar to Beats, but our Elasticsearch ingestion library go-elasticsearch doesn't support this feature. We need to add this feature to the library and/or develop a more robust error handling mechanism to report consistent failures. Infinite retry is brittle: if we attempt it because we don't realize a particular error type is deterministically fatal, it can block the entire pipeline permanently, which can lead to data loss in many common configurations. One possible approach is to implement infinite retry for an allow-list of explicit errors that we know are always retryable, while keeping bounded retry for other error types and instead adding better error reporting so permanent failures can be recognized and diagnosed instead of blocking the rest of the pipeline.
The text was updated successfully, but these errors were encountered:
One possible approach is to implement infinite retry for an allow-list of explicit errors that we know are always retryable, while keeping bounded retry for other error types and instead adding better error reporting so permanent failures can be recognized and diagnosed instead of blocking the rest of the pipeline.
This feels like the right approach, given that the "at least once" delivery guarantee doesn't apply to data that could never be indexed.
This feels like the right approach, given that the "at least once" delivery guarantee doesn't apply to data that could never be indexed.
Thinking about this a bit more, someone could consider stalling the pipeline if the data cannot be indexable a feature as it gives the user a chance to add a processor to drop or modify the problematic field and resume rather than discarding the data and continuing.
This is definitely an extreme edge case though, and assumes that the input source isn't lossy like quickly rotating files or a UDP socket. This is probably better served with a dead letter queue as requested in #245.
an allow-list of explicit errors that we know are always retryable
That seems like a list that already has to exist, somewhere, but not sure where. A few minutes of google didn't turn anything up.
I also agree that a dead letter queue/handler/whatever is probably the best fit for this, which isn't something we have in the shipper right now, and kind of feels like a whole project in its own right.
That seems like a list that already has to exist, somewhere, but not sure where. A few minutes of google didn't turn anything up.
We could make this list configurable, such that if we discover a new non-retryable error in a real deployment we can just update the configuration to move past it before adding it to the initially empty list of defaults.
I also agree that a dead letter queue/handler/whatever is probably the best fit for this, which isn't something we have in the shipper right now, and kind of feels like a whole project in its own right
Agreed, a dead letter index is a better solution and is out of scope for this issue.
#151 originally called for the Elasticsearch output to default to infinite retry, similar to Beats, but our Elasticsearch ingestion library
go-elasticsearch
doesn't support this feature. We need to add this feature to the library and/or develop a more robust error handling mechanism to report consistent failures. Infinite retry is brittle: if we attempt it because we don't realize a particular error type is deterministically fatal, it can block the entire pipeline permanently, which can lead to data loss in many common configurations. One possible approach is to implement infinite retry for an allow-list of explicit errors that we know are always retryable, while keeping bounded retry for other error types and instead adding better error reporting so permanent failures can be recognized and diagnosed instead of blocking the rest of the pipeline.The text was updated successfully, but these errors were encountered: