You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the AWS S3 direct listing input we keep the state of the listed S3 objects in the registry in order to decide if an S3 object during a current listing has to be ingested or was already ingested and has to be skipped
We have some logic to purge ingested S3 objects based on a "commit timestamp", in order to not let grow the registry indefinitely.
We apply the comparison of the commit timestamp (and eventually purge them) only to S3 objects that are marked with state.Stored = true
We forgot the fact that an S3 object could also be marked with state.Error = true
This can lead to the fact that S3 objects where an error occurred during the ingestion won't be purged and could be ingested again (potentially not the whole file, but at least some of the events it contains)
We should add an AND condition with the state.Error mark
The text was updated successfully, but these errors were encountered:
and could be ingested again (potentially not the whole file, but at least some of the events it contains)
I'm curious how partial read works. Does it retry and resume from a byte offset stored in the registry? Or does it read the whole file and assume that Elasticsearch will deduplicate based on _id.
I'm curious how partial read works. Does it retry and resume from a byte offset stored in the registry? Or does it read the whole file and assume that Elasticsearch will deduplicate based on _id.
it's not handled at the moment, so it will read the whole file again, and elasticsearch will indeed deduplicate based on _id, but that's was not properly intended. indeed it was something I forgot to manage properly and we could probably store the failing offsets in the registry and do a minimum amount of retries and the just purge the file. not sure how this align with at-least-once delivery: any suggestion?
In the AWS S3 direct listing input we keep the state of the listed S3 objects in the registry in order to decide if an S3 object during a current listing has to be ingested or was already ingested and has to be skipped
We have some logic to purge ingested S3 objects based on a "commit timestamp", in order to not let grow the registry indefinitely.
We apply the comparison of the commit timestamp (and eventually purge them) only to S3 objects that are marked with
state.Stored = true
We forgot the fact that an S3 object could also be marked with
state.Error = true
This can lead to the fact that S3 objects where an error occurred during the ingestion won't be purged and could be ingested again (potentially not the whole file, but at least some of the events it contains)
We should add an
AND
condition with thestate.Error
markThe text was updated successfully, but these errors were encountered: