[Filebeat] AWS S3 direct listing input: states registry purged only on "stored" and not on "error" #33513

aspacca · 2022-11-01T09:22:08Z

In the AWS S3 direct listing input we keep the state of the listed S3 objects in the registry in order to decide if an S3 object during a current listing has to be ingested or was already ingested and has to be skipped

We have some logic to purge ingested S3 objects based on a "commit timestamp", in order to not let grow the registry indefinitely.

We apply the comparison of the commit timestamp (and eventually purge them) only to S3 objects that are marked with state.Stored = true

We forgot the fact that an S3 object could also be marked with state.Error = true

This can lead to the fact that S3 objects where an error occurred during the ingestion won't be purged and could be ingested again (potentially not the whole file, but at least some of the events it contains)

We should add an AND condition with the state.Error mark

The text was updated successfully, but these errors were encountered:

andrewkroh · 2022-11-01T09:58:39Z

and could be ingested again (potentially not the whole file, but at least some of the events it contains)

I'm curious how partial read works. Does it retry and resume from a byte offset stored in the registry? Or does it read the whole file and assume that Elasticsearch will deduplicate based on _id.

aspacca · 2022-11-02T02:32:50Z

I'm curious how partial read works. Does it retry and resume from a byte offset stored in the registry? Or does it read the whole file and assume that Elasticsearch will deduplicate based on _id.

it's not handled at the moment, so it will read the whole file again, and elasticsearch will indeed deduplicate based on _id, but that's was not properly intended. indeed it was something I forgot to manage properly and we could probably store the failing offsets in the registry and do a minimum amount of retries and the just purge the file. not sure how this align with at-least-once delivery: any suggestion?

aspacca added the bug label Nov 1, 2022

aspacca self-assigned this Nov 1, 2022

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 1, 2022

aspacca added the Team:Cloud-Monitoring Label for the Cloud Monitoring team label Nov 1, 2022

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Nov 1, 2022

aspacca mentioned this issue Nov 18, 2022

fix in handling states and purging #33722

Merged

4 tasks

tommyers-elastic closed this as completed in #33722 Dec 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Filebeat] AWS S3 direct listing input: states registry purged only on "stored" and not on "error" #33513

[Filebeat] AWS S3 direct listing input: states registry purged only on "stored" and not on "error" #33513

aspacca commented Nov 1, 2022

andrewkroh commented Nov 1, 2022

aspacca commented Nov 2, 2022

[Filebeat] AWS S3 direct listing input: states registry purged only on "stored" and not on "error" #33513

[Filebeat] AWS S3 direct listing input: states registry purged only on "stored" and not on "error" #33513

Comments

aspacca commented Nov 1, 2022

andrewkroh commented Nov 1, 2022

aspacca commented Nov 2, 2022