Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filestream input duplicating events after every restart #30061

Closed
eedugon opened this issue Jan 27, 2022 · 18 comments · Fixed by #30717
Closed

Filestream input duplicating events after every restart #30061

eedugon opened this issue Jan 27, 2022 · 18 comments · Fixed by #30717
Assignees
Labels
bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team v8.2.0

Comments

@eedugon
Copy link
Contributor

eedugon commented Jan 27, 2022

Under some circumstances the filestream input is processing all the events after every restart.

For example the following configuration works fine in a Filebeat running on Kubernetes (static input, no autodiscover):

    - type: filestream
      enabled: true
      paths:
       - /var/log/k8sapps/myapp/*.log
      fields:
       app.name: "myapp"
      fields_under_root: true

But if we add a second input (actually from same disk) then Filebeat sends everything after every restart:

    - type: filestream
      enabled: true
      paths:
       - /var/log/k8sapps/myapp/*.log
      fields:
       app.name: "myapp"
      fields_under_root: true

    - type: filestream
      enabled: true
      paths:
       - /var/log/k8sapps/secondapp/*.log
      fields:
       app.name: "secondapp"
      fields_under_root: true

I've tried to apply file_identity.inode_marker.path: /var/log/.filebeat-marker but the result is the same, and with a single input all works as expected.
The inodes of the files do not change after every restart and the volume UUID i don't know because it's not reported by lsblk (checked from the filebeat container itself).

Doc reference: https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-filestream.html#filestream-file-identity

@eedugon eedugon added the Filebeat Filebeat label Jan 27, 2022
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jan 27, 2022
@eedugon eedugon added bug and removed needs_team Indicates that the issue/PR needs a Team:* label labels Jan 27, 2022
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jan 27, 2022
@jlind23 jlind23 added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Jan 27, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jan 27, 2022
@jlind23
Copy link
Collaborator

jlind23 commented Feb 2, 2022

@kvch @belimawr it seems that this one is pretty urgent and uggly.
As you do have the best knowledge around filestream, can one of you take it litteraly now?
cc @faec as you brought it up

@belimawr belimawr self-assigned this Feb 2, 2022
@belimawr
Copy link
Contributor

belimawr commented Feb 2, 2022

Yes, I'll jump right now

@kvch
Copy link
Contributor

kvch commented Feb 2, 2022

There is a workaround for this issue. You have to configure an ID for each input, so the input can find the appropriate state in the registry.

    - type: filestream
      enabled: true
      id: id1
      paths:
       - /var/log/k8sapps/myapp/*.log
      fields:
       app.name: "myapp"
      fields_under_root: true

    - type: filestream
      enabled: true
      id: id2
      paths:
       - /var/log/k8sapps/secondapp/*.log
      fields:
       app.name: "secondapp"
      fields_under_root: true

@cmacknz
Copy link
Member

cmacknz commented Feb 2, 2022

There is a workaround for this issue. You have to configure an ID for each input, so the input can find the appropriate state in the registry.

Should we require that each input have an ID if it doesn't work without it?

@kvch
Copy link
Contributor

kvch commented Feb 3, 2022

I do not have a strong opinion on this. I am afraid if we just go with requiring the ID we will never fix this bug.

@cmacknz
Copy link
Member

cmacknz commented Feb 3, 2022

I do not have a strong opinion on this. I am afraid if we just go with requiring the ID we will never fix this bug.

Ah, if the root cause of the bug is unrelated to the missing IDs then we should investigate that first before applying a permanent work around.

@rroemer
Copy link

rroemer commented Feb 8, 2022

Any news on this issue? This is a major problem when using elastic-agent as we are receiving a ton of duplicate events every time we update a policy that triggers a restart of the integrated filebeat.

@jlind23
Copy link
Collaborator

jlind23 commented Feb 8, 2022

@rroemer #30061 (comment) didn't do the trick?

@kvch
Copy link
Contributor

kvch commented Feb 9, 2022

@jlind23 Unfortunately, I checked it a few days ago and Agent does not bubble down the configured ID to the filestream input.

@jlind23
Copy link
Collaborator

jlind23 commented Feb 9, 2022

@eedugon @rroemer I just created a public issue here: #30300

@ruflin
Copy link
Member

ruflin commented Feb 23, 2022

I would like to better understand why we require an id. I assume this is related to the way filestream input writes the state. With the logfile input we had inode + device id as identifier. I remember we added also filename as an option. What is used for filestream? How does the state of a single file match to an input id if the input specifies a file pattern?

@belimawr
Copy link
Contributor

belimawr commented Mar 1, 2022

I would like to better understand why we require an id. I assume this is related to the way filestream input writes the state. With the logfile input we had inode + device id as identifier. I remember we added also filename as an option. What is used for filestream? How does the state of a single file match to an input id if the input specifies a file pattern?

The TLDR is:

  • At start up, filestream will remove from the registry all files that do not belong to the input anymore
  • If a file in the registry:
    • has got the same input ID (for inputs without ID in the configuration, a constant .global is used)
    • has got a file path that does not matches the current input configuration
  • Then this file is removed from the registry (in memory).

Because all inputs share the same ID (.global), then each input will remove the others input files from the registry, hence all files sharing the .global input ID will be removed from the registry.

The effect of the "remove from registry" is to set the offset to 0, so the files are re-read from beginning.

On this comment I added some details of what is happening, let me know if it's not clear enough.

@dschweinbenz
Copy link

dschweinbenz commented Jun 23, 2022

I have still the same problem using just one filestream. Even if i set the id i have duplicates on each restart...
I am using docker and using a volume to persist /usr/share/filebeat/data/ and have the config:
filebeat.inputs:

  • type: filestream
    id: watcher
    enabled: true
    encoding: utf-8
    ignore_older: 0
    paths:
    • "/watch/output/XY.csv"

@cmacknz
Copy link
Member

cmacknz commented Jun 23, 2022

@dschweinbenz can you start a thread on https://discuss.elastic.co/c/elastic-stack/beats/ about this? We'd need your filebeat version, full configuration file, and logs if you can provide them. We'll open a new issue if we confirm it is a new bug.

@dschweinbenz
Copy link

The config clean_removed: false seems to help in our case.

We have the setup that the log file is overwritten on each new data set. Filebeat seems to have problems to recognize this. Our apps are writing files to a different place and afterwards the file gets moved to overwrite the file which is scanned by filebeat. With type: log everything worked great, but using filestream filebeat produces duplicates on each restart. The registry folder is mounted as volume so that it is persistent.
As clean_removed seems to work, it looks like that filebeat gets confused after an overwrite has happened. We are using filebeat-oss:7.16.3 as docker image.

@ispringer
Copy link

I just migrated from log inputs to filestream inputs, and spent half a day trying to figure out why all my logs were resent every time I restarted filebeat. I don't think logging a warning when a filestream has no ID is sufficient. If the service behaves wildly incorrectly when an ID is not specified, then an ID should be required. That means when an ID is not present, refuse to initialize the filestream.

@cmacknz
Copy link
Member

cmacknz commented Apr 27, 2023

I don't think logging a warning when a filestream has no ID is sufficient. If the service behaves wildly incorrectly when an ID is not specified, then an ID should be required. That means when an ID is not present, refuse to initialize the filestream.

Generally we agree with this, but this would be a breaking change that could result in data loss (filestream would not be running) instead of data duplication. We haven't made this change for that reason, which might be acceptable to you but not all users.

We definitely regret getting ourselves into this situation, and are still thinking about ways to address it in better way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team v8.2.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants