Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Filebeat] Duplicated data when using filestream input #31239

Closed
belimawr opened this issue Apr 11, 2022 · 11 comments
Closed

[Filebeat] Duplicated data when using filestream input #31239

belimawr opened this issue Apr 11, 2022 · 11 comments
Labels
bug Filebeat Filebeat Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@belimawr
Copy link
Contributor

belimawr commented Apr 11, 2022

When there is more than one filestream input with the same ID (or without an ID) data duplication will happen when Filebeat is restarted or a new input with duplicated ID is started/created.

The root cause is related to the clean up of old entries in the registry.

Possible ways this bug manifests

  • Kubernetes integration (under Elastic-Agent/Fleet), when new pods are added old files are read from the beginning.
  • Kubernetes autodiscover, when new pods are added old files are read from the beginning. (confirmed by #24208 (comment))
  • Filestream inputs without IDs on a standalone Filebeat, when Filebeat restarts it reads the files from the beginning.
  • Filestream inputs with duplicated IDs on a standalone Filebeat, when Filebeat restarts it reads the files from the beginning.

How to detect the issue

The latest (v7.17.2 and v8.1.2) versions of Filebeat will issue a log error when this situation is detected:

filestream input with ID 'xxxxxx' already exists, this will lead to data duplication, please use a different ID

For multiple filestream inputs without IDs the message looks like:

filestream input ID without ID might lead to data  duplication, please add an ID and restart Filebeat

Workaround

Set unique IDs to every filestream input for standalone Filebeat and standalone Elastic-Agent

Related issues

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Apr 11, 2022
@belimawr belimawr added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Apr 11, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Apr 11, 2022
@bczifra
Copy link
Member

bczifra commented Apr 27, 2022

@belimawr based on my reading of elastic/kibana#129851 (comment) it appears that

Set clean_removed: false to every filestream input that does not have an ID or share the same ID

is not a recommended workaround. Is my understanding correct?

@belimawr
Copy link
Contributor Author

@belimawr based on my reading of elastic/kibana#129851 (comment) it appears that

Set clean_removed: false to every filestream input that does not have an ID or share the same ID

is not a recommended workaround. Is my understanding correct?

Indeed. Thanks for catching this. I'll update the issue.

@jlind23
Copy link
Collaborator

jlind23 commented May 11, 2022

@belimawr closing this one in favour of #31512

@jlind23 jlind23 closed this as completed May 11, 2022
@belimawr
Copy link
Contributor Author

@belimawr closing this one in favour of #31512

I'd have kept it open as a "meta issue", but it's ok. Having multiple issues open about the same bug can be confusing.

@irislanderos
Copy link

irislanderos commented Aug 10, 2022

Hi Team,

Cloud AR Operations here. We are working to give credits back this customer ( Org-2878897275, Ubrich ) for the increase in usage due to this technical issue. I was provided with a 1:4 to 1:5 ratio of impact.

  1. Is this the closest estimation we have as to the impact? Do these impact the capacity costs as well as DTS? Or just one or the other?

Also, I am aware that the cluster is still bloated -
2. Do we know by about how much so I can calculate the impact of that?

So far I have calculations based on the 1:4 & 1:5 ratio estimates, as well an estimate of credits based on their July 2022 usage.
The results of the ratio calculations and the July 2022 usage vary greatly so I'm stuck on how to proceed with the credit.
Any direction to get to the right place would be appreciated!

Working Calc Doc:
https://docs.google.com/spreadsheets/d/1CFL3lOjUv5AtXalZq6dQ_92AB9Y7T0x_y3s0C5dGyV4/edit#gid=1575888295

Related case: 00929762

@belimawr @bczifra @jlind23

@irislanderos irislanderos reopened this Aug 10, 2022
@jlind23
Copy link
Collaborator

jlind23 commented Aug 11, 2022

@irislanderos it is almost impossible to calculate the right ratio. As soon as you have an input without an ID specified, data will be duplicated. If you have X inputs, data may be duplciated X time. Thus having an accurate ratio is clearly impossible. @belimawr thoughts?

@belimawr
Copy link
Contributor Author

@irislanderos it is almost impossible to calculate the right ratio. As soon as you have an input without an ID specified, data will be duplicated. If you have X inputs, data may be duplciated X time. Thus having an accurate ratio is clearly impossible. @belimawr thoughts?

I agree with @jlind23. It's not possible to have an exact estimation of the data duplication ratio. One thing that plays a role is the number of times Filebeat is restarted. Most of the steps that lead to this bug happen synchronously and the time when they happen will impact the amount of data duplicated.

@irislanderos
Copy link

Thank you both for the insight. We'll find another way to resolve this. @jlind23 @belimawr

@spalan479
Copy link

spalan479 commented Sep 19, 2022

Hi All,

Duplicate data is sent from Filebeat.
When I start the elastic agent, I can see two filebeat processes running. If I Kill the first one it auto restarts and the second one disappears when I kill it twice. Once the second process is killed, there are no duplicate data coming in.
Am I missing any configuration here?

Filebeat.yml

============================== Filebeat inputs ===============================

filebeat.inputs:

  • type: filestream
    enabled: true
    id: "ecs-app-logging"
    paths:
    • C:\MVD2\Logs\ECS*.log
      json.keys_under_root: true
      json.overwrite_keys: true
      json.add_error_key: true
      json.expand_keys: true

@cmacknz
Copy link
Member

cmacknz commented Sep 20, 2022

The agent will use a second Filebeat process to ingest logs when log monitoring is enabled. Having two Filebeat processes running can be normal and does not necessarily cause data duplication. The processes started by the agent are controlled through the agent policy, and not the Filebeat configuration file.

Please start a thread on https://discuss.elastic.co/tag/elastic-agent and someone will help you determine if this is a configuration issue or a bug in the agent. If it is a bug we will open a new issue to track it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Filebeat Filebeat Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

No branches or pull requests

7 participants