-
Notifications
You must be signed in to change notification settings - Fork 444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add container ID as part of Kubernetes Cotainer Logs filestream input #3672
Conversation
a661d5f
to
028b87a
Compare
🌐 Coverage report
|
ce45806
to
d46ef82
Compare
This is part of the fix for data duplication that happens when there are multiple filestream inputs configured with the same ID. The container ID and pod name are added to the filestream input ID. The container ID ensure uniquiness and the pod name helps with debugging. Issue: elastic/beats#31512
To my mind with this we partially fix the problem. Actually we solve the case that surfaced the problem and not the problem itself. Will we need to adjust any single integration like this in the future, if the integration hits this issue? I think we would like to have the confidence that So could we verify what is the plan to cover generic cases like the following? Use dynamic variables to define an autodiscover template using specific integration- name: nginx
type: filestream
use_output: default
data_stream:
namespace: default
streams:
- data_stream:
dataset: nginx.access
type: logs
paths:
- '/var/log/containers/*${kubernetes.container.id}.log'
condition: ${kubernetes.labels.app} == 'nginx'
parsers:
- container:
stream: all
format: auto Use 2 different data_streams to parse same container's logs but different stream(stdout/stderr) - name: apache-1
type: filestream
use_output: default
meta:
package:
name: apache
version: 1.3.5
data_stream:
namespace: default
condition: ${kubernetes.labels.app} == 'apache'
streams:
- data_stream:
dataset: apache.access
type: logs
paths:
- "/var/log/containers/*${kubernetes.container.id}.log"
tags:
- apache-access
exclude_files:
- .gz$ # Is this needed ???
prospector.scanner.symlinks: true
parsers:
- container:
stream: stdout
format: auto
- data_stream:
dataset: apache.error
type: logs
paths:
- "/var/log/containers/*${kubernetes.container.id}.log"
exclude_files:
- .gz$
tags:
- apache-error
processors:
- add_locale: null
prospector.scanner.symlinks: true
parsers:
- container:
stream: stderr
format: auto |
Hello @belimawr , I think we need to agree that although this fix focuses on K8s does not cover rest of integrations of course and also does not fix the actual problem of filestream handling streams based on dynamic vars without ids. So are there any additional issues that will deal with this? Any documentation updates that will point all customers and our developers to provide ids and how on every situation? Moreover as cloudnative team, we are a little skeptical if your change also imposes a change here: https://github.com/elastic/elastic-agent/blob/main/deploy/kubernetes/elastic-agent-standalone-kubernetes.yaml Do we need to udpate ids on our streams as well or those are going to be handled from the package and this will introduce a value there? My point on the above questions is not to close this PR and then open something to users that again will need more discussions. So mainly:
|
Hey folks, I'll try to answer all your questions/concerns and try to explain the root cause of the issues with Filestream inputs without unique IDs. The main issue with filestream is that it uses the ID to identify with states (files) in the registry belongs to the input, so if two filestream instances share the same ID, they will have access to each other's states. That leads to data duplication when filestream cleans up (removes) the states of the files it's not harvesting any more. One filestream deletes the states of the other. One of the moments this happens is during the filestream input start up. The ideal fix for that is to make a unique ID for each input required, however this is a breaking change. Unfortunately when filestream was developed it was assumed (and assured) that there would always be unique IDs for the different filestream instances and the ID presence and uniqueness was not made mandatory. The best approach that would solve it in most cases is to have some heuristics to match auto generated UUIDs and the filestream configuration. A detailed description (alongside the problem explanation) can be found on this issue: elastic/beats#32335. Even if we implement the heuristics, it will not solve the problem in all cases. An example edge case is that if an filestream input is configured in a standalone mode (either Elastic-Agent or Filebeat), the person configuring it might keep changing some of the configuration parameters as well as adding/removing files, and it is expected the state of the harvested files to remain. That is a case where using some heuristics would probably fail. For Filebeat standalone we "solved" the issue with documentation stating the ID is required and needs to be unique, as well as updating all examples to contain unique IDs: https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-filestream.html#filestream-input-id We plan to do the same for Elastic-Agent standalone using filestream input. @gizas, yes, https://github.com/elastic/elastic-agent/blob/main/deploy/kubernetes/elastic-agent-standalone-kubernetes.yaml also needs to be updated
I'm not quite sure what you mean here. Using this PR as an example, is it a change on the package or the stream?
Yes. Most of the filestream ID issue discussions happened within the Data-Plane team, we hoped we could solve it within Filebeat/Elastic-Agent, but after detailed investigations last week, we found out there isn't a reliable way to solve it on Filebeat/Elastic-Agent. This is the issue we used to track the effort of fixing it from Filebeat/Elastic-Agent: elastic/beats#31512 |
@ChrsMark for dynamic variables and autodiscover functionality, the best approach is to use an intrinsic unique identifier of the instance being autodiscovered as ID for the filestream input. On the cases you mentioned I can see possible unique IDs. For Kubernetes/Docker, the container ID is already unique, so it's a perfect candidate. 1. Use dynamic variables to define an autodiscover template using specific integrationThis case can just use the container ID as the filestream ID, ideally with some prefix to make it more human-friendly. - name: nginx
type: filestream
use_output: default
data_stream:
namespace: default
streams:
- data_stream:
dataset: nginx.access
type: logs
id: nginx-${kubernetes.container.id}
paths:
- '/var/log/containers/*${kubernetes.container.id}.log'
condition: ${kubernetes.labels.app} == 'nginx'
parsers:
- container:
stream: all
format: auto 2. Use 2 different data_streams to parse same container's logs but different stream(stdout/stderr)Assuming the Because there are two different - name: apache-1
type: filestream
use_output: default
meta:
package:
name: apache
version: 1.3.5
data_stream:
namespace: default
condition: ${kubernetes.labels.app} == 'apache'
streams:
- data_stream:
dataset: apache.access
type: logs
id: apache-1-stdout-${kubernetes.container.id}
paths:
- "/var/log/containers/*${kubernetes.container.id}.log"
tags:
- apache-access
exclude_files:
- .gz$ # Is this needed ???
prospector.scanner.symlinks: true
parsers:
- container:
stream: stdout
format: auto
- data_stream:
dataset: apache.error
type: logs
id: apache-1-stderr-${kubernetes.container.id}
paths:
- "/var/log/containers/*${kubernetes.container.id}.log"
exclude_files:
- .gz$
tags:
- apache-error
processors:
- add_locale: null
prospector.scanner.symlinks: true
parsers:
- container:
stream: stderr
format: auto |
@belimawr For case 2 above what I think you want is to include the datastream dataset in the ID, you can see we reference this in a few of the hbs files already:
- data_stream:
dataset: apache.error
type: logs
id: {{data_stream.dataset}}-${kubernetes.container.id}
- data_stream:
dataset: apache.access
type: logs
id: {{data_stream.dataset}}-${kubernetes.container.id} This would hopefully give you the IDs |
@ChrsMark if we use But for an integration package it will work as well. |
Yes in this example But who's responsibility is to add this as well as the
Is there any defined path forward for this? It might be indeed a breaking change but if it's not fixed at first place it will keep annoying us. The change in this PR might be minor but it does not give me the confidence as an integration developer/maintainer that all integrations will work in the future. In this regard I would like a clear statement of what is the path forward, how it will prevent future failures related to this issue and how all integration developers will be aware of this. Sorry if this information is already somewhere but with a quick look I couldn't find something so it will worth it bringing this up and having it well communicated across the integration developer teams. |
Unfortunately we don't have a one-fits-all solution. The closest would be the heuristics approach, but it still does not solve for all cases and it is a huge effort. So the sad short answer is: no, we do not have a solution aside well documenting the issue and logging when it happens. We're focusing on fixing the cases that are affecting more users and doing our best to make sure everything we (Elastic) develop is not falling into this issue.
I'd say it's the integration developer responsibility. If the integration uses the filestream input, they should follow what is documented, and currently this unique ID situation is documented. On the long run, we want to improve the situation as much as possible, like for the V2 architecture we want to make those unique IDs required and automatically checked. But at the moment those just ideas and some might be breaking changes, so we have to be cautious.
The
In the short term, the path is what we're doing now: fixing the integrations. Long term (including the V2 architecture) we want to make it better, at least making the unique IDs required. Maybe even having a breaking change, but it will take some time.
I agree, that's indeed true.
At the moment we have documentation in place for filestream. I don't know much about the integrations development cycle, but I imagine that when a new integration is developed, the developers read the documentation from the Beats/features they're using, so at the moment documentation is our best solution. We could also write some specific integration development documentation. If there is any specific documentation for integration developers, I'd be happy to write a PR adding details and examples on how to circumvent this filestream ID issue. Another thing is that we have log errors in place that will warn the user about duplicated IDs that could lead to data duplication, so once Filebeat is running, if there are log errors like this:
Then the IDs are not unique. elastic/beats#31239 has got some more details. This can even be used on test automation.
Aside all that I sad, I cannot, at this moment, give a sure and clear path forward. However I can bring it back to the team and try to have a clear path well documented in the near future.
I agree this information is not the easiest to find. I'll gladly communicate it better. Could you help me to find the best channels for that? Would an addition to the contributing guide enough? Maybe into the Generic guidelines section. Or even a new warning/important/known issues section? |
Thank you @belimawr for all the time to answer here. My biggest concern here is that although, we discuss where and how to add those unique ids in case of dynamic input integrations, we miss the main problem to identify why missing or non-unique ids cause duplication. The point we try to raise here is that dont see this action being planned anywhere. Additionally although internally us in the thread we might know ways to enhance integrations, we dont provide a unified example, documentation and also communication how this can be done to all the teams. Are not we responsible to do so before hand? Sorry for such a fuzz to merge this, really sorry as it is more the path of action to be clear from this point on and if your team identified such concerns |
@gizas We (the agent data plane team) know what the underlying problem is. The simplest explanation is that filestream was designed to use the input ID as the equivalent of a primary key in a database. There is an underlying persistent key-value store with the ID as a key. You can read a bit about this here if you are curious. Unfortunately when filestream was originally released in 7.17.x nobody actually made the ID field of the input configuration mandatory. This is widely known within our team as a gigantic oversight and we are working to address it. Simply making the ID a required field in the configuration would be a breaking change that would cause Filebeat to fail to start, effectively causing a log collection outage for users with large deployments. We are making relatively simple changes like this one as an immediate work around for the most common uses of filestream. We are also working in the background trying to come up with a way to solve this in filestream itself without a breaking change so that every integration does not need to be audited for this problem. My hope is that we can solve this in filestream itself, but if not we will notify integration developers of the problem and the mitigation they'll need to apply. The majority of integrations are still using the previous As long as this fix correctly accomplishes the goal of adding a unique ID to filestream inputs created for k8s log ingestion, our preference would be to merge this and we would be happy to continue discussing the wider issues with filestream IDs separately. |
Thank you for the explanation once more @cmacknz! Lets make this discussion more productive in another channel as both agree that this PR focuses in part of problem. Duplicate same Ids pointing to different paths or missing IDs can be a problem maybe because of the way we create the unique Identifier text. Correct me here and maybe missing info, but a combination of file and path will always be unique is not it? So why dont we create the id always based on those text values? I will merge this and lets talk from next week on next actions. Sorry for the noise again was meant for good purposes |
@belimawr today @MichaelKatsoulis and myself both verified that with {"log.level":"error","@timestamp":"2022-11-17T10:35:46.158Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":544},"message":"failed to render components: invalid 'inputs.9.id', has a duplicate id \"filestream-container-logs-df397d61-6ff8-4553-823d-830bcf23dccd\" (id is required to be unique)","ecs.version":"1.6.0"} This looks weird. Is there anything that changed in Agent that could cause this? |
Hey @ChrsMark, to some extent quite a lot has changed with the merge of the V2 control protocol. On a quick glance through the code I can see Elastic-Agent is doing some validation to ensure input IDs are unique. Based on the prefix in the logs it does not seem directly related to this PR, but to another integration. Maybe this integration also needs some update to ensure it won't generate duplicated filestream IDs. Could you give me the steps to reproduce it? So I can take a deeper look? |
You will see agents not collecting anything and reporting this error |
I found the problem and created an issue to track/discuss it: elastic/elastic-agent#1751 |
What does this PR do?
This is part of the fix for data duplication that happens when there
are multiple filestream inputs configured with the same ID.
Checklist
changelog.yml
file.## Author's ChecklistHow to test this PR locally
Related issues
## Screenshots