Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add container ID as part of Kubernetes Cotainer Logs filestream input #3672

Merged
merged 1 commit into from
Jul 14, 2022

Conversation

belimawr
Copy link
Contributor

@belimawr belimawr commented Jul 8, 2022

What does this PR do?

This is part of the fix for data duplication that happens when there
are multiple filestream inputs configured with the same ID.

Checklist

  • I have reviewed tips for building integrations and this pull request is aligned with them.
  • I have verified that all data streams collect metrics or logs.
  • I have added an entry to my package's changelog.yml file.
  • I have verified that Kibana version constraints are current according to guidelines.

## Author's Checklist

How to test this PR locally

Related issues

## Screenshots

@belimawr belimawr added the bug Something isn't working, use only for issues label Jul 8, 2022
@belimawr belimawr force-pushed the fix-filestream-id-kubernetes branch from a661d5f to 028b87a Compare July 8, 2022 16:03
@elasticmachine
Copy link

elasticmachine commented Jul 8, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-07-11T16:43:52.259+0000

  • Duration: 31 min 42 sec

Test stats 🧪

Test Results
Failed 0
Passed 90
Skipped 0
Total 90

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

@elasticmachine
Copy link

elasticmachine commented Jul 8, 2022

🌐 Coverage report

Name Metrics % (covered/total) Diff
Packages 100.0% (0/0) 💚
Files 100.0% (0/0) 💚 3.08
Classes 100.0% (0/0) 💚 3.08
Methods 94.872% (74/78) 👍 5.828
Lines 100.0% (0/0) 💚 9.856
Conditionals 100.0% (0/0) 💚

@cmacknz cmacknz self-requested a review July 11, 2022 12:55
@belimawr belimawr force-pushed the fix-filestream-id-kubernetes branch 2 times, most recently from ce45806 to d46ef82 Compare July 11, 2022 16:43
This is part of the fix for data duplication that happens when there
are multiple filestream inputs configured with the same ID.

The container ID and pod name are added to the filestream input
ID. The container ID ensure uniquiness and the pod name helps with
debugging.

Issue: elastic/beats#31512
@belimawr belimawr marked this pull request as ready for review July 11, 2022 16:44
@belimawr belimawr requested a review from a team as a code owner July 11, 2022 16:44
@ChrsMark
Copy link
Member

To my mind with this we partially fix the problem. Actually we solve the case that surfaced the problem and not the problem itself. Will we need to adjust any single integration like this in the future, if the integration hits this issue? I think we would like to have the confidence that filestream input can be used without special hacks when it comes to dynamic variables, no?

So could we verify what is the plan to cover generic cases like the following?

Use dynamic variables to define an autodiscover template using specific integration

- name: nginx
  type: filestream
  use_output: default
  data_stream:
    namespace: default
  streams:
    - data_stream:
        dataset: nginx.access
        type: logs
      paths:
        - '/var/log/containers/*${kubernetes.container.id}.log'
      condition: ${kubernetes.labels.app} == 'nginx'
      parsers:
          - container:
            stream: all
            format: auto

Use 2 different data_streams to parse same container's logs but different stream(stdout/stderr)

  - name: apache-1
    type: filestream
    use_output: default
    meta:
      package:
        name: apache
        version: 1.3.5
    data_stream:
      namespace: default
    condition: ${kubernetes.labels.app} == 'apache'
    streams:
      - data_stream:
          dataset: apache.access
          type: logs
        paths:
          - "/var/log/containers/*${kubernetes.container.id}.log"
        tags:
          - apache-access
        exclude_files:
          - .gz$  # Is this needed ???
        prospector.scanner.symlinks: true
        parsers:
            - container:
              stream: stdout
              format: auto
      - data_stream:
          dataset: apache.error
          type: logs
        paths:
          - "/var/log/containers/*${kubernetes.container.id}.log"
        exclude_files:
          - .gz$
        tags:
          - apache-error
        processors:
          - add_locale: null
        prospector.scanner.symlinks: true
        parsers:
          - container:
            stream: stderr
            format: auto

@gizas
Copy link
Contributor

gizas commented Jul 12, 2022

Hello @belimawr ,

I think we need to agree that although this fix focuses on K8s does not cover rest of integrations of course and also does not fix the actual problem of filestream handling streams based on dynamic vars without ids.

So are there any additional issues that will deal with this? Any documentation updates that will point all customers and our developers to provide ids and how on every situation?

Moreover as cloudnative team, we are a little skeptical if your change also imposes a change here: https://github.com/elastic/elastic-agent/blob/main/deploy/kubernetes/elastic-agent-standalone-kubernetes.yaml

Do we need to udpate ids on our streams as well or those are going to be handled from the package and this will introduce a value there?

My point on the above questions is not to close this PR and then open something to users that again will need more discussions. So mainly:

  • Next actions are planned?
  • Do we need to have a preliminary sync on such issues in the future at least before starting them to discuss hidden implications?

@belimawr
Copy link
Contributor Author

Hey folks, I'll try to answer all your questions/concerns and try to explain the root cause of the issues with Filestream inputs without unique IDs.

The main issue with filestream is that it uses the ID to identify with states (files) in the registry belongs to the input, so if two filestream instances share the same ID, they will have access to each other's states. That leads to data duplication when filestream cleans up (removes) the states of the files it's not harvesting any more. One filestream deletes the states of the other. One of the moments this happens is during the filestream input start up.

The ideal fix for that is to make a unique ID for each input required, however this is a breaking change.

Unfortunately when filestream was developed it was assumed (and assured) that there would always be unique IDs for the different filestream instances and the ID presence and uniqueness was not made mandatory.

The best approach that would solve it in most cases is to have some heuristics to match auto generated UUIDs and the filestream configuration. A detailed description (alongside the problem explanation) can be found on this issue: elastic/beats#32335.

Even if we implement the heuristics, it will not solve the problem in all cases.

An example edge case is that if an filestream input is configured in a standalone mode (either Elastic-Agent or Filebeat), the person configuring it might keep changing some of the configuration parameters as well as adding/removing files, and it is expected the state of the harvested files to remain. That is a case where using some heuristics would probably fail.

For Filebeat standalone we "solved" the issue with documentation stating the ID is required and needs to be unique, as well as updating all examples to contain unique IDs: https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-filestream.html#filestream-input-id

We plan to do the same for Elastic-Agent standalone using filestream input.

@gizas, yes, https://github.com/elastic/elastic-agent/blob/main/deploy/kubernetes/elastic-agent-standalone-kubernetes.yaml also needs to be updated

Do we need to udpate ids on our streams as well or those are going to be handled from the package and this will introduce a value there?

I'm not quite sure what you mean here. Using this PR as an example, is it a change on the package or the stream?

Do we need to have a preliminary sync on such issues in the future at least before starting them to discuss hidden implications?

Yes. Most of the filestream ID issue discussions happened within the Data-Plane team, we hoped we could solve it within Filebeat/Elastic-Agent, but after detailed investigations last week, we found out there isn't a reliable way to solve it on Filebeat/Elastic-Agent. This is the issue we used to track the effort of fixing it from Filebeat/Elastic-Agent: elastic/beats#31512

@belimawr
Copy link
Contributor Author

@ChrsMark for dynamic variables and autodiscover functionality, the best approach is to use an intrinsic unique identifier of the instance being autodiscovered as ID for the filestream input. On the cases you mentioned I can see possible unique IDs.

For Kubernetes/Docker, the container ID is already unique, so it's a perfect candidate.

1. Use dynamic variables to define an autodiscover template using specific integration

This case can just use the container ID as the filestream ID, ideally with some prefix to make it more human-friendly.

- name: nginx
  type: filestream
  use_output: default
  data_stream:
    namespace: default
  streams:
    - data_stream:
        dataset: nginx.access
        type: logs
      id: nginx-${kubernetes.container.id}
      paths:
        - '/var/log/containers/*${kubernetes.container.id}.log'
      condition: ${kubernetes.labels.app} == 'nginx'
      parsers:
          - container:
            stream: all
            format: auto

2. Use 2 different data_streams to parse same container's logs but different stream(stdout/stderr)

Assuming the stderr and stdout of the container are logged to the same file, that is a very good example of where filestream (assuming there are unique IDs) shines because that is one of the filestream features: it allows different inputs to harvester the same file without interfering with each other (that is not possible with the log input).

Because there are two different data_stream (that will become two different filestream input instances), as long as they have different IDs, everything will work fine, looking at the example you provided, here is a possible solution for ID uniqueness:

  - name: apache-1
    type: filestream
    use_output: default
    meta:
      package:
        name: apache
        version: 1.3.5
    data_stream:
      namespace: default
    condition: ${kubernetes.labels.app} == 'apache'
    streams:
      - data_stream:
          dataset: apache.access
          type: logs
        id: apache-1-stdout-${kubernetes.container.id}
        paths:
          - "/var/log/containers/*${kubernetes.container.id}.log"
        tags:
          - apache-access
        exclude_files:
          - .gz$  # Is this needed ???
        prospector.scanner.symlinks: true
        parsers:
            - container:
              stream: stdout
              format: auto
      - data_stream:
          dataset: apache.error
          type: logs
        id: apache-1-stderr-${kubernetes.container.id}
        paths:
          - "/var/log/containers/*${kubernetes.container.id}.log"
        exclude_files:
          - .gz$
        tags:
          - apache-error
        processors:
          - add_locale: null
        prospector.scanner.symlinks: true
        parsers:
          - container:
            stream: stderr
            format: auto

@belimawr
Copy link
Contributor Author

@ChrsMark @gizas do you agree with merging this PR? What are your thoughts on it now?
Let me know if you have any other questions/concerns, I'll be glad to answer them. We can also have some f-2-f chat it that will make things easier.

@cmacknz
Copy link
Member

cmacknz commented Jul 13, 2022

@belimawr For case 2 above what I think you want is to include the datastream dataset in the ID, you can see we reference this in a few of the hbs files already:

data_stream:
dataset: {{data_stream.dataset}}

      - data_stream:
          dataset: apache.error
          type: logs
        id: {{data_stream.dataset}}-${kubernetes.container.id}

      - data_stream:
          dataset: apache.access
          type: logs
        id: {{data_stream.dataset}}-${kubernetes.container.id}

This would hopefully give you the IDs apache.error-${kubernetes.container.id} and apache.access-${kubernetes.container.id}.

@belimawr
Copy link
Contributor Author

@ChrsMark if we use {{data_stream.dataset}} would that work on a standalone policy? I focused on using what is available through the Elastic-Agent Providers to the snippet can be used on any policy.

But for an integration package it will work as well.

@ChrsMark
Copy link
Member

Yes in this example {{data_stream.dataset}} would work I guess but I cannot tell for every single case. This is my main concern for this issue, as I mention I see it solved in specific cases but not necessarily once and for all.

But who's responsibility is to add this as well as the id in every integration that uses dynamic variables? As far as I know dynamic variables can be everywhere, so a user can even use custom_logs package to collect logs from files dynamically populated.

The ideal fix for that is to make a unique ID for each input required, however this is a breaking change.

Unfortunately when filestream was developed it was assumed (and assured) that there would always be unique IDs for the different filestream instances and the ID presence and uniqueness was not made mandatory.

Is there any defined path forward for this? It might be indeed a breaking change but if it's not fixed at first place it will keep annoying us.

The change in this PR might be minor but it does not give me the confidence as an integration developer/maintainer that all integrations will work in the future.
What about other integrations that will be developed in the future? How integration developers will know about this?

In this regard I would like a clear statement of what is the path forward, how it will prevent future failures related to this issue and how all integration developers will be aware of this. Sorry if this information is already somewhere but with a quick look I couldn't find something so it will worth it bringing this up and having it well communicated across the integration developer teams.

@belimawr
Copy link
Contributor Author

as I mention I see it solved in specific cases but not necessarily once and for all.

Unfortunately we don't have a one-fits-all solution. The closest would be the heuristics approach, but it still does not solve for all cases and it is a huge effort. So the sad short answer is: no, we do not have a solution aside well documenting the issue and logging when it happens.

We're focusing on fixing the cases that are affecting more users and doing our best to make sure everything we (Elastic) develop is not falling into this issue.

But who's responsibility is to add this as well as the id in every integration that uses dynamic variables?

I'd say it's the integration developer responsibility. If the integration uses the filestream input, they should follow what is documented, and currently this unique ID situation is documented.

On the long run, we want to improve the situation as much as possible, like for the V2 architecture we want to make those unique IDs required and automatically checked. But at the moment those just ideas and some might be breaking changes, so we have to be cautious.

As far as I know dynamic variables can be everywhere, so a user can even use custom_logs package to collect logs from files dynamically populated.

The custom_logs integration uses the log input, so it does not need a unique ID, but on the other hand, the same file cannot be harvested by different log inputs. Plus the log input has got other performance issues that are solved on filestream.

Is there any defined path forward for this? It might be indeed a breaking change but if it's not fixed at first place it will keep annoying us.

In the short term, the path is what we're doing now: fixing the integrations. Long term (including the V2 architecture) we want to make it better, at least making the unique IDs required. Maybe even having a breaking change, but it will take some time.

The change in this PR might be minor but it does not give me the confidence as an integration developer/maintainer that all integrations will work in the future.

I agree, that's indeed true.

What about other integrations that will be developed in the future? How integration developers will know about this?

At the moment we have documentation in place for filestream. I don't know much about the integrations development cycle, but I imagine that when a new integration is developed, the developers read the documentation from the Beats/features they're using, so at the moment documentation is our best solution. We could also write some specific integration development documentation. If there is any specific documentation for integration developers, I'd be happy to write a PR adding details and examples on how to circumvent this filestream ID issue.

Another thing is that we have log errors in place that will warn the user about duplicated IDs that could lead to data duplication, so once Filebeat is running, if there are log errors like this:

filestream input with ID 'xxxxxx' already exists, this will lead to data duplication, please use a different ID

Then the IDs are not unique. elastic/beats#31239 has got some more details. This can even be used on test automation.

In this regard I would like a clear statement of what is the path forward, how it will prevent future failures related to this issue and how all integration developers will be aware of this.

Aside all that I sad, I cannot, at this moment, give a sure and clear path forward. However I can bring it back to the team and try to have a clear path well documented in the near future.

Sorry if this information is already somewhere but with a quick look I couldn't find something so it will worth it bringing this up and having it well communicated across the integration developer teams.

I agree this information is not the easiest to find. I'll gladly communicate it better. Could you help me to find the best channels for that? Would an addition to the contributing guide enough? Maybe into the Generic guidelines section. Or even a new warning/important/known issues section?

@ChrsMark
Copy link
Member

Thanks for the clarifications here @belimawr!

Pinging @gizas and @ruflin here for reviewing the suggested approach as well setting the way this should be communicated with *-integrations teams.

@gizas
Copy link
Contributor

gizas commented Jul 14, 2022

Thank you @belimawr for all the time to answer here.

My biggest concern here is that although, we discuss where and how to add those unique ids in case of dynamic input integrations, we miss the main problem to identify why missing or non-unique ids cause duplication. The point we try to raise here is that dont see this action being planned anywhere.

Additionally although internally us in the thread we might know ways to enhance integrations, we dont provide a unified example, documentation and also communication how this can be done to all the teams. Are not we responsible to do so before hand?

Sorry for such a fuzz to merge this, really sorry as it is more the path of action to be clear from this point on and if your team identified such concerns

@cmacknz
Copy link
Member

cmacknz commented Jul 14, 2022

My biggest concern here is that although, we discuss where and how to add those unique ids in case of dynamic input integrations, we miss the main problem to identify why missing or non-unique ids cause duplication. The point we try to raise here is that dont see this action being planned anywhere.

@gizas We (the agent data plane team) know what the underlying problem is. The simplest explanation is that filestream was designed to use the input ID as the equivalent of a primary key in a database. There is an underlying persistent key-value store with the ID as a key. You can read a bit about this here if you are curious.

Unfortunately when filestream was originally released in 7.17.x nobody actually made the ID field of the input configuration mandatory. This is widely known within our team as a gigantic oversight and we are working to address it. Simply making the ID a required field in the configuration would be a breaking change that would cause Filebeat to fail to start, effectively causing a log collection outage for users with large deployments.

We are making relatively simple changes like this one as an immediate work around for the most common uses of filestream. We are also working in the background trying to come up with a way to solve this in filestream itself without a breaking change so that every integration does not need to be audited for this problem.

My hope is that we can solve this in filestream itself, but if not we will notify integration developers of the problem and the mitigation they'll need to apply. The majority of integrations are still using the previous log input which does not have this problem. Migrating them to filestream is dependent on us fixing this ID problem.

As long as this fix correctly accomplishes the goal of adding a unique ID to filestream inputs created for k8s log ingestion, our preference would be to merge this and we would be happy to continue discussing the wider issues with filestream IDs separately.

@gizas
Copy link
Contributor

gizas commented Jul 14, 2022

Thank you for the explanation once more @cmacknz!

Lets make this discussion more productive in another channel as both agree that this PR focuses in part of problem.
I was aware of this link and my point is that still I am missing next actions and if we are going to address this as a common effort.

Duplicate same Ids pointing to different paths or missing IDs can be a problem maybe because of the way we create the unique Identifier text. Correct me here and maybe missing info, but a combination of file and path will always be unique is not it? So why dont we create the id always based on those text values?

I will merge this and lets talk from next week on next actions. Sorry for the noise again was meant for good purposes

@gizas gizas merged commit 4069a45 into elastic:main Jul 14, 2022
@belimawr belimawr deleted the fix-filestream-id-kubernetes branch July 14, 2022 15:40
@ChrsMark
Copy link
Member

@belimawr today @MichaelKatsoulis and myself both verified that with 8.6.0-SNAPSHOT we hit the following in the logs:

{"log.level":"error","@timestamp":"2022-11-17T10:35:46.158Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":544},"message":"failed to render components: invalid 'inputs.9.id', has a duplicate id \"filestream-container-logs-df397d61-6ff8-4553-823d-830bcf23dccd\" (id is required to be unique)","ecs.version":"1.6.0"}

This looks weird. Is there anything that changed in Agent that could cause this?

@belimawr
Copy link
Contributor Author

Hey @ChrsMark, to some extent quite a lot has changed with the merge of the V2 control protocol.

On a quick glance through the code I can see Elastic-Agent is doing some validation to ensure input IDs are unique. Based on the prefix in the logs it does not seem directly related to this PR, but to another integration. Maybe this integration also needs some update to ensure it won't generate duplicated filestream IDs.

Could you give me the steps to reproduce it? So I can take a deeper look?

@MichaelKatsoulis
Copy link
Contributor

@belimawr

  1. create a local k8s cluster with kind kind create cluster --config kind-cluster.yaml
kind-cluster.yaml:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
    kubeadmConfigPatches:
  1. bring up elastic stack versions 8.6.0-SNAPSHOT with elastic package
elastic-package stack up --version=8.6.0-SNAPSHOT -v -d 
  1. Create a policy with kubernetes integration and follow UI instructions on running agent on k8s.

You will see agents not collecting anything and reporting this error

@belimawr
Copy link
Contributor Author

I found the problem and created an issue to track/discuss it: elastic/elastic-agent#1751

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working, use only for issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants