Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use filestream input as default for hints autodiscover. #36950

Conversation

MichaelKatsoulis
Copy link
Contributor

@MichaelKatsoulis MichaelKatsoulis commented Oct 24, 2023

What does this PR do

This PR is the code resolution of #35984 issue.

  1. It updates filebeat hints autodiscover config and the proposed filebeat k8s manifest(filebeat-kubernetes.yml) to use filestream input instead of container input for the hints.default_config

  2. It allows to continue to use the same co.elastic.logs/* hints inside pods' annotations by

  • mapping co.elastic.logs/json* hints to the ndjson parser in case of filestream.
  • mapping co.elastic.logs/multiline* hints to the multiline parser in case of filestream.

User can still choose container input in hints.default_config. Everything will work as they used to in that case.

Example

User has the following filebeat.yml configuration with hints autodiscover enabled and filestream set as hints.default_config

filebeat.yml: |-
    # To enable hints based autodiscover, remove `filebeat.inputs` configuration and uncomment this:
    filebeat.autodiscover:
     providers:
       - type: kubernetes
         node: ${NODE_NAME}
         hints.enabled: true
         hints.default_config:
           type: filestream
           prospector.scanner.symlinks: true
           id: kubernetes-container-logs-${data.kubernetes.pod.name}-${data.kubernetes.container.id}
           paths:
           - /var/log/containers/*-${data.kubernetes.container.id}.log
           parsers:
           - container: ~

User sets the following hints in the Filebeat pods' annotations

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: filebeat
  namespace: kube-system
  labels:
    k8s-app: filebeat
spec:
  selector:
    matchLabels:
      k8s-app: filebeat
  template:
    metadata:
      labels:
        k8s-app: filebeat
      annotations:
        co.elastic.logs/json.target: "json"
        co.elastic.logs/processors.add_fields.target: "project"
        co.elastic.logs/processors.add_fields.fields.name: "myproject"
        co.elastic.logs/json.message_key: "foo"
        co.elastic.logs/multline.pattern: "^test"

The produced configuration for filebeat pod should look like this:

type: filestream
prospector.scanner.symlinks: true
id: kubernetes-container-logs-filebeat-29472-2495909b94a1567459717d.log
paths:
 - /var/log/containers/*-2495909b94a1567459717d.log
parsers:
- container: ~
- ndjson:
     target: "json"
     message_key: "foo"
- multiline:
     pattern: "^test"
processors:
- add_fields:
      target: project
      fields:
        name: myproject   

IMPORTANT NOTE:
Due to the default input type change , a user already running filebeat using container input will experience filebeat state loss. This will lead to all the available files at that moment to be re ingested.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

  1. Checkout to this PR
  2. Create kind kubernetes cluster
  3. PLATFORMS=linux/amd64 TYPES=docker mage package
  4. cd build/package/filebeat-oss/filebeat-oss-linux-amd64.docker/docker-build && docker build -t myfilebeat .
  5. kind load docker-image myfilebeat
  6. Configure beats/deploy/kubernetes/filebeat-kubernetes.yaml as in example section.
  7. kubectl apply -f beats/deploy/kubernetes/filebeat-kubernetes.yaml
  8. Check in Kibana discover that json parsers and/or processors defined in hints work as expected

Related issues

Use cases

Screenshots

json parser processor

…c.logs/json* in hints to the ndjson parser of filestream
@MichaelKatsoulis MichaelKatsoulis requested review from a team as code owners October 24, 2023 10:20
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Oct 24, 2023
@MichaelKatsoulis MichaelKatsoulis marked this pull request as draft October 24, 2023 10:20
@mergify
Copy link
Contributor

mergify bot commented Oct 24, 2023

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @MichaelKatsoulis? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@elasticmachine
Copy link
Collaborator

elasticmachine commented Oct 24, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Duration: 77 min 32 sec

❕ Flaky test report

No test was executed to be analysed.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@pierrehilbert pierrehilbert added the Team:Elastic-Agent Label for the Agent team label Oct 24, 2023
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Oct 24, 2023
@MichaelKatsoulis MichaelKatsoulis requested review from gizas and ChrsMark and removed request for gsantoro October 24, 2023 11:40
@MichaelKatsoulis MichaelKatsoulis added the Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team label Oct 24, 2023
@MichaelKatsoulis MichaelKatsoulis marked this pull request as ready for review October 24, 2023 13:58
@MichaelKatsoulis MichaelKatsoulis marked this pull request as draft October 25, 2023 11:01
@MichaelKatsoulis MichaelKatsoulis marked this pull request as ready for review October 25, 2023 12:03
Copy link
Member

@ChrsMark ChrsMark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes lgtm!
Consider adding this to a manual testing phase for when the BCs are out.

@@ -123,15 +123,19 @@ data:
logs_path: "/var/log/containers/"

# To enable hints based autodiscover, remove `filebeat.inputs` configuration and uncomment this:
#filebeat.autodiscover:
# filebeat.autodiscover:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this extra space intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, when I commented in and out the autodiscover block, it added this space. TBH it looks more readable

@@ -19,15 +19,19 @@ data:
logs_path: "/var/log/containers/"

# To enable hints based autodiscover, remove `filebeat.inputs` configuration and uncomment this:
#filebeat.autodiscover:
# filebeat.autodiscover:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same: is this needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say to remove the spaces because you can end up uncommenting this block and this not to have the correct spacing.

@cmacknz cmacknz requested a review from rdner October 27, 2023 18:54
@MichaelKatsoulis
Copy link
Contributor Author

@elastic/elastic-agent-data-plane team as you are the code owners , could you review this PR ?

Copy link
Member

@rdner rdner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're already switching to another input type with a state loss, I'd highly recommend to use fingerprint file identity on Kubernetes. We have a lot of reports that inode values in containerized environments are not stable.

For details please refer to https://www.elastic.co/blog/introducing-filestream-fingerprint-mode

Docs https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-filestream.html#_file_identity_2

@MichaelKatsoulis
Copy link
Contributor Author

@rdner thanks for your comment. Although this new option is officially recommended to be used in cases where a customer is facing data loss or duplication, I understand the value of us setting it as default recommendation.

I got a bit confused from the configuration documentation.
Is this the way to set the fingerprint for filestream input?

type: filestream
            prospector.scanner.symlinks: true
            id: kubernetes-container-logs-${data.kubernetes.pod.name}-${data.kubernetes.container.id}
            paths:
            - /var/log/containers/*-${data.kubernetes.container.id}.log
            parsers:
            - container: ~
            file_identity.native: ~

or should it be under prospector.scanner.fingerprint ?

@rdner
Copy link
Member

rdner commented Oct 31, 2023

@MichaelKatsoulis there are 2 things here:

The correct snippet would be something like this:

            - type: filestream
              id: kubernetes-container-logs-${data.kubernetes.pod.name}-${data.kubernetes.container.id}
              prospector:
                scanner:
                  fingerprint.enabled: true
                  symlinks: true
              file_identity.fingerprint: ~
              paths:
                - /var/log/containers/*-${data.kubernetes.container.id}.log

@MichaelKatsoulis
Copy link
Contributor Author

@rdner I played around with fingerprint using defaults in a local kind cluster and I get constant errors for most of the log files

"message":"cannot create a file descriptor for an ingest target \"/var/log/containers/local-path-provisioner-684f458cdd-w7cgt_local-path-storage_local-path-provisioner-57cb15732f6d64418c56a06a3ec2f9c5ae167a3a2902aee141b80618fdecf654.log\": filesize of \"/var/log/containers/local-path-provisioner-684f458cdd-w7cgt_local-path-storage_local-path-provisioner-57cb15732f6d64418c56a06a3ec2f9c5ae167a3a2902aee141b80618fdecf654.log\" is 763 bytes, expected at least 1024 bytes for fingerprinting","service.name":"filebeat","ecs.version":"1.6.0"}

So TBH I don't know if setting different defaults make sense or leave it on the users to decide if they want this feature or not.

@rdner
Copy link
Member

rdner commented Oct 31, 2023

@MichaelKatsoulis but what's the issue with the message? It clearly communicates what's happening and it will pick up the file once it grows in size.

We're talking about the choice between non-working file identity that leads to data duplication and data loss, and working file identity that addresses this issue.

We're having a quite high amount of support tickets related to this on Kubernetes, the fingerprint file identity was created to address this.

@rdner
Copy link
Member

rdner commented Oct 31, 2023

@MichaelKatsoulis by the way, these messages are not errors. They're warnings, so the customer would know why their files are not being ingested yet.

@MichaelKatsoulis
Copy link
Contributor Author

@rdner It is just that those logs confused me as on top of that I could not find the logs of some test pods that I had running. But it is due to the log file being too small, which was on purpose as it wasn't something that logs if not used (like Redis or nginx).
But I understand now why this happens. I am on board with you about the usefulness of this feature.

I updated my pr accordingly. Could you take a final look?

if inputType == harvester.FilestreamType {
// json options should be under ndjson parser in filestream input
parsersTempCfg := []mapstr.M{}
ndjsonTempCfg := mapstr.M{}
Copy link
Contributor

@gizas gizas Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check if this is empty before calling next line?

Ignore this. I just realsised that those are empty mapstr.M

CHANGELOG.next.asciidoc Outdated Show resolved Hide resolved
json.add_error_key: "true"
-----

NOTE: `keys_under_root` json option of `log` input is replaced with `target` option in filestream input. Read the documentation on how to use it correctly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should put a link to filestream input here

@MichaelKatsoulis MichaelKatsoulis merged commit 41ab08c into elastic:main Nov 1, 2023
30 checks passed
@@ -112,9 +112,16 @@ metadata:
data:
filebeat.yml: |-
filebeat.inputs:
- type: container
- type: filestream
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MichaelKatsoulis
shouldn't here be defined id in input? or in such case it will be automatically generated?

doc:

Each filestream input must have a unique ID. Omitting or changing the filestream ID may cause data duplication. Without a unique ID, filestream is unable to correctly track the state of files.

so for all files that are matching /var/log/containers/*.log we have 1 filestream with unique id, correct? do you know what does it imply in comparison to the autodiscover where it will be created a dedicated filestream per container?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tetianakravchenko Yes an id will automatically get generated. When the filebeat.input is used instead of auto discovery then there will be one stream of filestream input looking at all files in the path. When autodiscovery is used there will be one stream for each discovered container looking at one log file only.

For the metadata in first scenario, the processor is used which requires the matchers log path so it can extract the container id from the log file name, and add the metadata of that container.
In the autodiscovery case the metadata are enriched by the kubernetes provider.

So yes, we have one filestream with one id for all the log collection. Both options work just fine. But with the first approach we cannot enable hints.

Scholar-Li pushed a commit to Scholar-Li/beats that referenced this pull request Feb 5, 2024
* Use filestream input as default for hints autodiscover. Map co.elastic.logs/json* in hints to the ndjson parser of filestream

* Update filebeat-kubernetes.yaml

* Map co.elastic.logs/multiline.* hints to multiline parser of filestream input

* Update documentation


* Use file_identity.fingerprint as default way of file unique id creation
---------

Co-authored-by: Andrew Gizas <andreas.gkizas@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants