Use filestream input as default for hints autodiscover. #36950

MichaelKatsoulis · 2023-10-24T10:20:33Z

What does this PR do

This PR is the code resolution of #35984 issue.

It updates filebeat hints autodiscover config and the proposed filebeat k8s manifest(filebeat-kubernetes.yml) to use filestream input instead of container input for the hints.default_config
It allows to continue to use the same co.elastic.logs/* hints inside pods' annotations by

mapping co.elastic.logs/json* hints to the ndjson parser in case of filestream.
mapping co.elastic.logs/multiline* hints to the multiline parser in case of filestream.

User can still choose container input in hints.default_config. Everything will work as they used to in that case.

Example

User has the following filebeat.yml configuration with hints autodiscover enabled and filestream set as hints.default_config

filebeat.yml: |-
    # To enable hints based autodiscover, remove `filebeat.inputs` configuration and uncomment this:
    filebeat.autodiscover:
     providers:
       - type: kubernetes
         node: ${NODE_NAME}
         hints.enabled: true
         hints.default_config:
           type: filestream
           prospector.scanner.symlinks: true
           id: kubernetes-container-logs-${data.kubernetes.pod.name}-${data.kubernetes.container.id}
           paths:
           - /var/log/containers/*-${data.kubernetes.container.id}.log
           parsers:
           - container: ~

User sets the following hints in the Filebeat pods' annotations

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: filebeat
  namespace: kube-system
  labels:
    k8s-app: filebeat
spec:
  selector:
    matchLabels:
      k8s-app: filebeat
  template:
    metadata:
      labels:
        k8s-app: filebeat
      annotations:
        co.elastic.logs/json.target: "json"
        co.elastic.logs/processors.add_fields.target: "project"
        co.elastic.logs/processors.add_fields.fields.name: "myproject"
        co.elastic.logs/json.message_key: "foo"
        co.elastic.logs/multline.pattern: "^test"

The produced configuration for filebeat pod should look like this:

type: filestream
prospector.scanner.symlinks: true
id: kubernetes-container-logs-filebeat-29472-2495909b94a1567459717d.log
paths:
 - /var/log/containers/*-2495909b94a1567459717d.log
parsers:
- container: ~
- ndjson:
     target: "json"
     message_key: "foo"
- multiline:
     pattern: "^test"
processors:
- add_fields:
      target: project
      fields:
        name: myproject

IMPORTANT NOTE:
Due to the default input type change , a user already running filebeat using container input will experience filebeat state loss. This will lead to all the available files at that moment to be re ingested.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Checkout to this PR
Create kind kubernetes cluster
PLATFORMS=linux/amd64 TYPES=docker mage package
cd build/package/filebeat-oss/filebeat-oss-linux-amd64.docker/docker-build && docker build -t myfilebeat .
kind load docker-image myfilebeat
Configure beats/deploy/kubernetes/filebeat-kubernetes.yaml as in example section.
kubectl apply -f beats/deploy/kubernetes/filebeat-kubernetes.yaml
Check in Kibana discover that json parsers and/or processors defined in hints work as expected

Related issues

Relates [Beats] Hints autodiscovery support with Filestream input #35984

Use cases

Screenshots

…c.logs/json* in hints to the ndjson parser of filestream

mergify · 2023-10-24T10:21:10Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @MichaelKatsoulis? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

elasticmachine · 2023-10-24T10:34:03Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Duration: 77 min 32 sec

❕ Flaky test report

No test was executed to be analysed.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

…am input

ChrsMark

Changes lgtm!
Consider adding this to a manual testing phase for when the BCs are out.

ChrsMark · 2023-10-26T12:16:16Z

deploy/kubernetes/filebeat-kubernetes.yaml

@@ -123,15 +123,19 @@ data:
                logs_path: "/var/log/containers/"

    # To enable hints based autodiscover, remove `filebeat.inputs` configuration and uncomment this:
-    #filebeat.autodiscover:
+    # filebeat.autodiscover:


is this extra space intentional?

By default, when I commented in and out the autodiscover block, it added this space. TBH it looks more readable

ChrsMark · 2023-10-26T12:16:41Z

deploy/kubernetes/filebeat/filebeat-configmap.yaml

@@ -19,15 +19,19 @@ data:
                logs_path: "/var/log/containers/"

    # To enable hints based autodiscover, remove `filebeat.inputs` configuration and uncomment this:
-    #filebeat.autodiscover:
+    # filebeat.autodiscover:


same: is this needed?

I would say to remove the spaces because you can end up uncommenting this block and this not to have the correct spacing.

MichaelKatsoulis · 2023-10-30T08:32:31Z

@elastic/elastic-agent-data-plane team as you are the code owners , could you review this PR ?

rdner

Since we're already switching to another input type with a state loss, I'd highly recommend to use fingerprint file identity on Kubernetes. We have a lot of reports that inode values in containerized environments are not stable.

For details please refer to https://www.elastic.co/blog/introducing-filestream-fingerprint-mode

Docs https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-filestream.html#_file_identity_2

MichaelKatsoulis · 2023-10-30T12:59:13Z

@rdner thanks for your comment. Although this new option is officially recommended to be used in cases where a customer is facing data loss or duplication, I understand the value of us setting it as default recommendation.

I got a bit confused from the configuration documentation.
Is this the way to set the fingerprint for filestream input?

type: filestream
            prospector.scanner.symlinks: true
            id: kubernetes-container-logs-${data.kubernetes.pod.name}-${data.kubernetes.container.id}
            paths:
            - /var/log/containers/*-${data.kubernetes.container.id}.log
            parsers:
            - container: ~
            file_identity.native: ~

or should it be under prospector.scanner.fingerprint ?

rdner · 2023-10-31T07:58:29Z

@MichaelKatsoulis there are 2 things here:

Prospector's fingerprint mode that deals with certain file operations in a more reliable way
Fingerprint file identity which is using the fingerprints calculated by the prospector as file IDs.

The correct snippet would be something like this:

            - type: filestream
              id: kubernetes-container-logs-${data.kubernetes.pod.name}-${data.kubernetes.container.id}
              prospector:
                scanner:
                  fingerprint.enabled: true
                  symlinks: true
              file_identity.fingerprint: ~
              paths:
                - /var/log/containers/*-${data.kubernetes.container.id}.log

MichaelKatsoulis · 2023-10-31T12:47:00Z

@rdner I played around with fingerprint using defaults in a local kind cluster and I get constant errors for most of the log files

"message":"cannot create a file descriptor for an ingest target \"/var/log/containers/local-path-provisioner-684f458cdd-w7cgt_local-path-storage_local-path-provisioner-57cb15732f6d64418c56a06a3ec2f9c5ae167a3a2902aee141b80618fdecf654.log\": filesize of \"/var/log/containers/local-path-provisioner-684f458cdd-w7cgt_local-path-storage_local-path-provisioner-57cb15732f6d64418c56a06a3ec2f9c5ae167a3a2902aee141b80618fdecf654.log\" is 763 bytes, expected at least 1024 bytes for fingerprinting","service.name":"filebeat","ecs.version":"1.6.0"}

So TBH I don't know if setting different defaults make sense or leave it on the users to decide if they want this feature or not.

rdner · 2023-10-31T13:48:16Z

@MichaelKatsoulis but what's the issue with the message? It clearly communicates what's happening and it will pick up the file once it grows in size.

We're talking about the choice between non-working file identity that leads to data duplication and data loss, and working file identity that addresses this issue.

We're having a quite high amount of support tickets related to this on Kubernetes, the fingerprint file identity was created to address this.

rdner · 2023-10-31T13:52:56Z

@MichaelKatsoulis by the way, these messages are not errors. They're warnings, so the customer would know why their files are not being ingested yet.

MichaelKatsoulis · 2023-10-31T14:30:00Z

@rdner It is just that those logs confused me as on top of that I could not find the logs of some test pods that I had running. But it is due to the log file being too small, which was on purpose as it wasn't something that logs if not used (like Redis or nginx).
But I understand now why this happens. I am on board with you about the usefulness of this feature.

I updated my pr accordingly. Could you take a final look?

deploy/kubernetes/filebeat-kubernetes.yaml

gizas · 2023-10-31T14:52:21Z

filebeat/autodiscover/builder/hints/logs.go

+			if inputType == harvester.FilestreamType {
+				// json options should be under ndjson parser in filestream input
+				parsersTempCfg := []mapstr.M{}
+				ndjsonTempCfg := mapstr.M{}


~~Should we check if this is empty before calling next line?~~

Ignore this. I just realsised that those are empty mapstr.M

CHANGELOG.next.asciidoc

gizas · 2023-10-31T14:54:15Z

filebeat/docs/autodiscover-hints.asciidoc

+json.add_error_key: "true"
+-----
+
+NOTE: `keys_under_root` json option of `log` input is replaced with `target` option in filestream input. Read the documentation on how to use it correctly.


We should put a link to filestream input here

Co-authored-by: Andrew Gizas <andreas.gkizas@elastic.co>

tetianakravchenko · 2024-01-04T12:28:03Z

deploy/kubernetes/filebeat-kubernetes.yaml

@@ -112,9 +112,16 @@ metadata:
 data:
  filebeat.yml: |-
    filebeat.inputs:
-    - type: container
+    - type: filestream


@MichaelKatsoulis
shouldn't here be defined id in input? or in such case it will be automatically generated?

doc:

Each filestream input must have a unique ID. Omitting or changing the filestream ID may cause data duplication. Without a unique ID, filestream is unable to correctly track the state of files.

so for all files that are matching /var/log/containers/*.log we have 1 filestream with unique id, correct? do you know what does it imply in comparison to the autodiscover where it will be created a dedicated filestream per container?

@tetianakravchenko Yes an id will automatically get generated. When the filebeat.input is used instead of auto discovery then there will be one stream of filestream input looking at all files in the path. When autodiscovery is used there will be one stream for each discovered container looking at one log file only.

For the metadata in first scenario, the processor is used which requires the matchers log path so it can extract the container id from the log file name, and add the metadata of that container.
In the autodiscovery case the metadata are enriched by the kubernetes provider.

So yes, we have one filestream with one id for all the log collection. Both options work just fine. But with the first approach we cannot enable hints.

* Use filestream input as default for hints autodiscover. Map co.elastic.logs/json* in hints to the ndjson parser of filestream * Update filebeat-kubernetes.yaml * Map co.elastic.logs/multiline.* hints to multiline parser of filestream input * Update documentation * Use file_identity.fingerprint as default way of file unique id creation --------- Co-authored-by: Andrew Gizas <andreas.gkizas@elastic.co>

Use filestream input as default for hints autodiscover. Map co.elasti…

572d2de

…c.logs/json* in hints to the ndjson parser of filestream

MichaelKatsoulis requested review from a team as code owners October 24, 2023 10:20

MichaelKatsoulis requested review from faec, leehinman, gsantoro and tetianakravchenko October 24, 2023 10:20

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Oct 24, 2023

MichaelKatsoulis marked this pull request as draft October 24, 2023 10:20

mergify bot assigned MichaelKatsoulis Oct 24, 2023

pierrehilbert added the Team:Elastic-Agent Label for the Agent team label Oct 24, 2023

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Oct 24, 2023

Update filebeat-kubernetes.yaml

621ee6b

MichaelKatsoulis requested review from gizas and ChrsMark and removed request for gsantoro October 24, 2023 11:40

MichaelKatsoulis added the Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team label Oct 24, 2023

MichaelKatsoulis marked this pull request as ready for review October 24, 2023 13:58

MichaelKatsoulis added 2 commits October 25, 2023 10:02

Merge branch 'main' into hints-autodiscovery-filestream-input

b417380

Add changelog entry and lint error

f431c6b

MichaelKatsoulis marked this pull request as draft October 25, 2023 11:01

Map co.elastic.logs/multiline.* hints to multiline parser of filestre…

988c850

…am input

MichaelKatsoulis marked this pull request as ready for review October 25, 2023 12:03

Update documentation

85d658d

ChrsMark approved these changes Oct 26, 2023

View reviewed changes

Fix typo

1f97bbf

cmacknz requested a review from rdner October 27, 2023 18:54

rdner reviewed Oct 30, 2023

View reviewed changes

Use file_identity.fingerprint as default way of file unique id creation

40c2957

rdner approved these changes Oct 31, 2023

View reviewed changes

deploy/kubernetes/filebeat-kubernetes.yaml Show resolved Hide resolved

gizas reviewed Oct 31, 2023

View reviewed changes

CHANGELOG.next.asciidoc Outdated Show resolved Hide resolved

gizas reviewed Oct 31, 2023

View reviewed changes

MichaelKatsoulis and others added 3 commits November 1, 2023 13:00

Update tests

d0ece70

Update CHANGELOG.next.asciidoc

2a3b5b6

Co-authored-by: Andrew Gizas <andreas.gkizas@elastic.co>

Add filestream ndjson link in doc note

9d13102

gizas approved these changes Nov 1, 2023

View reviewed changes

MichaelKatsoulis merged commit 41ab08c into elastic:main Nov 1, 2023
30 checks passed

tetianakravchenko reviewed Jan 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use filestream input as default for hints autodiscover. #36950

Use filestream input as default for hints autodiscover. #36950

MichaelKatsoulis commented Oct 24, 2023 •

edited

mergify bot commented Oct 24, 2023

elasticmachine commented Oct 24, 2023 •

edited by jenkins-beats-ci bot

Build stats

ChrsMark left a comment

ChrsMark Oct 26, 2023

MichaelKatsoulis Oct 26, 2023

ChrsMark Oct 26, 2023

gizas Oct 31, 2023

MichaelKatsoulis commented Oct 30, 2023

rdner left a comment

MichaelKatsoulis commented Oct 30, 2023

rdner commented Oct 31, 2023

MichaelKatsoulis commented Oct 31, 2023

rdner commented Oct 31, 2023

rdner commented Oct 31, 2023 •

edited

MichaelKatsoulis commented Oct 31, 2023

gizas Oct 31, 2023 •

edited

gizas Oct 31, 2023

tetianakravchenko Jan 4, 2024 •

edited

MichaelKatsoulis Jan 8, 2024 •

edited

Use filestream input as default for hints autodiscover. #36950

Use filestream input as default for hints autodiscover. #36950

Conversation

MichaelKatsoulis commented Oct 24, 2023 • edited

What does this PR do

Example

Checklist

How to test this PR locally

Related issues

Use cases

Screenshots

mergify bot commented Oct 24, 2023

elasticmachine commented Oct 24, 2023 • edited by jenkins-beats-ci bot

💚 Build Succeeded

Build stats

❕ Flaky test report

🤖 GitHub comments

ChrsMark left a comment

Choose a reason for hiding this comment

ChrsMark Oct 26, 2023

Choose a reason for hiding this comment

MichaelKatsoulis Oct 26, 2023

Choose a reason for hiding this comment

ChrsMark Oct 26, 2023

Choose a reason for hiding this comment

gizas Oct 31, 2023

Choose a reason for hiding this comment

MichaelKatsoulis commented Oct 30, 2023

rdner left a comment

Choose a reason for hiding this comment

MichaelKatsoulis commented Oct 30, 2023

rdner commented Oct 31, 2023

MichaelKatsoulis commented Oct 31, 2023

rdner commented Oct 31, 2023

rdner commented Oct 31, 2023 • edited

MichaelKatsoulis commented Oct 31, 2023

gizas Oct 31, 2023 • edited

Choose a reason for hiding this comment

gizas Oct 31, 2023

Choose a reason for hiding this comment

tetianakravchenko Jan 4, 2024 • edited

Choose a reason for hiding this comment

MichaelKatsoulis Jan 8, 2024 • edited

Choose a reason for hiding this comment

MichaelKatsoulis commented Oct 24, 2023 •

edited

elasticmachine commented Oct 24, 2023 •

edited by jenkins-beats-ci bot

rdner commented Oct 31, 2023 •

edited

gizas Oct 31, 2023 •

edited

tetianakravchenko Jan 4, 2024 •

edited

MichaelKatsoulis Jan 8, 2024 •

edited