Skip to content

Conversation

@belimawr
Copy link
Contributor

@belimawr belimawr commented Sep 12, 2025

Proposed commit message

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

## Disruptive User Impact

Author's Checklist

How to test this PR locally

Manual test

  1. Create a log file with at least 1kb
    docker run -it --rm mingrammer/flog -n 20 > /tmp/flog.log
    
  2. Start Filebeat with the following configuration
    filebeat.inputs:
      - type: log
        id: log-as-filestream
        allow_deprecated_use: true
        paths:
          - /tmp/flog.log
    
    output.file:
      path: "${path.home}"
      filename: output
      rotate_on_startup: false  
    
    queue.mem:
      flush.timeout: 0
    
    logging:
      to_stderr: true
      level: debug
      selectors:
        - "*"
    
    #features:
    #  log_input_run_as_filestream:
    #    enabled: true

  1. Look at the logs, you will see some logs from the Log input
    {
      "log.level": "debug",
      "@timestamp": "2025-09-12T12:28:06.899-0400",
      "log.logger": "input.harvester",
      "log.origin": {
        "function": "github.com/elastic/beats/v7/filebeat/input/log.(*Log).Read",
        "file.name": "log/log.go",
        "file.line": 111
      },
      "message": "End of file reached: /tmp/flog.log; Backoff now.",
      "service.name": "filebeat",
      "input_id": "94a20b13-6927-4ff4-8f99-4f750469ed96",
      "source_file": "/tmp/flog.log",
      "state_id": "native::26052-40",
      "finished": false,
      "os_id": "26052-40",
      "harvester_id": "69128be5-d1f4-4493-935a-889d0461c95d",
      "ecs.version": "1.6.0"
    }
  2. Stop Filebeat
  3. Check the number of events published
    % wc -l output-*.ndjson        
    20 output-20250912.ndjson
    
  4. Uncomment the features: section in the configuration

  1. Start Filebeat again, you'll see some logs from the Filestream input
    {
      "log.level": "debug",
      "@timestamp": "2025-09-12T12:31:07.586-0400",
      "log.logger": "input.filestream",
      "log.origin": {
        "function": "github.com/elastic/beats/v7/filebeat/input/filestream.(*logFile).Read",
        "file.name": "filestream/filestream.go",
        "file.line": 139
      },
      "message": "End of file reached: /tmp/flog.log; Backoff now.",
      "service.name": "filebeat",
      "id": "log-as-filestream",
      "source_file": "filestream::log-as-filestream::fingerprint::445d01af94a604742ab7bb9db8b5bceff4b780925c2f8c7729165076319fc016",
      "path": "/tmp/flog.log",
      "state-id": "fingerprint::445d01af94a604742ab7bb9db8b5bceff4b780925c2f8c7729165076319fc016",
      "ecs.version": "1.6.0"
    }
  2. Check the number of events published, it should still be 20
    % wc -l output-*.ndjson        
    20 output-20250912.ndjson
    

Elastic Agent

  1. Create a log file with some lines
    docker run -it --rm mingrammer/flog -n 20 > /tmp/flog.log

  2. Run a standalone Elastic Agent with the following configuration (adjust the output settings as necessary)

    elastic-agent.yml

    outputs:
      default:
        type: elasticsearch
        hosts:
          - https://localhost:9200
        username: "elastic"
        password: "changeme"
        preset: latency
        ssl.verification_mode: none
    
    inputs:
      - type: log
        id: your-input-id
        streams:
          - id: your-log-stream-id
            data_stream:
              dataset: generic
            # run_as_filestream: true
            paths:
              - /tmp/flog.log
    
    agent.monitoring:
      enabled: false
      logs: false
      metrics: false
    
    agent.logging:
      level: debug
      to_stderr: true

  3. Ensure all events have been ingested

  4. Look at the logs, you will see Log input logs as described in the manual test

  5. Stop the Elastic Agent

  6. Uncomment run_as_filestream: true from the configuration

  7. Start the Elastic Agent again

  8. Ensure no more data is added to the output, no data duplication.

  9. Look at the logs, you will see Filestream input logs as described in the manual test

  10. You can also collect the diagnostics and look at the registry

    1. Collect the diagnostics and extract it
    2. Go to components/log-default
    3. Extract the registry: tar -xf registry.tar.gz
    4. cat registry/filebeat/log.json|jq -Sc
    5. You will see the entries from the Filestream input starting with the same offset as the ones from the Log input
      {"id":3,"op":"set"}
      {"k":"filebeat::logs::native::16-50","v":{"FileStateOS":{"device":50,"inode":16},"id":"native::16-50","identifier_name":"native","offset":2113,"prev_id":"","source":"/tmp/flog.log","timestamp":[280186759520503,1762292780],"ttl":
      -1,"type":"log"}}                                                                                                
      {"id":4,"op":"set"}
      {"k":"filestream::your-log-stream-id::native::16-50","v":{"cursor":{"offset":2113},"meta":{"identifier_name":"native","source":"/tmp/flog.log"},"ttl":-1,"updated":[281470681743360,18446744011573954816]}}                    
      {"id":5,"op":"remove"}
      {"k":"filebeat::logs::native::16-50"}
      {"id":6,"op":"set"}
      {"k":"filestream::your-log-stream-id::native::16-50","v":{"cursor":{"offset":2113},"meta":{"identifier_name":"native","source":"/tmp/flog.log"},"ttl":-1,"updated":[281470681743360,18446744011573954816]}}

Run the tests

cd filebeat
mage clean
go test -v -count=1 -run=TestRunAsFilestream ./input/logv2
mage BuildSystemTestBinary 
go test -v -count=1 -tags=integration -run=TestLogAsFilestream ./tests/integration 

cd ../x-pack/filebeat
mage clean
mage BuildSystemTestBinary
go test -v -count=1 -tags=integration -run=TestLogAsFilestream ./tests/integration

Related issues

Skipped Flaky tests

## Use cases
## Screenshots
## Logs

@belimawr belimawr self-assigned this Sep 12, 2025
@belimawr belimawr added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Sep 12, 2025
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Sep 12, 2025
@github-actions
Copy link
Contributor

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@mergify
Copy link
Contributor

mergify bot commented Sep 12, 2025

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @belimawr? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@belimawr belimawr changed the title [WIP] PoC to run Filestream as log input [PoC] Filestream running as log input Sep 12, 2025
@cmacknz
Copy link
Member

cmacknz commented Nov 14, 2025

we can only migrate states from Log -> Filestream and the Log input state is removed after the migration/take over.

The more I think about this the more I don't like it, this is a "one shot" migration with no way back if there is a problem. Before flipping the feature flag, can we make a copy of the registry and allow switching back to the original copy?

One way around this would be to build an agent action that deletes the state and starts from scratch, so that if you turn the feature flag off you also need to do that. Having a way to reset state is already a common-ish ask so this is two birds with one stone.

Another way would be to make a copy of the registry before migrating and allow switching back to it (something like adding something to the registry filename indicating the owning input name?). This would still duplicate data, but likely a lot less data than starting from scratch depending on how far it got.

@belimawr
Copy link
Contributor Author

The more I think about this the more I don't like it, this is a "one shot" migration with no way back if there is a problem.

That is how we designed feature since the beginning. We have never had a way to roll-back the "take over" under Elastic Agent, and on a standalone Filebeat the only way was to manually restore the registry, which affects all inputs, not only the one where "take over" was enabled.

Rolling back would make data to be re-ingested from the moment of the migration. The Log input has this undocumented behaviour of loading a copy of its state in memory and writing it back during shutdown. If we decided to rely on this behaviour, effectively its state is never fully removed, so disabling the migration would make the state of its file to be picked up from where the Log input left off.

If we want to rely on it, I'd like to dive more into this part of the code to make sure we're taking into account all corner cases.

Before flipping the feature flag, can we make a copy of the registry and allow switching back to the original copy?

Making a copy of the registry is a tricky one because we designed this feature to be enabled at the input level, by the time this code runs, the registry is already up and running, so it is not as simple as copying the folder.

There is also no 'history' of configuration changes from a Filebeat perspective, specially when running under Elastic Agent. So if this backup/rollback is a requirement, I'd say we need to pretty much design it from scratch instead of trying to patch it into the current Log -> Filestream migration.

One way around this would be to build an agent action that deletes the state and starts from scratch, so that if you turn the feature flag off you also need to do that. Having a way to reset state is already a common-ish ask so this is two birds with one stone.

Another way would be to make a copy of the registry before migrating and allow switching back to it (something like adding something to the registry filename indicating the owning input name?). This would still duplicate data, but likely a lot less data than starting from scratch depending on how far it got.

Both are interesting ideas and would make things simpler, the biggest challenge I see with them is the fact that our state is per running process rather than per input. Users only see the input level, Filebeat mostly sees the process level.

So the best way to go about it is really starting from scratch: defining the requirements, constraints and then looking into how we can implement this into our current architecture.

Honestly, it might make more sense to wait for other bigger re-designs we've been talking about doing in the registry:

@cmacknz
Copy link
Member

cmacknz commented Nov 17, 2025

I definitely think we need to go back and think more about the UX of this and how we are going to handle things going wrong. We do not want to be breaking people's audit logging use cases with no recovery mechanism.

"Switch back to the log input from where it left off" is exactly the recovery mechanism you'd want I think, imagine a failure where translation to filestream options used the wrong configuration and caused data to be parsed incorrectly or similar. Go back and re-ingest available data with the log input is what you'd want to do.

We could take this as a PoC proving this mechanism works and then pause and make changes to the state to make this easier to deal with as the next thing if that's the path we decide.

I added discussing this more to the data plane team agenda this week.

@belimawr
Copy link
Contributor Author

I definitely think we need to go back and think more about the UX of this and how we are going to handle things going wrong. We do not want to be breaking people's audit logging use cases with no recovery mechanism.

What is the recovery mechanism here? Is it about going back to the Log input or the state of the files?

"Switch back to the log input from where it left off" is exactly the recovery mechanism you'd want I think, imagine a failure where translation to filestream options used the wrong configuration and caused data to be parsed incorrectly or similar. Go back and re-ingest available data with the log input is what you'd want to do.

Yeah, that makes totally sense. It will require designing this from the ground up as it wasn't part of the original design and there are a number of layers/steps from an Elastic Agent integration (what users see) to an Filebeat input (the code that actually runs), and when we add the store to the mix, separating things gets much more complicated.

We could take this as a PoC proving this mechanism works and then pause and make changes to the state to make this easier to deal with as the next thing if that's the path we decide.

I really like this idea. The net new code from this PR for the Log input running as Log input is very small, simple, and extensively tested by every single Log input integration tests, which once merged will include any use of the Log input by Elastic Agent and Integration tests. I see this part of the code as low risk, so we could merge this PR sooner and then keep working on the other features we want before start deploying Log as Filestream.

I added discussing this more to the data plane team agenda this week.

Thanks!

@belimawr
Copy link
Contributor Author

The last failed CI run was because of another tests using the testing.T as logger and logging after the test has ended: #47698

@belimawr belimawr requested a review from cmacknz November 21, 2025 13:59
@pierrehilbert pierrehilbert added the Team:Obs-InfraObs Label for the Observability Infrastructure Monitoring team label Nov 21, 2025
Copy link
Member

@mauri870 mauri870 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codewise it looks good to me. I tested locally following the description and confirmed run_as_filestream switches the input properly. Let's try to get another review from the data plane team.

@rdner rdner requested a review from v1v November 26, 2025 10:24
rdner and others added 2 commits December 2, 2025 13:51
@belimawr belimawr merged commit 96dc1b3 into elastic:main Dec 3, 2025
206 of 209 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-skip Skip notification from the automated backport with mergify skip-changelog Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team Team:Obs-InfraObs Label for the Observability Infrastructure Monitoring team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants