-
Notifications
You must be signed in to change notification settings - Fork 5k
[Filebeat] Filestream running as Log input under Elastic Agent or feature flag #46587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🤖 GitHub commentsExpand to view the GitHub comments
Just comment with:
|
|
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
The more I think about this the more I don't like it, this is a "one shot" migration with no way back if there is a problem. Before flipping the feature flag, can we make a copy of the registry and allow switching back to the original copy? One way around this would be to build an agent action that deletes the state and starts from scratch, so that if you turn the feature flag off you also need to do that. Having a way to reset state is already a common-ish ask so this is two birds with one stone. Another way would be to make a copy of the registry before migrating and allow switching back to it (something like adding something to the registry filename indicating the owning input name?). This would still duplicate data, but likely a lot less data than starting from scratch depending on how far it got. |
That is how we designed feature since the beginning. We have never had a way to roll-back the "take over" under Elastic Agent, and on a standalone Filebeat the only way was to manually restore the registry, which affects all inputs, not only the one where "take over" was enabled. Rolling back would make data to be re-ingested from the moment of the migration. The Log input has this undocumented behaviour of loading a copy of its state in memory and writing it back during shutdown. If we decided to rely on this behaviour, effectively its state is never fully removed, so disabling the migration would make the state of its file to be picked up from where the Log input left off. If we want to rely on it, I'd like to dive more into this part of the code to make sure we're taking into account all corner cases.
Making a copy of the registry is a tricky one because we designed this feature to be enabled at the input level, by the time this code runs, the registry is already up and running, so it is not as simple as copying the folder. There is also no 'history' of configuration changes from a Filebeat perspective, specially when running under Elastic Agent. So if this backup/rollback is a requirement, I'd say we need to pretty much design it from scratch instead of trying to patch it into the current Log -> Filestream migration.
Both are interesting ideas and would make things simpler, the biggest challenge I see with them is the fact that our state is per running process rather than per input. Users only see the input level, Filebeat mostly sees the process level. So the best way to go about it is really starting from scratch: defining the requirements, constraints and then looking into how we can implement this into our current architecture. Honestly, it might make more sense to wait for other bigger re-designs we've been talking about doing in the registry: |
|
I definitely think we need to go back and think more about the UX of this and how we are going to handle things going wrong. We do not want to be breaking people's audit logging use cases with no recovery mechanism. "Switch back to the log input from where it left off" is exactly the recovery mechanism you'd want I think, imagine a failure where translation to filestream options used the wrong configuration and caused data to be parsed incorrectly or similar. Go back and re-ingest available data with the log input is what you'd want to do. We could take this as a PoC proving this mechanism works and then pause and make changes to the state to make this easier to deal with as the next thing if that's the path we decide. I added discussing this more to the data plane team agenda this week. |
What is the recovery mechanism here? Is it about going back to the Log input or the state of the files?
Yeah, that makes totally sense. It will require designing this from the ground up as it wasn't part of the original design and there are a number of layers/steps from an Elastic Agent integration (what users see) to an Filebeat input (the code that actually runs), and when we add the store to the mix, separating things gets much more complicated.
I really like this idea. The net new code from this PR for the Log input running as Log input is very small, simple, and extensively tested by every single Log input integration tests, which once merged will include any use of the Log input by Elastic Agent and Integration tests. I see this part of the code as low risk, so we could merge this PR sooner and then keep working on the other features we want before start deploying Log as Filestream.
Thanks! |
|
The last failed CI run was because of another tests using the |
mauri870
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codewise it looks good to me. I tested locally following the description and confirmed run_as_filestream switches the input properly. Let's try to get another review from the data plane team.
Co-authored-by: Anderson Queiroz <me@andersonq.me>
Proposed commit message
I have made corresponding changes to the documentationI have made corresponding change to the default configuration filesI have added an entry inCHANGELOG.next.asciidocorCHANGELOG-developer.next.asciidoc.## Disruptive User ImpactAuthor's Checklist
How to test this PR locally
Manual test
{ "log.level": "debug", "@timestamp": "2025-09-12T12:28:06.899-0400", "log.logger": "input.harvester", "log.origin": { "function": "github.com/elastic/beats/v7/filebeat/input/log.(*Log).Read", "file.name": "log/log.go", "file.line": 111 }, "message": "End of file reached: /tmp/flog.log; Backoff now.", "service.name": "filebeat", "input_id": "94a20b13-6927-4ff4-8f99-4f750469ed96", "source_file": "/tmp/flog.log", "state_id": "native::26052-40", "finished": false, "os_id": "26052-40", "harvester_id": "69128be5-d1f4-4493-935a-889d0461c95d", "ecs.version": "1.6.0" }features:section in the configuration{ "log.level": "debug", "@timestamp": "2025-09-12T12:31:07.586-0400", "log.logger": "input.filestream", "log.origin": { "function": "github.com/elastic/beats/v7/filebeat/input/filestream.(*logFile).Read", "file.name": "filestream/filestream.go", "file.line": 139 }, "message": "End of file reached: /tmp/flog.log; Backoff now.", "service.name": "filebeat", "id": "log-as-filestream", "source_file": "filestream::log-as-filestream::fingerprint::445d01af94a604742ab7bb9db8b5bceff4b780925c2f8c7729165076319fc016", "path": "/tmp/flog.log", "state-id": "fingerprint::445d01af94a604742ab7bb9db8b5bceff4b780925c2f8c7729165076319fc016", "ecs.version": "1.6.0" }Elastic Agent
Create a log file with some lines
docker run -it --rm mingrammer/flog -n 20 > /tmp/flog.logRun a standalone Elastic Agent with the following configuration (adjust the output settings as necessary)
elastic-agent.yml
Ensure all events have been ingested
Look at the logs, you will see Log input logs as described in the manual test
Stop the Elastic Agent
Uncomment
run_as_filestream: truefrom the configurationStart the Elastic Agent again
Ensure no more data is added to the output, no data duplication.
Look at the logs, you will see Filestream input logs as described in the manual test
You can also collect the diagnostics and look at the registry
components/log-defaulttar -xf registry.tar.gzcat registry/filebeat/log.json|jq -Sc{"id":3,"op":"set"} {"k":"filebeat::logs::native::16-50","v":{"FileStateOS":{"device":50,"inode":16},"id":"native::16-50","identifier_name":"native","offset":2113,"prev_id":"","source":"/tmp/flog.log","timestamp":[280186759520503,1762292780],"ttl": -1,"type":"log"}} {"id":4,"op":"set"} {"k":"filestream::your-log-stream-id::native::16-50","v":{"cursor":{"offset":2113},"meta":{"identifier_name":"native","source":"/tmp/flog.log"},"ttl":-1,"updated":[281470681743360,18446744011573954816]}} {"id":5,"op":"remove"} {"k":"filebeat::logs::native::16-50"} {"id":6,"op":"set"} {"k":"filestream::your-log-stream-id::native::16-50","v":{"cursor":{"offset":2113},"meta":{"identifier_name":"native","source":"/tmp/flog.log"},"ttl":-1,"updated":[281470681743360,18446744011573954816]}}Run the tests
Related issues
Skipped Flaky tests
Log in goroutine after TestNegativeCases has completed#47698## Use cases## Screenshots## Logs