Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filebeat log input fails when reload config under Elastic-Agent (affects Custom Logs integration) #31929

Closed
belimawr opened this issue Jun 14, 2022 · 9 comments
Assignees
Labels
bug Filebeat Filebeat Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team v8.5.0

Comments

@belimawr
Copy link
Contributor

  • Version: 8.2, main
  • Operating System: Linux (probably any)

Given that Filebeat is running under Elastic-Agent, if a log input is running and has its configurations updated it will fail to start and Filebeat will stop harvesting logs. Filebeat still running and Elastic-Agent report as healthy.

At the moment I managed to consistently reproduce it in a very specific scenario (see below).

The issue seems to be cause by some sort of race condition while updating the configuration, the current workflow for this is (it all happens in Filebeat)

  1. Inputs that will be udpated are stoped
  2. The new inputs are started

However, for some reason, the input still fails to start with common.ErrInputNotFinished. This is logged by centralmdmt.fleet and looks like this:

1 error: Error creating runner from config: Can only start an input when all related states are finished: {Id: native::5223-36, Finished: false, Fileinfo: &{c.log 27633120 420 {85357310 63790808914 0x76342c0} {36 5223 1 33188 1000 1000 0 0 27633120 4096 53976 {1655211900 583607419} {1655212114 85357310} {1655212114 85357310} [0 0 0]}}, Source: /tmp/flog/c.log, Offset: 27659552, Timestamp: 2022-06-14 13:10:41.752031002 +0000 UTC m=+128.920081638, TTL: -1ns, Type: log, Meta: map[], FileStateOS: 5223-36}

There also seems to be an issue communicating with Elasticsearch after the update, it is logged by esclientleg and looks like this:

Error reading from connection: read tcp aaaa.bbbb.cccc.dddd:41486->xxxx.yyyy.zzzz.vvvv:443: use of closed network connection

Steps to Reproduce:

  1. Create a Elastic-Stack deployment, including Fleet-Server
  2. Deploy Elastic-Agent
  3. Have something generating a log file (https://github.com/mingrammer/flog is a good option for testing) with constant updates
  4. Add the "Custom Logs" ingratiation reading from the file created on step 3
  5. Make sure the data is correctly ingested
  6. Create a dummy ingest pipeline, it can be as simple as: [{"set":{"field":"foo","value":"bar"}}], I'll refer to it as foo
  7. Edit the "Custom Logs" integration and under Custom log file -> Advanced options -> Custom configurations add the following snippet: pipeline: foo.
  8. Apply the changes.

The log file will stop being harvested, but Elastic-Agent still reports as healthy. The errors described above are logged and can be found in Kibana or in the log files.

@belimawr belimawr self-assigned this Jun 14, 2022
@belimawr belimawr added bug Filebeat Filebeat Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels Jun 14, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@cmacknz
Copy link
Member

cmacknz commented Jun 14, 2022

My first thought is that this might be related to #30533 which was fixed in 8.1.1, it seems like it might involve the same path between the agent and the beat.

@belimawr
Copy link
Contributor Author

My first thought is that this might be related to #30533 which was fixed in 8.1.1, it seems like it might involve the same path between the agent and the beat.

I remember working for a bit with Anderson on this one, I believe the problem(s) now is(are) different. All config blocks are applied, the whole update process happens as expected, however I'm experiencing two kinds errors:

  • The new input cannot start because the previous input reading the file does not seem to have finished. @kvch told me that's a know issue, and we just need to ignore a specific error type. Some parts of the code already try to ignore this error, but because it's been wrapped in multiple error types, the current code does not work. I believe I managed to fix it.
  • use of closed network connection: that's a new one I've discovered after fixing the first.

Sometimes a fresh start of Elastic-Agent + Beats also leads to Filebeat not harvesting the files. There seems to be more to it than the configs not being fully applied.

As far as I remember, on all my tests the configs are fully applied, but in any case I'll double-check it.

@belimawr
Copy link
Contributor Author

I did some more testing, including using other inputs to better assess the extent of the problem, here is what I found:

1. The use of closed network connection error seems to be the root cause of the problems.

This happens on the main branch as well as on my PR's branch that ignores common.ErrInputNotFinished errors.
After the error occur, the log input stops working, but other inputs, like filestream, still work without any problems.

I managed to reliably reproduce it in two cases (both using the custom log integration):

  1. Adding a pipeline configuration to the log input.
  2. Just editing the input configuration by adding the fields entry.

One or more restarts of Filebeat by changing the policy back and forth from an empty one solves the problem.

@belimawr
Copy link
Contributor Author

I added all my messy debug code to my PR #31928 on 076e702, just to keep track of it.

@joshdover
Copy link
Contributor

After the error occur, the log input stops working, but other inputs, like filestream, still work without any problems.

I'm not familiar with any of this code, but I'll ask the likely obvious/naive question: Do we have any idea how filestream handles this? Is there a fundamental reason that log input can't handle it similarly?

@kvch
Copy link
Contributor

kvch commented Jun 16, 2022

The log input uses probably one of the oldest part of the codebase to publish events: https://github.com/elastic/beats/blob/main/filebeat/channel/outlet.go filestream and all newer/migrated inputs use a different, more resilient package for publishing. There is a compatibility layer between the two so they can work together in Filebeat. Maybe there is an issue with the old publisher?

@joshdover
Copy link
Contributor

joshdover commented Jun 17, 2022

I'm curious if this or a similar bug could be related to an issue I'm seeing in a support case related to API keys. This cluster was under very high load and was having general ingestion performance problems, and many 429 Too Many Request errors being returned by ES on _bulk requests.

One of the things we noticed is that this created a situation where Fleet Server had not properly invalidated old API keys after a policy change, so we helped the customer manually invalidate these (what should have been unused) API keys.

After we ran the invalidation, we started seeing many Authentication using apikey failed - api key has been invalidated up to 24 hours later, followed by Authentication using apikey failed - unable to find apikey with id which is expected because invalidated API keys are deleted by ES 24 hours after invalidation.

So I'm curious if the following can happen:

  • ES is under very high load, responding to _bulk ingest requests with 429 errors
  • User makes a integration policy change while in this state
  • Fleet Server ships the new policy + new API key to agents
  • Agent sends the new configuration to Filebeat, but we run into some variation of this bug where connecting to ES prevents the log input from shutting down properly and it continues trying to ingest data with the old API key
  • Agent ack's the new policy + API key to Fleet Server, even though Filebeat isn't actually using it
  • Fleet Server attempts to invalidate the key, fails due to cluster load problems, but due to some bug in Fleet Server, it still deletes the API key from the bookkeeping document as if was invalidated.

I realize that's a lot of perfect storms, but there's definitely something going in this case where the new API key is not being properly picked up between Agent <-> Filebeat.

cc @AndersonQ @ph

@belimawr
Copy link
Contributor Author

Closing this issue as it has been solved by #32309

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Filebeat Filebeat Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team v8.5.0
Projects
None yet
Development

No branches or pull requests

6 participants