Filebeat log input fails when reload config under Elastic-Agent (affects Custom Logs integration) #31929

belimawr · 2022-06-14T13:59:15Z

Version: 8.2, main
Operating System: Linux (probably any)

Given that Filebeat is running under Elastic-Agent, if a log input is running and has its configurations updated it will fail to start and Filebeat will stop harvesting logs. Filebeat still running and Elastic-Agent report as healthy.

At the moment I managed to consistently reproduce it in a very specific scenario (see below).

The issue seems to be cause by some sort of race condition while updating the configuration, the current workflow for this is (it all happens in Filebeat)

Inputs that will be udpated are stoped
The new inputs are started

However, for some reason, the input still fails to start with common.ErrInputNotFinished. This is logged by centralmdmt.fleet and looks like this:

1 error: Error creating runner from config: Can only start an input when all related states are finished: {Id: native::5223-36, Finished: false, Fileinfo: &{c.log 27633120 420 {85357310 63790808914 0x76342c0} {36 5223 1 33188 1000 1000 0 0 27633120 4096 53976 {1655211900 583607419} {1655212114 85357310} {1655212114 85357310} [0 0 0]}}, Source: /tmp/flog/c.log, Offset: 27659552, Timestamp: 2022-06-14 13:10:41.752031002 +0000 UTC m=+128.920081638, TTL: -1ns, Type: log, Meta: map[], FileStateOS: 5223-36}

There also seems to be an issue communicating with Elasticsearch after the update, it is logged by esclientleg and looks like this:

Error reading from connection: read tcp aaaa.bbbb.cccc.dddd:41486->xxxx.yyyy.zzzz.vvvv:443: use of closed network connection

Steps to Reproduce:

Create a Elastic-Stack deployment, including Fleet-Server
Deploy Elastic-Agent
Have something generating a log file (https://github.com/mingrammer/flog is a good option for testing) with constant updates
Add the "Custom Logs" ingratiation reading from the file created on step 3
Make sure the data is correctly ingested
Create a dummy ingest pipeline, it can be as simple as: [{"set":{"field":"foo","value":"bar"}}], I'll refer to it as foo
Edit the "Custom Logs" integration and under Custom log file -> Advanced options -> Custom configurations add the following snippet: pipeline: foo.
Apply the changes.

The log file will stop being harvested, but Elastic-Agent still reports as healthy. The errors described above are logged and can be found in Kibana or in the log files.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-06-14T13:59:35Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

cmacknz · 2022-06-14T14:30:32Z

My first thought is that this might be related to #30533 which was fixed in 8.1.1, it seems like it might involve the same path between the agent and the beat.

belimawr · 2022-06-15T12:37:33Z

My first thought is that this might be related to #30533 which was fixed in 8.1.1, it seems like it might involve the same path between the agent and the beat.

I remember working for a bit with Anderson on this one, I believe the problem(s) now is(are) different. All config blocks are applied, the whole update process happens as expected, however I'm experiencing two kinds errors:

The new input cannot start because the previous input reading the file does not seem to have finished. @kvch told me that's a know issue, and we just need to ignore a specific error type. Some parts of the code already try to ignore this error, but because it's been wrapped in multiple error types, the current code does not work. I believe I managed to fix it.
use of closed network connection: that's a new one I've discovered after fixing the first.

Sometimes a fresh start of Elastic-Agent + Beats also leads to Filebeat not harvesting the files. There seems to be more to it than the configs not being fully applied.

As far as I remember, on all my tests the configs are fully applied, but in any case I'll double-check it.

belimawr · 2022-06-16T09:03:30Z

I did some more testing, including using other inputs to better assess the extent of the problem, here is what I found:

1. The `use of closed network connection` error seems to be the root cause of the problems.

This happens on the main branch as well as on my PR's branch that ignores common.ErrInputNotFinished errors.
After the error occur, the log input stops working, but other inputs, like filestream, still work without any problems.

I managed to reliably reproduce it in two cases (both using the custom log integration):

Adding a pipeline configuration to the log input.
Just editing the input configuration by adding the fields entry.

One or more restarts of Filebeat by changing the policy back and forth from an empty one solves the problem.

belimawr · 2022-06-16T09:18:13Z

I added all my messy debug code to my PR #31928 on 076e702, just to keep track of it.

joshdover · 2022-06-16T09:57:24Z

After the error occur, the log input stops working, but other inputs, like filestream, still work without any problems.

I'm not familiar with any of this code, but I'll ask the likely obvious/naive question: Do we have any idea how filestream handles this? Is there a fundamental reason that log input can't handle it similarly?

kvch · 2022-06-16T10:36:14Z

The log input uses probably one of the oldest part of the codebase to publish events: https://github.com/elastic/beats/blob/main/filebeat/channel/outlet.go filestream and all newer/migrated inputs use a different, more resilient package for publishing. There is a compatibility layer between the two so they can work together in Filebeat. Maybe there is an issue with the old publisher?

joshdover · 2022-06-17T15:39:30Z

I'm curious if this or a similar bug could be related to an issue I'm seeing in a support case related to API keys. This cluster was under very high load and was having general ingestion performance problems, and many 429 Too Many Request errors being returned by ES on _bulk requests.

One of the things we noticed is that this created a situation where Fleet Server had not properly invalidated old API keys after a policy change, so we helped the customer manually invalidate these (what should have been unused) API keys.

After we ran the invalidation, we started seeing many Authentication using apikey failed - api key has been invalidated up to 24 hours later, followed by Authentication using apikey failed - unable to find apikey with id which is expected because invalidated API keys are deleted by ES 24 hours after invalidation.

So I'm curious if the following can happen:

ES is under very high load, responding to _bulk ingest requests with 429 errors
User makes a integration policy change while in this state
Fleet Server ships the new policy + new API key to agents
Agent sends the new configuration to Filebeat, but we run into some variation of this bug where connecting to ES prevents the log input from shutting down properly and it continues trying to ingest data with the old API key
Agent ack's the new policy + API key to Fleet Server, even though Filebeat isn't actually using it
Fleet Server attempts to invalidate the key, fails due to cluster load problems, but due to some bug in Fleet Server, it still deletes the API key from the bookkeeping document as if was invalidated.

I realize that's a lot of perfect storms, but there's definitely something going in this case where the new API key is not being properly picked up between Agent <-> Filebeat.

cc @AndersonQ @ph

belimawr · 2022-07-14T16:23:15Z

Closing this issue as it has been solved by #32309

belimawr self-assigned this Jun 14, 2022

belimawr added bug Filebeat Filebeat Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels Jun 14, 2022

belimawr mentioned this issue Jun 14, 2022

[WIP] Ignores ErrInputNotFinished when updating config #31928

Closed

6 tasks

jlind23 added the 8.5-candidate label Jun 15, 2022

jlind23 added v8.5.0 and removed 8.5-candidate labels Jul 7, 2022

belimawr closed this as completed Jul 14, 2022

rdner mentioned this issue Nov 2, 2022

Agent gets Unhealthy on reassigning multiple times to different policies. elastic/elastic-agent#1665

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filebeat log input fails when reload config under Elastic-Agent (affects Custom Logs integration) #31929

Filebeat log input fails when reload config under Elastic-Agent (affects Custom Logs integration) #31929

belimawr commented Jun 14, 2022

elasticmachine commented Jun 14, 2022

cmacknz commented Jun 14, 2022

belimawr commented Jun 15, 2022

belimawr commented Jun 16, 2022

belimawr commented Jun 16, 2022

joshdover commented Jun 16, 2022

kvch commented Jun 16, 2022

joshdover commented Jun 17, 2022 •

edited

Loading

belimawr commented Jul 14, 2022

Filebeat log input fails when reload config under Elastic-Agent (affects Custom Logs integration) #31929

Filebeat log input fails when reload config under Elastic-Agent (affects Custom Logs integration) #31929

Comments

belimawr commented Jun 14, 2022

Steps to Reproduce:

elasticmachine commented Jun 14, 2022

cmacknz commented Jun 14, 2022

belimawr commented Jun 15, 2022

belimawr commented Jun 16, 2022

1. The use of closed network connection error seems to be the root cause of the problems.

belimawr commented Jun 16, 2022

joshdover commented Jun 16, 2022

kvch commented Jun 16, 2022

joshdover commented Jun 17, 2022 • edited Loading

belimawr commented Jul 14, 2022

1. The `use of closed network connection` error seems to be the root cause of the problems.

joshdover commented Jun 17, 2022 •

edited

Loading