Filebeat running under Elastic-Agent not harvesting logs after restart #30533

AndersonQ · 2022-02-22T14:30:35Z

After a restart a filebeat running under the elastic-agent doesn't start harvesting logs. Upon restart filebeat receive the config from the elastic-agent, it's processed, however only the output is applied.It happens inconsistently, so far reported on Linux and Windows endpoints.

In a fleet of agents enrolled to the same fleet-server only a few will show this behaviour. A changing the elastic-agent's policy or restarting it fixes the problem.

Even when showing this behaviour there are no errors in the logs of either filebeat or elastic-agent.

Workarounds:

tl;dr:

modify the policy. E.g. add a dummy custom log integration OR
swap the policy assigned to the affected agent. E.g. assign a dummy policy and then assign back the proper policy OR
restart the elastic-agent

the long version:
The elastic-agent needs to receive a new configuration/policy from fleet-server and forward it to filebeat. Any of the above methods will cause the elastic-agent to receive a new policy and update filebeat's configuration.

Known facts:

When a new configuration with changes is send to the Agent, it will restart sending events.
This mean that the system is correctly configured and sane and it is able to recover from the situation.
We can confirm the configuration is available it's retrieved from the diagnostic command.
We can confirm in the logs that the output receive the Elasticsearch configuration, this is the only
block that is modified after a restart.
We know that Filebeat is up and running, this means that the Agent detect input size is > 0.
If there is a problem in the configuration block the manager in libbeat will just ignore the error and continue.
Filebeat load the registrar state.
I know that Filebeat doesn't start any log harvester from the periodic metric dump.
I know that Filebeat is not processing any events from the periodic metric dump.
It's a random issue, if you have a fleet of agent some agent can be healthy and send events and others not.
It's confirmed to happen on Linux and Windows

What have been tried:

We have tried to reproduce the issues by simulating the configuration changes with nginx or custom log harvester.
We might have been able to reproduce once, but we do not have the log to inspect.

Related PRs:

[libbeat | filebeat] Log error when parsing config block and disabled input on filebeat #30534

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-02-22T14:34:38Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

vladvasiliu · 2022-02-25T10:38:23Z

I've been having this issue for a while, now, combined with a different one where the agent wouldn't update its policy, although it was otherwise reported as healthy.

When a new configuration with changes is send to the Agent, it will restart sending events.

This particular workaround was suggested by Elastic support, and it seemed to help with the update issue, but it won't reliably fix the harvesting issue.

jlind23 · 2022-02-28T08:14:11Z

@vladvasiliu you are right, this is why we are currently trying to find a proper fix for this issue.

rlapond · 2022-03-02T08:31:53Z

Hi,

We're also hitting this issue, which is quite problematic since we have 300+ agents connected.

jlind23 · 2022-03-02T14:46:13Z

@rlapond do you face this issue on some particular agents?
cc @ph

rlapond · 2022-03-02T14:59:37Z

Not really, we have both 7.14.1 (older agents) and 7.16.3 running on Centos7 and Ubuntu both hitting this issue. Yesterday we've upgraded some agents from 7.14.1 to 7.16.3, which started sending logs again after the upgrade.

Another agent (7.14.3) which we didn't upgrade started sending logs again after changing the log level for the agent to debug.

AndersonQ · 2022-03-02T16:24:54Z

@rlapond one alternative it to modify the policy, for example, adding a dummy custom log integration. It'll cause the elastic-agents to receive a new configuration and propagate it to filebeat. I believe it'd be the easiest to fix several failing filebeats at once.

ph · 2022-03-02T17:14:43Z

@rlapond @vladvasiliu Would you say this issue is often related to either a machine restart or a service restart?

rlapond · 2022-03-02T18:04:05Z

@ph Not confirmed on all Agents, but on 1 Agent (Linux) it for sure happened because of a machine restart after updating and restarting at 04:00 in the morning. On Elasticsearch side we didn't receive any logs from that exact moment.

After the machine restart Elastic-agent and Filebeat were running with no errors. I had restart the Elastic-agent service for filebeat to start shipping logs again.

vladvasiliu · 2022-03-02T18:16:56Z

@AndersonQ : this didn't help in my case, running 8.0.0. The fleet server reports the agents are still out of date after more than an hour. Changing policies back and forth didn't help, agents are still out of date (but otherwise healthy).

@ph This seems to be triggered upon restart, yes. Most of my agents are running on POS systems which are turned off at night. I don't restart the agent otherwise. But it seems that restarting only the agent service or the whole system doesn't reliably fix the issue. However, once it starts, and it works, I've never noticed it stop working.

ph · 2022-03-02T18:44:28Z

Thanks, I am trying to reproduce that exact case.

ph · 2022-03-02T20:57:05Z

@vladvasiliu @rlapond

I've been trying to reproduce the issue locally with a fleet server under load closer to your current situation (around 1200 agents where 99% are simulated), I've been killing, restarting gracefully, changing the configuration between these operations and the agent always come back up, cleanly and ack the new configuration and report the logs. I've looked a the code path and we have added more logging for a future versions to give more details.

But I want to get to the end of this story.

Let's start differently, which logs you are monitoring that show you the behavior that the logs aren't coming? Are you watching the collected syslog? I am to replicate more closely your situation.

ph · 2022-03-02T21:12:12Z

If you look at the /opt/Elastic/Agent/data/elastic-agent*/filebeat* is there any error there?

vladvasiliu · 2022-03-03T04:39:30Z

@ph, in my case, the agents are running on Windows 2016 LTSB and Server 2019.

The 2016 ones are the ones that are shut down for the night, and which have the problems. They're collecting JSON logs from a bunch of files, using the custom log integration.

The 2019 ones are collecting only Windows logs, with the system integration.

I can't guarantee it 100%, since those servers rarely reboot, but I would say that the issue doesn't happen on the 2019 / system integration agents. (They do have the policy update not working issue, but I suppose that's different).

At one point, I had the system integration activated on the 2016 agents, too, and it seemed to work even when the custom log didn't.

Another point: The agents are configured to send their own logs. When the custom log integration doesn't work, I usually don't get the agent's logs either.

There are no errors in the filebeat logs. But they don't say anything about starting any inputs, either. Then they go on about non-zero metrics every 30 seconds.

If you have a build with enhanced logging, I'd be happy to deploy it on a bunch of machines and report back.

mbudge · 2022-03-03T10:53:43Z

We were affected by this as soon as we started testing Elastic Agent with the Azure logs integration. We are using v7.16.3

It was when we either restarted the agent or updated the config which triggered a restart. The Azure integration dropped and the only error I can see in the logs is

{"log.level":"debug","@timestamp":"2022-02-22T10:44:55.535Z","log.logger":"add_cloud_metadata","log.origin":{"file.name":"add_cloud_metadata/providers.go","file.line":166},"message":"add_cloud_metadata: received disposition for azure after 2.434822ms. result=[provider:azure, error=failed with http status code 404, metadata={}]","service.name":"filebeat","ecs.version":"1.6.0"}

Several other integrations fail to start but the agent appears healthy in the Kibana UI.

Contact support if you want to do a screen share. I've provided about 20 diagnostics to support and they all contain the azure logs integration in the config file.

renekalff · 2022-03-03T11:05:54Z

We're having related issue's as well. Mostly on Linux and not only after restart but we also have endpoints not sending logs after an agent upgrade (about 50 stopped sending logs after an agent upgrade this week). Sometimes changing the policy or changing the logging level helps but that is not always the case. Most systems are not in our control but I do have one with this issue right now with debugging enabled and terminal access. If I can help in any way (setting up a call to speed things up is a possibility) let me know.

renekalff · 2022-03-03T12:31:04Z

What additional information:

Agent show's healthy
Agent version 7.16.3
Metrics are received but no logs
CentOS 7

Comparing with a working system:
./elastic-agent inspect | grep -v api_key | md5sum show's agent is running with the same configuration
curl -XGET --unix-socket ./filebeat.sock 'http:/state?pretty show's 0 input.count instead of 2 and no output.elasticsearch.cluster_uuid
./elastic-agent diagnostics shows no differences except for the id's
cat /proc/<filebeat pid>/cmdline | md5sum show's filebeat is started with same parameters

renekalff · 2022-03-03T16:17:40Z

Bit more digging in the filebeat logging. I changed to logging level (which causes a config reload) on a working and the defect system. On a working endpoint this results in "centralmgmt" events for filebeat.inputs, filebeat.outputs and filebeat.modules. On the defunct endpoint only the filebeat.outputs is present but it does show the correct url for that in the logs.

Maybe somebody else can scan their logs for "centralmgmt" events and verify if it's the same issue.
filebeat.json.txt

ph · 2022-03-03T21:35:41Z

@renekalff Can you verify how many filebeat processes run on a problematic host? I was able to reproduce it 2 times.

renekalff · 2022-03-03T21:39:38Z

There are two processes running one for the agent logs the other for the system logs.

@renekalff Can you verify how many filebeat processes run on a problematic host? I was able to reproduce it 2 times.

ph · 2022-03-03T21:46:20Z

{"log.level":"info","@timestamp":"2022-03-03T11:20:41.207+0100","log.logger":"centralmgmt.fleet","log.origin":{"file.name":"management/manager.go","file.line":109},"message":"Starting fleet management service","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-03-03T11:20:41.732+0100","log.logger":"centralmgmt.fleet","log.origin":{"file.name":"management/manager.go","file.line":150},"message":"Status change to Configuring: Updating configuration","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-03-03T11:20:41.733+0100","log.logger":"centralmgmt.fleet","log.origin":{"file.name":"management/manager.go","file.line":271},"message":"Applying settings for output","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-03-03T11:20:41.733+0100","log.logger":"esclientleg","log.origin":{"file.name":"eslegclient/connection.go","file.line":102},"message":"elasticsearch url: https://:443","service.name":"filebeat","ecs.version":"1.6.0"}

This is really good, because, this look like the registry point for the inputs and modules aren't there.
If I remember correctly the "Applying (inputs, modules, output)" is based what exist in the registry of the reloadable part.

ph · 2022-03-03T21:56:08Z

@renekalff Thanks for your help, I know where the problem is and I will work on a fix, but I cannot reproduce it reliably because it's a timing/visibility issue. The problem is in some circumstances, the manager that takes care of the reloadable part of beats is not completely initialized when the manager is ready to accept the configuration.

renekalff · 2022-03-03T21:59:47Z

@renekalff Thanks for your help, I know where the problem is and I will work on a fix, but I cannot reproduce it reliably because it's a timing/visibility issue. The problem is in some circumstances, the manager that takes care of the reloadable part of beats is not completely initialized when the manager is ready to accept the configuration.

Would be awesome if that fixes the issue. We've tried to reproduce this last week and also weren't able to.

ph · 2022-03-03T22:02:57Z

I will work on a fix and get that released asap.

ph · 2022-03-04T20:29:00Z

I have the PR open #30694, will get that reviewed and tested asap.

amolnater-qasource · 2022-03-21T12:03:38Z

Hi Team
We have revalidated this issue on 8.2 Snapshot Kibana cloud environment and found it fixed now.
Host: Windows 10

We restarted the Agent and found filebeat-metricbeat logs generating after agent restart successfully.

Build details:

BUILD: 51276
COMMIT: 32d08793f02171391ab4b0d3044970c3f14c2d94
Artifact Link: https://snapshots.elastic.co/8.2.0-5cc993c1/downloads/beats/elastic-agent/elastic-agent-8.2.0-SNAPSHOT-windows-x86_64.zip

Screenshot:

Logs:
logs.zip

Thanks!

AndersonQ added the bug label Feb 22, 2022

AndersonQ self-assigned this Feb 22, 2022

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Feb 22, 2022

AndersonQ mentioned this issue Feb 22, 2022

[libbeat | filebeat] Log error when parsing config block and disabled input on filebeat #30534

Merged

2 tasks

ph added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Feb 22, 2022

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Feb 22, 2022

jlind23 added backport-v7.17.0 Automated backport with mergify backport-v8.0.0 Automated backport with mergify backport-v8.1.0 Automated backport with mergify v8.2.0 labels Feb 23, 2022

ph assigned ph and unassigned AndersonQ Mar 3, 2022

ph added the blocker label Mar 3, 2022

This was referenced Mar 4, 2022

Fix/change the initialization manager #30692

Closed

Fix/change the initialization of management layer #30694

Merged

jlind23 linked a pull request Mar 7, 2022 that will close this issue

Fix/change the initialization of management layer #30694

Merged

3 tasks

ph closed this as completed in #30694 Mar 14, 2022

mbudge mentioned this issue Apr 27, 2022

Improve Elastic Agent integrations logging elastic/elastic-agent#380

Open

amolnater-qasource added the QA:Validated Validated by the QA Team label Apr 27, 2022

cmacknz mentioned this issue Jun 14, 2022

Filebeat log input fails when reload config under Elastic-Agent (affects Custom Logs integration) #31929

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filebeat running under Elastic-Agent not harvesting logs after restart #30533

Filebeat running under Elastic-Agent not harvesting logs after restart #30533

AndersonQ commented Feb 22, 2022 •

edited

Loading

elasticmachine commented Feb 22, 2022

vladvasiliu commented Feb 25, 2022

jlind23 commented Feb 28, 2022

rlapond commented Mar 2, 2022

jlind23 commented Mar 2, 2022

rlapond commented Mar 2, 2022

AndersonQ commented Mar 2, 2022

ph commented Mar 2, 2022

rlapond commented Mar 2, 2022

vladvasiliu commented Mar 2, 2022

ph commented Mar 2, 2022

ph commented Mar 2, 2022

ph commented Mar 2, 2022

vladvasiliu commented Mar 3, 2022

mbudge commented Mar 3, 2022

renekalff commented Mar 3, 2022 •

edited

Loading

renekalff commented Mar 3, 2022

renekalff commented Mar 3, 2022

ph commented Mar 3, 2022

renekalff commented Mar 3, 2022

ph commented Mar 3, 2022

ph commented Mar 3, 2022

renekalff commented Mar 3, 2022

ph commented Mar 3, 2022

ph commented Mar 4, 2022

amolnater-qasource commented Mar 21, 2022

Filebeat running under Elastic-Agent not harvesting logs after restart #30533

Filebeat running under Elastic-Agent not harvesting logs after restart #30533

Comments

AndersonQ commented Feb 22, 2022 • edited Loading

Workarounds:

Known facts:

What have been tried:

elasticmachine commented Feb 22, 2022

vladvasiliu commented Feb 25, 2022

jlind23 commented Feb 28, 2022

rlapond commented Mar 2, 2022

jlind23 commented Mar 2, 2022

rlapond commented Mar 2, 2022

AndersonQ commented Mar 2, 2022

ph commented Mar 2, 2022

rlapond commented Mar 2, 2022

vladvasiliu commented Mar 2, 2022

ph commented Mar 2, 2022

ph commented Mar 2, 2022

ph commented Mar 2, 2022

vladvasiliu commented Mar 3, 2022

mbudge commented Mar 3, 2022

renekalff commented Mar 3, 2022 • edited Loading

renekalff commented Mar 3, 2022

renekalff commented Mar 3, 2022

ph commented Mar 3, 2022

renekalff commented Mar 3, 2022

ph commented Mar 3, 2022

ph commented Mar 3, 2022

renekalff commented Mar 3, 2022

ph commented Mar 3, 2022

ph commented Mar 4, 2022

amolnater-qasource commented Mar 21, 2022

AndersonQ commented Feb 22, 2022 •

edited

Loading

renekalff commented Mar 3, 2022 •

edited

Loading