Fix/change the initialization of management layer #30694

ph · 2022-03-04T20:27:05Z

This fix an issue on Filebeat that makes the start sequence of Filebeat
non-deterministic. It was possible that not all the hooks were
configured correctly before the managed was receiving a configuration
from the Elastic Agent.

This causes inconsistency between the expected configuration state coming from Agent
and the actual running state. This situation can have one or many of the following symptoms:

Having Filebeat runnings and not sending any data to Elasticsearch
Having Filebeat partially configured, when only some inputs were
sending data.
Missing log from the Filebeat collector
Having only Metricbeats running and sending logs.
A problematic process is restarted by the agent.

This solves the issues by moving the Start and stop Stop of the
manager into the beats initialization process, each beat need to be
adjusted to support this new sequence.

This is indeed a breaking change for beats author,
but the bootstrap process of beats and libbeat cannot easily be
extended to make the change into a unique place.

Every beats has a different code path.

How it was detected

This was detected on a log where log events were actually missing from the log.

Working endpoint.

{"log.level":"info","@timestamp":"2022-03-03T15:21:46.739+0100","log.logger":"centralmgmt.fleet","log.origin":{"file.name":"management/manager.go","file.line":109},"message":"Starting fleet management service","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-03-03T15:21:47.354+0100","log.logger":"centralmgmt.fleet","log.origin":{"file.name":"management/manager.go","file.line":150},"message":"Status change to Configuring: Updating configuration","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-03-03T15:21:47.354+0100","log.logger":"centralmgmt.fleet","log.origin":{"file.name":"management/manager.go","file.line":271},"message":"Applying settings for filebeat.inputs","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2022-03-03T15:21:47.354+0100","log.logger":"centralmgmt","log.origin":{"file.name":"cfgfile/list.go","file.line":63},"message":"Starting reload procedure, current runners: 0","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2022-03-03T15:21:47.355+0100","log.logger":"centralmgmt","log.origin":{"file.name":"cfgfile/list.go","file.line":81},"message":"Start list: 2, Stop list: 0","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2022-03-03T15:21:47.356+0100","log.logger":"centralmgmt","log.origin":{"file.name":"cfgfile/list.go","file.line":105},"message":"Starting runner: input [type=log]","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2022-03-03T15:21:47.358+0100","log.logger":"centralmgmt","log.origin":{"file.name":"cfgfile/list.go","file.line":105},"message":"Starting runner: input [type=log]","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-03-03T15:21:47.358+0100","log.logger":"centralmgmt.fleet","log.origin":{"file.name":"management/manager.go","file.line":271},"message":"Applying settings for output","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-03-03T15:21:47.358+0100","log.logger":"esclientleg","log.origin":{"file.name":"eslegclient/connection.go","file.line":102},"message":"elasticsearch url: https://<redacted>:443","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-03-03T15:21:47.358+0100","log.logger":"centralmgmt.fleet","log.origin":{"file.name":"management/manager.go","file.line":271},"message":"Applying settings for filebeat.modules","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2022-03-03T15:21:47.358+0100","log.logger":"centralmgmt","log.origin":{"file.name":"cfgfile/list.go","file.line":63},"message":"Starting reload procedure, current runners: 0","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2022-03-03T15:21:47.358+0100","log.logger":"centralmgmt","log.origin":{"file.name":"cfgfile/list.go","file.line":81},"message":"Start list: 0, Stop list: 0","service.name":"filebeat","ecs.version":"1.6.0"}

Problematic endpoint

{"log.level":"info","@timestamp":"2022-03-03T11:20:41.207+0100","log.logger":"centralmgmt.fleet","log.origin":{"file.name":"management/manager.go","file.line":109},"message":"Starting fleet management service","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-03-03T11:20:41.732+0100","log.logger":"centralmgmt.fleet","log.origin":{"file.name":"management/manager.go","file.line":150},"message":"Status change to Configuring: Updating configuration","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-03-03T11:20:41.733+0100","log.logger":"centralmgmt.fleet","log.origin":{"file.name":"management/manager.go","file.line":271},"message":"Applying settings for output","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-03-03T11:20:41.733+0100","log.logger":"esclientleg","log.origin":{"file.name":"eslegclient/connection.go","file.line":102},"message":"elasticsearch url: https://<redacted>:443","service.name":"filebeat","ecs.version":"1.6.0"}

The later log extract only contains information about the outputs (`Applying settings...) nothing about the inputs.

What does this PR do?

Why is it important?

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~- [ ] I have made corresponding changes to the documentation~~
~~- [ ] I have made corresponding change to the default configuration files~~
~~- [ ] I have added tests that prove my fix is effective or that my feature works~~
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

[ ]

How to test this PR locally

Since the problem is non-deterministic reproducing this issue really hard, I was able to reproduce a few times by having simulated load on agent virtual machine.

Related issues

Closes Filebeat running under Elastic-Agent not harvesting logs after restart #30533

elasticmachine · 2022-03-04T20:27:18Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

mergify · 2022-03-04T20:27:42Z

This pull request does not have a backport label. Could you fix it @ph? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v./d./d./d is the label to automatically backport to the 7./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

elasticmachine · 2022-03-04T23:07:24Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
[](https://ci-stats.elastic.co/app/apm/services/beats-ci/transactions/view?rangeFrom=2022-03-09T16:39:31.895Z&rangeTo=2022-03-09T16:59:31.895Z&transactionName=BUILD Beats/beats/PR-{number}&transactionType=job&latencyAggregationType=avg&traceId=4bc4c0542f43ada17ec92de9fd81a8cf&transactionId=b5a59d7b4c3a3b35)

Expand to view the summary

Build stats

Start Time: 2022-03-09T16:49:31.895+0000
Duration: 120 min 48 sec

Test stats 🧪

Test	Results
Failed	0
Passed	43016
Skipped	3846
Total	46862

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

ph · 2022-03-05T00:57:21Z

/package

ph · 2022-03-05T05:10:58Z

The failure looks valid to me, I will take look.

ph · 2022-03-07T14:24:46Z

I've looked at the issues, I am going to rebase this PR and have another go with the test, I've added a changelog too.

ph · 2022-03-07T16:29:40Z

I've tested this PR using one of our vagrant machines (vagrant up ubuntu2004), I've installed the stress utility and put a really large CPU load on the machine, the machine was really slow, and I've done a few restarts of the service. Every time Filebeat and Metricbeat were back online and sending events to Elasticsearch.

x-pack/libbeat/management/manager.go

ph · 2022-03-07T16:34:28Z

@aleksmaus can you take a look at the osquerybeat part?

ph · 2022-03-09T14:38:02Z

@simitt Thanks, I've fixed the typo add added more information for the SetStopCallback see https://github.com/elastic/beats/pull/30694/files#diff-fcf0ac1927a6e4a560125bca3691cb6c27006277664735a58500420535fedc27R94

mergify · 2022-03-09T15:39:07Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b fix/change-the-initialization-manager upstream/fix/change-the-initialization-manager
git merge upstream/main
git push upstream fix/change-the-initialization-manager

This reverts commit 4c14f03.

* Ensure that libbeat manager is instantiated after the hooks. This fix an issues on Filebeat that make the start sequence of filebeat non-deterministic. It was possible that not all the hooks were configured correctly before the managed was receiving a configuration from the Elastic Agent. This causes an inconsistency between the expected configuration state and the actual running states, this includes the following symptoms: - Having Filebeat runnings and not sending any data to Elasticsearch - Having Filebeat partially configured, when only some inputs were sending data. - Missing log from the Filebeat collector - Having only metricsbeats running and sending logs. This solves the issues by moving the `Start` and stop `Stop` of the managed into the beats initialization process, each beats need to be adjusted to support. This is indeed a breaking changes for beats author, but the bootstrap process of beats and libbeat cannot easily be extended to make the change into a unique place. (cherry picked from commit 4c14f03)

ph · 2022-03-14T16:24:28Z

It seems I've missed the alert of mergify concerning the conflict. I will make a followup PR.

ph · 2022-03-14T16:28:19Z

it seems, I've fixed that last week , time change are hard. :(

This move the Manager.Start and Stop into the Beats' run method, this move ensure that the system is configured and ready to receive events. Having the Manager started and stopped at the Libbeat level was causing inconsistency when configuring the Beats by the Elastic Agent. The problem would lead to the following behavior: - Zombie Beats with only outputs configured - Beats without any inputs configured - Beats with some of the input configured. The problem was often cause by restarting the agent and having the machine under a significant load. See: elastic/beats#30694 for details

* Update to elastic/beats@c52699616a8a * Move Manager.Start() and Manager.Stop() in the beat execution. This move the Manager.Start and Stop into the Beats' run method, this move ensure that the system is configured and ready to receive events. Having the Manager started and stopped at the Libbeat level was causing inconsistency when configuring the Beats by the Elastic Agent. The problem would lead to the following behavior: - Zombie Beats with only outputs configured - Beats without any inputs configured - Beats with some of the input configured. The problem was often cause by restarting the agent and having the machine under a significant load. See: elastic/beats#30694 for details * Update mock Manager implementation Co-authored-by: apmmachine <infra-root-apmmachine@elastic.co> Co-authored-by: Pier-Hugues Pellerin <phpellerin@gmail.com> Co-authored-by: Andrew Wilkins <axw@elastic.co>

* Ensure that libbeat manager is instantiated after the hooks. This fix an issues on Filebeat that make the start sequence of filebeat non-deterministic. It was possible that not all the hooks were configured correctly before the managed was receiving a configuration from the Elastic Agent. This causes an inconsistency between the expected configuration state and the actual running states, this includes the following symptoms: - Having Filebeat runnings and not sending any data to Elasticsearch - Having Filebeat partially configured, when only some inputs were sending data. - Missing log from the Filebeat collector - Having only metricsbeats running and sending logs. This solves the issues by moving the `Start` and stop `Stop` of the managed into the beats initialization process, each beats need to be adjusted to support. This is indeed a breaking changes for beats author, but the bootstrap process of beats and libbeat cannot easily be extended to make the change into a unique place. (cherry picked from commit 4c14f03) Co-authored-by: Pier-Hugues Pellerin <phpellerin@gmail.com>

This move the Manager.Start and Stop into the Beats' run method, this move ensure that the system is configured and ready to receive events. Having the Manager started and stopped at the Libbeat level was causing inconsistency when configuring the Beats by the Elastic Agent. The problem would lead to the following behavior: - Zombie Beats with only outputs configured - Beats without any inputs configured - Beats with some of the input configured. The problem was often cause by restarting the agent and having the machine under a significant load. See: elastic/beats#30694 for details

* Ensure that libbeat manager is instantiated after the hooks. This fix an issues on Filebeat that make the start sequence of filebeat non-deterministic. It was possible that not all the hooks were configured correctly before the managed was receiving a configuration from the Elastic Agent. This causes an inconsistency between the expected configuration state and the actual running states, this includes the following symptoms: - Having Filebeat runnings and not sending any data to Elasticsearch - Having Filebeat partially configured, when only some inputs were sending data. - Missing log from the Filebeat collector - Having only metricsbeats running and sending logs. This solves the issues by moving the `Start` and stop `Stop` of the managed into the beats initialization process, each beats need to be adjusted to support. This is indeed a breaking changes for beats author, but the bootstrap process of beats and libbeat cannot easily be extended to make the change into a unique place. (cherry picked from commit 4c14f03)

* Update to elastic/beats@49a7ebdde9ef * Move Manager.Start() and Manager.Stop() in the beat execution. This move the Manager.Start and Stop into the Beats' run method, this move ensure that the system is configured and ready to receive events. Having the Manager started and stopped at the Libbeat level was causing inconsistency when configuring the Beats by the Elastic Agent. The problem would lead to the following behavior: - Zombie Beats with only outputs configured - Beats without any inputs configured - Beats with some of the input configured. The problem was often cause by restarting the agent and having the machine under a significant load. See: elastic/beats#30694 for details * Update mock Manager implementation Co-authored-by: apmmachine <infra-root-apmmachine@elastic.co> Co-authored-by: Pier-Hugues Pellerin <phpellerin@gmail.com> Co-authored-by: Andrew Wilkins <axw@elastic.co>

This fix an issues on Filebeat that make the start sequence of filebeat non-deterministic. It was possible that not all the hooks were configured correctly before the managed was receiving a configuration from the Elastic Agent. This causes an inconsistency between the expected configuration state and the actual running states, this includes the following symptoms: - Having Filebeat runnings and not sending any data to Elasticsearch - Having Filebeat partially configured, when only some inputs were sending data. - Missing log from the Filebeat collector - Having only metricsbeats running and sending logs. This solves the issues by moving the `Start` and stop `Stop` of the managed into the beats initialization process, each beats need to be adjusted to support. This is indeed a breaking changes for beats author, but the bootstrap process of beats and libbeat cannot easily be extended to make the change into a unique place. (cherry picked from commit 4c14f03) Co-authored-by: Pier-Hugues Pellerin <phpellerin@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

…ayer (#30805) * Fix/change the initialization of management layer (#30694) This fix an issues on Filebeat that make the start sequence of filebeat non-deterministic. It was possible that not all the hooks were configured correctly before the managed was receiving a configuration from the Elastic Agent. This causes an inconsistency between the expected configuration state and the actual running states, this includes the following symptoms: - Having Filebeat runnings and not sending any data to Elasticsearch - Having Filebeat partially configured, when only some inputs were sending data. - Missing log from the Filebeat collector - Having only metricsbeats running and sending logs. This solves the issues by moving the `Start` and stop `Stop` of the managed into the beats initialization process, each beats need to be adjusted to support. This is indeed a breaking changes for beats author, but the bootstrap process of beats and libbeat cannot easily be extended to make the change into a unique place. (cherry picked from commit 4c14f03) Co-authored-by: Pier-Hugues Pellerin <phpellerin@gmail.com>

This move the Manager.Start and Stop into the Beats' run method, this move ensure that the system is configured and ready to receive events. Having the Manager started and stopped at the Libbeat level was causing inconsistency when configuring the Beats by the Elastic Agent. The problem would lead to the following behavior: - Zombie Beats with only outputs configured - Beats without any inputs configured - Beats with some of the input configured. The problem was often cause by restarting the agent and having the machine under a significant load. See: elastic/beats#30694 for details

* Update to elastic/beats@6e046b747c6b * Move Manager.Start() and Manager.Stop() in the beat execution. This move the Manager.Start and Stop into the Beats' run method, this move ensure that the system is configured and ready to receive events. Having the Manager started and stopped at the Libbeat level was causing inconsistency when configuring the Beats by the Elastic Agent. The problem would lead to the following behavior: - Zombie Beats with only outputs configured - Beats without any inputs configured - Beats with some of the input configured. The problem was often cause by restarting the agent and having the machine under a significant load. See: elastic/beats#30694 for details * Update mock Manager implementation Co-authored-by: apmmachine <infra-root-apmmachine@elastic.co> Co-authored-by: Pier-Hugues Pellerin <phpellerin@gmail.com> Co-authored-by: Andrew Wilkins <axw@elastic.co>

…astic#30806) This fix an issues on Filebeat that make the start sequence of filebeat non-deterministic. It was possible that not all the hooks were configured correctly before the managed was receiving a configuration from the Elastic Agent. This causes an inconsistency between the expected configuration state and the actual running states, this includes the following symptoms: - Having Filebeat runnings and not sending any data to Elasticsearch - Having Filebeat partially configured, when only some inputs were sending data. - Missing log from the Filebeat collector - Having only metricsbeats running and sending logs. This solves the issues by moving the `Start` and stop `Stop` of the managed into the beats initialization process, each beats need to be adjusted to support. This is indeed a breaking changes for beats author, but the bootstrap process of beats and libbeat cannot easily be extended to make the change into a unique place. (cherry picked from commit 48da76f) Co-authored-by: Pier-Hugues Pellerin <phpellerin@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

)" This reverts commit 48da76f.

ph requested a review from a team as a code owner March 4, 2022 20:27

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Mar 4, 2022

ph added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Mar 4, 2022

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Mar 4, 2022

ph self-assigned this Mar 4, 2022

ph requested review from a team and aleksmaus March 4, 2022 20:27

mergify bot added the backport-skip Skip notification from the automated backport with mergify label Mar 4, 2022

ph mentioned this pull request Mar 4, 2022

Filebeat running under Elastic-Agent not harvesting logs after restart #30533

Closed

ph changed the title ~~Fix/change the initialization manager~~ Fix/change the initialization of management layer Mar 4, 2022

jlind23 linked an issue Mar 7, 2022 that may be closed by this pull request

Filebeat running under Elastic-Agent not harvesting logs after restart #30533

Closed

ph requested review from AndersonQ and andrewkroh March 7, 2022 14:19

ph force-pushed the fix/change-the-initialization-manager branch from 1592ac9 to 550fc6e Compare March 7, 2022 14:24

ph requested a review from a team as a code owner March 7, 2022 14:24

ph commented Mar 7, 2022

View reviewed changes

x-pack/libbeat/management/manager.go Outdated Show resolved Hide resolved

ph added 3 commits March 9, 2022 09:36

Should have the appriopriate shutdown sequence here.

482fcca

Osquerybeat

203a32b

Adding docs around SetStopCallback

a244a9f

ph force-pushed the fix/change-the-initialization-manager branch from 60c2de1 to a244a9f Compare March 9, 2022 14:36

Merge branch 'main' into fix/change-the-initialization-manager

d735b18

ph merged commit 4c14f03 into elastic:main Mar 14, 2022

ph added a commit that referenced this pull request Mar 14, 2022

Revert "Fix/change the initialization of management layer (#30694)"

1314f2b

This reverts commit 4c14f03.

ph mentioned this pull request Mar 14, 2022

Revert "Fix/change the initialization of management layer" #30804

Closed

mergify bot mentioned this pull request Mar 14, 2022

[7.17](backport #30694) Fix/change the initialization of management layer #30805

Merged

mergify bot mentioned this pull request Mar 14, 2022

[8.0](backport #30694) Fix/change the initialization of management layer #30806

Merged

mergify bot mentioned this pull request Mar 14, 2022

[8.1](backport #30694) Fix/change the initialization of management layer #30807

Merged

leweafan pushed a commit to leweafan/beats that referenced this pull request Apr 28, 2023

Revert "Fix/change the initialization of management layer (elastic#30694

db39236

)" This reverts commit 48da76f.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/change the initialization of management layer #30694

Fix/change the initialization of management layer #30694

ph commented Mar 4, 2022 •

edited

Loading

elasticmachine commented Mar 4, 2022

mergify bot commented Mar 4, 2022

elasticmachine commented Mar 4, 2022 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

ph commented Mar 5, 2022

ph commented Mar 5, 2022

ph commented Mar 7, 2022 •

edited

Loading

ph commented Mar 7, 2022

ph commented Mar 7, 2022

ph commented Mar 9, 2022

mergify bot commented Mar 9, 2022

ph commented Mar 14, 2022

ph commented Mar 14, 2022

Fix/change the initialization of management layer #30694

Fix/change the initialization of management layer #30694

Conversation

ph commented Mar 4, 2022 • edited Loading

How it was detected

Working endpoint.

Problematic endpoint

What does this PR do?

Why is it important?

Checklist

Author's Checklist

How to test this PR locally

Related issues

elasticmachine commented Mar 4, 2022

mergify bot commented Mar 4, 2022

elasticmachine commented Mar 4, 2022 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

ph commented Mar 5, 2022

ph commented Mar 5, 2022

ph commented Mar 7, 2022 • edited Loading

ph commented Mar 7, 2022

ph commented Mar 7, 2022

ph commented Mar 9, 2022

mergify bot commented Mar 9, 2022

ph commented Mar 14, 2022

ph commented Mar 14, 2022

ph commented Mar 4, 2022 •

edited

Loading

elasticmachine commented Mar 4, 2022 •

edited by jenkins-beats-ci bot

Loading

ph commented Mar 7, 2022 •

edited

Loading