Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/change the initialization of management layer #30694

Merged
merged 22 commits into from
Mar 14, 2022

Conversation

ph
Copy link
Contributor

@ph ph commented Mar 4, 2022

This fix an issue on Filebeat that makes the start sequence of Filebeat
non-deterministic. It was possible that not all the hooks were
configured correctly before the managed was receiving a configuration
from the Elastic Agent.

This causes inconsistency between the expected configuration state coming from Agent
and the actual running state. This situation can have one or many of the following symptoms:

  • Having Filebeat runnings and not sending any data to Elasticsearch
  • Having Filebeat partially configured, when only some inputs were
    sending data.
  • Missing log from the Filebeat collector
  • Having only Metricbeats running and sending logs.
  • A problematic process is restarted by the agent.

This solves the issues by moving the Start and stop Stop of the
manager into the beats initialization process, each beat need to be
adjusted to support this new sequence.

This is indeed a breaking change for beats author,
but the bootstrap process of beats and libbeat cannot easily be
extended to make the change into a unique place.

Every beats has a different code path.

How it was detected

This was detected on a log where log events were actually missing from the log.

Working endpoint.

{"log.level":"info","@timestamp":"2022-03-03T15:21:46.739+0100","log.logger":"centralmgmt.fleet","log.origin":{"file.name":"management/manager.go","file.line":109},"message":"Starting fleet management service","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-03-03T15:21:47.354+0100","log.logger":"centralmgmt.fleet","log.origin":{"file.name":"management/manager.go","file.line":150},"message":"Status change to Configuring: Updating configuration","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-03-03T15:21:47.354+0100","log.logger":"centralmgmt.fleet","log.origin":{"file.name":"management/manager.go","file.line":271},"message":"Applying settings for filebeat.inputs","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2022-03-03T15:21:47.354+0100","log.logger":"centralmgmt","log.origin":{"file.name":"cfgfile/list.go","file.line":63},"message":"Starting reload procedure, current runners: 0","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2022-03-03T15:21:47.355+0100","log.logger":"centralmgmt","log.origin":{"file.name":"cfgfile/list.go","file.line":81},"message":"Start list: 2, Stop list: 0","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2022-03-03T15:21:47.356+0100","log.logger":"centralmgmt","log.origin":{"file.name":"cfgfile/list.go","file.line":105},"message":"Starting runner: input [type=log]","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2022-03-03T15:21:47.358+0100","log.logger":"centralmgmt","log.origin":{"file.name":"cfgfile/list.go","file.line":105},"message":"Starting runner: input [type=log]","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-03-03T15:21:47.358+0100","log.logger":"centralmgmt.fleet","log.origin":{"file.name":"management/manager.go","file.line":271},"message":"Applying settings for output","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-03-03T15:21:47.358+0100","log.logger":"esclientleg","log.origin":{"file.name":"eslegclient/connection.go","file.line":102},"message":"elasticsearch url: https://<redacted>:443","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-03-03T15:21:47.358+0100","log.logger":"centralmgmt.fleet","log.origin":{"file.name":"management/manager.go","file.line":271},"message":"Applying settings for filebeat.modules","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2022-03-03T15:21:47.358+0100","log.logger":"centralmgmt","log.origin":{"file.name":"cfgfile/list.go","file.line":63},"message":"Starting reload procedure, current runners: 0","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2022-03-03T15:21:47.358+0100","log.logger":"centralmgmt","log.origin":{"file.name":"cfgfile/list.go","file.line":81},"message":"Start list: 0, Stop list: 0","service.name":"filebeat","ecs.version":"1.6.0"}

Problematic endpoint

{"log.level":"info","@timestamp":"2022-03-03T11:20:41.207+0100","log.logger":"centralmgmt.fleet","log.origin":{"file.name":"management/manager.go","file.line":109},"message":"Starting fleet management service","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-03-03T11:20:41.732+0100","log.logger":"centralmgmt.fleet","log.origin":{"file.name":"management/manager.go","file.line":150},"message":"Status change to Configuring: Updating configuration","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-03-03T11:20:41.733+0100","log.logger":"centralmgmt.fleet","log.origin":{"file.name":"management/manager.go","file.line":271},"message":"Applying settings for output","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-03-03T11:20:41.733+0100","log.logger":"esclientleg","log.origin":{"file.name":"eslegclient/connection.go","file.line":102},"message":"elasticsearch url: https://<redacted>:443","service.name":"filebeat","ecs.version":"1.6.0"}

The later log extract only contains information about the outputs (`Applying settings...) nothing about the inputs.

What does this PR do?

Why is it important?

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
    - [ ] I have made corresponding changes to the documentation
    - [ ] I have made corresponding change to the default configuration files
    - [ ] I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

  • [ ]

How to test this PR locally

Since the problem is non-deterministic reproducing this issue really hard, I was able to reproduce a few times by having simulated load on agent virtual machine.

Related issues

@ph ph requested a review from a team as a code owner March 4, 2022 20:27
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Mar 4, 2022
@ph ph added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Mar 4, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Mar 4, 2022
@ph ph self-assigned this Mar 4, 2022
@ph ph requested review from a team and aleksmaus March 4, 2022 20:27
@mergify
Copy link
Contributor

mergify bot commented Mar 4, 2022

This pull request does not have a backport label. Could you fix it @ph? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 7./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@mergify mergify bot added the backport-skip Skip notification from the automated backport with mergify label Mar 4, 2022
@ph ph added backport-v7.17.0 Automated backport with mergify backport-v8.0.0 Automated backport with mergify backport-v8.1.0 Automated backport with mergify backport-v8.2.0 Automated backport with mergify backport-v8.3.0 Automated backport with mergify and removed backport-skip Skip notification from the automated backport with mergify labels Mar 4, 2022
@ph ph changed the title Fix/change the initialization manager Fix/change the initialization of management layer Mar 4, 2022
@elasticmachine
Copy link
Collaborator

elasticmachine commented Mar 4, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview [preview](https://ci-stats.elastic.co/app/apm/services/beats-ci/transactions/view?rangeFrom=2022-03-09T16:39:31.895Z&rangeTo=2022-03-09T16:59:31.895Z&transactionName=BUILD Beats/beats/PR-{number}&transactionType=job&latencyAggregationType=avg&traceId=4bc4c0542f43ada17ec92de9fd81a8cf&transactionId=b5a59d7b4c3a3b35)

Expand to view the summary

Build stats

  • Start Time: 2022-03-09T16:49:31.895+0000

  • Duration: 120 min 48 sec

Test stats 🧪

Test Results
Failed 0
Passed 43016
Skipped 3846
Total 46862

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@ph
Copy link
Contributor Author

ph commented Mar 5, 2022

/package

@ph
Copy link
Contributor Author

ph commented Mar 5, 2022

The failure looks valid to me, I will take look.

@jlind23 jlind23 linked an issue Mar 7, 2022 that may be closed by this pull request
@ph ph requested review from AndersonQ and andrewkroh March 7, 2022 14:19
@ph ph force-pushed the fix/change-the-initialization-manager branch from 1592ac9 to 550fc6e Compare March 7, 2022 14:24
@ph ph requested a review from a team as a code owner March 7, 2022 14:24
@ph
Copy link
Contributor Author

ph commented Mar 7, 2022

I've looked at the issues, I am going to rebase this PR and have another go with the test, I've added a changelog too.

@ph
Copy link
Contributor Author

ph commented Mar 7, 2022

I've tested this PR using one of our vagrant machines (vagrant up ubuntu2004), I've installed the stress utility and put a really large CPU load on the machine, the machine was really slow, and I've done a few restarts of the service. Every time Filebeat and Metricbeat were back online and sending events to Elasticsearch.

@ph
Copy link
Contributor Author

ph commented Mar 7, 2022

@aleksmaus can you take a look at the osquerybeat part?

@ph ph force-pushed the fix/change-the-initialization-manager branch from 60c2de1 to a244a9f Compare March 9, 2022 14:36
@ph
Copy link
Contributor Author

ph commented Mar 9, 2022

@simitt Thanks, I've fixed the typo add added more information for the SetStopCallback see https://github.com/elastic/beats/pull/30694/files#diff-fcf0ac1927a6e4a560125bca3691cb6c27006277664735a58500420535fedc27R94

@mergify
Copy link
Contributor

mergify bot commented Mar 9, 2022

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b fix/change-the-initialization-manager upstream/fix/change-the-initialization-manager
git merge upstream/main
git push upstream fix/change-the-initialization-manager

@ph ph merged commit 4c14f03 into elastic:main Mar 14, 2022
ph added a commit that referenced this pull request Mar 14, 2022
mergify bot pushed a commit that referenced this pull request Mar 14, 2022
* Ensure that libbeat manager is instantiated after the hooks.

This fix an issues on Filebeat that make the start sequence of filebeat
non-deterministic. It was possible that not all the hooks were
configured correctly before the managed was receiving a configuration
from the Elastic Agent.

This causes an inconsistency between the expected configuration state
and the actual running states, this includes the following symptoms:

- Having Filebeat runnings and not sending any data to Elasticsearch
- Having Filebeat partially configured, when only some inputs were
  sending data.
- Missing log from the Filebeat collector
- Having only metricsbeats running and sending logs.

This solves the issues by moving the `Start` and stop `Stop` of the
managed into the beats initialization process, each beats need to be
adjusted to support. This is indeed a breaking changes for beats author,
but the bootstrap process of beats and libbeat cannot easily be
extended to make the change into a unique place.

(cherry picked from commit 4c14f03)
mergify bot pushed a commit that referenced this pull request Mar 14, 2022
* Ensure that libbeat manager is instantiated after the hooks.

This fix an issues on Filebeat that make the start sequence of filebeat
non-deterministic. It was possible that not all the hooks were
configured correctly before the managed was receiving a configuration
from the Elastic Agent.

This causes an inconsistency between the expected configuration state
and the actual running states, this includes the following symptoms:

- Having Filebeat runnings and not sending any data to Elasticsearch
- Having Filebeat partially configured, when only some inputs were
  sending data.
- Missing log from the Filebeat collector
- Having only metricsbeats running and sending logs.

This solves the issues by moving the `Start` and stop `Stop` of the
managed into the beats initialization process, each beats need to be
adjusted to support. This is indeed a breaking changes for beats author,
but the bootstrap process of beats and libbeat cannot easily be
extended to make the change into a unique place.

(cherry picked from commit 4c14f03)
mergify bot pushed a commit that referenced this pull request Mar 14, 2022
* Ensure that libbeat manager is instantiated after the hooks.

This fix an issues on Filebeat that make the start sequence of filebeat
non-deterministic. It was possible that not all the hooks were
configured correctly before the managed was receiving a configuration
from the Elastic Agent.

This causes an inconsistency between the expected configuration state
and the actual running states, this includes the following symptoms:

- Having Filebeat runnings and not sending any data to Elasticsearch
- Having Filebeat partially configured, when only some inputs were
  sending data.
- Missing log from the Filebeat collector
- Having only metricsbeats running and sending logs.

This solves the issues by moving the `Start` and stop `Stop` of the
managed into the beats initialization process, each beats need to be
adjusted to support. This is indeed a breaking changes for beats author,
but the bootstrap process of beats and libbeat cannot easily be
extended to make the change into a unique place.

(cherry picked from commit 4c14f03)
@ph
Copy link
Contributor Author

ph commented Mar 14, 2022

It seems I've missed the alert of mergify concerning the conflict. I will make a followup PR.

@ph
Copy link
Contributor Author

ph commented Mar 14, 2022

it seems, I've fixed that last week , time change are hard. :(

axw pushed a commit to elastic/apm-server that referenced this pull request Mar 15, 2022
This move the Manager.Start and Stop into the Beats' run method, this
move ensure that the system is configured and ready to receive events.

Having the Manager started and stopped at the Libbeat level was causing
inconsistency when configuring the Beats by the Elastic Agent.
The problem would lead to the following behavior:

- Zombie Beats with only outputs configured
- Beats without any inputs configured
- Beats with some of the input configured.

The problem was often cause by restarting the agent and having the
machine under a significant load.

See: elastic/beats#30694 for details
axw added a commit to elastic/apm-server that referenced this pull request Mar 15, 2022
* Update to elastic/beats@c52699616a8a

* Move Manager.Start() and Manager.Stop() in the beat execution.

This move the Manager.Start and Stop into the Beats' run method, this
move ensure that the system is configured and ready to receive events.

Having the Manager started and stopped at the Libbeat level was causing
inconsistency when configuring the Beats by the Elastic Agent.
The problem would lead to the following behavior:

- Zombie Beats with only outputs configured
- Beats without any inputs configured
- Beats with some of the input configured.

The problem was often cause by restarting the agent and having the
machine under a significant load.

See: elastic/beats#30694 for details

* Update mock Manager implementation

Co-authored-by: apmmachine <infra-root-apmmachine@elastic.co>
Co-authored-by: Pier-Hugues Pellerin <phpellerin@gmail.com>
Co-authored-by: Andrew Wilkins <axw@elastic.co>
ph added a commit that referenced this pull request Mar 16, 2022
* Ensure that libbeat manager is instantiated after the hooks.

This fix an issues on Filebeat that make the start sequence of filebeat
non-deterministic. It was possible that not all the hooks were
configured correctly before the managed was receiving a configuration
from the Elastic Agent.

This causes an inconsistency between the expected configuration state
and the actual running states, this includes the following symptoms:

- Having Filebeat runnings and not sending any data to Elasticsearch
- Having Filebeat partially configured, when only some inputs were
  sending data.
- Missing log from the Filebeat collector
- Having only metricsbeats running and sending logs.

This solves the issues by moving the `Start` and stop `Stop` of the
managed into the beats initialization process, each beats need to be
adjusted to support. This is indeed a breaking changes for beats author,
but the bootstrap process of beats and libbeat cannot easily be
extended to make the change into a unique place.

(cherry picked from commit 4c14f03)

Co-authored-by: Pier-Hugues Pellerin <phpellerin@gmail.com>
axw pushed a commit to elastic/apm-server that referenced this pull request Mar 17, 2022
This move the Manager.Start and Stop into the Beats' run method, this
move ensure that the system is configured and ready to receive events.

Having the Manager started and stopped at the Libbeat level was causing
inconsistency when configuring the Beats by the Elastic Agent.
The problem would lead to the following behavior:

- Zombie Beats with only outputs configured
- Beats without any inputs configured
- Beats with some of the input configured.

The problem was often cause by restarting the agent and having the
machine under a significant load.

See: elastic/beats#30694 for details
ph added a commit that referenced this pull request Mar 17, 2022
* Ensure that libbeat manager is instantiated after the hooks.

This fix an issues on Filebeat that make the start sequence of filebeat
non-deterministic. It was possible that not all the hooks were
configured correctly before the managed was receiving a configuration
from the Elastic Agent.

This causes an inconsistency between the expected configuration state
and the actual running states, this includes the following symptoms:

- Having Filebeat runnings and not sending any data to Elasticsearch
- Having Filebeat partially configured, when only some inputs were
  sending data.
- Missing log from the Filebeat collector
- Having only metricsbeats running and sending logs.

This solves the issues by moving the `Start` and stop `Stop` of the
managed into the beats initialization process, each beats need to be
adjusted to support. This is indeed a breaking changes for beats author,
but the bootstrap process of beats and libbeat cannot easily be
extended to make the change into a unique place.

(cherry picked from commit 4c14f03)
axw added a commit to elastic/apm-server that referenced this pull request Mar 18, 2022
* Update to elastic/beats@49a7ebdde9ef

* Move Manager.Start() and Manager.Stop() in the beat execution.

This move the Manager.Start and Stop into the Beats' run method, this
move ensure that the system is configured and ready to receive events.

Having the Manager started and stopped at the Libbeat level was causing
inconsistency when configuring the Beats by the Elastic Agent.
The problem would lead to the following behavior:

- Zombie Beats with only outputs configured
- Beats without any inputs configured
- Beats with some of the input configured.

The problem was often cause by restarting the agent and having the
machine under a significant load.

See: elastic/beats#30694 for details

* Update mock Manager implementation

Co-authored-by: apmmachine <infra-root-apmmachine@elastic.co>
Co-authored-by: Pier-Hugues Pellerin <phpellerin@gmail.com>
Co-authored-by: Andrew Wilkins <axw@elastic.co>
ph added a commit that referenced this pull request Mar 21, 2022
This fix an issues on Filebeat that make the start sequence of filebeat
non-deterministic. It was possible that not all the hooks were
configured correctly before the managed was receiving a configuration
from the Elastic Agent.

This causes an inconsistency between the expected configuration state
and the actual running states, this includes the following symptoms:

- Having Filebeat runnings and not sending any data to Elasticsearch
- Having Filebeat partially configured, when only some inputs were
  sending data.
- Missing log from the Filebeat collector
- Having only metricsbeats running and sending logs.

This solves the issues by moving the `Start` and stop `Stop` of the
managed into the beats initialization process, each beats need to be
adjusted to support. This is indeed a breaking changes for beats author,
but the bootstrap process of beats and libbeat cannot easily be
extended to make the change into a unique place.

(cherry picked from commit 4c14f03)

Co-authored-by: Pier-Hugues Pellerin <phpellerin@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
ph added a commit that referenced this pull request Mar 21, 2022
…ayer (#30805)

* Fix/change the initialization of management layer (#30694)

This fix an issues on Filebeat that make the start sequence of filebeat
non-deterministic. It was possible that not all the hooks were
configured correctly before the managed was receiving a configuration
from the Elastic Agent.

This causes an inconsistency between the expected configuration state
and the actual running states, this includes the following symptoms:

- Having Filebeat runnings and not sending any data to Elasticsearch
- Having Filebeat partially configured, when only some inputs were
  sending data.
- Missing log from the Filebeat collector
- Having only metricsbeats running and sending logs.

This solves the issues by moving the `Start` and stop `Stop` of the
managed into the beats initialization process, each beats need to be
adjusted to support. This is indeed a breaking changes for beats author,
but the bootstrap process of beats and libbeat cannot easily be
extended to make the change into a unique place.

(cherry picked from commit 4c14f03)


Co-authored-by: Pier-Hugues Pellerin <phpellerin@gmail.com>
axw pushed a commit to elastic/apm-server that referenced this pull request Mar 22, 2022
This move the Manager.Start and Stop into the Beats' run method, this
move ensure that the system is configured and ready to receive events.

Having the Manager started and stopped at the Libbeat level was causing
inconsistency when configuring the Beats by the Elastic Agent.
The problem would lead to the following behavior:

- Zombie Beats with only outputs configured
- Beats without any inputs configured
- Beats with some of the input configured.

The problem was often cause by restarting the agent and having the
machine under a significant load.

See: elastic/beats#30694 for details
axw added a commit to elastic/apm-server that referenced this pull request Mar 23, 2022
* Update to elastic/beats@6e046b747c6b

* Move Manager.Start() and Manager.Stop() in the beat execution.

This move the Manager.Start and Stop into the Beats' run method, this
move ensure that the system is configured and ready to receive events.

Having the Manager started and stopped at the Libbeat level was causing
inconsistency when configuring the Beats by the Elastic Agent.
The problem would lead to the following behavior:

- Zombie Beats with only outputs configured
- Beats without any inputs configured
- Beats with some of the input configured.

The problem was often cause by restarting the agent and having the
machine under a significant load.

See: elastic/beats#30694 for details

* Update mock Manager implementation

Co-authored-by: apmmachine <infra-root-apmmachine@elastic.co>
Co-authored-by: Pier-Hugues Pellerin <phpellerin@gmail.com>
Co-authored-by: Andrew Wilkins <axw@elastic.co>
leweafan pushed a commit to leweafan/beats that referenced this pull request Apr 28, 2023
…astic#30806)


This fix an issues on Filebeat that make the start sequence of filebeat
non-deterministic. It was possible that not all the hooks were
configured correctly before the managed was receiving a configuration
from the Elastic Agent.

This causes an inconsistency between the expected configuration state
and the actual running states, this includes the following symptoms:

- Having Filebeat runnings and not sending any data to Elasticsearch
- Having Filebeat partially configured, when only some inputs were
  sending data.
- Missing log from the Filebeat collector
- Having only metricsbeats running and sending logs.

This solves the issues by moving the `Start` and stop `Stop` of the
managed into the beats initialization process, each beats need to be
adjusted to support. This is indeed a breaking changes for beats author,
but the bootstrap process of beats and libbeat cannot easily be
extended to make the change into a unique place.

(cherry picked from commit 48da76f)

Co-authored-by: Pier-Hugues Pellerin <phpellerin@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
leweafan pushed a commit to leweafan/beats that referenced this pull request Apr 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v7.17.0 Automated backport with mergify backport-v8.0.0 Automated backport with mergify backport-v8.1.0 Automated backport with mergify backport-v8.2.0 Automated backport with mergify backport-v8.3.0 Automated backport with mergify Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Filebeat running under Elastic-Agent not harvesting logs after restart
8 participants