Fix indefinite memory and CPU consumption when waiting fleet to be ready #5034

AndersonQ · 2024-07-02T15:14:11Z

What does this PR do?

Fixes the wait for Fleet Server to be ready

Why is it important?

When waiting for Fleet Server to start, the Elastic Agent does not account for the timeout when waiting for Fleet Server to be ready.

Currently, when the timeout is reached, the operation isn't interrupted and the goroutine waiting for Fleet Server to be ready gets stuck in an infinite loop without any delay between iterations. It continually prints a log like:

{"log.level":"info","@timestamp":"2024-07-02T13:18:59.354Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":812},"message":"Waiting for Elastic Agent to start: rpc error: code = Canceled desc = context canceled","ecs.version":"1.6.0"} .

This causes a spike in memory and CPU consumption until the agent is killed by the OS, potentially jeopardising the normal operation of the host.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
~~[ ] I have made corresponding change to the default configuration files~~
~~[ ] I have added tests that prove my fix is effective or that my feature works~~
~~[ ] I have added an entry in ./changelog/fragments using the changelog tool~~
~~[ ] I have added an integration test or an E2E test~~

Disruptive User Impact

How to test this PR locally

Try to reproduce #5033, the issue should not be reproducible with this fix.

Related issues

Closes High memory and CPU consumption when fleet-server fails to start during enroll #5033

Questions to ask yourself

How are we going to support this in production?
How are we going to measure its adoption?
How are we going to debug this?
What are the metrics I should take care of?
...

mergify · 2024-07-02T15:14:54Z

This pull request does not have a backport label. Could you fix it @AndersonQ? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

elasticmachine · 2024-07-02T16:43:18Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

blakerouse

Looks good and the test covers the fix. Nice!

elastic-sonarqube · 2024-07-02T17:12:07Z

Quality Gate passed

Issues
1 New issue
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
100.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

…ady (#5034) * exit if timeout is reached while waiting for fleet server to start * clarify exponential backoff behaviour * add test * add changelog * fix changelog (cherry picked from commit 8aa3477)

…ady (#5034) (#5040) * exit if timeout is reached while waiting for fleet server to start * clarify exponential backoff behaviour (cherry picked from commit 8aa3477) Co-authored-by: Anderson Queiroz <anderson.queiroz@elastic.co>

AndersonQ added the bug Something isn't working label Jul 2, 2024

AndersonQ self-assigned this Jul 2, 2024

mergify bot added the backport-skip label Jul 2, 2024

AndersonQ added 3 commits July 2, 2024 17:35

exit if timeout is reached while waiting for fleet server to start

25a6a57

clarify exponential backoff behaviour

de41b81

add test

7ff0626

AndersonQ force-pushed the 5033-fix-fleet-start-error-handling branch from 027dc1c to 7ff0626 Compare July 2, 2024 15:35

add changelog

591be99

AndersonQ marked this pull request as ready for review July 2, 2024 16:26

AndersonQ requested a review from a team as a code owner July 2, 2024 16:26

AndersonQ requested review from blakerouse and michel-laterman July 2, 2024 16:26

fix changelog

eceb5a7

pierrehilbert added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Jul 2, 2024

blakerouse approved these changes Jul 2, 2024

View reviewed changes

AndersonQ added backport-v8.14.0 Automated backport with mergify and removed backport-skip labels Jul 2, 2024

pierrehilbert merged commit 8aa3477 into elastic:main Jul 2, 2024
16 checks passed

mergify bot mentioned this pull request Jul 2, 2024

[8.14](backport #5034) Fix indefinite memory and CPU consumption when waiting fleet to be ready #5040

Merged

3 tasks

AndersonQ deleted the 5033-fix-fleet-start-error-handling branch July 3, 2024 14:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix indefinite memory and CPU consumption when waiting fleet to be ready #5034

Fix indefinite memory and CPU consumption when waiting fleet to be ready #5034

AndersonQ commented Jul 2, 2024 •

edited

Loading

mergify bot commented Jul 2, 2024

elasticmachine commented Jul 2, 2024

blakerouse left a comment

elastic-sonarqube bot commented Jul 2, 2024

Fix indefinite memory and CPU consumption when waiting fleet to be ready #5034

Fix indefinite memory and CPU consumption when waiting fleet to be ready #5034

Conversation

AndersonQ commented Jul 2, 2024 • edited Loading

What does this PR do?

Why is it important?

Checklist

Disruptive User Impact

How to test this PR locally

Related issues

Questions to ask yourself

mergify bot commented Jul 2, 2024

elasticmachine commented Jul 2, 2024

blakerouse left a comment

Choose a reason for hiding this comment

elastic-sonarqube bot commented Jul 2, 2024

Quality Gate passed

AndersonQ commented Jul 2, 2024 •

edited

Loading