New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reuse error channel #2697
Reuse error channel #2697
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to add a test case for this?
Good catch! The fix looks good, we should add an integration test if possible to simulate the restart scenario. |
Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co>
I was unable to add an explicit test for this bug EDIT: Managed to add another reload test at the fleet-level. |
/test |
err = srv.Reload(ctx, newCfg) | ||
require.NoError(t, err) | ||
|
||
// Run server with bad config - it should fail, then work on the restart? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we remove commented out code?
What is the problem this PR solves?
SLES instances running a fleet-server instance under the elastic-agent are unhealthy after a restart (fleet-server does not reconfigure properly)
How does this PR solve the problem?
The error channel used by the server struct is being created on each iteration of the run loop: https://github.com/elastic/fleet-server/blob/main/internal/pkg/server/fleet.go#L136
However, runServer call actually occurs in a different goroutine: https://github.com/elastic/fleet-server/blob/main/internal/pkg/server/fleet.go#L117.
Which can lead to a config change being read before the runServer method is ran and leaking the channel; the error passed to the channel is lost and thus the fleet-server is stuck as it should signal that it's failed but has not.
How to test this PR locally
Install an agent with the fleet-server integration on a SLES machine and restart the machine.
Without the changes the logs should stop with:
With the changes the logs will have the
Fleet Server failed
message after.Checklist
I have commented my code, particularly in hard-to-understand areasI have made corresponding changes to the documentationI have made corresponding change to the default configuration filesI have added tests that prove my fix is effective or that my feature works./changelog/fragments
using the changelog toolRelated issues