Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuse error channel #2697

Merged
merged 10 commits into from Jun 20, 2023
Merged

Reuse error channel #2697

merged 10 commits into from Jun 20, 2023

Conversation

michel-laterman
Copy link
Contributor

What is the problem this PR solves?

SLES instances running a fleet-server instance under the elastic-agent are unhealthy after a restart (fleet-server does not reconfigure properly)

How does this PR solve the problem?

The error channel used by the server struct is being created on each iteration of the run loop: https://github.com/elastic/fleet-server/blob/main/internal/pkg/server/fleet.go#L136

However, runServer call actually occurs in a different goroutine: https://github.com/elastic/fleet-server/blob/main/internal/pkg/server/fleet.go#L117.

Which can lead to a config change being read before the runServer method is ran and leaking the channel; the error passed to the channel is lost and thus the fleet-server is stuck as it should signal that it's failed but has not.

How to test this PR locally

Install an agent with the fleet-server integration on a SLES machine and restart the machine.

Without the changes the logs should stop with:

{"log.level":"info","@timestamp":"2023-06-12T22:16:42.131Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":985},"message":"Unit state changed fleet-server-default-fleet-server-fleet_server-9baba0cd-d2de-47cc-8a9b-d1d2f8c12d66 (STARTING->CONFIGURING): Re-configuring","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"HEALTHY"},"unit":{"id":"fleet-server-default-fleet-server-fleet_server-9baba0cd-d2de-47cc-8a9b-d1d2f8c12d66","type":"input","state":"CONFIGURING","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-06-12T22:16:42.131Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":985},"message":"Unit state changed fleet-server-default (STARTING->CONFIGURING): Re-configuring","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"HEALTHY"},"unit":{"id":"fleet-server-default","type":"output","state":"CONFIGURING","old_state":"STARTING"},"ecs.version":"1.6.0"}

# This error is lost
{"log.level":"error","@timestamp":"2023-06-12T22:16:42.132Z","message":"fail elasticsearch info","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"cluster.maxConnsPersHost":128,"error.message":"dial tcp: lookup 200dd16c33174ccb94de816bb6fd5c18.us-central1.gcp.qa.cld.elstc.co on [::1]:53: read udp [::1]:38166->[::1]:53: read: connection refused","ecs.version":"1.6.0","service.name":"fleet-server","cluster.addr":["200dd16c33174ccb94de816bb6fd5c18.us-central1.gcp.qa.cld.elstc.co:443"],"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-06-12T22:16:42.132Z","message":"Stats endpoint (/opt/Elastic/Agent/data/tmp/fleet-server-default.sock) finished: accept unix /opt/Elastic/Agent/data/tmp/fleet-server-default.sock: use of closed network connection","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","log.logger":"fleet-metrics.api","ecs.version":"1.6.0"}

With the changes the logs will have the Fleet Server failed message after.

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

Related issues

@michel-laterman michel-laterman added bug Something isn't working Team:Fleet Label for the Fleet team labels Jun 14, 2023
@michel-laterman michel-laterman requested a review from a team as a code owner June 14, 2023 22:22
@elasticmachine
Copy link
Collaborator

elasticmachine commented Jun 14, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-06-20T17:00:09.033+0000

  • Duration: 40 min 23 sec

Test stats 🧪

Test Results
Failed 0
Passed 718
Skipped 1
Total 719

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to add a test case for this?

changelog/fragments/1686780651-reuse-error-channel.yaml Outdated Show resolved Hide resolved
@juliaElastic
Copy link
Contributor

Good catch! The fix looks good, we should add an integration test if possible to simulate the restart scenario.

Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co>
@michel-laterman
Copy link
Contributor Author

michel-laterman commented Jun 19, 2023

I was unable to add an explicit test for this bug
We test a config reload with the agent integration tests.

EDIT: Managed to add another reload test at the fleet-level.
also stopped the error that occurs when a fleet-server restarts on a config change from changing the status to failed

@michel-laterman
Copy link
Contributor Author

/test

err = srv.Reload(ctx, newCfg)
require.NoError(t, err)

// Run server with bad config - it should fail, then work on the restart?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove commented out code?

@michel-laterman michel-laterman enabled auto-merge (squash) June 20, 2023 15:34
@michel-laterman michel-laterman merged commit 080d311 into elastic:main Jun 20, 2023
17 of 18 checks passed
@michel-laterman michel-laterman deleted the fix-2431 branch June 20, 2023 17:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Fleet Label for the Fleet team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[SLES15]: Fleet-server Agent gets into offline state on machine reboot.
4 participants