Reuse error channel #2697

michel-laterman · 2023-06-14T22:22:22Z

What is the problem this PR solves?

SLES instances running a fleet-server instance under the elastic-agent are unhealthy after a restart (fleet-server does not reconfigure properly)

How does this PR solve the problem?

The error channel used by the server struct is being created on each iteration of the run loop: https://github.com/elastic/fleet-server/blob/main/internal/pkg/server/fleet.go#L136

However, runServer call actually occurs in a different goroutine: https://github.com/elastic/fleet-server/blob/main/internal/pkg/server/fleet.go#L117.

Which can lead to a config change being read before the runServer method is ran and leaking the channel; the error passed to the channel is lost and thus the fleet-server is stuck as it should signal that it's failed but has not.

How to test this PR locally

Install an agent with the fleet-server integration on a SLES machine and restart the machine.

Without the changes the logs should stop with:

{"log.level":"info","@timestamp":"2023-06-12T22:16:42.131Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":985},"message":"Unit state changed fleet-server-default-fleet-server-fleet_server-9baba0cd-d2de-47cc-8a9b-d1d2f8c12d66 (STARTING->CONFIGURING): Re-configuring","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"HEALTHY"},"unit":{"id":"fleet-server-default-fleet-server-fleet_server-9baba0cd-d2de-47cc-8a9b-d1d2f8c12d66","type":"input","state":"CONFIGURING","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-06-12T22:16:42.131Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":985},"message":"Unit state changed fleet-server-default (STARTING->CONFIGURING): Re-configuring","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"HEALTHY"},"unit":{"id":"fleet-server-default","type":"output","state":"CONFIGURING","old_state":"STARTING"},"ecs.version":"1.6.0"}

# This error is lost
{"log.level":"error","@timestamp":"2023-06-12T22:16:42.132Z","message":"fail elasticsearch info","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"cluster.maxConnsPersHost":128,"error.message":"dial tcp: lookup 200dd16c33174ccb94de816bb6fd5c18.us-central1.gcp.qa.cld.elstc.co on [::1]:53: read udp [::1]:38166->[::1]:53: read: connection refused","ecs.version":"1.6.0","service.name":"fleet-server","cluster.addr":["200dd16c33174ccb94de816bb6fd5c18.us-central1.gcp.qa.cld.elstc.co:443"],"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-06-12T22:16:42.132Z","message":"Stats endpoint (/opt/Elastic/Agent/data/tmp/fleet-server-default.sock) finished: accept unix /opt/Elastic/Agent/data/tmp/fleet-server-default.sock: use of closed network connection","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","log.logger":"fleet-metrics.api","ecs.version":"1.6.0"}

With the changes the logs will have the Fleet Server failed message after.

Checklist

~~I have commented my code, particularly in hard-to-understand areas~~
~~I have made corresponding changes to the documentation~~
~~I have made corresponding change to the default configuration files~~
~~I have added tests that prove my fix is effective or that my feature works~~
I have added an entry in ./changelog/fragments using the changelog tool

Related issues

Closes [SLES15]: Fleet-server Agent gets into offline state on machine reboot. #2431

elasticmachine · 2023-06-14T22:34:59Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-06-20T17:00:09.033+0000
Duration: 40 min 23 sec

Test stats 🧪

Test	Results
Failed	0
Passed	718
Skipped	1
Total	719

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.

jsoriano

Would it be possible to add a test case for this?

changelog/fragments/1686780651-reuse-error-channel.yaml

juliaElastic · 2023-06-15T13:45:16Z

Good catch! The fix looks good, we should add an integration test if possible to simulate the restart scenario.

Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co>

michel-laterman · 2023-06-19T14:28:20Z

I was unable to add an explicit test for this bug
We test a config reload with the agent integration tests.

EDIT: Managed to add another reload test at the fleet-level.
also stopped the error that occurs when a fleet-server restarts on a config change from changing the status to failed

michel-laterman · 2023-06-19T14:28:30Z

/test

changelog/fragments/1686780651-reuse-error-channel.yaml

juliaElastic · 2023-06-20T07:08:35Z

internal/pkg/server/fleet_integration_test.go

+	err = srv.Reload(ctx, newCfg)
+	require.NoError(t, err)
+
+	// Run server with bad config - it should fail, then work on the restart?


can we remove commented out code?

Reuse error channel

5d8fca7

michel-laterman added bug Something isn't working Team:Fleet Label for the Fleet team labels Jun 14, 2023

michel-laterman requested a review from a team as a code owner June 14, 2023 22:22

jsoriano reviewed Jun 15, 2023

View reviewed changes

changelog/fragments/1686780651-reuse-error-channel.yaml Outdated Show resolved Hide resolved

Update changelog/fragments/1686780651-reuse-error-channel.yaml

c378e5f

Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co>

michel-laterman and others added 5 commits June 19, 2023 10:13

Add log message to test trace server

c8ff670

fix typo

a4ae2eb

Intercept expected stop error

61e29f5

linter fixes

86cdf14

Merge branch 'main' into fix-2431

48bd0bb

michel-laterman requested a review from jsoriano June 19, 2023 20:36

jlind23 requested a review from juliaElastic June 20, 2023 04:04

juliaElastic reviewed Jun 20, 2023

View reviewed changes

changelog/fragments/1686780651-reuse-error-channel.yaml Outdated Show resolved Hide resolved

Update changelog/fragments/1686780651-reuse-error-channel.yaml

df8fb8a

juliaElastic reviewed Jun 20, 2023

View reviewed changes

juliaElastic approved these changes Jun 20, 2023

View reviewed changes

remove commented code

ceb9a79

michel-laterman enabled auto-merge (squash) June 20, 2023 15:34

Merge branch 'main' into fix-2431

ffe73e3

michel-laterman merged commit 080d311 into elastic:main Jun 20, 2023
17 of 18 checks passed

michel-laterman deleted the fix-2431 branch June 20, 2023 17:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reuse error channel #2697

Reuse error channel #2697

michel-laterman commented Jun 14, 2023

elasticmachine commented Jun 14, 2023 •

edited

Build stats

Test stats 🧪

jsoriano left a comment

juliaElastic commented Jun 15, 2023

michel-laterman commented Jun 19, 2023 •

edited

michel-laterman commented Jun 19, 2023

juliaElastic Jun 20, 2023

Reuse error channel #2697

Reuse error channel #2697

Conversation

michel-laterman commented Jun 14, 2023

What is the problem this PR solves?

How does this PR solve the problem?

How to test this PR locally

Checklist

Related issues

elasticmachine commented Jun 14, 2023 • edited

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

jsoriano left a comment

Choose a reason for hiding this comment

juliaElastic commented Jun 15, 2023

michel-laterman commented Jun 19, 2023 • edited

michel-laterman commented Jun 19, 2023

juliaElastic Jun 20, 2023

Choose a reason for hiding this comment

elasticmachine commented Jun 14, 2023 •

edited

michel-laterman commented Jun 19, 2023 •

edited