Fix startup with failing configuration #26126

michalpristas · 2021-06-03T06:43:19Z

What does this PR do?

What is going on is a bit weird race.
Basically we start ok but troubles come on restart.

On restart we apply stored config to start MB (mb1) with failing configuration,
MB does not exit just reports Failed state because it cannot apply system/load on windows.
At this time 10 second timer to recover is started otherwise we plan a restart.

Then we pull config from fleet and decide MB should be started.
We start MB mb2 while mb1 is still running. We are starting it because check checks for terminal statuses and Failed is one of them.
This mb2 starts reports running, timer for killing mb1 is stopped. Because MB is running and it does not differentiate between processes.
mb2 fails on data.path conflict with mb1

watcher detects stopped metricbeat and mb3 is started, mb1 is still running, we dont have any information about mb1 anymore anywhere.
mb3 fails on same thing mb2 failed and exits.

mb1 still running in Failed state and we try mbX over and over again.

This PR adds some closers to watcher, some checks on Failure termination and passing proc so watcher and terminator are killing the process they were designed to kill.

While this fix works, i dont like the whole approach where we handle start/stop/restart from 4-5 places and they can be conflicting. I would like to see this redesigned. But i've spent 5 days chasing this and need some social distancing from the topic so this fix is OK at the moment for me.

How to test:

Build
Create config with fleet-server, system and endpoint integration
Deploy on windows machine
Wait for everything to settle and restart a lot of times to check if you see 3 metricbeat processes running in Task Manager.

How it behaved before the fix: we had 3 metricbeats, 2 with stable PID (one for monitoring, one is mb1) and then thirds MB process kept changing PID (crash-restart loop)

Why is it important?

Fixes #25829

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

elasticmachine · 2021-06-03T06:43:21Z

Pinging @elastic/agent (Team:Agent)

elasticmachine · 2021-06-03T08:05:49Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Build Cause: Pull request #26126 updated
Start Time: 2021-06-08T12:52:34.060+0000
Duration: 91 min 47 sec
Commit: 89a1c35

Test stats 🧪

Test	Results
Failed	0
Passed	6908
Skipped	16
Total	6924

Trends 🧪

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test	Results
Failed	0
Passed	6908
Skipped	16
Total	6924

kvch · 2021-06-03T09:51:31Z

x-pack/elastic-agent/pkg/core/plugin/process/app.go

@@ -192,16 +196,32 @@ func (a *Application) watch(ctx context.Context, p app.Taggable, proc *process.I
 		case ps := <-a.waitProc(proc.Process):
 			procState = ps
 		case <-a.bgContext.Done():
-			a.Stop()
+			return


Just curious: is there any connection between a.bgContext and ctx? Does ctx has to be cancelled when a.bgContext is cancelled?

bgContext is like a cancellation token passed from top of the app and cancelled on exit or unenroll, so agent cleans and backs up everything in a nice manner. we want to avoid just shutting down agent without any cleaning as this may turn out problematic

urso · 2021-06-04T14:50:52Z

x-pack/elastic-agent/pkg/core/plugin/process/app.go

 			return
 		}

 		a.appLock.Lock()
 		if a.state.ProcessInfo != proc {
+			// kill original process if possible
+			if proc != nil && proc.Process != nil {
+				_ = proc.Process.Kill()


shall we wait or check for other errors?

this is best effort

urso · 2021-06-04T14:51:26Z

x-pack/elastic-agent/pkg/core/plugin/process/app.go

+
+		// was already stopped by Stop, do not restart
+		if a.state.Status == state.Stopped {
+			return


are we missing a `a.appLock.Unlock()t here?

checking the complete function body, we always use Unlock for each exit path. Better add the defer unlock right after the lock in order to reduce the chance of introducing new deadlocks in the future.

mergify · 2021-06-04T16:25:44Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b fix/invalid-start upstream/fix/invalid-start
git merge upstream/master
git push upstream fix/invalid-start

blakerouse · 2021-06-08T12:17:31Z

x-pack/elastic-agent/pkg/core/plugin/process/app.go

+	}
+
+	// send stop signal to request stop
+	proc.Process.Signal(os.Interrupt)


Does this work on Windows? I thought process.Info had each OS implementation that is why it was added. Should we just move this logic into the Stop() function of process.Info?

you're right windows does not implement sending interupt. using Stop function instead

…to fix/invalid-start

blakerouse

With the latest change to using Stop this looks good.

EricDavisX · 2021-06-09T21:36:54Z

not sure if the 7.13.2 label helps, but i def would like to see this back ported to 7.13.x - @michalpristas

Fix startup with failing configuration (elastic#26126)

…26261) Cherry-pick #26126 to 7.13: Fix startup with failing configuration (#26261)

…6262) Cherry-pick #26126 to 7.x: Fix startup with failing configuration (#26262)

This reverts commit 5a294a4.

* master: (26 commits) Report total and free CPU for vSphere virtual machines (elastic#26167) [filebeat] Add preserve_original_event option to o365audit input (elastic#26273) Change xml processor names in script processor to match convention (elastic#26263) [Oracle] Fixing default values for paths in config template (elastic#26276) Add more ECS fields to logs (elastic#25998) [Heartbeat] Fix broken invocation of synth package (elastic#26228) rename sqs file name (elastic#26227) Populate the agent action result if there is no matching action handlers (elastic#26152) Add ISO8601 as supported timestamp type (elastic#25564) Move Filebeat azure module to GA (elastic#26168) Filebeat azure module pipeline fixes and changes (elastic#26148) libbeat: monitor version (elastic#26214) Add new parser to filestream input: container (elastic#26115) [Metricbeat] Add state_statefulset replicas.ready (elastic#26088) Disable test processors system test for windows 10 (elastic#26216) Fix startup with failing configuration (elastic#26126) Remove 32 bits version of Elastic Agent. (elastic#25708) Chane fleetmode detection to ony use management.enabled (elastic#26180) Make `filestream` input GA (elastic#26127) libbeat/idxmgmt/ilm: fix alias creation (elastic#26146) ...

all processes must die

a25f2bd

michalpristas added bug needs_backport PR is waiting to be backported to other branches. Team:Elastic-Agent Label for the Agent team v7.13.1 labels Jun 3, 2021

michalpristas self-assigned this Jun 3, 2021

botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Jun 3, 2021

kvch reviewed Jun 3, 2021

View reviewed changes

urso reviewed Jun 4, 2021

View reviewed changes

michalpristas added 3 commits June 4, 2021 16:55

missing unlock

1fcaded

graceful kill

3f8aeb5

unlock in defer

6b74838

Merge branch 'master' into fix/invalid-start

12b1455

blakerouse reviewed Jun 8, 2021

View reviewed changes

michalpristas added 3 commits June 8, 2021 14:50

use stop, windows does not implement sending interupt

47db569

Merge branch 'fix/invalid-start' of github.com:michalpristas/beats in…

ae3951d

…to fix/invalid-start

Merge branch 'master' of github.com:elastic/beats into fix/invalid-start

89a1c35

blakerouse approved these changes Jun 8, 2021

View reviewed changes

michalpristas merged commit 5a294a4 into elastic:master Jun 8, 2021

EricDavisX added the v7.13.2 label Jun 9, 2021

michalpristas mentioned this pull request Jun 10, 2021

Investigate cummulating memory when process cannot reach agent #26242

Closed

michalpristas added a commit to michalpristas/beats that referenced this pull request Jun 11, 2021

Fix startup with failing configuration (elastic#26126)

0cd5ee6

Fix startup with failing configuration (elastic#26126)

michalpristas added a commit to michalpristas/beats that referenced this pull request Jun 11, 2021

Fix startup with failing configuration (elastic#26126)

4ca5651

Fix startup with failing configuration (elastic#26126)

michalpristas added a commit that referenced this pull request Jun 11, 2021

Cherry-pick #26126 to 7.13: Fix startup with failing configuration (#…

ad5f805

…26261) Cherry-pick #26126 to 7.13: Fix startup with failing configuration (#26261)

michalpristas added a commit that referenced this pull request Jun 11, 2021

Cherry-pick #26126 to 7.x: Fix startup with failing configuration (#2…

e73b271

…6262) Cherry-pick #26126 to 7.x: Fix startup with failing configuration (#26262)

michalpristas added a commit to michalpristas/beats that referenced this pull request Jun 21, 2021

Revert "Fix startup with failing configuration (elastic#26126)"

b895e44

This reverts commit 5a294a4.

ph mentioned this pull request May 20, 2022

Collect process (beats) stdout and stderr elastic/elastic-agent#455

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix startup with failing configuration #26126

Fix startup with failing configuration #26126

michalpristas commented Jun 3, 2021

elasticmachine commented Jun 3, 2021

elasticmachine commented Jun 3, 2021 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

Trends 🧪

Test stats 🧪

kvch Jun 3, 2021

michalpristas Jun 3, 2021

urso Jun 4, 2021

michalpristas Jun 4, 2021

urso Jun 4, 2021

michalpristas Jun 4, 2021

urso Jun 4, 2021

mergify bot commented Jun 4, 2021

blakerouse Jun 8, 2021

michalpristas Jun 8, 2021

blakerouse left a comment

EricDavisX commented Jun 9, 2021

Fix startup with failing configuration #26126

Fix startup with failing configuration #26126

Conversation

michalpristas commented Jun 3, 2021

What does this PR do?

Why is it important?

Checklist

elasticmachine commented Jun 3, 2021

elasticmachine commented Jun 3, 2021 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

Trends 🧪

💚 Flaky test report

Test stats 🧪

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Jun 4, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blakerouse left a comment

Choose a reason for hiding this comment

EricDavisX commented Jun 9, 2021

elasticmachine commented Jun 3, 2021 •

edited by jenkins-beats-ci bot

Loading