Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix startup with failing configuration #26126

Merged
merged 8 commits into from
Jun 8, 2021

Conversation

michalpristas
Copy link
Contributor

What does this PR do?

What is going on is a bit weird race.
Basically we start ok but troubles come on restart.

On restart we apply stored config to start MB (mb1) with failing configuration,
MB does not exit just reports Failed state because it cannot apply system/load on windows.
At this time 10 second timer to recover is started otherwise we plan a restart.

Then we pull config from fleet and decide MB should be started.
We start MB mb2 while mb1 is still running. We are starting it because check checks for terminal statuses and Failed is one of them.
This mb2 starts reports running, timer for killing mb1 is stopped. Because MB is running and it does not differentiate between processes.
mb2 fails on data.path conflict with mb1

watcher detects stopped metricbeat and mb3 is started, mb1 is still running, we dont have any information about mb1 anymore anywhere.
mb3 fails on same thing mb2 failed and exits.

mb1 still running in Failed state and we try mbX over and over again.

This PR adds some closers to watcher, some checks on Failure termination and passing proc so watcher and terminator are killing the process they were designed to kill.

While this fix works, i dont like the whole approach where we handle start/stop/restart from 4-5 places and they can be conflicting. I would like to see this redesigned. But i've spent 5 days chasing this and need some social distancing from the topic so this fix is OK at the moment for me.

How to test:

  • Build
  • Create config with fleet-server, system and endpoint integration
  • Deploy on windows machine
  • Wait for everything to settle and restart a lot of times to check if you see 3 metricbeat processes running in Task Manager.

How it behaved before the fix: we had 3 metricbeats, 2 with stable PID (one for monitoring, one is mb1) and then thirds MB process kept changing PID (crash-restart loop)

Why is it important?

Fixes #25829

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

@michalpristas michalpristas added bug needs_backport PR is waiting to be backported to other branches. Team:Elastic-Agent Label for the Agent team v7.13.1 labels Jun 3, 2021
@michalpristas michalpristas self-assigned this Jun 3, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/agent (Team:Agent)

@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Jun 3, 2021
@elasticmachine
Copy link
Collaborator

elasticmachine commented Jun 3, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: Pull request #26126 updated

  • Start Time: 2021-06-08T12:52:34.060+0000

  • Duration: 91 min 47 sec

  • Commit: 89a1c35

Test stats 🧪

Test Results
Failed 0
Passed 6908
Skipped 16
Total 6924

Trends 🧪

Image of Build Times

Image of Tests

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test Results
Failed 0
Passed 6908
Skipped 16
Total 6924

@@ -192,16 +196,32 @@ func (a *Application) watch(ctx context.Context, p app.Taggable, proc *process.I
case ps := <-a.waitProc(proc.Process):
procState = ps
case <-a.bgContext.Done():
a.Stop()
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious: is there any connection between a.bgContext and ctx? Does ctx has to be cancelled when a.bgContext is cancelled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bgContext is like a cancellation token passed from top of the app and cancelled on exit or unenroll, so agent cleans and backs up everything in a nice manner. we want to avoid just shutting down agent without any cleaning as this may turn out problematic

return
}

a.appLock.Lock()
if a.state.ProcessInfo != proc {
// kill original process if possible
if proc != nil && proc.Process != nil {
_ = proc.Process.Kill()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we wait or check for other errors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is best effort


// was already stopped by Stop, do not restart
if a.state.Status == state.Stopped {
return
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we missing a `a.appLock.Unlock()t here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checking the complete function body, we always use Unlock for each exit path. Better add the defer unlock right after the lock in order to reduce the chance of introducing new deadlocks in the future.

@mergify
Copy link
Contributor

mergify bot commented Jun 4, 2021

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b fix/invalid-start upstream/fix/invalid-start
git merge upstream/master
git push upstream fix/invalid-start

}

// send stop signal to request stop
proc.Process.Signal(os.Interrupt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work on Windows? I thought process.Info had each OS implementation that is why it was added. Should we just move this logic into the Stop() function of process.Info?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right windows does not implement sending interupt. using Stop function instead

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the latest change to using Stop this looks good.

@michalpristas michalpristas merged commit 5a294a4 into elastic:master Jun 8, 2021
@EricDavisX
Copy link
Contributor

not sure if the 7.13.2 label helps, but i def would like to see this back ported to 7.13.x - @michalpristas

michalpristas added a commit to michalpristas/beats that referenced this pull request Jun 11, 2021
michalpristas added a commit to michalpristas/beats that referenced this pull request Jun 11, 2021
michalpristas added a commit that referenced this pull request Jun 11, 2021
…26261)

Cherry-pick #26126 to 7.13: Fix startup with failing configuration  (#26261)
michalpristas added a commit that referenced this pull request Jun 11, 2021
…6262)

Cherry-pick #26126 to 7.x: Fix startup with failing configuration  (#26262)
michalpristas added a commit to michalpristas/beats that referenced this pull request Jun 21, 2021
mdelapenya added a commit to mdelapenya/beats that referenced this pull request Jun 21, 2021
* master: (26 commits)
  Report total and free CPU for vSphere virtual machines (elastic#26167)
  [filebeat] Add preserve_original_event option to o365audit input (elastic#26273)
  Change xml processor names in script processor to match convention (elastic#26263)
  [Oracle] Fixing default values for paths in config template (elastic#26276)
  Add more ECS fields to logs (elastic#25998)
  [Heartbeat] Fix broken invocation of synth package (elastic#26228)
  rename sqs file name (elastic#26227)
  Populate the agent action result if there is no matching action handlers (elastic#26152)
  Add ISO8601 as supported timestamp type (elastic#25564)
  Move Filebeat azure module to GA (elastic#26168)
  Filebeat azure module pipeline fixes and changes (elastic#26148)
  libbeat: monitor version (elastic#26214)
  Add new parser to filestream input: container (elastic#26115)
  [Metricbeat] Add state_statefulset replicas.ready (elastic#26088)
  Disable test processors system test for windows 10 (elastic#26216)
  Fix startup with failing configuration (elastic#26126)
  Remove 32 bits version of Elastic Agent. (elastic#25708)
  Chane fleetmode detection to ony use management.enabled (elastic#26180)
  Make `filestream` input GA (elastic#26127)
  libbeat/idxmgmt/ilm: fix alias creation (elastic#26146)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug needs_backport PR is waiting to be backported to other branches. Team:Elastic-Agent Label for the Agent team v7.13.1 v7.13.2
Projects
None yet
6 participants