-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The elastic agent watch command creates a zombie process each time it restarts #2190
Comments
@blakerouse does anything come to mind on what could cause this? Clearly the Beats restarting all the time was causing the unusual number of zombies, but the exited processes should not have been hanging around as zombies. |
We should always be calling Wait here to prevent this. Clearly there is some interaction we are missing here though. In 8.6.1 there shouldn't be as many zombie processes as reduced the number of times Beats exit, but leaving them behind would still be possible as we haven't changed anything in the process handling on the agent side. Here's the call stack involved in the agent that leads to calling Wait(): elastic-agent/pkg/component/runtime/command.go Lines 334 to 344 in fefe64f
elastic-agent/pkg/component/runtime/command.go Lines 378 to 389 in fefe64f
elastic-agent/pkg/core/process/process.go Lines 114 to 123 in fefe64f
|
Other users are hitting this issue. |
Do we have logs or diagnostics available when this is happening? |
I reproduced this without trying just installing a standalone agent with the default policy
elastic-agent-diagnostics-2023-10-05T19-56-15Z-00.zip Diagnostics attached. I think the defunct process might be the upgrade watcher. I see it exiting as expected on startup: {"log.level":"info","@timestamp":"2023-10-05T19:24:29.052Z","log.origin":{"file.name":"cmd/watch.go","file.line":67},"message":"Upgrade Watcher started","process.pid":1866,"agent.version":"8.10.2","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-05T19:24:29.052Z","log.origin":{"file.name":"cmd/watch.go","file.line":75},"message":"update marker not present at '/opt/Elastic/Agent/data'","ecs.version":"1.6.0"} At startup we launch the elastic-agent/internal/pkg/agent/cmd/run.go Lines 214 to 218 in ac2ef57
The way we launch it is: elastic-agent/internal/pkg/agent/application/upgrade/rollback.go Lines 109 to 136 in ac2ef57
We call os.Process.Release but the exec.Cmd.Start says we need to call Wait(). Perhaps this is what is causing this. |
I can reproduce this installing the agent in an ubuntu:22.04 container as well. Each time I restart the agent I get another defunct process
|
Confirmed this is the watcher, building the agent with the small patch below to stop starting the watcher at startup stops the creation of zombie processes: diff --git a/internal/pkg/agent/cmd/run.go b/internal/pkg/agent/cmd/run.go
index 91d470ac0..3c39c98dc 100644
--- a/internal/pkg/agent/cmd/run.go
+++ b/internal/pkg/agent/cmd/run.go
@@ -212,10 +212,10 @@ func run(override cfgOverrider, testingMode bool, fleetInitTimeout time.Duration
}
// initiate agent watcher
- if err := upgrade.InvokeWatcher(l); err != nil {
- // we should not fail because watcher is not working
- l.Error(errors.New(err, "failed to invoke rollback watcher"))
- }
+ // if err := upgrade.InvokeWatcher(l); err != nil {
+ // // we should not fail because watcher is not working
+ // l.Error(errors.New(err, "failed to invoke rollback watcher"))
+ // }
if allowEmptyPgp, _ := release.PGP(); allowEmptyPgp {
l.Info("Elastic Agent has been built with security disabled. Elastic Agent will not verify signatures of upgrade artifact.")
(END) We need to start the watcher, so this isn't a real solution. We'll probably need to wait on it. I also haven't explored whether we get a zombie process when the upgrade watcher is run across upgrades, since it needs to outlive its parent process and there's no way we could wait on it in that situation. |
Hi elastic-team! We have detected that on all RED-HAT machines that have an elastic-agent installed, versions 8.8.1 or 8.10.1, several processes appears in a defunct state. For example, our hosts with elastic-agent version 8.8.1: after 1 restart still have 1 defunct : Same at version 8.10.1: all hosts are VM ESXI ¿is there any solution for this issue? |
You will get a defunct process every time the Elastic Agent restarts (and possibly each time it upgrades) as shown in #2190 (comment). This is caused by a bug in the Elastic Agent and requires a code change to fix. Restarting the system or stopping and starting the Elastic Agent service will likely clear these defunct processes but they will just come back as the next time the agent restarts. |
@cmacknz what about launching the watcher by forking twice? I didn't try to implement that in Go but that's how a process "daemonize" on linux... In short we would have 3 processes involved in launching the watcher:
I think we could experiment with it and check if we don't leave zombie processes this way... Edit: forgot about the |
Seems like it might be more challenging than it seems:
|
Please see the solution in https://segmentfault.com/a/1190000041466423/en Both solutions use |
I think what we need to determine here is if there is only a zombie because of reexec or if there is always a zombie in the case that the watcher is started and the parent exits. I believe its only because of the reexec process where we re-use the same process space so that parent -> child reference still exists. |
Are all versions affected by this ? I also noticed in my environments happening. |
Yes, as far as I can tell this behavior has likely been here since agent was released. We have a child process that outlives its parent process' ability to call |
Just the linux OS |
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
This will be less frequent in 8.15.0, there will no longer be a defunct process every time the Elastic Agent process restarts. There will likely still be a defunct process each time the Elastic Agent is upgraded, because of the way our upgrade logic is implemented. Fixing that properly is more complex than just calling |
Before we fixed elastic/beats#34178 we had some reports of an excessive number of zombie processes being left on a machine. In one case a single MacOS machine was left with hundreds of zombie processes. Example attached below.
ps-aux.txt
It seems like there is a correlation between the Beats restarting themselves and zombie processes being left behind. We should investigate this and see if there is anything we should be doing to ensure these processes are properly cleaned up.
The path the Beats follow to exit in this case is here.
The text was updated successfully, but these errors were encountered: