Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The elastic agent watch command creates a zombie process each time it restarts #2190

Open
cmacknz opened this issue Jan 26, 2023 · 18 comments
Open
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@cmacknz
Copy link
Member

cmacknz commented Jan 26, 2023

Before we fixed elastic/beats#34178 we had some reports of an excessive number of zombie processes being left on a machine. In one case a single MacOS machine was left with hundreds of zombie processes. Example attached below.

ps-aux.txt

It seems like there is a correlation between the Beats restarting themselves and zombie processes being left behind. We should investigate this and see if there is anything we should be doing to ensure these processes are properly cleaned up.

The path the Beats follow to exit in this case is here.

@cmacknz cmacknz added bug Something isn't working Team:Elastic-Agent Label for the Agent team labels Jan 26, 2023
@cmacknz
Copy link
Member Author

cmacknz commented Jan 26, 2023

@blakerouse does anything come to mind on what could cause this? Clearly the Beats restarting all the time was causing the unusual number of zombies, but the exited processes should not have been hanging around as zombies.

@cmacknz
Copy link
Member Author

cmacknz commented Feb 1, 2023

We should always be calling Wait here to prevent this. Clearly there is some interaction we are missing here though. In 8.6.1 there shouldn't be as many zombie processes as reduced the number of times Beats exit, but leaving them behind would still be possible as we haven't changed anything in the process handling on the agent side.

Here's the call stack involved in the agent that leads to calling Wait():

proc, err := process.Start(path,
process.WithArgs(args),
process.WithEnv(env),
process.WithCmdOptions(attachOutErr(c.logStd, c.logErr), dirPath(workDir)))
if err != nil {
return err
}
c.proc = proc
c.forceCompState(client.UnitStateStarting, fmt.Sprintf("Starting: spawned pid '%d'", c.proc.PID))
c.startWatcher(proc, comm)

func (c *CommandRuntime) startWatcher(info *process.Info, comm Communicator) {
go func() {
err := comm.WriteConnInfo(info.Stdin)
if err != nil {
c.forceCompState(client.UnitStateFailed, fmt.Sprintf("Failed: failed to provide connection information to spawned pid '%d': %s", info.PID, err))
// kill instantly
_ = info.Kill()
} else {
_ = info.Stdin.Close()
}
ch := info.Wait()

// Wait returns a channel that will send process state once it exits.
func (i *Info) Wait() <-chan *os.ProcessState {
ch := make(chan *os.ProcessState)
go func() {
procState, err := i.Process.Wait()
if err != nil {
// process is not a child - some OSs requires process to be child
externalProcess(i.Process)
}

@lucabelluccini
Copy link
Contributor

Other users are hitting this issue.

@cmacknz
Copy link
Member Author

cmacknz commented Oct 3, 2023

Do we have logs or diagnostics available when this is happening?

@cmacknz
Copy link
Member Author

cmacknz commented Oct 5, 2023

I reproduced this without trying just installing a standalone agent with the default policy
on a fresh Ubuntu 22.04 VM.

craig_mackenzie@cmackenzie-ubuntu-test:~/elastic-agent-8.10.2-linux-x86_64$ ps aux | grep elastic
root        1860  1.5  1.3 1506628 53864 ?       Ssl  19:24   0:01 /opt/Elastic/Agent/elastic-agent
root        1866  0.0  0.0      0     0 ?        Zs   19:24   0:00 [elastic-agent] <defunct>
root        1874  0.4  2.8 1482120 115692 ?      Sl   19:24   0:00 /opt/Elastic/Agent/data/elastic-agent-b0c688/components/metricbeat -E setup.ilm.enabled=false -E setup.template.enabled=false -E management.enabled=true -E management.restart_on_output_change=true -E logging.level=info -E logging.to_stderr=true -E gc_percent=${METRICBEAT_GOGC:100} -E metricbeat.config.modules.enabled=false -E http.enabled=true -E http.host=unix:///opt/Elastic/Agent/data/tmp/iThI_df0cBKC6YUNGGlKscMkOfz3FBH3.sock -E path.data=/opt/Elastic/Agent/data/elastic-agent-b0c688/run/system/metrics-default
root        1884  0.3  2.7 1387768 108976 ?      Sl   19:24   0:00 /opt/Elastic/Agent/data/elastic-agent-b0c688/components/filebeat -E setup.ilm.enabled=false -E setup.template.enabled=false -E management.enabled=true -E management.restart_on_output_change=true -E logging.level=info -E logging.to_stderr=true -E gc_percent=${FILEBEAT_GOGC:100} -E filebeat.config.modules.enabled=false -E http.enabled=true -E http.host=unix:///opt/Elastic/Agent/data/tmp/xTEtpJ7117ppc6OYvJCaYHbDW8mLjXGe.sock -E path.data=/opt/Elastic/Agent/data/elastic-agent-b0c688/run/filestream-monitoring
root        1890  0.3  2.8 1408388 113792 ?      Sl   19:24   0:00 /opt/Elastic/Agent/data/elastic-agent-b0c688/components/metricbeat -E setup.ilm.enabled=false -E setup.template.enabled=false -E management.enabled=true -E management.restart_on_output_change=true -E logging.level=info -E logging.to_stderr=true -E gc_percent=${METRICBEAT_GOGC:100} -E metricbeat.config.modules.enabled=false -E http.enabled=true -E http.host=unix:///opt/Elastic/Agent/data/tmp/Hk6rvk9TDibMPcDvpl0jkLE-qDsHWVYL.sock -E path.data=/opt/Elastic/Agent/data/elastic-agent-b0c688/run/beat/metrics-monitoring
root        1902  0.2  2.8 1481864 115096 ?      Sl   19:24   0:00 /opt/Elastic/Agent/data/elastic-agent-b0c688/components/metricbeat -E setup.ilm.enabled=false -E setup.template.enabled=false -E management.enabled=true -E management.restart_on_output_change=true -E logging.level=info -E logging.to_stderr=true -E gc_percent=${METRICBEAT_GOGC:100} -E metricbeat.config.modules.enabled=false -E http.enabled=true -E http.host=unix:///opt/Elastic/Agent/data/tmp/akSPbdqgaHaTY0_J01-dsfYK6JpMz2zn.sock -E path.data=/opt/Elastic/Agent/data/elastic-agent-b0c688/run/http/metrics-monitoring
craig_m+    1930  0.0  0.0   7004  2304 pts/0    S+   19:26   0:00 grep --color=auto elastic

elastic-agent-diagnostics-2023-10-05T19-56-15Z-00.zip

Diagnostics attached. I think the defunct process might be the upgrade watcher. I see it exiting as expected on startup:

{"log.level":"info","@timestamp":"2023-10-05T19:24:29.052Z","log.origin":{"file.name":"cmd/watch.go","file.line":67},"message":"Upgrade Watcher started","process.pid":1866,"agent.version":"8.10.2","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-05T19:24:29.052Z","log.origin":{"file.name":"cmd/watch.go","file.line":75},"message":"update marker not present at '/opt/Elastic/Agent/data'","ecs.version":"1.6.0"}

At startup we launch the elastic-agent watch command as a sub-process:

// initiate agent watcher
if err := upgrade.InvokeWatcher(l); err != nil {
// we should not fail because watcher is not working
l.Error(errors.New(err, "failed to invoke rollback watcher"))
}

The way we launch it is:

// InvokeWatcher invokes an agent instance using watcher argument for watching behavior of
// agent during upgrade period.
func InvokeWatcher(log *logger.Logger) error {
if !IsUpgradeable() {
log.Info("agent is not upgradable, not starting watcher")
return nil
}
cmd := invokeCmd()
defer func() {
if cmd.Process != nil {
log.Infof("releasing watcher %v", cmd.Process.Pid)
_ = cmd.Process.Release()
}
}()
log.Infow("Starting upgrade watcher", "path", cmd.Path, "args", cmd.Args, "env", cmd.Env, "dir", cmd.Dir)
if err := cmd.Start(); err != nil {
return fmt.Errorf("failed to start Upgrade Watcher: %w", err)
}
upgradeWatcherPID := cmd.Process.Pid
agentPID := os.Getpid()
log.Infow("Upgrade Watcher invoked", "agent.upgrade.watcher.process.pid", upgradeWatcherPID, "agent.process.pid", agentPID)
return nil
}

We call os.Process.Release but the exec.Cmd.Start says we need to call Wait(). Perhaps this is what is causing this.

@cmacknz
Copy link
Member Author

cmacknz commented Oct 5, 2023

I can reproduce this installing the agent in an ubuntu:22.04 container as well. Each time I restart the agent I get another defunct process

root@55df25f47e9b:/elastic-agent-8.10.2-linux-arm64# elastic-agent restart
root@55df25f47e9b:/elastic-agent-8.10.2-linux-arm64# ps aux | grep elastic-agent
root      2815  2.2  0.3 1653072 52056 pts/0   Sl   20:18   0:02 elastic-agent
root      2828  0.0  0.0      0     0 ?        Zs   20:18   0:00 [elastic-agent] <defunct>
root      2913  1.5  0.0      0     0 ?        Zs   20:20   0:00 [elastic-agent] <defunct>
root      2925  9.5  0.7 1476232 107968 pts/0  Sl   20:20   0:00 /opt/Elastic/Agent/data/elastic-agent-b0c688/components/metricbeat -E setup.ilm.enabled=false -E setup.template.enabled=false -E management.enabled=true -E management.restart_on_output_change=true -E logging.level=info -E logging.to_stderr=true -E gc_percent=${METRICBEAT_GOGC:100} -E metricbeat.config.modules.enabled=false -E http.enabled=true -E http.host=unix:///opt/Elastic/Agent/data/tmp/iThI_df0cBKC6YUNGGlKscMkOfz3FBH3.sock -E path.data=/opt/Elastic/Agent/data/elastic-agent-b0c688/run/system/metrics-default
...
root@55df25f47e9b:/elastic-agent-8.10.2-linux-arm64# elastic-agent restart
root@55df25f47e9b:/elastic-agent-8.10.2-linux-arm64# ps aux | grep elastic-agent
root      2815  2.3  0.3 1726096 49660 pts/0   Sl   20:18   0:03 elastic-agent
root      2828  0.0  0.0      0     0 ?        Zs   20:18   0:00 [elastic-agent] <defunct>
root      2913  0.0  0.0      0     0 ?        Zs   20:20   0:00 [elastic-agent] <defunct>
root      2987  3.0  0.0      0     0 ?        Zs   20:20   0:00 [elastic-agent] <defunct>
...

@cmacknz
Copy link
Member Author

cmacknz commented Oct 5, 2023

Confirmed this is the watcher, building the agent with the small patch below to stop starting the watcher at startup stops the creation of zombie processes:

diff --git a/internal/pkg/agent/cmd/run.go b/internal/pkg/agent/cmd/run.go
index 91d470ac0..3c39c98dc 100644
--- a/internal/pkg/agent/cmd/run.go
+++ b/internal/pkg/agent/cmd/run.go
@@ -212,10 +212,10 @@ func run(override cfgOverrider, testingMode bool, fleetInitTimeout time.Duration
        }

        // initiate agent watcher
-       if err := upgrade.InvokeWatcher(l); err != nil {
-               // we should not fail because watcher is not working
-               l.Error(errors.New(err, "failed to invoke rollback watcher"))
-       }
+       // if err := upgrade.InvokeWatcher(l); err != nil {
+       //      // we should not fail because watcher is not working
+       //      l.Error(errors.New(err, "failed to invoke rollback watcher"))
+       // }

        if allowEmptyPgp, _ := release.PGP(); allowEmptyPgp {
                l.Info("Elastic Agent has been built with security disabled. Elastic Agent will not verify signatures of upgrade artifact.")
(END)

We need to start the watcher, so this isn't a real solution. We'll probably need to wait on it. I also haven't explored whether we get a zombie process when the upgrade watcher is run across upgrades, since it needs to outlive its parent process and there's no way we could wait on it in that situation.

@Nacho1973
Copy link

Hi elastic-team!
We have multiples clusters for severals enviroments. versions 8.8.1 and 8.10.1

We have detected that on all RED-HAT machines that have an elastic-agent installed, versions 8.8.1 or 8.10.1, several processes appears in a defunct state.

For example, our hosts with elastic-agent version 8.8.1:
[elastic@g99eksa738 ~]$ ps -fea | grep -i defunc
root 1629 1351 0 Jun15 ? 00:00:00 [elastic-agent]
root 1130324 1351 0 Sep19 ? 00:00:00 [elastic-agent]
root 1130372 1351 0 Sep19 ? 00:00:00 [elastic-agent]
elastic 1625525 1625493 0 08:52 pts/0 00:00:00 grep --color=auto -i defunc

after 1 restart still have 1 defunct :
[elastic@g99eksa738 ~]$ sudo systemctl restart elastic-agent
[elastic@g99eksa738 ~]$ ps -fea | grep -i defunc
root 1625543 1625534 3 08:53 ? 00:00:00 [elastic-agent]
elastic 1625608 1625493 0 08:53 pts/0 00:00:00 grep --color=auto -i defunc

Same at version 8.10.1:
root 604199 604139 0 Jun13 ? 00:00:00 [elastic-agent]
root 2548351 604139 0 Sep19 ? 00:00:00 [elastic-agent]
root 2548423 604139 0 Sep19 ? 00:00:00 [elastic-agent]
elastic 3048770 3048735 0 09:02 pts/0 00:00:00 grep --color=auto defunct
[elastic@g99tnsa723 ~]$ sudo systemctl restart elastic-agent
[elastic@g99tnsa723 ~]$ ps -fea | grep defunct
root 3048801 3048787 2 09:02 ? 00:00:00 [elastic-agent]
elastic 3048814 3048735 0 09:02 pts/0 00:00:00 grep --color=auto defunct

all hosts are VM ESXI
most of them are REDHAT 8.5
All elastic-agents are fleetserver managed...

¿is there any solution for this issue?
Thanks and best regards

@cmacknz
Copy link
Member Author

cmacknz commented Oct 10, 2023

You will get a defunct process every time the Elastic Agent restarts (and possibly each time it upgrades) as shown in #2190 (comment).

This is caused by a bug in the Elastic Agent and requires a code change to fix. Restarting the system or stopping and starting the Elastic Agent service will likely clear these defunct processes but they will just come back as the next time the agent restarts.

@cmacknz cmacknz changed the title Processes left as zombies when they exit independently The elastic agent watch command creates a zombie process each time it restarts Mar 14, 2024
@pchila
Copy link
Member

pchila commented May 9, 2024

@cmacknz what about launching the watcher by forking twice? I didn't try to implement that in Go but that's how a process "daemonize" on linux... In short we would have 3 processes involved in launching the watcher:

  1. elastic-agent main process: launches child process n.2 and waits for it
  2. elastic-agent "child" process: the only job of this is to call setsid() to become a process group leader, then to fork again launching process 3. and then exiting (it will be reaped by the main process)
  3. elastic-agent "grandchild" process: this is the actual watcher process and (in theory) it should be reaped properly by init/systemd process (PID 1), not leaving a zombie process behind...

I think we could experiment with it and check if we don't leave zombie processes this way...

Edit: forgot about the setsid() call needed to enter a new session

@cmacknz
Copy link
Member Author

cmacknz commented May 9, 2024

Seems like it might be more challenging than it seems:

We can not use fork syscall in Golang's runtime, because child process doesn't inherit threads and goroutines in that case. The library uses a simple trick: it runs its own copy with a mark - a predefined environment variable. Availability of the variable for the process means an execution in the child's copy. So that if the mark is not setted - the library executes parent's operations and runs its own copy with mark, and if the mark is setted - the library executes child's operations:

@zez3
Copy link

zez3 commented May 9, 2024

Please see the solution in https://segmentfault.com/a/1190000041466423/en
and
https://stackoverflow.com/questions/59410139/run-command-in-golang-and-detach-it-from-process

Both solutions use cmd.wait()
I hope it helps

@blakerouse
Copy link
Contributor

I think what we need to determine here is if there is only a zombie because of reexec or if there is always a zombie in the case that the watcher is started and the parent exits.

I believe its only because of the reexec process where we re-use the same process space so that parent -> child reference still exists.

@catalindevops
Copy link

Are all versions affected by this ? I also noticed in my environments happening.

@cmacknz
Copy link
Member Author

cmacknz commented May 14, 2024

Yes, as far as I can tell this behavior has likely been here since agent was released. We have a child process that outlives its parent process' ability to call wait on it as part of our upgrade design.

@zez3
Copy link

zez3 commented May 14, 2024

Are all versions affected by this ? I also noticed in my environments happening.

Just the linux OS

@ycombinator ycombinator added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label May 21, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@cmacknz
Copy link
Member Author

cmacknz commented Jul 4, 2024

This will be less frequent in 8.15.0, there will no longer be a defunct process every time the Elastic Agent process restarts. There will likely still be a defunct process each time the Elastic Agent is upgraded, because of the way our upgrade logic is implemented. Fixing that properly is more complex than just calling Wait unfortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

9 participants