The elastic agent watch command creates a zombie process each time it restarts #2190

cmacknz · 2023-01-26T00:36:46Z

Before we fixed elastic/beats#34178 we had some reports of an excessive number of zombie processes being left on a machine. In one case a single MacOS machine was left with hundreds of zombie processes. Example attached below.

ps-aux.txt

It seems like there is a correlation between the Beats restarting themselves and zombie processes being left behind. We should investigate this and see if there is anything we should be doing to ensure these processes are properly cleaned up.

The path the Beats follow to exit in this case is here.

cmacknz · 2023-01-26T00:38:01Z

@blakerouse does anything come to mind on what could cause this? Clearly the Beats restarting all the time was causing the unusual number of zombies, but the exited processes should not have been hanging around as zombies.

cmacknz · 2023-02-01T14:15:02Z

We should always be calling Wait here to prevent this. Clearly there is some interaction we are missing here though. In 8.6.1 there shouldn't be as many zombie processes as reduced the number of times Beats exit, but leaving them behind would still be possible as we haven't changed anything in the process handling on the agent side.

Here's the call stack involved in the agent that leads to calling Wait():

elastic-agent/pkg/component/runtime/command.go

Lines 334 to 344 in fefe64f

    
           proc, err := process.Start(path, 
        
           	process.WithArgs(args), 
        
           	process.WithEnv(env), 
        
           	process.WithCmdOptions(attachOutErr(c.logStd, c.logErr), dirPath(workDir))) 
        
           if err != nil { 
        
           	return err 
        
           } 
        
           c.proc = proc 
        
           c.forceCompState(client.UnitStateStarting, fmt.Sprintf("Starting: spawned pid '%d'", c.proc.PID)) 
        
           c.startWatcher(proc, comm)

elastic-agent/pkg/component/runtime/command.go

Lines 378 to 389 in fefe64f

    
           func (c *CommandRuntime) startWatcher(info *process.Info, comm Communicator) { 
        
           	go func() { 
        
           		err := comm.WriteConnInfo(info.Stdin) 
        
           		if err != nil { 
        
           			c.forceCompState(client.UnitStateFailed, fmt.Sprintf("Failed: failed to provide connection information to spawned pid '%d': %s", info.PID, err)) 
        
           			// kill instantly 
        
           			_ = info.Kill() 
        
           		} else { 
        
           			_ = info.Stdin.Close() 
        
           		} 
        
           		ch := info.Wait()

elastic-agent/pkg/core/process/process.go

Lines 114 to 123 in fefe64f

    
           // Wait returns a channel that will send process state once it exits. 
        
           func (i *Info) Wait() <-chan *os.ProcessState { 
        
           	ch := make(chan *os.ProcessState) 
        
           	go func() { 
        
           		procState, err := i.Process.Wait() 
        
           		if err != nil { 
        
           			// process is not a child - some OSs requires process to be child 
        
           			externalProcess(i.Process) 
        
           		}

lucabelluccini · 2023-10-02T11:10:34Z

Other users are hitting this issue.

cmacknz · 2023-10-03T14:30:24Z

Do we have logs or diagnostics available when this is happening?

cmacknz · 2023-10-05T20:20:22Z

I reproduced this without trying just installing a standalone agent with the default policy
on a fresh Ubuntu 22.04 VM.

craig_mackenzie@cmackenzie-ubuntu-test:~/elastic-agent-8.10.2-linux-x86_64$ ps aux | grep elastic
root        1860  1.5  1.3 1506628 53864 ?       Ssl  19:24   0:01 /opt/Elastic/Agent/elastic-agent
root        1866  0.0  0.0      0     0 ?        Zs   19:24   0:00 [elastic-agent] <defunct>
root        1874  0.4  2.8 1482120 115692 ?      Sl   19:24   0:00 /opt/Elastic/Agent/data/elastic-agent-b0c688/components/metricbeat -E setup.ilm.enabled=false -E setup.template.enabled=false -E management.enabled=true -E management.restart_on_output_change=true -E logging.level=info -E logging.to_stderr=true -E gc_percent=${METRICBEAT_GOGC:100} -E metricbeat.config.modules.enabled=false -E http.enabled=true -E http.host=unix:///opt/Elastic/Agent/data/tmp/iThI_df0cBKC6YUNGGlKscMkOfz3FBH3.sock -E path.data=/opt/Elastic/Agent/data/elastic-agent-b0c688/run/system/metrics-default
root        1884  0.3  2.7 1387768 108976 ?      Sl   19:24   0:00 /opt/Elastic/Agent/data/elastic-agent-b0c688/components/filebeat -E setup.ilm.enabled=false -E setup.template.enabled=false -E management.enabled=true -E management.restart_on_output_change=true -E logging.level=info -E logging.to_stderr=true -E gc_percent=${FILEBEAT_GOGC:100} -E filebeat.config.modules.enabled=false -E http.enabled=true -E http.host=unix:///opt/Elastic/Agent/data/tmp/xTEtpJ7117ppc6OYvJCaYHbDW8mLjXGe.sock -E path.data=/opt/Elastic/Agent/data/elastic-agent-b0c688/run/filestream-monitoring
root        1890  0.3  2.8 1408388 113792 ?      Sl   19:24   0:00 /opt/Elastic/Agent/data/elastic-agent-b0c688/components/metricbeat -E setup.ilm.enabled=false -E setup.template.enabled=false -E management.enabled=true -E management.restart_on_output_change=true -E logging.level=info -E logging.to_stderr=true -E gc_percent=${METRICBEAT_GOGC:100} -E metricbeat.config.modules.enabled=false -E http.enabled=true -E http.host=unix:///opt/Elastic/Agent/data/tmp/Hk6rvk9TDibMPcDvpl0jkLE-qDsHWVYL.sock -E path.data=/opt/Elastic/Agent/data/elastic-agent-b0c688/run/beat/metrics-monitoring
root        1902  0.2  2.8 1481864 115096 ?      Sl   19:24   0:00 /opt/Elastic/Agent/data/elastic-agent-b0c688/components/metricbeat -E setup.ilm.enabled=false -E setup.template.enabled=false -E management.enabled=true -E management.restart_on_output_change=true -E logging.level=info -E logging.to_stderr=true -E gc_percent=${METRICBEAT_GOGC:100} -E metricbeat.config.modules.enabled=false -E http.enabled=true -E http.host=unix:///opt/Elastic/Agent/data/tmp/akSPbdqgaHaTY0_J01-dsfYK6JpMz2zn.sock -E path.data=/opt/Elastic/Agent/data/elastic-agent-b0c688/run/http/metrics-monitoring
craig_m+    1930  0.0  0.0   7004  2304 pts/0    S+   19:26   0:00 grep --color=auto elastic

elastic-agent-diagnostics-2023-10-05T19-56-15Z-00.zip

Diagnostics attached. I think the defunct process might be the upgrade watcher. I see it exiting as expected on startup:

{"log.level":"info","@timestamp":"2023-10-05T19:24:29.052Z","log.origin":{"file.name":"cmd/watch.go","file.line":67},"message":"Upgrade Watcher started","process.pid":1866,"agent.version":"8.10.2","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-05T19:24:29.052Z","log.origin":{"file.name":"cmd/watch.go","file.line":75},"message":"update marker not present at '/opt/Elastic/Agent/data'","ecs.version":"1.6.0"}

At startup we launch the elastic-agent watch command as a sub-process:

elastic-agent/internal/pkg/agent/cmd/run.go

Lines 214 to 218 in ac2ef57

    
           // initiate agent watcher 
        
           if err := upgrade.InvokeWatcher(l); err != nil { 
        
           	// we should not fail because watcher is not working 
        
           	l.Error(errors.New(err, "failed to invoke rollback watcher")) 
        
           }

The way we launch it is:

elastic-agent/internal/pkg/agent/application/upgrade/rollback.go

Lines 109 to 136 in ac2ef57

    
           // InvokeWatcher invokes an agent instance using watcher argument for watching behavior of 
        
           // agent during upgrade period. 
        
           func InvokeWatcher(log *logger.Logger) error { 
        
           	if !IsUpgradeable() { 
        
           		log.Info("agent is not upgradable, not starting watcher") 
        
           		return nil 
        
           	} 
        
           	cmd := invokeCmd() 
        
           	defer func() { 
        
           		if cmd.Process != nil { 
        
           			log.Infof("releasing watcher %v", cmd.Process.Pid) 
        
           			_ = cmd.Process.Release() 
        
           		} 
        
           	}() 
        
           	log.Infow("Starting upgrade watcher", "path", cmd.Path, "args", cmd.Args, "env", cmd.Env, "dir", cmd.Dir) 
        
           	if err := cmd.Start(); err != nil { 
        
           		return fmt.Errorf("failed to start Upgrade Watcher: %w", err) 
        
           	} 
        
           	upgradeWatcherPID := cmd.Process.Pid 
        
           	agentPID := os.Getpid() 
        
           	log.Infow("Upgrade Watcher invoked", "agent.upgrade.watcher.process.pid", upgradeWatcherPID, "agent.process.pid", agentPID) 
        
           	return nil 
        
           }

We call os.Process.Release but the exec.Cmd.Start says we need to call Wait(). Perhaps this is what is causing this.

cmacknz · 2023-10-05T20:21:12Z

I can reproduce this installing the agent in an ubuntu:22.04 container as well. Each time I restart the agent I get another defunct process

root@55df25f47e9b:/elastic-agent-8.10.2-linux-arm64# elastic-agent restart
root@55df25f47e9b:/elastic-agent-8.10.2-linux-arm64# ps aux | grep elastic-agent
root      2815  2.2  0.3 1653072 52056 pts/0   Sl   20:18   0:02 elastic-agent
root      2828  0.0  0.0      0     0 ?        Zs   20:18   0:00 [elastic-agent] <defunct>
root      2913  1.5  0.0      0     0 ?        Zs   20:20   0:00 [elastic-agent] <defunct>
root      2925  9.5  0.7 1476232 107968 pts/0  Sl   20:20   0:00 /opt/Elastic/Agent/data/elastic-agent-b0c688/components/metricbeat -E setup.ilm.enabled=false -E setup.template.enabled=false -E management.enabled=true -E management.restart_on_output_change=true -E logging.level=info -E logging.to_stderr=true -E gc_percent=${METRICBEAT_GOGC:100} -E metricbeat.config.modules.enabled=false -E http.enabled=true -E http.host=unix:///opt/Elastic/Agent/data/tmp/iThI_df0cBKC6YUNGGlKscMkOfz3FBH3.sock -E path.data=/opt/Elastic/Agent/data/elastic-agent-b0c688/run/system/metrics-default
...

root@55df25f47e9b:/elastic-agent-8.10.2-linux-arm64# elastic-agent restart
root@55df25f47e9b:/elastic-agent-8.10.2-linux-arm64# ps aux | grep elastic-agent
root      2815  2.3  0.3 1726096 49660 pts/0   Sl   20:18   0:03 elastic-agent
root      2828  0.0  0.0      0     0 ?        Zs   20:18   0:00 [elastic-agent] <defunct>
root      2913  0.0  0.0      0     0 ?        Zs   20:20   0:00 [elastic-agent] <defunct>
root      2987  3.0  0.0      0     0 ?        Zs   20:20   0:00 [elastic-agent] <defunct>
...

cmacknz · 2023-10-05T20:35:20Z

Confirmed this is the watcher, building the agent with the small patch below to stop starting the watcher at startup stops the creation of zombie processes:

diff --git a/internal/pkg/agent/cmd/run.go b/internal/pkg/agent/cmd/run.go
index 91d470ac0..3c39c98dc 100644
--- a/internal/pkg/agent/cmd/run.go
+++ b/internal/pkg/agent/cmd/run.go
@@ -212,10 +212,10 @@ func run(override cfgOverrider, testingMode bool, fleetInitTimeout time.Duration
        }

        // initiate agent watcher
-       if err := upgrade.InvokeWatcher(l); err != nil {
-               // we should not fail because watcher is not working
-               l.Error(errors.New(err, "failed to invoke rollback watcher"))
-       }
+       // if err := upgrade.InvokeWatcher(l); err != nil {
+       //      // we should not fail because watcher is not working
+       //      l.Error(errors.New(err, "failed to invoke rollback watcher"))
+       // }

        if allowEmptyPgp, _ := release.PGP(); allowEmptyPgp {
                l.Info("Elastic Agent has been built with security disabled. Elastic Agent will not verify signatures of upgrade artifact.")
(END)

We need to start the watcher, so this isn't a real solution. We'll probably need to wait on it. I also haven't explored whether we get a zombie process when the upgrade watcher is run across upgrades, since it needs to outlive its parent process and there's no way we could wait on it in that situation.

Nacho1973 · 2023-10-10T05:31:27Z

Hi elastic-team!
We have multiples clusters for severals enviroments. versions 8.8.1 and 8.10.1

We have detected that on all RED-HAT machines that have an elastic-agent installed, versions 8.8.1 or 8.10.1, several processes appears in a defunct state.

For example, our hosts with elastic-agent version 8.8.1:
[elastic@g99eksa738 ~]$ ps -fea | grep -i defunc
root 1629 1351 0 Jun15 ? 00:00:00 [elastic-agent]
root 1130324 1351 0 Sep19 ? 00:00:00 [elastic-agent]
root 1130372 1351 0 Sep19 ? 00:00:00 [elastic-agent]
elastic 1625525 1625493 0 08:52 pts/0 00:00:00 grep --color=auto -i defunc

after 1 restart still have 1 defunct :
[elastic@g99eksa738 ~]$ sudo systemctl restart elastic-agent
[elastic@g99eksa738 ~]$ ps -fea | grep -i defunc
root 1625543 1625534 3 08:53 ? 00:00:00 [elastic-agent]
elastic 1625608 1625493 0 08:53 pts/0 00:00:00 grep --color=auto -i defunc

Same at version 8.10.1:
root 604199 604139 0 Jun13 ? 00:00:00 [elastic-agent]
root 2548351 604139 0 Sep19 ? 00:00:00 [elastic-agent]
root 2548423 604139 0 Sep19 ? 00:00:00 [elastic-agent]
elastic 3048770 3048735 0 09:02 pts/0 00:00:00 grep --color=auto defunct
[elastic@g99tnsa723 ~]$ sudo systemctl restart elastic-agent
[elastic@g99tnsa723 ~]$ ps -fea | grep defunct
root 3048801 3048787 2 09:02 ? 00:00:00 [elastic-agent]
elastic 3048814 3048735 0 09:02 pts/0 00:00:00 grep --color=auto defunct

all hosts are VM ESXI
most of them are REDHAT 8.5
All elastic-agents are fleetserver managed...

¿is there any solution for this issue?
Thanks and best regards

cmacknz · 2023-10-10T17:04:51Z

You will get a defunct process every time the Elastic Agent restarts (and possibly each time it upgrades) as shown in #2190 (comment).

This is caused by a bug in the Elastic Agent and requires a code change to fix. Restarting the system or stopping and starting the Elastic Agent service will likely clear these defunct processes but they will just come back as the next time the agent restarts.

pchila · 2024-05-09T12:05:57Z

@cmacknz what about launching the watcher by forking twice? I didn't try to implement that in Go but that's how a process "daemonize" on linux... In short we would have 3 processes involved in launching the watcher:

elastic-agent main process: launches child process n.2 and waits for it
elastic-agent "child" process: the only job of this is to call setsid() to become a process group leader, then to fork again launching process 3. and then exiting (it will be reaped by the main process)
elastic-agent "grandchild" process: this is the actual watcher process and (in theory) it should be reaped properly by init/systemd process (PID 1), not leaving a zombie process behind...

I think we could experiment with it and check if we don't leave zombie processes this way...

Edit: forgot about the setsid() call needed to enter a new session

cmacknz · 2024-05-09T14:19:34Z

Seems like it might be more challenging than it seems:

We can not use fork syscall in Golang's runtime, because child process doesn't inherit threads and goroutines in that case. The library uses a simple trick: it runs its own copy with a mark - a predefined environment variable. Availability of the variable for the process means an execution in the child's copy. So that if the mark is not setted - the library executes parent's operations and runs its own copy with mark, and if the mark is setted - the library executes child's operations:

zez3 · 2024-05-09T17:15:34Z

Please see the solution in https://segmentfault.com/a/1190000041466423/en
and
https://stackoverflow.com/questions/59410139/run-command-in-golang-and-detach-it-from-process

Both solutions use cmd.wait()
I hope it helps

blakerouse · 2024-05-09T18:50:17Z

I think what we need to determine here is if there is only a zombie because of reexec or if there is always a zombie in the case that the watcher is started and the parent exits.

I believe its only because of the reexec process where we re-use the same process space so that parent -> child reference still exists.

catalindevops · 2024-05-14T07:03:38Z

Are all versions affected by this ? I also noticed in my environments happening.

cmacknz · 2024-05-14T17:18:30Z

Yes, as far as I can tell this behavior has likely been here since agent was released. We have a child process that outlives its parent process' ability to call wait on it as part of our upgrade design.

zez3 · 2024-05-14T17:33:33Z

Are all versions affected by this ? I also noticed in my environments happening.

Just the linux OS

elasticmachine · 2024-05-21T15:05:14Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

cmacknz · 2024-07-04T14:00:46Z

This will be less frequent in 8.15.0, there will no longer be a defunct process every time the Elastic Agent process restarts. There will likely still be a defunct process each time the Elastic Agent is upgraded, because of the way our upgrade logic is implemented. Fixing that properly is more complex than just calling Wait unfortunately.

cmacknz added bug Something isn't working Team:Elastic-Agent Label for the Agent team labels Jan 26, 2023

cmacknz mentioned this issue Feb 7, 2023

Defunct Process(es) #1215

Closed

cmacknz mentioned this issue Nov 21, 2023

[Agent] Elastic Agent Unenroll Action Leaves Dormant Service and Process Running elastic/beats#24568

Open

cmacknz changed the title ~~Processes left as zombies when they exit independently~~ The elastic agent watch command creates a zombie process each time it restarts Mar 14, 2024

ycombinator added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label May 21, 2024

cmacknz mentioned this issue May 31, 2024

Wait on the watcher at startup instead of just releasing resources associated with it #4834

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The elastic agent watch command creates a zombie process each time it restarts #2190

The elastic agent watch command creates a zombie process each time it restarts #2190

cmacknz commented Jan 26, 2023

cmacknz commented Jan 26, 2023

cmacknz commented Feb 1, 2023

lucabelluccini commented Oct 2, 2023

cmacknz commented Oct 3, 2023

cmacknz commented Oct 5, 2023

cmacknz commented Oct 5, 2023

cmacknz commented Oct 5, 2023

Nacho1973 commented Oct 10, 2023

cmacknz commented Oct 10, 2023

pchila commented May 9, 2024 •

edited

Loading

cmacknz commented May 9, 2024

zez3 commented May 9, 2024

blakerouse commented May 9, 2024

catalindevops commented May 14, 2024

cmacknz commented May 14, 2024

zez3 commented May 14, 2024

elasticmachine commented May 21, 2024

cmacknz commented Jul 4, 2024

The elastic agent watch command creates a zombie process each time it restarts #2190

The elastic agent watch command creates a zombie process each time it restarts #2190

Comments

cmacknz commented Jan 26, 2023

cmacknz commented Jan 26, 2023

cmacknz commented Feb 1, 2023

lucabelluccini commented Oct 2, 2023

cmacknz commented Oct 3, 2023

cmacknz commented Oct 5, 2023

cmacknz commented Oct 5, 2023

cmacknz commented Oct 5, 2023

Nacho1973 commented Oct 10, 2023

cmacknz commented Oct 10, 2023

pchila commented May 9, 2024 • edited Loading

cmacknz commented May 9, 2024

zez3 commented May 9, 2024

blakerouse commented May 9, 2024

catalindevops commented May 14, 2024

cmacknz commented May 14, 2024

zez3 commented May 14, 2024

elasticmachine commented May 21, 2024

cmacknz commented Jul 4, 2024

pchila commented May 9, 2024 •

edited

Loading