[Ingest Manager] elastic-agent process is not properly terminated after restart #127

mdelapenya · 2020-08-11T15:21:25Z

Environment

Version: 8.0.0-SNAPSHOT (Downloaded from https://snapshots.elastic.co/8.0.0-3ce083a1/downloads/beats/elastic-agent/elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm)
Operating System: Centos 7, dockerised

Steps to Reproduce

Start a Centos:7 docker container: docker run --name centos centos:7 tail -f /dev/null
Enter the container: docker exec -ti centos bash
Download the agent RPM package: curl https://snapshots.elastic.co/8.0.0-3ce083a1/downloads/beats/elastic-agent/elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm -o /elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm
Install systemctl replacement for Docker: curl https://raw.githubusercontent.com/gdraheim/docker-systemctl-replacement/master/files/docker/systemctl.py -o /usr/bin/systemctl
Install the RPM package with yum: yum localinstall /elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm -y
Enable service: systemctl enable elastic-agent
Start service: systemctl start elastic-agent
Check processes: top. There should be only one process for the elastic-agent
Restart service: systemctl restart elastic-agent
Check processes: top

Behaviours:

Expected behaviour

After the initial restart, the elastic-agent appears once, not in the Z state.

Current behaviour

After the initial restart, the elastic-agent appears twice, one in the Z state, and the other in the S state (as shown in the attachment)

Other observations

This behavior persists across multiple restarts: the elastic-agent process gets into the zombie state each time is restarted (note that I restarted it three times, so there are 3 zombie processes):

One shot script

docker run -d --name centos centos:7 tail -f /dev/null
docker exec -ti centos bash

Inside the container

curl https://snapshots.elastic.co/8.0.0-3ce083a1/downloads/beats/elastic-agent/elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm -o /elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm
curl https://raw.githubusercontent.com/gdraheim/docker-systemctl-replacement/master/files/docker/systemctl.py -o /usr/bin/systemctl 
yum localinstall /elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm -y
systemctl enable elastic-agent
systemctl start elastic-agent
systemctl restart elastic-agent
top

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-08-11T15:21:27Z

Pinging @elastic/ingest-management (Team:Ingest Management)

mdelapenya · 2020-08-12T06:48:53Z

I'm gonna close this issue, as I was not able to reproduce it in a full-blown Centos VM, which includes the whole systemd service manager, so it seems related to the limitations of systemd in Docker.

Steps to reproduce it

create a VM with Centos on Google Cloud
Download the RPM package for the elastic-agent
Install it with yum localinstall
Enable the service
Start the service
Check processes are started
Restart the service using systemctl
Check processes are restarted

One-shot script

Once you SSH'ed into the VM:

sudo su -
cd /
curl https://snapshots.elastic.co/8.0.0-3ce083a1/downloads/beats/elastic-agent/elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm -o /elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm
yum localinstall /elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm -y
systemctl enable elastic-agent
systemctl start elastic-agent
ps aux | grep elastic

Restart the service:

systemctl restart elastic-agent
ps aux | grep elastic

There is NO elastic-agent in the Zombie state

zez3 · 2021-11-26T14:11:58Z

@mdelapenya
I can reproduce this on Ubuntu for every policy change there is a new Zombie process spawned.
Running the 7.15.2 Elastic agent

zez3 · 2021-11-26T14:12:26Z

Was this fixed in 8.0 ?

elasticmachine · 2021-11-29T12:24:17Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

ruflin · 2021-11-29T12:24:57Z

@zez3 Reopening for further investigation. Can you share how you reproduced it?

zez3 · 2021-11-29T16:16:26Z

@ruflin
hmm, not sure exactly how but it happens:

 ps aux | grep 'Z' | grep defunct
root        4746  0.0  0.0      0     0 ?        Zs   Nov09   0:00 [elastic-agent] <defunct>
root        6166  0.0  0.0      0     0 ?        Zs   Nov09   0:00 [elastic-agent] <defunct>
root       74439  0.0  0.0      0     0 ?        Zs   Nov17   0:00 [elastic-agent] <defunct>
root       74468  0.0  0.0      0     0 ?        Zs   Nov17   0:00 [elastic-agent] <defunct>

During that time 16 Nov -> 18 Novemeber I can see this errors in


Showing entries from Nov 17, 10:37:48
10:37:48.831
elastic_agent
[elastic_agent][error] Could not communicate with fleet-server Checking API will retry, error: could not decode the response, raw response: Post "https://x.x.1.196:18315/api/fleet/agents/2d258220-fa3a-4e7b-bf86-dc2cdd2b15b1/checkin": unexpected EOF
Nov 18, 2021
10:31:25.199
elastic_agent
[elastic_agent][error] Could not communicate with fleet-server Checking API will retry, error: could not decode the response, raw response: Post "https://x.x.1.196:18383/api/fleet/agents/2d258220-fa3a-4e7b-bf86-dc2cdd2b15b1/checkin": unexpected EOF
10:31:25.199
elastic_agent
[elastic_agent][error] Could not communicate with fleet-server Checking API will retry, error: could not decode the response, raw response: Post "https://x.x.1.196:18383/api/fleet/agents/2d258220-fa3a-4e7b-bf86-dc2cdd2b15b1/checkin": unexpected EOF
10:31:25.199
elastic_agent
[elastic_agent][error] Could not communicate with fleet-server Checking API will retry, error: could not decode the response, raw response: Post "https://x.x.1.196:18383/api/fleet/agents/2d258220-fa3a-4e7b-bf86-dc2cdd2b15b1/checkin": unexpected EOF
Showing entries until Nov 18, 10:31:25

other agents have crashed at different times

ps aux | grep 'Z' | grep defunct
root     2263706  0.0  0.0      0     0 ?        Zs   Nov20   0:00 [elastic-agent] <defunct>
root     2263972  0.0  0.0      0     0 ?        Zs   Nov20   0:00 [elastic-agent] <defunct>

it's not something specific to an exact day

 ps aux | grep 'Z' | grep defunct
root        3002  0.0  0.0      0     0 ?        Zs   Nov10   0:00 [elastic-agent] <defunct>
root     3257393  0.0  0.0      0     0 ?        Zs   Nov17   0:00 [elastic-agent] <defunct>
root     3259342  0.0  0.0      0     0 ?        Zs   Nov17   0:00 [elastic-agent] <defunct>

I have to check my other agents.

this is on a macos

mac:~ $ ps aux | grep 'Z'
USER               PID  %CPU %MEM      VSZ    RSS   TT  STAT STARTED      TIME COMMAND
root               431   0.0  0.0        0      0   ??  Z    Tue04AM   0:00.00 (elastic-agent)

zez3 · 2021-11-29T23:22:07Z

Perhaps, I could have a hint. I restarted my docker service(one by one) on all my ECE hosts.

on 4 of my hosts where the agent is running I managed to get the zombies. Some where not affected.

 ps aux | grep 'Z' | grep defunct
root      294556  0.0  0.0      0     0 ?        Zs   16:26   0:00 [elastic-agent] <defunct>

ps aux | grep 'Z' | grep defunct
root     2231066  0.0  0.0      0     0 ?        Zs   Nov29   0:00 [elastic-agent] <defunct>

ps aux | grep 'Z' | grep defunct
root      4499  0.0  0.0      0     0 ?        Zs   Nov29   0:00 [elastic-agent] <defunct>

 ps aux | grep 'Z' | grep defunct
root     3244101  0.0  0.0      0     0 ?        Zs   Nov29   0:00 [elastic-agent] <defunct>

let me know if you need some logs or if we should make a live session. There is also an support ticket for this open.

ruflin · 2021-11-30T09:16:14Z

As you mention ECE, I assume all the Elastic Agents that you got above are running inside a Docker container? These are the hosted elastic agents?

How did you restart the docker service? Does it stop and then start the container? I'm asking because maybe the container restart is causing it.

My general understanding of the <defunct> process is that there should be some parent still around.

To also undertstand the priority on this issue a bit: The defunct processes are there after a restart but the system works still as expected?

zez3 · 2021-11-30T19:29:05Z

Nope, my agents are not inside containers, they are on bare metal machines. I restarted the docker service on the ECE hosts where fleet, kibana and all other elastic nodes in my deployments are running. It's basically cutting the agent from fleet and ES.

Yes, the parent process is still running and operating properly by spawning(restarting) a new child.
If this happens 2-3 times a month then in one year it would eat a bit of RAM but nothing critical.

ruflin · 2021-12-01T09:20:21Z

In your scenario you manage can create the defunct processes if you restart the Elastic Agent with the fleet-server the Elastic Agents connect to. This is the bit I missed before as I assume it is when you restart the Elastic Agents on the edge.

What happens in this scenario is that the Elastic Agents on the edge temporarily loose connection to the fleet-server which indicates to me, that is where we should investigate further and likely is not related in any way to Docker or ECE.

jlind23 · 2021-12-01T10:13:27Z

@ruflin what you are saying is that when the Agent lose its connection to fleet server then somehow defunct processes are created right?

zez3 · 2021-12-02T08:05:00Z

I assume it is when you restart the Elastic Agents on the edge.

The agent was not restarted. Only Fleel+Kiabana+ES and all the other ECE containers.

most likely is not related in any way to Docker or ECE perhaps indirectly because Fleet server resides here

Another hint is that only the agents(beats/children underneath the parent process) with a high(~2000 eps) load on them caused the defunct processes.

ruflin · 2021-12-02T14:17:26Z

a high(~2000 eps)

Very interesting detail. Will help to investigate it further. We should put load on the Elastic Agents (subprocess) for testing.

andrewkroh · 2021-12-03T19:59:35Z

I don't know the code paths here well (at all), but Stop() looks problematic if it is used without StopWait(). The method does not call exec.Cmd.Wait() which invokes wait for Linux that is required to release the resources associated with the child process. It also closes some internal channels and goroutines.

Wait() may be called elsewhere but it's hard to verify/ensure that all code paths lead to it.

https://github.com/elastic/beats/blob/a91bba523d2075272d0aad0bd5e7f006d29cdc84/x-pack/elastic-agent/pkg/core/process/process.go#L69-L72

ph · 2022-02-01T19:30:31Z

I think @andrewkroh is right here, we need to audit the stop path of the process, It's currently been changed in elastic/beats#29650

jlind23 · 2022-03-09T14:35:53Z

Closing it as won't fix. It will be part of the V2 architecture.

zez3 · 2022-03-09T16:32:26Z

@jlind23 Can I track this future V2 architecture somewhere?

jlind23 · 2022-03-10T08:46:11Z

@zez3 i've started a new issue just there: #189

zez3 · 2023-11-20T13:11:20Z

@jlind23

the Zombie issue still persists on 8.11.1

https://www.howtogeek.com/119815/htg-explains-what-is-a-zombie-process-on-linux/

mdelapenya added the bug Something isn't working label Aug 11, 2020

mdelapenya changed the title ~~[Ingest Manager] elastic-agent process is not properly terminated after~~ [Ingest Manager] elastic-agent process is not properly terminated after restart Aug 11, 2020

mdelapenya closed this as completed Aug 12, 2020

ruflin added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Nov 29, 2021

ruflin reopened this Nov 29, 2021

jlind23 added the 8.1-candidate label Dec 2, 2021

jlind23 added good first issue Good for newcomers v8.1.0 and removed 8.1-candidate labels Dec 6, 2021

jlind23 transferred this issue from elastic/beats Mar 7, 2022

jlind23 removed the v8.1.0 label Mar 9, 2022

jlind23 closed this as completed Mar 9, 2022

zez3 mentioned this issue Mar 23, 2022

[Agent] Elastic Agent Unenroll Action Leaves Dormant Service and Process Running elastic/beats#24568

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ingest Manager] elastic-agent process is not properly terminated after restart #127

[Ingest Manager] elastic-agent process is not properly terminated after restart #127

mdelapenya commented Aug 11, 2020 •

edited

elasticmachine commented Aug 11, 2020

mdelapenya commented Aug 12, 2020 •

edited

zez3 commented Nov 26, 2021

zez3 commented Nov 26, 2021

elasticmachine commented Nov 29, 2021

ruflin commented Nov 29, 2021

zez3 commented Nov 29, 2021 •

edited

zez3 commented Nov 29, 2021

ruflin commented Nov 30, 2021

zez3 commented Nov 30, 2021

ruflin commented Dec 1, 2021

jlind23 commented Dec 1, 2021

zez3 commented Dec 2, 2021

ruflin commented Dec 2, 2021

andrewkroh commented Dec 3, 2021 •

edited

ph commented Feb 1, 2022

jlind23 commented Mar 9, 2022 •

edited

zez3 commented Mar 9, 2022

jlind23 commented Mar 10, 2022

zez3 commented Nov 20, 2023

[Ingest Manager] elastic-agent process is not properly terminated after restart #127

[Ingest Manager] elastic-agent process is not properly terminated after restart #127

Comments

mdelapenya commented Aug 11, 2020 • edited

Environment

Steps to Reproduce

Behaviours:

Expected behaviour

Current behaviour

Other observations

One shot script

elasticmachine commented Aug 11, 2020

mdelapenya commented Aug 12, 2020 • edited

Steps to reproduce it

One-shot script

zez3 commented Nov 26, 2021

zez3 commented Nov 26, 2021

elasticmachine commented Nov 29, 2021

ruflin commented Nov 29, 2021

zez3 commented Nov 29, 2021 • edited

zez3 commented Nov 29, 2021

ruflin commented Nov 30, 2021

zez3 commented Nov 30, 2021

ruflin commented Dec 1, 2021

jlind23 commented Dec 1, 2021

zez3 commented Dec 2, 2021

ruflin commented Dec 2, 2021

andrewkroh commented Dec 3, 2021 • edited

ph commented Feb 1, 2022

jlind23 commented Mar 9, 2022 • edited

zez3 commented Mar 9, 2022

jlind23 commented Mar 10, 2022

zez3 commented Nov 20, 2023

mdelapenya commented Aug 11, 2020 •

edited

mdelapenya commented Aug 12, 2020 •

edited

zez3 commented Nov 29, 2021 •

edited

andrewkroh commented Dec 3, 2021 •

edited

jlind23 commented Mar 9, 2022 •

edited