Argocd-reposerver spawns too many git zombie processes in the host node #3611

kanapuli · 2020-05-19T05:15:33Z

If you are trying to resolve an environment-specific issue or have a one-off question about the edge case that does not require a feature then please consider asking a
question in argocd slack channel.

Checklist:

I've searched in the docs and FAQ for my answer: http://bit.ly/argocd-faq.
I've included steps to reproduce the bug.
I've pasted the output of argocd version.

Describe the bug

I am running argocd in an AWS ec2 machine as a pod. The host machine where argocd-repo-server pod runs seem to have so many git zombie processes.

To Reproduce
I am not exactly sure how to reproduce this. Just running argocd-repo-server pod for couple of days can create the zombies of git.

Expected behavior

If a git process is invoked by argocd to check the application status , it has to properly killed by argocd.

Screenshots

Version

v1.4.2+48ecced9

Logs

time="2020-05-19T04:20:34Z" level=info msg="manifest cache miss: &ApplicationSource{RepoURL:https://cicd-exotel@bitbucket.org/Exotel/exotel_deployments.git,Path:campaignix/integration,TargetRevision:HEAD,Helm:nil,Kustomize:nil,Ksonnet:nil,Directory:nil,Plugin:nil,Chart:,}/ee99bb62f813471bbf01d947dfe0fa8070b9e80b"
time="2020-05-19T04:20:35Z" level=error msg="`git fetch origin --tags --force` failed exit status 128: error: cannot fork() for git-ask-pass.sh: Resource temporarily unavailable\nfatal: could not read Password for 'https://cicd-exotel@bitbucket.org': No such device or address" execID=we6DF
time="2020-05-19T04:20:35Z" level=error msg="finished unary call with code Internal" error="rpc error: code = Internal desc = Failed to fetch git repo: `git fetch origin --tags --force` failed exit status 128: error: cannot fork() for git-ask-pass.sh: Resource temporarily unavailable\nfatal: could not read Password for 'https://cicd-exotel@bitbucket.org': No such device or address" grpc.code=Internal grpc.method=GenerateManifest grpc.request.deadline="2020-05-19T04:21:34Z" grpc.service=repository.RepoServerService grpc.start_time="2020-05-19T04:20:34Z" grpc.time_ms=1241.966 span.kind=server system=grpc
time="2020-05-19T04:20:35Z" level=info msg="manifest cache miss: &ApplicationSource{RepoURL:https://cicd-exotel@bitbucket.org/Exotel/exotel_deployments.git,Path:reportix/integration,TargetRevision:HEAD,Helm:nil,Kustomize:nil,Ksonnet:nil,Directory:nil,Plugin:nil,Chart:,}/ee99bb62f813471bbf01d947dfe0fa8070b9e80b"
time="2020-05-19T04:20:35Z" level=info msg="git fetch origin --tags --force" dir="/tmp/https:__cicd-exotel@bitbucket.org_exotel_exotel_deployments" execID=um08W
time="2020-05-19T04:20:35Z" level=error msg="finished unary call with code Internal" error="rpc error: code = Internal desc = Failed to fetch git repo: fork/exec /usr/bin/git: resource temporarily unavailable" grpc.code=Internal grpc.method=GenerateManifest grpc.request.deadline="2020-05-19T04:21:34Z" grpc.service=repository.RepoServerService grpc.start_time="2020-05-19T04:20:34Z" grpc.time_ms=1246.9 span.kind=server system=grpc
time="2020-05-19T04:20:35Z" level=info msg="manifest cache miss: &ApplicationSource{RepoURL:https://cicd-exotel@bitbucket.org/Exotel/exotel_deployments.git,Path:samanval/qa_us3/,TargetRevision:HEAD,Helm:nil,Kustomize:nil,Ksonnet:nil,Directory:nil,Plugin:nil,Chart:,}/ee99bb62f813471bbf01d947dfe0fa8070b9e80b"
time="2020-05-19T04:20:35Z" level=info msg="git fetch origin --tags --force" dir="/tmp/https:__cicd-exotel@bitbucket.org_exotel_exotel_deployments" execID=o79yN
time="2020-05-19T04:20:35Z" level=error msg="finished unary call with code Internal" error="rpc error: code = Internal desc = Failed to fetch git repo: fork/exec /usr/bin/git: resource temporarily unavailable" grpc.code=Internal grpc.method=GenerateManifest grpc.request.deadline="2020-05-19T04:21:34Z" grpc.service=repository.RepoServerService grpc.start_time="2020-05-19T04:20:34Z" grpc.time_ms=1245.643 span.kind=server system=grpc
time="2020-05-19T04:20:35Z" level=info msg="manifest cache miss: &ApplicationSource{RepoURL:https://cicd-exotel@bitbucket.org/Exotel/exotel_deployments.git,Path:fulliautomatix/integration,TargetRevision:HEAD,Helm:nil,Kustomize:nil,Ksonnet:nil,Directory:nil,Plugin:nil,Chart:,}/ee99bb62f813471bbf01d947dfe0fa8070b9e80b"
time="2020-05-19T04:20:35Z" level=info msg="git fetch origin --tags --force" dir="/tmp/https:__cicd-exotel@bitbucket.org_exotel_exotel_deployments" execID=KslMc

The text was updated successfully, but these errors were encountered:

kanapuli · 2020-05-21T10:09:21Z

I would like to contribute for this issue. Where should I start ?

jannfis · 2020-05-21T10:42:12Z

Hi @kanapuli, thanks for your bug report and the interest in contributing! Much appreciated!

Please check our documentation at https://argoproj.github.io/argo-cd/developer-guide/contributing/ on how to get started. If you have any questions after reading this document, feel free to ask. Also, if there's something missing or wrongly documented, please let us know!

As for this special problem, I think the root cause might be found in the package we use for executing external commands, at https://github.com/argoproj/pkg (or more specifically, https://github.com/argoproj/pkg/tree/master/exec), however might also be elsewhere (I didn't dig any deeper yet).

kanapuli · 2020-05-21T11:36:01Z

Thanks @jannfis . Let me go through the code and documentation and ask here if I have a question.

asvasyanin · 2020-05-29T13:33:26Z

any updates? I've encounter same error

markbenschop · 2020-06-02T15:32:30Z

I've seen this issue on multiple occasions. We had to reboot k8s nodes because they became unresponsive because of the enormous amount of zombie processes. I assume the argocd-reposerver does not properly kill the git process after execution or doesn't wait for the child process to return its exit code.

WaldoFR · 2020-06-02T15:51:53Z

Same here on version 1.5.3 (duplicated report : #3694 )

jannfis · 2020-06-05T10:04:09Z

Hey guys, just a question during my search for root cause of this issue and trying to reproduce it reliably: Are you making use of Kustomize for (some of) your applications?

kanapuli · 2020-06-05T10:10:50Z

No, I just use plain kubernetes yaml manifests for my application.

jannfis · 2020-06-05T10:37:54Z

Just a heads-up: It seems I can reliably reproduce it now. I'll be digging for the root cause.

jannfis · 2020-06-07T15:50:39Z

I did some more research, and it happens that this issue only comes to light when run in Kubernetes. This is most likely due to the fact that in a Kubernetes pod, there is no init-like process which reaps terminating processes that do not have a parent process anymore. The issue is not reproducible when running ArgoCD outside a K8s cluster, no matter whether run in a Docker container or not.

I took some time to audit the functions used by ArgoCD to spawn external processes, and in fact, we are correctly waiting for any child processes to exit, thus, ArgoCD should not leave unreaped zombie processes behind. But as we can observer, it happens.

One can easily reproduce it by creating a Kustomize application, which uses a non-accessible remote base, i.e. a Git repository that requires a different authentication than the repository where the original kustomization.yaml resides.

I think what is going on the following:

ArgoCD spawns a child process A and waits for it to either timeout or exit normally
The child process A then forks a child process B and exits process A without waiting for process B. Process B becomes orphan.
ArgoCD correctly reaps child process A and cleans up entry in the PID table
Process B exits and becomes zombie because noone will reap it

In the UNIX world, orphaned processes will be reassigned to PID 1 as their parent, which is usually init(8). The init process usually has mechanisms to reap the orphaned processes that get assigned to it. In Kubernetes, PID 1 however will be the command used to start the container, in this case it would be argocd-repo-server which in turn has no mechanisms to reap orphan processes it does not know of.

I see three possible solutions to solve this problem:

Implement orphan process reaping in ArgoCD's server processes
Run the container with an init-like wrapper, such as tini
Enable process namespace sharing for our pods in Kubernetes

Solution 1 feels like reinventing the wheel, so I did a small PoC using solution 2 adapting ArgoCD's docker entry point script to execute the argocd-repo-server wrapped through tini. tini is available as an installable package from our debian:10 base image, so could easily be included in the Docker image without much overhead or adaption. This works and orphaned processes get reaped, not leaving zombie processes behind.

Solution 3 seems the most native one, however, might not be available in early versions of K8s that could still be found in the wild (it became stable with K8s v1.17). I think it was introduced as alpha feature on v1.10 and became beta since at least v1.14, from where it is enabled by default. According to my research, the Kubernetes people adopted tini in the /pause process that will be PID 1 if shareProcessNamespace is set to true. I think it is safe to assume that this feature is enabled in most, if not all, production clusters out there. However, maybe someone knows of consequences this would have that I'm currently not aware of.

Long story short: The workaround for this issue is to set shareProcessNamespace: true to the pod templates of argocd-repo-server deployment resource. I think we should make this the default in the installation manifests, also for the other pods running in the argocd namespace.

jessesuen · 2020-06-08T22:52:43Z

This was an excellent summary. For option 3, enabling shareProcessNamespace, we would enable the mode with an assumption that the pause container is really a tini process which will reap the orphaned process? Is it possible that tini might be run in a way where this might be a false assumption? For example, can we trust that this is also true for OpenShift?

I think we should make this the default in the installation manifests, also for the other pods running in the argocd namespace.

It sounds like we should either do option 2 or 3 for the repo-server, since it does spawn child processes. But for argocd-server and argocd-application-controller, I might avoid it since we don't spawn child processes AFIK.

I'm in favor of doing 2, since 3 is making assumptions about the kubernetes version underlying implementation using tini.

) (#3721) * fix: Reap orphaned ("zombie") processes in argocd-repo-server pod

While uid_entrypoint.sh contains the OpenShift specific manipulation of /etc/passwd it also starts the reposerver via tini and so ensures that any zombies produced by reposerver and its decendants are collected. This matches the behaviour from the manifests included with the main ArgoCD project. See: * https://github.com/argoproj/argo-cd/blob/f93da5346c3dfe0ec75549fd78b2d30ce7d5cfad/manifests/base/repo-server/argocd-repo-server-deployment.yaml#L24 * argoproj/argo-cd#3721 * argoproj/argo-cd#3611

…466) * fix(argocd): Unconditionally start reposerver with uid_entrypoint.sh While uid_entrypoint.sh contains the OpenShift specific manipulation of /etc/passwd it also starts the reposerver via tini and so ensures that any zombies produced by reposerver and its decendants are collected. This matches the behaviour from the manifests included with the main ArgoCD project. See: * https://github.com/argoproj/argo-cd/blob/f93da5346c3dfe0ec75549fd78b2d30ce7d5cfad/manifests/base/repo-server/argocd-repo-server-deployment.yaml#L24 * argoproj/argo-cd#3721 * argoproj/argo-cd#3611 * chore: Bumping minor semver as this feels like a bit more than a patch change.

…rgoproj#466) * fix(argocd): Unconditionally start reposerver with uid_entrypoint.sh While uid_entrypoint.sh contains the OpenShift specific manipulation of /etc/passwd it also starts the reposerver via tini and so ensures that any zombies produced by reposerver and its decendants are collected. This matches the behaviour from the manifests included with the main ArgoCD project. See: * https://github.com/argoproj/argo-cd/blob/f93da5346c3dfe0ec75549fd78b2d30ce7d5cfad/manifests/base/repo-server/argocd-repo-server-deployment.yaml#L24 * argoproj/argo-cd#3721 * argoproj/argo-cd#3611 * chore: Bumping minor semver as this feels like a bit more than a patch change.

willemm · 2022-03-28T07:46:21Z

We're currently seeing this issue (well, almost. The defunct processes are mostly 'ssh').
Running argocd:v2.2.5 image.

jumarko · 2022-04-23T19:10:03Z

I don't have anything special to add, but want to share my experience when debugging a similar problem - completely unrelated to argo-cd (never used it).

Our java application was running as PID 1 in a docker container and was running git clone commands as sub-proceses.
As it turns out, when the git cloning process fails, it doesn't properly reap its child process - that in turn becomes a zombie and is re-parented to PID1 (our java app) that doesn't know how to deal with it.

A simple command like this would reproduce the problem (If you don't have an SSH key set up for git)

git clone git@github.com:jumarko/poptavka.git

It fails with a permissions error leaving an sh process behind as a zombie.
This sh process runs ssh command for git-style remote URLs.

A similar thing happens with https-style URLs, but in this case, it keeps around git zombies.
(it runs git remote-https subprocess which is left around as a zombie)

It feels like a bug in git where they don't reap children processes if a git clone subprocess terminates with an error.

kanapuli added the bug Something isn't working label May 19, 2020

alexmt added bug/severity:major Malfunction in one of the core component, impacting a majority of users component:git Interaction with GitHub, Gitlab etc labels May 19, 2020

jannfis mentioned this issue Jun 2, 2020

Huge number of defuncts causing synchronization problems #3694

Closed

jannfis mentioned this issue Jun 7, 2020

fix: Reap orphaned ("zombie") processes in argocd-repo-server pod (#3611) #3721

Merged

5 tasks

alexmt closed this as completed in #3721 Jun 9, 2020

alexmt pushed a commit that referenced this issue Jun 9, 2020

fix: Reap orphaned ("zombie") processes in argocd-repo-server pod (#3611

4032e8e

) (#3721) * fix: Reap orphaned ("zombie") processes in argocd-repo-server pod

alexmt pushed a commit that referenced this issue Jun 9, 2020

fix: Reap orphaned ("zombie") processes in argocd-repo-server pod (#3611

df65d01

) (#3721) * fix: Reap orphaned ("zombie") processes in argocd-repo-server pod

elemental-lf mentioned this issue Oct 5, 2020

fix(argocd): Unconditionally start reposerver with uid_entrypoint.sh argoproj/argo-helm#466

Merged

5 tasks

nmasse-itix mentioned this issue Dec 11, 2021

After a while the argocd-repo-server is broken ("cannot fork") argoproj/argo-helm#1052

Closed

hcelaloner mentioned this issue Jan 7, 2022

ApplicationSet controller spawns too many git zombie processes argoproj/applicationset#452

Closed

niv0 mentioned this issue Aug 22, 2022

ArgoCD chart - repo-server - support flag shareProcessNamespace bitnami/charts#11969

Closed

This was referenced Jul 1, 2023

Fix: kustomize write target starts failing after indeterminate period at time argoproj-labs/argocd-image-updater#583

Merged

Fix: git zombie processes left behind after periodic image updater runs argoproj-labs/argocd-image-updater#584

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Argocd-reposerver spawns too many git zombie processes in the host node #3611

Argocd-reposerver spawns too many git zombie processes in the host node #3611

kanapuli commented May 19, 2020 •

edited

Loading

kanapuli commented May 21, 2020

jannfis commented May 21, 2020

kanapuli commented May 21, 2020

asvasyanin commented May 29, 2020

markbenschop commented Jun 2, 2020 •

edited

Loading

WaldoFR commented Jun 2, 2020 •

edited

Loading

jannfis commented Jun 5, 2020

kanapuli commented Jun 5, 2020

jannfis commented Jun 5, 2020

jannfis commented Jun 7, 2020 •

edited

Loading

jessesuen commented Jun 8, 2020

willemm commented Mar 28, 2022

jumarko commented Apr 23, 2022

Argocd-reposerver spawns too many git zombie processes in the host node #3611

Argocd-reposerver spawns too many git zombie processes in the host node #3611

Comments

kanapuli commented May 19, 2020 • edited Loading

kanapuli commented May 21, 2020

jannfis commented May 21, 2020

kanapuli commented May 21, 2020

asvasyanin commented May 29, 2020

markbenschop commented Jun 2, 2020 • edited Loading

WaldoFR commented Jun 2, 2020 • edited Loading

jannfis commented Jun 5, 2020

kanapuli commented Jun 5, 2020

jannfis commented Jun 5, 2020

jannfis commented Jun 7, 2020 • edited Loading

jessesuen commented Jun 8, 2020

willemm commented Mar 28, 2022

jumarko commented Apr 23, 2022

kanapuli commented May 19, 2020 •

edited

Loading

markbenschop commented Jun 2, 2020 •

edited

Loading

WaldoFR commented Jun 2, 2020 •

edited

Loading

jannfis commented Jun 7, 2020 •

edited

Loading