Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Argocd-reposerver spawns too many git zombie processes in the host node #3611

Closed
3 tasks done
kanapuli opened this issue May 19, 2020 · 13 comments · Fixed by #3721
Closed
3 tasks done

Argocd-reposerver spawns too many git zombie processes in the host node #3611

kanapuli opened this issue May 19, 2020 · 13 comments · Fixed by #3721
Labels
bug/severity:major Malfunction in one of the core component, impacting a majority of users bug Something isn't working component:git Interaction with GitHub, Gitlab etc

Comments

@kanapuli
Copy link

kanapuli commented May 19, 2020

If you are trying to resolve an environment-specific issue or have a one-off question about the edge case that does not require a feature then please consider asking a
question in argocd slack channel.

Checklist:

  • I've searched in the docs and FAQ for my answer: http://bit.ly/argocd-faq.
  • I've included steps to reproduce the bug.
  • I've pasted the output of argocd version.

Describe the bug

I am running argocd in an AWS ec2 machine as a pod. The host machine where argocd-repo-server pod runs seem to have so many git zombie processes.

To Reproduce
I am not exactly sure how to reproduce this. Just running argocd-repo-server pod for couple of days can create the zombies of git.

Expected behavior

If a git process is invoked by argocd to check the application status , it has to properly killed by argocd.

Screenshots
image
image

Version

v1.4.2+48ecced9

Logs

time="2020-05-19T04:20:34Z" level=info msg="manifest cache miss: &ApplicationSource{RepoURL:https://cicd-exotel@bitbucket.org/Exotel/exotel_deployments.git,Path:campaignix/integration,TargetRevision:HEAD,Helm:nil,Kustomize:nil,Ksonnet:nil,Directory:nil,Plugin:nil,Chart:,}/ee99bb62f813471bbf01d947dfe0fa8070b9e80b"
time="2020-05-19T04:20:35Z" level=error msg="`git fetch origin --tags --force` failed exit status 128: error: cannot fork() for git-ask-pass.sh: Resource temporarily unavailable\nfatal: could not read Password for 'https://cicd-exotel@bitbucket.org': No such device or address" execID=we6DF
time="2020-05-19T04:20:35Z" level=error msg="finished unary call with code Internal" error="rpc error: code = Internal desc = Failed to fetch git repo: `git fetch origin --tags --force` failed exit status 128: error: cannot fork() for git-ask-pass.sh: Resource temporarily unavailable\nfatal: could not read Password for 'https://cicd-exotel@bitbucket.org': No such device or address" grpc.code=Internal grpc.method=GenerateManifest grpc.request.deadline="2020-05-19T04:21:34Z" grpc.service=repository.RepoServerService grpc.start_time="2020-05-19T04:20:34Z" grpc.time_ms=1241.966 span.kind=server system=grpc
time="2020-05-19T04:20:35Z" level=info msg="manifest cache miss: &ApplicationSource{RepoURL:https://cicd-exotel@bitbucket.org/Exotel/exotel_deployments.git,Path:reportix/integration,TargetRevision:HEAD,Helm:nil,Kustomize:nil,Ksonnet:nil,Directory:nil,Plugin:nil,Chart:,}/ee99bb62f813471bbf01d947dfe0fa8070b9e80b"
time="2020-05-19T04:20:35Z" level=info msg="git fetch origin --tags --force" dir="/tmp/https:__cicd-exotel@bitbucket.org_exotel_exotel_deployments" execID=um08W
time="2020-05-19T04:20:35Z" level=error msg="finished unary call with code Internal" error="rpc error: code = Internal desc = Failed to fetch git repo: fork/exec /usr/bin/git: resource temporarily unavailable" grpc.code=Internal grpc.method=GenerateManifest grpc.request.deadline="2020-05-19T04:21:34Z" grpc.service=repository.RepoServerService grpc.start_time="2020-05-19T04:20:34Z" grpc.time_ms=1246.9 span.kind=server system=grpc
time="2020-05-19T04:20:35Z" level=info msg="manifest cache miss: &ApplicationSource{RepoURL:https://cicd-exotel@bitbucket.org/Exotel/exotel_deployments.git,Path:samanval/qa_us3/,TargetRevision:HEAD,Helm:nil,Kustomize:nil,Ksonnet:nil,Directory:nil,Plugin:nil,Chart:,}/ee99bb62f813471bbf01d947dfe0fa8070b9e80b"
time="2020-05-19T04:20:35Z" level=info msg="git fetch origin --tags --force" dir="/tmp/https:__cicd-exotel@bitbucket.org_exotel_exotel_deployments" execID=o79yN
time="2020-05-19T04:20:35Z" level=error msg="finished unary call with code Internal" error="rpc error: code = Internal desc = Failed to fetch git repo: fork/exec /usr/bin/git: resource temporarily unavailable" grpc.code=Internal grpc.method=GenerateManifest grpc.request.deadline="2020-05-19T04:21:34Z" grpc.service=repository.RepoServerService grpc.start_time="2020-05-19T04:20:34Z" grpc.time_ms=1245.643 span.kind=server system=grpc
time="2020-05-19T04:20:35Z" level=info msg="manifest cache miss: &ApplicationSource{RepoURL:https://cicd-exotel@bitbucket.org/Exotel/exotel_deployments.git,Path:fulliautomatix/integration,TargetRevision:HEAD,Helm:nil,Kustomize:nil,Ksonnet:nil,Directory:nil,Plugin:nil,Chart:,}/ee99bb62f813471bbf01d947dfe0fa8070b9e80b"
time="2020-05-19T04:20:35Z" level=info msg="git fetch origin --tags --force" dir="/tmp/https:__cicd-exotel@bitbucket.org_exotel_exotel_deployments" execID=KslMc

@kanapuli kanapuli added the bug Something isn't working label May 19, 2020
@alexmt alexmt added bug/severity:major Malfunction in one of the core component, impacting a majority of users component:git Interaction with GitHub, Gitlab etc labels May 19, 2020
@kanapuli
Copy link
Author

I would like to contribute for this issue. Where should I start ?

@jannfis
Copy link
Member

jannfis commented May 21, 2020

Hi @kanapuli, thanks for your bug report and the interest in contributing! Much appreciated!

Please check our documentation at https://argoproj.github.io/argo-cd/developer-guide/contributing/ on how to get started. If you have any questions after reading this document, feel free to ask. Also, if there's something missing or wrongly documented, please let us know!

As for this special problem, I think the root cause might be found in the package we use for executing external commands, at https://github.com/argoproj/pkg (or more specifically, https://github.com/argoproj/pkg/tree/master/exec), however might also be elsewhere (I didn't dig any deeper yet).

@kanapuli
Copy link
Author

Thanks @jannfis . Let me go through the code and documentation and ask here if I have a question.

@asvasyanin
Copy link

any updates? I've encounter same error

@markbenschop
Copy link

markbenschop commented Jun 2, 2020

I've seen this issue on multiple occasions. We had to reboot k8s nodes because they became unresponsive because of the enormous amount of zombie processes. I assume the argocd-reposerver does not properly kill the git process after execution or doesn't wait for the child process to return its exit code.

@WaldoFR
Copy link

WaldoFR commented Jun 2, 2020

Same here on version 1.5.3 (duplicated report : #3694 )

@jannfis
Copy link
Member

jannfis commented Jun 5, 2020

Hey guys, just a question during my search for root cause of this issue and trying to reproduce it reliably: Are you making use of Kustomize for (some of) your applications?

@kanapuli
Copy link
Author

kanapuli commented Jun 5, 2020

No, I just use plain kubernetes yaml manifests for my application.

@jannfis
Copy link
Member

jannfis commented Jun 5, 2020

Just a heads-up: It seems I can reliably reproduce it now. I'll be digging for the root cause.

@jannfis
Copy link
Member

jannfis commented Jun 7, 2020

I did some more research, and it happens that this issue only comes to light when run in Kubernetes. This is most likely due to the fact that in a Kubernetes pod, there is no init-like process which reaps terminating processes that do not have a parent process anymore. The issue is not reproducible when running ArgoCD outside a K8s cluster, no matter whether run in a Docker container or not.

I took some time to audit the functions used by ArgoCD to spawn external processes, and in fact, we are correctly waiting for any child processes to exit, thus, ArgoCD should not leave unreaped zombie processes behind. But as we can observer, it happens.

One can easily reproduce it by creating a Kustomize application, which uses a non-accessible remote base, i.e. a Git repository that requires a different authentication than the repository where the original kustomization.yaml resides.

I think what is going on the following:

  • ArgoCD spawns a child process A and waits for it to either timeout or exit normally
  • The child process A then forks a child process B and exits process A without waiting for process B. Process B becomes orphan.
  • ArgoCD correctly reaps child process A and cleans up entry in the PID table
  • Process B exits and becomes zombie because noone will reap it

In the UNIX world, orphaned processes will be reassigned to PID 1 as their parent, which is usually init(8). The init process usually has mechanisms to reap the orphaned processes that get assigned to it. In Kubernetes, PID 1 however will be the command used to start the container, in this case it would be argocd-repo-server which in turn has no mechanisms to reap orphan processes it does not know of.

I see three possible solutions to solve this problem:

  1. Implement orphan process reaping in ArgoCD's server processes
  2. Run the container with an init-like wrapper, such as tini
  3. Enable process namespace sharing for our pods in Kubernetes

Solution 1 feels like reinventing the wheel, so I did a small PoC using solution 2 adapting ArgoCD's docker entry point script to execute the argocd-repo-server wrapped through tini. tini is available as an installable package from our debian:10 base image, so could easily be included in the Docker image without much overhead or adaption. This works and orphaned processes get reaped, not leaving zombie processes behind.

Solution 3 seems the most native one, however, might not be available in early versions of K8s that could still be found in the wild (it became stable with K8s v1.17). I think it was introduced as alpha feature on v1.10 and became beta since at least v1.14, from where it is enabled by default. According to my research, the Kubernetes people adopted tini in the /pause process that will be PID 1 if shareProcessNamespace is set to true. I think it is safe to assume that this feature is enabled in most, if not all, production clusters out there. However, maybe someone knows of consequences this would have that I'm currently not aware of.

Long story short: The workaround for this issue is to set shareProcessNamespace: true to the pod templates of argocd-repo-server deployment resource. I think we should make this the default in the installation manifests, also for the other pods running in the argocd namespace.

@jessesuen
Copy link
Member

This was an excellent summary. For option 3, enabling shareProcessNamespace, we would enable the mode with an assumption that the pause container is really a tini process which will reap the orphaned process? Is it possible that tini might be run in a way where this might be a false assumption? For example, can we trust that this is also true for OpenShift?

I think we should make this the default in the installation manifests, also for the other pods running in the argocd namespace.

It sounds like we should either do option 2 or 3 for the repo-server, since it does spawn child processes. But for argocd-server and argocd-application-controller, I might avoid it since we don't spawn child processes AFIK.

I'm in favor of doing 2, since 3 is making assumptions about the kubernetes version underlying implementation using tini.

alexmt pushed a commit that referenced this issue Jun 9, 2020
) (#3721)

* fix: Reap orphaned ("zombie") processes in argocd-repo-server pod
alexmt pushed a commit that referenced this issue Jun 9, 2020
) (#3721)

* fix: Reap orphaned ("zombie") processes in argocd-repo-server pod
elemental-lf added a commit to elemental-lf/argo-helm that referenced this issue Oct 5, 2020
While uid_entrypoint.sh contains the OpenShift specific manipulation of
/etc/passwd it also starts the reposerver via tini and so ensures that any
zombies produced by reposerver and its decendants are collected.

This matches the behaviour from the manifests included with the main ArgoCD
project. See:

* https://github.com/argoproj/argo-cd/blob/f93da5346c3dfe0ec75549fd78b2d30ce7d5cfad/manifests/base/repo-server/argocd-repo-server-deployment.yaml#L24
* argoproj/argo-cd#3721
* argoproj/argo-cd#3611
seanson pushed a commit to argoproj/argo-helm that referenced this issue Oct 8, 2020
…466)

* fix(argocd): Unconditionally start reposerver with uid_entrypoint.sh

While uid_entrypoint.sh contains the OpenShift specific manipulation of
/etc/passwd it also starts the reposerver via tini and so ensures that any
zombies produced by reposerver and its decendants are collected.

This matches the behaviour from the manifests included with the main ArgoCD
project. See:

* https://github.com/argoproj/argo-cd/blob/f93da5346c3dfe0ec75549fd78b2d30ce7d5cfad/manifests/base/repo-server/argocd-repo-server-deployment.yaml#L24
* argoproj/argo-cd#3721
* argoproj/argo-cd#3611

* chore: Bumping minor semver as this feels like a bit more than a patch change.
muwon pushed a commit to muwon/argo-helm that referenced this issue Oct 10, 2020
…rgoproj#466)

* fix(argocd): Unconditionally start reposerver with uid_entrypoint.sh

While uid_entrypoint.sh contains the OpenShift specific manipulation of
/etc/passwd it also starts the reposerver via tini and so ensures that any
zombies produced by reposerver and its decendants are collected.

This matches the behaviour from the manifests included with the main ArgoCD
project. See:

* https://github.com/argoproj/argo-cd/blob/f93da5346c3dfe0ec75549fd78b2d30ce7d5cfad/manifests/base/repo-server/argocd-repo-server-deployment.yaml#L24
* argoproj/argo-cd#3721
* argoproj/argo-cd#3611

* chore: Bumping minor semver as this feels like a bit more than a patch change.
@willemm
Copy link

willemm commented Mar 28, 2022

We're currently seeing this issue (well, almost. The defunct processes are mostly 'ssh').
Running argocd:v2.2.5 image.

@jumarko
Copy link

jumarko commented Apr 23, 2022

I don't have anything special to add, but want to share my experience when debugging a similar problem - completely unrelated to argo-cd (never used it).

Our java application was running as PID 1 in a docker container and was running git clone commands as sub-proceses.
As it turns out, when the git cloning process fails, it doesn't properly reap its child process - that in turn becomes a zombie and is re-parented to PID1 (our java app) that doesn't know how to deal with it.

A simple command like this would reproduce the problem (If you don't have an SSH key set up for git)

git clone git@github.com:jumarko/poptavka.git

It fails with a permissions error leaving an sh process behind as a zombie.
This sh process runs ssh command for git-style remote URLs.

A similar thing happens with https-style URLs, but in this case, it keeps around git zombies.
(it runs git remote-https subprocess which is left around as a zombie)

It feels like a bug in git where they don't reap children processes if a git clone subprocess terminates with an error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug/severity:major Malfunction in one of the core component, impacting a majority of users bug Something isn't working component:git Interaction with GitHub, Gitlab etc
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants