Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PreStop Hook exited with 137 blocking clean kubectl delete pod #81

Open
smoke opened this issue Mar 12, 2020 · 4 comments
Open

PreStop Hook exited with 137 blocking clean kubectl delete pod #81

smoke opened this issue Mar 12, 2020 · 4 comments

Comments

@smoke
Copy link
Contributor

smoke commented Mar 12, 2020

Using the following command stucks for too much time:

smoke@rkirilov-work-pc ~ $ kubectl delete pod -n ci concourse-ci-worker-0 
pod "concourse-ci-worker-0" deleted

When I describe the POD it is clear that the PreStop Hook did not exit clean:

smoke@rkirilov-work-pc ~ $ kubectl describe pod -n ci concourse-ci-worker-0 | cat | tail -n 12
Events:
  Type     Reason             Age   From                                  Message
  ----     ------             ----  ----                                  -------
  Normal   Scheduled          79s   default-scheduler                     Successfully assigned ci/concourse-ci-worker-0 to ip-10-200-3-38.ec2.internal
  Normal   Pulled             78s   kubelet, ip-10-200-3-38.ec2.internal  Container image "concourse/concourse:5.8.0" already present on machine
  Normal   Created            78s   kubelet, ip-10-200-3-38.ec2.internal  Created container concourse-ci-worker-init-rm
  Normal   Started            78s   kubelet, ip-10-200-3-38.ec2.internal  Started container concourse-ci-worker-init-rm
  Normal   Pulled             72s   kubelet, ip-10-200-3-38.ec2.internal  Container image "concourse/concourse:5.8.0" already present on machine
  Normal   Created            72s   kubelet, ip-10-200-3-38.ec2.internal  Created container concourse-ci-worker
  Normal   Started            72s   kubelet, ip-10-200-3-38.ec2.internal  Started container concourse-ci-worker
  Normal   Killing            54s   kubelet, ip-10-200-3-38.ec2.internal  Stopping container concourse-ci-worker
  Warning  FailedPreStopHook  11s   kubelet, ip-10-200-3-38.ec2.internal  Exec lifecycle hook ([/bin/bash /pre-stop-hook.sh]) for Container "concourse-ci-worker" in Pod "concourse-ci-worker-0_ci(8688f7aa-6444-11ea-9917-0ad140727ba9)" failed - error: command '/bin/bash /pre-stop-hook.sh' exited with 137: , message: ""

So the only workaround is to now force delete the pod:

smoke@rkirilov-work-pc ~ $ kubectl delete pod --force --grace-period=0 -n ci concourse-ci-worker-0 
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "concourse-ci-worker-0" force deleted

May be /pre-stop-hook.sh should be patched to handle (trap) the relevant signals (e.g. SIGTERM, SIGINT, SIGHUP) and exit cleanly. I assume when the dumb-init is signaled, it on its own tries to cleanly terminate the /pre-stop-hook.sh and given it does not terminate cleanly - it gets killed with the exit code 137 that then blocks K8S.

I will give it a try and will update the ticket, hopefully with a PR.

Actually K8S just waits for the PreStop Hook only for a terminationGracePeriodSeconds amount of time and then sends a SIGTERM the containers and then SIGKILL all the running processes after 2 more seconds as per kubernetes/kubernetes#39170 (comment) and https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods

However strange thing is the POD is left in terminating state for many more minutes and doesn't seem to restart.

So may be the best course of action would be to use timeout -k {.Values.worker.terminationGracePeriodSeconds} bash -c 'while [ -e /proc/1 ]; do sleep 1; done' or something similar I guess. This way at least the delete command will not be blocked.

Also it is important to increase the .Values.worker.terminationGracePeriodSeconds to something that makes sense for your own Pipelines.

@taylorsilva
Copy link
Member

I tried a quick patch with your suggestion:

diff --git a/templates/worker-prestop-configmap.yaml b/templates/worker-prestop-configmap.yaml
index 9d5dd31..9f43a76 100644
--- a/templates/worker-prestop-configmap.yaml
+++ b/templates/worker-prestop-configmap.yaml
@@ -11,5 +11,5 @@ data:
   pre-stop-hook.sh: |
     #!/bin/bash
     kill -s {{ .Values.concourse.worker.shutdownSignal }} 1
-    while [ -e /proc/1 ]; do sleep 1; done
+    timeout -k {{ .Values.worker.terminationGracePeriodSeconds }} {{ .Values.worker.terminationGracePeriodSeconds }} /bin/bash -c 'while [ -e /proc/1 ]; do sleep 1; done'

The script still exits with a non-zero exit code, 124 in this case:

Warning  FailedPreStopHook       1s     kubelet, gke-topgun-topgun-worker-2c49df4e-qwh6  Exec lifecycle hook ([/bin/bash /pre-stop-hook.sh]) for Container "issue81-worker" in Pod "issue81-worker-0_issue81(4ad690c9-d362-48d8-9e5a-c5e873b5571e)" failed - error: command '/bin/bash /pre-stop-hook.sh' exited with 124: , message: ""

Not sure what a good solution for this one is 🤔


to reproduce this I installed the helm chart with default settings and started this long running job:

---
jobs:
  - name: simple-job
    plan:
      - task: simple-task
        config:
          platform: linux
          image_resource:
            type: registry-image
            source: {repository: busybox}
          run:
            path: /bin/sh
            args:
              - -c
              - |
                #!/bin/sh
                sleep 1h

I then deleted the pod

$ kubectl delete pod -n issue81 issue81-worker-0

and kept describing the pod until I saw the relevant error:

$ k describe pod -n issue81 issue81-worker-0 | tail -n 10

@smoke
Copy link
Contributor Author

smoke commented May 31, 2020

@taylorsilva I confirm your findings and I don't have better workaround than increase the timeout and manually intervening when such things happen :(

@skreddy6673
Copy link

Having same issue on Concourse v5.7.1.

@vineethNaroju
Copy link

Hi, I've the same error. I attached a pre stop hook script, containing sleep 10 seconds and deleted the pod. The pre-stop hook script ran but got FailedPreStopHook event with same exit code 137. This is in EKS with 1.25 k8s version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants