New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide a way to terminate/shutdown workflow pods gracefully #2742
Comments
Here is where we actually delete Pods: We use the K8s API Also, is this only relevant for |
@simster7 unless I am mistaken, the line of code you mentioned is only run in the case when the pod is in PodPending status, but if it is PodRunning I don't believe it applies. We are not experiencing the same behavior as kubectl delete with our batch application pod. When running as a Kubernetes job, delete will result in a graceful shutdown, but terminate command from Argo CLI or UI when using it as part of a workflow just kills the pod immediately. I have not checked with any other scenarios, just terminate command. Edit: I think when the pod is running, it skips past those two cases and eventually goes to and then This matches up with what I'm seeing in my logs:
|
Ah, good catch! If the Pods are running (instead of pending), we implement using the shutdown using the old "set I think the fix should be to replace this logic and implement something similar to: Where we first attempt to delete the Pods using the API and only use the deadline trick as a fallback. |
Turns out fix was way simpler: #2855 |
@wktmeow After an internal review I realized that the fix I had proposed will not work. The reason we terminate Workflow pods the way we do is to allow the wait container to gather and upload all artifacts without being restricted by K8s. I'll look for another fix |
It looks like I did not dig deeply enough the first time around. The USR2 signal triggers a reload of the annotations, which is I believe when executor realizes that the container needs to be killed, and ContainerRuntimeExecutor.Kill is triggered: I think what I actually want is for the following grace period to be configurable, as our app can take longer than 10 seconds to stop safely: So maybe this issue is actually a duplicate of: |
Closed in #3064. If not feel free to reopen |
@simster7 sorry if i'm missing the obvious, but i've bounced back and forth a few times between this thread and #3064 + the docs without finding an answer to the following question: If I have multiple step/tasks running under a single workflow, how can i gracefully terminate one of the steps/tasks? You can see in the example below I used
|
Summary
Currently it appears that when a terminate command is sent to a workflow, the pods are killed via the wait container and a kubectl exec kill command and the pod is not given a chance to shutdown gracefully:
workflow/controller/exec_control.go:
Motivation
We are trying to use argo to orchestrate long running batch jobs. Sometimes we would like to stop a running job and resume it later, but unless our container receives a SIGTERM as we would, for example, with a kubectl delete, or during a node reboot, our application does not have a chance to stop gracefully and leaves the batch job in a non-resumable state.
Proposal
Delete the pod via the kubernetes API in the same manner that
kubectl delete
would. (Sorry, I'm not familiar enough with coding against kubernetes API to offer any more detail than that)Message from the maintainers:
If you wish to see this enhancement implemented please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.
The text was updated successfully, but these errors were encountered: