Provide a way to terminate/shutdown workflow pods gracefully #2742

matveybrant · 2020-04-17T21:51:19Z

Summary

Currently it appears that when a terminate command is sent to a workflow, the pods are killed via the wait container and a kubectl exec kill command and the pod is not given a chance to shutdown gracefully:

workflow/controller/exec_control.go:

	woc.log.Infof("Signalling %s of updates", podName)
	exec, err := common.ExecPodContainer(
		woc.controller.restConfig, woc.wf.ObjectMeta.Namespace, podName,
		common.WaitContainerName, true, true, "sh", "-c", "kill -s USR2 $(pidof argoexec)",
	)

Motivation

We are trying to use argo to orchestrate long running batch jobs. Sometimes we would like to stop a running job and resume it later, but unless our container receives a SIGTERM as we would, for example, with a kubectl delete, or during a node reboot, our application does not have a chance to stop gracefully and leaves the batch job in a non-resumable state.

Proposal

Delete the pod via the kubernetes API in the same manner that kubectl delete would. (Sorry, I'm not familiar enough with coding against kubernetes API to offer any more detail than that)

Message from the maintainers:

If you wish to see this enhancement implemented please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

The text was updated successfully, but these errors were encountered:

simster7 · 2020-04-24T21:00:57Z

Here is where we actually delete Pods:

https://github.com/argoproj/argo/blob/bac339afe10013a63bc1bb2fb83a883c07e19cb9/workflow/controller/exec_control.go#L34

We use the K8s API Delete request, which should be the same as kubectl delete. Is this not the behavior you're experiencing?

Also, is this only relevant for Stop/Terminate commands? Or for other scenarios?

matveybrant · 2020-04-24T21:42:00Z

@simster7 unless I am mistaken, the line of code you mentioned is only run in the case when the pod is in PodPending status, but if it is PodRunning I don't believe it applies.

We are not experiencing the same behavior as kubectl delete with our batch application pod. When running as a Kubernetes job, delete will result in a graceful shutdown, but terminate command from Argo CLI or UI when using it as part of a workflow just kills the pod immediately. I have not checked with any other scenarios, just terminate command.

Edit: I think when the pod is running, it skips past those two cases and eventually goes to
https://github.com/argoproj/argo/blob/bac339afe10013a63bc1bb2fb83a883c07e19cb9/workflow/controller/exec_control.go#L96

and then

https://github.com/argoproj/argo/blob/bac339afe10013a63bc1bb2fb83a883c07e19cb9/workflow/controller/exec_control.go#L151

This matches up with what I'm seeing in my logs:

time="2020-04-14T21:49:17Z" level=info msg="Signalling batch-job-vdnk8-1020344310 of updates" namespace=batch-argo-dev workflow
=batch-job-vdnk8
time="2020-04-14T21:49:17Z" level=info msg="https://10.154.0.1:443/api/v1/namespaces/batch-argo-dev/pods/batch-job-vdnk8-102034
4310/exec?command=sh&command=-c&command=kill+-s+USR2+%24%28pidof+argoexec%29&container=wait&stderr=true&stdout=true&tty=false"

simster7 · 2020-04-27T16:14:18Z

Ah, good catch!

If the Pods are running (instead of pending), we implement using the shutdown using the old "set activeDeadlineSeconds to 0" way, which I think (but I'm not sure) terminates the Pod non-gracefully. See:

https://github.com/argoproj/argo/blob/bac339afe10013a63bc1bb2fb83a883c07e19cb9/workflow/controller/exec_control.go#L76-L79

I think the fix should be to replace this logic and implement something similar to:

https://github.com/argoproj/argo/blob/bac339afe10013a63bc1bb2fb83a883c07e19cb9/workflow/controller/exec_control.go#L30-L43

Where we first attempt to delete the Pods using the API and only use the deadline trick as a fallback.

simster7 · 2020-04-27T16:40:09Z

Turns out fix was way simpler: #2855

simster7 · 2020-05-07T15:57:17Z

@wktmeow After an internal review I realized that the fix I had proposed will not work. The reason we terminate Workflow pods the way we do is to allow the wait container to gather and upload all artifacts without being restricted by K8s. I'll look for another fix

matveybrant · 2020-05-20T22:04:25Z

It looks like I did not dig deeply enough the first time around. The USR2 signal triggers a reload of the annotations, which is I believe when executor realizes that the container needs to be killed, and ContainerRuntimeExecutor.Kill is triggered:
https://github.com/argoproj/argo/blob/4c3387b273d802419a1552345dfb95dd05b8555b/workflow/executor/executor.go#L1090

I think what I actually want is for the following grace period to be configurable, as our app can take longer than 10 seconds to stop safely:
https://github.com/argoproj/argo/blob/4c3387b273d802419a1552345dfb95dd05b8555b/workflow/executor/common/common.go#L22

So maybe this issue is actually a duplicate of:
#1012

simster7 · 2020-06-16T15:32:08Z

Closed in #3064. If not feel free to reopen

JakeColor · 2021-04-09T21:18:51Z

@simster7 sorry if i'm missing the obvious, but i've bounced back and forth a few times between this thread and #3064 + the docs without finding an answer to the following question:

If I have multiple step/tasks running under a single workflow, how can i gracefully terminate one of the steps/tasks?

You can see in the example below I used kubectl delete pod, but did not get the graceful kill i was looking for:

 ● train-dl13-rgvmq                                                                                                                                           train                                                                            
 └─┬─● train-loop(0:conf_dir:/default,config_id:0)  train-each  train-dl13-rgvmq-3469847994  1d                                      
   ├─⚠ train-loop(1:conf_dir:/default,config_id:1)  train-each  train-dl13-rgvmq-1066836090  1d        pod deleted during operation  
   └─● train-loop(2:conf_dir:/default,config_id:2)  train-each  train-dl13-rgvmq-919645334   1d

matveybrant added the type/feature Feature request label Apr 17, 2020

simster7 self-assigned this Apr 17, 2020

simster7 mentioned this issue Apr 27, 2020

fix: Nodes with pods deleted out-of-band should be Errored, not Failed #2855

Merged

simster7 closed this as completed Jun 16, 2020

benjimin mentioned this issue Mar 14, 2024

Expand docs for aborting workflows #12800

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide a way to terminate/shutdown workflow pods gracefully #2742

Provide a way to terminate/shutdown workflow pods gracefully #2742

matveybrant commented Apr 17, 2020

simster7 commented Apr 24, 2020

matveybrant commented Apr 24, 2020 •

edited

simster7 commented Apr 27, 2020

simster7 commented Apr 27, 2020

simster7 commented May 7, 2020

matveybrant commented May 20, 2020 •

edited

simster7 commented Jun 16, 2020

JakeColor commented Apr 9, 2021 •

edited

Provide a way to terminate/shutdown workflow pods gracefully #2742

Provide a way to terminate/shutdown workflow pods gracefully #2742

Comments

matveybrant commented Apr 17, 2020

Summary

Motivation

Proposal

simster7 commented Apr 24, 2020

matveybrant commented Apr 24, 2020 • edited

simster7 commented Apr 27, 2020

simster7 commented Apr 27, 2020

simster7 commented May 7, 2020

matveybrant commented May 20, 2020 • edited

simster7 commented Jun 16, 2020

JakeColor commented Apr 9, 2021 • edited

matveybrant commented Apr 24, 2020 •

edited

matveybrant commented May 20, 2020 •

edited

JakeColor commented Apr 9, 2021 •

edited