Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a way to terminate/shutdown workflow pods gracefully #2742

Closed
matveybrant opened this issue Apr 17, 2020 · 8 comments
Closed

Provide a way to terminate/shutdown workflow pods gracefully #2742

matveybrant opened this issue Apr 17, 2020 · 8 comments
Assignees
Labels
type/feature Feature request

Comments

@matveybrant
Copy link
Contributor

Summary

Currently it appears that when a terminate command is sent to a workflow, the pods are killed via the wait container and a kubectl exec kill command and the pod is not given a chance to shutdown gracefully:

workflow/controller/exec_control.go:

	woc.log.Infof("Signalling %s of updates", podName)
	exec, err := common.ExecPodContainer(
		woc.controller.restConfig, woc.wf.ObjectMeta.Namespace, podName,
		common.WaitContainerName, true, true, "sh", "-c", "kill -s USR2 $(pidof argoexec)",
	)

Motivation

We are trying to use argo to orchestrate long running batch jobs. Sometimes we would like to stop a running job and resume it later, but unless our container receives a SIGTERM as we would, for example, with a kubectl delete, or during a node reboot, our application does not have a chance to stop gracefully and leaves the batch job in a non-resumable state.

Proposal

Delete the pod via the kubernetes API in the same manner that kubectl delete would. (Sorry, I'm not familiar enough with coding against kubernetes API to offer any more detail than that)


Message from the maintainers:

If you wish to see this enhancement implemented please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

@matveybrant matveybrant added the type/feature Feature request label Apr 17, 2020
@simster7 simster7 self-assigned this Apr 17, 2020
@simster7
Copy link
Member

Here is where we actually delete Pods:

https://github.com/argoproj/argo/blob/bac339afe10013a63bc1bb2fb83a883c07e19cb9/workflow/controller/exec_control.go#L34

We use the K8s API Delete request, which should be the same as kubectl delete. Is this not the behavior you're experiencing?

Also, is this only relevant for Stop/Terminate commands? Or for other scenarios?

@matveybrant
Copy link
Contributor Author

matveybrant commented Apr 24, 2020

@simster7 unless I am mistaken, the line of code you mentioned is only run in the case when the pod is in PodPending status, but if it is PodRunning I don't believe it applies.

We are not experiencing the same behavior as kubectl delete with our batch application pod. When running as a Kubernetes job, delete will result in a graceful shutdown, but terminate command from Argo CLI or UI when using it as part of a workflow just kills the pod immediately. I have not checked with any other scenarios, just terminate command.

Edit: I think when the pod is running, it skips past those two cases and eventually goes to
https://github.com/argoproj/argo/blob/bac339afe10013a63bc1bb2fb83a883c07e19cb9/workflow/controller/exec_control.go#L96

and then

https://github.com/argoproj/argo/blob/bac339afe10013a63bc1bb2fb83a883c07e19cb9/workflow/controller/exec_control.go#L151

This matches up with what I'm seeing in my logs:

time="2020-04-14T21:49:17Z" level=info msg="Signalling batch-job-vdnk8-1020344310 of updates" namespace=batch-argo-dev workflow
=batch-job-vdnk8
time="2020-04-14T21:49:17Z" level=info msg="https://10.154.0.1:443/api/v1/namespaces/batch-argo-dev/pods/batch-job-vdnk8-102034
4310/exec?command=sh&command=-c&command=kill+-s+USR2+%24%28pidof+argoexec%29&container=wait&stderr=true&stdout=true&tty=false"

@simster7
Copy link
Member

Ah, good catch!

If the Pods are running (instead of pending), we implement using the shutdown using the old "set activeDeadlineSeconds to 0" way, which I think (but I'm not sure) terminates the Pod non-gracefully. See:

https://github.com/argoproj/argo/blob/bac339afe10013a63bc1bb2fb83a883c07e19cb9/workflow/controller/exec_control.go#L76-L79

I think the fix should be to replace this logic and implement something similar to:

https://github.com/argoproj/argo/blob/bac339afe10013a63bc1bb2fb83a883c07e19cb9/workflow/controller/exec_control.go#L30-L43

Where we first attempt to delete the Pods using the API and only use the deadline trick as a fallback.

@simster7
Copy link
Member

Turns out fix was way simpler: #2855

@simster7
Copy link
Member

simster7 commented May 7, 2020

@wktmeow After an internal review I realized that the fix I had proposed will not work. The reason we terminate Workflow pods the way we do is to allow the wait container to gather and upload all artifacts without being restricted by K8s. I'll look for another fix

@matveybrant
Copy link
Contributor Author

matveybrant commented May 20, 2020

It looks like I did not dig deeply enough the first time around. The USR2 signal triggers a reload of the annotations, which is I believe when executor realizes that the container needs to be killed, and ContainerRuntimeExecutor.Kill is triggered:
https://github.com/argoproj/argo/blob/4c3387b273d802419a1552345dfb95dd05b8555b/workflow/executor/executor.go#L1090

I think what I actually want is for the following grace period to be configurable, as our app can take longer than 10 seconds to stop safely:
https://github.com/argoproj/argo/blob/4c3387b273d802419a1552345dfb95dd05b8555b/workflow/executor/common/common.go#L22

So maybe this issue is actually a duplicate of:
#1012

@simster7
Copy link
Member

Closed in #3064. If not feel free to reopen

@JakeColor
Copy link

JakeColor commented Apr 9, 2021

@simster7 sorry if i'm missing the obvious, but i've bounced back and forth a few times between this thread and #3064 + the docs without finding an answer to the following question:

If I have multiple step/tasks running under a single workflow, how can i gracefully terminate one of the steps/tasks?

You can see in the example below I used kubectl delete pod, but did not get the graceful kill i was looking for:

 ● train-dl13-rgvmq                                                                                                                                           train                                                                            
 └─┬─● train-loop(0:conf_dir:/default,config_id:0)  train-each  train-dl13-rgvmq-3469847994  1d                                      
   ├─⚠ train-loop(1:conf_dir:/default,config_id:1)  train-each  train-dl13-rgvmq-1066836090  1d        pod deleted during operation  
   └─● train-loop(2:conf_dir:/default,config_id:2)  train-each  train-dl13-rgvmq-919645334   1d  

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature Feature request
Projects
None yet
Development

No branches or pull requests

3 participants