GKE preemption inconsistently treated as WorkflowNodeError and WorkflowNodeFailed #9880

pjo256 · 2022-10-21T15:18:35Z

Pre-requisites

I have double-checked my configuration
I can confirm the issues exists when I tested with :latest
I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

For some context, we run a large number of jobs with preemptible node pools in GKE. We typically don't want to retry application failures, but we always want to retry preemptions. We added a retry policy for OnError, and successfully triggered this policy when manually pre-empting nodes (e.g. deleting a node in the node pool).

When looking at Argo workflows in the cluster, manual preemption was recorded as a WorkflowNodeError with a pod deleted error message and correctly triggered our retry policy. During actual pre-emption, we see WorkflowNodeFailed - Pod was terminated due to imminent node shutdown and does not trigger our retry policy.

After #2881, it seems all preemptions should be treated as WorkflowNodeError. This would help isolate retries to node availability problems as opposed to application errors.

Version

v3.3.8

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Should be any workflow running in GKE

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

time="2022-10-19T17:06:08.550Z" level=info msg="Processing workflow" namespace=default workflow=${workflow}
time="2022-10-19T17:06:08.557Z" level=info msg="Updated phase  -> Running" namespace=default workflow=${workflow}
time="2022-10-19T17:06:08.557Z" level=info msg="DAG node ${workflow} initialized Running" namespace=default workflow=${workflow}
time="2022-10-19T17:06:08.557Z" level=info msg="All of node ${workflow}.exit-handler-1 dependencies [] completed" namespace=default workflow=${workflow}
time="2022-10-19T17:06:08.558Z" level=info msg="DAG node ${workflow}-4115780724 initialized Running" namespace=default workflow=${workflow}
time="2022-10-19T17:06:08.558Z" level=info msg="All of node ${workflow}.exit-handler-1.chaptering dependencies [] completed" namespace=default workflow=${workflow}
time="2022-10-19T17:06:08.559Z" level=info msg="Retry node ${workflow}-3870252585 initialized Running" namespace=default workflow=${workflow}
time="2022-10-19T17:06:08.559Z" level=info msg="Pod node ${workflow}-1579747344 initialized Pending" namespace=default workflow=${workflow}
time="2022-10-19T17:06:13.613Z" level=info msg="Created pod: ${workflow}.exit-handler-1.chaptering(0) (${workflow}-1579747344)" namespace=default workflow=${workflow}
time="2022-10-19T17:06:13.613Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=${workflow}
time="2022-10-19T17:06:13.613Z" level=info msg=reconcileAgentPod namespace=default workflow=${workflow}
time="2022-10-19T17:06:13.647Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=19478544 workflow=${workflow}
time="2022-10-19T17:06:23.597Z" level=info msg="Processing workflow" namespace=default workflow=${workflow}
time="2022-10-19T17:06:23.597Z" level=info msg="Task-result reconciliation" namespace=default numObjs=0 workflow=${workflow}
time="2022-10-19T17:06:23.597Z" level=info msg="node changed" new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=${workflow}-1579747344 old.message= old.phase=Pending old.progress=0/1
time="2022-10-19T17:06:23.599Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=${workflow}
time="2022-10-19T17:06:23.599Z" level=info msg=reconcileAgentPod namespace=default workflow=${workflow}
time="2022-10-19T17:06:23.621Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=19478634 workflow=${workflow}
time="2022-10-19T17:06:33.623Z" level=info msg="Processing workflow" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.623Z" level=info msg="Task-result reconciliation" namespace=default numObjs=0 workflow=${workflow}
time="2022-10-19T17:06:33.623Z" level=info msg="node changed" new.message="Pod was terminated in response to imminent node shutdown." new.phase=Failed new.progress=0/1 nodeID=${workflow}-1579747344 old.message=PodInitializing old.phase=Pending old.progress=0/1
time="2022-10-19T17:06:33.625Z" level=info msg="node has maxDuration set, setting executionDeadline to: Mon Jan 01 00:00:00 +0000 (a long while ago)" namespace=default node=${workflow}.exit-handler-1.chaptering workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="Node not set to be retried after status: Failed" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="node ${workflow}-3870252585 phase Running -> Failed" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="node ${workflow}-3870252585 message: Pod was terminated in response to imminent node shutdown." namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="node ${workflow}-3870252585 finished: 2022-10-19 17:06:33.625741142 +0000 UTC" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="Outbound nodes of ${workflow}-4115780724 set to [${workflow}-1579747344]" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="node ${workflow}-4115780724 phase Running -> Failed" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="node ${workflow}-4115780724 finished: 2022-10-19 17:06:33.625828677 +0000 UTC" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="Checking daemoned children of ${workflow}-4115780724" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="Outbound nodes of ${workflow} set to [${workflow}-1579747344]" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="node ${workflow} phase Running -> Failed" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="node ${workflow} finished: 2022-10-19 17:06:33.62593395 +0000 UTC" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="Checking daemoned children of ${workflow}" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg=reconcileAgentPod namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.626Z" level=info msg="Running OnExit handler: notifyhandler" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.626Z" level=info msg="Pod node ${workflow}-1999402710 initialized Pending" namespace=default workflow=${workflow}
time="2022-10-19T17:06:38.650Z" level=info msg="Created pod: ${workflow}.onExit (${workflow}-1999402710)" namespace=default workflow=${workflow}
time="2022-10-19T17:06:38.673Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=19478796 workflow=${workflow}
time="2022-10-19T17:06:38.679Z" level=info msg="cleaning up pod" action=labelPodCompleted key=default/${workflow}-1579747344/labelPodCompleted
time="2022-10-19T17:06:48.649Z" level=info msg="Processing workflow" namespace=default workflow=${workflow}
time="2022-10-19T17:06:48.649Z" level=info msg="Task-result reconciliation" namespace=default numObjs=0 workflow=${workflow}
time="2022-10-19T17:06:48.649Z" level=warning msg="workflow uses legacy/insecure pod patch, see https://argoproj.github.io/argo-workflows/workflow-rbac/" namespace=default workflow=${workflow}
time="2022-10-19T17:06:48.649Z" level=info msg="node changed" new.message= new.phase=Succeeded new.progress=0/1 nodeID=${workflow}-1999402710 old.message= old.phase=Pending old.progress=0/1
time="2022-10-19T17:06:48.650Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=${workflow}
time="2022-10-19T17:06:48.650Z" level=info msg=reconcileAgentPod namespace=default workflow=${workflow}
time="2022-10-19T17:06:48.650Z" level=info msg="Running OnExit handler: notifyhandler" namespace=default workflow=${workflow}
time="2022-10-19T17:06:48.650Z" level=info msg="Updated phase Running -> Failed" namespace=default workflow=${workflow}
time="2022-10-19T17:06:48.650Z" level=info msg="Marking workflow completed" namespace=default workflow=${workflow}
time="2022-10-19T17:06:48.650Z" level=info msg="Checking daemoned children of " namespace=default workflow=${workflow}
time="2022-10-19T17:06:48.655Z" level=info msg="cleaning up pod" action=deletePod key=default/${workflow}-1340600742-agent/deletePod
time="2022-10-19T17:06:48.664Z" level=info msg="Workflow update successful" namespace=default phase=Failed resourceVersion=19478961 workflow=${workflow}
time="2022-10-19T17:06:48.673Z" level=info msg="cleaning up pod" action=labelPodCompleted key=default/${workflow}-1999402710/labelPodCompleted

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

The text was updated successfully, but these errors were encountered:

terrytangyuan · 2022-10-21T16:14:47Z

Have you tried setting TRANSIENT_ERROR_PATTERN? https://github.com/argoproj/argo-workflows/blob/master/docs/environment-variables.md

pjo256 · 2022-10-25T15:57:30Z

Good point, setting TRANSIENT_ERROR_PATTERN does solve my issue. It would be nice to restart on both OnTransientError with this pattern and OnError. I'd like to use conditional retries, but not sure if expression would have access to the imminent node shutdown message on lastRetry - only seeing exitCode, status and duration in https://argoproj.github.io/argo-workflows/retries/#conditional-retries.

sarabala1979 · 2022-10-31T17:33:08Z

@pjo256 Can you create an enhancement request for the above ask?

pjo256 added the type/bug label Oct 21, 2022

pjo256 mentioned this issue Oct 21, 2022

[bug] GKE preemption inconsistently treated as WorkflowNodeError and WorkflowNodeFailed kubeflow/pipelines#8381

Closed

sarabala1979 added the solution/workaround There's a workaround, might not be great, but exists label Oct 31, 2022

sarabala1979 closed this as completed Oct 31, 2022

pjo256 mentioned this issue Nov 1, 2022

Add message field to lastRetry for conditional retries #9937

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GKE preemption inconsistently treated as WorkflowNodeError and WorkflowNodeFailed #9880

GKE preemption inconsistently treated as WorkflowNodeError and WorkflowNodeFailed #9880

pjo256 commented Oct 21, 2022 •

edited

Loading

terrytangyuan commented Oct 21, 2022

pjo256 commented Oct 25, 2022

sarabala1979 commented Oct 31, 2022

GKE preemption inconsistently treated as WorkflowNodeError and WorkflowNodeFailed #9880

GKE preemption inconsistently treated as WorkflowNodeError and WorkflowNodeFailed #9880

Comments

pjo256 commented Oct 21, 2022 • edited Loading

Pre-requisites

What happened/what you expected to happen?

Version

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

terrytangyuan commented Oct 21, 2022

pjo256 commented Oct 25, 2022

sarabala1979 commented Oct 31, 2022

pjo256 commented Oct 21, 2022 •

edited

Loading