Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GKE preemption inconsistently treated as WorkflowNodeError and WorkflowNodeFailed #9880

Closed
1 of 3 tasks
pjo256 opened this issue Oct 21, 2022 · 3 comments
Closed
1 of 3 tasks
Labels
solution/workaround There's a workaround, might not be great, but exists type/bug

Comments

@pjo256
Copy link

pjo256 commented Oct 21, 2022

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

For some context, we run a large number of jobs with preemptible node pools in GKE. We typically don't want to retry application failures, but we always want to retry preemptions. We added a retry policy for OnError, and successfully triggered this policy when manually pre-empting nodes (e.g. deleting a node in the node pool).

When looking at Argo workflows in the cluster, manual preemption was recorded as a WorkflowNodeError with a pod deleted error message and correctly triggered our retry policy. During actual pre-emption, we see WorkflowNodeFailed - Pod was terminated due to imminent node shutdown and does not trigger our retry policy.

After #2881, it seems all preemptions should be treated as WorkflowNodeError. This would help isolate retries to node availability problems as opposed to application errors.

Version

v3.3.8

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Should be any workflow running in GKE

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

time="2022-10-19T17:06:08.550Z" level=info msg="Processing workflow" namespace=default workflow=${workflow}
time="2022-10-19T17:06:08.557Z" level=info msg="Updated phase  -> Running" namespace=default workflow=${workflow}
time="2022-10-19T17:06:08.557Z" level=info msg="DAG node ${workflow} initialized Running" namespace=default workflow=${workflow}
time="2022-10-19T17:06:08.557Z" level=info msg="All of node ${workflow}.exit-handler-1 dependencies [] completed" namespace=default workflow=${workflow}
time="2022-10-19T17:06:08.558Z" level=info msg="DAG node ${workflow}-4115780724 initialized Running" namespace=default workflow=${workflow}
time="2022-10-19T17:06:08.558Z" level=info msg="All of node ${workflow}.exit-handler-1.chaptering dependencies [] completed" namespace=default workflow=${workflow}
time="2022-10-19T17:06:08.559Z" level=info msg="Retry node ${workflow}-3870252585 initialized Running" namespace=default workflow=${workflow}
time="2022-10-19T17:06:08.559Z" level=info msg="Pod node ${workflow}-1579747344 initialized Pending" namespace=default workflow=${workflow}
time="2022-10-19T17:06:13.613Z" level=info msg="Created pod: ${workflow}.exit-handler-1.chaptering(0) (${workflow}-1579747344)" namespace=default workflow=${workflow}
time="2022-10-19T17:06:13.613Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=${workflow}
time="2022-10-19T17:06:13.613Z" level=info msg=reconcileAgentPod namespace=default workflow=${workflow}
time="2022-10-19T17:06:13.647Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=19478544 workflow=${workflow}
time="2022-10-19T17:06:23.597Z" level=info msg="Processing workflow" namespace=default workflow=${workflow}
time="2022-10-19T17:06:23.597Z" level=info msg="Task-result reconciliation" namespace=default numObjs=0 workflow=${workflow}
time="2022-10-19T17:06:23.597Z" level=info msg="node changed" new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=${workflow}-1579747344 old.message= old.phase=Pending old.progress=0/1
time="2022-10-19T17:06:23.599Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=${workflow}
time="2022-10-19T17:06:23.599Z" level=info msg=reconcileAgentPod namespace=default workflow=${workflow}
time="2022-10-19T17:06:23.621Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=19478634 workflow=${workflow}
time="2022-10-19T17:06:33.623Z" level=info msg="Processing workflow" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.623Z" level=info msg="Task-result reconciliation" namespace=default numObjs=0 workflow=${workflow}
time="2022-10-19T17:06:33.623Z" level=info msg="node changed" new.message="Pod was terminated in response to imminent node shutdown." new.phase=Failed new.progress=0/1 nodeID=${workflow}-1579747344 old.message=PodInitializing old.phase=Pending old.progress=0/1
time="2022-10-19T17:06:33.625Z" level=info msg="node has maxDuration set, setting executionDeadline to: Mon Jan 01 00:00:00 +0000 (a long while ago)" namespace=default node=${workflow}.exit-handler-1.chaptering workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="Node not set to be retried after status: Failed" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="node ${workflow}-3870252585 phase Running -> Failed" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="node ${workflow}-3870252585 message: Pod was terminated in response to imminent node shutdown." namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="node ${workflow}-3870252585 finished: 2022-10-19 17:06:33.625741142 +0000 UTC" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="Outbound nodes of ${workflow}-4115780724 set to [${workflow}-1579747344]" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="node ${workflow}-4115780724 phase Running -> Failed" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="node ${workflow}-4115780724 finished: 2022-10-19 17:06:33.625828677 +0000 UTC" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="Checking daemoned children of ${workflow}-4115780724" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="Outbound nodes of ${workflow} set to [${workflow}-1579747344]" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="node ${workflow} phase Running -> Failed" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="node ${workflow} finished: 2022-10-19 17:06:33.62593395 +0000 UTC" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="Checking daemoned children of ${workflow}" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.625Z" level=info msg=reconcileAgentPod namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.626Z" level=info msg="Running OnExit handler: notifyhandler" namespace=default workflow=${workflow}
time="2022-10-19T17:06:33.626Z" level=info msg="Pod node ${workflow}-1999402710 initialized Pending" namespace=default workflow=${workflow}
time="2022-10-19T17:06:38.650Z" level=info msg="Created pod: ${workflow}.onExit (${workflow}-1999402710)" namespace=default workflow=${workflow}
time="2022-10-19T17:06:38.673Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=19478796 workflow=${workflow}
time="2022-10-19T17:06:38.679Z" level=info msg="cleaning up pod" action=labelPodCompleted key=default/${workflow}-1579747344/labelPodCompleted
time="2022-10-19T17:06:48.649Z" level=info msg="Processing workflow" namespace=default workflow=${workflow}
time="2022-10-19T17:06:48.649Z" level=info msg="Task-result reconciliation" namespace=default numObjs=0 workflow=${workflow}
time="2022-10-19T17:06:48.649Z" level=warning msg="workflow uses legacy/insecure pod patch, see https://argoproj.github.io/argo-workflows/workflow-rbac/" namespace=default workflow=${workflow}
time="2022-10-19T17:06:48.649Z" level=info msg="node changed" new.message= new.phase=Succeeded new.progress=0/1 nodeID=${workflow}-1999402710 old.message= old.phase=Pending old.progress=0/1
time="2022-10-19T17:06:48.650Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=${workflow}
time="2022-10-19T17:06:48.650Z" level=info msg=reconcileAgentPod namespace=default workflow=${workflow}
time="2022-10-19T17:06:48.650Z" level=info msg="Running OnExit handler: notifyhandler" namespace=default workflow=${workflow}
time="2022-10-19T17:06:48.650Z" level=info msg="Updated phase Running -> Failed" namespace=default workflow=${workflow}
time="2022-10-19T17:06:48.650Z" level=info msg="Marking workflow completed" namespace=default workflow=${workflow}
time="2022-10-19T17:06:48.650Z" level=info msg="Checking daemoned children of " namespace=default workflow=${workflow}
time="2022-10-19T17:06:48.655Z" level=info msg="cleaning up pod" action=deletePod key=default/${workflow}-1340600742-agent/deletePod
time="2022-10-19T17:06:48.664Z" level=info msg="Workflow update successful" namespace=default phase=Failed resourceVersion=19478961 workflow=${workflow}
time="2022-10-19T17:06:48.673Z" level=info msg="cleaning up pod" action=labelPodCompleted key=default/${workflow}-1999402710/labelPodCompleted

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

@terrytangyuan
Copy link
Member

Have you tried setting TRANSIENT_ERROR_PATTERN? https://github.com/argoproj/argo-workflows/blob/master/docs/environment-variables.md

@pjo256
Copy link
Author

pjo256 commented Oct 25, 2022

Good point, setting TRANSIENT_ERROR_PATTERN does solve my issue. It would be nice to restart on both OnTransientError with this pattern and OnError. I'd like to use conditional retries, but not sure if expression would have access to the imminent node shutdown message on lastRetry - only seeing exitCode, status and duration in https://argoproj.github.io/argo-workflows/retries/#conditional-retries.

@sarabala1979
Copy link
Member

@pjo256 Can you create an enhancement request for the above ask?

@sarabala1979 sarabala1979 added the solution/workaround There's a workaround, might not be great, but exists label Oct 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solution/workaround There's a workaround, might not be great, but exists type/bug
Projects
None yet
Development

No branches or pull requests

3 participants