You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For some context, we run a large number of jobs with preemptible node pools in GKE. We typically don't want to retry application failures, but we always want to retry preemptions. We added a retry policy for OnError, and successfully triggered this policy when manually pre-empting nodes (e.g. deleting a node in the node pool).
When looking at Argo workflows in the cluster, manual preemption was recorded as a WorkflowNodeError with a pod deleted error message and correctly triggered our retry policy. During actual pre-emption, we see WorkflowNodeFailed - Pod was terminated due to imminent node shutdown and does not trigger our retry policy.
After #2881, it seems all preemptions should be treated as WorkflowNodeError. This would help isolate retries to node availability problems as opposed to application errors.
Version
v3.3.8
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Good point, setting TRANSIENT_ERROR_PATTERN does solve my issue. It would be nice to restart on both OnTransientError with this pattern and OnError. I'd like to use conditional retries, but not sure if expression would have access to the imminent node shutdown message on lastRetry - only seeing exitCode, status and duration in https://argoproj.github.io/argo-workflows/retries/#conditional-retries.
Pre-requisites
:latest
What happened/what you expected to happen?
For some context, we run a large number of jobs with preemptible node pools in GKE. We typically don't want to retry application failures, but we always want to retry preemptions. We added a retry policy for
OnError
, and successfully triggered this policy when manually pre-empting nodes (e.g. deleting a node in the node pool).When looking at Argo workflows in the cluster, manual preemption was recorded as a
WorkflowNodeError
with apod deleted
error message and correctly triggered our retry policy. During actual pre-emption, we seeWorkflowNodeFailed
-Pod was terminated due to imminent node shutdown
and does not trigger our retry policy.After #2881, it seems all preemptions should be treated as
WorkflowNodeError
. This would help isolate retries to node availability problems as opposed to application errors.Version
v3.3.8
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Should be any workflow running in GKE
Logs from the workflow controller
kubectl logs -n argo deploy/workflow-controller | grep ${workflow}
Logs from in your workflow's wait container
kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
The text was updated successfully, but these errors were encountered: