v2.12: `pod deleted` + `re-apply` error = errored workflow #4798

alexec · 2020-12-24T18:25:59Z

Summary

In v2.12, we can get a pod deleted error under high load. I believe this is caused by factors interplaying:

The workflow completes successfully.
A pod is then deleted during clean-up, we'll get a re-queue of the workflow.
On the next reconciliation the informer returns the same (and now out of date) workflow as the last reconciliation.
The pod has been deleted, so the reconciliation marks the pod as Error: pod deleted. The workflow is marked as errored
Update fails due to resource version check.
Re-apply overwrites the previously successful workflow with an error workflow.
If pod GC strategy is on-success, then the TTL controller will error.

Causes:

reapplyUpdate will happily overwrite a completed workflow or node.
v2.12 added indexers. I think when any of these errors, the workflow update is lost
I think that DEFAULT_REQUEUE_TIME should be longer, up to 10s.

Solution

Modify reapplyUpdate to check to see it is overwriting a successful workflow or any successful nodes. Error out. This will prevent any future cases of succeeded workflows being marked as error.
Modify the indexers so that they can never return errors. This will prevent conflict error.
I don't think that the grace period for recently created pods is needed after these changes. Mark it with at TODO to remove.

Relates to #4795, #4634, #4794

The text was updated successfully, but these errors were encountered:

, argoproj#4806, argoproj#3551 Signed-off-by: Alex Collins <alex_collins@intuit.com>

Signed-off-by: Alex Collins <alex_collins@intuit.com>

, argoproj#4806 (argoproj#4808) Signed-off-by: Alex Collins <alex_collins@intuit.com> Signed-off-by: saranyaeu2987 <saranyaeu2987@gmail.com>

simster7 · 2021-01-05T17:25:13Z

Fix for this is out on https://github.com/argoproj/argo/releases/tag/v2.12.3

alexec added type/bug type/regression Regression from previous behavior (a specific type of bug) labels Dec 24, 2020

alexec added this to the v2.12 milestone Dec 24, 2020

alexec added the epic/reliability label Dec 24, 2020

alexec self-assigned this Dec 24, 2020

alexec mentioned this issue Dec 29, 2020

ttlStrategy not working with a specific configuration #3551

Closed

alexec added a commit to alexec/argo-workflows that referenced this issue Dec 29, 2020

fix(controller): Various v2.12 fixes. Fixes argoproj#4798, argoproj#4801

996afc8

, argoproj#4806, argoproj#3551 Signed-off-by: Alex Collins <alex_collins@intuit.com>

alexec mentioned this issue Dec 29, 2020

fix(controller): Various v2.12 fixes. Fixes #4798, #4801, #4806 #4808

Merged

1 task

alexec closed this as completed in #4808 Jan 4, 2021

alexec added a commit that referenced this issue Jan 4, 2021

fix(controller): Various v2.12 fixes. Fixes #4798, #4801, #4806 (#4808)

a0024d0

Signed-off-by: Alex Collins <alex_collins@intuit.com>

simster7 mentioned this issue Jan 4, 2021

Cherry pick for v2.12.3 #4826

Closed

21 tasks

simster7 pushed a commit that referenced this issue Jan 4, 2021

fix(controller): Various v2.12 fixes. Fixes #4798, #4801, #4806 (#4808)

8177b53

Signed-off-by: Alex Collins <alex_collins@intuit.com>

kglapiak mentioned this issue Feb 17, 2021

proper workflows marked as error with "pod deleted" message #5126

Closed

terrytangyuan mentioned this issue Apr 22, 2021

continueOn: failed: true doesn't work as expected for "pod deleted" steps #5738

Closed

wujayway mentioned this issue Mar 25, 2022

pod deleted + re-apply error with plugin task #8245

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.12: `pod deleted` + `re-apply` error = errored workflow #4798

v2.12: `pod deleted` + `re-apply` error = errored workflow #4798

alexec commented Dec 24, 2020

simster7 commented Jan 5, 2021

v2.12: pod deleted + re-apply error = errored workflow #4798

v2.12: pod deleted + re-apply error = errored workflow #4798

Comments

alexec commented Dec 24, 2020

Summary

simster7 commented Jan 5, 2021

v2.12: `pod deleted` + `re-apply` error = errored workflow #4798

v2.12: `pod deleted` + `re-apply` error = errored workflow #4798