Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2.12: pod deleted + re-apply error = errored workflow #4798

Closed
alexec opened this issue Dec 24, 2020 · 1 comment · Fixed by #4808
Closed

v2.12: pod deleted + re-apply error = errored workflow #4798

alexec opened this issue Dec 24, 2020 · 1 comment · Fixed by #4808
Assignees
Labels
type/bug type/regression Regression from previous behavior (a specific type of bug)
Milestone

Comments

@alexec
Copy link
Contributor

alexec commented Dec 24, 2020

Summary

In v2.12, we can get a pod deleted error under high load. I believe this is caused by factors interplaying:

  1. The workflow completes successfully.
  2. A pod is then deleted during clean-up, we'll get a re-queue of the workflow.
  3. On the next reconciliation the informer returns the same (and now out of date) workflow as the last reconciliation.
  4. The pod has been deleted, so the reconciliation marks the pod as Error: pod deleted. The workflow is marked as errored
  5. Update fails due to resource version check.
  6. Re-apply overwrites the previously successful workflow with an error workflow.
  7. If pod GC strategy is on-success, then the TTL controller will error.

Causes:

  • reapplyUpdate will happily overwrite a completed workflow or node.
  • v2.12 added indexers. I think when any of these errors, the workflow update is lost
  • I think that DEFAULT_REQUEUE_TIME should be longer, up to 10s.

Solution

  • Modify reapplyUpdate to check to see it is overwriting a successful workflow or any successful nodes. Error out. This will prevent any future cases of succeeded workflows being marked as error.
  • Modify the indexers so that they can never return errors. This will prevent conflict error.
  • I don't think that the grace period for recently created pods is needed after these changes. Mark it with at TODO to remove.

Relates to #4795, #4634, #4794

@alexec alexec added type/bug type/regression Regression from previous behavior (a specific type of bug) labels Dec 24, 2020
@alexec alexec added this to the v2.12 milestone Dec 24, 2020
@alexec alexec self-assigned this Dec 24, 2020
alexec added a commit to alexec/argo-workflows that referenced this issue Dec 29, 2020
alexec added a commit that referenced this issue Jan 4, 2021
Signed-off-by: Alex Collins <alex_collins@intuit.com>
@simster7 simster7 mentioned this issue Jan 4, 2021
21 tasks
simster7 pushed a commit that referenced this issue Jan 4, 2021
Signed-off-by: Alex Collins <alex_collins@intuit.com>
saranyaeu2987 pushed a commit to saranyaeu2987/argo-1 that referenced this issue Jan 5, 2021
, argoproj#4806 (argoproj#4808)

Signed-off-by: Alex Collins <alex_collins@intuit.com>
Signed-off-by: saranyaeu2987 <saranyaeu2987@gmail.com>
@simster7
Copy link
Member

simster7 commented Jan 5, 2021

Fix for this is out on https://github.com/argoproj/argo/releases/tag/v2.12.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants