Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Nodes with pods deleted out-of-band should be Errored, not Failed #2855

Merged
merged 7 commits into from May 7, 2020

Conversation

simster7
Copy link
Member

@simster7 simster7 commented Apr 27, 2020

This PR contains 3 fixes:

@simster7
Copy link
Member Author

@wktmeow Would it be possible for you to test this? You can build your own images from this branch. Take a look at https://github.com/argoproj/argo/blob/master/docs/running-locally.md

@matveybrant
Copy link
Contributor

@wktmeow Would it be possible for you to test this? You can build your own images from this branch. Take a look at https://github.com/argoproj/argo/blob/master/docs/running-locally.md

Absolutely, I'll try to get to it today. Thank you for the quick fix!

@simster7
Copy link
Member Author

@wktmeow Gentle bump :)

@matveybrant
Copy link
Contributor

It works! Thank you for the quick fix, sorry for the delay in testing.

@simster7 simster7 changed the title fix: Delete running pods with API before restoring to exec control fix: Delete running pods with API before resorting to exec control May 1, 2020
@simster7 simster7 requested a review from jessesuen May 1, 2020 19:50
@simster7 simster7 marked this pull request as ready for review May 1, 2020 19:50
Comment on lines +77 to +80
_, onExitPod := pod.Labels[common.LabelKeyOnExit]
if woc.wf.Spec.Shutdown == wfv1.ShutdownStrategyTerminate || (woc.wf.Spec.Shutdown == wfv1.ShutdownStrategyStop && !onExitPod) {
newDeadline = &time.Time{}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This particular code fixes #2914

Copy link
Member

@sarabala1979 sarabala1979 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@jessesuen jessesuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, we cannot delete Running pods. We rely on the executor to do send the SIGTERM to the main process in order to collect artifacts of the main container.

This is a conditional approval to remove the deletion of running pods, but an approval of other changes.

newPhase = wfv1.NodeFailed
message = "pod termination"
newPhase = wfv1.NodeError
message = "pod deleted during operation"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this needs to be reverted as well since you will be reverting the logic of deletion of running pods..

Copy link
Member Author

@simster7 simster7 May 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is unrelated. This makes sure that Nodes that have pods deleted out-of-band get marked as Error instead of Failed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's simply a node label change

@simster7 simster7 changed the title fix: Delete running pods with API before resorting to exec control fix: Nodes with pods deleted out-of-band should be Errored, not Failed May 7, 2020
@sonarcloud
Copy link

sonarcloud bot commented May 7, 2020

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities (and Security Hotspot 0 Security Hotspots to review)
Code Smell A 0 Code Smells

50.0% 50.0% Coverage
0.0% 0.0% Duplication

@simster7 simster7 merged commit 8a511e1 into argoproj:master May 7, 2020
@sarabala1979
Copy link
Member

#3097

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

argo stop can result in errored workflow Preempted nodes are treated as Fails but should be Errors
4 participants