Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(containerSet): mark container deleted when pod deleted. Fixes: #12210 #12756

Merged
merged 9 commits into from
Mar 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 12 additions & 1 deletion workflow/controller/operator.go
Original file line number Diff line number Diff line change
Expand Up @@ -1223,7 +1223,18 @@ func (woc *wfOperationCtx) podReconciliation(ctx context.Context) (error, bool)
node.Daemoned = nil
woc.updated = true
}
woc.markNodePhase(node.Name, wfv1.NodeError, "pod deleted")
woc.markNodeError(node.Name, errors.New("", "pod deleted"))
// Set pod's child(container) error if pod deleted
for _, childNodeID := range node.Children {
childNode, err := woc.wf.Status.Nodes.Get(childNodeID)
if err != nil {
woc.log.Errorf("was unable to obtain node for %s", childNodeID)
continue
}
if childNode.Type == wfv1.NodeTypeContainer {
woc.markNodeError(childNode.Name, errors.New("", "container deleted"))
}
}
}
}
return nil, !taskResultIncomplete
Expand Down
89 changes: 89 additions & 0 deletions workflow/controller/operator_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -10803,3 +10803,92 @@ status:
woc.operate(ctx)
assert.Equal(t, wfv1.WorkflowSucceeded, woc.wf.Status.Phase)
}

var wfHasContainerSet = `
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: lovely-rhino
spec:
templates:
- name: init
dag:
tasks:
- name: A
template: run
arguments: {}
- name: run
containerSet:
containers:
- name: main
image: alpine:latest
command:
- /bin/sh
args:
- '-c'
- sleep 9000
resources: {}
- name: main2
image: alpine:latest
command:
- /bin/sh
args:
- '-c'
- sleep 9000
resources: {}
entrypoint: init
arguments: {}
ttlStrategy:
secondsAfterCompletion: 300
podGC:
strategy: OnPodCompletion`

// TestContainerSetDeleteContainerWhenPodDeleted test whether a workflow has ContainerSet error when pod deleted.
func TestContainerSetDeleteContainerWhenPodDeleted(t *testing.T) {
cancel, controller := newController()
defer cancel()
ctx := context.Background()
wfcset := controller.wfclientset.ArgoprojV1alpha1().Workflows("")
wf := wfv1.MustUnmarshalWorkflow(wfHasContainerSet)
wf, err := wfcset.Create(ctx, wf, metav1.CreateOptions{})
assert.Nil(t, err)
wf, err = wfcset.Get(ctx, wf.ObjectMeta.Name, metav1.GetOptions{})
assert.Nil(t, err)
woc := newWorkflowOperationCtx(wf, controller)
woc.operate(ctx)
pods, err := listPods(woc)
assert.Nil(t, err)
assert.Equal(t, 1, len(pods.Items))

// mark pod Running
makePodsPhase(ctx, woc, apiv1.PodRunning)
woc = newWorkflowOperationCtx(woc.wf, controller)
woc.operate(ctx)
for _, node := range woc.wf.Status.Nodes {
if node.Type == wfv1.NodeTypePod {
assert.Equal(t, wfv1.NodeRunning, node.Phase)
}
}

// TODO: Refactor to use local-scoped env vars in test to avoid long wait. See https://github.com/argoproj/argo-workflows/pull/12756#discussion_r1530245007
// delete pod
time.Sleep(10 * time.Second)
Copy link
Member

@tczhao tczhao Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be mindful of how long a test takes, with so many tests in place, the time to run tests could accumulate very quickly

use
_ = os.Setenv("RECENTLY_STARTED_POD_DURATION", "0")
and
set graceperiod to 0 on metav1.DeleteOptions{ in deletePods function

so that we can get rid of time.sleep here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I add it. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use
_ = os.Setenv("RECENTLY_STARTED_POD_DURATION", "0")

This sets it for the whole process though, no? not just this single test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sigle and i remove it after test.

Copy link
Member

@agilgur5 agilgur5 Mar 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not quite sufficient. Env vars are set for an entire process -- meaning it affects any tests that are ran in parallel and use the same env var. But I wasn't sure if Go had any special handling for this given that I've seen this in other tests. According to Go's own docs (for the t.Setenv function), it does not have any special handling and this would indeed affect any parallel tests.

While this is done in other tests in this codebase (50 occurrences) which probably need a larger refactoring, this env var is a bit more global in its effects.

This is one of those why tests shouldn't rely on globals.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed it, and i test sleep 5s is not enough. need 10s

Copy link
Member

@agilgur5 agilgur5 Mar 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, not needing to sleep in the tests is good to have though... @tczhao what do you think?

perhaps we leave one in and then refactor in a separate PR to avoid globals? (note that refactoring the globals might be substantially easier said than done)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed it, and i test sleep 5s is not enough. need 10s

The 10s is due to RECENTLY_STARTED_POD_DURATION defaults to 10s,
podReconciliation will not update the wf status until RECENTLY_STARTED_POD_DURATION elapsed
https://github.com/argoproj/argo-workflows/blob/v3.5.5/workflow/controller/operator.go#L1209


perhaps we leave one in and then refactor in a separate PR to avoid globals?

Sounds good to me

Copy link
Member

@agilgur5 agilgur5 Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure to add a // TODO: comment here to modify this in the future. Can reference this thread in the comment: #12756 (comment).
This would be a really easy follow-up to miss (and may not happen soon due to the possible complexity of the refactor), so being explicit is good (and TODOs can be searched for in the codebase easily)

deletePods(ctx, woc)
pods, err = listPods(woc)
assert.Nil(t, err)
assert.Equal(t, 0, len(pods.Items))

// reconcile
woc = newWorkflowOperationCtx(woc.wf, controller)
woc.operate(ctx)
assert.Equal(t, wfv1.WorkflowError, woc.wf.Status.Phase)
for _, node := range woc.wf.Status.Nodes {
assert.Equal(t, wfv1.NodeError, node.Phase)
if node.Type == wfv1.NodeTypePod {
assert.Equal(t, "pod deleted", node.Message)
}
if node.Type == wfv1.NodeTypeContainer {
assert.Equal(t, "container deleted", node.Message)
}
}
}
Loading