refactor: change the logic of delete pod during retry. Fixes: #12538 #12734

shuangkun · 2024-03-04T09:10:46Z

Fixes: #12538
Refactor the logic of deleting pods during retry to speed up retry a workflow.

Motivation

Speed up retry a workflow. Let a large archived workflow (more than 8000 pods) can be successfully retryed within 1 minute.

Modifications

As Anton and Joibel suggested, I moved the delete logic to the controller.
Add spec retry to workflow spec.
Adding spec is just one way, other good ways are also possible. For example, pass label to trigger retry. If you think it's better to pass the label, I will change it.

Verification

e2e and units

reference:
#12624
#12419

tczhao · 2024-03-25T13:43:48Z

workflow/util/util.go

@@ -827,6 +827,29 @@ func resetConnectedParentGroupNodes(oldWF *wfv1.Workflow, newWF *wfv1.Workflow,
 	return newWF, resetParentGroupNodes
 }

+// RteryWorkflow retry a workflow by setting spec.retry and controller will retry it.
+func RetryWorkflow(ctx context.Context, wf *wfv1.Workflow, restartSuccessful bool, nodeFieldSelector string, parameters []string) (*wfv1.Workflow, error) {


We could use a better name here, retryworkflow can be confusing,
The word on top of my head is markWorkflowForRetry, there must be a better term

Changed, Thanks!

tczhao · 2024-03-25T13:43:54Z

workflow/util/util.go

+		Parameters:        parameters,
+	}
+
+	delete(wf.Labels, common.LabelKeyCompleted)


the delete label looks formulation of workflow, I think we can let formulateRetryWorkflow handle this instead.
so that server only

check client request error

and update for retry

If delete it from here, reconceil may not be triggered. Do you think change "reconciliationNeeded" function better than delete label here? If so, I will modify reconciliationNeeded function.

func reconciliationNeeded(wf metav1.Object) bool { return wf.GetLabels()[common.LabelKeyCompleted] != "true" || slices.Contains(wf.GetFinalizers(), common.FinalizerArtifactGC) }

tczhao · 2024-03-25T13:44:13Z

workflow/util/util.go

+	switch wf.Status.Phase {
+	case wfv1.WorkflowFailed, wfv1.WorkflowError:
+	case wfv1.WorkflowSucceeded:
+		if !(restartSuccessful && len(nodeFieldSelector) > 0) {
+			return nil, errors.Errorf(errors.CodeBadRequest, "To retry a succeeded workflow, set the options restartSuccessful and nodeFieldSelector")
+		}
+	default:
+		return nil, errors.Errorf(errors.CodeBadRequest, "Cannot retry a workflow in phase %s", wf.Status.Phase)
+	}


I think we can remove this from formulateRetryWorkflow since it is already handled

Yes, removed.

tczhao · 2024-03-25T13:46:49Z

server/workflowarchive/archived_workflow_server.go

@@ -286,19 +285,11 @@ func (w *archivedWorkflowServer) RetryArchivedWorkflow(ctx context.Context, req
 	_, err = wfClient.ArgoprojV1alpha1().Workflows(req.Namespace).Get(ctx, wf.Name, metav1.GetOptions{})
 	if apierr.IsNotFound(err) {

-		wf, podsToDelete, err := util.FormulateRetryWorkflow(ctx, wf, req.RestartSuccessful, req.NodeFieldSelector, req.Parameters)
+		wf, _, err := util.FormulateRetryWorkflow(ctx, wf, req.RestartSuccessful, req.NodeFieldSelector, req.Parameters)


This should use the new util function right? just like the workflow_server.
So that server only "label" workflow for retry, the actual formulation and processing is done by the controller

tczhao · 2024-03-25T13:53:33Z

workflow/controller/operator.go

+		return false
+	}
+	if woc.IsRetried() {
+		if woc.controller.podCleanupQueue.Len() == 0 {


what is the purpose of if woc.controller.podCleanupQueue.Len() == 0 { here?
could server set retry to true then formulateRetryWorkflow set retry to false
so we only checks if retry exist and is true

I just want to confirm that all the pod has been cleaned up. If exists is not reported when creating a new pod at this time, this should be removed.

Do you mean something like
make sure all pods deleted before process retry workflow

Yes, this is what I want.

I think maybe we should check nodestatus instead.
Other workflows would queue up the podCleanupQueue causing issue

shuangkun · 2024-03-25T14:06:33Z

Thank you for your reply. I would like to confirm a question. Do you think it is reasonable to pass the retry parameter through the spec? This will increase the amount of code changes. But at that time, I thought it was not appropriate to just use label to pass it, so I put it in the spec with reference to suspend.

tczhao · 2024-03-25T14:39:19Z

Thank you for your reply. I would like to confirm a question. Do you think it is reasonable to pass the retry parameter through the spec? This will increase the amount of code changes. But at that time, I thought it was not appropriate to just use label to pass it, so I put it in the spec with reference to suspend.

Here's some context
spec and status
annotation
label

Both annotation and spec make sense in their own way.

annotation Directives from the end-user to the implementations to modify behavior or engage non-standard features.
The spec is a complete description of the desired state, including configuration settings provided by the user

I would vote for spec thinking in a way that, when we retry, we update the desired state of the workflow, and then the controller works on it towards the desired state

Hi @agilgur5, what are your thoughts on this?

shuangkun · 2024-03-29T03:40:05Z

Hi @agilgur5 , can you give me your vote so that I can better solve these comments? Thanks.

agilgur5

Adding spec is just one way, other good ways are also possible. For example, pass label to trigger retry. If you think it's better to pass the label, I will change it.

Hi @agilgur5 , can you give me your vote so that I can better solve these comments? Thanks.

Hey sorry, I've had a bit too much on my plate recently. (can you tell by how late I've been up in EDT? 🙃)

So I mentioned using a label in #12538 as we pick up on various labels already in the Controller, so that would be consistent. It would also make it possible for a user to just add a label and not need to go through the Server or CLI (such as in #12027 (comment)), which would also make for a consistent UX.
In #12624 (review), I also suggested against spec changes as those are, well, part of the spec, and so become (much) harder to change.

In particular, per #6490 (comment), we may want to create (optional?) CRs for imperative actions like retry and resubmit. Per #6490 (comment), I also discussed that in SIG Security during two meetings, and they were supportive of that and apparently Argo CD had been thinking of implementing something similar to that.

With that in mind, in particular, I wouldn't want us to implement a spec that will change that we then will have trouble to remove due to breaking spec changes.

Both annotation and spec make sense in their own way.

annotation Directives from the end-user to the implementations to modify behavior or engage non-standard features.

The spec is a complete description of the desired state, including configuration settings provided by the user

I would vote for spec thinking in a way that, when we retry, we update the desired state of the workflow, and then the controller works on it towards the desired state

Hi @agilgur5, what are your thoughts on this?

Thanks for summarizing @tczhao.
So a retry is an imperative action, not a declarative state. So "desired state" is not really accurate to describe a retry -- it doesn't describe state at all, it describes an action. It also only affects the status and not the actual Workflow spec.
"non-standard feature" could be considered as a roughly accurate description of how retries currently work, as they are not supported via k8s directly, only via the Argo API. And as they are imperative, that is also non-standard in k8s.
A "directive from the end-user [...] to modify behavior" could be considered indicative of an action and it does modify behavior; it tells the Workflow to reprocess some nodes.

Perhaps an example would be good -- you wouldn't specify a retry within your spec when you submit your Workflow initially, as it's something you do manually after the fact. Whereas you do specify a retryStrategy beforehand for automatic retries. Also retryConfig needing to be disambiguated even more from retryStrategy I think would add confusion (it's a bit confusing even now, but at least it is relatively self-evident that one is manual whereas the other is automatic).

They are not entirely mutually exclusive definitions though, there is certainly some overlap.

implementation details

When I first looked at this implementation (it's been a month since then), my bigger question was actually, "can this even be properly implemented in a label or annotation?" as it's a primitive string -- with a limited length no less -- and not a more complex datatype.
parameters in particular seem a bit non-trivial to marshal.

Although retrying with new parameters actually seems to break the declarative state in general if I'm not mistaken -- and so may be an anti-pattern in and of itself. For instance, you could have some nodes with one set of parameters and others nodes with a different set. Meanwhile there is only one set of parameters in the actual Workflow spec. EDIT: This appears to have been mentioned in #9141 (comment) as well

Signed-off-by: shuangkun <tsk2013uestc@163.com>

Co-authored-by: shuangkun <tsk2013uestc@163.com> Co-authored-by: AlbeeSo <suyashi1321@163.com> Signed-off-by: shuangkun <tsk2013uestc@163.com>

Signed-off-by: shuangkun <tsk2013uestc@163.com>

shuangkun · 2024-04-10T12:24:56Z

@agilgur5 Thank you for your reply. I have modified a version. Can you help me take a look?

tczhao · 2024-05-01T11:49:41Z

test/e2e/cli_test.go

+				assert.Contains(t, output, "hello world")
+			}
+		}).
+		Wait(3*time.Second).


what would be the potential issue without wait here

tczhao · 2024-05-01T12:18:34Z

workflow/controller/controller.go

+		case labelBatchDeletePodsCompleted:
+			// When running here, means that all pods that need to be deleted for the retry operation have been completed.
+			workflowName := podName
+			err := wfc.labelWorkflowRetried(ctx, namespace, workflowName)
+			if err != nil {
+				return err
+			}


I think there's a problem here,
the podCleanupQueue is handled by multiple workers,
there's no guarantee all pods are cleaned when the worker sees labelBatchDeletePodsCompleted.

We may end up in a situation where a retry workflow starts but still having pods that's yet to be cleaned up

shuangkun requested review from terrytangyuan and sarabala1979 as code owners March 4, 2024 09:10

shuangkun marked this pull request as draft March 4, 2024 09:10

shuangkun force-pushed the fix/RefactorRetryDeleteLogic branch from 65e080f to 57c1676 Compare March 4, 2024 10:37

shuangkun marked this pull request as ready for review March 4, 2024 11:06

shuangkun added the area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries label Mar 5, 2024

shuangkun changed the title ~~feat: refactor the logic of delete pod during retry~~ feat: refactor the logic of delete pod during retry. Fixes: ##12538 Mar 9, 2024

shuangkun changed the title ~~feat: refactor the logic of delete pod during retry. Fixes: ##12538~~ feat: refactor the logic of delete pod during retry. Fixes: #12538 Mar 9, 2024

tczhao self-assigned this Mar 20, 2024

tczhao requested changes Mar 25, 2024

View reviewed changes

juliev0 assigned sarabala1979 Mar 29, 2024

shuangkun changed the title ~~feat: refactor the logic of delete pod during retry. Fixes: #12538~~ refactor: change the logic of delete pod during retry. Fixes: #12538 Mar 30, 2024

shuangkun force-pushed the fix/RefactorRetryDeleteLogic branch 2 times, most recently from 2883edc to bbb619a Compare March 30, 2024 06:47

shuangkun mentioned this pull request Apr 4, 2024

REQUEST: Promotion to Reviewer for @shuangkun argoproj/argoproj#293

Closed

6 tasks

agilgur5 reviewed Apr 9, 2024

View reviewed changes

agilgur5 requested review from Joibel and isubasinghe April 9, 2024 06:00

agilgur5 added area/controller Controller issues, panics area/server labels Apr 9, 2024

agilgur5 mentioned this pull request Apr 9, 2024

Retry failed workflow with ttl deleted after initial secondsAfterFailure while still running #12636

Closed

4 tasks

shuangkun added 6 commits April 9, 2024 20:53

feat: Refactor the logic of delete pod during retry

38c00db

Signed-off-by: shuangkun <tsk2013uestc@163.com>

fix: test

d08d334

Signed-off-by: shuangkun <tsk2013uestc@163.com>

fix: test-cli

0d93cb6

Signed-off-by: shuangkun <tsk2013uestc@163.com>

fix: test-cli

65d25e2

Signed-off-by: shuangkun <tsk2013uestc@163.com>

fix: retry

a196700

Signed-off-by: shuangkun <tsk2013uestc@163.com>

fix: lint

33a6f61

Signed-off-by: shuangkun <tsk2013uestc@163.com>

shuangkun added 11 commits April 9, 2024 20:53

fix: delete argo server pod delete rbac

ba01622

Signed-off-by: shuangkun <tsk2013uestc@163.com>

fix: comments

8feee6d

Signed-off-by: shuangkun <tsk2013uestc@163.com>

fix: comments

4cb1f83

Signed-off-by: shuangkun <tsk2013uestc@163.com>

fix: comments

7d53e33

Signed-off-by: shuangkun <tsk2013uestc@163.com>

fix: bool pointer

7b0dc53

Signed-off-by: shuangkun <tsk2013uestc@163.com>

fix: test

8fbb3bf

Signed-off-by: shuangkun <tsk2013uestc@163.com>

fix: test

ad4d350

Signed-off-by: shuangkun <tsk2013uestc@163.com>

fix: change spec to label

9da1c5b

Signed-off-by: shuangkun <tsk2013uestc@163.com>

fix: test

9b01549

Signed-off-by: shuangkun <tsk2013uestc@163.com>

fix: test

2dfee1b

Signed-off-by: shuangkun <tsk2013uestc@163.com>

fix: retry status

d6ac764

Signed-off-by: shuangkun <tsk2013uestc@163.com>

shuangkun force-pushed the fix/RefactorRetryDeleteLogic branch from 2c0587d to d6ac764 Compare April 9, 2024 12:59

shuangkun and others added 9 commits April 9, 2024 21:08

fix: swagger.yaml

d01bb65

Signed-off-by: shuangkun <tsk2013uestc@163.com>

fix: doc

51bfa7f

Signed-off-by: shuangkun <tsk2013uestc@163.com>

fix: doc

8950a75

Signed-off-by: shuangkun <tsk2013uestc@163.com>

fix: doc

9a27062

Signed-off-by: shuangkun <tsk2013uestc@163.com>

fix: swagger

cf4b6cb

Signed-off-by: shuangkun <tsk2013uestc@163.com>

fix: test

e588918

Signed-off-by: shuangkun <tsk2013uestc@163.com>

fix: test

14c06b6

Signed-off-by: shuangkun <tsk2013uestc@163.com>

fix: change to label

d7ae0ae

Co-authored-by: shuangkun <tsk2013uestc@163.com> Co-authored-by: AlbeeSo <suyashi1321@163.com> Signed-off-by: shuangkun <tsk2013uestc@163.com>

fix: lint

cc04729

Signed-off-by: shuangkun <tsk2013uestc@163.com>

shuangkun added the prioritized-review For members of the Sustainability Effort label Apr 16, 2024

agilgur5 mentioned this pull request Apr 25, 2024

feat: Support overriding parameters when retry/resubmit workflows #9141

Merged

tczhao requested changes May 1, 2024

View reviewed changes

tczhao mentioned this pull request May 1, 2024

REQUEST: Promotion to Approver for @tczhao argoproj/argoproj#296

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: change the logic of delete pod during retry. Fixes: #12538 #12734

refactor: change the logic of delete pod during retry. Fixes: #12538 #12734

shuangkun commented Mar 4, 2024 •

edited

Loading

tczhao Mar 25, 2024

shuangkun Mar 30, 2024

tczhao Mar 25, 2024

shuangkun Mar 30, 2024 •

edited by agilgur5

Loading

tczhao Mar 25, 2024

shuangkun Mar 30, 2024

tczhao Mar 25, 2024

tczhao Mar 25, 2024 •

edited

Loading

shuangkun Mar 25, 2024

tczhao Mar 25, 2024

shuangkun Mar 25, 2024

tczhao Mar 25, 2024

shuangkun commented Mar 25, 2024

tczhao commented Mar 25, 2024 •

edited

Loading

shuangkun commented Mar 29, 2024

agilgur5 left a comment •

edited

Loading

shuangkun commented Apr 10, 2024

tczhao May 1, 2024

tczhao May 1, 2024

refactor: change the logic of delete pod during retry. Fixes: #12538 #12734

Are you sure you want to change the base?

refactor: change the logic of delete pod during retry. Fixes: #12538 #12734

Conversation

shuangkun commented Mar 4, 2024 • edited Loading

Motivation

Modifications

Verification

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shuangkun Mar 30, 2024 • edited by agilgur5 Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tczhao Mar 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shuangkun commented Mar 25, 2024

tczhao commented Mar 25, 2024 • edited Loading

shuangkun commented Mar 29, 2024

agilgur5 left a comment • edited Loading

Choose a reason for hiding this comment

implementation details

shuangkun commented Apr 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shuangkun commented Mar 4, 2024 •

edited

Loading

shuangkun Mar 30, 2024 •

edited by agilgur5

Loading

tczhao Mar 25, 2024 •

edited

Loading

tczhao commented Mar 25, 2024 •

edited

Loading

agilgur5 left a comment •

edited

Loading