feat: speed up retry archived workflow #12624

shuangkun · 2024-02-05T09:59:20Z

#12419
Since I accelerated the deletion of pods during retry workflow, it works well and is much faster. I want to speed up the retry archived workflow too.

Motivation

Speed up retry archived workflow. Let a large archived workflow (more than 8000 pods) can be successfully retryed within 1 minute.

Modifications

Clean up failed nodes in parallel.

Verification

e2e test.

tooptoop4 · 2024-02-05T18:24:35Z

🐎 💨

isubasinghe · 2024-02-06T07:05:25Z

server/workflowarchive/archived_workflow_server.go

+			go func(podName string) {
+				defer wg.Done()
+				err := kubeClient.CoreV1().Pods(wf.Namespace).Delete(ctx, podName, metav1.DeleteOptions{})
+				if err != nil && !apierr.IsNotFound(err) {
+					errCh <- err
+					return
+				}
+			}(podName)


@agilgur5 do you have any opinions on this? My concern here is that this might very easily hit rate limitations.
I see a similar PR has already been merged in though.

I believe there are some timing related helper functions that can be used here, maybe that would help?
I am concerned because I think some k8s platforms have relatively small QPS and for a large enough workflow, we might hit that easily.

See my comment below

Joibel · 2024-02-06T11:17:20Z

server/workflowarchive/archived_workflow_server.go

@@ -291,12 +292,25 @@ func (w *archivedWorkflowServer) RetryArchivedWorkflow(ctx context.Context, req
 			return nil, sutils.ToStatusError(err, codes.Internal)
 		}

+		errCh := make(chan error, len(podsToDelete))
+		var wg sync.WaitGroup
+		wg.Add(len(podsToDelete))
 		for _, podName := range podsToDelete {


It feels like this and the partner could be extracted out to a common helper function.

Yes, this is possible. I can extract a function.

Joibel · 2024-02-06T11:22:36Z

Both this and the other speed up for plain retry feel like they will only be effective speedups in the situation where the controller isn't busy. I share @isubasinghe concerns over this effectively stealing more bandwidth.

I am unsure why all of these pod deletions can't just be put on the pod deletion queue.

shuangkun · 2024-02-06T11:43:44Z

Both this and the other speed up for plain retry feel like they will only be effective speedups in the situation where the controller isn't busy. I share @isubasinghe concerns over this effectively stealing more bandwidth.

I am unsure why all of these pod deletions can't just be put on the pod deletion queue.

Yes, I will adjust the concurrency and burst

agilgur5 · 2024-02-10T18:18:02Z

Both this and the other speed up for plain retry feel like they will only be effective speedups in the situation where the controller isn't busy. I share @isubasinghe concerns over this effectively stealing more bandwidth.

I am unsure why all of these pod deletions can't just be put on the pod deletion queue.

Regarding these concerns, note that these changes are currently in the Server. The pod deletion logic really shouldn't be in the Server to begin with, see #12538. I mentioned this in the other retry PR as well.

Moving the logic to the Controller will allow for some more advanced techniques, like using the queue. Note that things like retries should probably be prioritized over other GC, as they are interactive with a user and so a long wait would not necessarily be optimal.

That would be the root cause fix here.

isubasinghe · 2024-02-11T00:22:59Z

That would be the root cause fix here.

agreed but I think for now, probably better off not doing this regardless? Or perhaps do it with exponential backoff.

shuangkun · 2024-02-11T02:25:35Z

What should we do now. For a large workflow with more than 4,000 nodes, retry often takes more than a minute. After acceleration, it takes 10 seconds. I think we can adjust the concurrency to 500, if we are afraid of occupying bandwidth.

isubasinghe · 2024-02-11T09:12:29Z

What should we do now. For a large workflow with more than 4,000 nodes, retry often takes more than a minute. After acceleration, it takes 10 seconds. I think we can adjust the concurrency to 500, if we are afraid of occupying bandwidth.

Have you seen the wait package in the k8s api? I think it might be a good idea to use the exponential backoff from that?
Regardless, probably not a good idea to launch this many go routines, you likely aren't going to see much benefit after the point where you've saturated your networking hardware.

shuangkun · 2024-02-19T05:57:43Z

Have you seen the wait package in the k8s api? I think it might be a good idea to use the exponential backoff from that? Regardless, probably not a good idea to launch this many go routines, you likely aren't going to see much benefit after the point where you've saturated your networking hardware.

Thanks, Do you mean use exponential to Reduce stress when delete pod error? I add exponential backoff and can you have a look? @isubasinghe

juliev0 · 2024-02-25T20:54:58Z

Is anyone up for being Assignee on this one? (I see multiple people have reviewed) and it’s a “prioritized-review”. Thanks

Joibel · 2024-02-28T08:39:04Z

I think we should put the effort into doing the right thing, and move these deletes to a prioritised queue in the controller as suggested by Anton.

Even with the exponential backoff code this is still attempting to

hammer the kubernetes API using go routines to dispatch massive numbers of deletes at once
is in the server not the controller

I also really don't understand why an archived workflow still has pods around. I must admit to not having looked or tried this

Surely, due to owner references at the point a workflow is archived the workflow is deleted, and the pods it owns should be deleted.

shuangkun · 2024-02-28T10:31:33Z

I think we should put the effort into doing the right thing, and move these deletes to a prioritised queue in the controller as suggested by Anton.

Even with the exponential backoff code this is still attempting to

hammer the kubernetes API using go routines to dispatch massive numbers of deletes at once

is in the server not the controller

I also really don't understand why an archived workflow still has pods around. I must admit to not having looked or tried this

Surely, due to owner references at the point a workflow is archived the workflow is deleted, and the pods it owns should be deleted.

Once "archive: true" is set, the workflow will be archived when completed, regardless of whether it is deleted. We have no control over whether the user uses retryWorkflow or retryArchiveWorkflow.

I don't think we need to worry too much about the concurrency pressure of 500, it's a very simple operation. We make heavy use of concurrency in our code.

As for moving the code to the controller, it may be the final solution, but it will take time to refactor.
This PR does not introduce new issues, but it can solve the problem of too slow retry for large workflows(> 4000).

Joibel · 2024-02-28T11:38:13Z

Once "archive: true" is set, the workflow will be archived when completed, regardless of whether it is deleted. We have no control over whether the user uses retryWorkflow or retryArchiveWorkflow.

Indeed, but this PR is called "speed up retry archived workflow". I know the code prior to this commit attempts to delete a lot of pods, but I'm wondering why. Once archiving has happened, why are there any pods in the cluster? I'm questioning why the code exists at all - it will take time to fail to delete 4000 pods that don't exist.

I don't think we need to worry too much about the concurrency pressure of 500, it's a very simple operation. We make heavy use of concurrency in our code.

I am not at all worried about the use of go routines.

I am worried about being a good kubernetes citizen. Elsewhere in the code we attempt to rate limit how hard we push the kubernetes API. This code change attempts to be selfish and greedy with API usage in order to achieve the goal of achieving a goal on behalf of one user. It is doing this at the expense of other users in the system.

Kubernetes API is often deliberately rate limited, and often rate limited by external actors. This code change attempts to use as much of this bandwidth as possible for this one thing. On any busy system this must do that at the expense of other users (whether human or computer) of the system. This is what I mean by being a bad citizen. What I would expect if we adopted this code is that we will get issues raised by people who object to this "denial of service" to other part of their kubernetes system. They have no option not to have this behavior, but I don't think making this optional is the right approach either.

As for moving the code to the controller, it may be the final solution, but it will take time to refactor. This PR does not introduce new issues, but it can solve the problem of too slow retry for large workflows(> 4000).

Although a goal of making large workflow retries work faster is a good goal, I don't think that being selfish about doing it is the right approach. I think we need to take time to refactor, and perhaps along side this provide better user feedback on the retry.

shuangkun · 2024-02-28T12:59:40Z

Indeed, but this PR is called "speed up retry archived workflow". I know the code prior to this commit attempts to delete a lot of pods, but I'm wondering why. Once archiving has happened, why are there any pods in the cluster? I'm questioning why the code exists at all - it will take time to fail to delete 4000 pods that don't exist.

After the workflow is archived, there may still be pods in the cluster, if the workflow is not deleted and no pod recycling policy is set.

Although a goal of making large workflow retries work faster is a good goal, I don't think that being selfish about doing it is the right approach. I think we need to take time to refactor, and perhaps along side this provide better user feedback on the retry.

I think we should put the effort into doing the right thing, and move these deletes to a prioritised queue in the controller as suggested by Anton.

Thanks，maybe you are right, but I still think that the problem needs to be fixed step by step, and it is difficult to solve all the problems at once. If the majority of people think that solving the problem requires refactoring this piece of code, I will do it. We really need this feature.

In the server, there are still some places that use this method to accelerate，For example:

argo-workflows/util/logs/pods-logger.go

Line 74 in 8830456

streamPod(p.DeepCopy())

Joibel · 2024-02-28T13:27:47Z

After the workflow is archived, there may still be pods in the cluster, if the workflow is not deleted and no pod recycling policy is set.

I don't know of how you get archiving not to delete the workflow. It would seem very odd not to have it in both the cluster and the archive.

I also don't think pod GC has any effect here, because pods are owned by the workflow, so will be deleted.

I'm asking for proof that what you're saying is possible, I just don't know a way, but I haven't looked in detail.

Thanks，maybe you are right, but I still think that the problem needs to be fixed step by step, and it is difficult to solve all the problems at once. If the majority of people think that solving the problem requires refactoring this piece of code, I will do it. We really need this feature.

@juliev0, @agilgur5 and @isubasinghe have commented in this thread, so I'll ping them here.

In the server, there are still some places that use this method to accelerate，For example:

argo-workflows/util/logs/pods-logger.go

Line 74 in 8830456

streamPod(p.DeepCopy())

I know. In this case it is justified, and is not so evil because it's only hitting live pods.

shuangkun · 2024-02-28T13:59:22Z

After the workflow is archived, there may still be pods in the cluster, if the workflow is not deleted and no pod recycling policy is set.

I don't know of how you get archiving not to delete the workflow. It would seem very odd not to have it in both the cluster and the archive.

I also don't think pod GC has any effect here, because pods are owned by the workflow, so will be deleted.

I'm asking for proof that what you're saying is possible, I just don't know a way, but I haven't looked in detail.

I did't test in latest:
I can see workflow and pod in cluster:

tianshuangkun@B-QGHAMD6M-2124 test % argo list
NAME                    STATUS      AGE   DURATION   PRIORITY
test-workflow-gcszp     Failed      2m    2m         0
sparkly-poochenheimer   Succeeded   8m    42s        0
tianshuangkun@B-QGHAMD6M-2124 test % kubectl get pod
NAME                    READY   STATUS      RESTARTS   AGE
sparkly-poochenheimer   0/2     Completed   0          8m41s
test-workflow-gcszp     0/2     Error       0          2m47s

and in archive:

Maybe the logic has changed recently？

In the server, there are still some places that use this method to accelerate，For example:

argo-workflows/util/logs/pods-logger.go

Line 74 in 8830456

streamPod(p.DeepCopy())

I know. In this case it is justified, and is not so evil because it's only hitting live pods.

If pod didn't gc, fetch log will hit all pod?

agilgur5 · 2024-02-28T14:53:49Z

Thanks，maybe you are right, but I still think that the problem needs to be fixed step by step, and it is difficult to solve all the problems at once. If the majority of people think that solving the problem requires refactoring this piece of code, I will do it. We really need this feature.

@juliev0, @agilgur5 and @isubasinghe have commented in this thread, so I'll ping them here.

Yes, I definitely think it should be refactored and would definitely prefer that, that's why I said it. I did choose the "comment" code review option intentionally though -- I don't really have a strong opinion on a "band-aid" like this, but I generally prefer long-term, root cause fixes rather than adding tech debt. If someone is willing to approve, I don't necessarily mind too much, but I will admit it is a bit concerning when multiple PRs happen to the same place (we're counting at least 3 right now) that should all be refactored -- at that point it may even take the same amount of time to refactor.

agilgur5 · 2024-02-28T15:04:20Z

Regarding the Workflow Archive -- there are race conditions when a Workflow is both in the Archive and still a resource in the cluster. Even in the Archive process itself, the Workflow must exist while being Archived.

But also in general, the Archive does not require that a Workflow be deleted, that is up to the TTL/retentionPolicy etc. Workflows are labeled for archiving when they are completed. This label is then picked up by the Informer which kicks off archiving.

agilgur5

@shuangkun it looks like you're going to address #12538 and refactor the Server/Controller logic properly? I think that should be a separate PR, as nearly everything here is about the archived retry logic, including your PR description. I would leave this PR in its previous state for posterity and create a new PR.

Refactoring should also modify the manifests -- the Server should no longer need writeable RBAC for Pods, per #12538 1. The other implementation details listed there are also relevant. Changes to that approach -- especially spec changes -- should be discussed.

shuangkun · 2024-03-04T08:26:07Z

OK, thanks, I will open a new PR.

Signed-off-by: shuangkun <tsk2013uestc@163.com>

shuangkun marked this pull request as draft February 5, 2024 10:06

shuangkun marked this pull request as ready for review February 5, 2024 10:28

agilgur5 added area/workflow-archive area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries area/server labels Feb 5, 2024

isubasinghe reviewed Feb 6, 2024

View reviewed changes

Joibel requested changes Feb 6, 2024

View reviewed changes

shuangkun force-pushed the feat/speedUpRetryArchivedWorkflow branch 3 times, most recently from 10f196a to cd183a6 Compare February 19, 2024 05:40

shuangkun requested a review from isubasinghe February 19, 2024 05:57

shuangkun force-pushed the feat/speedUpRetryArchivedWorkflow branch 2 times, most recently from 6255641 to 87c5fcd Compare February 20, 2024 06:25

shuangkun added the prioritized-review For members of the Sustainability Effort label Feb 24, 2024

shuangkun force-pushed the feat/speedUpRetryArchivedWorkflow branch from 87c5fcd to 1ee04ce Compare March 3, 2024 16:36

shuangkun requested review from terrytangyuan and sarabala1979 as code owners March 3, 2024 16:36

shuangkun changed the title ~~feat: speed up retry archived workflow~~ feat: refactor the logic of delete pod during retry Mar 3, 2024

agilgur5 reviewed Mar 3, 2024

View reviewed changes

shuangkun force-pushed the feat/speedUpRetryArchivedWorkflow branch from 708cc20 to 6255641 Compare March 4, 2024 08:14

shuangkun changed the title ~~feat: refactor the logic of delete pod during retry~~ feat: speed up retry archived workflow Mar 4, 2024

shuangkun added 5 commits March 4, 2024 16:31

feat: speed up retry archived workflow

f27a47f

Signed-off-by: shuangkun <tsk2013uestc@163.com>

test

1da1405

Signed-off-by: shuangkun <tsk2013uestc@163.com>

test

32f8f3b

Signed-off-by: shuangkun <tsk2013uestc@163.com>

test

5896ec4

Signed-off-by: shuangkun <tsk2013uestc@163.com>

test

9aabe5b

Signed-off-by: shuangkun <tsk2013uestc@163.com>

shuangkun force-pushed the feat/speedUpRetryArchivedWorkflow branch from 6255641 to 9aabe5b Compare March 4, 2024 08:31

shuangkun closed this Mar 4, 2024

shuangkun reopened this Mar 4, 2024

shuangkun closed this Mar 4, 2024

shuangkun reopened this Mar 4, 2024

shuangkun removed the prioritized-review For members of the Sustainability Effort label Mar 4, 2024

shuangkun mentioned this pull request Mar 4, 2024

refactor: change the logic of delete pod during retry. Fixes: #12538 #12734

Open

shuangkun closed this Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: speed up retry archived workflow #12624

feat: speed up retry archived workflow #12624

shuangkun commented Feb 5, 2024 •

edited

Loading

tooptoop4 commented Feb 5, 2024

isubasinghe Feb 6, 2024

agilgur5 Feb 10, 2024

Joibel Feb 6, 2024

shuangkun Feb 6, 2024

Joibel commented Feb 6, 2024

shuangkun commented Feb 6, 2024

agilgur5 commented Feb 10, 2024 •

edited

Loading

isubasinghe commented Feb 11, 2024

shuangkun commented Feb 11, 2024

isubasinghe commented Feb 11, 2024

shuangkun commented Feb 19, 2024 •

edited by agilgur5

Loading

juliev0 commented Feb 25, 2024

Joibel commented Feb 28, 2024

shuangkun commented Feb 28, 2024

Joibel commented Feb 28, 2024 •

edited by agilgur5

Loading

shuangkun commented Feb 28, 2024 •

edited by agilgur5

Loading

Joibel commented Feb 28, 2024

shuangkun commented Feb 28, 2024 •

edited by agilgur5

Loading

agilgur5 commented Feb 28, 2024 •

edited

Loading

agilgur5 commented Feb 28, 2024 •

edited

Loading

agilgur5 left a comment •

edited

Loading

shuangkun commented Mar 4, 2024 •

edited by agilgur5

Loading

feat: speed up retry archived workflow #12624

feat: speed up retry archived workflow #12624

Conversation

shuangkun commented Feb 5, 2024 • edited Loading

Motivation

Modifications

Verification

tooptoop4 commented Feb 5, 2024

isubasinghe Feb 6, 2024

Choose a reason for hiding this comment

agilgur5 Feb 10, 2024

Choose a reason for hiding this comment

Joibel Feb 6, 2024

Choose a reason for hiding this comment

shuangkun Feb 6, 2024

Choose a reason for hiding this comment

Joibel commented Feb 6, 2024

shuangkun commented Feb 6, 2024

agilgur5 commented Feb 10, 2024 • edited Loading

isubasinghe commented Feb 11, 2024

shuangkun commented Feb 11, 2024

isubasinghe commented Feb 11, 2024

shuangkun commented Feb 19, 2024 • edited by agilgur5 Loading

juliev0 commented Feb 25, 2024

Joibel commented Feb 28, 2024

shuangkun commented Feb 28, 2024

Joibel commented Feb 28, 2024 • edited by agilgur5 Loading

shuangkun commented Feb 28, 2024 • edited by agilgur5 Loading

Joibel commented Feb 28, 2024

shuangkun commented Feb 28, 2024 • edited by agilgur5 Loading

agilgur5 commented Feb 28, 2024 • edited Loading

agilgur5 commented Feb 28, 2024 • edited Loading

agilgur5 left a comment • edited Loading

Choose a reason for hiding this comment

shuangkun commented Mar 4, 2024 • edited by agilgur5 Loading

shuangkun commented Feb 5, 2024 •

edited

Loading

agilgur5 commented Feb 10, 2024 •

edited

Loading

shuangkun commented Feb 19, 2024 •

edited by agilgur5

Loading

Joibel commented Feb 28, 2024 •

edited by agilgur5

Loading

shuangkun commented Feb 28, 2024 •

edited by agilgur5

Loading

shuangkun commented Feb 28, 2024 •

edited by agilgur5

Loading

agilgur5 commented Feb 28, 2024 •

edited

Loading

agilgur5 commented Feb 28, 2024 •

edited

Loading

agilgur5 left a comment •

edited

Loading

shuangkun commented Mar 4, 2024 •

edited by agilgur5

Loading