Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: speed up retry archived workflow #12624

Closed

Conversation

shuangkun
Copy link
Member

@shuangkun shuangkun commented Feb 5, 2024

#12419
Since I accelerated the deletion of pods during retry workflow, it works well and is much faster. I want to speed up the retry archived workflow too.

Motivation

Speed up retry archived workflow. Let a large archived workflow (more than 8000 pods) can be successfully retryed within 1 minute.

Modifications

Clean up failed nodes in parallel.

Verification

e2e test.

@shuangkun shuangkun marked this pull request as draft February 5, 2024 10:06
@shuangkun shuangkun marked this pull request as ready for review February 5, 2024 10:28
@agilgur5 agilgur5 added area/workflow-archive area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries area/server labels Feb 5, 2024
@tooptoop4
Copy link
Contributor

🐎 💨

Comment on lines 300 to 325
go func(podName string) {
defer wg.Done()
err := kubeClient.CoreV1().Pods(wf.Namespace).Delete(ctx, podName, metav1.DeleteOptions{})
if err != nil && !apierr.IsNotFound(err) {
errCh <- err
return
}
}(podName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agilgur5 do you have any opinions on this? My concern here is that this might very easily hit rate limitations.
I see a similar PR has already been merged in though.

I believe there are some timing related helper functions that can be used here, maybe that would help?
I am concerned because I think some k8s platforms have relatively small QPS and for a large enough workflow, we might hit that easily.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment below

@@ -291,12 +292,25 @@ func (w *archivedWorkflowServer) RetryArchivedWorkflow(ctx context.Context, req
return nil, sutils.ToStatusError(err, codes.Internal)
}

errCh := make(chan error, len(podsToDelete))
var wg sync.WaitGroup
wg.Add(len(podsToDelete))
for _, podName := range podsToDelete {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like this and the partner could be extracted out to a common helper function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is possible. I can extract a function.

@Joibel
Copy link
Member

Joibel commented Feb 6, 2024

Both this and the other speed up for plain retry feel like they will only be effective speedups in the situation where the controller isn't busy. I share @isubasinghe concerns over this effectively stealing more bandwidth.

I am unsure why all of these pod deletions can't just be put on the pod deletion queue.

@shuangkun
Copy link
Member Author

Both this and the other speed up for plain retry feel like they will only be effective speedups in the situation where the controller isn't busy. I share @isubasinghe concerns over this effectively stealing more bandwidth.

I am unsure why all of these pod deletions can't just be put on the pod deletion queue.

Yes, I will adjust the concurrency and burst

@agilgur5
Copy link
Member

agilgur5 commented Feb 10, 2024

Both this and the other speed up for plain retry feel like they will only be effective speedups in the situation where the controller isn't busy. I share @isubasinghe concerns over this effectively stealing more bandwidth.

I am unsure why all of these pod deletions can't just be put on the pod deletion queue.

Regarding these concerns, note that these changes are currently in the Server. The pod deletion logic really shouldn't be in the Server to begin with, see #12538. I mentioned this in the other retry PR as well.

Moving the logic to the Controller will allow for some more advanced techniques, like using the queue. Note that things like retries should probably be prioritized over other GC, as they are interactive with a user and so a long wait would not necessarily be optimal.

That would be the root cause fix here.

@isubasinghe
Copy link
Member

That would be the root cause fix here.

agreed but I think for now, probably better off not doing this regardless? Or perhaps do it with exponential backoff.

@shuangkun
Copy link
Member Author

What should we do now. For a large workflow with more than 4,000 nodes, retry often takes more than a minute. After acceleration, it takes 10 seconds. I think we can adjust the concurrency to 500, if we are afraid of occupying bandwidth.

@isubasinghe
Copy link
Member

What should we do now. For a large workflow with more than 4,000 nodes, retry often takes more than a minute. After acceleration, it takes 10 seconds. I think we can adjust the concurrency to 500, if we are afraid of occupying bandwidth.

Have you seen the wait package in the k8s api? I think it might be a good idea to use the exponential backoff from that?
Regardless, probably not a good idea to launch this many go routines, you likely aren't going to see much benefit after the point where you've saturated your networking hardware.

@shuangkun shuangkun force-pushed the feat/speedUpRetryArchivedWorkflow branch 3 times, most recently from 10f196a to cd183a6 Compare February 19, 2024 05:40
@shuangkun
Copy link
Member Author

shuangkun commented Feb 19, 2024

Have you seen the wait package in the k8s api? I think it might be a good idea to use the exponential backoff from that? Regardless, probably not a good idea to launch this many go routines, you likely aren't going to see much benefit after the point where you've saturated your networking hardware.

Thanks, Do you mean use exponential to Reduce stress when delete pod error? I add exponential backoff and can you have a look? @isubasinghe

@shuangkun shuangkun force-pushed the feat/speedUpRetryArchivedWorkflow branch 2 times, most recently from 6255641 to 87c5fcd Compare February 20, 2024 06:25
@shuangkun shuangkun added the prioritized-review For members of the Sustainability Effort label Feb 24, 2024
@juliev0
Copy link
Contributor

juliev0 commented Feb 25, 2024

Is anyone up for being Assignee on this one? (I see multiple people have reviewed) and it’s a “prioritized-review”. Thanks

@Joibel
Copy link
Member

Joibel commented Feb 28, 2024

I think we should put the effort into doing the right thing, and move these deletes to a prioritised queue in the controller as suggested by Anton.

Even with the exponential backoff code this is still attempting to

  • hammer the kubernetes API using go routines to dispatch massive numbers of deletes at once
  • is in the server not the controller

I also really don't understand why an archived workflow still has pods around. I must admit to not having looked or tried this

Surely, due to owner references at the point a workflow is archived the workflow is deleted, and the pods it owns should be deleted.

@shuangkun
Copy link
Member Author

I think we should put the effort into doing the right thing, and move these deletes to a prioritised queue in the controller as suggested by Anton.

Even with the exponential backoff code this is still attempting to

  • hammer the kubernetes API using go routines to dispatch massive numbers of deletes at once
  • is in the server not the controller

I also really don't understand why an archived workflow still has pods around. I must admit to not having looked or tried this

Surely, due to owner references at the point a workflow is archived the workflow is deleted, and the pods it owns should be deleted.

Once "archive: true" is set, the workflow will be archived when completed, regardless of whether it is deleted. We have no control over whether the user uses retryWorkflow or retryArchiveWorkflow.

I don't think we need to worry too much about the concurrency pressure of 500, it's a very simple operation. We make heavy use of concurrency in our code.

As for moving the code to the controller, it may be the final solution, but it will take time to refactor.
This PR does not introduce new issues, but it can solve the problem of too slow retry for large workflows(> 4000).

@Joibel
Copy link
Member

Joibel commented Feb 28, 2024

Once "archive: true" is set, the workflow will be archived when completed, regardless of whether it is deleted. We have no control over whether the user uses retryWorkflow or retryArchiveWorkflow.

Indeed, but this PR is called "speed up retry archived workflow". I know the code prior to this commit attempts to delete a lot of pods, but I'm wondering why. Once archiving has happened, why are there any pods in the cluster? I'm questioning why the code exists at all - it will take time to fail to delete 4000 pods that don't exist.

I don't think we need to worry too much about the concurrency pressure of 500, it's a very simple operation. We make heavy use of concurrency in our code.

I am not at all worried about the use of go routines.

I am worried about being a good kubernetes citizen. Elsewhere in the code we attempt to rate limit how hard we push the kubernetes API. This code change attempts to be selfish and greedy with API usage in order to achieve the goal of achieving a goal on behalf of one user. It is doing this at the expense of other users in the system.

Kubernetes API is often deliberately rate limited, and often rate limited by external actors. This code change attempts to use as much of this bandwidth as possible for this one thing. On any busy system this must do that at the expense of other users (whether human or computer) of the system. This is what I mean by being a bad citizen. What I would expect if we adopted this code is that we will get issues raised by people who object to this "denial of service" to other part of their kubernetes system. They have no option not to have this behavior, but I don't think making this optional is the right approach either.

As for moving the code to the controller, it may be the final solution, but it will take time to refactor. This PR does not introduce new issues, but it can solve the problem of too slow retry for large workflows(> 4000).

Although a goal of making large workflow retries work faster is a good goal, I don't think that being selfish about doing it is the right approach. I think we need to take time to refactor, and perhaps along side this provide better user feedback on the retry.

@shuangkun
Copy link
Member Author

shuangkun commented Feb 28, 2024

Indeed, but this PR is called "speed up retry archived workflow". I know the code prior to this commit attempts to delete a lot of pods, but I'm wondering why. Once archiving has happened, why are there any pods in the cluster? I'm questioning why the code exists at all - it will take time to fail to delete 4000 pods that don't exist.

After the workflow is archived, there may still be pods in the cluster, if the workflow is not deleted and no pod recycling policy is set.

Although a goal of making large workflow retries work faster is a good goal, I don't think that being selfish about doing it is the right approach. I think we need to take time to refactor, and perhaps along side this provide better user feedback on the retry.

I think we should put the effort into doing the right thing, and move these deletes to a prioritised queue in the controller as suggested by Anton.

Thanks,maybe you are right, but I still think that the problem needs to be fixed step by step, and it is difficult to solve all the problems at once. If the majority of people think that solving the problem requires refactoring this piece of code, I will do it. We really need this feature.

In the server, there are still some places that use this method to accelerate,For example:

streamPod(p.DeepCopy())

@Joibel
Copy link
Member

Joibel commented Feb 28, 2024

After the workflow is archived, there may still be pods in the cluster, if the workflow is not deleted and no pod recycling policy is set.

I don't know of how you get archiving not to delete the workflow. It would seem very odd not to have it in both the cluster and the archive.

I also don't think pod GC has any effect here, because pods are owned by the workflow, so will be deleted.

I'm asking for proof that what you're saying is possible, I just don't know a way, but I haven't looked in detail.

Thanks,maybe you are right, but I still think that the problem needs to be fixed step by step, and it is difficult to solve all the problems at once. If the majority of people think that solving the problem requires refactoring this piece of code, I will do it. We really need this feature.

@juliev0, @agilgur5 and @isubasinghe have commented in this thread, so I'll ping them here.

In the server, there are still some places that use this method to accelerate,For example:

streamPod(p.DeepCopy())

I know. In this case it is justified, and is not so evil because it's only hitting live pods.

@shuangkun
Copy link
Member Author

shuangkun commented Feb 28, 2024

After the workflow is archived, there may still be pods in the cluster, if the workflow is not deleted and no pod recycling policy is set.

I don't know of how you get archiving not to delete the workflow. It would seem very odd not to have it in both the cluster and the archive.

I also don't think pod GC has any effect here, because pods are owned by the workflow, so will be deleted.

I'm asking for proof that what you're saying is possible, I just don't know a way, but I haven't looked in detail.

I did't test in latest:
I can see workflow and pod in cluster:

tianshuangkun@B-QGHAMD6M-2124 test % argo list
NAME                    STATUS      AGE   DURATION   PRIORITY
test-workflow-gcszp     Failed      2m    2m         0
sparkly-poochenheimer   Succeeded   8m    42s        0
tianshuangkun@B-QGHAMD6M-2124 test % kubectl get pod
NAME                    READY   STATUS      RESTARTS   AGE
sparkly-poochenheimer   0/2     Completed   0          8m41s
test-workflow-gcszp     0/2     Error       0          2m47s

and in archive:
image
Maybe the logic has changed recently?

In the server, there are still some places that use this method to accelerate,For example:

streamPod(p.DeepCopy())

I know. In this case it is justified, and is not so evil because it's only hitting live pods.

If pod didn't gc, fetch log will hit all pod?

@agilgur5
Copy link
Member

agilgur5 commented Feb 28, 2024

Thanks,maybe you are right, but I still think that the problem needs to be fixed step by step, and it is difficult to solve all the problems at once. If the majority of people think that solving the problem requires refactoring this piece of code, I will do it. We really need this feature.

@juliev0, @agilgur5 and @isubasinghe have commented in this thread, so I'll ping them here.

Yes, I definitely think it should be refactored and would definitely prefer that, that's why I said it. I did choose the "comment" code review option intentionally though -- I don't really have a strong opinion on a "band-aid" like this, but I generally prefer long-term, root cause fixes rather than adding tech debt. If someone is willing to approve, I don't necessarily mind too much, but I will admit it is a bit concerning when multiple PRs happen to the same place (we're counting at least 3 right now) that should all be refactored -- at that point it may even take the same amount of time to refactor.

@agilgur5
Copy link
Member

agilgur5 commented Feb 28, 2024

Regarding the Workflow Archive -- there are race conditions when a Workflow is both in the Archive and still a resource in the cluster. Even in the Archive process itself, the Workflow must exist while being Archived.

But also in general, the Archive does not require that a Workflow be deleted, that is up to the TTL/retentionPolicy etc. Workflows are labeled for archiving when they are completed. This label is then picked up by the Informer which kicks off archiving.

@shuangkun shuangkun force-pushed the feat/speedUpRetryArchivedWorkflow branch from 87c5fcd to 1ee04ce Compare March 3, 2024 16:36
@shuangkun shuangkun changed the title feat: speed up retry archived workflow feat: refactor the logic of delete pod during retry Mar 3, 2024
Copy link
Member

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shuangkun it looks like you're going to address #12538 and refactor the Server/Controller logic properly? I think that should be a separate PR, as nearly everything here is about the archived retry logic, including your PR description. I would leave this PR in its previous state for posterity and create a new PR.

Refactoring should also modify the manifests -- the Server should no longer need writeable RBAC for Pods, per #12538 1. The other implementation details listed there are also relevant. Changes to that approach -- especially spec changes -- should be discussed.

@shuangkun shuangkun force-pushed the feat/speedUpRetryArchivedWorkflow branch from 708cc20 to 6255641 Compare March 4, 2024 08:14
@shuangkun
Copy link
Member Author

shuangkun commented Mar 4, 2024

OK, thanks, I will open a new PR.

@shuangkun shuangkun changed the title feat: refactor the logic of delete pod during retry feat: speed up retry archived workflow Mar 4, 2024
Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun <tsk2013uestc@163.com>
@shuangkun shuangkun force-pushed the feat/speedUpRetryArchivedWorkflow branch from 6255641 to 9aabe5b Compare March 4, 2024 08:31
@shuangkun shuangkun closed this Mar 4, 2024
@shuangkun shuangkun reopened this Mar 4, 2024
@shuangkun shuangkun closed this Mar 4, 2024
@shuangkun shuangkun reopened this Mar 4, 2024
@shuangkun shuangkun removed the prioritized-review For members of the Sustainability Effort label Mar 4, 2024
@shuangkun shuangkun closed this Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries area/server area/workflow-archive
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants