-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: speed up retry archived workflow #12624
feat: speed up retry archived workflow #12624
Conversation
🐎 💨 |
go func(podName string) { | ||
defer wg.Done() | ||
err := kubeClient.CoreV1().Pods(wf.Namespace).Delete(ctx, podName, metav1.DeleteOptions{}) | ||
if err != nil && !apierr.IsNotFound(err) { | ||
errCh <- err | ||
return | ||
} | ||
}(podName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@agilgur5 do you have any opinions on this? My concern here is that this might very easily hit rate limitations.
I see a similar PR has already been merged in though.
I believe there are some timing related helper functions that can be used here, maybe that would help?
I am concerned because I think some k8s platforms have relatively small QPS and for a large enough workflow, we might hit that easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment below
@@ -291,12 +292,25 @@ func (w *archivedWorkflowServer) RetryArchivedWorkflow(ctx context.Context, req | |||
return nil, sutils.ToStatusError(err, codes.Internal) | |||
} | |||
|
|||
errCh := make(chan error, len(podsToDelete)) | |||
var wg sync.WaitGroup | |||
wg.Add(len(podsToDelete)) | |||
for _, podName := range podsToDelete { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like this and the partner could be extracted out to a common helper function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is possible. I can extract a function.
Both this and the other speed up for plain retry feel like they will only be effective speedups in the situation where the controller isn't busy. I share @isubasinghe concerns over this effectively stealing more bandwidth. I am unsure why all of these pod deletions can't just be put on the pod deletion queue. |
Yes, I will adjust the concurrency and burst |
Regarding these concerns, note that these changes are currently in the Server. The pod deletion logic really shouldn't be in the Server to begin with, see #12538. I mentioned this in the other retry PR as well. Moving the logic to the Controller will allow for some more advanced techniques, like using the queue. Note that things like retries should probably be prioritized over other GC, as they are interactive with a user and so a long wait would not necessarily be optimal. That would be the root cause fix here. |
agreed but I think for now, probably better off not doing this regardless? Or perhaps do it with exponential backoff. |
What should we do now. For a large workflow with more than 4,000 nodes, retry often takes more than a minute. After acceleration, it takes 10 seconds. I think we can adjust the concurrency to 500, if we are afraid of occupying bandwidth. |
Have you seen the wait package in the k8s api? I think it might be a good idea to use the exponential backoff from that? |
10f196a
to
cd183a6
Compare
Thanks, Do you mean use exponential to Reduce stress when delete pod error? I add exponential backoff and can you have a look? @isubasinghe |
6255641
to
87c5fcd
Compare
Is anyone up for being Assignee on this one? (I see multiple people have reviewed) and it’s a “prioritized-review”. Thanks |
I think we should put the effort into doing the right thing, and move these deletes to a prioritised queue in the controller as suggested by Anton. Even with the exponential backoff code this is still attempting to
I also really don't understand why an archived workflow still has pods around. I must admit to not having looked or tried this Surely, due to owner references at the point a workflow is archived the workflow is deleted, and the pods it owns should be deleted. |
Once "archive: true" is set, the workflow will be archived when completed, regardless of whether it is deleted. We have no control over whether the user uses retryWorkflow or retryArchiveWorkflow. I don't think we need to worry too much about the concurrency pressure of 500, it's a very simple operation. We make heavy use of concurrency in our code. As for moving the code to the controller, it may be the final solution, but it will take time to refactor. |
Indeed, but this PR is called "speed up retry archived workflow". I know the code prior to this commit attempts to delete a lot of pods, but I'm wondering why. Once archiving has happened, why are there any pods in the cluster? I'm questioning why the code exists at all - it will take time to fail to delete 4000 pods that don't exist.
I am not at all worried about the use of go routines. I am worried about being a good kubernetes citizen. Elsewhere in the code we attempt to rate limit how hard we push the kubernetes API. This code change attempts to be selfish and greedy with API usage in order to achieve the goal of achieving a goal on behalf of one user. It is doing this at the expense of other users in the system. Kubernetes API is often deliberately rate limited, and often rate limited by external actors. This code change attempts to use as much of this bandwidth as possible for this one thing. On any busy system this must do that at the expense of other users (whether human or computer) of the system. This is what I mean by being a bad citizen. What I would expect if we adopted this code is that we will get issues raised by people who object to this "denial of service" to other part of their kubernetes system. They have no option not to have this behavior, but I don't think making this optional is the right approach either.
Although a goal of making large workflow retries work faster is a good goal, I don't think that being selfish about doing it is the right approach. I think we need to take time to refactor, and perhaps along side this provide better user feedback on the retry. |
After the workflow is archived, there may still be pods in the cluster, if the workflow is not deleted and no pod recycling policy is set.
Thanks,maybe you are right, but I still think that the problem needs to be fixed step by step, and it is difficult to solve all the problems at once. If the majority of people think that solving the problem requires refactoring this piece of code, I will do it. We really need this feature. In the server, there are still some places that use this method to accelerate,For example: argo-workflows/util/logs/pods-logger.go Line 74 in 8830456
|
I don't know of how you get archiving not to delete the workflow. It would seem very odd not to have it in both the cluster and the archive. I also don't think pod GC has any effect here, because pods are owned by the workflow, so will be deleted. I'm asking for proof that what you're saying is possible, I just don't know a way, but I haven't looked in detail.
@juliev0, @agilgur5 and @isubasinghe have commented in this thread, so I'll ping them here.
I know. In this case it is justified, and is not so evil because it's only hitting live pods. |
I did't test in latest:
and in archive:
If pod didn't gc, fetch log will hit all pod? |
Yes, I definitely think it should be refactored and would definitely prefer that, that's why I said it. I did choose the "comment" code review option intentionally though -- I don't really have a strong opinion on a "band-aid" like this, but I generally prefer long-term, root cause fixes rather than adding tech debt. If someone is willing to approve, I don't necessarily mind too much, but I will admit it is a bit concerning when multiple PRs happen to the same place (we're counting at least 3 right now) that should all be refactored -- at that point it may even take the same amount of time to refactor. |
Regarding the Workflow Archive -- there are race conditions when a Workflow is both in the Archive and still a resource in the cluster. Even in the Archive process itself, the Workflow must exist while being Archived. But also in general, the Archive does not require that a Workflow be deleted, that is up to the TTL/ |
87c5fcd
to
1ee04ce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shuangkun it looks like you're going to address #12538 and refactor the Server/Controller logic properly? I think that should be a separate PR, as nearly everything here is about the archived retry logic, including your PR description. I would leave this PR in its previous state for posterity and create a new PR.
Refactoring should also modify the manifests -- the Server should no longer need writeable RBAC for Pods, per #12538 1. The other implementation details listed there are also relevant. Changes to that approach -- especially spec changes -- should be discussed.
708cc20
to
6255641
Compare
OK, thanks, I will open a new PR. |
Signed-off-by: shuangkun <tsk2013uestc@163.com>
6255641
to
9aabe5b
Compare
#12419
Since I accelerated the deletion of pods during retry workflow, it works well and is much faster. I want to speed up the retry archived workflow too.
Motivation
Speed up retry archived workflow. Let a large archived workflow (more than 8000 pods) can be successfully retryed within 1 minute.
Modifications
Clean up failed nodes in parallel.
Verification
e2e test.