Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: delete pods in parallel to speed up retryworkflow #12419

Merged
merged 1 commit into from
Jan 10, 2024

Conversation

shuangkun
Copy link
Member

I have a big worklfow. When it failed and retry, it will be slowly. Latency increases with the number of pods. Always letting my automated system(interact with argo server use https) time out.

time="2023-12-26T13:02:01.560Z" level=info duration=1m18.730806898s method=PUT path=/api/v1/workflows/default/catch-workflow-8c48s/retry size=194 status=409

I found that the main time spent was deleting pods.

time="2023-12-26T13:02:00.126Z" level=info msg="Deleting pod" podDeleted=catch-workflow-8c48s-read-ligands-from-vswdb-op-bmvou-9dzxr-2538960512
time="2023-12-26T13:02:00.328Z" level=info msg="Deleting pod" podDeleted=catch-workflow-8c48s-read-ligands-from-vswdb-op-bmvou-9dzxr-2824246174
time="2023-12-26T13:02:00.527Z" level=info msg="Deleting pod" podDeleted=catch-workflow-8c48s-read-ligands-from-vswdb-op-bmvou-9dzxr-3817503941
time="2023-12-26T13:02:00.731Z" level=info msg="Deleting pod" podDeleted=catch-workflow-8c48s-read-ligands-from-vswdb-op-bmvou-9dzxr-1699455227
time="2023-12-26T13:02:00.925Z" level=info msg="Deleting pod" podDeleted=catch-workflow-8c48s-read-ligands-from-vswdb-op-bmvou-9dzxr-4168132452
time="2023-12-26T13:02:01.133Z" level=info msg="Deleting pod" podDeleted=catch-workflow-8c48s-read-ligands-from-vswdb-op-bmvou-9dzxr-1982889595

So I want make deleting pods parallel.

Motivation

Modifications

Verification

@shuangkun shuangkun force-pushed the feature/speedUpRetry branch 2 times, most recently from b0861ee to 5ebfed5 Compare December 27, 2023 09:21
Co-authored-by: shuangkun <tsk2013uestc@163.com>
Co-authored-by: sherwinkoo29 <sherwinkoo@163.com>
Signed-off-by: shuangkun <tsk2013uestc@163.com>
@shuangkun shuangkun closed this Dec 27, 2023
@shuangkun shuangkun reopened this Dec 27, 2023
@shuangkun shuangkun closed this Dec 27, 2023
@shuangkun shuangkun reopened this Dec 27, 2023
@juliev0 juliev0 added the prioritized-review For members of the Sustainability Effort label Jan 9, 2024
@juliev0 juliev0 self-assigned this Jan 9, 2024
@juliev0
Copy link
Contributor

juliev0 commented Jan 10, 2024

I had no idea the Server deleted pods..that's surprising...anyway, LGTM!

@juliev0 juliev0 merged commit 2bb770e into argoproj:main Jan 10, 2024
68 of 75 checks passed
@shuangkun
Copy link
Member Author

I had no idea the Server deleted pods..that's surprising...anyway, LGTM!

Thank you so much!

@agilgur5 agilgur5 added area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries area/server labels Jan 17, 2024
@agilgur5
Copy link
Member

agilgur5 commented Jan 17, 2024

I had no idea the Server deleted pods..that's surprising...

@juliev0 that makes at least 3 of us as neither Terry nor I expected that either 😅 😕

I mentioned this in #12105 (comment), I don't think this logic should belong in the Server as it breaks the separation of duties between Controller and Server right now; this makes the Server necessary for certain operations (retry in this case). Outside of this functionality, the Server and Controller are independent, with the Server mainly signaling to the Controller (via labels) and not strictly required for any operations.

Bala agreed with me there as well and multiple people have been surprised by the current behavior, so I do think we have consensus that this should be refactored

@juliev0
Copy link
Contributor

juliev0 commented Jan 17, 2024

I had no idea the Server deleted pods..that's surprising...

@juliev0 that makes at least 3 of us as neither Terry nor I expected that either 😅 😕

I mentioned this in #12105 (comment), I don't think this logic should belong in the Server as it breaks the separation of duties between Controller and Server right now; this makes the Server necessary for certain operations (retry in this case). Outside of this functionality, the Server and Controller are independent, with the Server mainly signaling to the Controller (via labels) and not strictly required for any operations.

Bala agreed with me there as well and multiple people have been surprised by the current behavior, so I do think we have consensus that this should be refactored

I completely agree and wanted to make the same judgment actually.

@agilgur5
Copy link
Member

agilgur5 commented Jan 17, 2024

I'll file a tracking issue for it for now as it does require some careful refactoring. EDIT: Created #12538

@shuangkun
Copy link
Member Author

Yes, I agree. I also had some doubts at the time, but at that time I was just focused on solving customer problems.
If there's anything I can do, I'm willing to do it.

@agilgur5
Copy link
Member

agilgur5 commented Jan 26, 2024

Feel free to take on the issue if you'd like, otherwise I might get to it eventually (it's not the highest priority)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries area/server prioritized-review For members of the Sustainability Effort
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants