Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

retryStrategy with nodeAntiAffinity is not working as expected. #9193

Closed
cwichka opened this issue Jul 20, 2022 · 26 comments · Fixed by #12701
Closed

retryStrategy with nodeAntiAffinity is not working as expected. #9193

cwichka opened this issue Jul 20, 2022 · 26 comments · Fixed by #12701
Assignees
Labels
area/controller Controller issues, panics area/retryStrategy Template-level retryStrategy P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority type/bug

Comments

@cwichka
Copy link

cwichka commented Jul 20, 2022

Summary

What happened/what you expected to happen?

I wanted to use retryStrategy with nodeAntiAffinity in order to prevent retrials from running on the same hosts. I was using the following small workflow for testing, but what happens is all retrials were started on the same host (node).

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: random-fail-
spec:
  entrypoint: random
  templates:
  - name: random
    retryStrategy:
      limit: 10
      retryPolicy: "Always"
      affinity:
        nodeAntiAffinity: {}
    script:
      image: python:alpine3.6
      command: [python]
      source: |
        import random
        import time
        random.seed(time.time())
        i = random.randint(0, 10)
        print(i)
        exit(i)`

Not sure if this is an expected behavior and in my case i should use RetryNodeAntiAffinity which is, as mentioned in the documentation, is a placeholder for future expansion.

What version are you running?
Tested with v3.2.9 and v3.3.3


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@alexec
Copy link
Contributor

alexec commented Aug 2, 2022

#4679

@alexec
Copy link
Contributor

alexec commented Aug 2, 2022

@goock could you look into this pls?

@goock
Copy link
Contributor

goock commented Aug 3, 2022

@goock could you look into this pls?

I run the example from the ticket, and from above 30K feet, it looks like the controller stopped adding Affinity to the pod's spec, or it's removed at some point, I'll start debugging and find the cause.

@alexec alexec added area/controller Controller issues, panics and removed triage labels Sep 6, 2022
@goock
Copy link
Contributor

goock commented Sep 9, 2022

@alexec I started working on this.
In this function

func FindRetryNode(nodes wfv1.Nodes, nodeID string) *wfv1.NodeStatus {
I'm taking the template from the boundary node and then finding the retry node with the same template, which works for more complicated scenarios but does not work for very simple ones from this ticket - the BoundaryID is empty.

I see four possible solutions to the issue, but I need advice on this:

  1. Modify FindRetryNode; if the BoundaryID is empty, take the template from the node and then find the retry node. I would need confirmation if TemplateName is not empty in all cases or at least when BoundaryID is empty.

  2. Create another traverse function (like this one https://github.com/argoproj/argo-workflows/blob/master/workflow/util/retry/retry.go#L10), walk over the nodes tree and find the retry node. This solution would be clumsy and inefficient.

  3. Add a Parent field to NodeStatus struct and just walk the tree upward to find the retry node. It makes everything super easy, but I'm unsure of possible drawbacks and if we want to introduce another field to the NodeStatus struct.

  4. Another way to find the retry node for the current one which I am not aware of.

@stale

This comment was marked as resolved.

@stale stale bot added the problem/stale This has not had a response in some time label Oct 1, 2022
@cwichka
Copy link
Author

cwichka commented Oct 3, 2022

Hi @goock @alexec, thank you for looking into the issue so far.
I noticed it was automatically marked as stale. Do we have an update on the proposed solutions above please ?

@stale stale bot removed the problem/stale This has not had a response in some time label Oct 3, 2022
@goock
Copy link
Contributor

goock commented Oct 3, 2022

Hi @goock @alexec, thank you for looking into the issue so far. I noticed it was automatically marked as stale. Do we have an update on the proposed solutions above please ?

Hi @cwichka, thanks for replying. We do not have any proposal yet. I do not want to make it too complicated. I would appreciate it if you or someone familiar with the structures could take a look at my comment above and give me some advice and the right direction.

@sarabala1979 sarabala1979 added this to To do in Run The Business (incl. bugs) via automation Oct 20, 2022
@stale

This comment was marked as resolved.

@stale stale bot added the problem/stale This has not had a response in some time label Oct 22, 2022
@remidebette

This comment was marked as resolved.

@stale stale bot removed the problem/stale This has not had a response in some time label Oct 23, 2022
@sarabala1979 sarabala1979 added the P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important label Nov 8, 2022
@sarabala1979
Copy link
Member

@remidebette @cwichka Is it not working for the script template only or all templates (Container)?

@sarabala1979 sarabala1979 added P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority and removed P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important labels Dec 8, 2022
@stale

This comment was marked as resolved.

@stale stale bot added the problem/stale This has not had a response in some time label Dec 31, 2022
@remidebette

This comment was marked as resolved.

@stale stale bot removed the problem/stale This has not had a response in some time label Jan 5, 2023
@stale

This comment was marked as resolved.

@stale stale bot added the problem/stale This has not had a response in some time label Jan 21, 2023
@remidebette
Copy link

Hi, still needed.
Any news on this?

@stale stale bot removed the problem/stale This has not had a response in some time label Jan 30, 2023
@goock
Copy link
Contributor

goock commented Feb 1, 2023

Hi, still needed. Any news on this?

Hey, I can still work on this but need help as described here #9193 (comment)

@caelan-io
Copy link
Member

caelan-io commented Mar 17, 2023

@alexec would you be able to lend us your thoughts on the below questions @goock proposed above?

@3072141364

This comment was marked as resolved.

@stale

This comment was marked as resolved.

@stale stale bot added the problem/stale This has not had a response in some time label Jun 18, 2023
@kevinnoel-be

This comment was marked as resolved.

@stale stale bot removed the problem/stale This has not had a response in some time label Jun 20, 2023
@alexec
Copy link
Contributor

alexec commented Jul 3, 2023

Please don't add new fields to NodeStatus as each new fields needsO(n) extra storage when n is the number of nodes. It is better to traverse status and build a new structure if need. A one-time traverse would bo O(n).

@caelan-io
Copy link
Member

Thanks for the input @alexec !

@goock - do you still have capacity to work on this issue? We're trying to get a sense so we can prioritize it among other issues. Please do let us know if you have more questions or would like further guidance. Thank you.

@goock
Copy link
Contributor

goock commented Jul 5, 2023

@caelan-io I would like to finish this ticket. ATM I'm busy but probably in the next 2-3 weeks I can find a spare time to work on this.
The crucial part of this ticket and implementation is to find an efficient way to find the retry node as in my original question #9193 (comment).
Thanks.

@caelan-io
Copy link
Member

Thanks for the quick update @goock ! That sounds like a plan. Feel free to post if further questions/ clarification needed.

@Interaze
Copy link

Bumping this to hopefully get some headway (also, don't want it to be marked stale again. lol)

@stale

This comment was marked as resolved.

@stale stale bot added the problem/stale This has not had a response in some time label Sep 17, 2023
@terrytangyuan terrytangyuan removed the problem/stale This has not had a response in some time label Sep 20, 2023
@corentingiraud

This comment was marked as spam.

@shuangkun shuangkun self-assigned this Feb 22, 2024
@agilgur5 agilgur5 changed the title nodeAntiAffinity is not working as expected. retryStrategy with nodeAntiAffinity is not working as expected. Feb 26, 2024
@agilgur5 agilgur5 added the area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries label Feb 26, 2024
@agilgur5 agilgur5 added area/retryStrategy Template-level retryStrategy and removed area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries labels Apr 25, 2024
Run The Business (incl. bugs) automation moved this from To do to Done May 17, 2024
terrytangyuan pushed a commit that referenced this issue May 17, 2024
…mpty. Fixes: #9193 (#12701)

Signed-off-by: shuangkun <tsk2013uestc@163.com>
@agilgur5 agilgur5 added this to the v3.5.x patches milestone May 17, 2024
agilgur5 pushed a commit that referenced this issue May 27, 2024
…mpty. Fixes: #9193 (#12701)

Signed-off-by: shuangkun <tsk2013uestc@163.com>
(cherry picked from commit e490d48)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics area/retryStrategy Template-level retryStrategy P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority type/bug