Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: nodeAntiAffinity is not working as expected when boundaryID is empty. Fixes: #9193 #12701

Merged
merged 9 commits into from
May 17, 2024

Conversation

shuangkun
Copy link
Member

@shuangkun shuangkun commented Feb 26, 2024

Fixes #9193

Motivation

The root cause is the workflow's boundaryID is empty. So did't find the retry node to fetch the HostNodeName.
So I want to find the retry node and fetch the HostNodeName as the new pod‘s nodeAntiAffinity.

Modifications

Find the correct retrynode when boundaryID is empty.

Verification

e2e tests and ut

@shuangkun shuangkun marked this pull request as draft February 26, 2024 14:40
@shuangkun shuangkun added area/controller Controller issues, panics area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries and removed area/controller Controller issues, panics labels Feb 26, 2024
@shuangkun shuangkun force-pushed the fix/TakeEffectNodeAntiAffinity branch 2 times, most recently from 78a0641 to 3b5265c Compare February 27, 2024 03:28
@shuangkun shuangkun marked this pull request as ready for review February 27, 2024 05:02
@shuangkun shuangkun added the area/controller Controller issues, panics label Mar 8, 2024
@isubasinghe isubasinghe self-assigned this Mar 11, 2024
@shuangkun shuangkun added the prioritized-review For members of the Sustainability Effort label Apr 1, 2024
github-merge-queue bot pushed a commit to linz/topo-workflows that referenced this pull request Apr 11, 2024
#### Motivation

We want every task within our workflows to retry (twice) automatically
and on a different node/host (network issues) when they fail.

#### Modification

Configure the `retryStrategy` at the `workflowDefaults` level.
Notes: 

- the `nodeAntiAffinity` to prioritize retrying on a different node/host
is not working due to an issue in the Argo Workflows system. A [PR is
opened](argoproj/argo-workflows#12701) in the
Argo repo
- this change will make every task retrying on failure. Some of our
tasks (for example, `stac-validate`, `tileindex-validate`) are expected
to fail if the system invalidate something. They will still retry twice
in that case. This would be handle in a separate PR.

#### Checklist

- [ ] Tests updated NA
- [x] Docs updated 
- [x] Issue linked in Title
@shuangkun shuangkun force-pushed the fix/TakeEffectNodeAntiAffinity branch from f0ddd52 to a8bd167 Compare April 12, 2024 01:31
@shuangkun shuangkun force-pushed the fix/TakeEffectNodeAntiAffinity branch from b1c6679 to f016cba Compare April 12, 2024 03:04
@shuangkun shuangkun closed this Apr 12, 2024
@shuangkun shuangkun reopened this Apr 12, 2024
github-merge-queue bot pushed a commit to linz/topo-workflows that referenced this pull request Apr 18, 2024
…545)

#### Motivation

We've noticed that karpenter is erroring since
#506. Errors makes us
thinking that it could be related. Since the retry on a new node using
`nodeAntiAffinity` [is not
working](argoproj/argo-workflows#12701), it's a
good idea to remove it to see if it solved `karpenter` issues.

#### Modification

Remove the `affinity` in the `retryStrategy`.

#### Checklist

- [ ] Tests updated
- [x] Docs updated
- [x] Issue linked in Title
@agilgur5 agilgur5 added area/retryStrategy Template-level retryStrategy and removed area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries labels Apr 25, 2024
test/e2e/retry_test.go Show resolved Hide resolved
workflow/controller/operator_test.go Outdated Show resolved Hide resolved
@shuangkun shuangkun force-pushed the fix/TakeEffectNodeAntiAffinity branch 5 times, most recently from bf88ddf to 66c932d Compare May 8, 2024 11:44
@shuangkun shuangkun force-pushed the fix/TakeEffectNodeAntiAffinity branch 2 times, most recently from efb2600 to 0e4a9d6 Compare May 9, 2024 08:07
Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun <tsk2013uestc@163.com>
@terrytangyuan terrytangyuan merged commit e490d48 into argoproj:main May 17, 2024
28 checks passed
@agilgur5 agilgur5 added this to the v3.5.x patches milestone May 17, 2024
agilgur5 pushed a commit that referenced this pull request May 27, 2024
…mpty. Fixes: #9193 (#12701)

Signed-off-by: shuangkun <tsk2013uestc@163.com>
(cherry picked from commit e490d48)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics area/retryStrategy Template-level retryStrategy prioritized-review For members of the Sustainability Effort
Projects
None yet
Development

Successfully merging this pull request may close these issues.

retryStrategy with nodeAntiAffinity is not working as expected.
5 participants