Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

increase nodeadm network resilience #1699

Merged
merged 1 commit into from Mar 4, 2024
Merged

Conversation

ndbaker1
Copy link
Member

@ndbaker1 ndbaker1 commented Mar 1, 2024

Issue #, if available:

Description of changes:

increase retries and total retry window for imds client

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Testing Done

See this guide for recommended testing for PRs. Some tests may not apply. Completing tests and providing additional validation steps are not required, but it is recommended and may reduce review time and time to merge.

@ndbaker1 ndbaker1 force-pushed the imds-resilience branch 4 times, most recently from 491ee9c to f29fbe1 Compare March 1, 2024 19:21
Comment on lines +36 to +37
so.MaxAttempts = 15
so.MaxBackoff = 1 * time.Second
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by default imds.New uses the standard retryer but sets MaxBackoff as 1 and does not change the MaxAttempts from 3, so we're just mimicking this and bumping the attempts

@ndbaker1
Copy link
Member Author

ndbaker1 commented Mar 1, 2024

/ci

Copy link
Contributor

github-actions bot commented Mar 1, 2024

@ndbaker1 roger that! I've dispatched a workflow. 👍

Copy link
Contributor

github-actions bot commented Mar 1, 2024

@ndbaker1 the workflow that you requested has completed. 🎉

AMI variantBuildTest
1.23 / al2success ✅failure ❌
1.23 / al2023success ✅success ✅
1.24 / al2success ✅failure ❌
1.25 / al2success ✅success ✅
1.25 / al2023success ✅success ✅
1.26 / al2success ✅success ✅
1.26 / al2023success ✅success ✅
1.27 / al2success ✅failure ❌
1.27 / al2023success ✅success ✅
1.28 / al2success ✅success ✅
1.28 / al2023success ✅success ✅
1.29 / al2success ✅failure ❌
1.29 / al2023success ✅success ✅

@awslabs awslabs deleted a comment from github-actions bot Mar 1, 2024
@ndbaker1
Copy link
Member Author

ndbaker1 commented Mar 1, 2024

/ci

Copy link
Contributor

github-actions bot commented Mar 1, 2024

@ndbaker1 roger that! I've dispatched a workflow. 👍

Copy link
Contributor

github-actions bot commented Mar 1, 2024

@ndbaker1 the workflow that you requested has completed. 🎉

AMI variantBuildTest
1.23 / al2success ✅success ✅
1.23 / al2023success ✅success ✅
1.24 / al2success ✅success ✅
1.24 / al2023success ✅success ✅
1.25 / al2success ✅success ✅
1.25 / al2023success ✅success ✅
1.26 / al2success ✅success ✅
1.26 / al2023success ✅success ✅
1.27 / al2success ✅success ✅
1.27 / al2023success ✅success ✅
1.28 / al2success ✅success ✅
1.28 / al2023success ✅success ✅
1.29 / al2success ✅success ✅
1.29 / al2023success ✅success ✅

@ndbaker1
Copy link
Member Author

ndbaker1 commented Mar 1, 2024

bumped some account limits and reran the ci :)

Comment on lines 10 to 11
After=network-online.service
Wants=network-online.service
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious how much of a bottleneck this will create in the systemd unit tree. Can you grab a before and after?

systemd-analyze plot > tree.svg

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm mostly interested in whether cloud-init.service already has some kind of dependency on one of the network targets:

systemctl list-dependencies cloud-init

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

silly me got the network-online.target swapped with network-online.service 🙃,
getting the data now, will follow up offline

@cartermckinnon
Copy link
Member

@Issacwww I don't disagree, we'd have to move the definition for this dimension of the matrix out of the workflow file into a variable passed by the bot. I'll see what I can do 👍

* increase retries and total retry window for imds client

* set nodeadm config step to wait for `network-online`
@ndbaker1
Copy link
Member Author

ndbaker1 commented Mar 4, 2024

the network-online.target already pulls in cloud-init as a successor, so this introduces some conflicts with the target

$ systemctl show network-online.target -p After
After=network.target systemd-networkd-wait-online.service cloud-init.service

so per the network online docs we should just exercise normal retries. this PR should be more resilient than the existing AL2 retry, which is 10 tries with a linear 1s backoff

@ndbaker1 ndbaker1 merged commit 8190adb into awslabs:main Mar 4, 2024
10 checks passed
@ndbaker1 ndbaker1 deleted the imds-resilience branch March 4, 2024 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants