Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(url_helper): Retry on 503 error #5578

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

holmanb
Copy link
Member

@holmanb holmanb commented Aug 2, 2024

If the server is busy, no need to fail.
Add type hints to adjascent code paths.

Fixes GH-5577

Proposed Commit Message

feat(url_helper): Retry on 503 error

If the server is busy, no need to fail.
Add type hints to adjascent code paths.

Fixes GH-5577

Additional Context

Fixes #5577

Test Steps

Merge type

  • Squash merge using "Proposed Commit Message"
  • Rebase and merge unique commits. Requires commit messages per-commit each referencing the pull request number (#<PR_NUM>)

If the server is busy, no need to fail.
Add type hints to adjascent code paths.

Fixes canonicalGH-5577
Copy link
Member

@TheRealFalcon TheRealFalcon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the inline comment

"IMDS returned 503 error code. Retrying in %s",
current_sleep_time,
)
elif isinstance(response, UrlResponse):
Copy link
Member

@TheRealFalcon TheRealFalcon Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the old code was making some bad assumptions...particularly that response is a UrlResponse even though it could also be UrlError if an error occurred. It also looks like the callers of this code aren't really doing anything to handle exceptions. Previously if we did get a UrlError, the old response.contents would throw an error due to contents not being a member of UrlError. It was raising the wrong error obviously, but it still wound up having the same effect as the requests' raise_for_status().

With this though, if we get an error that isn't a 503, we'll retry until we timeout, even if it's a non-recoverable error. I think we should instead raise the UrlError if it's found and not a 503.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, thanks for catching that @TheRealFalcon!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the old code was making some bad assumptions...particularly that response is a UrlResponse even though it could also be UrlError if an error occurred.
...
Previously if we did get a UrlError, the old response.contents would throw an error due to contents not being a member of UrlError. It was raising the wrong error obviously, but it still wound up having the same effect as the requests' raise_for_status().

This isn't correct. Previously, only UrlResponse was returned. See the change on line 755.

It also looks like the callers of this code aren't really doing anything to handle exceptions.

Exceptions are optionally handled by the function passed in via exception_cb. Note the line in the original bug report:

2024-07-24 08:13:06,801 - DataSourceEc2.py[WARNING]: Fatal error while requesting Ec2 IMDSv2 API tokens

Looking into this a bit further, I think that we actually need to make a change to the function that caused this log to allow 503 exceptions to retry.

@TheRealFalcon TheRealFalcon self-assigned this Aug 2, 2024
Copy link
Contributor

@aciba90 aciba90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @holmanb, for fixing this.

503 responses are recommended to contain a Retry-After header containing an estimated time for the recovery of the service.

Do we want to be good citizens and build logic to respect that recommended wait-time if present and not contribute to the server overload?

@TheRealFalcon
Copy link
Member

TheRealFalcon commented Aug 2, 2024

Is it also possible to get unit tests showing contents return correctly on 200, retry happens on 503, and exception is raised on anything else? I'm guessing one or two of those already exist, but we should test the 503 case.

Copy link

Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.

If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging TheRealFalcon, and he will ensure that someone takes a look soon.

(If the pull request is closed and you would like to continue working on it, please do tag TheRealFalcon to reopen it.)

@github-actions github-actions bot added the stale-pr Pull request is stale; will be auto-closed soon label Aug 30, 2024
@github-actions github-actions bot closed this Sep 6, 2024
@holmanb holmanb reopened this Sep 6, 2024
@holmanb holmanb removed the stale-pr Pull request is stale; will be auto-closed soon label Sep 6, 2024
Copy link

github-actions bot commented Oct 5, 2024

Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.

If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging TheRealFalcon, and he will ensure that someone takes a look soon.

(If the pull request is closed and you would like to continue working on it, please do tag TheRealFalcon to reopen it.)

@github-actions github-actions bot added the stale-pr Pull request is stale; will be auto-closed soon label Oct 5, 2024
@holmanb holmanb added wip Work in progress, do not land and removed stale-pr Pull request is stale; will be auto-closed soon labels Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wip Work in progress, do not land
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cloud-init fails on AWS if IMDSv2 returns a 503 error.
4 participants