Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

galaxy: make retryable HTTP status codes configurable #82246

Open
wants to merge 1 commit into
base: devel
Choose a base branch
from

Conversation

nejch
Copy link

@nejch nejch commented Nov 20, 2023

SUMMARY

Extends the current list of server HTTP codes to retry client-side during galaxy installs, some of which have been listed in ansible/galaxy#3086 and the issues linked. This doesn't solve the root cause, which AFAIK should be resolved once Galaxy NG is ready. IMO these still make sense to be included by default but open to ideas.

This list is now similar to what we have in python-gitlab, which we've been extending based on feedback from users over the years:
https://github.com/python-gitlab/python-gitlab/blob/main/gitlab/const.py#L117

ISSUE TYPE
  • Bugfix Pull Request
ADDITIONAL INFORMATION

For the full list of Cloudflare 52x status codes, see:
https://developers.cloudflare.com/support/troubleshooting/cloudflare-errors/troubleshooting-cloudflare-5xx-errors/#error-520-web-server-returns-an-unknown-error

/cc @dlouzan @fgreinacher

@ansibot ansibot added bug This issue/PR relates to a bug. needs_triage Needs a first human triage before being processed. labels Nov 20, 2023
@nejch nejch marked this pull request as ready for review November 20, 2023 09:19
@ansibot ansibot added the needs_revision This PR fails CI tests or a maintainer has requested a review/revision of the PR. label Nov 20, 2023
@bcoca bcoca removed the needs_triage Needs a first human triage before being processed. label Nov 21, 2023
@bcoca
Copy link
Member

bcoca commented Nov 21, 2023

First, not all those codes seem to be wise to add to 'retryable' list.

Second, since that can vary by context, why not make this an option instead (defaulting to currrent list), so the user can configure the codes as they see fit?

@ansibot ansibot removed the needs_revision This PR fails CI tests or a maintainer has requested a review/revision of the PR. label Nov 27, 2023
@ansibot ansibot added the stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. label Dec 11, 2023
@nejch nejch force-pushed the fix/galaxy-retry branch 2 times, most recently from e09338e to f31915c Compare January 26, 2024 10:42
@ansibot ansibot removed the stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. label Jan 26, 2024
Comment on lines +51 to +54
try:
retry_codes = [int(code) for code in C.GALAXY_RETRY_HTTP_ERROR_CODES]
except ValueError as e:
raise AnsibleError("Invalid value for HTTP retry code: %s. Only integer values are supported." % e)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this should be handled here at all or if we assume they'll always be validated before. Just added just in case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this seems like too late for the validation. But if you do keep this, make sure to chain the exception cause explicitly.

Suggested change
try:
retry_codes = [int(code) for code in C.GALAXY_RETRY_HTTP_ERROR_CODES]
except ValueError as e:
raise AnsibleError("Invalid value for HTTP retry code: %s. Only integer values are supported." % e)
raise AnsibleError("Invalid value for HTTP retry code: %s. Only integer values are supported." % e) from e

@nejch
Copy link
Author

nejch commented Jan 26, 2024

First, not all those codes seem to be wise to add to 'retryable' list.

Second, since that can vary by context, why not make this an option instead (defaulting to currrent list), so the user can configure the codes as they see fit?

Thanks @bcoca @webknjaz, I added those initially as I saw 520 was on there which also doesn't necessarily mean retryable but I've removed them now 👍

I started adding a simple config item, would you be able to take another look if this is going in the right direction? Wasn't 100% sure on the config approach.

@nejch nejch changed the title galaxy: extend retryable HTTP status codes galaxy: make retryable HTTP status codes configurable Jan 26, 2024
@webknjaz
Copy link
Member

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@Akasurde Akasurde requested a review from bcoca February 8, 2024 05:22
@ansibot ansibot added the stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. label Feb 8, 2024
@nejch
Copy link
Author

nejch commented Feb 28, 2024

/azp run

@webknjaz CI is green here but I'm not sure if there's anything else to for this PR before the next round of review? 🙇



def test_should_retry_error_with_invalid_code(monkeypatch):
expected = "Invalid value for HTTP retry code"
Copy link
Member

@webknjaz webknjaz Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's a regex, it'd be useful to be extra explicit about it:

Suggested change
expected = "Invalid value for HTTP retry code"
expected = r"^Invalid value for HTTP retry code: invalid\. Only integer values are supported\.$"

Comment on lines +1516 to +1518
- "429" # Too many requests
- "520" # Galaxy rate limit error code (Cloudflare unknown error)
- "502" # Common error from galaxy that may represent any number of transient backend issues
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really necessary to use string literals here?

Suggested change
- "429" # Too many requests
- "520" # Galaxy rate limit error code (Cloudflare unknown error)
- "502" # Common error from galaxy that may represent any number of transient backend issues
- 429 # Too many requests
- 520 # Galaxy rate limit error code (Cloudflare unknown error)
- 502 # Common error from galaxy that may represent any number of transient backend issues

@webknjaz
Copy link
Member

@nejch I'm mostly going through the CI queue to make sure no statuses are stuck and the CI is healthy. Sometimes GitHub thinks it didn't receive some status updates and it needs to be triggered again. Hence the restarts.

You'll have to wait for somebody to do a functional review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue/PR relates to a bug. stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants