Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add exponential backoff and retry functionality to rest_client #904

Closed
wants to merge 14 commits into from

Conversation

jp-harvey
Copy link
Contributor

While working with the Atlassian Cloud API and attempting to optimize the time it takes to do many operations, it's common to hit the Atlassian API request limits, resulting in annoying negative engineering.

This PR adds a convenience option to the underlying REST client, turned off by default, to retry using exponential backoff.

@jp-harvey
Copy link
Contributor Author

@gonchik I've not tested this quite enough for it not to be a [WIP], will update after I've done so.

@codecov-commenter
Copy link

Codecov Report

Merging #904 (c25ea52) into master (4d90a20) will increase coverage by 0.03%.
The diff coverage is 47.82%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #904      +/-   ##
==========================================
+ Coverage   36.19%   36.22%   +0.03%     
==========================================
  Files          32       32              
  Lines        6335     6357      +22     
  Branches      978      983       +5     
==========================================
+ Hits         2293     2303      +10     
- Misses       3934     3945      +11     
- Partials      108      109       +1     
Impacted Files Coverage Δ
atlassian/rest_client.py 65.68% <47.82%> (-3.03%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4d90a20...c25ea52. Read the comment docs.

@gonchik
Copy link
Member

gonchik commented Dec 25, 2021

@jp-harvey checked a few options, that's very great step for the rest client module :)
Thank you for your initiative, I hope soon it will be done.

@flichtenheld
Copy link
Contributor

Another change I made was to add these new options to BitbucketBase._new_session_args. At least my assumption was that if you set backoff_and_retry=True when creating the Cloud object you would expect all child objects to inherit the setting.

@flichtenheld
Copy link
Contributor

Actually functionality looks good to me. I have a Jenkins job running into the rate limiting, and this patch together with my modifications seems to solve the problem.

Co-authored-by: Frank Lichtenheld <frank@lichtenheld.com>
@jp-harvey
Copy link
Contributor Author

Great suggestion @flichtenheld , I've been working with Jira and not with Confluence so thanks for adding that in, glad it helped you out :-)

@jp-harvey
Copy link
Contributor Author

Another change I made was to add these new options to BitbucketBase._new_session_args. At least my assumption was that if you set backoff_and_retry=True when creating the Cloud object you would expect all child objects to inherit the setting.

@flichtenheld do you want to add those as a review to this PR as well?

@flichtenheld
Copy link
Contributor

Another change I made was to add these new options to BitbucketBase._new_session_args. At least my assumption was that if you set backoff_and_retry=True when creating the Cloud object you would expect all child objects to inherit the setting.

@flichtenheld do you want to add those as a review to this PR as well?

Hmm, not sure what would be the best way to do that given that they are in a completely different file that is not touched by the PR. But you can see them here: flichtenheld@f94cef6

atlassian/rest_client.py Outdated Show resolved Hide resolved
files=files,
proxies=self.proxies,
)
responseloop = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting always to False and in Line 295 to True is confusing. Remove line 284 and 295.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Spacetown I think that would cause an infinite loop if self.backoffandretry is set to False, which is the default.

If it's confusing is there another way we could implement it?

Now that you mention it I'm also just looking at line 288 and I think that could probably be removed although I think I put that in there to future proof any other logic changes relating to responseloop that might change the default - ie. making the logic explicit - so would probably vote to leave that in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Than init it with True and set it to False on success or on max back off retries.

atlassian/rest_client.py Outdated Show resolved Hide resolved
atlassian/rest_client.py Outdated Show resolved Hide resolved
atlassian/rest_client.py Outdated Show resolved Hide resolved
atlassian/rest_client.py Outdated Show resolved Hide resolved
@mykytoleg
Copy link

Hi everyone, what are requirements for this pull request to be merged? Do you have any ETA?

@Spacetown
Copy link
Contributor

@mykytoleg please check the review remarks.

@jp-harvey jp-harvey changed the title [WIP] Add exponential backoff and retry functionality to rest_client Add exponential backoff and retry functionality to rest_client Aug 1, 2022
@jp-harvey
Copy link
Contributor Author

@gonchik there has been no additional feedback by reviewers for some time following changes / responses and I think this one is probably good to go?

Copy link
Contributor

@Spacetown Spacetown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, forgot to submit the review.

files=files,
proxies=self.proxies,
)
responseloop = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting always to False and in Line 295 to True is confusing. Remove line 284 and 295.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jp-harvey do you have time to update that PR. ?

atlassian/rest_client.py Outdated Show resolved Hide resolved
files=files,
proxies=self.proxies,
)
responseloop = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Than init it with True and set it to False on success or on max back off retries.

for em in self.retry_error_matches:
if retries > self.max_backoff_retries:
log.warning("Hit max backoff retry limit of {0}, no more retries.".format(self.max_backoff_retries))
responseloop = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line isn't needed because you use break to exit the loop.

responseloop = False
if self.backoff_and_retry:
for em in self.retry_error_matches:
if retries > self.max_backoff_retries:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block should be moved in front of the loop.

@gonchik
Copy link
Member

gonchik commented Aug 23, 2022

@jp-harvey @flichtenheld can we update PR and merge it ?

@jp-harvey
Copy link
Contributor Author

jp-harvey commented Aug 24, 2022

@jp-harvey @flichtenheld can we update PR and merge it ?

@gonchik I've made some of the changes, the others I am not comfortable making without doing testing and I don't have a good way to test at the moment. Functionally it works as it is and has been tested many times in real-life scenarios, it's just the code structure that is the concern now.

EDIT: looks like there's a linting issue as well which should be easy enough to sort out.

@Spacetown
Copy link
Contributor

Spacetown commented Sep 2, 2022

@jp-harvey @flichtenheld can we update PR and merge it ?

@gonchik I've made some of the changes, the others I am not comfortable making without doing testing and I don't have a good way to test at the moment. Functionally it works as it is and has been tested many times in real-life scenarios, it's just the code structure that is the concern now.

EDIT: looks like there's a linting issue as well which should be easy enough to sort out.

It works but it's wired and not easy to understand. If max back off is reached but the response was success a warning is logged.

        backoff = 1
        retries = 0
        responseloop = True
        while responseloop:
            response = self._session.request(
                method=method,
                url=url,
                headers=headers,
                data=data,
                json=json,
                timeout=self.timeout,
                verify=self.verify_ssl,
                files=files,
                proxies=self.proxies,
            )
            if self.backoff_and_retry:
                for em in self.retry_error_matches:
                    if response.status_code == em[0] and response.reason == em[1]:
                        if retries > self.max_backoff_retries:
                            log.warning("Hit max backoff retry limit of {0}, no more retries.".format(self.max_backoff_retries))
                            responseloop = False
                        else:
                            log.warning('Backing off due to error "{0}: {1}" for {2}s'.format(em[0], em[1], backoff))
                            time.sleep(backoff)
                            backoff = backoff * 2 if backoff * 2 < self.max_backoff_seconds else self.max_backoff_seconds
                            retries += 1
                        break  # for loop
            else:
                responseloop = False

@Spacetown
Copy link
Contributor

Spacetown commented Sep 2, 2022

The for loop and the if can also be combined using any() together with comparing set objects.

            if self.backoff_and_retry:
                if any([(response.status_code, response.reason) == em for em in self.retry_error_matches]):
                    if retries > self.max_backoff_retries:
                        log.warning("Hit max backoff retry limit of {0}, no more retries.".format(self.max_backoff_retries))
                        responseloop = False
                    else:
                        current_backoff = backoff + (random.random() * backoff / 10)
                        log.warning('Backing off due to error "{0}: {1}" for {2}s'.format(response.status_code, response.reason, current_backoff))
                        time.sleep(current_backoff)
                        backoff = min(backoff * 2, self.max_backoff_seconds)
                        retries += 1
            else:
                responseloop = False

Copy link
Contributor

@Spacetown Spacetown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use only break to finish the loop. At the moment this is mixed and in line 310-311 the flag sets a new iteration but a break is used which doesn't make sense.

@@ -60,7 +60,46 @@ def __init__(
cloud=False,
proxies=None,
token=None,
backoff_and_retry=False,
retry_error_matches=[(429, "Too Many Requests"),
(429, "Unknown Status Code")],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix indention.

for em in self.retry_error_matches:
if retries > self.max_backoff_retries:
log.warning("Hit max backoff retry limit of {0}, no more retries.".format(self.max_backoff_retries))
responseloop = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line isn't needed since it's set to false directly after the response.

atlassian/rest_client.py Outdated Show resolved Hide resolved
atlassian/rest_client.py Outdated Show resolved Hide resolved
atlassian/rest_client.py Outdated Show resolved Hide resolved
gonchik and others added 3 commits July 29, 2023 00:39
Co-authored-by: Michael Förderer <michael.foerderer@gmx.de>
Co-authored-by: Michael Förderer <michael.foerderer@gmx.de>
Co-authored-by: Michael Förderer <michael.foerderer@gmx.de>
@Spacetown
Copy link
Contributor

Ok, now I see that there are two nested loops...

Copy link
Contributor

@Spacetown Spacetown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggested code only sets response_loop to False if the loop shall be exited.

atlassian/rest_client.py Show resolved Hide resolved
files=files,
proxies=self.proxies,
)
responseloop = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jp-harvey do you have time to update that PR. ?

@gonchik gonchik assigned gonchik and jp-harvey and unassigned gonchik Aug 18, 2023
@gonchik
Copy link
Member

gonchik commented Aug 18, 2023

@jp-harvey can we finalize it ?

@jp-harvey
Copy link
Contributor Author

@jp-harvey can we finalize it ?

Hi @gonchik I'm at a point where it's no longer 100% clear to me what the issues are or what to change. I also don't have a good way to test at the moment, so making changes and testing is going to be high-friction.

I've also discovered subsequently that the requests library has a built-in backoff and retry capability built into sessions, so that's probably a better option all round.

I think we have the following options:

  1. Abandon this PR
  2. Merge it (mostly?) as-is
  3. Someone other than me modifies the code, tests, and once approved it can be merged
  4. I modify the existing code, test, and once approved it gets merged
  5. I resubmit the PR using the retry capability of sessions

Unfortunately I won't have the bandwidth to do it for a while and don't have an easy way to test it at the moment, so 4 and 5 won't be able to happen soon.

Sorry about this, I know the code works as it is now despite not being perfect, and I respect that it's considered not suitable for merging. By the time the PR was reviewed we passed a window I was able to keep working on it and don't want to make blind changes now that may break it. I may be able to in the future though, but would probably do option 5 before 4 because it looks like a more elegant option. I'm also fine with options 1-3 if that's the best thing for this project. Let me know what your preference is.

Comment on lines +64 to +65
retry_error_matches=[(429, "Too Many Requests"),
(429, "Unknown Status Code")],

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

503 also deserve a retry IMO, since jira cloud does throw it from time to time

posting it here if someone want to update this PR or create a new one using requests.Session retry mechanism

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included 503 in my new PR #1339 since that is in the default list of urllib3.util.Retry

@flichtenheld
Copy link
Contributor

I will take a stab at resurrecting this mechanism since I'm currently working on related code anyway. While I have been using this code for years at this point and it works fine in production (the original code, not the current one in this PR which I'm pretty sure has been patched wrongly), I will try to change the mechanism to use urllib3.util.Retry as suggested to reduce unnecessary code.

@flichtenheld
Copy link
Contributor

Since #1339 was merged, I think this PR should be closed.

@jp-harvey
Copy link
Contributor Author

@flichtenheld thanks for taking this one up and your great work on #1339!

@jp-harvey jp-harvey closed this Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants