Add API retries for octokit requests by robertbrignull · Pull Request #1420 · github/vscode-codeql

robertbrignull · 2022-07-01T11:22:53Z

The aim is to add automatic retrying for HTTP requests made with octokit. We've seen a few cases of spurious failure causing problems and this will go a long way to avoiding them. We may also want to add the ability to fully retry some operations like downloading MRVA results, but that's a separate change to this.

Is this all that's needed to use the @octokit/plugin-retry module and get automatic retrying? Reading https://github.com/octokit/plugin-retry.js this appears to be it. Just not sure how to tell if it's working or not. 🤔
Maybe someone could work with me to write a test.

Checklist

CHANGELOG.md has been updated to incorporate all user visible changes made by this pull request.
Issues have been created for any UI or other user-facing changes made by this pull request.
[Maintainers only] If this pull request makes user-facing changes that require documentation changes, open a corresponding docs pull request in the github/codeql repo and add the ready-for-doc-review label there.

charisk · 2022-07-11T08:37:31Z

Thanks for taking a look at this! The default behaviour is that 400, 401, 403, 404, 422 codes won't be retried (according to this). Do you think we might want to retry 404s? I'm wondering if we ever want to protect against something not being there yet.

aeisenberg

I think this is generally fine, but there's a possibility that the request will be sent and processed correctly, but the response will fail to be received. Since the run remote queries requests are not idempotent, this could cause multiple runs to be started.

I don't think this is a likely scenario, but it's something to consider.

robertbrignull · 2022-07-12T10:06:17Z

@charisk

The default behaviour is that 400, 401, 403, 404, 422 codes won't be retried (according to this).

That default list of statuses to not retry looks sensible to me. Those should all be user errors where retrying won't make a difference, as opposed to 5xx errors indicating something went wrong out of the user's control.

Do you think we might want to retry 404s? I'm wondering if we ever want to protect against something not being there yet.

In terms of a 404, that would only mean the controller repository doesn't exist yet and I don't think we want to retry and wait for that. For repos not having databases those are filtered out and the request either succeeds with those repos filtered out or fails with a 422 if no repos have databases. Again here I don't think we want to attempt to wait for them. The time frame of how long the user probably wants to wait vs how long the thing will take to be available isn't really compatible.

Does that make sense and is that what you meant?

robertbrignull · 2022-07-12T10:12:35Z

@aeisenberg

I think this is generally fine, but there's a possibility that the request will be sent and processed correctly, but the response will fail to be received. Since the run remote queries requests are not idempotent, this could cause multiple runs to be started.

I don't think this is a likely scenario, but it's something to consider.

Hmm, that's an interesting point there are more ways that a HTTP request can fail. I think also if it fails at DNS or a similar early stage it might not have got far enough to produce a status code and thus will likely fail with some other kind of exception that we currently don't handle.

I think regarding other ways a request can fail, maybe we could handle more of them, but let's try adding this simple retry mechanism first. It's still a win, and then we can see how frequently people hit other kinds of errors that aren't handled. Is that sensible?

Regarding the request succeeding on the server side but being seem as a failure by the client and getting retried, that would be a waste of resources, but then also the client doesn't have anything after the first "failure" so there's not much else it can do except retry. If we didn't do it automatically the user would probably do it manually anyway and just be more annoyed about it.

There is a case I've seen that could be a little dangerous where the request "times out" and returns a 5xx error, but the processing on the server side actually continues a little bit longer and completes successfully. Some middleware returns the timeout HTTP response before killing the code handling the request. If this happens we might retry the request and it'll fake-timeout again and again, leading to wasted compute resources. We could discuss not retrying timeouts, though I'm not sure how we would configure that, or just accept that it'll only retry 3 times and this is an ok cost.

aeisenberg

This sounds reasonable.

robertbrignull added 2 commits July 1, 2022 11:06

Install @octokot/plugin-retry

f7aa0a5

Use octokit retry module

825054a

robertbrignull requested a review from a team as a code owner July 1, 2022 11:22

aeisenberg reviewed Jul 11, 2022

View reviewed changes

charisk approved these changes Jul 12, 2022

View reviewed changes

aeisenberg approved these changes Jul 12, 2022

View reviewed changes

aeisenberg merged commit f559b59 into main Jul 12, 2022

aeisenberg deleted the robertbrignull/api-retry branch July 12, 2022 15:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add API retries for octokit requests#1420

Add API retries for octokit requests#1420
aeisenberg merged 2 commits intomainfrom
robertbrignull/api-retry

robertbrignull commented Jul 1, 2022

Uh oh!

charisk commented Jul 11, 2022

Uh oh!

aeisenberg left a comment

Uh oh!

robertbrignull commented Jul 12, 2022 •

edited

Loading

Uh oh!

robertbrignull commented Jul 12, 2022 •

edited

Loading

Uh oh!

aeisenberg left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

robertbrignull commented Jul 1, 2022

Checklist

Uh oh!

charisk commented Jul 11, 2022

Uh oh!

aeisenberg left a comment

Choose a reason for hiding this comment

Uh oh!

robertbrignull commented Jul 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robertbrignull commented Jul 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aeisenberg left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

robertbrignull commented Jul 12, 2022 •

edited

Loading

robertbrignull commented Jul 12, 2022 •

edited

Loading