Skip to content

Add API retries for octokit requests#1420

Merged
aeisenberg merged 2 commits intomainfrom
robertbrignull/api-retry
Jul 12, 2022
Merged

Add API retries for octokit requests#1420
aeisenberg merged 2 commits intomainfrom
robertbrignull/api-retry

Conversation

@robertbrignull
Copy link
Copy Markdown
Contributor

The aim is to add automatic retrying for HTTP requests made with octokit. We've seen a few cases of spurious failure causing problems and this will go a long way to avoiding them. We may also want to add the ability to fully retry some operations like downloading MRVA results, but that's a separate change to this.

Is this all that's needed to use the @octokit/plugin-retry module and get automatic retrying? Reading https://github.com/octokit/plugin-retry.js this appears to be it. Just not sure how to tell if it's working or not. 🤔
Maybe someone could work with me to write a test.

Checklist

  • CHANGELOG.md has been updated to incorporate all user visible changes made by this pull request.
  • Issues have been created for any UI or other user-facing changes made by this pull request.
  • [Maintainers only] If this pull request makes user-facing changes that require documentation changes, open a corresponding docs pull request in the github/codeql repo and add the ready-for-doc-review label there.

@robertbrignull robertbrignull requested a review from a team as a code owner July 1, 2022 11:22
@charisk
Copy link
Copy Markdown
Contributor

charisk commented Jul 11, 2022

Thanks for taking a look at this! The default behaviour is that 400, 401, 403, 404, 422 codes won't be retried (according to this). Do you think we might want to retry 404s? I'm wondering if we ever want to protect against something not being there yet.

Copy link
Copy Markdown
Contributor

@aeisenberg aeisenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is generally fine, but there's a possibility that the request will be sent and processed correctly, but the response will fail to be received. Since the run remote queries requests are not idempotent, this could cause multiple runs to be started.

I don't think this is a likely scenario, but it's something to consider.

@robertbrignull
Copy link
Copy Markdown
Contributor Author

robertbrignull commented Jul 12, 2022

@charisk

The default behaviour is that 400, 401, 403, 404, 422 codes won't be retried (according to this).

That default list of statuses to not retry looks sensible to me. Those should all be user errors where retrying won't make a difference, as opposed to 5xx errors indicating something went wrong out of the user's control.

Do you think we might want to retry 404s? I'm wondering if we ever want to protect against something not being there yet.

In terms of a 404, that would only mean the controller repository doesn't exist yet and I don't think we want to retry and wait for that. For repos not having databases those are filtered out and the request either succeeds with those repos filtered out or fails with a 422 if no repos have databases. Again here I don't think we want to attempt to wait for them. The time frame of how long the user probably wants to wait vs how long the thing will take to be available isn't really compatible.

Does that make sense and is that what you meant?

@robertbrignull
Copy link
Copy Markdown
Contributor Author

robertbrignull commented Jul 12, 2022

@aeisenberg

I think this is generally fine, but there's a possibility that the request will be sent and processed correctly, but the response will fail to be received. Since the run remote queries requests are not idempotent, this could cause multiple runs to be started.

I don't think this is a likely scenario, but it's something to consider.

Hmm, that's an interesting point there are more ways that a HTTP request can fail. I think also if it fails at DNS or a similar early stage it might not have got far enough to produce a status code and thus will likely fail with some other kind of exception that we currently don't handle.

I think regarding other ways a request can fail, maybe we could handle more of them, but let's try adding this simple retry mechanism first. It's still a win, and then we can see how frequently people hit other kinds of errors that aren't handled. Is that sensible?

Regarding the request succeeding on the server side but being seem as a failure by the client and getting retried, that would be a waste of resources, but then also the client doesn't have anything after the first "failure" so there's not much else it can do except retry. If we didn't do it automatically the user would probably do it manually anyway and just be more annoyed about it.

There is a case I've seen that could be a little dangerous where the request "times out" and returns a 5xx error, but the processing on the server side actually continues a little bit longer and completes successfully. Some middleware returns the timeout HTTP response before killing the code handling the request. If this happens we might retry the request and it'll fake-timeout again and again, leading to wasted compute resources. We could discuss not retrying timeouts, though I'm not sure how we would configure that, or just accept that it'll only retry 3 times and this is an ok cost.

Copy link
Copy Markdown
Contributor

@aeisenberg aeisenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds reasonable.

@aeisenberg aeisenberg merged commit f559b59 into main Jul 12, 2022
@aeisenberg aeisenberg deleted the robertbrignull/api-retry branch July 12, 2022 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants