Normalise HTTP status-based retry policies across the agent#3827
Merged
Normalise HTTP status-based retry policies across the agent#3827
Conversation
zhming0
previously requested changes
Apr 17, 2026
zhming0
reviewed
Apr 17, 2026
|
|
||
| var retryableStatuses = map[int]bool{ | ||
| http.StatusTooManyRequests: true, // 429 | ||
| 529: true, // Buildkite-specific "still waiting" (used by pipeline upload) |
Across the agent, there are a bunch of interactions with the Buildkite Agent API that all have their own, bespoke retry policies to retry certain HTTP status codes. Many of these policies are very silly, retrying permission errors and the like. This PR attempts to normalise these policies, adding a helper in the `api` package called `BreakOnNonRetryableStatus`, which breaks the roko retrier when the HTTP response has a status code which is non-retryable, and applying this helper across all calls to retried calls to the agent API
Error casts -> `errors.As`, removing the use of `neterr.Temporary`, typo fixes
9f83a42 to
530fe21
Compare
zhming0
approved these changes
Apr 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Across the agent, there are a bunch of interactions with the Buildkite Agent API that all have their own, bespoke retry policies to retry certain HTTP status codes. Many of these policies are very silly, retrying permission errors and the like.
This PR attempts to normalise these policies, adding a helper in the
apipackage calledBreakOnNonRetryableStatus, which breaks the roko retrier when the HTTP response has a status code which is non-retryable, and applying this helper across all calls to retried calls to the agent API.Context
I was working on a thing where in my local environment i was failing meta data gets with a permission error, and had this very silly situation:

where meta data gets failing with a permission error leads to them being retried for ~5 minutes.
Changes
Where calls to the buildkite agent api are being retried through
roko, ensure that when those calls receive a status that's non-retryable (any non-2xx that's not one of these)that they break immediately and don't attempt to retry.
Testing
go test ./...). Buildkite employees may check this if the pipeline has run automatically.go tool gofumpt -extra -w .)Disclosures / Credits