Skip to content

Normalise HTTP status-based retry policies across the agent#3827

Merged
moskyb merged 2 commits intomainfrom
http-status-retry-policy
Apr 20, 2026
Merged

Normalise HTTP status-based retry policies across the agent#3827
moskyb merged 2 commits intomainfrom
http-status-retry-policy

Conversation

@moskyb
Copy link
Copy Markdown
Contributor

@moskyb moskyb commented Apr 17, 2026

Description

Across the agent, there are a bunch of interactions with the Buildkite Agent API that all have their own, bespoke retry policies to retry certain HTTP status codes. Many of these policies are very silly, retrying permission errors and the like.

This PR attempts to normalise these policies, adding a helper in the api package called BreakOnNonRetryableStatus, which breaks the roko retrier when the HTTP response has a status code which is non-retryable, and applying this helper across all calls to retried calls to the agent API.

Context

I was working on a thing where in my local environment i was failing meta data gets with a permission error, and had this very silly situation:
CleanShot 2026-04-17 at 14 49 30@2x

where meta data gets failing with a permission error leads to them being retried for ~5 minutes.

Changes

Where calls to the buildkite agent api are being retried through roko, ensure that when those calls receive a status that's non-retryable (any non-2xx that's not one of these)

var retryableStatuses = map[int]bool{
	http.StatusTooManyRequests:     true, // 429
	529:                            true, // Buildkite-specific "still waiting" (used by pipeline upload)
	http.StatusInternalServerError: true, // 500
	http.StatusBadGateway:          true, // 502
	http.StatusServiceUnavailable:  true, // 503
	http.StatusGatewayTimeout:      true, // 504
}

that they break immediately and don't attempt to retry.

Testing

  • Tests have run locally (with go test ./...). Buildkite employees may check this if the pipeline has run automatically.
  • Code is formatted (with go tool gofumpt -extra -w .)

Disclosures / Credits

@moskyb moskyb requested review from a team as code owners April 17, 2026 05:24
Comment thread core/client.go
Comment thread api/retryable.go Outdated

var retryableStatuses = map[int]bool{
http.StatusTooManyRequests: true, // 429
529: true, // Buildkite-specific "still waiting" (used by pipeline upload)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sigh

Comment thread api/retryable.go Outdated
@zhming0 zhming0 dismissed their stale review April 17, 2026 06:00

I misread this 🤦🏿

moskyb added 2 commits April 20, 2026 10:17
Across the agent, there are a bunch of interactions with the Buildkite Agent API that all have their own, bespoke retry policies to retry certain HTTP status codes. Many of these policies are very silly, retrying permission errors and the like.

This PR attempts to normalise these policies, adding a helper in the `api` package called `BreakOnNonRetryableStatus`, which breaks the roko retrier when the HTTP response has a status code which is non-retryable, and applying this helper across all calls to retried calls to the agent API
Error casts -> `errors.As`, removing the use of `neterr.Temporary`, typo fixes
@moskyb moskyb force-pushed the http-status-retry-policy branch from 9f83a42 to 530fe21 Compare April 20, 2026 00:17
Copy link
Copy Markdown
Contributor

@DrJosh9000 DrJosh9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sick ✅

@moskyb moskyb merged commit 9e5dc5e into main Apr 20, 2026
3 checks passed
@moskyb moskyb deleted the http-status-retry-policy branch April 20, 2026 00:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants