-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry forever to upload job chunks #898
Conversation
We send 400 and 401 responses sometimes. I can't find any 422 responses. 400 Bad Request is sent for "Missing chunk" and "Job can't have chunks" at least. Both are unrecoverable and would halt an agent here. 401 is sent when the agent authorization fails, and is likely unrecoverable. 404 might be sent if a job disappears while chunks are being submitted, and is likely unrecoverable. Perhaps 4XX codes should not infinitely retry, but 5XX should? |
@sj26 excellent point, changed! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice 👍
Good sleuthing, @sj26! I need to remember how ridiculously powerful Insights is! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let’s merge and release? 🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems super sensible!
Currently we wait a max of 50 seconds on each chunk before trying to upload it. If chunks are failing to upload, the main reason would be a problem with Buildkite. Presumably if they're failing to upload, finishing a job and starting a new one will also have the same problem (at which point we're just throwing chunk data away, to get more, and probably throw them away too).
This change tweaks the retry rules so we don't give up on chunks. It does handle cases that BK returns a 422, which "is" a successful chunk upload (whether or not we actually stored it is another thing), but a 422 does mean communications were successful, and Buildkite did "the right thing"