Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make failed job acquisitions return a specific exit code (27) #2762

Merged
merged 1 commit into from
May 3, 2024

Conversation

moskyb
Copy link
Contributor

@moskyb moskyb commented May 2, 2024

Description

When the agent is running in acquire mode - that is, by specifying a job UUID for the agent to pick up, "inverting" the normal dispatch process - sometimes, the job it's been instructed to pick up in unavailable. Perhaps it's already running on another agent, or maybe the job was cancelled, but made its way to the agent anyway.

In these situations, the agent prints the error to stdout, and returns an exit code of 1, indicating failure.

However, acquisition can fail for both recoverable- and non-recoverable reasons, and the agent doesn't differentiate between these in its status code. Users can grep through the logs for things unrecoverable-sounding things, but this is way more work than it needs to be.

This PR makes it so that if acquiring the job fails in such a way that the error is unrecoverable, the agent will return a status code of 27 (chosen arbitrarily, because it's a nice number). This allows consumers of the agent to not try to pick up that job again.

Context

COMP-332
https://docs.google.com/document/d/1qjIXw2gm88iiDggQAKVMQx0qwKntPjtlA4pW84om3xw/edit
Slack convo w/ Namespace

Testing

  • Tests have run locally (with go test ./...). Buildkite employees may check this if the pipeline has run automatically.
  • Code is formatted (with go fmt ./...)

@moskyb moskyb requested a review from matthewborden May 2, 2024 06:17
clicommand/agent_start.go Outdated Show resolved Hide resolved
@moskyb moskyb requested a review from DrJosh9000 May 2, 2024 06:21
@moskyb moskyb force-pushed the exit-status-from-failed-acquire branch 2 times, most recently from 7a099ea to c907663 Compare May 2, 2024 06:50
@moskyb moskyb changed the title Make failed job acquisitions return a specific exit code (77) Make failed job acquisitions return a specific exit code (27) May 2, 2024
// specific exit code so that the caller can know that this job can't be acquired.

const acquisitionFailedExitCode = 27 // chosen by fair dice roll
return cli.NewExitError(err, acquisitionFailedExitCode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish I could find a better way to do this, It feels like the job of looking up an error and translating it to an exit code should be simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i mean, if the return value of a cli.Command is an ExitCoder (ie, fulfils interface { ExitCode() int }, this would happen automatically, but i actually think that that would be worse than what we have here - it would mean pulling a concern of the CLI into the agent package. i think this is fine as is for now, and if we end up checking a whole bunch of error types, we can refactor as necessary

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% agreed, that's the conclusion I came to, but it was niggling at me

@moskyb moskyb force-pushed the exit-status-from-failed-acquire branch from c907663 to de3e8d1 Compare May 2, 2024 23:44
// If the agent tried to acquire a job, but it couldn't because the job was already taken, we should exit with a
// specific exit code so that the caller can know that this job can't be acquired.

const acquisitionFailedExitCode = 27 // chosen by fair dice roll
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a list somewhere of the other exit codes we have? i.e. aside from 0 and 1 — would 27 be the first "other one" we introduce?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, AFAICT, we only ever return 0 or 1. the reason i didn't go with 2 is that it implies that this is the second most important exit code. because it's one of two non-zero exit codes, it is the second-most important exit code by definition, but were we to add a third, this would likely not be the case.

regardless, potentially it's a good idea to hoist this const definition out from this function and put it somewhere central-ish in the clicommand package?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me

@moskyb moskyb merged commit e8628ce into main May 3, 2024
1 check passed
@moskyb moskyb deleted the exit-status-from-failed-acquire branch May 3, 2024 01:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants