Skip to content

Big process review & refactor#3814

Merged
DrJosh9000 merged 11 commits intomainfrom
a-1075-add-hard-interrupt-on-windows-experiment
Apr 22, 2026
Merged

Big process review & refactor#3814
DrJosh9000 merged 11 commits intomainfrom
a-1075-add-hard-interrupt-on-windows-experiment

Conversation

@DrJosh9000
Copy link
Copy Markdown
Contributor

@DrJosh9000 DrJosh9000 commented Apr 13, 2026

Description

Review and refactor of chunks of the process package, because it was getting pretty crusty, but also to solve a specific problem to do with interrupting programs on Windows.

Context

https://linear.app/buildkite/issue/A-1075

The current mechanism used to interrupt processes on Windows is to send a "Ctrl-Break" event (as though someone was typing that key combination on the console window). This can be applied to all processes in a process group (created by an option on the Win32 CreateProcess call) and is understood to be vaguely similar to SIGTERM on *nixes...but it really isn't the same.

Go binaries seem to understand Ctrl-Break as a signal, so we don't need to change how the agent signals the bootstrap. But some processes (PowerShell, ping, probably others) absorb Ctrl-Break and do something other than exit.

Ctrl-C also exists as a concept on Windows and is supported as a generated console event. Unfortunately there are Windows-isms in the way - GenerateConsoleCtrlEvent understands CTRL_C_EVENT, but can't target it to a process group, only the current console. So we have to create the process in a new console, then detach console, attach to the other console, set an event handler, send the Ctrl-C, then detach and reattach to the original console and clean up the event handler. And I can't get that to work!

There's no particular reason we can't support SIGKILL as an "interrupt" signal either. Maybe the user just wants to kill stuff right away? But it isn't a handle-able signal, so we should avoid using it to interrupt the bootstrap (as a subprocess of start or kubernetes-bootstrap), otherwise post-command, pre-exit, artifact etc hooks won't run.

But... interrupting with SIGKILL will let us interrupt processes that absorb Ctrl-Break. So the solution is this PR, plus pass --cancel-signal=SIGKILL.

Changes

  • Add SIGKILL as an understood cancel signal
  • For Windows, make SIGKILL call terminateProcessGroup - this isn't needed on *nixes because sending an actual SIGKILL to an actual process group is how that works.
  • Refactors:
    • Move process into internal, because it's not an intended API surface
    • Split out setup, start, startWithPTY, startWithoutPTY, copyPTYToStdout, complete from Run
    • Use exec.CommandContext and Command.Cancel to set up the interrupt-grace period-terminate handling, rather than running a goroutine waiting for context cancellation the whole time
    • Use cmp.Or a bit more
    • Remove the waitgroup, instead close the channel from copyPTYToStdout
    • Fix a few wrong comments
    • Fix some weak mutex coverage
    • Make some tests in shell table-driven to exercise the range of interrupt signals

Testing

  • Tests have run locally (with go test ./...). Buildkite employees may check this if the pipeline has run automatically.
  • Code is formatted (with go tool gofumpt -extra -w .)

Disclosures / Credits

I did not use AI tools, but Codex is probably going to review it when I mark it ready whether I want it to or not

@DrJosh9000 DrJosh9000 force-pushed the a-1075-add-hard-interrupt-on-windows-experiment branch 17 times, most recently from 128a058 to 0b53623 Compare April 14, 2026 07:26
@DrJosh9000 DrJosh9000 marked this pull request as ready for review April 14, 2026 07:44
@DrJosh9000 DrJosh9000 requested review from a team as code owners April 14, 2026 07:44
chatgpt-codex-connector[bot]

This comment was marked as outdated.

chatgpt-codex-connector[bot]

This comment was marked as outdated.

@DrJosh9000 DrJosh9000 force-pushed the a-1075-add-hard-interrupt-on-windows-experiment branch from eb01f35 to 6ec284c Compare April 15, 2026 01:03
chatgpt-codex-connector[bot]

This comment was marked as outdated.

@DrJosh9000 DrJosh9000 force-pushed the a-1075-add-hard-interrupt-on-windows-experiment branch from 6ec284c to c7f4ccf Compare April 15, 2026 02:19
chatgpt-codex-connector[bot]

This comment was marked as outdated.

@DrJosh9000 DrJosh9000 force-pushed the a-1075-add-hard-interrupt-on-windows-experiment branch from c7f4ccf to da90a4c Compare April 15, 2026 04:48
chatgpt-codex-connector[bot]

This comment was marked as outdated.

@DrJosh9000 DrJosh9000 force-pushed the a-1075-add-hard-interrupt-on-windows-experiment branch from da90a4c to 75c3ce9 Compare April 15, 2026 05:01
chatgpt-codex-connector[bot]

This comment was marked as outdated.

chatgpt-codex-connector[bot]

This comment was marked as low quality.

@DrJosh9000 DrJosh9000 force-pushed the a-1075-add-hard-interrupt-on-windows-experiment branch 5 times, most recently from c16be23 to 5b57ef3 Compare April 20, 2026 23:59
Copy link
Copy Markdown
Contributor

@zhming0 zhming0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd need to proceed after some commit wrangling to make the diff viewable in the core file.

But I am also happy to generate a diff locally if it's easier, lmk 🙏🏿

Comment thread agent/job_runner.go Outdated
Comment on lines +343 to +350
// We don't SIGKILL the bootstrap as a cancel signal, because that
// prevents post-checkout/command hooks running. (Change the signal
// grace period instead.) But the user may want the bootstrap to SIGKILL
// the command that it runs as an "interrupt".
cancelSignal := conf.CancelSignal
if cancelSignal == process.SIGKILL {
cancelSignal = process.SIGTERM
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment does not seem to explain why we reset cancelSignal to SIGTERM when conf.CancelSignal is process.SIGKILL?

The comment seem to suggest that should allow SIGKILL setting to be propagated to child process?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading a bit more context, I think I get the gist but still feeling it's going to be a very hard to understand piece of logic for future travellers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the comment a bit, let me know what you think!

Comment thread docs/images/agent-start.svg
Comment thread internal/shell/shell_test.go
Comment thread internal/process/process.go
@DrJosh9000 DrJosh9000 force-pushed the a-1075-add-hard-interrupt-on-windows-experiment branch 3 times, most recently from 49fe79e to 0e04f8f Compare April 21, 2026 02:25
@DrJosh9000 DrJosh9000 requested a review from zhming0 April 21, 2026 03:40
Copy link
Copy Markdown
Contributor

@zhming0 zhming0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍🏿 . Please correct me if I understood it wrong. For Windows user, they can set cancel signal to be SIGKILL, which will let agent start to simulate a SIGKILL to its subprocesses via windows.CloseHandle(windows.Handle(p.winJobHandle)).

The main benefit is that now cancel signal won't get swallowed by other mechanisms in Windows.

But what I don't get is if this solves the original issue which is about job lifecycle hooks being missed?

Nonetheless the change itself look good, I will give it tick.

Comment thread agent/job_runner.go
Comment on lines +343 to +352
// CancelSignal == SIGKILL means the user wants the command to be killed
// instead of signaled more gracefully (SIGTERM, SIGINT, etc).
// We don't send SIGKILL to the bootstrap itself as a cancel signal,
// because that would kill the bootstrap immediately, which would
// prevent capturing the exit status of the command, executing various
// pre-exit hooks, and other cleanup.
cancelSignal := conf.CancelSignal
if cancelSignal == process.SIGKILL {
cancelSignal = process.SIGTERM
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we comment here that the only reasonable use case for specifying SIGKILL is for non-Linux systems?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why we should specifically discourage SIGKILL for Linux users, it just means there's effectively no signal/cancel grace period.

Comment thread internal/shell/shell_test.go Outdated
- Move the files
- Cmd-Shift-F, replace "agent/v3/process" with "agent/v3/internal/process"
- bazel run :gazelle
This will enable tests to wait for the process to start or complete, rather than
having to sleep.
In the unlikely event that creating the job object or attaching the process to
the job object fails, we should fall back to terminating just the process.
- Split Run into setup, start, startWithPTY, startWithoutPTY, copyPTYToStdout,
  complete, onContextCancel.
- Use CommandContext and Command.Cancel to handle context cancellation
- Fix missing mutex coverage of some fields
@DrJosh9000 DrJosh9000 force-pushed the a-1075-add-hard-interrupt-on-windows-experiment branch from 0e04f8f to 5e739ab Compare April 22, 2026 03:11
@DrJosh9000
Copy link
Copy Markdown
Contributor Author

But what I don't get is if this solves the original issue which is about job lifecycle hooks being missed?

If a command swallows Ctrl-Break and doesn't exit, then the bootstrap keeps waiting until the signal grace period, then kills the process. At the same time, the agent is also waiting for the signal grace period, and then kills the bootstrap. Since the bootstrap is what executes the job lifecycle hooks, and the post-command and pre-exit hooks must run after the command, the bootstrap has no time to execute them before being killed.

@DrJosh9000 DrJosh9000 force-pushed the a-1075-add-hard-interrupt-on-windows-experiment branch from 5e739ab to c9a1ac0 Compare April 22, 2026 03:29
@zhming0
Copy link
Copy Markdown
Contributor

zhming0 commented Apr 22, 2026

It all make senses how. 👍🏿

@DrJosh9000 DrJosh9000 merged commit 0f9651a into main Apr 22, 2026
3 checks passed
@DrJosh9000 DrJosh9000 deleted the a-1075-add-hard-interrupt-on-windows-experiment branch April 22, 2026 04:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants