Conversation
128a058 to
0b53623
Compare
eb01f35 to
6ec284c
Compare
6ec284c to
c7f4ccf
Compare
c7f4ccf to
da90a4c
Compare
da90a4c to
75c3ce9
Compare
c16be23 to
5b57ef3
Compare
zhming0
left a comment
There was a problem hiding this comment.
I think I'd need to proceed after some commit wrangling to make the diff viewable in the core file.
But I am also happy to generate a diff locally if it's easier, lmk 🙏🏿
| // We don't SIGKILL the bootstrap as a cancel signal, because that | ||
| // prevents post-checkout/command hooks running. (Change the signal | ||
| // grace period instead.) But the user may want the bootstrap to SIGKILL | ||
| // the command that it runs as an "interrupt". | ||
| cancelSignal := conf.CancelSignal | ||
| if cancelSignal == process.SIGKILL { | ||
| cancelSignal = process.SIGTERM | ||
| } |
There was a problem hiding this comment.
The comment does not seem to explain why we reset cancelSignal to SIGTERM when conf.CancelSignal is process.SIGKILL?
The comment seem to suggest that should allow SIGKILL setting to be propagated to child process?
There was a problem hiding this comment.
After reading a bit more context, I think I get the gist but still feeling it's going to be a very hard to understand piece of logic for future travellers.
There was a problem hiding this comment.
I've changed the comment a bit, let me know what you think!
49fe79e to
0e04f8f
Compare
zhming0
left a comment
There was a problem hiding this comment.
LGTM 👍🏿 . Please correct me if I understood it wrong. For Windows user, they can set cancel signal to be SIGKILL, which will let agent start to simulate a SIGKILL to its subprocesses via windows.CloseHandle(windows.Handle(p.winJobHandle)).
The main benefit is that now cancel signal won't get swallowed by other mechanisms in Windows.
But what I don't get is if this solves the original issue which is about job lifecycle hooks being missed?
Nonetheless the change itself look good, I will give it tick.
| // CancelSignal == SIGKILL means the user wants the command to be killed | ||
| // instead of signaled more gracefully (SIGTERM, SIGINT, etc). | ||
| // We don't send SIGKILL to the bootstrap itself as a cancel signal, | ||
| // because that would kill the bootstrap immediately, which would | ||
| // prevent capturing the exit status of the command, executing various | ||
| // pre-exit hooks, and other cleanup. | ||
| cancelSignal := conf.CancelSignal | ||
| if cancelSignal == process.SIGKILL { | ||
| cancelSignal = process.SIGTERM | ||
| } |
There was a problem hiding this comment.
Should we comment here that the only reasonable use case for specifying SIGKILL is for non-Linux systems?
There was a problem hiding this comment.
I don't see why we should specifically discourage SIGKILL for Linux users, it just means there's effectively no signal/cancel grace period.
- Move the files - Cmd-Shift-F, replace "agent/v3/process" with "agent/v3/internal/process" - bazel run :gazelle
This will enable tests to wait for the process to start or complete, rather than having to sleep.
In the unlikely event that creating the job object or attaching the process to the job object fails, we should fall back to terminating just the process.
- Split Run into setup, start, startWithPTY, startWithoutPTY, copyPTYToStdout, complete, onContextCancel. - Use CommandContext and Command.Cancel to handle context cancellation - Fix missing mutex coverage of some fields
0e04f8f to
5e739ab
Compare
If a command swallows Ctrl-Break and doesn't exit, then the bootstrap keeps waiting until the signal grace period, then kills the process. At the same time, the agent is also waiting for the signal grace period, and then kills the bootstrap. Since the bootstrap is what executes the job lifecycle hooks, and the post-command and pre-exit hooks must run after the command, the bootstrap has no time to execute them before being killed. |
5e739ab to
c9a1ac0
Compare
|
It all make senses how. 👍🏿 |
Description
Review and refactor of chunks of the
processpackage, because it was getting pretty crusty, but also to solve a specific problem to do with interrupting programs on Windows.Context
https://linear.app/buildkite/issue/A-1075
The current mechanism used to interrupt processes on Windows is to send a "Ctrl-Break" event (as though someone was typing that key combination on the console window). This can be applied to all processes in a process group (created by an option on the Win32
CreateProcesscall) and is understood to be vaguely similar toSIGTERMon *nixes...but it really isn't the same.Go binaries seem to understand Ctrl-Break as a signal, so we don't need to change how the agent signals the bootstrap. But some processes (PowerShell, ping, probably others) absorb Ctrl-Break and do something other than exit.
Ctrl-C also exists as a concept on Windows and is supported as a generated console event. Unfortunately there are Windows-isms in the way -
GenerateConsoleCtrlEventunderstandsCTRL_C_EVENT, but can't target it to a process group, only the current console. So we have to create the process in a new console, then detach console, attach to the other console, set an event handler, send the Ctrl-C, then detach and reattach to the original console and clean up the event handler. And I can't get that to work!There's no particular reason we can't support
SIGKILLas an "interrupt" signal either. Maybe the user just wants to kill stuff right away? But it isn't a handle-able signal, so we should avoid using it to interrupt the bootstrap (as a subprocess ofstartorkubernetes-bootstrap), otherwise post-command, pre-exit, artifact etc hooks won't run.But... interrupting with
SIGKILLwill let us interrupt processes that absorb Ctrl-Break. So the solution is this PR, plus pass--cancel-signal=SIGKILL.Changes
SIGKILLas an understood cancel signalSIGKILLcallterminateProcessGroup- this isn't needed on *nixes because sending an actualSIGKILLto an actual process group is how that works.processintointernal, because it's not an intended API surfacesetup,start,startWithPTY,startWithoutPTY,copyPTYToStdout,completefromRunexec.CommandContextandCommand.Cancelto set up the interrupt-grace period-terminate handling, rather than running a goroutine waiting for context cancellation the whole timecmp.Ora bit morecopyPTYToStdoutshelltable-driven to exercise the range of interrupt signalsTesting
go test ./...). Buildkite employees may check this if the pipeline has run automatically.go tool gofumpt -extra -w .)Disclosures / Credits
I did not use AI tools, but Codex is probably going to review it when I mark it ready whether I want it to or not