Skip to content

feat: add BUILDKITE_JOB_TIMED_OUT env var for hooks#3871

Open
mastermanas805 wants to merge 3 commits intobuildkite:mainfrom
mastermanas805:feat/3592-job-timeout-env
Open

feat: add BUILDKITE_JOB_TIMED_OUT env var for hooks#3871
mastermanas805 wants to merge 3 commits intobuildkite:mainfrom
mastermanas805:feat/3592-job-timeout-env

Conversation

@mastermanas805
Copy link
Copy Markdown
Contributor

@mastermanas805 mastermanas805 commented Apr 29, 2026

Description

Adds BUILDKITE_JOB_TIMED_OUT for hooks (and CancelReasonJobTimeout) so post-command hooks can distinguish timeout from other cancellations. Mirrors #3213's BUILDKITE_JOB_CANCELLED design.

Fixes #3592

The agent doesn't enforce timeouts itself — JobRunner.jobCancellationChecker polls GetJobState and now routes timing_out/timed_out server states through Cancel(CancelReasonJobTimeout). Cross-process signal → bootstrap → hook env uses a per-job marker file (path passed via BUILDKITE_AGENT_JOB_TIMEOUT_FILE, protected from job-level override): agent writes the marker before signaling, bootstrap's Cancel reads it and sets BUILDKITE_JOB_TIMED_OUT=true on the hook env.

Testing

  • go test ./... clean
  • go test -race ./agent/ ./internal/job/ ./env/ clean
  • go tool gofumpt -extra -d . clean
  • golangci-lint run 0 issues
  • Added unit tests for executor.Cancel (no-marker / marker-present / bad-path), jobTimeoutFilePath, and CancelReason.String()

Disclosures / Credits

Used Claude Code (Opus 4.7) to design the marker-file plumbing and draft the patch. Reviewed and tested locally.

When a Buildkite job is cancelled because of a job-level timeout, the
post-command hook now sees BUILDKITE_JOB_TIMED_OUT=true alongside the
existing BUILDKITE_JOB_CANCELLED, so hooks can distinguish a timeout
from a manual cancellation without having to do arithmetic on job start
time and timeout configuration.

The job runner already polls GetJobState and cancels on the canceling
/canceled states. It now also recognises the timing_out/timed_out
states and routes those through a new CancelReasonJobTimeout. Before
sending the interrupt signal, the agent drops a marker file at a path
it has shared with the bootstrap subprocess via
BUILDKITE_AGENT_JOB_TIMEOUT_FILE. The executor's Cancel handler reads
that path and, if the file is present, sets BUILDKITE_JOB_TIMED_OUT on
the shell env, which then propagates into the post-command hook just
like BUILDKITE_JOB_CANCELLED does today (buildkite#3213).

The marker file lives in the same temp directory used for the job env
files and is removed during cleanup. Failing to write or remove it is
non-fatal: the cancellation still proceeds, and the hook just sees the
existing BUILDKITE_JOB_CANCELLED.

Fixes buildkite#3592
@mastermanas805 mastermanas805 marked this pull request as ready for review May 5, 2026 05:07
@mastermanas805 mastermanas805 requested review from a team as code owners May 5, 2026 05:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a bool env var for when job times out

1 participant