Skip to content

[TE-5476] Add bktec backfill-commit-metadata subcommand#471

Merged
gchan merged 44 commits intomainfrom
gordon/te-5476-backfill-commit-metadata
Apr 13, 2026
Merged

[TE-5476] Add bktec backfill-commit-metadata subcommand#471
gchan merged 44 commits intomainfrom
gordon/te-5476-backfill-commit-metadata

Conversation

@gchan
Copy link
Copy Markdown
Contributor

@gchan gchan commented Apr 1, 2026

Description

The test prediction training pipeline needs historical changeset data to train models. We don't have access to customer repos, so customers need a tool they can run themselves inside their own checkout.

This PR adds a new subcommand under bktec tools (hidden from bktec --help unless BKTEC_PREVIEW_SELECTION is enabled):

bktec tools backfill-commit-metadata

Collects historical git commit metadata from a customer's repository and uploads to Buildkite via presigned S3. The pipeline:

  1. Verify token scopes (read_suites + write_suites) via GET /v2/access-token -- fast-fail before expensive git work. When --output is set, missing write_suites downgrades to a warning.
  2. Fetch commit SHA list from the Buildkite API (GET .../suites/:suite/commits?days=N)
  3. Detect default branch with <remote>/HEAD -> <remote>/main -> <remote>/master fallback
  4. Filter commits that exist locally via git cat-file --batch-check
  5. Fetch missing commits from remote with chunked git fetch + recursive bisect-on-error to isolate unfetchable SHAs
  6. Build a mainline cache (bounded by --since to match the lookback window) for 3-strategy fork-point detection (--fork-point, mainline parent fallback, plain merge-base)
  7. Bulk-fetch commit metadata via git log --no-walk --stdin (single process)
  8. Collect diffs concurrently (configurable worker pool with context cancellation: --name-only, --numstat, full diff, --raw)
  9. Package as tar.gz (files nested inside a backfill-<org>-<suite>-<timestamp> directory: commit-metadata.jsonl + metadata.json with config options and commit date range)
  10. Upload via presigned S3 POST, or write locally with --output

On upload failure, the tarball is retained locally and the path is printed so the user can retry with --upload.

--upload flag

The --upload <file> flag provides an upload-only mode that skips all git work and uploads a previously generated tarball directly. This supports workflows where generation and upload happen in separate steps (e.g. air-gapped environments, split CI pipelines, or retrying a failed upload).

# All-in-one (generate + upload)
bktec tools backfill-commit-metadata

# Generate locally...
bktec tools backfill-commit-metadata --output data.tar.gz
# ...and upload later
bktec tools backfill-commit-metadata --upload data.tar.gz

Architecture

The implementation adds four new packages under internal/, all using stdlib only (no new module dependencies):

  • internal/git/ -- GitRunner interface (with dependency injection), default branch detection, fork-point, bulk metadata, filter, fetch-missing, concurrent diff collection
  • internal/packaging/ -- tar.gz archive creation with JSONL + metadata.json
  • internal/upload/ -- presigned S3 POST with multipart form (context-aware for cancellation)
  • internal/api/ -- FetchCommitList, PresignUpload, VerifyTokenScopes client methods

Context

Resolves TE-5476

Depends on:

  • TE-5489 -- Server-side commit list API (endpoint live, pending production deployment for end-to-end testing)
  • TE-5441 -- Presigned S3 upload endpoint

Changes

  • Add internal/git package: GitRunner interface with ExecGitRunner production implementation and shared FakeGitRunner for tests, default branch detection with <remote>/HEAD fallback chain, 3-strategy fork-point detection with MainlineCache (bounded by --since to match the lookback window), bulk metadata fetch via git log --no-walk --stdin, local commit filtering via git cat-file --batch-check, chunked missing commit fetch with recursive bisect-on-error, concurrent diff collection with configurable worker pool and context cancellation
  • Add internal/packaging package: tar.gz archive creation with files nested inside a backfill-<org>-<suite>-<timestamp> directory containing commit-metadata.jsonl (one JSON object per line) and metadata.json manifest with config options (days, remote, skip_diffs) and commit date range (min_commit_date, max_commit_date)
  • Add internal/upload package: presigned S3 POST upload with buffered multipart form encoding (Content-Length required by S3), accepts context.Context for cancellation/timeout support
  • Add API client methods: FetchCommitList (text/plain content negotiation for newline-delimited SHAs, URL path segments escaped), PresignUpload (presigned S3 POST form, URL path segments escaped), VerifyTokenScopes (fast-fail scope check via GET /v2/access-token)
  • Add bktec tools backfill-commit-metadata subcommand: full pipeline from commit list fetch through to S3 upload, with --days, --remote, --skip-diffs, --output, --upload, --concurrency flags
  • --upload flag dispatches to an upload-only path (scope check, presign, upload) -- replaces the former separate upload-commit-metadata subcommand
  • Add config fields (Days, Remote, Concurrency, SkipDiffs, Output, UploadFile) and validation that branches on --upload (upload mode only requires API connection fields)
  • Gate tools parent command visibility on BKTEC_PREVIEW_SELECTION (always invocable, hidden from --help unless env var is set)
  • Token scope verification before git work: require read_suites + write_suites, downgrade missing write_suites to warning when --output is set
  • Retain tarball on upload failure so the user can retry with --upload
  • Add docs/commit-metadata-backfill.md guide and README section
  • Use dependency injection for GitRunner (passed as parameter, not a global)
  • Unhide --organization-slug for the backfill command (visible in --help under TEST ENGINE category) since the command is CLI-first, not CI-only like run/plan
  • Use flag-first error messages in ValidateForBackfillCommitMetadata() (e.g. --organization-slug / BUILDKITE_ORGANIZATION_SLUG must not be blank) instead of env-var-only format

Verification

Backed by specs. AI assisted.

# Run all new tests
go test ./internal/git/... ./internal/packaging/... ./internal/upload/... ./internal/command/... ./internal/config/...

# Manual test: generate tarball locally
bktec --debug tools backfill-commit-metadata \
  --access-token "bkua_..." \
  --suite-slug "my-suite" \
  --output /tmp/test.tar.gz

# Manual test: upload previously generated tarball
bktec tools backfill-commit-metadata \
  --access-token "bkua_..." \
  --upload /tmp/test.tar.gz

Verified working e2e locally.

❯ BUILDKITE_TEST_ENGINE_API_ACCESS_TOKEN=bkua_xxxx \
BUILDKITE_ORGANIZATION_SLUG=my_org \
BUILDKITE_TEST_ENGINE_SUITE_SLUG=my_suite \
BKTEC_PREVIEW_SELECTION=1 \
test-engine-client tools backfill-commit-metadata --days=90
+++ Buildkite Test Engine Client: bktec dev

Fetching commit list for suite "my_suite" (last 90 days)...
Server returned 9115 commits
Fetching 105 missing commits from origin...
Processing 9115 commits
Building mainline cache...
Fetching commit metadata...
Collecting diffs...
Processed 9115/9115 commits
Packaging tarball...
Requesting presigned upload URL...
Uploading to S3...
Uploaded to s3://buildkite-te-commit-metadata/backfill/org=abc/20260407T014403Z_019d659c-6ddf-7509-b4c5-9430f9ed7381.tar.gz
Done. 9115 commits exported.

Alternatively, it can be invoked without env vars:

❯ test-engine-client tools backfill-commit-metadata --access-token bkua_xxxx --organization-slug my_org --suite-slug my_suite

Deployment

Low risk. The command is hidden behind the tools parent command which is only visible when BKTEC_PREVIEW_SELECTION is enabled. The command is always registered (invocable directly) but not discoverable without the env var. No changes to existing run or plan behaviour.

Rollback

Yes

@gchan gchan force-pushed the gordon/te-5476-backfill-commit-metadata branch 3 times, most recently from 5f60e10 to 570aed5 Compare April 2, 2026 04:19
gchan added 17 commits April 7, 2026 10:22
Add the git abstraction layer for the backfill-commit-metadata command:

- GitRunner interface with ExecGitRunner production implementation for
  testability via dependency injection
- DetectDefaultBranch with origin/HEAD -> origin/main -> origin/master
  fallback chain
- 3-strategy fork-point detection (--fork-point, mainline parent
  fallback with MainlineCache, plain merge-base) ported from reporummage
- FetchBulkMetadata using git log --no-walk --stdin with ASCII unit/record
  separators for single-call bulk extraction
- FilterExistingCommits via git cat-file --batch-check for efficient
  missing commit detection
- CollectDiffs with 10-goroutine worker pool for concurrent per-commit
  diff collection (--name-only, --numstat, full diff, --raw)
- Comprehensive unit tests using FakeGitRunner for all components
Add JSONL + tar.gz packaging for commit metadata export:

- CommitRecord type with 14 fields matching the export schema (schema_version,
  commit_sha, parent_shas, author/committer info, message, diffs)
- ArchiveMetadata type for the metadata.json manifest
- CreateTarball writes commit-metadata.jsonl and metadata.json to a gzipped
  tar archive in a temp file, using stdlib archive/tar and compress/gzip
- GitDiff and GitDiffRaw use omitempty for smaller output with --skip-diffs
- Unit tests covering structure, JSONL content, metadata, empty records,
  omitempty behaviour, and schema version
Add presigned S3 POST upload using multipart form encoding:

- PresignedUploadForm type matching the Buildkite API response shape
- UploadToS3 streams file content via io.Pipe to avoid buffering the
  entire tarball in memory
- Form data fields are sent before the file field (S3 requirement)
- Fields sent in deterministic order for testability
- No auth headers needed (presigned form fields handle S3 authorisation)
- Unit tests with httptest server verifying field presence, file content,
  field ordering, server errors, and missing file handling
Add FetchCommitList and PresignUpload API client methods, plus config
changes for the backfill-commit-metadata command:

- FetchCommitList requests text/plain from the commits endpoint (TE-5489)
  and parses newline-delimited SHAs. Bypasses DoWithRetry to set custom
  Accept header and handle text responses directly.
- PresignUpload requests a presigned S3 POST form from the backfill
  upload endpoint (TE-5441). Uses DoWithRetry for standard JSON handling.
- Add Days, Output, and SkipDiffs fields to Config struct
- Add ValidateForBackfillCommitMetadata with minimal validation (only
  API connection fields required, no runner/parallelism/build env)
Add the backfill-commit-metadata subcommand, hidden behind
BKTEC_ENABLE_BACKFILL=1 env var (following BKTEC_PREVIEW_SELECTION pattern):

- BackfillCommitMetadata orchestrates the full pipeline: fetch commit
  list from API, detect default branch, build mainline cache, filter
  existing commits, bulk-fetch metadata, collect diffs concurrently,
  assemble JSONL records, package as tar.gz, and upload via presigned S3
- CLI flags: --skip-diffs, --output, --days (reuses existing auth flags)
- buildCommands() conditionally registers the command when env var is set
- ValidateForBackfillCommitMetadata used for minimal config validation
- Progress reporting to stderr (every 100 commits + completion)
- Tests covering happy path, skip-diffs, no commits, all commits missing,
  API error, days parameter, and S3 upload flow
S3 presigned POST uploads require a Content-Length header. The previous
io.Pipe streaming approach produced a body with unknown length, causing
S3 to reject the request with 411 MissingContentLength.

Switch from io.Pipe to bytes.Buffer so the HTTP client can compute and
set Content-Length automatically. This buffers the tarball in memory
which is acceptable for backfill uploads (typically tens of MB).
Remove the os.Rename-with-fallback pattern and always copy the temp
tarball to the output path. The rename only saved a syscall when src
and dst were on the same filesystem, which is the common case but not
worth the extra code path.
…SHAs from filter

Update the internal/git package with three related changes:

- FilterExistingCommits now returns the missing SHA list ([]string)
  instead of just a count, so callers can pass the list to
  FetchMissingCommits
- DetectDefaultBranch accepts a remote parameter instead of hardcoding
  'origin', enabling --remote flag support for non-standard remotes
- Add FetchMissingCommits with chunked fetching (1000 per batch) and
  recursive bisect-on-error to isolate unfetchable SHAs (ported from
  reporummage's fetchCommitsInChunks + fetchCommitsBisectOnError)
- Add TestDetectDefaultBranch_CustomRemote and four FetchMissingCommits
  tests (all succeed, bisect on error, all unfetchable, empty list)
Wire the new git package changes into the CLI and command orchestration:

- Add Remote config field and --remote CLI flag (default 'origin',
  env var BUILDKITE_TEST_ENGINE_BACKFILL_REMOTE)
- Pass cfg.Remote to DetectDefaultBranch
- Insert fetch-missing step between filter and mainline cache: after
  initial filter, attempt to fetch missing commits from the remote,
  then re-filter to pick up newly available commits
- Update SkippedCommits in archive metadata to use len(missingCommits)
- Restore backfillEnabled() and BKTEC_ENABLE_BACKFILL env var gating
  that was lost during unstaged file restoration
After writing the tar.gz, rename the temp file from the random
os.CreateTemp suffix to include a UTC timestamp (e.g.
bktec-commit-metadata-20260402T100000Z.tar.gz). This makes leftover
temp files identifiable if cleanup is skipped. Falls back to the
original temp path if the rename fails.
Add workerCount parameter to CollectDiffs instead of using a hardcoded
constant. Export DefaultWorkerCount (10) for callers that want the
standard pool size. This allows tuning concurrency for environments
with limited resources or when processing very large commit lists.
Add --concurrency flag (default 10, env BUILDKITE_TEST_ENGINE_BACKFILL_CONCURRENCY)
to control the number of concurrent git operations during diff collection.
Useful for tuning on resource-constrained environments or when processing
very large commit lists.
The map stores each commit's first parent only, not multiple parents.
Rename the field to singular to accurately reflect the 1:1 mapping.
Move FakeGitRunner from git_test.go to internal/git/fake_runner.go
(non-test file under internal/) so it can be imported by tests in
other packages. Remove the duplicate fakeGitRunner from the command
tests and use git.FakeGitRunner directly.
De-duplicate Output and OutputWithStdin by extracting a common run
method that handles command setup, debug logging, execution, and error
formatting. The only difference (stdin) is passed as an io.Reader
parameter (nil for Output, strings.NewReader for OutputWithStdin).

Also adds the package-level doc comment for the git package.
Replace the dense single-line format string with named constants for
each git format placeholder (fmtHash, fmtAuthorN, etc.) joined by
the field separator. Each placeholder is documented with its meaning,
making the field order and separators self-evident.
Export metadataFormat as MetadataFormat so tests in other packages can
reference the single source of truth. Remove the hardcoded copy
(testMetadataFormat) from the command tests.
@gchan gchan force-pushed the gordon/te-5476-backfill-commit-metadata branch 2 times, most recently from 21d1044 to 9d19c16 Compare April 7, 2026 01:35
gchan added 6 commits April 7, 2026 13:37
…ELECTION

Replace the BKTEC_ENABLE_BACKFILL env var gating with the existing
BKTEC_PREVIEW_SELECTION env var. The backfill-commit-metadata command
is always registered but hidden from bktec --help unless
BKTEC_PREVIEW_SELECTION is enabled. Users who know about it can invoke
it directly regardless of the env var.

Remove backfillEnvVar, backfillEnabled(), and buildCommands(). The
command list is now a static declaration on the cliCommand var.
Extend ArchiveMetadata with fields reflecting the export configuration
and data provenance:

- Days, Remote, SkippedDiffs: record the config options used, so the
  downstream pipeline knows the parameters of the export
- MinCommitDate, MaxCommitDate: ISO 8601 date range of the commits in
  the archive, computed from CommitterDate during record assembly
  (lexicographic comparison works for ISO 8601 strings)

These help the ingestion pipeline understand what data the archive
contains without needing to scan the JSONL.
Fast-fail with a clear error if the API token is missing required
scopes, rather than spending minutes on git metadata collection only
to get a 403 at upload time.

- Add VerifyTokenScopes via GET /v2/access-token, checking for
  read_suites (commit list) and write_suites (presigned upload)
- When --output is set, only require read_suites since upload is
  skipped; downgrade missing write_suites to a stderr warning
- Scope check runs immediately after API client creation (step 2),
  before any git commands
- Tests: _ScopeCheckFails (missing write_suites without --output
  returns error), _ScopeCheckWarnsWithOutput (missing write_suites
  with --output warns but succeeds)
On presign or S3 upload failure, keep the temp tarball and print its
path to stderr instead of deleting it. This avoids forcing the user to
re-run the entire git metadata collection (which can take minutes) just
to retry the upload step. The retained file can be uploaded later with
the upload-commit-metadata command.
Move backfill-commit-metadata from a top-level hidden command to
bktec tools backfill-commit-metadata using urfave/cli/v3 native
command nesting. The tools parent is hidden from bktec --help unless
BKTEC_PREVIEW_SELECTION is enabled, but is always invocable directly.

This keeps the top-level namespace clean for run/plan and provides
a natural home for future utility commands (e.g. upload-commit-metadata).
Global flags like --debug propagate via cmd.Root().Bool().
Add a command to upload a previously generated commit metadata tarball
to Buildkite, for cases where generation and upload happen in separate
steps (e.g. air-gapped environments or split CI pipelines).

- UploadCommitMetadata command: verify file exists, check write_suites
  token scope, presign, upload to S3
- UploadFile config field and ValidateForUploadCommitMetadata validator
  (requires API connection fields + --file path, no suite slug needed)
- --file flag (required) under bktec tools upload-commit-metadata
- Shares the existing PresignUpload + UploadToS3 logic with the
  backfill command
- Tests: happy path (end-to-end with mock S3), file not found,
  scope check failure (missing write_suites)
@gchan gchan force-pushed the gordon/te-5476-backfill-commit-metadata branch from 9d19c16 to f18908a Compare April 7, 2026 01:38
@gchan gchan requested a review from pda April 7, 2026 01:43
@gchan gchan marked this pull request as ready for review April 8, 2026 04:32
@gchan gchan requested a review from a team as a code owner April 8, 2026 04:32
gchan added 6 commits April 8, 2026 16:39
Output a human-friendly file size (bytes, KiB, or MiB) to stdout after
creating the tarball, making it easy to see the export size at a glance
or capture programmatically.
The return value from runner.Output was captured and immediately
discarded. Use _ directly in the assignment instead.
The processed counter is only read and written from the single goroutine
that ranges over resultCh, so atomics are unnecessary.
Replace the parenthetical referencing an internal tool name with a
description of what the symbolic-ref call actually does.
Pass GitRunner as a parameter to BackfillCommitMetadata instead of using
a package-level mutable global. Tests now pass the fake runner directly,
removing the setGitRunnerFactory helper and the implicit coupling via
shared mutable state. Also renumber step comments (4-12) after the
removed step.
Remove the deferred out.Close() and close explicitly in both the error
and success paths. Previously the file was closed twice on success, with
the deferred close silently discarding any error.
@gchan gchan force-pushed the gordon/te-5476-backfill-commit-metadata branch from b672f29 to 26e7b52 Compare April 8, 2026 04:39
…support

Add context.Context parameter to UploadToS3 and use
http.NewRequestWithContext so the S3 upload can be cancelled or timed
out externally.
@gchan gchan force-pushed the gordon/te-5476-backfill-commit-metadata branch from d6240b1 to b38adad Compare April 8, 2026 05:02
@pda
Copy link
Copy Markdown
Member

pda commented Apr 8, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b38adad311

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@pda
Copy link
Copy Markdown
Member

pda commented Apr 8, 2026

I wonder if bktec tools upload-commit-metadata should just be a flag of bktec tools backfill-commit-metadata rather than a whole separate command 🤔

# all in one
bktec tools backfill-commit-metadata

# or, produce locally…
bktec tools backfill-commit-metadata --output data.tar.gz
# … and then upload:
bktec tools backfill-commit-metadata --upload data.tar.gz

…as --upload flag

Replace the separate upload-commit-metadata subcommand with a --upload
flag on backfill-commit-metadata, per review feedback. The two operations
are the same workflow (generate + upload) split across invocations, so a
single command with mode dispatch is a better UX.

When --upload is set, BackfillCommitMetadata dispatches to a private
uploadOnly helper that skips all git work and goes straight to scope
verification, presign, and S3 upload. Validation branches on UploadFile:
upload mode only requires API connection fields (no suite slug, days,
concurrency).

- Add --upload flag to backfill-commit-metadata (replaces --file on the
  removed upload-commit-metadata subcommand)
- Remove upload-commit-metadata subcommand, handler, validation method,
  flags, and dedicated command/test files
- Update ValidateForBackfillCommitMetadata to branch on UploadFile
- Add uploadOnly private function with the upload-only pipeline
- Move upload tests into backfill test file (3 tests: happy path, file
  not found, scope check failure)
- Update docs and README to reference --upload instead of the separate
  command
@gchan
Copy link
Copy Markdown
Contributor Author

gchan commented Apr 8, 2026

I wonder if bktec tools upload-commit-metadata should just be a flag of bktec tools backfill-commit-metadata rather than a whole separate command 🤔

# all in one
bktec tools backfill-commit-metadata

# or, produce locally…
bktec tools backfill-commit-metadata --output data.tar.gz
# … and then upload:
bktec tools backfill-commit-metadata --upload data.tar.gz

I like this and it's been implemented in 3549938

gchan added 2 commits April 9, 2026 10:17
FetchCommitList and UploadToS3 had no per-request timeouts, so a stalled
connection could block the command indefinitely.

- Add 30s context timeout to FetchCommitList before the HTTP request
- Replace http.DefaultClient in UploadToS3 with a client using a 5
  minute timeout (supports up to 1GB uploads at ~30 Mbps)
…llisions

The temp tarball is renamed to include a UTC timestamp for identification,
but second-level granularity means concurrent runs started in the same
second would collide. Add milliseconds to the format string to narrow the
collision window.
@pda
Copy link
Copy Markdown
Member

pda commented Apr 10, 2026

Without args, bktec tools backfill-commit-metadata fails with:

invalid configuration...
BUILDKITE_ORGANIZATION_SLUG must not be blank
BUILDKITE_TEST_ENGINE_SUITE_SLUG must not be blank

Setting those in env does work, but for a CLI tool I think the more conventional solution is to pass --organization-slug and --suite-slug flags.

These flags both work, but only one of them is mentioned in --help:

$ bktec tools backfill-commit-metadata --help
NAME:
   bktec tools backfill-commit-metadata - Collect historical git commit metadata and upload to Buildkite

USAGE:
   bktec tools backfill-commit-metadata [options]

OPTIONS:
   --help, -h  show help

   BACKFILL

   --concurrency int  Number of concurrent git operations for diff collection (default: 10) [$BUILDKITE_TEST_ENGINE_BACKFILL_CONCURRENCY]
   --days int         Number of days of commit history to export (1-90) (default: 90) [$BUILDKITE_TEST_ENGINE_BACKFILL_DAYS]
   --output string    Write tarball to a local file instead of uploading to Buildkite
   --remote string    Git remote name for fetching missing commits and detecting default branch (default: "origin") [$BUILDKITE_TEST_ENGINE_BACKFILL_REMOTE]
   --skip-diffs       Omit full git diffs from export to reduce upload size [$BUILDKITE_TEST_ENGINE_SKIP_DIFFS]
   --upload string    Upload a previously generated commit metadata tarball instead of collecting new data

   TEST ENGINE

   --access-token string  Buildkite API access token [$BUILDKITE_TEST_ENGINE_API_ACCESS_TOKEN]
   --suite-slug string    Buildkite suite slug [$BUILDKITE_TEST_ENGINE_SUITE_SLUG]


GLOBAL OPTIONS:
   --version  print version information and exit
   --debug    Enable debug output [$BUILDKITE_TEST_ENGINE_DEBUG_ENABLED]

So, ideally:

  • --help would show --organization-slug as an option
  • unspecified org/slug would report missing required args, not required env.

@pda
Copy link
Copy Markdown
Member

pda commented Apr 10, 2026

I wonder if it's bad manners that our .tar.gz has files at the top level, rather than creating a directory.

ChatGPT reckons it is:

image

gchan added 3 commits April 13, 2026 10:03
Nest commit-metadata.jsonl and metadata.json inside a directory named
backfill-<org>-<suite>-<timestamp> (e.g. backfill-my-org-my-suite-20260402T100000.000Z/)
instead of placing them at the tar root. Having files at the top level
of a tar archive is considered bad practice as extracting without -C
can pollute the working directory.

The directory entry is written as an explicit tar.TypeDir header before
the two file entries. The directory name includes org slug, suite slug,
and a UTC timestamp for uniqueness across multiple archives.

Tests updated to use suffix-based file lookup since the directory name
includes a runtime timestamp. New TestCreateTarball_DirectoryStructure
validates the directory entry type, naming convention, and file ordering.
…rs.go

De-duplicate ReadTarball, FindTarEntry, and HasTarEntry which were
defined independently in both internal/packaging/tarball_test.go and
internal/command/backfill_commit_metadata_test.go.

Export them from internal/packaging/test_helpers.go (non-test file under
internal/, following the FakeGitRunner pattern in internal/git/) so both
packages can import a single implementation.

Also simplifies the SkipDiffs command test by replacing a manual
tar-reading loop with ReadTarball + FindTarEntry, removing the
archive/tar and compress/gzip imports from that file.
…validation errors

The backfill command is CLI-first (run manually outside CI), but
--organization-slug was hidden from --help because the shared flag
variable has Hidden: true for the run/plan commands where BUILDKITE_ORGANIZATION_SLUG
is auto-populated by the Buildkite agent.

Add a non-hidden backfillOrganizationSlugFlag under the TEST ENGINE
category so it appears alongside --access-token and --suite-slug in
bktec tools backfill-commit-metadata --help.

Update ValidateForBackfillCommitMetadata() error messages to reference
flag names first (e.g. "--organization-slug / BUILDKITE_ORGANIZATION_SLUG
must not be blank") instead of env-var-only format. This is consistent
with how --days and --concurrency are already validated in the same
function, and more helpful for CLI users who may not know the env var
names.

Add validation tests for missing org slug, suite slug, access token,
and upload-only mode skipping suite slug. Update docs and README
examples to show flag-based usage.
@gchan
Copy link
Copy Markdown
Contributor Author

gchan commented Apr 12, 2026

Without args, bktec tools backfill-commit-metadata fails with:

invalid configuration...
BUILDKITE_ORGANIZATION_SLUG must not be blank
BUILDKITE_TEST_ENGINE_SUITE_SLUG must not be blank

Setting those in env does work, but for a CLI tool I think the more conventional solution is to pass --organization-slug and --suite-slug flags.

bktec is normally executed in a CI environment so setting env vars is the preferred option over options. However this command could be executed in a local dev environment (someone's computer). So updating the error messages to include the options seems reasonable.

invalid configuration...
--access-token / BUILDKITE_TEST_ENGINE_API_ACCESS_TOKEN must not be blank
--organization-slug / BUILDKITE_ORGANIZATION_SLUG must not be blank
--suite-slug / BUILDKITE_TEST_ENGINE_SUITE_SLUG must not be blank

These flags both work, but only one of them is mentioned in --help:

So, ideally:

* `--help` would show `--organization-slug` as an option

* unspecified org/slug would report missing required args, not required env.

bktec is normally executed in a CI environment where BUILDKITE_ORGANIZATION_SLUG is already set. So that could be why --organization-slug has been hidden. I've unhid this option for the command backfill-commit-metadata specifically.

Now it is clear in --help what options are available

NAME:
   bktec tools backfill-commit-metadata - Collect historical git commit metadata and upload to Buildkite

USAGE:
   bktec tools backfill-commit-metadata [options]

OPTIONS:
   --help, -h  show help

   BACKFILL

   --concurrency int  Number of concurrent git operations for diff collection (default: 10) [$BUILDKITE_TEST_ENGINE_BACKFILL_CONCURRENCY]
   --days int         Number of days of commit history to export (1-90) (default: 90) [$BUILDKITE_TEST_ENGINE_BACKFILL_DAYS]
   --output string    Write tarball to a local file instead of uploading to Buildkite
   --remote string    Git remote name for fetching missing commits and detecting default branch (default: "origin") [$BUILDKITE_TEST_ENGINE_BACKFILL_REMOTE]
   --skip-diffs       Omit full git diffs from export to reduce upload size [$BUILDKITE_TEST_ENGINE_SKIP_DIFFS]
   --upload string    Upload a previously generated commit metadata tarball instead of collecting new data

   TEST ENGINE

   --access-token string       Buildkite API access token [$BUILDKITE_TEST_ENGINE_API_ACCESS_TOKEN]
   --organization-slug string  Buildkite organization slug [$BUILDKITE_ORGANIZATION_SLUG]
   --suite-slug string         Buildkite suite slug [$BUILDKITE_TEST_ENGINE_SUITE_SLUG]


GLOBAL OPTIONS:
   --version  print version information and exit
   --debug    Enable debug output [$BUILDKITE_TEST_ENGINE_DEBUG_ENABLED]

@gchan gchan requested a review from jameshill April 13, 2026 01:04
Copy link
Copy Markdown
Contributor

@jameshill jameshill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low risk — purely additive. New command is hidden behind BKTEC_PREVIEW_SELECTION and touches nothing in the existing run/split/plan paths. Config struct additions are all json:"-" so inert to existing customers. Happy for this to go in for internal usage.

@gchan gchan merged commit a110c6f into main Apr 13, 2026
3 checks passed
@gchan gchan deleted the gordon/te-5476-backfill-commit-metadata branch April 13, 2026 01:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants