fix(telemetry): await initTelemetry so command_executed carries repo_id by rhuanbarreto · Pull Request #211 · archgate/cli

rhuanbarreto · 2026-04-15T19:34:50Z

Summary

cli.ts was kicking off initTelemetry() with void, so the async getRepoContext() call was still in flight when Commander's preAction hook fired command_executed. The result: repo_id was always null on that event, and ~25% of command_completed events also missed it when the process exited before the snapshot resolved.
Fix: await initTelemetry() before command handlers run.
Parallelized the three git probes in getRepoContext() to hide the added startup latency. rev-parse --is-inside-work-tree gates first (so non-git dirs still spawn just one subprocess), then git config --get remote.origin.url runs concurrently with resolveDefaultBranch, which itself fires symbolic-ref and rev-parse --abbrev-ref in parallel.
Updated the initTelemetry JSDoc: callers should await before emitting events.

Cost measured on Windows

Windows git spawn is ~25ms per subprocess, so serial calls were the dominant cost.

Scenario	Before	After	Delta
Git repo w/ remote, p50	80ms	63ms	-17ms (-21%)
Git repo w/ remote, p95	99ms	84ms	-15ms
Non-git directory	24ms	25ms	~0 (gate unchanged)

Evidence the race is real (PostHog project 145977, last 90 days, v0.28.0)

Event	Total	With `repo_id`
`command_executed`	90	0
`command_completed`	90	69
`check_completed`	89	69

The 0/90 on command_executed is the race; the 69/90 on the other two is the same race manifesting as process exit before the async resolve lands.

Test plan

Git repo with remote → command_executed carries repo_id (92b8023d174a9222) + repo_host=github
Non-git directory → event fires, repo_id=null, repo_is_git=false
ARCHGATE_TELEMETRY=0 → 0 events captured (disabled path unchanged)
NODE_ENV=test → 0 events captured (test short-circuit unchanged)
bun run validate (lint + typecheck + format + 639 tests + ADR check + build check) passes

🤖 Generated with Claude Code

cloudflare-workers-and-pages · 2026-04-15T19:42:17Z

Deploying archgate-cli with Cloudflare Pages

Latest commit:	`e04264e`
Status:	✅ Deploy successful!
Preview URL:	https://47b17c73.archgate-cli.pages.dev
Branch Preview URL:	https://fix-telemetry-repo-id-race.archgate-cli.pages.dev

View logs

The CLI kicked off initTelemetry() with `void`, so the async getRepoContext() call was still in flight when the Commander preAction hook fired `command_executed` — meaning `repo_id` was always null on that event (0/90 in the last 90 days of v0.28.0 on PostHog) and ~25% of `command_completed` events also missed it when the process exited before the snapshot resolved. Await initTelemetry() before command handlers run, and parallelize the three git probes inside getRepoContext() (rev-parse --is-inside-work-tree gates first, then `git config --get remote.origin.url` runs concurrently with resolveDefaultBranch, which itself fires symbolic-ref and rev-parse --abbrev-ref in parallel). Windows spawn is the dominant cost (~25ms each), so going serial added ~80ms p50 / ~100ms p95 to every CLI invocation; parallelizing drops that to ~63ms p50 / ~84ms p95 (a ~21% reduction on the hot path). Non-git directories are unchanged — the gate stays serial. Verified: - Git repo with remote: `command_executed` carries `repo_id` + `repo_host`. - Non-git directory: event fires, `repo_id=null, repo_is_git=false`. - ARCHGATE_TELEMETRY=0: 0 events (disabled path unchanged). - NODE_ENV=test: 0 events (test short-circuit unchanged). - `bun run validate` passes (639 tests, lint/typecheck/format/ADR/build). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

On Windows (and anywhere the runtime honours timer handles), every command that exits naturally through `main()` returning — i.e. every command except `check` and `upgrade` — used to linger ~3 seconds after its output appeared. Root causes + fixes below. ## Timer leaks (Phase 1) `flushTelemetry`, `gitCredentialFill`, the WSL fallback in `resolveCommand`, and the per-rule timeout in the check runner all race a spawn/shutdown against a `setTimeout`/`Bun.sleep` that's never cancelled when the race winner resolves. The stray timer keeps the event loop alive for its full duration. Each site now captures the timer id and clears it in a `.finally`. `flushSentry` wraps `Sentry.flush` with the same cancellable race pattern as defence-in-depth against leaks inside the SDK. `main()` now finishes with an explicit `process.exit(0)` after flushes complete so a future leaked handle in a third-party SDK can't regress this class of bug. ## Lazy SDK loading (Phase 2 #6) `posthog-node` and `@sentry/node-core/light` are now loaded via dynamic `import()` inside `initTelemetry` / `initSentry`. The `ARCHGATE_TELEMETRY=0` path skips the parse and init cost entirely. Both inits run concurrently via `Promise.all` in `cli.ts`. ## Startup trimming (Phase 2) - `installGit` short-circuits on `Bun.which("git")` before falling back to the async cross-env resolver (avoids a WSL subprocess on Windows in the 99% happy path). - `getRepoContext` now fires all four git probes in parallel instead of gating on `rev-parse --is-inside-work-tree`, saving one serial spawn on git repos. ## Check hot path (Phase 3) - `getGitTrackedFiles` is cached per project root — `resolveScopedFiles` used to spawn `git ls-files` once per ADR (16 spawns in this repo). - `ensureRulesShim` compares existing `rules.d.ts` content and skips the write when unchanged. - New `parseAllAdrs(projectRoot)` caches the readdir + parse of every ADR once per process. `loadRuleAdrs`, `loadAllAdrs` (in context.ts), and `adr list` now share that cache, so `review-context --run-checks` no longer reads the ADR directory twice. - `adr list` uses the shared cache — one readdir instead of two. - `buildSummary` is built once in `check.ts` and passed into every reporter + `getExitCode` instead of being recomputed three times. ## Telemetry payload (Phase 4) - `getCommonProperties` now splits into a cached "static" snapshot (platform, install method, CI, locale) and a fresh "dynamic" overlay (project context, repo snapshot). Cuts redundant `getPlatformInfo` / `detectInstallMethod` / `Intl.DateTimeFormat` calls when multiple events are emitted per command. - The opportunistic update check at startup now uses a 5s fetch timeout (down from 15s) so a slow network never extends exit time. The explicit `archgate upgrade` path keeps the 15s default. ## Measured impact (Windows, this repo) | Command | Before | After | |------------------|---------|---------| | adr list | 3.60 s | 0.82 s | | adr show | 3.60 s | 0.72 s | | doctor | 3.90 s | 1.06 s | | telemetry status | 3.70 s | 0.72 s | | check | 0.88 s | 0.85 s | | adr list (tel off)| 0.59 s | 0.42 s | | doctor (tel off) | 3.50 s | 0.80 s | ## Not changed The `await initTelemetry()` in `main()` — PR #211 added it so `command_executed` / `command_completed` carry `repo_id`. Keeping the await; lazy-loading plus parallel init keeps the added startup cost bounded. ## Validation `bun run validate` passes (lint + typecheck + format + 662 tests + 22/22 ADR rules + build check).

* perf(cli): remove 3s exit tail and trim startup overhead On Windows (and anywhere the runtime honours timer handles), every command that exits naturally through `main()` returning — i.e. every command except `check` and `upgrade` — used to linger ~3 seconds after its output appeared. Root causes + fixes below. ## Timer leaks (Phase 1) `flushTelemetry`, `gitCredentialFill`, the WSL fallback in `resolveCommand`, and the per-rule timeout in the check runner all race a spawn/shutdown against a `setTimeout`/`Bun.sleep` that's never cancelled when the race winner resolves. The stray timer keeps the event loop alive for its full duration. Each site now captures the timer id and clears it in a `.finally`. `flushSentry` wraps `Sentry.flush` with the same cancellable race pattern as defence-in-depth against leaks inside the SDK. `main()` now finishes with an explicit `process.exit(0)` after flushes complete so a future leaked handle in a third-party SDK can't regress this class of bug. ## Lazy SDK loading (Phase 2 #6) `posthog-node` and `@sentry/node-core/light` are now loaded via dynamic `import()` inside `initTelemetry` / `initSentry`. The `ARCHGATE_TELEMETRY=0` path skips the parse and init cost entirely. Both inits run concurrently via `Promise.all` in `cli.ts`. ## Startup trimming (Phase 2) - `installGit` short-circuits on `Bun.which("git")` before falling back to the async cross-env resolver (avoids a WSL subprocess on Windows in the 99% happy path). - `getRepoContext` now fires all four git probes in parallel instead of gating on `rev-parse --is-inside-work-tree`, saving one serial spawn on git repos. ## Check hot path (Phase 3) - `getGitTrackedFiles` is cached per project root — `resolveScopedFiles` used to spawn `git ls-files` once per ADR (16 spawns in this repo). - `ensureRulesShim` compares existing `rules.d.ts` content and skips the write when unchanged. - New `parseAllAdrs(projectRoot)` caches the readdir + parse of every ADR once per process. `loadRuleAdrs`, `loadAllAdrs` (in context.ts), and `adr list` now share that cache, so `review-context --run-checks` no longer reads the ADR directory twice. - `adr list` uses the shared cache — one readdir instead of two. - `buildSummary` is built once in `check.ts` and passed into every reporter + `getExitCode` instead of being recomputed three times. ## Telemetry payload (Phase 4) - `getCommonProperties` now splits into a cached "static" snapshot (platform, install method, CI, locale) and a fresh "dynamic" overlay (project context, repo snapshot). Cuts redundant `getPlatformInfo` / `detectInstallMethod` / `Intl.DateTimeFormat` calls when multiple events are emitted per command. - The opportunistic update check at startup now uses a 5s fetch timeout (down from 15s) so a slow network never extends exit time. The explicit `archgate upgrade` path keeps the 15s default. ## Measured impact (Windows, this repo) | Command | Before | After | |------------------|---------|---------| | adr list | 3.60 s | 0.82 s | | adr show | 3.60 s | 0.72 s | | doctor | 3.90 s | 1.06 s | | telemetry status | 3.70 s | 0.72 s | | check | 0.88 s | 0.85 s | | adr list (tel off)| 0.59 s | 0.42 s | | doctor (tel off) | 3.50 s | 0.80 s | ## Not changed The `await initTelemetry()` in `main()` — PR #211 added it so `command_executed` / `command_completed` carry `repo_id`. Keeping the await; lazy-loading plus parallel init keeps the added startup cost bounded. ## Validation `bun run validate` passes (lint + typecheck + format + 662 tests + 22/22 ADR rules + build check). * test(perf): regression guard for the 3s exit tail Spawns the real CLI via `bun run src/cli.ts --version` / `--help`, asserts the median of 3 runs stays under 4s. The historical regression (un-cancelled `setTimeout` / `Bun.sleep` in the telemetry / Sentry / credential / WSL flush paths) pushed these commands to ~3.5s on Windows; normal runs sit well under 2s even on slow CI. `NODE_ENV=test` suppresses real event delivery but leaves the telemetry + Sentry SDK init/flush path active, which is where the regression lives. `ARCHGATE_TELEMETRY` is intentionally NOT set to 0 so the test exercises the enabled code path. Threshold chosen to catch the regression (always ≥3000ms) with enough headroom that slow runners don't flake.

rhuanbarreto force-pushed the fix/telemetry-repo-id-race branch from f0c516b to e04264e Compare April 15, 2026 19:52

rhuanbarreto merged commit 4ffbb70 into main Apr 15, 2026
16 of 29 checks passed

rhuanbarreto deleted the fix/telemetry-repo-id-race branch April 15, 2026 20:19

github-actions bot mentioned this pull request Apr 15, 2026

chore(release): 0.29.0 #209

Merged

rhuanbarreto mentioned this pull request Apr 16, 2026

perf(cli): remove 3s exit tail and trim startup overhead #213

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(telemetry): await initTelemetry so command_executed carries repo_id#211

fix(telemetry): await initTelemetry so command_executed carries repo_id#211
rhuanbarreto merged 1 commit intomainfrom
fix/telemetry-repo-id-race

rhuanbarreto commented Apr 15, 2026 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rhuanbarreto commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Cost measured on Windows

Evidence the race is real (PostHog project 145977, last 90 days, v0.28.0)

Test plan

Uh oh!

cloudflare-workers-and-pages bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying archgate-cli with Cloudflare Pages

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rhuanbarreto commented Apr 15, 2026 •

edited

Loading

cloudflare-workers-and-pages bot commented Apr 15, 2026 •

edited

Loading