Skip to content

fix(telemetry): await initTelemetry so command_executed carries repo_id#211

Merged
rhuanbarreto merged 1 commit intomainfrom
fix/telemetry-repo-id-race
Apr 15, 2026
Merged

fix(telemetry): await initTelemetry so command_executed carries repo_id#211
rhuanbarreto merged 1 commit intomainfrom
fix/telemetry-repo-id-race

Conversation

@rhuanbarreto
Copy link
Copy Markdown
Contributor

@rhuanbarreto rhuanbarreto commented Apr 15, 2026

Summary

  • cli.ts was kicking off initTelemetry() with void, so the async getRepoContext() call was still in flight when Commander's preAction hook fired command_executed. The result: repo_id was always null on that event, and ~25% of command_completed events also missed it when the process exited before the snapshot resolved.
  • Fix: await initTelemetry() before command handlers run.
  • Parallelized the three git probes in getRepoContext() to hide the added startup latency. rev-parse --is-inside-work-tree gates first (so non-git dirs still spawn just one subprocess), then git config --get remote.origin.url runs concurrently with resolveDefaultBranch, which itself fires symbolic-ref and rev-parse --abbrev-ref in parallel.
  • Updated the initTelemetry JSDoc: callers should await before emitting events.

Cost measured on Windows

Windows git spawn is ~25ms per subprocess, so serial calls were the dominant cost.

Scenario Before After Delta
Git repo w/ remote, p50 80ms 63ms -17ms (-21%)
Git repo w/ remote, p95 99ms 84ms -15ms
Non-git directory 24ms 25ms ~0 (gate unchanged)

Evidence the race is real (PostHog project 145977, last 90 days, v0.28.0)

Event Total With repo_id
command_executed 90 0
command_completed 90 69
check_completed 89 69

The 0/90 on command_executed is the race; the 69/90 on the other two is the same race manifesting as process exit before the async resolve lands.

Test plan

  • Git repo with remote → command_executed carries repo_id (92b8023d174a9222) + repo_host=github
  • Non-git directory → event fires, repo_id=null, repo_is_git=false
  • ARCHGATE_TELEMETRY=0 → 0 events captured (disabled path unchanged)
  • NODE_ENV=test → 0 events captured (test short-circuit unchanged)
  • bun run validate (lint + typecheck + format + 639 tests + ADR check + build check) passes

🤖 Generated with Claude Code

@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Apr 15, 2026

Deploying archgate-cli with  Cloudflare Pages  Cloudflare Pages

Latest commit: e04264e
Status: ✅  Deploy successful!
Preview URL: https://47b17c73.archgate-cli.pages.dev
Branch Preview URL: https://fix-telemetry-repo-id-race.archgate-cli.pages.dev

View logs

The CLI kicked off initTelemetry() with `void`, so the async getRepoContext()
call was still in flight when the Commander preAction hook fired
`command_executed` — meaning `repo_id` was always null on that event (0/90 in
the last 90 days of v0.28.0 on PostHog) and ~25% of `command_completed` events
also missed it when the process exited before the snapshot resolved.

Await initTelemetry() before command handlers run, and parallelize the three
git probes inside getRepoContext() (rev-parse --is-inside-work-tree gates
first, then `git config --get remote.origin.url` runs concurrently with
resolveDefaultBranch, which itself fires symbolic-ref and rev-parse
--abbrev-ref in parallel). Windows spawn is the dominant cost (~25ms each),
so going serial added ~80ms p50 / ~100ms p95 to every CLI invocation;
parallelizing drops that to ~63ms p50 / ~84ms p95 (a ~21% reduction on the
hot path). Non-git directories are unchanged — the gate stays serial.

Verified:
- Git repo with remote: `command_executed` carries `repo_id` + `repo_host`.
- Non-git directory: event fires, `repo_id=null, repo_is_git=false`.
- ARCHGATE_TELEMETRY=0: 0 events (disabled path unchanged).
- NODE_ENV=test: 0 events (test short-circuit unchanged).
- `bun run validate` passes (639 tests, lint/typecheck/format/ADR/build).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@rhuanbarreto rhuanbarreto force-pushed the fix/telemetry-repo-id-race branch from f0c516b to e04264e Compare April 15, 2026 19:52
@rhuanbarreto rhuanbarreto merged commit 4ffbb70 into main Apr 15, 2026
16 of 29 checks passed
@rhuanbarreto rhuanbarreto deleted the fix/telemetry-repo-id-race branch April 15, 2026 20:19
@github-actions github-actions bot mentioned this pull request Apr 15, 2026
rhuanbarreto added a commit that referenced this pull request Apr 16, 2026
On Windows (and anywhere the runtime honours timer handles), every
command that exits naturally through `main()` returning — i.e. every
command except `check` and `upgrade` — used to linger ~3 seconds
after its output appeared. Root causes + fixes below.

## Timer leaks (Phase 1)

`flushTelemetry`, `gitCredentialFill`, the WSL fallback in
`resolveCommand`, and the per-rule timeout in the check runner all
race a spawn/shutdown against a `setTimeout`/`Bun.sleep` that's
never cancelled when the race winner resolves. The stray timer
keeps the event loop alive for its full duration. Each site now
captures the timer id and clears it in a `.finally`.

`flushSentry` wraps `Sentry.flush` with the same cancellable race
pattern as defence-in-depth against leaks inside the SDK.

`main()` now finishes with an explicit `process.exit(0)` after
flushes complete so a future leaked handle in a third-party SDK
can't regress this class of bug.

## Lazy SDK loading (Phase 2 #6)

`posthog-node` and `@sentry/node-core/light` are now loaded via
dynamic `import()` inside `initTelemetry` / `initSentry`. The
`ARCHGATE_TELEMETRY=0` path skips the parse and init cost entirely.
Both inits run concurrently via `Promise.all` in `cli.ts`.

## Startup trimming (Phase 2)

- `installGit` short-circuits on `Bun.which("git")` before falling
  back to the async cross-env resolver (avoids a WSL subprocess on
  Windows in the 99% happy path).
- `getRepoContext` now fires all four git probes in parallel
  instead of gating on `rev-parse --is-inside-work-tree`, saving
  one serial spawn on git repos.

## Check hot path (Phase 3)

- `getGitTrackedFiles` is cached per project root — `resolveScopedFiles`
  used to spawn `git ls-files` once per ADR (16 spawns in this repo).
- `ensureRulesShim` compares existing `rules.d.ts` content and skips
  the write when unchanged.
- New `parseAllAdrs(projectRoot)` caches the readdir + parse of every
  ADR once per process. `loadRuleAdrs`, `loadAllAdrs` (in context.ts),
  and `adr list` now share that cache, so `review-context --run-checks`
  no longer reads the ADR directory twice.
- `adr list` uses the shared cache — one readdir instead of two.
- `buildSummary` is built once in `check.ts` and passed into every
  reporter + `getExitCode` instead of being recomputed three times.

## Telemetry payload (Phase 4)

- `getCommonProperties` now splits into a cached "static" snapshot
  (platform, install method, CI, locale) and a fresh "dynamic"
  overlay (project context, repo snapshot). Cuts redundant
  `getPlatformInfo` / `detectInstallMethod` / `Intl.DateTimeFormat`
  calls when multiple events are emitted per command.
- The opportunistic update check at startup now uses a 5s fetch
  timeout (down from 15s) so a slow network never extends exit
  time. The explicit `archgate upgrade` path keeps the 15s default.

## Measured impact (Windows, this repo)

| Command          | Before  | After   |
|------------------|---------|---------|
| adr list         | 3.60 s  | 0.82 s  |
| adr show         | 3.60 s  | 0.72 s  |
| doctor           | 3.90 s  | 1.06 s  |
| telemetry status | 3.70 s  | 0.72 s  |
| check            | 0.88 s  | 0.85 s  |
| adr list (tel off)| 0.59 s  | 0.42 s  |
| doctor (tel off) | 3.50 s  | 0.80 s  |

## Not changed

The `await initTelemetry()` in `main()` — PR #211 added it so
`command_executed` / `command_completed` carry `repo_id`. Keeping
the await; lazy-loading plus parallel init keeps the added startup
cost bounded.

## Validation

`bun run validate` passes (lint + typecheck + format + 662 tests
+ 22/22 ADR rules + build check).
rhuanbarreto added a commit that referenced this pull request Apr 16, 2026
* perf(cli): remove 3s exit tail and trim startup overhead

On Windows (and anywhere the runtime honours timer handles), every
command that exits naturally through `main()` returning — i.e. every
command except `check` and `upgrade` — used to linger ~3 seconds
after its output appeared. Root causes + fixes below.

## Timer leaks (Phase 1)

`flushTelemetry`, `gitCredentialFill`, the WSL fallback in
`resolveCommand`, and the per-rule timeout in the check runner all
race a spawn/shutdown against a `setTimeout`/`Bun.sleep` that's
never cancelled when the race winner resolves. The stray timer
keeps the event loop alive for its full duration. Each site now
captures the timer id and clears it in a `.finally`.

`flushSentry` wraps `Sentry.flush` with the same cancellable race
pattern as defence-in-depth against leaks inside the SDK.

`main()` now finishes with an explicit `process.exit(0)` after
flushes complete so a future leaked handle in a third-party SDK
can't regress this class of bug.

## Lazy SDK loading (Phase 2 #6)

`posthog-node` and `@sentry/node-core/light` are now loaded via
dynamic `import()` inside `initTelemetry` / `initSentry`. The
`ARCHGATE_TELEMETRY=0` path skips the parse and init cost entirely.
Both inits run concurrently via `Promise.all` in `cli.ts`.

## Startup trimming (Phase 2)

- `installGit` short-circuits on `Bun.which("git")` before falling
  back to the async cross-env resolver (avoids a WSL subprocess on
  Windows in the 99% happy path).
- `getRepoContext` now fires all four git probes in parallel
  instead of gating on `rev-parse --is-inside-work-tree`, saving
  one serial spawn on git repos.

## Check hot path (Phase 3)

- `getGitTrackedFiles` is cached per project root — `resolveScopedFiles`
  used to spawn `git ls-files` once per ADR (16 spawns in this repo).
- `ensureRulesShim` compares existing `rules.d.ts` content and skips
  the write when unchanged.
- New `parseAllAdrs(projectRoot)` caches the readdir + parse of every
  ADR once per process. `loadRuleAdrs`, `loadAllAdrs` (in context.ts),
  and `adr list` now share that cache, so `review-context --run-checks`
  no longer reads the ADR directory twice.
- `adr list` uses the shared cache — one readdir instead of two.
- `buildSummary` is built once in `check.ts` and passed into every
  reporter + `getExitCode` instead of being recomputed three times.

## Telemetry payload (Phase 4)

- `getCommonProperties` now splits into a cached "static" snapshot
  (platform, install method, CI, locale) and a fresh "dynamic"
  overlay (project context, repo snapshot). Cuts redundant
  `getPlatformInfo` / `detectInstallMethod` / `Intl.DateTimeFormat`
  calls when multiple events are emitted per command.
- The opportunistic update check at startup now uses a 5s fetch
  timeout (down from 15s) so a slow network never extends exit
  time. The explicit `archgate upgrade` path keeps the 15s default.

## Measured impact (Windows, this repo)

| Command          | Before  | After   |
|------------------|---------|---------|
| adr list         | 3.60 s  | 0.82 s  |
| adr show         | 3.60 s  | 0.72 s  |
| doctor           | 3.90 s  | 1.06 s  |
| telemetry status | 3.70 s  | 0.72 s  |
| check            | 0.88 s  | 0.85 s  |
| adr list (tel off)| 0.59 s  | 0.42 s  |
| doctor (tel off) | 3.50 s  | 0.80 s  |

## Not changed

The `await initTelemetry()` in `main()` — PR #211 added it so
`command_executed` / `command_completed` carry `repo_id`. Keeping
the await; lazy-loading plus parallel init keeps the added startup
cost bounded.

## Validation

`bun run validate` passes (lint + typecheck + format + 662 tests
+ 22/22 ADR rules + build check).

* test(perf): regression guard for the 3s exit tail

Spawns the real CLI via `bun run src/cli.ts --version` / `--help`,
asserts the median of 3 runs stays under 4s. The historical
regression (un-cancelled `setTimeout` / `Bun.sleep` in the
telemetry / Sentry / credential / WSL flush paths) pushed these
commands to ~3.5s on Windows; normal runs sit well under 2s even
on slow CI.

`NODE_ENV=test` suppresses real event delivery but leaves the
telemetry + Sentry SDK init/flush path active, which is where
the regression lives. `ARCHGATE_TELEMETRY` is intentionally NOT
set to 0 so the test exercises the enabled code path.

Threshold chosen to catch the regression (always ≥3000ms) with
enough headroom that slow runners don't flake.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant