fix(telemetry): await initTelemetry so command_executed carries repo_id#211
Merged
rhuanbarreto merged 1 commit intomainfrom Apr 15, 2026
Merged
fix(telemetry): await initTelemetry so command_executed carries repo_id#211rhuanbarreto merged 1 commit intomainfrom
rhuanbarreto merged 1 commit intomainfrom
Conversation
Deploying archgate-cli with
|
| Latest commit: |
e04264e
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://47b17c73.archgate-cli.pages.dev |
| Branch Preview URL: | https://fix-telemetry-repo-id-race.archgate-cli.pages.dev |
The CLI kicked off initTelemetry() with `void`, so the async getRepoContext() call was still in flight when the Commander preAction hook fired `command_executed` — meaning `repo_id` was always null on that event (0/90 in the last 90 days of v0.28.0 on PostHog) and ~25% of `command_completed` events also missed it when the process exited before the snapshot resolved. Await initTelemetry() before command handlers run, and parallelize the three git probes inside getRepoContext() (rev-parse --is-inside-work-tree gates first, then `git config --get remote.origin.url` runs concurrently with resolveDefaultBranch, which itself fires symbolic-ref and rev-parse --abbrev-ref in parallel). Windows spawn is the dominant cost (~25ms each), so going serial added ~80ms p50 / ~100ms p95 to every CLI invocation; parallelizing drops that to ~63ms p50 / ~84ms p95 (a ~21% reduction on the hot path). Non-git directories are unchanged — the gate stays serial. Verified: - Git repo with remote: `command_executed` carries `repo_id` + `repo_host`. - Non-git directory: event fires, `repo_id=null, repo_is_git=false`. - ARCHGATE_TELEMETRY=0: 0 events (disabled path unchanged). - NODE_ENV=test: 0 events (test short-circuit unchanged). - `bun run validate` passes (639 tests, lint/typecheck/format/ADR/build). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
f0c516b to
e04264e
Compare
Merged
rhuanbarreto
added a commit
that referenced
this pull request
Apr 16, 2026
On Windows (and anywhere the runtime honours timer handles), every command that exits naturally through `main()` returning — i.e. every command except `check` and `upgrade` — used to linger ~3 seconds after its output appeared. Root causes + fixes below. ## Timer leaks (Phase 1) `flushTelemetry`, `gitCredentialFill`, the WSL fallback in `resolveCommand`, and the per-rule timeout in the check runner all race a spawn/shutdown against a `setTimeout`/`Bun.sleep` that's never cancelled when the race winner resolves. The stray timer keeps the event loop alive for its full duration. Each site now captures the timer id and clears it in a `.finally`. `flushSentry` wraps `Sentry.flush` with the same cancellable race pattern as defence-in-depth against leaks inside the SDK. `main()` now finishes with an explicit `process.exit(0)` after flushes complete so a future leaked handle in a third-party SDK can't regress this class of bug. ## Lazy SDK loading (Phase 2 #6) `posthog-node` and `@sentry/node-core/light` are now loaded via dynamic `import()` inside `initTelemetry` / `initSentry`. The `ARCHGATE_TELEMETRY=0` path skips the parse and init cost entirely. Both inits run concurrently via `Promise.all` in `cli.ts`. ## Startup trimming (Phase 2) - `installGit` short-circuits on `Bun.which("git")` before falling back to the async cross-env resolver (avoids a WSL subprocess on Windows in the 99% happy path). - `getRepoContext` now fires all four git probes in parallel instead of gating on `rev-parse --is-inside-work-tree`, saving one serial spawn on git repos. ## Check hot path (Phase 3) - `getGitTrackedFiles` is cached per project root — `resolveScopedFiles` used to spawn `git ls-files` once per ADR (16 spawns in this repo). - `ensureRulesShim` compares existing `rules.d.ts` content and skips the write when unchanged. - New `parseAllAdrs(projectRoot)` caches the readdir + parse of every ADR once per process. `loadRuleAdrs`, `loadAllAdrs` (in context.ts), and `adr list` now share that cache, so `review-context --run-checks` no longer reads the ADR directory twice. - `adr list` uses the shared cache — one readdir instead of two. - `buildSummary` is built once in `check.ts` and passed into every reporter + `getExitCode` instead of being recomputed three times. ## Telemetry payload (Phase 4) - `getCommonProperties` now splits into a cached "static" snapshot (platform, install method, CI, locale) and a fresh "dynamic" overlay (project context, repo snapshot). Cuts redundant `getPlatformInfo` / `detectInstallMethod` / `Intl.DateTimeFormat` calls when multiple events are emitted per command. - The opportunistic update check at startup now uses a 5s fetch timeout (down from 15s) so a slow network never extends exit time. The explicit `archgate upgrade` path keeps the 15s default. ## Measured impact (Windows, this repo) | Command | Before | After | |------------------|---------|---------| | adr list | 3.60 s | 0.82 s | | adr show | 3.60 s | 0.72 s | | doctor | 3.90 s | 1.06 s | | telemetry status | 3.70 s | 0.72 s | | check | 0.88 s | 0.85 s | | adr list (tel off)| 0.59 s | 0.42 s | | doctor (tel off) | 3.50 s | 0.80 s | ## Not changed The `await initTelemetry()` in `main()` — PR #211 added it so `command_executed` / `command_completed` carry `repo_id`. Keeping the await; lazy-loading plus parallel init keeps the added startup cost bounded. ## Validation `bun run validate` passes (lint + typecheck + format + 662 tests + 22/22 ADR rules + build check).
8 tasks
rhuanbarreto
added a commit
that referenced
this pull request
Apr 16, 2026
* perf(cli): remove 3s exit tail and trim startup overhead On Windows (and anywhere the runtime honours timer handles), every command that exits naturally through `main()` returning — i.e. every command except `check` and `upgrade` — used to linger ~3 seconds after its output appeared. Root causes + fixes below. ## Timer leaks (Phase 1) `flushTelemetry`, `gitCredentialFill`, the WSL fallback in `resolveCommand`, and the per-rule timeout in the check runner all race a spawn/shutdown against a `setTimeout`/`Bun.sleep` that's never cancelled when the race winner resolves. The stray timer keeps the event loop alive for its full duration. Each site now captures the timer id and clears it in a `.finally`. `flushSentry` wraps `Sentry.flush` with the same cancellable race pattern as defence-in-depth against leaks inside the SDK. `main()` now finishes with an explicit `process.exit(0)` after flushes complete so a future leaked handle in a third-party SDK can't regress this class of bug. ## Lazy SDK loading (Phase 2 #6) `posthog-node` and `@sentry/node-core/light` are now loaded via dynamic `import()` inside `initTelemetry` / `initSentry`. The `ARCHGATE_TELEMETRY=0` path skips the parse and init cost entirely. Both inits run concurrently via `Promise.all` in `cli.ts`. ## Startup trimming (Phase 2) - `installGit` short-circuits on `Bun.which("git")` before falling back to the async cross-env resolver (avoids a WSL subprocess on Windows in the 99% happy path). - `getRepoContext` now fires all four git probes in parallel instead of gating on `rev-parse --is-inside-work-tree`, saving one serial spawn on git repos. ## Check hot path (Phase 3) - `getGitTrackedFiles` is cached per project root — `resolveScopedFiles` used to spawn `git ls-files` once per ADR (16 spawns in this repo). - `ensureRulesShim` compares existing `rules.d.ts` content and skips the write when unchanged. - New `parseAllAdrs(projectRoot)` caches the readdir + parse of every ADR once per process. `loadRuleAdrs`, `loadAllAdrs` (in context.ts), and `adr list` now share that cache, so `review-context --run-checks` no longer reads the ADR directory twice. - `adr list` uses the shared cache — one readdir instead of two. - `buildSummary` is built once in `check.ts` and passed into every reporter + `getExitCode` instead of being recomputed three times. ## Telemetry payload (Phase 4) - `getCommonProperties` now splits into a cached "static" snapshot (platform, install method, CI, locale) and a fresh "dynamic" overlay (project context, repo snapshot). Cuts redundant `getPlatformInfo` / `detectInstallMethod` / `Intl.DateTimeFormat` calls when multiple events are emitted per command. - The opportunistic update check at startup now uses a 5s fetch timeout (down from 15s) so a slow network never extends exit time. The explicit `archgate upgrade` path keeps the 15s default. ## Measured impact (Windows, this repo) | Command | Before | After | |------------------|---------|---------| | adr list | 3.60 s | 0.82 s | | adr show | 3.60 s | 0.72 s | | doctor | 3.90 s | 1.06 s | | telemetry status | 3.70 s | 0.72 s | | check | 0.88 s | 0.85 s | | adr list (tel off)| 0.59 s | 0.42 s | | doctor (tel off) | 3.50 s | 0.80 s | ## Not changed The `await initTelemetry()` in `main()` — PR #211 added it so `command_executed` / `command_completed` carry `repo_id`. Keeping the await; lazy-loading plus parallel init keeps the added startup cost bounded. ## Validation `bun run validate` passes (lint + typecheck + format + 662 tests + 22/22 ADR rules + build check). * test(perf): regression guard for the 3s exit tail Spawns the real CLI via `bun run src/cli.ts --version` / `--help`, asserts the median of 3 runs stays under 4s. The historical regression (un-cancelled `setTimeout` / `Bun.sleep` in the telemetry / Sentry / credential / WSL flush paths) pushed these commands to ~3.5s on Windows; normal runs sit well under 2s even on slow CI. `NODE_ENV=test` suppresses real event delivery but leaves the telemetry + Sentry SDK init/flush path active, which is where the regression lives. `ARCHGATE_TELEMETRY` is intentionally NOT set to 0 so the test exercises the enabled code path. Threshold chosen to catch the regression (always ≥3000ms) with enough headroom that slow runners don't flake.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
cli.tswas kicking offinitTelemetry()withvoid, so the asyncgetRepoContext()call was still in flight when Commander'spreActionhook firedcommand_executed. The result:repo_idwas always null on that event, and ~25% ofcommand_completedevents also missed it when the process exited before the snapshot resolved.await initTelemetry()before command handlers run.getRepoContext()to hide the added startup latency.rev-parse --is-inside-work-treegates first (so non-git dirs still spawn just one subprocess), thengit config --get remote.origin.urlruns concurrently withresolveDefaultBranch, which itself firessymbolic-refandrev-parse --abbrev-refin parallel.initTelemetryJSDoc: callers should await before emitting events.Cost measured on Windows
Windows
gitspawn is ~25ms per subprocess, so serial calls were the dominant cost.Evidence the race is real (PostHog project 145977, last 90 days, v0.28.0)
repo_idcommand_executedcommand_completedcheck_completedThe 0/90 on
command_executedis the race; the 69/90 on the other two is the same race manifesting as process exit before the async resolve lands.Test plan
command_executedcarriesrepo_id(92b8023d174a9222) +repo_host=githubrepo_id=null, repo_is_git=falseARCHGATE_TELEMETRY=0→ 0 events captured (disabled path unchanged)NODE_ENV=test→ 0 events captured (test short-circuit unchanged)bun run validate(lint + typecheck + format + 639 tests + ADR check + build check) passes🤖 Generated with Claude Code