perf(cli): remove 3s exit tail and trim startup overhead by rhuanbarreto · Pull Request #213 · archgate/cli

rhuanbarreto · 2026-04-16T06:04:47Z

Summary

On Windows (and anywhere the runtime honours timer handles), every command that exits naturally through main() returning — i.e. every command except check and upgrade — used to linger ~3 seconds after its output appeared. This PR removes that tail and trims startup overhead across the board.

Measured impact (this repo, Windows):

Command	Before	After
`adr list`	3.60 s	0.82 s
`adr show`	3.60 s	0.72 s
`doctor`	3.90 s	1.06 s
`telemetry status`	3.70 s	0.72 s
`check`	0.88 s	0.85 s
`adr list` (tel off)	0.59 s	0.42 s
`doctor` (tel off)	3.50 s	0.80 s

Root cause — four un-cancelled timers

flushTelemetry, gitCredentialFill, the WSL fallback in resolveCommand, and the per-rule timeout in the check runner all raced a spawn/shutdown against a setTimeout / Bun.sleep that was never cancelled when the race winner resolved. The stray timer kept the event loop alive for its full 3s duration. Each site now captures the id and clearTimeouts in a .finally.

flushSentry got the same cancellable-race treatment as defence-in-depth against leaks inside the SDK, and main() now finishes with an explicit process.exit(0) after flushes complete so future third-party-SDK leaks can't regress this class of bug.

Additional changes

Lazy SDK loading: posthog-node and @sentry/node-core/light load via dynamic import() inside initTelemetry / initSentry. The ARCHGATE_TELEMETRY=0 path skips parse + init entirely. Both inits now run concurrently in cli.ts.
installGit short-circuits on Bun.which(\"git\") before hitting the async cross-env resolver (avoids a WSL subprocess on Windows in the happy path).
getRepoContext fires all four git probes in parallel instead of gating on rev-parse --is-inside-work-tree — one fewer serial spawn.
getGitTrackedFiles cached per project root. resolveScopedFiles used to spawn git ls-files once per ADR (16 spawns in this repo); now once.
ensureRulesShim compares existing rules.d.ts content and skips the write when unchanged.
New parseAllAdrs caches the readdir + parse of every ADR once per process. loadRuleAdrs, loadAllAdrs (context.ts), and adr list share the cache — review-context --run-checks no longer reads the ADR directory twice.
buildSummary is built once in check.ts and reused by every reporter + getExitCode instead of being recomputed three times.
getCommonProperties splits into a cached static snapshot (platform / install method / CI / locale) and a fresh dynamic overlay (project context / repo snapshot), cutting redundant detection calls when multiple events fire per command.
Opportunistic update check uses a 5s fetch timeout (down from 15s); the explicit archgate upgrade path keeps the 15s default.

What's NOT changed

The await initTelemetry() in main() — PR #211 added it because command_executed / command_completed were shipping without repo_id. The await stays; lazy-loading plus parallel initSentry + initTelemetry keeps the added cost bounded.

Test plan

bun run validate passes (lint + typecheck + format + 662 tests + 22/22 ADR rules + build check)
time bun run cli adr list → ~800 ms (was ~3.6 s)
time bun run cli doctor → ~1 s (was ~3.9 s)
time bun run cli check still passes and exits fast
ARCHGATE_TELEMETRY=0 bun run cli adr list → ~420 ms (was ~590 ms) — SDK parse cost avoided
Natural exit path: command output appears and the shell prompt returns immediately (no perceptible tail)
Verify on macOS / Linux runners in CI
Verify PostHog still receives command_executed with repo_id populated after merge

On Windows (and anywhere the runtime honours timer handles), every command that exits naturally through `main()` returning — i.e. every command except `check` and `upgrade` — used to linger ~3 seconds after its output appeared. Root causes + fixes below. ## Timer leaks (Phase 1) `flushTelemetry`, `gitCredentialFill`, the WSL fallback in `resolveCommand`, and the per-rule timeout in the check runner all race a spawn/shutdown against a `setTimeout`/`Bun.sleep` that's never cancelled when the race winner resolves. The stray timer keeps the event loop alive for its full duration. Each site now captures the timer id and clears it in a `.finally`. `flushSentry` wraps `Sentry.flush` with the same cancellable race pattern as defence-in-depth against leaks inside the SDK. `main()` now finishes with an explicit `process.exit(0)` after flushes complete so a future leaked handle in a third-party SDK can't regress this class of bug. ## Lazy SDK loading (Phase 2 #6) `posthog-node` and `@sentry/node-core/light` are now loaded via dynamic `import()` inside `initTelemetry` / `initSentry`. The `ARCHGATE_TELEMETRY=0` path skips the parse and init cost entirely. Both inits run concurrently via `Promise.all` in `cli.ts`. ## Startup trimming (Phase 2) - `installGit` short-circuits on `Bun.which("git")` before falling back to the async cross-env resolver (avoids a WSL subprocess on Windows in the 99% happy path). - `getRepoContext` now fires all four git probes in parallel instead of gating on `rev-parse --is-inside-work-tree`, saving one serial spawn on git repos. ## Check hot path (Phase 3) - `getGitTrackedFiles` is cached per project root — `resolveScopedFiles` used to spawn `git ls-files` once per ADR (16 spawns in this repo). - `ensureRulesShim` compares existing `rules.d.ts` content and skips the write when unchanged. - New `parseAllAdrs(projectRoot)` caches the readdir + parse of every ADR once per process. `loadRuleAdrs`, `loadAllAdrs` (in context.ts), and `adr list` now share that cache, so `review-context --run-checks` no longer reads the ADR directory twice. - `adr list` uses the shared cache — one readdir instead of two. - `buildSummary` is built once in `check.ts` and passed into every reporter + `getExitCode` instead of being recomputed three times. ## Telemetry payload (Phase 4) - `getCommonProperties` now splits into a cached "static" snapshot (platform, install method, CI, locale) and a fresh "dynamic" overlay (project context, repo snapshot). Cuts redundant `getPlatformInfo` / `detectInstallMethod` / `Intl.DateTimeFormat` calls when multiple events are emitted per command. - The opportunistic update check at startup now uses a 5s fetch timeout (down from 15s) so a slow network never extends exit time. The explicit `archgate upgrade` path keeps the 15s default. ## Measured impact (Windows, this repo) | Command | Before | After | |------------------|---------|---------| | adr list | 3.60 s | 0.82 s | | adr show | 3.60 s | 0.72 s | | doctor | 3.90 s | 1.06 s | | telemetry status | 3.70 s | 0.72 s | | check | 0.88 s | 0.85 s | | adr list (tel off)| 0.59 s | 0.42 s | | doctor (tel off) | 3.50 s | 0.80 s | ## Not changed The `await initTelemetry()` in `main()` — PR #211 added it so `command_executed` / `command_completed` carry `repo_id`. Keeping the await; lazy-loading plus parallel init keeps the added startup cost bounded. ## Validation `bun run validate` passes (lint + typecheck + format + 662 tests + 22/22 ADR rules + build check).

cloudflare-workers-and-pages · 2026-04-16T06:05:56Z

Deploying archgate-cli with Cloudflare Pages

Latest commit:	`37bc0a9`
Status:	✅ Deploy successful!
Preview URL:	https://5b959942.archgate-cli.pages.dev
Branch Preview URL:	https://perf-kill-exit-lag.archgate-cli.pages.dev

View logs

Spawns the real CLI via `bun run src/cli.ts --version` / `--help`, asserts the median of 3 runs stays under 4s. The historical regression (un-cancelled `setTimeout` / `Bun.sleep` in the telemetry / Sentry / credential / WSL flush paths) pushed these commands to ~3.5s on Windows; normal runs sit well under 2s even on slow CI. `NODE_ENV=test` suppresses real event delivery but leaves the telemetry + Sentry SDK init/flush path active, which is where the regression lives. `ARCHGATE_TELEMETRY` is intentionally NOT set to 0 so the test exercises the enabled code path. Threshold chosen to catch the regression (always ≥3000ms) with enough headroom that slow runners don't flake.

rhuanbarreto merged commit 9e442d6 into main Apr 16, 2026
9 checks passed

rhuanbarreto deleted the perf/kill-exit-lag branch April 16, 2026 06:19

github-actions bot mentioned this pull request Apr 15, 2026

chore(release): 0.29.0 #209

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(cli): remove 3s exit tail and trim startup overhead#213

perf(cli): remove 3s exit tail and trim startup overhead#213
rhuanbarreto merged 2 commits intomainfrom
perf/kill-exit-lag

rhuanbarreto commented Apr 16, 2026

Uh oh!

cloudflare-workers-and-pages bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rhuanbarreto commented Apr 16, 2026

Summary

Root cause — four un-cancelled timers

Additional changes

What's NOT changed

Test plan

Uh oh!

cloudflare-workers-and-pages bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying archgate-cli with Cloudflare Pages

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cloudflare-workers-and-pages bot commented Apr 16, 2026 •

edited

Loading