Skip to content

perf(cli): remove 3s exit tail and trim startup overhead#213

Merged
rhuanbarreto merged 2 commits intomainfrom
perf/kill-exit-lag
Apr 16, 2026
Merged

perf(cli): remove 3s exit tail and trim startup overhead#213
rhuanbarreto merged 2 commits intomainfrom
perf/kill-exit-lag

Conversation

@rhuanbarreto
Copy link
Copy Markdown
Contributor

Summary

On Windows (and anywhere the runtime honours timer handles), every command that exits naturally through main() returning — i.e. every command except check and upgrade — used to linger ~3 seconds after its output appeared. This PR removes that tail and trims startup overhead across the board.

Measured impact (this repo, Windows):

Command Before After
adr list 3.60 s 0.82 s
adr show 3.60 s 0.72 s
doctor 3.90 s 1.06 s
telemetry status 3.70 s 0.72 s
check 0.88 s 0.85 s
adr list (tel off) 0.59 s 0.42 s
doctor (tel off) 3.50 s 0.80 s

Root cause — four un-cancelled timers

flushTelemetry, gitCredentialFill, the WSL fallback in resolveCommand, and the per-rule timeout in the check runner all raced a spawn/shutdown against a setTimeout / Bun.sleep that was never cancelled when the race winner resolved. The stray timer kept the event loop alive for its full 3s duration. Each site now captures the id and clearTimeouts in a .finally.

flushSentry got the same cancellable-race treatment as defence-in-depth against leaks inside the SDK, and main() now finishes with an explicit process.exit(0) after flushes complete so future third-party-SDK leaks can't regress this class of bug.

Additional changes

  • Lazy SDK loading: posthog-node and @sentry/node-core/light load via dynamic import() inside initTelemetry / initSentry. The ARCHGATE_TELEMETRY=0 path skips parse + init entirely. Both inits now run concurrently in cli.ts.
  • installGit short-circuits on Bun.which(\"git\") before hitting the async cross-env resolver (avoids a WSL subprocess on Windows in the happy path).
  • getRepoContext fires all four git probes in parallel instead of gating on rev-parse --is-inside-work-tree — one fewer serial spawn.
  • getGitTrackedFiles cached per project root. resolveScopedFiles used to spawn git ls-files once per ADR (16 spawns in this repo); now once.
  • ensureRulesShim compares existing rules.d.ts content and skips the write when unchanged.
  • New parseAllAdrs caches the readdir + parse of every ADR once per process. loadRuleAdrs, loadAllAdrs (context.ts), and adr list share the cache — review-context --run-checks no longer reads the ADR directory twice.
  • buildSummary is built once in check.ts and reused by every reporter + getExitCode instead of being recomputed three times.
  • getCommonProperties splits into a cached static snapshot (platform / install method / CI / locale) and a fresh dynamic overlay (project context / repo snapshot), cutting redundant detection calls when multiple events fire per command.
  • Opportunistic update check uses a 5s fetch timeout (down from 15s); the explicit archgate upgrade path keeps the 15s default.

What's NOT changed

The await initTelemetry() in main() — PR #211 added it because command_executed / command_completed were shipping without repo_id. The await stays; lazy-loading plus parallel initSentry + initTelemetry keeps the added cost bounded.

Test plan

  • bun run validate passes (lint + typecheck + format + 662 tests + 22/22 ADR rules + build check)
  • time bun run cli adr list → ~800 ms (was ~3.6 s)
  • time bun run cli doctor → ~1 s (was ~3.9 s)
  • time bun run cli check still passes and exits fast
  • ARCHGATE_TELEMETRY=0 bun run cli adr list → ~420 ms (was ~590 ms) — SDK parse cost avoided
  • Natural exit path: command output appears and the shell prompt returns immediately (no perceptible tail)
  • Verify on macOS / Linux runners in CI
  • Verify PostHog still receives command_executed with repo_id populated after merge

On Windows (and anywhere the runtime honours timer handles), every
command that exits naturally through `main()` returning — i.e. every
command except `check` and `upgrade` — used to linger ~3 seconds
after its output appeared. Root causes + fixes below.

## Timer leaks (Phase 1)

`flushTelemetry`, `gitCredentialFill`, the WSL fallback in
`resolveCommand`, and the per-rule timeout in the check runner all
race a spawn/shutdown against a `setTimeout`/`Bun.sleep` that's
never cancelled when the race winner resolves. The stray timer
keeps the event loop alive for its full duration. Each site now
captures the timer id and clears it in a `.finally`.

`flushSentry` wraps `Sentry.flush` with the same cancellable race
pattern as defence-in-depth against leaks inside the SDK.

`main()` now finishes with an explicit `process.exit(0)` after
flushes complete so a future leaked handle in a third-party SDK
can't regress this class of bug.

## Lazy SDK loading (Phase 2 #6)

`posthog-node` and `@sentry/node-core/light` are now loaded via
dynamic `import()` inside `initTelemetry` / `initSentry`. The
`ARCHGATE_TELEMETRY=0` path skips the parse and init cost entirely.
Both inits run concurrently via `Promise.all` in `cli.ts`.

## Startup trimming (Phase 2)

- `installGit` short-circuits on `Bun.which("git")` before falling
  back to the async cross-env resolver (avoids a WSL subprocess on
  Windows in the 99% happy path).
- `getRepoContext` now fires all four git probes in parallel
  instead of gating on `rev-parse --is-inside-work-tree`, saving
  one serial spawn on git repos.

## Check hot path (Phase 3)

- `getGitTrackedFiles` is cached per project root — `resolveScopedFiles`
  used to spawn `git ls-files` once per ADR (16 spawns in this repo).
- `ensureRulesShim` compares existing `rules.d.ts` content and skips
  the write when unchanged.
- New `parseAllAdrs(projectRoot)` caches the readdir + parse of every
  ADR once per process. `loadRuleAdrs`, `loadAllAdrs` (in context.ts),
  and `adr list` now share that cache, so `review-context --run-checks`
  no longer reads the ADR directory twice.
- `adr list` uses the shared cache — one readdir instead of two.
- `buildSummary` is built once in `check.ts` and passed into every
  reporter + `getExitCode` instead of being recomputed three times.

## Telemetry payload (Phase 4)

- `getCommonProperties` now splits into a cached "static" snapshot
  (platform, install method, CI, locale) and a fresh "dynamic"
  overlay (project context, repo snapshot). Cuts redundant
  `getPlatformInfo` / `detectInstallMethod` / `Intl.DateTimeFormat`
  calls when multiple events are emitted per command.
- The opportunistic update check at startup now uses a 5s fetch
  timeout (down from 15s) so a slow network never extends exit
  time. The explicit `archgate upgrade` path keeps the 15s default.

## Measured impact (Windows, this repo)

| Command          | Before  | After   |
|------------------|---------|---------|
| adr list         | 3.60 s  | 0.82 s  |
| adr show         | 3.60 s  | 0.72 s  |
| doctor           | 3.90 s  | 1.06 s  |
| telemetry status | 3.70 s  | 0.72 s  |
| check            | 0.88 s  | 0.85 s  |
| adr list (tel off)| 0.59 s  | 0.42 s  |
| doctor (tel off) | 3.50 s  | 0.80 s  |

## Not changed

The `await initTelemetry()` in `main()` — PR #211 added it so
`command_executed` / `command_completed` carry `repo_id`. Keeping
the await; lazy-loading plus parallel init keeps the added startup
cost bounded.

## Validation

`bun run validate` passes (lint + typecheck + format + 662 tests
+ 22/22 ADR rules + build check).
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Apr 16, 2026

Deploying archgate-cli with  Cloudflare Pages  Cloudflare Pages

Latest commit: 37bc0a9
Status: ✅  Deploy successful!
Preview URL: https://5b959942.archgate-cli.pages.dev
Branch Preview URL: https://perf-kill-exit-lag.archgate-cli.pages.dev

View logs

Spawns the real CLI via `bun run src/cli.ts --version` / `--help`,
asserts the median of 3 runs stays under 4s. The historical
regression (un-cancelled `setTimeout` / `Bun.sleep` in the
telemetry / Sentry / credential / WSL flush paths) pushed these
commands to ~3.5s on Windows; normal runs sit well under 2s even
on slow CI.

`NODE_ENV=test` suppresses real event delivery but leaves the
telemetry + Sentry SDK init/flush path active, which is where
the regression lives. `ARCHGATE_TELEMETRY` is intentionally NOT
set to 0 so the test exercises the enabled code path.

Threshold chosen to catch the regression (always ≥3000ms) with
enough headroom that slow runners don't flake.
@rhuanbarreto rhuanbarreto merged commit 9e442d6 into main Apr 16, 2026
9 checks passed
@rhuanbarreto rhuanbarreto deleted the perf/kill-exit-lag branch April 16, 2026 06:19
@github-actions github-actions bot mentioned this pull request Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant