perf(cli): remove 3s exit tail and trim startup overhead#213
Merged
rhuanbarreto merged 2 commits intomainfrom Apr 16, 2026
Merged
perf(cli): remove 3s exit tail and trim startup overhead#213rhuanbarreto merged 2 commits intomainfrom
rhuanbarreto merged 2 commits intomainfrom
Conversation
On Windows (and anywhere the runtime honours timer handles), every command that exits naturally through `main()` returning — i.e. every command except `check` and `upgrade` — used to linger ~3 seconds after its output appeared. Root causes + fixes below. ## Timer leaks (Phase 1) `flushTelemetry`, `gitCredentialFill`, the WSL fallback in `resolveCommand`, and the per-rule timeout in the check runner all race a spawn/shutdown against a `setTimeout`/`Bun.sleep` that's never cancelled when the race winner resolves. The stray timer keeps the event loop alive for its full duration. Each site now captures the timer id and clears it in a `.finally`. `flushSentry` wraps `Sentry.flush` with the same cancellable race pattern as defence-in-depth against leaks inside the SDK. `main()` now finishes with an explicit `process.exit(0)` after flushes complete so a future leaked handle in a third-party SDK can't regress this class of bug. ## Lazy SDK loading (Phase 2 #6) `posthog-node` and `@sentry/node-core/light` are now loaded via dynamic `import()` inside `initTelemetry` / `initSentry`. The `ARCHGATE_TELEMETRY=0` path skips the parse and init cost entirely. Both inits run concurrently via `Promise.all` in `cli.ts`. ## Startup trimming (Phase 2) - `installGit` short-circuits on `Bun.which("git")` before falling back to the async cross-env resolver (avoids a WSL subprocess on Windows in the 99% happy path). - `getRepoContext` now fires all four git probes in parallel instead of gating on `rev-parse --is-inside-work-tree`, saving one serial spawn on git repos. ## Check hot path (Phase 3) - `getGitTrackedFiles` is cached per project root — `resolveScopedFiles` used to spawn `git ls-files` once per ADR (16 spawns in this repo). - `ensureRulesShim` compares existing `rules.d.ts` content and skips the write when unchanged. - New `parseAllAdrs(projectRoot)` caches the readdir + parse of every ADR once per process. `loadRuleAdrs`, `loadAllAdrs` (in context.ts), and `adr list` now share that cache, so `review-context --run-checks` no longer reads the ADR directory twice. - `adr list` uses the shared cache — one readdir instead of two. - `buildSummary` is built once in `check.ts` and passed into every reporter + `getExitCode` instead of being recomputed three times. ## Telemetry payload (Phase 4) - `getCommonProperties` now splits into a cached "static" snapshot (platform, install method, CI, locale) and a fresh "dynamic" overlay (project context, repo snapshot). Cuts redundant `getPlatformInfo` / `detectInstallMethod` / `Intl.DateTimeFormat` calls when multiple events are emitted per command. - The opportunistic update check at startup now uses a 5s fetch timeout (down from 15s) so a slow network never extends exit time. The explicit `archgate upgrade` path keeps the 15s default. ## Measured impact (Windows, this repo) | Command | Before | After | |------------------|---------|---------| | adr list | 3.60 s | 0.82 s | | adr show | 3.60 s | 0.72 s | | doctor | 3.90 s | 1.06 s | | telemetry status | 3.70 s | 0.72 s | | check | 0.88 s | 0.85 s | | adr list (tel off)| 0.59 s | 0.42 s | | doctor (tel off) | 3.50 s | 0.80 s | ## Not changed The `await initTelemetry()` in `main()` — PR #211 added it so `command_executed` / `command_completed` carry `repo_id`. Keeping the await; lazy-loading plus parallel init keeps the added startup cost bounded. ## Validation `bun run validate` passes (lint + typecheck + format + 662 tests + 22/22 ADR rules + build check).
Deploying archgate-cli with
|
| Latest commit: |
37bc0a9
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://5b959942.archgate-cli.pages.dev |
| Branch Preview URL: | https://perf-kill-exit-lag.archgate-cli.pages.dev |
Spawns the real CLI via `bun run src/cli.ts --version` / `--help`, asserts the median of 3 runs stays under 4s. The historical regression (un-cancelled `setTimeout` / `Bun.sleep` in the telemetry / Sentry / credential / WSL flush paths) pushed these commands to ~3.5s on Windows; normal runs sit well under 2s even on slow CI. `NODE_ENV=test` suppresses real event delivery but leaves the telemetry + Sentry SDK init/flush path active, which is where the regression lives. `ARCHGATE_TELEMETRY` is intentionally NOT set to 0 so the test exercises the enabled code path. Threshold chosen to catch the regression (always ≥3000ms) with enough headroom that slow runners don't flake.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
On Windows (and anywhere the runtime honours timer handles), every command that exits naturally through
main()returning — i.e. every command exceptcheckandupgrade— used to linger ~3 seconds after its output appeared. This PR removes that tail and trims startup overhead across the board.Measured impact (this repo, Windows):
adr listadr showdoctortelemetry statuscheckadr list(tel off)doctor(tel off)Root cause — four un-cancelled timers
flushTelemetry,gitCredentialFill, the WSL fallback inresolveCommand, and the per-rule timeout in the check runner all raced a spawn/shutdown against asetTimeout/Bun.sleepthat was never cancelled when the race winner resolved. The stray timer kept the event loop alive for its full 3s duration. Each site now captures the id andclearTimeouts in a.finally.flushSentrygot the same cancellable-race treatment as defence-in-depth against leaks inside the SDK, andmain()now finishes with an explicitprocess.exit(0)after flushes complete so future third-party-SDK leaks can't regress this class of bug.Additional changes
posthog-nodeand@sentry/node-core/lightload via dynamicimport()insideinitTelemetry/initSentry. TheARCHGATE_TELEMETRY=0path skips parse + init entirely. Both inits now run concurrently incli.ts.installGitshort-circuits onBun.which(\"git\")before hitting the async cross-env resolver (avoids a WSL subprocess on Windows in the happy path).getRepoContextfires all four git probes in parallel instead of gating onrev-parse --is-inside-work-tree— one fewer serial spawn.getGitTrackedFilescached per project root.resolveScopedFilesused to spawngit ls-filesonce per ADR (16 spawns in this repo); now once.ensureRulesShimcompares existingrules.d.tscontent and skips the write when unchanged.parseAllAdrscaches the readdir + parse of every ADR once per process.loadRuleAdrs,loadAllAdrs(context.ts), andadr listshare the cache —review-context --run-checksno longer reads the ADR directory twice.buildSummaryis built once incheck.tsand reused by every reporter +getExitCodeinstead of being recomputed three times.getCommonPropertiessplits into a cached static snapshot (platform / install method / CI / locale) and a fresh dynamic overlay (project context / repo snapshot), cutting redundant detection calls when multiple events fire per command.archgate upgradepath keeps the 15s default.What's NOT changed
The
await initTelemetry()inmain()— PR #211 added it becausecommand_executed/command_completedwere shipping withoutrepo_id. The await stays; lazy-loading plus parallelinitSentry+initTelemetrykeeps the added cost bounded.Test plan
bun run validatepasses (lint + typecheck + format + 662 tests + 22/22 ADR rules + build check)time bun run cli adr list→ ~800 ms (was ~3.6 s)time bun run cli doctor→ ~1 s (was ~3.9 s)time bun run cli checkstill passes and exits fastARCHGATE_TELEMETRY=0 bun run cli adr list→ ~420 ms (was ~590 ms) — SDK parse cost avoidedcommand_executedwithrepo_idpopulated after merge