Skip to content

🤖 perf: cut SSH workspace startup ~9× on the warm path via fused single-RTT materialization#3302

Merged
ammario merged 3 commits into
mainfrom
startup-perf-7103
May 17, 2026
Merged

🤖 perf: cut SSH workspace startup ~9× on the warm path via fused single-RTT materialization#3302
ammario merged 3 commits into
mainfrom
startup-perf-7103

Conversation

@ammar-agent
Copy link
Copy Markdown
Collaborator

@ammar-agent ammar-agent commented May 16, 2026

Summary

Optimize SSH workspace startup using mux run as a debug + timing harness, cutting cold-path SSH workspace creation by ~24% and warm-path creation by ~9× (≈1.8s → ~205ms median) against ovh-1. The warm path collapses what used to be ~9 sequential SSH round-trips into a single fused command, bringing subsequent mux run invocations within ~2.2× of one raw SSH-command RTT on this link.

Background

The objective was to instrument mux run --runtime "ssh <host>" so it doubles as a startup-perf harness, then drive both cold and warm SSH workspace creation down as far as possible. The starting point for cold workspace startup against ovh-1 was ~2.92s (empty repo, fresh remote), and "warm" subsequent runs were also ~1.8s because the CLI was using an ephemeral srcBaseDir per invocation — so every run looked cold to the remote.

Implementation

This PR contains three commits:

perf: cut SSH cold-workspace startup ~24% via batched ensureBaseRepo + skipped no-op probes

Cold-path wins (baseline 2.92s → 2.21s, –24%):

  1. Skip prefetchOriginOnRemote when local has no origin — hoist the origin-URL check; the fetch dance is a guaranteed no-op for local-only projects (~500ms).
  2. Collapse ensureBaseRepo into one SSH round-trip — single batched shell script with status sentinels (STATUS_CREATED, STATUS_CORE_BARE); falls through to slow-path retry helper only on real config-lock conflicts (~300ms).
  3. Drop trailing refreshBaseRepoOrigin when prefetch already pushed it (~100ms).
  4. Skip ensureHealthyBaseRepoForSync when ensureBaseRepo just created the bare repo — zero packs by definition (~95ms).

Also extends mux run itself to provision SSH workspaces (previously only Docker) and adds a timed init logger (makeTimedCliInitLogger) so --verbose annotates every step with [+<total> Δ<delta>] — this is the harness used to identify all subsequent wins.

perf: collapse warm SSH workspace creates to one fused round-trip (~9× faster)

Warm-path wins (1.8s → ~205ms median):

  1. Stable remote srcBaseDir for the CLImux run was using tempDir/src per invocation, guaranteeing a fresh remote base repo every time. Anchor to ~/mux (matching the desktop app) and resolve the tilde to an absolute remote path once up-front so path.resolve() in session.ensureMetadata() doesn't mangle ~/mux/... into local nonsense.
  2. Single-SSH-command warm fast-path (tryWarmWorktreeAdd) — when the remote already has a healthy shared base repo for this project AND the staged refs/mux-bundle/* tips already match local heads, skip the entire slow pipeline (~9 sequential SSH round-trips) and run a single tight shell pipeline that probes + materializes + reports .gitmodules in one round-trip. Misses fall through with a structured WARM_MISS:<reason> log line.
  3. Fold .gitmodules probe into the warm command so the caller can skip the post-worktree submodule-sync SSH RTT when no submodules exist.
  4. Eliminated double git show-ref — added fastReadGitHeadsRefs() that walks .git/refs/heads/ + .git/packed-refs directly (~2ms vs ~70ms), with graceful fallback to git show-ref for worktree-gitdir or reftable-based repos.
  5. Dropped the createWorkspace mkdir SSH RTT — both materialization paths create their own parents already.
  6. Snapshot-digest-only gate — the digest is sha256(git show-ref --heads), so a digest match implies a manifest match. Skip the parallel manifest verification.

perf: preserve slow-path origin freshness in SSH warm fast-path

Addresses a P1 correctness concern from Codex review on this PR. The original warm-path implementation always based the new workspace on refs/mux-bundle/<trunk> and returned before the slow path's fetchOriginTrunk() / resolveFreshWorkspaceSourceBase() logic could fetch and prefer origin/<trunkBranch>. When the remote cache was warm but origin/main had advanced, the user silently got a stale checkout.

Fix: fold the same origin fetch into the fused warm SSH script. When the local project has an origin URL we now also:

  1. git remote set-url origin <local-url> on the remote base repo (idempotent; protects against URL drift between syncs).
  2. git fetch --quiet origin +refs/heads/<trunk>:refs/remotes/origin/<trunk> on the host. Failure is tolerated (fo=0).
  3. Prefer refs/remotes/origin/<trunk> as the worktree-add base whenever the fetch succeeded; fall back to the bundle ref otherwise. This matches resolveFreshWorkspaceSourceBase()'s fallback semantics exactly.

The fetch runs on the SSH host (datacenter↔upstream link), so it adds no client-side round-trip and the warm path stays inside a single client→host RTT envelope. The wall-time impact for repos with origin is on the order of the host↔upstream fetch handshake (typically fast when origin hasn't moved); for local-only repos (no origin URL) behavior is unchanged.

The warm path also now emits "Reusing existing remote project snapshot" on hit so existing lifecycle tests (e.g. initWorkspace reuses snapshots and preserves remote-only tags across later resyncs) continue to observe a consistent step.

Validation

Measured on ovh-1 (SSH multiplexed RTT median ~92ms, 2× RTT = ~184ms):

Run kind Before After
Cold (empty repo, fresh remote) 2.92s 2.21s (–24%)
Warm (subsequent run, local-only repo) 1.80s 205ms median (–89%)

15 consecutive warm runs after the patch: 196 198 199 199 200 203 205 205 205 206 207 209 212 214 232 ms
→ min 196, P10 198, median 205, P90 214, max 232, mean 206ms (~2.24× one SSH-command RTT).

The narrow spread (σ≈8ms) shows variance is dominated by raw SSH-link jitter — the bare ssh true RTT itself varies 88–140ms on this link. The fundamental floor is 1× SSH RTT + ~10ms remote git worktree add + ~100ms ssh client/server overhead for a non-trivial script. Repos with an origin URL will incur an additional host↔upstream fetch handshake inside the same SSH RTT envelope.

All 444 runtime + CLI unit tests continue to pass, including the SSH retry orchestration suite that exercises the promisor-cleanup / repair paths preserved here. The full tests/runtime/runtime.test.ts "SSHRuntime sync does not import stale remote tracking refs" integration suite (5/5) passes locally against the Docker SSH fixture — including the initWorkspace reuses snapshots and preserves remote-only tags across later resyncs case that surfaced the original correctness concern. Local make static-check is green (typecheck, lint, prettier, docs link check).

Risks

Touched logic: SSH workspace creation. The warm fast-path is gated by strict miss conditions and falls through to the existing slow path on any miss; the slow path is unchanged in behavior aside from the cold-path round-trip reductions. The promisor-cleanup, partial-clone, and lock-conflict retry paths are preserved verbatim — the batched ensureBaseRepo falls back to the original normalizeBaseRepoSharedConfig retry helper whenever it sees a non-trivial config-lock error from its inline normalize attempt.

Regression risk is concentrated in:

  • Cold-path ensureBaseRepo: the batched command now does test/mkdir/init/unset-core.bare/unset-promisor in one shell pipeline; behavior should be identical to the original sequential calls for all currently-tested states.
  • Warm fast-path correctness: gated by snapshot-digest match (sha256 of git show-ref --heads) so a stale or mismatched local repo cannot be silently materialized against the wrong base — any mismatch falls to the slow path which independently re-verifies. The origin-fetch preamble now also matches the slow path's resolveFreshWorkspaceSourceBase semantics so upstream advances are picked up even on warm hits.
  • mux run SSH provisioning: this is a new code path in src/cli/run.ts and only affects the CLI; the desktop app's WORKSPACE_CREATE flow is unchanged.

Pains

The biggest debugging hiccup was a metadata-path bug: persisting ~/mux/... through session.ensureMetadata() (which runs path.resolve()) silently mangled it to /Users/.../~/mux/..., which the runtime then sent verbatim to the remote — making the workspace exist on the remote but fail the subsequent ensureReady check with "repository not found". The fix (resolve tilde once via runtime.resolvePath before persisting) mirrors how the desktop app's WORKSPACE_CREATE IPC handler already handles this.


Generated with mux • Model: anthropic:claude-opus-4-7 • Thinking: max • Cost: $27.24

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 341a6bd3c7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/node/runtime/SSHRuntime.ts
…o + skipped no-op probes

Total cold create+init for `mux run --runtime ssh <host>` against a tiny repo
drops from ~2.92s baseline to ~2.21s on average (-712ms, -24%) on the ovh-1
target. Numbers were measured with the new timed init logger surfaced by
`mux run --runtime ssh … --verbose`, which doubles as a startup-timing harness
(each step is annotated with elapsed total and delta-since-previous).

Concrete wins, all on the SSH cold path:

1. **Skip `prefetchOriginOnRemote` when local has no `origin`** — the existing
   guard was inside-out: we already paid for `refreshBaseRepoOrigin` and the
   actual fetch (which is guaranteed to fail) before noticing there was no
   origin URL to push. Moved the origin-URL check to the very top so the
   entire prefetch dance is a no-op for local-only projects. Saves ~500ms on
   the canonical "fresh repo with no remote" case.

2. **Collapse `ensureBaseRepo` into one SSH round-trip** — was 3-5 sequential
   round-trips (test -d, mkdir, git init, unset core.bare, 3× unset promisor
   keys); now a single batched shell script returns status sentinels
   (`STATUS_CREATED=<existed|created>`, `STATUS_CORE_BARE=<unset|absent|error>`)
   and falls through to the existing slow-path retry helper only when a real
   config-lock conflict is observed. Saves ~300ms cold.

3. **Skip the trailing `refreshBaseRepoOrigin`** when `prefetchOriginOnRemote`
   already pushed the same origin URL during this sync pass. Saves ~100ms when
   origin is configured.

4. **Skip `ensureHealthyBaseRepoForSync`** when `ensureBaseRepo` just freshly
   created the repo — a brand-new bare repo has zero packs, so the remote
   `count-objects -v` probe is pure overhead on the cold path. Saves ~95ms.

Also extends `mux run` itself to provision SSH workspaces (previously it only
provisioned Docker ones, which is why this scenario was effectively untestable
from the CLI). The Docker and SSH provisioning blocks share the same create →
runFullInit → cleanup-on-failure shape, both routed through a single timed
init logger so future startup investigations have a working harness out of the
box.

All 444 runtime + CLI tests continue to pass, including the SSH retry
orchestration suite that exercises the promisor-cleanup / repair paths
preserved here.

---

_Generated with `mux` • Model: `anthropic:claude-opus-4-7` • Thinking: `high`_

<!-- mux-attribution: model=anthropic:claude-opus-4-7 thinking=high -->
…~9× faster)

Subsequent `mux run --runtime "ssh <host>"` invocations against the same
project now complete in ~205ms median on `ovh-1` (down from ~1.8s baseline,
a ~9× speedup). The warm path is dominated by a single SSH round-trip plus
the small remote-side cost of `git worktree add` — within ~2.2× of one raw
SSH-command RTT (92ms) on this host.

Concrete wins, all on the warm path:

1. **Stable remote `srcBaseDir` for the CLI** — `mux run` was using its own
   ephemeral `tempDir/src` per invocation, which meant the remote
   `baseRepoPath` (derived from `srcBaseDir`) was *always* fresh, so every
   single CLI run paid the full cold-init cost. Now we anchor to `~/mux`
   (matching the desktop app's default) and resolve the tilde to an absolute
   remote path once up-front so `path.resolve()` inside
   `session.ensureMetadata()` doesn't mangle `~/mux/...` into
   `/Users/.../~/mux/...`.

2. **Single-SSH-command warm fast-path (`tryWarmWorktreeAdd`)** — when the
   remote already has a healthy shared base repo for this project AND the
   staged `refs/mux-bundle/*` tips already match the local heads, we now
   skip the entire slow pipeline (~9 sequential SSH round-trips:
   `test -d`, `git rev-parse`, `ensureBaseRepo`, snapshot check, manifest
   check, `refreshBaseRepoOrigin`, `fetchOriginTrunk`,
   `resolveBundleTrunkRef`, `resolveFreshWorkspaceSourceBase`,
   `worktree add`, `.gitmodules` probe) and run a single tight shell
   pipeline that does the probe + materialize + .gitmodules check in one
   round-trip. Misses fall through to the existing slow path with a
   structured `WARM_MISS:<reason>` log line.

3. **Folded `.gitmodules` probe into the warm command** so the caller can
   skip the post-worktree submodule-sync SSH RTT when the freshly-checked-
   out worktree has no submodules (the overwhelming common case for
   `mux run`).

4. **Eliminated double `git show-ref`** — the slow path runs `git show-ref
   --heads` twice (once in `computeSnapshotDigest`, once in
   `resolveLocalSyncRefManifest`). Each fork+exec costs ~70-100ms on macOS.
   Added a `fastReadGitHeadsRefs()` helper that walks `.git/refs/heads/`
   + `.git/packed-refs` directly (~2ms total), with graceful fallback to
   `git show-ref` for worktree-gitdir or reftable-based repos.

5. **Dropped the `createWorkspace` mkdir SSH RTT** — the parent dir is
   already created by both materialization paths (the slow path's
   `ensureBaseRepo` runs `mkdir -p` on the shared root dir; the warm
   fast-path's tight command runs `mkdir -p` inline before `git worktree
   add`). Folding it into materialization shaves a full SSH RTT off every
   workspace startup.

6. **Snapshot-digest-only gate, no manifest verification** — the digest is
   `sha256(git show-ref --heads)`, so a digest match implies a manifest
   match (modulo a sha256 collision). Skipping the parallel manifest check
   saves another SSH RTT *and* an entire `git show-ref` parse pass.

Measured on ovh-1 (SSH multiplexed RTT ~92ms, 2× RTT = ~184ms),
15 consecutive warm runs against an empty repo:

  baseline (pre-patch): ~1800ms median
  this patch:           196 198 199 199 200 203 205 205 205 206 207 209
                        212 214 232 ms
                        → min 196, P10 198, median 205, P90 214, max 232,
                          mean 206ms (~2.24× one SSH-command RTT)

The narrow spread (196–232ms, σ≈8ms) shows variance is now dominated by
network jitter on the SSH link (the bare `ssh true` RTT itself varies
88–140ms). The warm path is fundamentally bound by:

  1× SSH RTT for the fused probe+materialize command
  + ~10ms remote shell + `git worktree add`
  + ~100ms of ssh client/server overhead for a non-trivial script

All 315 runtime tests continue to pass, including the SSH retry orchestration
suite that exercises the promisor-cleanup / repair paths.

---

_Generated with `mux` • Model: `anthropic:claude-opus-4-7` • Thinking: `high`_

<!-- mux-attribution: model=anthropic:claude-opus-4-7 thinking=high -->
Codex review on #3302 flagged a P1 correctness issue: the warm
worktree fast-path always based the new workspace on
`refs/mux-bundle/<trunk>` (the local snapshot tip) and short-circuited
before the slow path's `fetchOriginTrunk()` /
`resolveFreshWorkspaceSourceBase()` could fetch and prefer
`origin/<trunkBranch>`. When the remote cache was warm but
`origin/main` had advanced, the user silently got a stale checkout.

Fix: fold the same origin fetch into the fused warm SSH script. When
the local project has an `origin` URL we now also:

  1. `git remote set-url origin <local-url>` on the remote base repo
     (idempotent; protects against URL drift between syncs).
  2. `git fetch --quiet origin +refs/heads/<trunk>:refs/remotes/origin/<trunk>`
     on the host. Failure is tolerated (`fo=0`).
  3. Prefer `refs/remotes/origin/<trunk>` as the worktree-add base
     whenever the fetch succeeded; fall back to the bundle ref
     otherwise. This matches `resolveFreshWorkspaceSourceBase()`'s
     fallback semantics exactly.

The fetch runs on the SSH host (datacenter↔upstream link), so it adds
no client-side round-trip and the warm path stays inside a single
client→host RTT envelope.

Also emit `"Reusing existing remote project snapshot"` on warm hit so
existing lifecycle tests (e.g. `initWorkspace reuses snapshots and
preserves remote-only tags across later resyncs`) keep observing the
same step text — semantically the warm path *is* reusing the remote
project snapshot, just with the materialization fused in.

Validated locally:
  - `make typecheck`, `make lint`, `make fmt-check`: green
  - `TEST_INTEGRATION=1 bun x jest tests/runtime/runtime.test.ts -t
    "SSHRuntime sync does not import stale remote tracking refs"`:
    5/5 pass (previously: 1 failing, now fixed)
  - `bun test src/node/runtime/ src/cli/`: 444/444 pass

---

_Generated with `mux` • Model: `anthropic:claude-opus-4-7` • Thinking: `max` • Cost: `$27.24`_

<!-- mux-attribution: model=anthropic:claude-opus-4-7 thinking=max costs=27.24 -->
@ammar-agent
Copy link
Copy Markdown
Collaborator Author

@codex review

Addressed the P1 origin-freshness issue in commit 9ac2c32: the warm fast-path now folds git fetch origin <trunk> into the fused SSH script and prefers refs/remotes/origin/<trunkBranch> over refs/mux-bundle/*, exactly mirroring resolveFreshWorkspaceSourceBase()'s semantics. Fetch is best-effort with bundle-ref fallback. Local validation: integration test "initWorkspace reuses snapshots and preserves remote-only tags across later resyncs" now passes (was the failing test that surfaced this issue), full SSH sync test suite (5/5) green, 444/444 unit tests green.

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Can't wait for the next one!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@ammario ammario merged commit f2451f5 into main May 17, 2026
24 checks passed
@ammario ammario deleted the startup-perf-7103 branch May 17, 2026 14:56
mux-bot Bot added a commit that referenced this pull request May 17, 2026
The SSH-provisioning commit (#3302) added a second `import type { Runtime }
from "../node/runtime/Runtime"` line right below an existing
`import type { InitLogger, WorkspaceInitResult }` from the same module.
Fold the new symbol into the existing import — byte-identical semantics,
one fewer line of boilerplate, no behavior change.
mux-bot Bot added a commit that referenced this pull request May 18, 2026
The SSH-provisioning commit (#3302) added a second `import type { Runtime }
from "../node/runtime/Runtime"` line right below an existing
`import type { InitLogger, WorkspaceInitResult }` from the same module.
Fold the new symbol into the existing import — byte-identical semantics,
one fewer line of boilerplate, no behavior change.
mux-bot Bot added a commit that referenced this pull request May 18, 2026
The SSH-provisioning commit (#3302) added a second `import type { Runtime }
from "../node/runtime/Runtime"` line right below an existing
`import type { InitLogger, WorkspaceInitResult }` from the same module.
Fold the new symbol into the existing import — byte-identical semantics,
one fewer line of boilerplate, no behavior change.
mux-bot Bot added a commit that referenced this pull request May 18, 2026
The SSH-provisioning commit (#3302) added a second `import type { Runtime }
from "../node/runtime/Runtime"` line right below an existing
`import type { InitLogger, WorkspaceInitResult }` from the same module.
Fold the new symbol into the existing import — byte-identical semantics,
one fewer line of boilerplate, no behavior change.
mux-bot Bot added a commit that referenced this pull request May 18, 2026
The SSH-provisioning commit (#3302) added a second `import type { Runtime }
from "../node/runtime/Runtime"` line right below an existing
`import type { InitLogger, WorkspaceInitResult }` from the same module.
Fold the new symbol into the existing import — byte-identical semantics,
one fewer line of boilerplate, no behavior change.
mux-bot Bot added a commit that referenced this pull request May 19, 2026
The SSH-provisioning commit (#3302) added a second `import type { Runtime }
from "../node/runtime/Runtime"` line right below an existing
`import type { InitLogger, WorkspaceInitResult }` from the same module.
Fold the new symbol into the existing import — byte-identical semantics,
one fewer line of boilerplate, no behavior change.
mux-bot Bot added a commit that referenced this pull request May 19, 2026
The SSH-provisioning commit (#3302) added a second `import type { Runtime }
from "../node/runtime/Runtime"` line right below an existing
`import type { InitLogger, WorkspaceInitResult }` from the same module.
Fold the new symbol into the existing import — byte-identical semantics,
one fewer line of boilerplate, no behavior change.
mux-bot Bot added a commit that referenced this pull request May 19, 2026
The SSH-provisioning commit (#3302) added a second `import type { Runtime }
from "../node/runtime/Runtime"` line right below an existing
`import type { InitLogger, WorkspaceInitResult }` from the same module.
Fold the new symbol into the existing import — byte-identical semantics,
one fewer line of boilerplate, no behavior change.
mux-bot Bot added a commit that referenced this pull request May 20, 2026
The SSH-provisioning commit (#3302) added a second `import type { Runtime }
from "../node/runtime/Runtime"` line right below an existing
`import type { InitLogger, WorkspaceInitResult }` from the same module.
Fold the new symbol into the existing import — byte-identical semantics,
one fewer line of boilerplate, no behavior change.
mux-bot Bot added a commit that referenced this pull request May 20, 2026
The SSH-provisioning commit (#3302) added a second `import type { Runtime }
from "../node/runtime/Runtime"` line right below an existing
`import type { InitLogger, WorkspaceInitResult }` from the same module.
Fold the new symbol into the existing import — byte-identical semantics,
one fewer line of boilerplate, no behavior change.
mux-bot Bot added a commit that referenced this pull request May 20, 2026
The SSH-provisioning commit (#3302) added a second `import type { Runtime }
from "../node/runtime/Runtime"` line right below an existing
`import type { InitLogger, WorkspaceInitResult }` from the same module.
Fold the new symbol into the existing import — byte-identical semantics,
one fewer line of boilerplate, no behavior change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants