Skip to content

fix: close cross-process keychain race and namespace dev-build nest#1409

Merged
wpfleger96 merged 5 commits into
mainfrom
duncan/dev-nest-keychain-race
Jun 30, 2026
Merged

fix: close cross-process keychain race and namespace dev-build nest#1409
wpfleger96 merged 5 commits into
mainfrom
duncan/dev-nest-keychain-race

Conversation

@wpfleger96

Copy link
Copy Markdown
Collaborator

What

Fixes two independent bugs that together caused Will's dev-build agents to lose their keys when the signed DMG and a just staging build ran concurrently.

D3 — cross-process keychain blob clobber (commit 1)

mutate_blob() in secret_store.rs only re-read the keychain when its per-process cache was cold. Two concurrent processes (DMG + dev build) with warm caches would independently build write candidates from their stale baselines and overwrite each other's keys — last writer wins, dropped keys.

Fix: wrap the entire read-modify-write in an exclusive advisory file lock keyed by service name:

  • Unix: flock(2) on a lockfile in std::env::temp_dir()
  • Windows: named kernel mutex (CreateMutexW / WaitForSingleObject)

Inside the lock, read_blob_raw() is always called (warm cache is not used as the write basis). The cache is advanced after a successful write and cleared on write failure so the next caller re-reads from the keychain.

Both signed DMG and unsigned dev builds can reach the lockfile/mutex — neither uses the macOS app sandbox (com.apple.security.app-sandbox is absent from Entitlements.plist). tauri-plugin-single-instance keys on bundle id, so it does not prevent the two builds from running concurrently over the shared blob.

D2 — DMG + dev-build REPOS/.repos-dir fight (commit 2)

Both builds shared one ~/.buzz nest, one .repos-dir dotfile, and one REPOS inode. write_persisted_repos_dir(None) deletes the dotfile and ensure_repos_symlink(None) reverts the REPOS symlink to a real directory. Whichever instance applied its workspace last would clobber the other's configuration, causing resolve_repos_at_boot() to fail and skip agent restore on the next boot.

Fix: give dev builds their own nest at ~/.buzz-dev. The discriminator is the Tauri app-data directory name starting with xyz.block.buzz.app.dev — the same runtime signal used by reconcile_target_dir() — evaluated once at startup and stored in an OnceLock. Call sites for nest_dir() need no change.

Migration: on first dev-build boot after this lands, migrate_dev_nest() copies accumulated knowledge (RESEARCH/, PLANS/, GUIDES/, WORK_LOGS/, mem_* slugs, AGENTS.md) from ~/.buzz~/.buzz-dev non-destructively. The source is never deleted; prod builds continue using ~/.buzz normally. Both builds still symlink REPOS → ~/Development; they own separate symlinks so there is no more clobbering.

Files changed

File Change
desktop/src-tauri/src/secret_store.rs D3: interprocess lock + always-fresh-read in mutate_blob; new CI-safe lock tests; #[ignore] regression tests for the race
desktop/src-tauri/src/managed_agents/nest.rs D2: NEST_DIR OnceLock, init_nest_dir(is_dev), updated nest_dir(), updated nest-dir tests
desktop/src-tauri/src/migration.rs D2: call init_nest_dir in run_boot_migrations; fix migrate_legacy_nest hardcoded .buzz dest; add migrate_dev_nest
desktop/src-tauri/src/lib.rs D2: call migrate_dev_nest after ensure_nest for dev instances

npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 and others added 3 commits June 30, 2026 18:10
Two concurrent Buzz processes (signed DMG + unsigned dev build) share the
same OS keychain blob (service "buzz-desktop", account "secrets"). Each
process holds a per-process in-memory cache backed by an OnceLock. Without
coordination, a warm-cache write in process A builds its candidate from a
stale baseline and silently drops keys written by process B since A's last
read — the exact mechanism that stranded Will's dev-build agent keys.

Fix: wrap the entire read-modify-write in mutate_blob() with an exclusive
advisory file lock keyed by service name:
- Unix: flock(2) on a lockfile in std::env::temp_dir()
- Windows: named kernel mutex (CreateMutexW/WaitForSingleObject)

Inside the lock, always perform a fresh read_blob_raw() — never build the
write candidate from the warm cache. This eliminates the lost-update window
regardless of cache state. The cache is still advanced after a successful
write (reads within the same process remain fast) and cleared on write
failure (next caller re-reads rather than building on a stale state).

Both signed DMG and unsigned dev builds can reach the lockfile/mutex since
neither uses the macOS app sandbox (no com.apple.security.app-sandbox
entitlement). tauri-plugin-single-instance keys on bundle id, so it does
not prevent DMG + dev from running concurrently over the shared blob.

Adds two #[ignore] integration tests (require real OS keychain) and two
CI-safe unit tests for lockfile path correctness and lock acquire/release.

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
The signed DMG build (xyz.block.buzz.app) and dev builds via `just staging`
(xyz.block.buzz.app.dev) shared one ~/.buzz nest, one .repos-dir dotfile,
and one REPOS inode. write_persisted_repos_dir(None) deleted the dotfile and
ensure_repos_symlink(None) reverted the REPOS symlink to a real directory
(repos.rs:82-92, 218-221). Last-writer-wins between the two instances meant
resolve_repos_at_boot() would fail and skip agent restore on the next boot.

Fix: give dev builds their own nest at ~/.buzz-dev. The discriminator is the
Tauri app-data directory name starting with xyz.block.buzz.app.dev — the
same runtime signal used by reconcile_target_dir() in migration.rs. This
is evaluated at startup (before any nest I/O) and stored in a process-lifetime
OnceLock so nest_dir() call sites need no AppHandle parameter change.

Changes:
- nest.rs: add NEST_DIR OnceLock + init_nest_dir(is_dev), update nest_dir()
  to read from it (falls back to .buzz when uninitialized, for test isolation)
- migration.rs run_boot_migrations: call init_nest_dir() using the same
  app-data-dir discriminator as reconcile_target_dir()
- migration.rs migrate_legacy_nest(): use nest_dir() instead of hardcoded
  ~/.buzz as the migration destination (fixes migration.rs:223 bypass)
- migration.rs: add migrate_dev_nest() — one-time copy of ~/.buzz → ~/.buzz-dev
  on first dev-build boot after this change; non-destructive, skips when
  destination RESEARCH/PLANS already have content
- lib.rs: call migrate_dev_nest() after ensure_nest() for dev instances so
  accumulated knowledge (RESEARCH/, PLANS/, mem_* slugs) is not lost

Both builds can still symlink REPOS → ~/Development; they own separate
symlinks so they never clobber each other.

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
F1: Move Unix lockfile from $TMPDIR to /tmp/<uid>-<service>.lock — a
deterministic per-user path invariant to process environment. launchd
strips TMPDIR for GUI processes; a shell-launched dev build and a signed
DMG could resolve different temp dirs and take different lockfiles,
leaving the lost-update race live. /tmp is a fixed inode shared by both.
Update the test to assert the path starts under /tmp and contains the uid.
Add #[allow(dead_code)] + clarifying comment on BlobLockGuard.file (held
purely for RAII fd-close, not read).

F2: Migrate .repos-dir from ~/.buzz → ~/.buzz-dev before
resolve_repos_at_boot() reads it. run_boot_migrations now returns the
is_dev flag from the OnceLock init block and calls migrate_dev_repos_dir()
immediately — before lib.rs calls resolve_repos_at_boot() at line 305.
Also creates ~/.buzz-dev with create_dir_all before copying, since this
runs before ensure_nest().

F3: Replace the RESEARCH/PLANS file-presence sentinel in migrate_dev_nest
with an explicit .dev-nest-migrated marker file. The old sentinel was
tripped by .sprout migration (which runs earlier and copies into
~/.buzz-dev), causing migrate_dev_nest to skip copying ~/.buzz knowledge.
Also removes the now-dead dir_has_any_file helper.

F4: Delete mutate_blob_skips_write_when_map_unchanged — empty test body
with no assertions, reporting phantom coverage. The behavior is covered
by the two #[ignore] keychain-backed tests that follow.

File-size overrides bumped: nest.rs → 1501, migration.rs → 1575,
secret_store.rs → 1033. All load-bearing fix growth.

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
@wpfleger96 wpfleger96 force-pushed the duncan/dev-nest-keychain-race branch from 68f54c0 to ed75910 Compare June 30, 2026 22:10
npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 and others added 2 commits June 30, 2026 18:23
The Windows #[cfg(windows)] branch of BlobLockGuard::acquire failed to
compile on x86_64-pc-windows-msvc because HANDLE = *mut c_void in
windows-sys 0.61, not an integer type:

- CreateMutexW(std::ptr::null(), ...) — std::ptr::null() could not
  infer its type to *const SECURITY_ATTRIBUTES without that type in
  scope (Win32_Security feature was not enabled).
- if handle == 0 — HANDLE (*mut c_void) cannot be compared to an
  integer literal; the correct check is handle.is_null().
- if wait_result != 0x80 — replaced with the named constant
  WAIT_ABANDONED from Win32::Foundation for clarity.

Fixes:
- Add Win32_Security to windows-sys features in Cargo.toml so
  SECURITY_ATTRIBUTES can be imported and used in the null ptr call.
- Use std::ptr::null::<SECURITY_ATTRIBUTES>() for the first argument
  to CreateMutexW.
- Use handle.is_null() instead of == 0 for the null-handle check.
- Use windows_sys::Win32::Foundation::WAIT_ABANDONED instead of 0x80.

The Unix (flock) path and all D2 nest-namespace code are unchanged.
Verified: cargo check --target x86_64-pc-windows-msvc shows zero
E0308 errors in secret_store.rs.

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
Windows pointer-type fix added 10 lines (multi-line CreateMutexW call +
SECURITY_ATTRIBUTES import + comments). Gate counts via split(/\r?\n/).length
= wc -l + 1 for newline-terminated files, so override is 1043 not 1042.

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
@wpfleger96 wpfleger96 merged commit e8adc33 into main Jun 30, 2026
24 checks passed
@wpfleger96 wpfleger96 deleted the duncan/dev-nest-keychain-race branch June 30, 2026 22:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant