fix: close cross-process keychain race and namespace dev-build nest#1409
Merged
Conversation
Two concurrent Buzz processes (signed DMG + unsigned dev build) share the same OS keychain blob (service "buzz-desktop", account "secrets"). Each process holds a per-process in-memory cache backed by an OnceLock. Without coordination, a warm-cache write in process A builds its candidate from a stale baseline and silently drops keys written by process B since A's last read — the exact mechanism that stranded Will's dev-build agent keys. Fix: wrap the entire read-modify-write in mutate_blob() with an exclusive advisory file lock keyed by service name: - Unix: flock(2) on a lockfile in std::env::temp_dir() - Windows: named kernel mutex (CreateMutexW/WaitForSingleObject) Inside the lock, always perform a fresh read_blob_raw() — never build the write candidate from the warm cache. This eliminates the lost-update window regardless of cache state. The cache is still advanced after a successful write (reads within the same process remain fast) and cleared on write failure (next caller re-reads rather than building on a stale state). Both signed DMG and unsigned dev builds can reach the lockfile/mutex since neither uses the macOS app sandbox (no com.apple.security.app-sandbox entitlement). tauri-plugin-single-instance keys on bundle id, so it does not prevent DMG + dev from running concurrently over the shared blob. Adds two #[ignore] integration tests (require real OS keychain) and two CI-safe unit tests for lockfile path correctness and lock acquire/release. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
The signed DMG build (xyz.block.buzz.app) and dev builds via `just staging` (xyz.block.buzz.app.dev) shared one ~/.buzz nest, one .repos-dir dotfile, and one REPOS inode. write_persisted_repos_dir(None) deleted the dotfile and ensure_repos_symlink(None) reverted the REPOS symlink to a real directory (repos.rs:82-92, 218-221). Last-writer-wins between the two instances meant resolve_repos_at_boot() would fail and skip agent restore on the next boot. Fix: give dev builds their own nest at ~/.buzz-dev. The discriminator is the Tauri app-data directory name starting with xyz.block.buzz.app.dev — the same runtime signal used by reconcile_target_dir() in migration.rs. This is evaluated at startup (before any nest I/O) and stored in a process-lifetime OnceLock so nest_dir() call sites need no AppHandle parameter change. Changes: - nest.rs: add NEST_DIR OnceLock + init_nest_dir(is_dev), update nest_dir() to read from it (falls back to .buzz when uninitialized, for test isolation) - migration.rs run_boot_migrations: call init_nest_dir() using the same app-data-dir discriminator as reconcile_target_dir() - migration.rs migrate_legacy_nest(): use nest_dir() instead of hardcoded ~/.buzz as the migration destination (fixes migration.rs:223 bypass) - migration.rs: add migrate_dev_nest() — one-time copy of ~/.buzz → ~/.buzz-dev on first dev-build boot after this change; non-destructive, skips when destination RESEARCH/PLANS already have content - lib.rs: call migrate_dev_nest() after ensure_nest() for dev instances so accumulated knowledge (RESEARCH/, PLANS/, mem_* slugs) is not lost Both builds can still symlink REPOS → ~/Development; they own separate symlinks so they never clobber each other. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
F1: Move Unix lockfile from $TMPDIR to /tmp/<uid>-<service>.lock — a deterministic per-user path invariant to process environment. launchd strips TMPDIR for GUI processes; a shell-launched dev build and a signed DMG could resolve different temp dirs and take different lockfiles, leaving the lost-update race live. /tmp is a fixed inode shared by both. Update the test to assert the path starts under /tmp and contains the uid. Add #[allow(dead_code)] + clarifying comment on BlobLockGuard.file (held purely for RAII fd-close, not read). F2: Migrate .repos-dir from ~/.buzz → ~/.buzz-dev before resolve_repos_at_boot() reads it. run_boot_migrations now returns the is_dev flag from the OnceLock init block and calls migrate_dev_repos_dir() immediately — before lib.rs calls resolve_repos_at_boot() at line 305. Also creates ~/.buzz-dev with create_dir_all before copying, since this runs before ensure_nest(). F3: Replace the RESEARCH/PLANS file-presence sentinel in migrate_dev_nest with an explicit .dev-nest-migrated marker file. The old sentinel was tripped by .sprout migration (which runs earlier and copies into ~/.buzz-dev), causing migrate_dev_nest to skip copying ~/.buzz knowledge. Also removes the now-dead dir_has_any_file helper. F4: Delete mutate_blob_skips_write_when_map_unchanged — empty test body with no assertions, reporting phantom coverage. The behavior is covered by the two #[ignore] keychain-backed tests that follow. File-size overrides bumped: nest.rs → 1501, migration.rs → 1575, secret_store.rs → 1033. All load-bearing fix growth. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
68f54c0 to
ed75910
Compare
The Windows #[cfg(windows)] branch of BlobLockGuard::acquire failed to compile on x86_64-pc-windows-msvc because HANDLE = *mut c_void in windows-sys 0.61, not an integer type: - CreateMutexW(std::ptr::null(), ...) — std::ptr::null() could not infer its type to *const SECURITY_ATTRIBUTES without that type in scope (Win32_Security feature was not enabled). - if handle == 0 — HANDLE (*mut c_void) cannot be compared to an integer literal; the correct check is handle.is_null(). - if wait_result != 0x80 — replaced with the named constant WAIT_ABANDONED from Win32::Foundation for clarity. Fixes: - Add Win32_Security to windows-sys features in Cargo.toml so SECURITY_ATTRIBUTES can be imported and used in the null ptr call. - Use std::ptr::null::<SECURITY_ATTRIBUTES>() for the first argument to CreateMutexW. - Use handle.is_null() instead of == 0 for the null-handle check. - Use windows_sys::Win32::Foundation::WAIT_ABANDONED instead of 0x80. The Unix (flock) path and all D2 nest-namespace code are unchanged. Verified: cargo check --target x86_64-pc-windows-msvc shows zero E0308 errors in secret_store.rs. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
Windows pointer-type fix added 10 lines (multi-line CreateMutexW call + SECURITY_ATTRIBUTES import + comments). Gate counts via split(/\r?\n/).length = wc -l + 1 for newline-terminated files, so override is 1043 not 1042. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Fixes two independent bugs that together caused Will's dev-build agents to lose their keys when the signed DMG and a
just stagingbuild ran concurrently.D3 — cross-process keychain blob clobber (commit 1)
mutate_blob()insecret_store.rsonly re-read the keychain when its per-process cache was cold. Two concurrent processes (DMG + dev build) with warm caches would independently build write candidates from their stale baselines and overwrite each other's keys — last writer wins, dropped keys.Fix: wrap the entire read-modify-write in an exclusive advisory file lock keyed by service name:
flock(2)on a lockfile instd::env::temp_dir()CreateMutexW/WaitForSingleObject)Inside the lock,
read_blob_raw()is always called (warm cache is not used as the write basis). The cache is advanced after a successful write and cleared on write failure so the next caller re-reads from the keychain.Both signed DMG and unsigned dev builds can reach the lockfile/mutex — neither uses the macOS app sandbox (
com.apple.security.app-sandboxis absent fromEntitlements.plist).tauri-plugin-single-instancekeys on bundle id, so it does not prevent the two builds from running concurrently over the shared blob.D2 — DMG + dev-build REPOS/.repos-dir fight (commit 2)
Both builds shared one
~/.buzznest, one.repos-dirdotfile, and oneREPOSinode.write_persisted_repos_dir(None)deletes the dotfile andensure_repos_symlink(None)reverts the REPOS symlink to a real directory. Whichever instance applied its workspace last would clobber the other's configuration, causingresolve_repos_at_boot()to fail and skip agent restore on the next boot.Fix: give dev builds their own nest at
~/.buzz-dev. The discriminator is the Tauri app-data directory name starting withxyz.block.buzz.app.dev— the same runtime signal used byreconcile_target_dir()— evaluated once at startup and stored in anOnceLock. Call sites fornest_dir()need no change.Migration: on first dev-build boot after this lands,
migrate_dev_nest()copies accumulated knowledge (RESEARCH/,PLANS/,GUIDES/,WORK_LOGS/,mem_*slugs,AGENTS.md) from~/.buzz→~/.buzz-devnon-destructively. The source is never deleted; prod builds continue using~/.buzznormally. Both builds still symlinkREPOS → ~/Development; they own separate symlinks so there is no more clobbering.Files changed
desktop/src-tauri/src/secret_store.rsmutate_blob; new CI-safe lock tests;#[ignore]regression tests for the racedesktop/src-tauri/src/managed_agents/nest.rsNEST_DIROnceLock,init_nest_dir(is_dev), updatednest_dir(), updated nest-dir testsdesktop/src-tauri/src/migration.rsinit_nest_dirinrun_boot_migrations; fixmigrate_legacy_nesthardcoded.buzzdest; addmigrate_dev_nestdesktop/src-tauri/src/lib.rsmigrate_dev_nestafterensure_nestfor dev instances