perf(mutation): don't snapshot the whole blob store on every add by KylinMountain · Pull Request #155 · VectifyAI/OpenKB

KylinMountain · 2026-07-01T02:32:32Z

Problem

The serial crash-safe add path from #142 lists .openkb/files (the PageIndex blob store) in both the snapshot path set (_snapshot_add_paths) and hardlink_dirs. So every openkb add hardlinks the entire blob store into a rollback backup — one os.link per existing blob, plus the matching unlink on commit — a cost that scales with total KB size, not with the document being added.

Two concrete downsides:

O(total blobs) per add on the happy path (link + unlink), growing as the KB grows rather than staying O(this doc). Dwarfed by LLM compile on small KBs, but the wrong scaling on large ones.
Full byte-copy fallback: on a filesystem without hardlink support (cross-device staging, some Windows / OneDrive / Dropbox / network mounts), _hardlink_or_copy falls back to shutil.copy2 — i.e. a full copy of the whole blob store on every add (potentially GBs).
Cloud import lists .openkb/files too, but never writes a local blob — pure waste.

Fix

The blob store is append-only by {doc_id}: an add only ever creates the new doc's blob, and that name is assigned during indexing — after the snapshot is taken. So instead of snapshotting the whole tree up front:

Add MutationSnapshot.track_new(paths) — registers paths created after the snapshot with no backup, so both in-process rollback() and crash recovery (recover_pending_journals) remove exactly those paths. It rewrites the active journal, so a crash after the blob lands but before commit still cleans it up.
add: drop .openkb/files from the eager snapshot + hardlink_dirs; call track_new(files/*/<doc_id>*) right after index_long_document returns a doc_id.
cloud import: drop .openkb/files from hardlink_dirs (never touched there).

Everything else in the add snapshot is unchanged — hashes.json, pageindex.db*, wiki/*, concepts/entities (still hardlinked, since those are updated in place via atomic temp+replace) all keep their existing backup semantics.

Behavior notes

Rollback now removes exactly the new doc's <doc_id>{ext} blob and its <doc_id>/images/ subtree.
If an add creates a brand-new collection dir and then rolls back, the now-empty .openkb/files/<collection>/ dir is left behind. This is harmless — PageIndex add_document does mkdir(exist_ok=True) and create_collection rmtrees before reuse.
remove doesn't use the mutation snapshot, so it's unaffected.

Tests

Added to tests/test_mutation.py:

test_track_new_removes_new_blob_on_rollback — rollback removes the new blob + images subtree; a pre-existing blob is untouched (same inode).
test_track_new_persists_to_journal_for_crash_recovery — track_new persists so recover_pending_journals cleans a crashed add.
test_snapshot_add_paths_excludes_blob_store — _snapshot_add_paths no longer lists .openkb/files (but still lists hashes.json).

pytest -q: 906 passed (the 4 pre-existing tests/test_url_ingest.py failures are ModuleNotFoundError: trafilatura, unrelated to this change and also failing on main). ruff introduces no new diagnostics in the changed hunks.

https://claude.ai/code/session_018WiFnTo1YW9mtw47Fzir9K

The serial crash-safe add path (#142) listed `.openkb/files` (the PageIndex blob store) in both the snapshot path set and `hardlink_dirs`, so every add hardlinked the entire store into the rollback backup — one `os.link` per existing blob, plus the matching unlink on commit — a cost that scales with total KB size, not with the document being added. On a filesystem without hardlink support (cross-device staging, some Windows / cloud-sync folders) `_hardlink_or_copy` fell back to a full byte copy of the whole store on *every* add. The blob store is append-only by `{doc_id}`: an add only ever creates the new doc's blob, and that name is assigned during indexing — after the snapshot is taken. So instead of snapshotting the whole tree up front, register just the new blob once indexing has run, via `MutationSnapshot.track_new()`, which records it with no backup and rewrites the active journal so both in-process rollback and crash recovery remove exactly this doc's artifacts. Cloud import never writes a local blob, so `.openkb/files` is dropped from its snapshot entirely (it was pure waste there). - `MutationSnapshot.track_new(paths)`: register post-snapshot-created paths for removal on rollback; persists to the journal for crash recovery. - add: drop `.openkb/files` from the eager snapshot + `hardlink_dirs`; call `track_new(files/*/<doc_id>*)` right after `index_long_document`. - cloud import: drop `.openkb/files` from `hardlink_dirs`. Tests: new-blob rollback removes exactly the doc's blob + images subtree and leaves existing blobs untouched (same inode); track_new persists so recover_pending_journals cleans a crashed add; `_snapshot_add_paths` no longer lists `.openkb/files`. Full suite green (pre-existing trafilatura-missing url_ingest failures aside). Claude-Session: https://claude.ai/code/session_018WiFnTo1YW9mtw47Fzir9K

…bug) Self-review of the previous commit found a regression it introduced: track_new globbed `.openkb/files/*/<doc_id>*` and registered whatever matched for removal on rollback. But PageIndex content-dedups — `add_document` returns an EXISTING doc_id and writes no new blob when the same content is already indexed. If hashes.json and pageindex.db diverge (e.g. a prior `remove` whose PageIndex cleanup failed left the row + blob but dropped the hashes.json entry), re-adding that content makes col.add() return the OLD doc_id, so a subsequent compile failure would roll back and DELETE that prior document's blob. The old whole-store hardlink snapshot did not have this bug (a dedup-hit blob shares the backup inode and is left in place on rollback). Fix: capture the blob set *before* indexing and register only the paths this add actually created (set difference), guarded by `if index_result.doc_id`. That also neutralizes an unexpected empty/falsy doc_id, which would otherwise glob `*/*` and register — then delete on rollback — the entire blob store. Tests (tests/test_add_command.py): - test_long_doc_rollback_removes_only_the_new_blob: a failed long-doc add rolls back its own new blob + images subtree while a pre-existing blob survives. - test_long_doc_dedup_hit_does_not_delete_existing_blob: a dedup hit (existing doc_id, no new blob) must not delete the pre-existing blob on rollback — verified this test FAILS on the pre-fix code. Claude-Session: https://claude.ai/code/session_018WiFnTo1YW9mtw47Fzir9K

KylinMountain · 2026-07-01T03:44:08Z

Self-review (max-effort) turned up a regression this PR's first commit introduced — now fixed in 3749648.

Bug: track_new(glob("*/{doc_id}*")) assumed the globbed blob was created by this add. But PageIndex content-dedups (LocalBackend.add_document returns an existing doc_id and writes no new blob when the same content is already indexed). If hashes.json and pageindex.db diverge — e.g. a prior remove whose PageIndex cleanup failed left the row + blob but dropped the hashes.json entry — re-adding that content makes col.add() return the old doc_id, so a subsequent compile failure would roll back and delete another document's blob. The whole-store hardlink snapshot this PR replaced did not have this bug (a dedup-hit blob shares the backup inode → left in place).

Fix: capture the blob set before indexing and register only paths this add actually created (set difference), guarded by if index_result.doc_id. That also neutralizes an unexpected empty doc_id, which would otherwise glob */* and register — then delete — the whole store. Added two long-doc rollback tests (new-blob removed / pre-existing survives / dedup-hit survives); verified the dedup test fails on the pre-fix code.

Known limitation (documented, not fixed): there's a narrow crash window — if the process is hard-killed between the blob landing during indexing and track_new persisting the journal, the blob is orphaned (recovery won't remove it). It's benign (unreferenced; registry is restored on recovery), and a fuller fix would need the doc_id before the snapshot. Calling it out rather than silently accepting it.

Other review findings were minor (layout-coupling glob shares an assumption with remove's delete path; track_new lacks the kb_dir-containment assert snapshot_paths has) — happy to address if you'd like, otherwise leaving as-is.

KylinMountain added 2 commits July 1, 2026 10:31

KylinMountain merged commit 5053cbe into main Jul 1, 2026
1 check passed

gwokhou mentioned this pull request Jul 1, 2026

feat(mutation): route serial add mutations through coordinator #156

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(mutation): don't snapshot the whole blob store on every add#155

perf(mutation): don't snapshot the whole blob store on every add#155
KylinMountain merged 2 commits into
mainfrom
fix/blob-store-snapshot-cost

KylinMountain commented Jul 1, 2026

Uh oh!

KylinMountain commented Jul 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

KylinMountain commented Jul 1, 2026

Problem

Fix

Behavior notes

Tests

Uh oh!

KylinMountain commented Jul 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant