perf(mutation): don't snapshot the whole blob store on every add#155
Conversation
The serial crash-safe add path (#142) listed `.openkb/files` (the PageIndex blob store) in both the snapshot path set and `hardlink_dirs`, so every add hardlinked the entire store into the rollback backup — one `os.link` per existing blob, plus the matching unlink on commit — a cost that scales with total KB size, not with the document being added. On a filesystem without hardlink support (cross-device staging, some Windows / cloud-sync folders) `_hardlink_or_copy` fell back to a full byte copy of the whole store on *every* add. The blob store is append-only by `{doc_id}`: an add only ever creates the new doc's blob, and that name is assigned during indexing — after the snapshot is taken. So instead of snapshotting the whole tree up front, register just the new blob once indexing has run, via `MutationSnapshot.track_new()`, which records it with no backup and rewrites the active journal so both in-process rollback and crash recovery remove exactly this doc's artifacts. Cloud import never writes a local blob, so `.openkb/files` is dropped from its snapshot entirely (it was pure waste there). - `MutationSnapshot.track_new(paths)`: register post-snapshot-created paths for removal on rollback; persists to the journal for crash recovery. - add: drop `.openkb/files` from the eager snapshot + `hardlink_dirs`; call `track_new(files/*/<doc_id>*)` right after `index_long_document`. - cloud import: drop `.openkb/files` from `hardlink_dirs`. Tests: new-blob rollback removes exactly the doc's blob + images subtree and leaves existing blobs untouched (same inode); track_new persists so recover_pending_journals cleans a crashed add; `_snapshot_add_paths` no longer lists `.openkb/files`. Full suite green (pre-existing trafilatura-missing url_ingest failures aside). Claude-Session: https://claude.ai/code/session_018WiFnTo1YW9mtw47Fzir9K
…bug) Self-review of the previous commit found a regression it introduced: track_new globbed `.openkb/files/*/<doc_id>*` and registered whatever matched for removal on rollback. But PageIndex content-dedups — `add_document` returns an EXISTING doc_id and writes no new blob when the same content is already indexed. If hashes.json and pageindex.db diverge (e.g. a prior `remove` whose PageIndex cleanup failed left the row + blob but dropped the hashes.json entry), re-adding that content makes col.add() return the OLD doc_id, so a subsequent compile failure would roll back and DELETE that prior document's blob. The old whole-store hardlink snapshot did not have this bug (a dedup-hit blob shares the backup inode and is left in place on rollback). Fix: capture the blob set *before* indexing and register only the paths this add actually created (set difference), guarded by `if index_result.doc_id`. That also neutralizes an unexpected empty/falsy doc_id, which would otherwise glob `*/*` and register — then delete on rollback — the entire blob store. Tests (tests/test_add_command.py): - test_long_doc_rollback_removes_only_the_new_blob: a failed long-doc add rolls back its own new blob + images subtree while a pre-existing blob survives. - test_long_doc_dedup_hit_does_not_delete_existing_blob: a dedup hit (existing doc_id, no new blob) must not delete the pre-existing blob on rollback — verified this test FAILS on the pre-fix code. Claude-Session: https://claude.ai/code/session_018WiFnTo1YW9mtw47Fzir9K
|
Self-review (max-effort) turned up a regression this PR's first commit introduced — now fixed in Bug: Fix: capture the blob set before indexing and register only paths this add actually created (set difference), guarded by Known limitation (documented, not fixed): there's a narrow crash window — if the process is hard-killed between the blob landing during indexing and Other review findings were minor (layout-coupling glob shares an assumption with |
Problem
The serial crash-safe add path from #142 lists
.openkb/files(the PageIndex blob store) in both the snapshot path set (_snapshot_add_paths) andhardlink_dirs. So everyopenkb addhardlinks the entire blob store into a rollback backup — oneos.linkper existing blob, plus the matching unlink on commit — a cost that scales with total KB size, not with the document being added.Two concrete downsides:
_hardlink_or_copyfalls back toshutil.copy2— i.e. a full copy of the whole blob store on every add (potentially GBs)..openkb/filestoo, but never writes a local blob — pure waste.Fix
The blob store is append-only by
{doc_id}: an add only ever creates the new doc's blob, and that name is assigned during indexing — after the snapshot is taken. So instead of snapshotting the whole tree up front:MutationSnapshot.track_new(paths)— registers paths created after the snapshot with no backup, so both in-processrollback()and crash recovery (recover_pending_journals) remove exactly those paths. It rewrites the active journal, so a crash after the blob lands but before commit still cleans it up..openkb/filesfrom the eager snapshot +hardlink_dirs; calltrack_new(files/*/<doc_id>*)right afterindex_long_documentreturns adoc_id..openkb/filesfromhardlink_dirs(never touched there).Everything else in the add snapshot is unchanged —
hashes.json,pageindex.db*,wiki/*,concepts/entities(still hardlinked, since those are updated in place via atomic temp+replace) all keep their existing backup semantics.Behavior notes
<doc_id>{ext}blob and its<doc_id>/images/subtree..openkb/files/<collection>/dir is left behind. This is harmless — PageIndexadd_documentdoesmkdir(exist_ok=True)andcreate_collectionrmtrees before reuse.removedoesn't use the mutation snapshot, so it's unaffected.Tests
Added to
tests/test_mutation.py:test_track_new_removes_new_blob_on_rollback— rollback removes the new blob + images subtree; a pre-existing blob is untouched (same inode).test_track_new_persists_to_journal_for_crash_recovery—track_newpersists sorecover_pending_journalscleans a crashed add.test_snapshot_add_paths_excludes_blob_store—_snapshot_add_pathsno longer lists.openkb/files(but still listshashes.json).pytest -q: 906 passed (the 4 pre-existingtests/test_url_ingest.pyfailures areModuleNotFoundError: trafilatura, unrelated to this change and also failing onmain).ruffintroduces no new diagnostics in the changed hunks.https://claude.ai/code/session_018WiFnTo1YW9mtw47Fzir9K