Skip to content

perf(mutation): don't snapshot the whole blob store on every add#155

Merged
KylinMountain merged 2 commits into
mainfrom
fix/blob-store-snapshot-cost
Jul 1, 2026
Merged

perf(mutation): don't snapshot the whole blob store on every add#155
KylinMountain merged 2 commits into
mainfrom
fix/blob-store-snapshot-cost

Conversation

@KylinMountain

Copy link
Copy Markdown
Collaborator

Problem

The serial crash-safe add path from #142 lists .openkb/files (the PageIndex blob store) in both the snapshot path set (_snapshot_add_paths) and hardlink_dirs. So every openkb add hardlinks the entire blob store into a rollback backup — one os.link per existing blob, plus the matching unlink on commit — a cost that scales with total KB size, not with the document being added.

Two concrete downsides:

  • O(total blobs) per add on the happy path (link + unlink), growing as the KB grows rather than staying O(this doc). Dwarfed by LLM compile on small KBs, but the wrong scaling on large ones.
  • Full byte-copy fallback: on a filesystem without hardlink support (cross-device staging, some Windows / OneDrive / Dropbox / network mounts), _hardlink_or_copy falls back to shutil.copy2 — i.e. a full copy of the whole blob store on every add (potentially GBs).
  • Cloud import lists .openkb/files too, but never writes a local blob — pure waste.

Fix

The blob store is append-only by {doc_id}: an add only ever creates the new doc's blob, and that name is assigned during indexing — after the snapshot is taken. So instead of snapshotting the whole tree up front:

  • Add MutationSnapshot.track_new(paths) — registers paths created after the snapshot with no backup, so both in-process rollback() and crash recovery (recover_pending_journals) remove exactly those paths. It rewrites the active journal, so a crash after the blob lands but before commit still cleans it up.
  • add: drop .openkb/files from the eager snapshot + hardlink_dirs; call track_new(files/*/<doc_id>*) right after index_long_document returns a doc_id.
  • cloud import: drop .openkb/files from hardlink_dirs (never touched there).

Everything else in the add snapshot is unchanged — hashes.json, pageindex.db*, wiki/*, concepts/entities (still hardlinked, since those are updated in place via atomic temp+replace) all keep their existing backup semantics.

Behavior notes

  • Rollback now removes exactly the new doc's <doc_id>{ext} blob and its <doc_id>/images/ subtree.
  • If an add creates a brand-new collection dir and then rolls back, the now-empty .openkb/files/<collection>/ dir is left behind. This is harmless — PageIndex add_document does mkdir(exist_ok=True) and create_collection rmtrees before reuse.
  • remove doesn't use the mutation snapshot, so it's unaffected.

Tests

Added to tests/test_mutation.py:

  • test_track_new_removes_new_blob_on_rollback — rollback removes the new blob + images subtree; a pre-existing blob is untouched (same inode).
  • test_track_new_persists_to_journal_for_crash_recoverytrack_new persists so recover_pending_journals cleans a crashed add.
  • test_snapshot_add_paths_excludes_blob_store_snapshot_add_paths no longer lists .openkb/files (but still lists hashes.json).

pytest -q: 906 passed (the 4 pre-existing tests/test_url_ingest.py failures are ModuleNotFoundError: trafilatura, unrelated to this change and also failing on main). ruff introduces no new diagnostics in the changed hunks.

https://claude.ai/code/session_018WiFnTo1YW9mtw47Fzir9K

The serial crash-safe add path (#142) listed `.openkb/files` (the PageIndex
blob store) in both the snapshot path set and `hardlink_dirs`, so every add
hardlinked the entire store into the rollback backup — one `os.link` per
existing blob, plus the matching unlink on commit — a cost that scales with
total KB size, not with the document being added. On a filesystem without
hardlink support (cross-device staging, some Windows / cloud-sync folders)
`_hardlink_or_copy` fell back to a full byte copy of the whole store on
*every* add.

The blob store is append-only by `{doc_id}`: an add only ever creates the new
doc's blob, and that name is assigned during indexing — after the snapshot is
taken. So instead of snapshotting the whole tree up front, register just the
new blob once indexing has run, via `MutationSnapshot.track_new()`, which
records it with no backup and rewrites the active journal so both in-process
rollback and crash recovery remove exactly this doc's artifacts.

Cloud import never writes a local blob, so `.openkb/files` is dropped from its
snapshot entirely (it was pure waste there).

- `MutationSnapshot.track_new(paths)`: register post-snapshot-created paths for
  removal on rollback; persists to the journal for crash recovery.
- add: drop `.openkb/files` from the eager snapshot + `hardlink_dirs`; call
  `track_new(files/*/<doc_id>*)` right after `index_long_document`.
- cloud import: drop `.openkb/files` from `hardlink_dirs`.

Tests: new-blob rollback removes exactly the doc's blob + images subtree and
leaves existing blobs untouched (same inode); track_new persists so
recover_pending_journals cleans a crashed add; `_snapshot_add_paths` no longer
lists `.openkb/files`. Full suite green (pre-existing trafilatura-missing
url_ingest failures aside).

Claude-Session: https://claude.ai/code/session_018WiFnTo1YW9mtw47Fzir9K
…bug)

Self-review of the previous commit found a regression it introduced: track_new
globbed `.openkb/files/*/<doc_id>*` and registered whatever matched for removal
on rollback. But PageIndex content-dedups — `add_document` returns an EXISTING
doc_id and writes no new blob when the same content is already indexed. If
hashes.json and pageindex.db diverge (e.g. a prior `remove` whose PageIndex
cleanup failed left the row + blob but dropped the hashes.json entry),
re-adding that content makes col.add() return the OLD doc_id, so a subsequent
compile failure would roll back and DELETE that prior document's blob. The old
whole-store hardlink snapshot did not have this bug (a dedup-hit blob shares
the backup inode and is left in place on rollback).

Fix: capture the blob set *before* indexing and register only the paths this
add actually created (set difference), guarded by `if index_result.doc_id`.
That also neutralizes an unexpected empty/falsy doc_id, which would otherwise
glob `*/*` and register — then delete on rollback — the entire blob store.

Tests (tests/test_add_command.py):
- test_long_doc_rollback_removes_only_the_new_blob: a failed long-doc add rolls
  back its own new blob + images subtree while a pre-existing blob survives.
- test_long_doc_dedup_hit_does_not_delete_existing_blob: a dedup hit (existing
  doc_id, no new blob) must not delete the pre-existing blob on rollback —
  verified this test FAILS on the pre-fix code.

Claude-Session: https://claude.ai/code/session_018WiFnTo1YW9mtw47Fzir9K
@KylinMountain

Copy link
Copy Markdown
Collaborator Author

Self-review (max-effort) turned up a regression this PR's first commit introduced — now fixed in 3749648.

Bug: track_new(glob("*/{doc_id}*")) assumed the globbed blob was created by this add. But PageIndex content-dedups (LocalBackend.add_document returns an existing doc_id and writes no new blob when the same content is already indexed). If hashes.json and pageindex.db diverge — e.g. a prior remove whose PageIndex cleanup failed left the row + blob but dropped the hashes.json entry — re-adding that content makes col.add() return the old doc_id, so a subsequent compile failure would roll back and delete another document's blob. The whole-store hardlink snapshot this PR replaced did not have this bug (a dedup-hit blob shares the backup inode → left in place).

Fix: capture the blob set before indexing and register only paths this add actually created (set difference), guarded by if index_result.doc_id. That also neutralizes an unexpected empty doc_id, which would otherwise glob */* and register — then delete — the whole store. Added two long-doc rollback tests (new-blob removed / pre-existing survives / dedup-hit survives); verified the dedup test fails on the pre-fix code.

Known limitation (documented, not fixed): there's a narrow crash window — if the process is hard-killed between the blob landing during indexing and track_new persisting the journal, the blob is orphaned (recovery won't remove it). It's benign (unreferenced; registry is restored on recovery), and a fuller fix would need the doc_id before the snapshot. Calling it out rather than silently accepting it.

Other review findings were minor (layout-coupling glob shares an assumption with remove's delete path; track_new lacks the kb_dir-containment assert snapshot_paths has) — happy to address if you'd like, otherwise leaving as-is.

@KylinMountain KylinMountain merged commit 5053cbe into main Jul 1, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant