feat: add serial crash-safe mutation recovery#142
Conversation
5346945 to
4ca33c7
Compare
|
Solid PR — the snapshot/rollback design is sound and the hardlink backup contract holds (every writer into the hardlinked dirs is atomic temp+replace or append-only). One gap worth closing before merge, since it's the PR's own goal: the URL path isn't crash-safe. |
Keep URL ingests on add_single_file's staged conversion path so source artifacts are published only after the mutation snapshot exists and can roll back on failure. Add regression coverage for URL failure cleanup while preserving downloaded raw files for retry, and replace legacy asyncio.run mocks with async compiler stubs to avoid unawaited coroutine warnings in the touched tests.
@KylinMountain Thanks for the CR, gap fixed🫡 |
* perf(mutation): don't snapshot the whole blob store on every add The serial crash-safe add path (#142) listed `.openkb/files` (the PageIndex blob store) in both the snapshot path set and `hardlink_dirs`, so every add hardlinked the entire store into the rollback backup — one `os.link` per existing blob, plus the matching unlink on commit — a cost that scales with total KB size, not with the document being added. On a filesystem without hardlink support (cross-device staging, some Windows / cloud-sync folders) `_hardlink_or_copy` fell back to a full byte copy of the whole store on *every* add. The blob store is append-only by `{doc_id}`: an add only ever creates the new doc's blob, and that name is assigned during indexing — after the snapshot is taken. So instead of snapshotting the whole tree up front, register just the new blob once indexing has run, via `MutationSnapshot.track_new()`, which records it with no backup and rewrites the active journal so both in-process rollback and crash recovery remove exactly this doc's artifacts. Cloud import never writes a local blob, so `.openkb/files` is dropped from its snapshot entirely (it was pure waste there). - `MutationSnapshot.track_new(paths)`: register post-snapshot-created paths for removal on rollback; persists to the journal for crash recovery. - add: drop `.openkb/files` from the eager snapshot + `hardlink_dirs`; call `track_new(files/*/<doc_id>*)` right after `index_long_document`. - cloud import: drop `.openkb/files` from `hardlink_dirs`. Tests: new-blob rollback removes exactly the doc's blob + images subtree and leaves existing blobs untouched (same inode); track_new persists so recover_pending_journals cleans a crashed add; `_snapshot_add_paths` no longer lists `.openkb/files`. Full suite green (pre-existing trafilatura-missing url_ingest failures aside). Claude-Session: https://claude.ai/code/session_018WiFnTo1YW9mtw47Fzir9K * fix(mutation): only track blobs this add created (dedup-hit rollback bug) Self-review of the previous commit found a regression it introduced: track_new globbed `.openkb/files/*/<doc_id>*` and registered whatever matched for removal on rollback. But PageIndex content-dedups — `add_document` returns an EXISTING doc_id and writes no new blob when the same content is already indexed. If hashes.json and pageindex.db diverge (e.g. a prior `remove` whose PageIndex cleanup failed left the row + blob but dropped the hashes.json entry), re-adding that content makes col.add() return the OLD doc_id, so a subsequent compile failure would roll back and DELETE that prior document's blob. The old whole-store hardlink snapshot did not have this bug (a dedup-hit blob shares the backup inode and is left in place on rollback). Fix: capture the blob set *before* indexing and register only the paths this add actually created (set difference), guarded by `if index_result.doc_id`. That also neutralizes an unexpected empty/falsy doc_id, which would otherwise glob `*/*` and register — then delete on rollback — the entire blob store. Tests (tests/test_add_command.py): - test_long_doc_rollback_removes_only_the_new_blob: a failed long-doc add rolls back its own new blob + images subtree while a pre-existing blob survives. - test_long_doc_dedup_hit_does_not_delete_existing_blob: a dedup hit (existing doc_id, no new blob) must not delete the pre-existing blob on rollback — verified this test FAILS on the pre-fix code. Claude-Session: https://claude.ai/code/session_018WiFnTo1YW9mtw47Fzir9K
TL;DR
This PR makes document import fail more cleanly. If
openkb addis interrupted or errors halfway through, OpenKB should no longer leave half-written KB files behind; it can recover the previous stable state before accepting the next mutation.Summary
Adds crash-safe mutation recovery for serial KB mutations without bringing back the concurrent add pipeline from #104.
This PR keeps the scope focused on the first split-out piece from the abandoned PR: journaled snapshot/rollback for
addand PageIndex Cloud import, plus the atomic writer prerequisites needed for hardlink-backed rollback.What changed
Added
openkb/mutation.pywith:Wired recovery into serial mutation paths:
openkb add <file>converts into staging, snapshots final KB paths, publishes, compiles, then commits via registry write + journal markopenkb add --from-pageindex-cloudprepares cloud data read-only, snapshots doc-specific paths, writes/compiles/registers under the same recovery contractKept conversion serial while adding staging support:
convert_document(..., staging_dir=...)writes raw/source artifacts into isolated staging before commitMade wiki writers atomic where hardlink snapshots depend on temp+replace semantics:
Explicitly out of scope
This does not reintroduce the concurrency half of #104:
ThreadPoolExecutorfile_processing_jobsValidation
UV_CACHE_DIR=/tmp/uv-cache UV_PYTHON=3.13 uv run --extra dev pytest -q898 passed, 5 warningsUV_CACHE_DIR=/tmp/uv-cache UV_PYTHON=3.13 uv run --extra dev pytest tests/test_add_command.py::TestImportFromPageindexCloud tests/test_mutation.py -q25 passedUV_CACHE_DIR=/tmp/uv-cache UV_PYTHON=3.13 uv run --extra dev pytest tests/test_converter.py tests/test_add_command.py tests/test_mutation.py -q72 passedUV_CACHE_DIR=/tmp/uv-cache UV_PYTHON=3.13 uv run --extra dev ruff check openkb/converter.py openkb/mutation.py tests/test_mutation.pyNotes
Full-project
ruffandtyare not clean on currentmain; this PR fixes the new diagnostics introduced by the recovery changes but does not widen scope to existing lint/type debt.