feat: add serial crash-safe mutation recovery by gwokhou · Pull Request #142 · VectifyAI/OpenKB

gwokhou · 2026-06-26T10:32:16Z

TL;DR

This PR makes document import fail more cleanly. If openkb add is interrupted or errors halfway through, OpenKB should no longer leave half-written KB files behind; it can recover the previous stable state before accepting the next mutation.

Summary

Adds crash-safe mutation recovery for serial KB mutations without bringing back the concurrent add pipeline from #104.

This PR keeps the scope focused on the first split-out piece from the abandoned PR: journaled snapshot/rollback for add and PageIndex Cloud import, plus the atomic writer prerequisites needed for hardlink-backed rollback.

What changed

Added openkb/mutation.py with:
- mutation snapshots
- active/committed/rolled_back journals
- rollback and recovery of interrupted mutations
- bounded rollback retry
- malformed journal handling
- staged artifact publishing with atomic rename + EXDEV copy fallback
Wired recovery into serial mutation paths:
- openkb add <file> converts into staging, snapshots final KB paths, publishes, compiles, then commits via registry write + journal mark
- openkb add --from-pageindex-cloud prepares cloud data read-only, snapshots doc-specific paths, writes/compiles/registers under the same recovery contract
- exclusive KB lock acquisition drains pending journals before mutating
Kept conversion serial while adding staging support:
- convert_document(..., staging_dir=...) writes raw/source artifacts into isolated staging before commit
- no parallel prepare hook or doc-name override path is introduced
Made wiki writers atomic where hardlink snapshots depend on temp+replace semantics:
- compiler summary/concept/entity/index writes
- lint broken-link cleanup

Explicitly out of scope

This does not reintroduce the concurrency half of #104:

no ThreadPoolExecutor
no file_processing_jobs
no jobs>1 directory add path
no parallel prepare pipeline
no batch doc-name reservation logic

Validation

UV_CACHE_DIR=/tmp/uv-cache UV_PYTHON=3.13 uv run --extra dev pytest -q
- 898 passed, 5 warnings
UV_CACHE_DIR=/tmp/uv-cache UV_PYTHON=3.13 uv run --extra dev pytest tests/test_add_command.py::TestImportFromPageindexCloud tests/test_mutation.py -q
- 25 passed
UV_CACHE_DIR=/tmp/uv-cache UV_PYTHON=3.13 uv run --extra dev pytest tests/test_converter.py tests/test_add_command.py tests/test_mutation.py -q
- 72 passed
UV_CACHE_DIR=/tmp/uv-cache UV_PYTHON=3.13 uv run --extra dev ruff check openkb/converter.py openkb/mutation.py tests/test_mutation.py
- passed

Notes

Full-project ruff and ty are not clean on current main; this PR fixes the new diagnostics introduced by the recovery changes but does not widen scope to existing lint/type debt.

KylinMountain · 2026-06-30T10:16:11Z

Solid PR — the snapshot/rollback design is sound and the hardlink backup contract holds (every writer into the hardlinked dirs is atomic temp+replace or append-only).

One gap worth closing before merge, since it's the PR's own goal: the URL path isn't crash-safe. openkb add <url> calls add_single_file(..., stage=False) (cli.py:964), so staging_dir=None and convert_document (cli.py:444) writes into the live KB before snapshot_paths() (cli.py:461). A crash mid-conversion leaves a partial wiki/sources/<doc>.md with no journal to roll back — exactly the orphan state the PR eliminates for the local/cloud paths. Could the URL path use stage=True and handle the dedup cleanup separately?

Keep URL ingests on add_single_file's staged conversion path so source artifacts are published only after the mutation snapshot exists and can roll back on failure. Add regression coverage for URL failure cleanup while preserving downloaded raw files for retry, and replace legacy asyncio.run mocks with async compiler stubs to avoid unawaited coroutine warnings in the touched tests.

gwokhou · 2026-06-30T12:09:56Z

Solid PR — the snapshot/rollback design is sound and the hardlink backup contract holds (every writer into the hardlinked dirs is atomic temp+replace or append-only).

One gap worth closing before merge, since it's the PR's own goal: the URL path isn't crash-safe. openkb add <url> calls add_single_file(..., stage=False) (cli.py:964), so staging_dir=None and convert_document (cli.py:444) writes into the live KB before snapshot_paths() (cli.py:461). A crash mid-conversion leaves a partial wiki/sources/<doc>.md with no journal to roll back — exactly the orphan state the PR eliminates for the local/cloud paths. Could the URL path use stage=True and handle the dedup cleanup separately?

@KylinMountain Thanks for the CR, gap fixed🫡

KylinMountain

LGTM

* perf(mutation): don't snapshot the whole blob store on every add The serial crash-safe add path (#142) listed `.openkb/files` (the PageIndex blob store) in both the snapshot path set and `hardlink_dirs`, so every add hardlinked the entire store into the rollback backup — one `os.link` per existing blob, plus the matching unlink on commit — a cost that scales with total KB size, not with the document being added. On a filesystem without hardlink support (cross-device staging, some Windows / cloud-sync folders) `_hardlink_or_copy` fell back to a full byte copy of the whole store on *every* add. The blob store is append-only by `{doc_id}`: an add only ever creates the new doc's blob, and that name is assigned during indexing — after the snapshot is taken. So instead of snapshotting the whole tree up front, register just the new blob once indexing has run, via `MutationSnapshot.track_new()`, which records it with no backup and rewrites the active journal so both in-process rollback and crash recovery remove exactly this doc's artifacts. Cloud import never writes a local blob, so `.openkb/files` is dropped from its snapshot entirely (it was pure waste there). - `MutationSnapshot.track_new(paths)`: register post-snapshot-created paths for removal on rollback; persists to the journal for crash recovery. - add: drop `.openkb/files` from the eager snapshot + `hardlink_dirs`; call `track_new(files/*/<doc_id>*)` right after `index_long_document`. - cloud import: drop `.openkb/files` from `hardlink_dirs`. Tests: new-blob rollback removes exactly the doc's blob + images subtree and leaves existing blobs untouched (same inode); track_new persists so recover_pending_journals cleans a crashed add; `_snapshot_add_paths` no longer lists `.openkb/files`. Full suite green (pre-existing trafilatura-missing url_ingest failures aside). Claude-Session: https://claude.ai/code/session_018WiFnTo1YW9mtw47Fzir9K * fix(mutation): only track blobs this add created (dedup-hit rollback bug) Self-review of the previous commit found a regression it introduced: track_new globbed `.openkb/files/*/<doc_id>*` and registered whatever matched for removal on rollback. But PageIndex content-dedups — `add_document` returns an EXISTING doc_id and writes no new blob when the same content is already indexed. If hashes.json and pageindex.db diverge (e.g. a prior `remove` whose PageIndex cleanup failed left the row + blob but dropped the hashes.json entry), re-adding that content makes col.add() return the OLD doc_id, so a subsequent compile failure would roll back and DELETE that prior document's blob. The old whole-store hardlink snapshot did not have this bug (a dedup-hit blob shares the backup inode and is left in place on rollback). Fix: capture the blob set *before* indexing and register only the paths this add actually created (set difference), guarded by `if index_result.doc_id`. That also neutralizes an unexpected empty/falsy doc_id, which would otherwise glob `*/*` and register — then delete on rollback — the entire blob store. Tests (tests/test_add_command.py): - test_long_doc_rollback_removes_only_the_new_blob: a failed long-doc add rolls back its own new blob + images subtree while a pre-existing blob survives. - test_long_doc_dedup_hit_does_not_delete_existing_blob: a dedup hit (existing doc_id, no new blob) must not delete the pre-existing blob on rollback — verified this test FAILS on the pre-fix code. Claude-Session: https://claude.ai/code/session_018WiFnTo1YW9mtw47Fzir9K

feat: add serial mutation recovery

4ca33c7

gwokhou force-pushed the pr/serial-mutation-recovery branch from 5346945 to 4ca33c7 Compare June 26, 2026 10:38

gwokhou marked this pull request as ready for review June 26, 2026 10:50

KylinMountain approved these changes Jun 30, 2026

View reviewed changes

KylinMountain merged commit 4616e49 into VectifyAI:main Jun 30, 2026

KylinMountain mentioned this pull request Jul 1, 2026

perf(mutation): don't snapshot the whole blob store on every add #155

Merged

gwokhou deleted the pr/serial-mutation-recovery branch July 1, 2026 03:01

gwokhou mentioned this pull request Jul 1, 2026

feat(mutation): route serial add mutations through coordinator #156

Open

This was referenced Jul 2, 2026

Fix XLSX ingestion memory spikes with streaming parser #42

Closed

Incrementally resume long-PDF ingestion using cached PageIndex doc_id #43

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add serial crash-safe mutation recovery#142

feat: add serial crash-safe mutation recovery#142
KylinMountain merged 2 commits into
VectifyAI:mainfrom
gwokhou:pr/serial-mutation-recovery

gwokhou commented Jun 26, 2026 •

edited

Loading

Uh oh!

KylinMountain commented Jun 30, 2026 •

edited

Loading

Uh oh!

gwokhou commented Jun 30, 2026

Uh oh!

KylinMountain left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

gwokhou commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Summary

What changed

Explicitly out of scope

Validation

Notes

Uh oh!

KylinMountain commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gwokhou commented Jun 30, 2026

Uh oh!

KylinMountain left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gwokhou commented Jun 26, 2026 •

edited

Loading

KylinMountain commented Jun 30, 2026 •

edited

Loading