Skip to content

feat: add serial crash-safe mutation recovery#142

Merged
KylinMountain merged 2 commits into
VectifyAI:mainfrom
gwokhou:pr/serial-mutation-recovery
Jun 30, 2026
Merged

feat: add serial crash-safe mutation recovery#142
KylinMountain merged 2 commits into
VectifyAI:mainfrom
gwokhou:pr/serial-mutation-recovery

Conversation

@gwokhou

@gwokhou gwokhou commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

TL;DR

This PR makes document import fail more cleanly. If openkb add is interrupted or errors halfway through, OpenKB should no longer leave half-written KB files behind; it can recover the previous stable state before accepting the next mutation.

Summary

Adds crash-safe mutation recovery for serial KB mutations without bringing back the concurrent add pipeline from #104.

This PR keeps the scope focused on the first split-out piece from the abandoned PR: journaled snapshot/rollback for add and PageIndex Cloud import, plus the atomic writer prerequisites needed for hardlink-backed rollback.

What changed

  • Added openkb/mutation.py with:

    • mutation snapshots
    • active/committed/rolled_back journals
    • rollback and recovery of interrupted mutations
    • bounded rollback retry
    • malformed journal handling
    • staged artifact publishing with atomic rename + EXDEV copy fallback
  • Wired recovery into serial mutation paths:

    • openkb add <file> converts into staging, snapshots final KB paths, publishes, compiles, then commits via registry write + journal mark
    • openkb add --from-pageindex-cloud prepares cloud data read-only, snapshots doc-specific paths, writes/compiles/registers under the same recovery contract
    • exclusive KB lock acquisition drains pending journals before mutating
  • Kept conversion serial while adding staging support:

    • convert_document(..., staging_dir=...) writes raw/source artifacts into isolated staging before commit
    • no parallel prepare hook or doc-name override path is introduced
  • Made wiki writers atomic where hardlink snapshots depend on temp+replace semantics:

    • compiler summary/concept/entity/index writes
    • lint broken-link cleanup

Explicitly out of scope

This does not reintroduce the concurrency half of #104:

  • no ThreadPoolExecutor
  • no file_processing_jobs
  • no jobs>1 directory add path
  • no parallel prepare pipeline
  • no batch doc-name reservation logic

Validation

  • UV_CACHE_DIR=/tmp/uv-cache UV_PYTHON=3.13 uv run --extra dev pytest -q

    • 898 passed, 5 warnings
  • UV_CACHE_DIR=/tmp/uv-cache UV_PYTHON=3.13 uv run --extra dev pytest tests/test_add_command.py::TestImportFromPageindexCloud tests/test_mutation.py -q

    • 25 passed
  • UV_CACHE_DIR=/tmp/uv-cache UV_PYTHON=3.13 uv run --extra dev pytest tests/test_converter.py tests/test_add_command.py tests/test_mutation.py -q

    • 72 passed
  • UV_CACHE_DIR=/tmp/uv-cache UV_PYTHON=3.13 uv run --extra dev ruff check openkb/converter.py openkb/mutation.py tests/test_mutation.py

    • passed

Notes

Full-project ruff and ty are not clean on current main; this PR fixes the new diagnostics introduced by the recovery changes but does not widen scope to existing lint/type debt.

@gwokhou gwokhou force-pushed the pr/serial-mutation-recovery branch from 5346945 to 4ca33c7 Compare June 26, 2026 10:38
@gwokhou gwokhou marked this pull request as ready for review June 26, 2026 10:50
@KylinMountain

KylinMountain commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Solid PR — the snapshot/rollback design is sound and the hardlink backup contract holds (every writer into the hardlinked dirs is atomic temp+replace or append-only).

One gap worth closing before merge, since it's the PR's own goal: the URL path isn't crash-safe. openkb add <url> calls add_single_file(..., stage=False) (cli.py:964), so staging_dir=None and convert_document (cli.py:444) writes into the live KB before snapshot_paths() (cli.py:461). A crash mid-conversion leaves a partial wiki/sources/<doc>.md with no journal to roll back — exactly the orphan state the PR eliminates for the local/cloud paths. Could the URL path use stage=True and handle the dedup cleanup separately?

Keep URL ingests on add_single_file's staged conversion path so source artifacts are published only after the mutation snapshot exists and can roll back on failure.

Add regression coverage for URL failure cleanup while preserving downloaded raw files for retry, and replace legacy asyncio.run mocks with async compiler stubs to avoid unawaited coroutine warnings in the touched tests.
@gwokhou

gwokhou commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

Solid PR — the snapshot/rollback design is sound and the hardlink backup contract holds (every writer into the hardlinked dirs is atomic temp+replace or append-only).

One gap worth closing before merge, since it's the PR's own goal: the URL path isn't crash-safe. openkb add <url> calls add_single_file(..., stage=False) (cli.py:964), so staging_dir=None and convert_document (cli.py:444) writes into the live KB before snapshot_paths() (cli.py:461). A crash mid-conversion leaves a partial wiki/sources/<doc>.md with no journal to roll back — exactly the orphan state the PR eliminates for the local/cloud paths. Could the URL path use stage=True and handle the dedup cleanup separately?

@KylinMountain Thanks for the CR, gap fixed🫡

@KylinMountain KylinMountain left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@KylinMountain KylinMountain merged commit 4616e49 into VectifyAI:main Jun 30, 2026
@gwokhou gwokhou deleted the pr/serial-mutation-recovery branch July 1, 2026 03:01
KylinMountain added a commit that referenced this pull request Jul 1, 2026
* perf(mutation): don't snapshot the whole blob store on every add

The serial crash-safe add path (#142) listed `.openkb/files` (the PageIndex
blob store) in both the snapshot path set and `hardlink_dirs`, so every add
hardlinked the entire store into the rollback backup — one `os.link` per
existing blob, plus the matching unlink on commit — a cost that scales with
total KB size, not with the document being added. On a filesystem without
hardlink support (cross-device staging, some Windows / cloud-sync folders)
`_hardlink_or_copy` fell back to a full byte copy of the whole store on
*every* add.

The blob store is append-only by `{doc_id}`: an add only ever creates the new
doc's blob, and that name is assigned during indexing — after the snapshot is
taken. So instead of snapshotting the whole tree up front, register just the
new blob once indexing has run, via `MutationSnapshot.track_new()`, which
records it with no backup and rewrites the active journal so both in-process
rollback and crash recovery remove exactly this doc's artifacts.

Cloud import never writes a local blob, so `.openkb/files` is dropped from its
snapshot entirely (it was pure waste there).

- `MutationSnapshot.track_new(paths)`: register post-snapshot-created paths for
  removal on rollback; persists to the journal for crash recovery.
- add: drop `.openkb/files` from the eager snapshot + `hardlink_dirs`; call
  `track_new(files/*/<doc_id>*)` right after `index_long_document`.
- cloud import: drop `.openkb/files` from `hardlink_dirs`.

Tests: new-blob rollback removes exactly the doc's blob + images subtree and
leaves existing blobs untouched (same inode); track_new persists so
recover_pending_journals cleans a crashed add; `_snapshot_add_paths` no longer
lists `.openkb/files`. Full suite green (pre-existing trafilatura-missing
url_ingest failures aside).

Claude-Session: https://claude.ai/code/session_018WiFnTo1YW9mtw47Fzir9K

* fix(mutation): only track blobs this add created (dedup-hit rollback bug)

Self-review of the previous commit found a regression it introduced: track_new
globbed `.openkb/files/*/<doc_id>*` and registered whatever matched for removal
on rollback. But PageIndex content-dedups — `add_document` returns an EXISTING
doc_id and writes no new blob when the same content is already indexed. If
hashes.json and pageindex.db diverge (e.g. a prior `remove` whose PageIndex
cleanup failed left the row + blob but dropped the hashes.json entry),
re-adding that content makes col.add() return the OLD doc_id, so a subsequent
compile failure would roll back and DELETE that prior document's blob. The old
whole-store hardlink snapshot did not have this bug (a dedup-hit blob shares
the backup inode and is left in place on rollback).

Fix: capture the blob set *before* indexing and register only the paths this
add actually created (set difference), guarded by `if index_result.doc_id`.
That also neutralizes an unexpected empty/falsy doc_id, which would otherwise
glob `*/*` and register — then delete on rollback — the entire blob store.

Tests (tests/test_add_command.py):
- test_long_doc_rollback_removes_only_the_new_blob: a failed long-doc add rolls
  back its own new blob + images subtree while a pre-existing blob survives.
- test_long_doc_dedup_hit_does_not_delete_existing_blob: a dedup hit (existing
  doc_id, no new blob) must not delete the pre-existing blob on rollback —
  verified this test FAILS on the pre-fix code.

Claude-Session: https://claude.ai/code/session_018WiFnTo1YW9mtw47Fzir9K
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants