feat(cli): add openkb remove to safely delete a document (closes #41)#51
Conversation
Removing a document used to require manual cleanup across summaries, sources, concepts, the index, and the hash registry. `openkb remove` does it in one step with a plan/confirm flow: - Resolves identifier by exact filename, exact doc_name slug, or fuzzy substring; refuses to act on ambiguous matches. - Prints a DELETE/MODIFY plan and confirms before touching disk; `--dry-run` prints only, `--yes` skips the prompt. - Concept pages whose only source was the removed doc are deleted by default; `--keep-empty-concepts` keeps them with empty sources for the replace-with-newer-version workflow. - `--keep-raw` preserves the original file in raw/. - Runs `lint --fix` afterwards so any stray wikilinks pointing at the removed summary or deleted concept pages get stripped automatically.
Code reviewFound 1 issue:
🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
1. Persist `doc_name` in the hash registry so `remove_by_doc_name`
actually prunes the entry on real KBs. Previously `add_single_file`
wrote only `{"name", "type"}`, so `remove_by_doc_name`'s lookup
silently failed and the registry never shrank — re-adding the same
file would then be wrongly skipped as "already known".
2. Tighten `_remove_section_entry` to strict `- {link}` prefix matching,
dropping the substring fallback. The fallback could wrongly delete
sibling bullets whose brief text mentions the removed wikilink (the
same class of bug that earlier commits 88a7a74 and 3995bc1 fixed for
the insert/contains helpers).
3. Make the CLI plan-builder classify concept pages by frontmatter
`sources:` membership only, so the announced plan reflects what the
executor will actually do. Body-only references (e.g. a stray
`See also:` line a user added by hand) used to be reported as
DELETE but the executor only ever MODIFIED them.
4. Replace the over-greedy `^\s*See also:\s*\[\[…\]\]\s*\n?` regex
with a bounded two-pass match. The previous pattern's `\s` matches
`\n`, so removal collapsed paragraph spacing in surrounding
content. The new pair handles the dominant paragraph-block form
(preserving one trailing newline so spacing survives) plus a
line-anchored fallback for hand-edited inline references.
Adds 6 regression tests covering each fix: end-to-end add → registry
contract → remove, strict prefix matching, body-only reference plan
exclusion, and See also spacing in both mid-file and trailing forms.
Code reviewFunctional completeness pass — does
Out of scope but worth noting (NOT flagged as bug): the body prose of multi-source concept pages is intentionally NOT regenerated on remove, and the 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
Functional completeness gaps surfaced in the second code review:
1. `openkb remove` left `wiki/sources/images/<doc_name>/` orphaned.
For image-heavy PDFs/docx/pptx this leaked tens to hundreds of MB
per add → remove cycle, silently. Plan now lists the images
directory under DELETE actions and the executor shutil.rmtrees it.
2. `openkb remove` left PageIndex's local state orphaned for long
PDFs: the SQLite row in `.openkb/pageindex.db`, the managed PDF
copy at `.openkb/files/<collection>/<doc_id>.pdf`, and the
extracted-images directory. The blocking dependency was that
`add_single_file` never persisted PageIndex's `doc_id`, so even
if remove wanted to call `Collection.delete_document(doc_id)` it
had no handle.
Fix is in two parts:
- Ingest now stores `doc_id` on the long-doc registry entry.
- Remove instantiates `PageIndexClient(storage_path=.openkb)` and
calls `delete_document(doc_id)` when the registry entry says
`type == "long_pdf"` and `.openkb/pageindex.db` exists. Legacy
entries (registered before this fix, no `doc_id`) fall back to
matching by `doc_name` via `list_documents()`; ambiguous
multi-matches are skipped with a WARN rather than guessed.
PageIndex cleanup runs after the wiki-side mutations so that a
partial failure there leaves only PageIndex bloat, not a
half-removed wiki. Cloud-mode KBs (no local `pageindex.db`) skip
PageIndex cleanup entirely so we don't pull in an LLM key check
during remove.
Adds 7 regression tests covering: image dir deletion, dry-run image
preservation, doc_id persistence during ingest, PageIndex delete
with stored doc_id, fallback lookup, ambiguous-match refusal, and
the no-PageIndex-state no-op path.
Code reviewLogical-consistency pass — does the wiki end in a sound state for subsequent
Out of scope (didn't meet the 80-confidence threshold but worth tracking as follow-ups):
🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
The previous executor order removed the registry entry before attempting PageIndex cleanup. If PageIndex's local-mode delete failed (LLM-key check inside PageIndexClient.__init__, network blip in LiteLLM provider lookup, transient SQLite lock), the WARN told the user state was partial — but with the registry entry already gone there was no retry path, and the next `openkb add <same file>` would hit PageIndex's SHA-256 dedup, silently re-bind to the stale doc_id, and feed the OLD parsed structure to the compiler. A user removing a document specifically because the parse was bad would get the same bad parse back without warning. Reorder so the executor reaches the registry write only after PageIndex cleanup succeeds: 1. unlink summary / sources / images / concepts / index entries 2. fix_broken_links (so a retry sees a clean wiki) 3. _cleanup_pageindex — on failure: WARN with retry hint, return 4. registry.remove_by_doc_name ← commit point 5. raw_path.unlink, append_log Every step before the commit point is already idempotent (`unlink(missing_ok=True)`, `shutil.rmtree(ignore_errors=True)`, concept/index helpers that cheap-filter already-clean pages, and PageIndex's own delete using missing_ok internally), so re-running `openkb remove` after a WARN finishes the job once the underlying issue is resolved. Adds 2 regression tests: PageIndex failure preserves registry + doc_id for retry, and a second invocation with a working PageIndex completes the removal cleanly.
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. The latest commit's executor reordering (PageIndex cleanup before the registry commit point) is sound: each pre-commit step is idempotent on retry ( 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
| from pathlib import Path | ||
| from unittest.mock import MagicMock, patch | ||
|
|
||
| import pytest |
Summary
openkb remove <identifier>for safely deleting a document from a knowledge base in one step, with a plan/confirm flow.doc_nameslug → fuzzy substring; ambiguous matches refuse to act.--keep-empty-conceptsretains them with empty sources (useful when replacing the doc with a newer version).--keep-rawpreserves the original file inraw/.--dry-runprints the plan only,--yesskips the confirm prompt.lint --fixafter removal so stray wikilinks pointing at the removed summary or deleted concept pages get stripped automatically.Closes #41.
Test plan
pytest— full suite 287 passed (260 prior + 27 new intests/test_remove.py).index.mdDocuments + deleted-Concepts entries pruned, surviving entries kept--keep-rawand--keep-empty-conceptsflag paths verified--dry-runprints plan without touching disk[[summaries/...]]link in a sibling page is stripped by the autolint --fixpass