feat(cli): add `openkb remove` to safely delete a document (closes #41) by KylinMountain · Pull Request #51 · VectifyAI/OpenKB

KylinMountain · 2026-05-16T01:53:00Z

Summary

Adds openkb remove <identifier> for safely deleting a document from a knowledge base in one step, with a plan/confirm flow.
Identifier resolves by exact filename → exact doc_name slug → fuzzy substring; ambiguous matches refuse to act.
Concept pages whose only source was the removed doc are deleted by default; --keep-empty-concepts retains them with empty sources (useful when replacing the doc with a newer version). --keep-raw preserves the original file in raw/. --dry-run prints the plan only, --yes skips the confirm prompt.
Auto-runs lint --fix after removal so stray wikilinks pointing at the removed summary or deleted concept pages get stripped automatically.

Closes #41.

Test plan

pytest — full suite 287 passed (260 prior + 27 new in tests/test_remove.py).
End-to-end smoke test on a 2-doc / 3-concept KB:
- single-source concept is deleted; multi-source concept is edited; unrelated concept untouched
- index.md Documents + deleted-Concepts entries pruned, surviving entries kept
- --keep-raw and --keep-empty-concepts flag paths verified
- --dry-run prints plan without touching disk
- dangling [[summaries/...]] link in a sibling page is stripped by the auto lint --fix pass
Ambiguous and unknown identifiers fail safe (no files modified).

Removing a document used to require manual cleanup across summaries, sources, concepts, the index, and the hash registry. `openkb remove` does it in one step with a plan/confirm flow: - Resolves identifier by exact filename, exact doc_name slug, or fuzzy substring; refuses to act on ambiguous matches. - Prints a DELETE/MODIFY plan and confirms before touching disk; `--dry-run` prints only, `--yes` skips the prompt. - Concept pages whose only source was the removed doc are deleted by default; `--keep-empty-concepts` keeps them with empty sources for the replace-with-newer-version workflow. - `--keep-raw` preserves the original file in raw/. - Runs `lint --fix` afterwards so any stray wikilinks pointing at the removed summary or deleted concept pages get stripped automatically.

KylinMountain · 2026-05-16T02:01:32Z

Code review

Found 1 issue:

HashRegistry.remove_by_doc_name silently no-ops on real knowledge bases. add_single_file only persists {"name", "type"} to the registry — doc_name is never stored. The new remove_by_doc_name then iterates entries matching meta.get("doc_name") == doc_name, which is always None for KBs built via openkb add, so the hash entry never gets pruned. The plan still prints REGISTRY remove hash entry (h_xxx…), but execution silently fails, and re-adding the same file later is wrongly skipped as "already known". Tests pass only because _seed_two_doc_kb manually injects doc_name into seed metadata.

add_single_file registry write (missing doc_name):

OpenKB/openkb/cli.py

Lines 209 to 213 in 254f5a2

    
           # Register hash only after successful compilation 
        
           if result.file_hash: 
        
               doc_type = "long_pdf" if result.is_long_doc else file_path.suffix.lstrip(".") 
        
               registry.add(result.file_hash, {"name": file_path.name, "type": doc_type})

remove_by_doc_name lookup that never matches:

OpenKB/openkb/state.py

Lines 43 to 52 in 254f5a2

    
           def remove_by_doc_name(self, doc_name: str) -> bool: 
        
               """Remove the entry whose metadata['doc_name'] matches. Returns True if removed.""" 
        
               for file_hash, meta in list(self._data.items()): 
        
                   if meta.get("doc_name") == doc_name: 
        
                       del self._data[file_hash] 
        
                       self._persist() 
        
                       return True 
        
               return False

Fix: persist doc_name (and ideally path) in add_single_file (line 212), e.g. registry.add(result.file_hash, {"name": file_path.name, "doc_name": doc_name, "type": doc_type, "path": str(result.raw_path.relative_to(kb_dir))}) — and add a regression test that seeds the registry via openkb add rather than direct injection.

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

1. Persist `doc_name` in the hash registry so `remove_by_doc_name` actually prunes the entry on real KBs. Previously `add_single_file` wrote only `{"name", "type"}`, so `remove_by_doc_name`'s lookup silently failed and the registry never shrank — re-adding the same file would then be wrongly skipped as "already known". 2. Tighten `_remove_section_entry` to strict `- {link}` prefix matching, dropping the substring fallback. The fallback could wrongly delete sibling bullets whose brief text mentions the removed wikilink (the same class of bug that earlier commits 88a7a74 and 3995bc1 fixed for the insert/contains helpers). 3. Make the CLI plan-builder classify concept pages by frontmatter `sources:` membership only, so the announced plan reflects what the executor will actually do. Body-only references (e.g. a stray `See also:` line a user added by hand) used to be reported as DELETE but the executor only ever MODIFIED them. 4. Replace the over-greedy `^\s*See also:\s*\[\[…\]\]\s*\n?` regex with a bounded two-pass match. The previous pattern's `\s` matches `\n`, so removal collapsed paragraph spacing in surrounding content. The new pair handles the dominant paragraph-block form (preserving one trailing newline so spacing survives) plus a line-anchored fallback for hand-edited inline references. Adds 6 regression tests covering each fix: end-to-end add → registry contract → remove, strict prefix matching, body-only reference plan exclusion, and See also spacing in both mid-file and trailing forms.

KylinMountain · 2026-05-16T02:24:54Z

Code review

Functional completeness pass — does openkb remove clean up everything openkb add produces? Found 2 real gaps.

Per-doc image directory is orphaned. openkb/images.py writes every image into wiki/sources/images/<doc_name>/ (the same doc_name stem used for the summary and source files), but the remove plan/executor never references that directory. For image-heavy PDFs / docx / pptx that's tens to hundreds of MB silently left behind per removed doc, with no --keep-images flag to even hint it was intentional.

Plan only enumerates summary + sources + concepts + index + raw + registry:

OpenKB/openkb/cli.py

Lines 519 to 535 in c504e26

    
           # ----- Build the plan (no side effects) ----- 
        
           actions: list[tuple[str, str]] = [] 
        
           summary_path = wiki_dir / "summaries" / f"{doc_name}.md" 
        
           if summary_path.exists(): 
        
               actions.append(("DELETE", str(summary_path.relative_to(kb_dir)))) 
        
           source_md = wiki_dir / "sources" / f"{doc_name}.md" 
        
           source_json = wiki_dir / "sources" / f"{doc_name}.json" 
        
           if source_md.exists(): 
        
               actions.append(("DELETE", str(source_md.relative_to(kb_dir)))) 
        
           if source_json.exists(): 
        
               actions.append(("DELETE", str(source_json.relative_to(kb_dir)))) 
        
           # Scan concept pages to predict which will be edited vs. deleted. 
        
           # Only frontmatter ``sources:`` membership drives the plan — body-only

Images directory written here, never matched on remove:

OpenKB/openkb/images.py

Lines 65 to 75 in c504e26

    
                               pix = None 
        
                           except Exception: 
        
                               logger.warning("Failed to save image block on page %d", page_num) 
        
                               continue 
        
                           rel_path = f"sources/images/{doc_name}/{filename}" 
        
                           page_images.setdefault(page_num, []).append(rel_path) 
        
               return page_images 
        
           def convert_pdf_to_pages(pdf_path: Path, doc_name: str, images_dir: Path) -> list[dict]:

Fix is small: compute images_dir = wiki_dir / "sources" / "images" / doc_name, add a DELETE action when it exists, and shutil.rmtree(images_dir, ignore_errors=True) in the executor block around line 612.

PageIndex state is orphaned for long PDFs. The long-doc ingest path calls Collection.add(...) and receives a doc_id, but the registry only persists {name, doc_name, type} — doc_id is dropped on the floor. openkb remove never imports pageindex and never calls Collection.delete_document(doc_id), so for every removed long PDF this stays on disk: the SQLite row in .openkb/pageindex.db, the managed copy at .openkb/files/<collection>/<doc_id>.pdf, and any extracted images at .openkb/files/<collection>/<doc_id>/images/. For a 200MB PDF the entire 200MB managed copy leaks per add → remove cycle, and the document still shows up to anything that calls col.list_documents().

doc_id acquired here:

OpenKB/openkb/indexer.py

Lines 54 to 58 in c504e26

    
           for attempt in range(1, max_retries + 1): 
        
               try: 
        
                   doc_id = col.add(str(pdf_path)) 
        
                   logger.info("PageIndex added %s → doc_id=%s (attempt %d)", pdf_path.name, doc_id, attempt) 
        
                   break

Registry write drops doc_id — no handle survives to feed delete_document:

OpenKB/openkb/cli.py

Lines 209 to 216 in c504e26

    
           # Register hash only after successful compilation 
        
           if result.file_hash: 
        
               doc_type = "long_pdf" if result.is_long_doc else file_path.suffix.lstrip(".") 
        
               registry.add(result.file_hash, { 
        
                   "name": file_path.name, 
        
                   "doc_name": doc_name, 
        
                   "type": doc_type, 
        
               })

Fix is in two parts: (a) persist doc_id in the registry alongside doc_name for long-doc entries during ingest, (b) in the remove flow, if meta.get("type") == "long_pdf" and a doc_id is present, call pageindex.Collection(storage_path=str(openkb_dir)).delete_document(doc_id). Existing KBs without doc_id will need a fallback (look up by basename via col.list_documents()) or a one-time migration.

Out of scope but worth noting (NOT flagged as bug): the body prose of multi-source concept pages is intentionally NOT regenerated on remove, and the brief: frontmatter stays as last synthesized. That's a reasonable design choice to avoid an LLM call per remove, but the remove command's help text doesn't say so — worth a one-line note in the docstring at cli.py:470-482.

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

Functional completeness gaps surfaced in the second code review: 1. `openkb remove` left `wiki/sources/images/<doc_name>/` orphaned. For image-heavy PDFs/docx/pptx this leaked tens to hundreds of MB per add → remove cycle, silently. Plan now lists the images directory under DELETE actions and the executor shutil.rmtrees it. 2. `openkb remove` left PageIndex's local state orphaned for long PDFs: the SQLite row in `.openkb/pageindex.db`, the managed PDF copy at `.openkb/files/<collection>/<doc_id>.pdf`, and the extracted-images directory. The blocking dependency was that `add_single_file` never persisted PageIndex's `doc_id`, so even if remove wanted to call `Collection.delete_document(doc_id)` it had no handle. Fix is in two parts: - Ingest now stores `doc_id` on the long-doc registry entry. - Remove instantiates `PageIndexClient(storage_path=.openkb)` and calls `delete_document(doc_id)` when the registry entry says `type == "long_pdf"` and `.openkb/pageindex.db` exists. Legacy entries (registered before this fix, no `doc_id`) fall back to matching by `doc_name` via `list_documents()`; ambiguous multi-matches are skipped with a WARN rather than guessed. PageIndex cleanup runs after the wiki-side mutations so that a partial failure there leaves only PageIndex bloat, not a half-removed wiki. Cloud-mode KBs (no local `pageindex.db`) skip PageIndex cleanup entirely so we don't pull in an LLM key check during remove. Adds 7 regression tests covering: image dir deletion, dry-run image preservation, doc_id persistence during ingest, PageIndex delete with stored doc_id, fallback lookup, ambiguous-match refusal, and the no-PageIndex-state no-op path.

KylinMountain · 2026-05-16T03:34:32Z

Code review

Logical-consistency pass — does the wiki end in a sound state for subsequent add after a remove? Found 1 issue.

Failed PageIndex cleanup leaves a stale row that silently re-binds on re-add. The executor removes the registry entry before attempting PageIndex cleanup. If the PageIndex call raises (LLM-key check inside PageIndexClient.__init__ fails, network blip in cloud-LiteLLM lookup, transient SQLite lock, etc.), we WARN and continue — but the registry entry that held doc_id is already gone.

On the next openkb add <same file>:

convert_document sees the hash is unknown (we cleared it) and proceeds.
The long-doc path calls Collection.add(pdf_path). PageIndex's local backend dedupes by SHA-256 (find_document_by_hash) and returns the stale doc_id without re-parsing.
col.get_document(stale_doc_id, include_text=True) then feeds the OLD parsed structure to our compiler.
A fresh registry entry gets written pointing at the stale doc_id; the wiki is rebuilt on top of stale parse content.

Net effect: a user who removes a document because the parse was bad and re-adds it gets the same bad parse back, with no signal anything went wrong. The inline rationale comment at the top of the PageIndex block says we deliberately put cleanup last "to leave only PageIndex bloat, not a half-removed wiki" — but the cost is worse than disk bloat: silent state divergence the user cannot recover from without manually inspecting .openkb/pageindex.db.

Registry removed first:

OpenKB/openkb/cli.py

Lines 694 to 718 in 5e76130

    
           remove_doc_from_index(wiki_dir, doc_name, concept_result["deleted"]) 
        
           registry.remove_by_doc_name(doc_name) 
        
           if raw_path is not None: 
        
               raw_path.unlink(missing_ok=True) 
        
           # Free PageIndex's local managed state for long PDFs. We do this last 
        
           # because the wiki side is now clean and we want a partial failure 
        
           # here to leave only PageIndex bloat, not a half-removed wiki. 
        
           if cleanup_pageindex: 
        
               try: 
        
                   cleaned, msg = _cleanup_pageindex( 
        
                       openkb_dir, kb_dir, doc_name, pageindex_doc_id, 
        
                   ) 
        
                   click.echo(f"  PageIndex: {msg}") 
        
               except Exception as exc: 
        
                   click.echo( 
        
                       f"  [WARN] PageIndex cleanup failed: {exc} " 
        
                       f"— .openkb/pageindex.db row and .openkb/files/ may still hold this doc" 
        
                   ) 
        
                   logging.getLogger(__name__).debug( 
        
                       "PageIndex cleanup traceback:", exc_info=True, 
        
                   )

Fix: reorder so _cleanup_pageindex runs before registry.remove_by_doc_name. On PageIndex failure the registry still holds doc_id, so openkb remove is retryable (PageIndex delete is idempotent — find_document_by_hash → row, delete_document unlinks file with missing_ok=True and if doc_dir.exists() guards). The reverse failure mode — PageIndex cleaned but registry remove fails — is also idempotent on retry.

Out of scope (didn't meet the 80-confidence threshold but worth tracking as follow-ups):

--keep-empty-concepts retains stale concept body; re-adding the same doc enters the LLM "update" path and blends new content into prose still containing the old doc's contribution — likely to produce duplicate phrasing. (75)
Multi-source concept pages may retain markdown image refs ![alt](sources/images/<doc_name>/...) after images_dir is deleted; fix_broken_links only handles [[wikilinks]]. Real if the LLM echoes image refs into concept bodies — depends on actual concept content. (50)
PageIndexClient.__init__ validates LLM provider even for delete-only ops; users with rotated keys can't run the cleanup path and get a misleading "PageIndex cleanup failed" WARN. (75)

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

The previous executor order removed the registry entry before attempting PageIndex cleanup. If PageIndex's local-mode delete failed (LLM-key check inside PageIndexClient.__init__, network blip in LiteLLM provider lookup, transient SQLite lock), the WARN told the user state was partial — but with the registry entry already gone there was no retry path, and the next `openkb add <same file>` would hit PageIndex's SHA-256 dedup, silently re-bind to the stale doc_id, and feed the OLD parsed structure to the compiler. A user removing a document specifically because the parse was bad would get the same bad parse back without warning. Reorder so the executor reaches the registry write only after PageIndex cleanup succeeds: 1. unlink summary / sources / images / concepts / index entries 2. fix_broken_links (so a retry sees a clean wiki) 3. _cleanup_pageindex — on failure: WARN with retry hint, return 4. registry.remove_by_doc_name ← commit point 5. raw_path.unlink, append_log Every step before the commit point is already idempotent (`unlink(missing_ok=True)`, `shutil.rmtree(ignore_errors=True)`, concept/index helpers that cheap-filter already-clean pages, and PageIndex's own delete using missing_ok internally), so re-running `openkb remove` after a WARN finishes the job once the underlying issue is resolved. Adds 2 regression tests: PageIndex failure preserves registry + doc_id for retry, and a second invocation with a working PageIndex completes the removal cleanly.

KylinMountain · 2026-05-16T03:57:04Z

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

The latest commit's executor reordering (PageIndex cleanup before the registry commit point) is sound: each pre-commit step is idempotent on retry (unlink(missing_ok=True), shutil.rmtree(ignore_errors=True), concept/index helpers that cheap-filter already-clean pages, fix_broken_links no-ops when nothing changes, and PageIndex's own delete_document handles missing file/dir gracefully), so the new two-test retry contract holds end-to-end.

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

+from pathlib import Path
+from unittest.mock import MagicMock, patch
+
+import pytest


docs(readme): add openkb remove to Commands table

522c5ae

github-code-quality Bot found potential problems May 16, 2026

View reviewed changes

Comment thread tests/test_remove.py

from pathlib import Path

from unittest.mock import MagicMock, patch

import pytest

KylinMountain merged commit a1867c6 into main May 16, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cli): add `openkb remove` to safely delete a document (closes #41)#51

feat(cli): add `openkb remove` to safely delete a document (closes #41)#51
KylinMountain merged 5 commits into
mainfrom
feat/remove-document

KylinMountain commented May 16, 2026

Uh oh!

KylinMountain commented May 16, 2026

Uh oh!

KylinMountain commented May 16, 2026

Uh oh!

KylinMountain commented May 16, 2026

Uh oh!

KylinMountain commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KylinMountain commented May 16, 2026

Summary

Test plan

Uh oh!

KylinMountain commented May 16, 2026

Code review

Uh oh!

KylinMountain commented May 16, 2026

Code review

Uh oh!

KylinMountain commented May 16, 2026

Code review

Uh oh!

KylinMountain commented May 16, 2026

Code review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant