Skip to content

feat(cli): add openkb remove to safely delete a document (closes #41)#51

Merged
KylinMountain merged 5 commits into
mainfrom
feat/remove-document
May 16, 2026
Merged

feat(cli): add openkb remove to safely delete a document (closes #41)#51
KylinMountain merged 5 commits into
mainfrom
feat/remove-document

Conversation

@KylinMountain
Copy link
Copy Markdown
Collaborator

Summary

  • Adds openkb remove <identifier> for safely deleting a document from a knowledge base in one step, with a plan/confirm flow.
  • Identifier resolves by exact filename → exact doc_name slug → fuzzy substring; ambiguous matches refuse to act.
  • Concept pages whose only source was the removed doc are deleted by default; --keep-empty-concepts retains them with empty sources (useful when replacing the doc with a newer version). --keep-raw preserves the original file in raw/. --dry-run prints the plan only, --yes skips the confirm prompt.
  • Auto-runs lint --fix after removal so stray wikilinks pointing at the removed summary or deleted concept pages get stripped automatically.

Closes #41.

Test plan

  • pytest — full suite 287 passed (260 prior + 27 new in tests/test_remove.py).
  • End-to-end smoke test on a 2-doc / 3-concept KB:
    • single-source concept is deleted; multi-source concept is edited; unrelated concept untouched
    • index.md Documents + deleted-Concepts entries pruned, surviving entries kept
    • --keep-raw and --keep-empty-concepts flag paths verified
    • --dry-run prints plan without touching disk
    • dangling [[summaries/...]] link in a sibling page is stripped by the auto lint --fix pass
  • Ambiguous and unknown identifiers fail safe (no files modified).

Removing a document used to require manual cleanup across summaries,
sources, concepts, the index, and the hash registry. `openkb remove`
does it in one step with a plan/confirm flow:

- Resolves identifier by exact filename, exact doc_name slug, or fuzzy
  substring; refuses to act on ambiguous matches.
- Prints a DELETE/MODIFY plan and confirms before touching disk;
  `--dry-run` prints only, `--yes` skips the prompt.
- Concept pages whose only source was the removed doc are deleted by
  default; `--keep-empty-concepts` keeps them with empty sources for
  the replace-with-newer-version workflow.
- `--keep-raw` preserves the original file in raw/.
- Runs `lint --fix` afterwards so any stray wikilinks pointing at the
  removed summary or deleted concept pages get stripped automatically.
@KylinMountain
Copy link
Copy Markdown
Collaborator Author

Code review

Found 1 issue:

  1. HashRegistry.remove_by_doc_name silently no-ops on real knowledge bases. add_single_file only persists {"name", "type"} to the registry — doc_name is never stored. The new remove_by_doc_name then iterates entries matching meta.get("doc_name") == doc_name, which is always None for KBs built via openkb add, so the hash entry never gets pruned. The plan still prints REGISTRY remove hash entry (h_xxx…), but execution silently fails, and re-adding the same file later is wrongly skipped as "already known". Tests pass only because _seed_two_doc_kb manually injects doc_name into seed metadata.

    add_single_file registry write (missing doc_name):

    OpenKB/openkb/cli.py

    Lines 209 to 213 in 254f5a2

    # Register hash only after successful compilation
    if result.file_hash:
    doc_type = "long_pdf" if result.is_long_doc else file_path.suffix.lstrip(".")
    registry.add(result.file_hash, {"name": file_path.name, "type": doc_type})

    remove_by_doc_name lookup that never matches:

    OpenKB/openkb/state.py

    Lines 43 to 52 in 254f5a2

    def remove_by_doc_name(self, doc_name: str) -> bool:
    """Remove the entry whose metadata['doc_name'] matches. Returns True if removed."""
    for file_hash, meta in list(self._data.items()):
    if meta.get("doc_name") == doc_name:
    del self._data[file_hash]
    self._persist()
    return True
    return False

    Fix: persist doc_name (and ideally path) in add_single_file (line 212), e.g. registry.add(result.file_hash, {"name": file_path.name, "doc_name": doc_name, "type": doc_type, "path": str(result.raw_path.relative_to(kb_dir))}) — and add a regression test that seeds the registry via openkb add rather than direct injection.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

1. Persist `doc_name` in the hash registry so `remove_by_doc_name`
   actually prunes the entry on real KBs. Previously `add_single_file`
   wrote only `{"name", "type"}`, so `remove_by_doc_name`'s lookup
   silently failed and the registry never shrank — re-adding the same
   file would then be wrongly skipped as "already known".

2. Tighten `_remove_section_entry` to strict `- {link}` prefix matching,
   dropping the substring fallback. The fallback could wrongly delete
   sibling bullets whose brief text mentions the removed wikilink (the
   same class of bug that earlier commits 88a7a74 and 3995bc1 fixed for
   the insert/contains helpers).

3. Make the CLI plan-builder classify concept pages by frontmatter
   `sources:` membership only, so the announced plan reflects what the
   executor will actually do. Body-only references (e.g. a stray
   `See also:` line a user added by hand) used to be reported as
   DELETE but the executor only ever MODIFIED them.

4. Replace the over-greedy `^\s*See also:\s*\[\[…\]\]\s*\n?` regex
   with a bounded two-pass match. The previous pattern's `\s` matches
   `\n`, so removal collapsed paragraph spacing in surrounding
   content. The new pair handles the dominant paragraph-block form
   (preserving one trailing newline so spacing survives) plus a
   line-anchored fallback for hand-edited inline references.

Adds 6 regression tests covering each fix: end-to-end add → registry
contract → remove, strict prefix matching, body-only reference plan
exclusion, and See also spacing in both mid-file and trailing forms.
@KylinMountain
Copy link
Copy Markdown
Collaborator Author

Code review

Functional completeness pass — does openkb remove clean up everything openkb add produces? Found 2 real gaps.

  1. Per-doc image directory is orphaned. openkb/images.py writes every image into wiki/sources/images/<doc_name>/ (the same doc_name stem used for the summary and source files), but the remove plan/executor never references that directory. For image-heavy PDFs / docx / pptx that's tens to hundreds of MB silently left behind per removed doc, with no --keep-images flag to even hint it was intentional.

    Plan only enumerates summary + sources + concepts + index + raw + registry:

    OpenKB/openkb/cli.py

    Lines 519 to 535 in c504e26

    # ----- Build the plan (no side effects) -----
    actions: list[tuple[str, str]] = []
    summary_path = wiki_dir / "summaries" / f"{doc_name}.md"
    if summary_path.exists():
    actions.append(("DELETE", str(summary_path.relative_to(kb_dir))))
    source_md = wiki_dir / "sources" / f"{doc_name}.md"
    source_json = wiki_dir / "sources" / f"{doc_name}.json"
    if source_md.exists():
    actions.append(("DELETE", str(source_md.relative_to(kb_dir))))
    if source_json.exists():
    actions.append(("DELETE", str(source_json.relative_to(kb_dir))))
    # Scan concept pages to predict which will be edited vs. deleted.
    # Only frontmatter ``sources:`` membership drives the plan — body-only

    Images directory written here, never matched on remove:

    OpenKB/openkb/images.py

    Lines 65 to 75 in c504e26

    pix = None
    except Exception:
    logger.warning("Failed to save image block on page %d", page_num)
    continue
    rel_path = f"sources/images/{doc_name}/{filename}"
    page_images.setdefault(page_num, []).append(rel_path)
    return page_images
    def convert_pdf_to_pages(pdf_path: Path, doc_name: str, images_dir: Path) -> list[dict]:

    Fix is small: compute images_dir = wiki_dir / "sources" / "images" / doc_name, add a DELETE action when it exists, and shutil.rmtree(images_dir, ignore_errors=True) in the executor block around line 612.

  2. PageIndex state is orphaned for long PDFs. The long-doc ingest path calls Collection.add(...) and receives a doc_id, but the registry only persists {name, doc_name, type}doc_id is dropped on the floor. openkb remove never imports pageindex and never calls Collection.delete_document(doc_id), so for every removed long PDF this stays on disk: the SQLite row in .openkb/pageindex.db, the managed copy at .openkb/files/<collection>/<doc_id>.pdf, and any extracted images at .openkb/files/<collection>/<doc_id>/images/. For a 200MB PDF the entire 200MB managed copy leaks per add → remove cycle, and the document still shows up to anything that calls col.list_documents().

    doc_id acquired here:

    OpenKB/openkb/indexer.py

    Lines 54 to 58 in c504e26

    for attempt in range(1, max_retries + 1):
    try:
    doc_id = col.add(str(pdf_path))
    logger.info("PageIndex added %s → doc_id=%s (attempt %d)", pdf_path.name, doc_id, attempt)
    break

    Registry write drops doc_id — no handle survives to feed delete_document:

    OpenKB/openkb/cli.py

    Lines 209 to 216 in c504e26

    # Register hash only after successful compilation
    if result.file_hash:
    doc_type = "long_pdf" if result.is_long_doc else file_path.suffix.lstrip(".")
    registry.add(result.file_hash, {
    "name": file_path.name,
    "doc_name": doc_name,
    "type": doc_type,
    })

    Fix is in two parts: (a) persist doc_id in the registry alongside doc_name for long-doc entries during ingest, (b) in the remove flow, if meta.get("type") == "long_pdf" and a doc_id is present, call pageindex.Collection(storage_path=str(openkb_dir)).delete_document(doc_id). Existing KBs without doc_id will need a fallback (look up by basename via col.list_documents()) or a one-time migration.

Out of scope but worth noting (NOT flagged as bug): the body prose of multi-source concept pages is intentionally NOT regenerated on remove, and the brief: frontmatter stays as last synthesized. That's a reasonable design choice to avoid an LLM call per remove, but the remove command's help text doesn't say so — worth a one-line note in the docstring at cli.py:470-482.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Functional completeness gaps surfaced in the second code review:

1. `openkb remove` left `wiki/sources/images/<doc_name>/` orphaned.
   For image-heavy PDFs/docx/pptx this leaked tens to hundreds of MB
   per add → remove cycle, silently. Plan now lists the images
   directory under DELETE actions and the executor shutil.rmtrees it.

2. `openkb remove` left PageIndex's local state orphaned for long
   PDFs: the SQLite row in `.openkb/pageindex.db`, the managed PDF
   copy at `.openkb/files/<collection>/<doc_id>.pdf`, and the
   extracted-images directory. The blocking dependency was that
   `add_single_file` never persisted PageIndex's `doc_id`, so even
   if remove wanted to call `Collection.delete_document(doc_id)` it
   had no handle.

   Fix is in two parts:
   - Ingest now stores `doc_id` on the long-doc registry entry.
   - Remove instantiates `PageIndexClient(storage_path=.openkb)` and
     calls `delete_document(doc_id)` when the registry entry says
     `type == "long_pdf"` and `.openkb/pageindex.db` exists. Legacy
     entries (registered before this fix, no `doc_id`) fall back to
     matching by `doc_name` via `list_documents()`; ambiguous
     multi-matches are skipped with a WARN rather than guessed.

PageIndex cleanup runs after the wiki-side mutations so that a
partial failure there leaves only PageIndex bloat, not a
half-removed wiki. Cloud-mode KBs (no local `pageindex.db`) skip
PageIndex cleanup entirely so we don't pull in an LLM key check
during remove.

Adds 7 regression tests covering: image dir deletion, dry-run image
preservation, doc_id persistence during ingest, PageIndex delete
with stored doc_id, fallback lookup, ambiguous-match refusal, and
the no-PageIndex-state no-op path.
@KylinMountain
Copy link
Copy Markdown
Collaborator Author

Code review

Logical-consistency pass — does the wiki end in a sound state for subsequent add after a remove? Found 1 issue.

  1. Failed PageIndex cleanup leaves a stale row that silently re-binds on re-add. The executor removes the registry entry before attempting PageIndex cleanup. If the PageIndex call raises (LLM-key check inside PageIndexClient.__init__ fails, network blip in cloud-LiteLLM lookup, transient SQLite lock, etc.), we WARN and continue — but the registry entry that held doc_id is already gone.

    On the next openkb add <same file>:

    • convert_document sees the hash is unknown (we cleared it) and proceeds.
    • The long-doc path calls Collection.add(pdf_path). PageIndex's local backend dedupes by SHA-256 (find_document_by_hash) and returns the stale doc_id without re-parsing.
    • col.get_document(stale_doc_id, include_text=True) then feeds the OLD parsed structure to our compiler.
    • A fresh registry entry gets written pointing at the stale doc_id; the wiki is rebuilt on top of stale parse content.

    Net effect: a user who removes a document because the parse was bad and re-adds it gets the same bad parse back, with no signal anything went wrong. The inline rationale comment at the top of the PageIndex block says we deliberately put cleanup last "to leave only PageIndex bloat, not a half-removed wiki" — but the cost is worse than disk bloat: silent state divergence the user cannot recover from without manually inspecting .openkb/pageindex.db.

    Registry removed first:

    OpenKB/openkb/cli.py

    Lines 694 to 718 in 5e76130

    remove_doc_from_index(wiki_dir, doc_name, concept_result["deleted"])
    registry.remove_by_doc_name(doc_name)
    if raw_path is not None:
    raw_path.unlink(missing_ok=True)
    # Free PageIndex's local managed state for long PDFs. We do this last
    # because the wiki side is now clean and we want a partial failure
    # here to leave only PageIndex bloat, not a half-removed wiki.
    if cleanup_pageindex:
    try:
    cleaned, msg = _cleanup_pageindex(
    openkb_dir, kb_dir, doc_name, pageindex_doc_id,
    )
    click.echo(f" PageIndex: {msg}")
    except Exception as exc:
    click.echo(
    f" [WARN] PageIndex cleanup failed: {exc} "
    f"— .openkb/pageindex.db row and .openkb/files/ may still hold this doc"
    )
    logging.getLogger(__name__).debug(
    "PageIndex cleanup traceback:", exc_info=True,
    )

    Fix: reorder so _cleanup_pageindex runs before registry.remove_by_doc_name. On PageIndex failure the registry still holds doc_id, so openkb remove is retryable (PageIndex delete is idempotent — find_document_by_hash → row, delete_document unlinks file with missing_ok=True and if doc_dir.exists() guards). The reverse failure mode — PageIndex cleaned but registry remove fails — is also idempotent on retry.

Out of scope (didn't meet the 80-confidence threshold but worth tracking as follow-ups):

  • --keep-empty-concepts retains stale concept body; re-adding the same doc enters the LLM "update" path and blends new content into prose still containing the old doc's contribution — likely to produce duplicate phrasing. (75)
  • Multi-source concept pages may retain markdown image refs ![alt](sources/images/<doc_name>/...) after images_dir is deleted; fix_broken_links only handles [[wikilinks]]. Real if the LLM echoes image refs into concept bodies — depends on actual concept content. (50)
  • PageIndexClient.__init__ validates LLM provider even for delete-only ops; users with rotated keys can't run the cleanup path and get a misleading "PageIndex cleanup failed" WARN. (75)

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

The previous executor order removed the registry entry before
attempting PageIndex cleanup. If PageIndex's local-mode delete failed
(LLM-key check inside PageIndexClient.__init__, network blip in
LiteLLM provider lookup, transient SQLite lock), the WARN told the
user state was partial — but with the registry entry already gone
there was no retry path, and the next `openkb add <same file>` would
hit PageIndex's SHA-256 dedup, silently re-bind to the stale doc_id,
and feed the OLD parsed structure to the compiler. A user removing a
document specifically because the parse was bad would get the same
bad parse back without warning.

Reorder so the executor reaches the registry write only after
PageIndex cleanup succeeds:

  1. unlink summary / sources / images / concepts / index entries
  2. fix_broken_links (so a retry sees a clean wiki)
  3. _cleanup_pageindex — on failure: WARN with retry hint, return
  4. registry.remove_by_doc_name  ← commit point
  5. raw_path.unlink, append_log

Every step before the commit point is already idempotent
(`unlink(missing_ok=True)`, `shutil.rmtree(ignore_errors=True)`,
concept/index helpers that cheap-filter already-clean pages, and
PageIndex's own delete using missing_ok internally), so re-running
`openkb remove` after a WARN finishes the job once the underlying
issue is resolved.

Adds 2 regression tests: PageIndex failure preserves registry +
doc_id for retry, and a second invocation with a working PageIndex
completes the removal cleanly.
@KylinMountain
Copy link
Copy Markdown
Collaborator Author

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

The latest commit's executor reordering (PageIndex cleanup before the registry commit point) is sound: each pre-commit step is idempotent on retry (unlink(missing_ok=True), shutil.rmtree(ignore_errors=True), concept/index helpers that cheap-filter already-clean pages, fix_broken_links no-ops when nothing changes, and PageIndex's own delete_document handles missing file/dir gracefully), so the new two-test retry contract holds end-to-end.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Comment thread tests/test_remove.py
from pathlib import Path
from unittest.mock import MagicMock, patch

import pytest
@KylinMountain KylinMountain merged commit a1867c6 into main May 16, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove Document

1 participant