You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A single openkb remove <doc> run surfaces two independent bugs at once. Reporting them together because
they share a single repro, but they have different root causes and need separate fixes.
Follow-up to #41 — both are regressions in the implementation shipped by PR #51.
Observed: removal "succeeds" but git status / hashes.json show the symptoms below.
Bug 1 — hash entry is not removed for docs ingested before PR #51
cat .openkb/hashes.json still contains the removed doc's entry after openkb remove reports success.
Re-running openkb add <same-file> is then incorrectly treated as a duplicate via the SHA dedup.
Root cause
Commit c504e26 (within this same PR) fixed add_single_file so newly-ingested docs persist doc_name
into the registry. However, entries that already existed in hashes.json before that commit were not
backfilled — they still carry only {name, type}.
HashRegistry.remove_by_doc_name (openkb/state.py:44-51) matches with meta.get("doc_name") == doc_name. For un-backfilled legacy entries the comparison evaluates to None == "<slug>" → always False. The method silently returns False; nothing in the call chain checks the return value.
Meanwhile cli.py:670 (doc_name = meta.get("doc_name") or Path(name).stem) does fall back to the
filename stem to drive every other step, so summary/source/concept/index removal succeeds and the failure
is invisible at the surface.
Suggested fix
Either of the following — both are robust against un-backfilled legacy data:
Add a fallback in remove_by_doc_name that also matches when Path(meta["name"]).stem == doc_name, OR
Introduce remove_by_hash(file_hash) and call it from cli.py:842 since the CLI already has the
matched hash in hand. Preferred — eliminates the slug round-trip and works regardless of doc_name
presence.
A one-shot migration that backfills doc_name on the next openkb invocation would also clean this up,
but the call-site fix above is sufficient and avoids touching user data on read paths.
Bug 2 — unrelated wiki pages get reformatted
Removing a single doc produces a sprawling diff. In my repro, removing one ollama.md produced a 39-file / 1254-line diff; 27 of those were concept pages that didn't list ollama as a source.
Example (from a concept page unrelated to ollama):
- **Knowledge access**: agents need curated context such as [[LLM Wiki]] + **Knowledge access**: agents need curated context such as LLM Wiki
Root cause
cli.py:815 calls fix_broken_links(wiki_dir) over the entire wiki on every remove. openkb/lint.py:fix_broken_links strips every dangling wikilink in the KB, not only the ones created by
this removal. Pre-existing ghost links (LLM-generated, hand-edited, links to not-yet-added concepts, etc.)
get swept up too.
Impact
Removal commits are unreadable — actual deletion effects are buried under unrelated reformat noise.
Users lose [[wikilinks]] they may want to keep (e.g. links to a concept they plan to add later).
Violates least-surprise: the command name says "remove one doc," but the diff shows wiki-wide
refactoring.
Suggested fix (preferred)
Limit ghost-link stripping to files actually touched by this removal: concept_result["modified"] ∪ {index.md}. Preserves the original PR #49 intent (clean up dangling links the removal just created)
without sweeping the rest of the KB.
Alternatives
Snapshot the global ghost set before/after the removal and strip only the newly-introduced ghosts.
Make the global pass opt-in via a flag (e.g. --lint), default off.
A single
openkb remove <doc>run surfaces two independent bugs at once. Reporting them together becausethey share a single repro, but they have different root causes and need separate fixes.
Follow-up to #41 — both are regressions in the implementation shipped by PR #51.
Repro
openkb removeto safely delete a document (closes #41) #51 (hashes.jsonentry has only{name, type}, nodoc_namekey) and a handful of LLM-generated concept pages with pre-existing danglingwikilinks.
openkb remove <that-doc>(e.g.openkb remove ollama).git status/hashes.jsonshow the symptoms below.Bug 1 — hash entry is not removed for docs ingested before PR #51
cat .openkb/hashes.jsonstill contains the removed doc's entry afteropenkb removereports success.Re-running
openkb add <same-file>is then incorrectly treated as a duplicate via the SHA dedup.Root cause
Commit c504e26 (within this same PR) fixed
add_single_fileso newly-ingested docs persistdoc_nameinto the registry. However, entries that already existed in
hashes.jsonbefore that commit were notbackfilled — they still carry only
{name, type}.HashRegistry.remove_by_doc_name(openkb/state.py:44-51) matches withmeta.get("doc_name") == doc_name. For un-backfilled legacy entries the comparison evaluates toNone == "<slug>"→ alwaysFalse. The method silently returnsFalse; nothing in the call chain checks the return value.Meanwhile
cli.py:670(doc_name = meta.get("doc_name") or Path(name).stem) does fall back to thefilename stem to drive every other step, so summary/source/concept/index removal succeeds and the failure
is invisible at the surface.
Suggested fix
Either of the following — both are robust against un-backfilled legacy data:
remove_by_doc_namethat also matches whenPath(meta["name"]).stem == doc_name, ORremove_by_hash(file_hash)and call it fromcli.py:842since the CLI already has thematched hash in hand. Preferred — eliminates the slug round-trip and works regardless of
doc_namepresence.
A one-shot migration that backfills
doc_nameon the nextopenkbinvocation would also clean this up,but the call-site fix above is sufficient and avoids touching user data on read paths.
Bug 2 — unrelated wiki pages get reformatted
Removing a single doc produces a sprawling diff. In my repro, removing one
ollama.mdproduced a39-file / 1254-line diff; 27 of those were concept pages that didn't list ollama as a source.
Example (from a concept page unrelated to ollama):
Root cause
cli.py:815callsfix_broken_links(wiki_dir)over the entire wiki on every remove.openkb/lint.py:fix_broken_linksstrips every dangling wikilink in the KB, not only the ones created bythis removal. Pre-existing ghost links (LLM-generated, hand-edited, links to not-yet-added concepts, etc.)
get swept up too.
Impact
[[wikilinks]]they may want to keep (e.g. links to a concept they plan to add later).refactoring.
Suggested fix (preferred)
Limit ghost-link stripping to files actually touched by this removal:
concept_result["modified"]∪{index.md}. Preserves the original PR #49 intent (clean up dangling links the removal just created)without sweeping the rest of the KB.
Alternatives
--lint), default off.