Skip to content

feat: compile pipeline, query agent, and multimodal improvements#10

Merged
rejojer merged 12 commits intodevfrom
bugfix/compile-clean
Apr 10, 2026
Merged

feat: compile pipeline, query agent, and multimodal improvements#10
rejojer merged 12 commits intodevfrom
bugfix/compile-clean

Conversation

@rejojer
Copy link
Copy Markdown
Member

@rejojer rejojer commented Apr 10, 2026

Summary

  • Concept dedup & compile refactor: concept plan/update/related paths, bidirectional backlinks, shared _compile_concepts
  • Brief system & unified query: per-page JSON sources via pymupdf, doc_type + full_text frontmatter, get_page_content tool
  • Config & environment: default model gpt-5.4-mini, API key warning, warning suppression, test isolation
  • CLI polish: improved init prompts with explicit defaults, American English, output formatting
  • Multimodal query: get_image tool with ToolOutputImage for viewing figures/charts in source documents
  • Image path unification: all image paths use sources/images/ prefix, pymupdf replaces PageIndex for page content extraction

Test plan

  • All 184 tests pass
  • openkb init shows correct prompts with defaults
  • openkb add short doc: clean frontmatter, images at sources/images/
  • openkb add long doc: per-page JSON from pymupdf, correct image paths
  • openkb query on short doc: reads source via read_file
  • openkb query on long doc: uses get_page_content with targeted pages
  • openkb query about figures: calls get_image tool

KylinMountain and others added 6 commits April 10, 2026 08:04
…klinks

- Add concept dedup with briefs and _read_concept_briefs context
- Add concepts plan and update prompt templates with create/update/related paths
- Extract shared _compile_concepts from compile_short_doc and compile_long_doc
- Add bidirectional backlinks between summaries and concepts
- Code review fixes: security, robustness, tests, and CI hardening

Co-authored-by: Ray <mailtangyu@gmail.com>
- Add get_page_content tool and parse_pages helper for page-level access
- Store long doc sources as per-page JSON extracted by pymupdf
- Unify summary frontmatter to doc_type + full_text fields
- Update schema and tree renderer for new frontmatter format
- All image paths use sources/images/ prefix relative to wiki root

Co-authored-by: Ray <mailtangyu@gmail.com>
- Change default model to gpt-5.4-mini
- Warn when no LLM API key found instead of failing silently
- Fix CI publish workflow and test isolation

Co-authored-by: Ray <mailtangyu@gmail.com>
- Move warning suppression after imports to avoid markitdown override
- Improve init prompts with explicit defaults
- Use American English throughout (initialized, normalized, Synthesize)
- Replace unicode ellipsis with ASCII
- Remove empty explorations/reports dirs from init
- Fix test isolation for _find_kb_dir
- Add get_image tool for viewing images referenced in source documents
- Use ToolOutputImage for proper image content in LLM context
- Update prompt: use full_text field, restrict get_page_content to pageindex
- Add self-talk before tool calls, enforce concise answers
- Prevent duplicate frontmatter in LLM-generated content via schema update
- Add convert_pdf_to_pages for per-page content+image extraction
- All image paths use sources/images/ prefix relative to wiki root
- Remove page marker comments from short doc source markdown
@rejojer rejojer force-pushed the bugfix/compile-clean branch from 9f652d0 to 44bf83e Compare April 10, 2026 00:09
@rejojer
Copy link
Copy Markdown
Member Author

rejojer commented Apr 10, 2026

Code review

Found 1 issue:

  1. _write_concept appends LLM-rewritten content instead of replacing the body during concept updates, causing content duplication. The _CONCEPT_UPDATE_USER prompt (line 106) instructs the LLM to "Rewrite the full page incorporating the new information naturally -- do not just append" and return "The rewritten full concept page." However, _write_concept appends the complete rewrite to the existing body (existing += f"\n\n{clean}" at line 341) rather than replacing the body with it. Every concept update with a new source document will duplicate all existing content.

if is_update and path.exists():
existing = path.read_text(encoding="utf-8")
if source_file not in existing:
if existing.startswith("---"):
end = existing.find("---", 3)
if end != -1:
fm = existing[:end + 3]
body = existing[end + 3:]
if "sources:" in fm:
fm = fm.replace("sources: [", f"sources: [{source_file}, ")
else:
fm = fm.replace("---\n", f"---\nsources: [{source_file}]\n", 1)
existing = fm + body
else:
existing = f"---\nsources: [{source_file}]\n---\n\n" + existing
# Strip frontmatter from LLM content to avoid duplicate blocks
clean = content
if clean.startswith("---"):
end = clean.find("---", 3)
if end != -1:
clean = clean[end + 3:].lstrip("\n")
existing += f"\n\n{clean}"
if brief and existing.startswith("---"):

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

rejojer added 5 commits April 10, 2026 08:33
The _CONCEPT_UPDATE_USER prompt asks the LLM for a full rewrite, but
_write_concept was appending the rewrite to the existing body, causing
content duplication on every concept update.
Replace hand-rolled fence stripping with json_repair to handle
malformed JSON, missing fences, and prose-wrapped responses from LLMs.
Also fixes str.index() ValueError on fenced blocks without newlines.
@rejojer
Copy link
Copy Markdown
Member Author

rejojer commented Apr 10, 2026

Code review

Found 1 issue:

  1. _write_concept silently discards LLM-generated content on re-compilation of the same document. When is_update=True and the concept page already exists, the body-replacement logic (lines 339-353) is entirely nested inside if source_file not in existing:. On re-compile, source_file is already present in the frontmatter, so this condition is false and the LLM rewrite is silently dropped -- the API call is made and paid for, but the result is never written. Only the brief field gets updated (lines 354-363, outside the gate). The source_file not in existing check should only guard the "add source to frontmatter" block, not the body replacement.

if is_update and path.exists():
existing = path.read_text(encoding="utf-8")
if source_file not in existing:
if existing.startswith("---"):
end = existing.find("---", 3)
if end != -1:
fm = existing[:end + 3]
body = existing[end + 3:]
if "sources:" in fm:
fm = fm.replace("sources: [", f"sources: [{source_file}, ")
else:
fm = fm.replace("---\n", f"---\nsources: [{source_file}]\n", 1)
existing = fm + body
else:
existing = f"---\nsources: [{source_file}]\n---\n\n" + existing
# Strip frontmatter from LLM content to avoid duplicate blocks
clean = content
if clean.startswith("---"):
end = clean.find("---", 3)
if end != -1:
clean = clean[end + 3:].lstrip("\n")
# Replace body with LLM rewrite (prompt asks for full rewrite, not delta)
if existing.startswith("---"):
end = existing.find("---", 3)
if end != -1:
existing = existing[:end + 3] + "\n\n" + clean
else:
existing = clean
else:
existing = clean
if brief and existing.startswith("---"):

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@rejojer
Copy link
Copy Markdown
Member Author

rejojer commented Apr 10, 2026

Code review

Found 1 issue:

  1. _write_concept silently discards LLM-generated concept rewrites on re-compilation. When is_update=True and source_file already appears in the existing concept page (in sources: [...] frontmatter), the source_file not in existing guard at line 326 causes the entire body-replacement block (lines 327-353) to be skipped. Only the brief frontmatter field is updated; the full page rewrite from _CONCEPT_UPDATE_USER is thrown away. This means re-adding a modified document leaves concept pages stale. (bug: body replacement wrongly gated inside source-dedup check)

if is_update and path.exists():
existing = path.read_text(encoding="utf-8")
if source_file not in existing:
if existing.startswith("---"):
end = existing.find("---", 3)
if end != -1:
fm = existing[:end + 3]
body = existing[end + 3:]
if "sources:" in fm:
fm = fm.replace("sources: [", f"sources: [{source_file}, ")
else:
fm = fm.replace("---\n", f"---\nsources: [{source_file}]\n", 1)
existing = fm + body
else:
existing = f"---\nsources: [{source_file}]\n---\n\n" + existing
# Strip frontmatter from LLM content to avoid duplicate blocks
clean = content
if clean.startswith("---"):
end = clean.find("---", 3)
if end != -1:
clean = clean[end + 3:].lstrip("\n")
# Replace body with LLM rewrite (prompt asks for full rewrite, not delta)
if existing.startswith("---"):
end = existing.find("---", 3)
if end != -1:
existing = existing[:end + 3] + "\n\n" + clean
else:
existing = clean
else:
existing = clean
if brief and existing.startswith("---"):

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@rejojer rejojer merged commit 85eaebf into dev Apr 10, 2026
@rejojer rejojer mentioned this pull request Apr 10, 2026
4 tasks
rejojer added a commit that referenced this pull request Apr 11, 2026
feat: compile pipeline, query agent, and multimodal improvements
@rejojer rejojer deleted the bugfix/compile-clean branch April 11, 2026 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants