feat: compile pipeline, query agent, and multimodal improvements by rejojer · Pull Request #10 · VectifyAI/OpenKB

rejojer · 2026-04-10T00:07:34Z

Summary

Concept dedup & compile refactor: concept plan/update/related paths, bidirectional backlinks, shared _compile_concepts
Brief system & unified query: per-page JSON sources via pymupdf, doc_type + full_text frontmatter, get_page_content tool
Config & environment: default model gpt-5.4-mini, API key warning, warning suppression, test isolation
CLI polish: improved init prompts with explicit defaults, American English, output formatting
Multimodal query: get_image tool with ToolOutputImage for viewing figures/charts in source documents
Image path unification: all image paths use sources/images/ prefix, pymupdf replaces PageIndex for page content extraction

Test plan

All 184 tests pass
openkb init shows correct prompts with defaults
openkb add short doc: clean frontmatter, images at sources/images/
openkb add long doc: per-page JSON from pymupdf, correct image paths
openkb query on short doc: reads source via read_file
openkb query on long doc: uses get_page_content with targeted pages
openkb query about figures: calls get_image tool

…klinks - Add concept dedup with briefs and _read_concept_briefs context - Add concepts plan and update prompt templates with create/update/related paths - Extract shared _compile_concepts from compile_short_doc and compile_long_doc - Add bidirectional backlinks between summaries and concepts - Code review fixes: security, robustness, tests, and CI hardening Co-authored-by: Ray <mailtangyu@gmail.com>

- Add get_page_content tool and parse_pages helper for page-level access - Store long doc sources as per-page JSON extracted by pymupdf - Unify summary frontmatter to doc_type + full_text fields - Update schema and tree renderer for new frontmatter format - All image paths use sources/images/ prefix relative to wiki root Co-authored-by: Ray <mailtangyu@gmail.com>

- Change default model to gpt-5.4-mini - Warn when no LLM API key found instead of failing silently - Fix CI publish workflow and test isolation Co-authored-by: Ray <mailtangyu@gmail.com>

- Move warning suppression after imports to avoid markitdown override - Improve init prompts with explicit defaults - Use American English throughout (initialized, normalized, Synthesize) - Replace unicode ellipsis with ASCII - Remove empty explorations/reports dirs from init - Fix test isolation for _find_kb_dir

- Add get_image tool for viewing images referenced in source documents - Use ToolOutputImage for proper image content in LLM context - Update prompt: use full_text field, restrict get_page_content to pageindex - Add self-talk before tool calls, enforce concise answers - Prevent duplicate frontmatter in LLM-generated content via schema update

- Add convert_pdf_to_pages for per-page content+image extraction - All image paths use sources/images/ prefix relative to wiki root - Remove page marker comments from short doc source markdown

rejojer · 2026-04-10T00:28:56Z

Code review

Found 1 issue:

_write_concept appends LLM-rewritten content instead of replacing the body during concept updates, causing content duplication. The _CONCEPT_UPDATE_USER prompt (line 106) instructs the LLM to "Rewrite the full page incorporating the new information naturally -- do not just append" and return "The rewritten full concept page." However, _write_concept appends the complete rewrite to the existing body (existing += f"\n\n{clean}" at line 341) rather than replacing the body with it. Every concept update with a new source document will duplicate all existing content.

OpenKB/openkb/agent/compiler.py

Lines 319 to 342 in 44bf83e

    
           if is_update and path.exists(): 
        
               existing = path.read_text(encoding="utf-8") 
        
               if source_file not in existing: 
        
                   if existing.startswith("---"): 
        
                       end = existing.find("---", 3) 
        
                       if end != -1: 
        
                           fm = existing[:end + 3] 
        
                           body = existing[end + 3:] 
        
                           if "sources:" in fm: 
        
                               fm = fm.replace("sources: [", f"sources: [{source_file}, ") 
        
                           else: 
        
                               fm = fm.replace("---\n", f"---\nsources: [{source_file}]\n", 1) 
        
                           existing = fm + body 
        
                   else: 
        
                       existing = f"---\nsources: [{source_file}]\n---\n\n" + existing 
        
                   # Strip frontmatter from LLM content to avoid duplicate blocks 
        
                   clean = content 
        
                   if clean.startswith("---"): 
        
                       end = clean.find("---", 3) 
        
                       if end != -1: 
        
                           clean = clean[end + 3:].lstrip("\n") 
        
                   existing += f"\n\n{clean}" 
        
               if brief and existing.startswith("---"):

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

The _CONCEPT_UPDATE_USER prompt asks the LLM for a full rewrite, but _write_concept was appending the rewrite to the existing body, causing content duplication on every concept update.

Replace hand-rolled fence stripping with json_repair to handle malformed JSON, missing fences, and prose-wrapped responses from LLMs. Also fixes str.index() ValueError on fenced blocks without newlines.

rejojer · 2026-04-10T01:07:05Z

Code review

Found 1 issue:

_write_concept silently discards LLM-generated content on re-compilation of the same document. When is_update=True and the concept page already exists, the body-replacement logic (lines 339-353) is entirely nested inside if source_file not in existing:. On re-compile, source_file is already present in the frontmatter, so this condition is false and the LLM rewrite is silently dropped -- the API call is made and paid for, but the result is never written. Only the brief field gets updated (lines 354-363, outside the gate). The source_file not in existing check should only guard the "add source to frontmatter" block, not the body replacement.

OpenKB/openkb/agent/compiler.py

Lines 323 to 354 in 3dd84f3

    
           if is_update and path.exists(): 
        
               existing = path.read_text(encoding="utf-8") 
        
               if source_file not in existing: 
        
                   if existing.startswith("---"): 
        
                       end = existing.find("---", 3) 
        
                       if end != -1: 
        
                           fm = existing[:end + 3] 
        
                           body = existing[end + 3:] 
        
                           if "sources:" in fm: 
        
                               fm = fm.replace("sources: [", f"sources: [{source_file}, ") 
        
                           else: 
        
                               fm = fm.replace("---\n", f"---\nsources: [{source_file}]\n", 1) 
        
                           existing = fm + body 
        
                   else: 
        
                       existing = f"---\nsources: [{source_file}]\n---\n\n" + existing 
        
                   # Strip frontmatter from LLM content to avoid duplicate blocks 
        
                   clean = content 
        
                   if clean.startswith("---"): 
        
                       end = clean.find("---", 3) 
        
                       if end != -1: 
        
                           clean = clean[end + 3:].lstrip("\n") 
        
                   # Replace body with LLM rewrite (prompt asks for full rewrite, not delta) 
        
                   if existing.startswith("---"): 
        
                       end = existing.find("---", 3) 
        
                       if end != -1: 
        
                           existing = existing[:end + 3] + "\n\n" + clean 
        
                       else: 
        
                           existing = clean 
        
                   else: 
        
                       existing = clean 
        
               if brief and existing.startswith("---"):

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

…ocstring

rejojer · 2026-04-10T02:02:15Z

Code review

Found 1 issue:

_write_concept silently discards LLM-generated concept rewrites on re-compilation. When is_update=True and source_file already appears in the existing concept page (in sources: [...] frontmatter), the source_file not in existing guard at line 326 causes the entire body-replacement block (lines 327-353) to be skipped. Only the brief frontmatter field is updated; the full page rewrite from _CONCEPT_UPDATE_USER is thrown away. This means re-adding a modified document leaves concept pages stale. (bug: body replacement wrongly gated inside source-dedup check)

OpenKB/openkb/agent/compiler.py

Lines 323 to 354 in ef60f7d

    
           if is_update and path.exists(): 
        
               existing = path.read_text(encoding="utf-8") 
        
               if source_file not in existing: 
        
                   if existing.startswith("---"): 
        
                       end = existing.find("---", 3) 
        
                       if end != -1: 
        
                           fm = existing[:end + 3] 
        
                           body = existing[end + 3:] 
        
                           if "sources:" in fm: 
        
                               fm = fm.replace("sources: [", f"sources: [{source_file}, ") 
        
                           else: 
        
                               fm = fm.replace("---\n", f"---\nsources: [{source_file}]\n", 1) 
        
                           existing = fm + body 
        
                   else: 
        
                       existing = f"---\nsources: [{source_file}]\n---\n\n" + existing 
        
                   # Strip frontmatter from LLM content to avoid duplicate blocks 
        
                   clean = content 
        
                   if clean.startswith("---"): 
        
                       end = clean.find("---", 3) 
        
                       if end != -1: 
        
                           clean = clean[end + 3:].lstrip("\n") 
        
                   # Replace body with LLM rewrite (prompt asks for full rewrite, not delta) 
        
                   if existing.startswith("---"): 
        
                       end = existing.find("---", 3) 
        
                       if end != -1: 
        
                           existing = existing[:end + 3] + "\n\n" + clean 
        
                       else: 
        
                           existing = clean 
        
                   else: 
        
                       existing = clean 
        
               if brief and existing.startswith("---"):

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

feat: compile pipeline, query agent, and multimodal improvements

KylinMountain and others added 6 commits April 10, 2026 08:04

fix: default model, API key warning, config and CI improvements

0bf7084

- Change default model to gpt-5.4-mini - Warn when no LLM API key found instead of failing silently - Fix CI publish workflow and test isolation Co-authored-by: Ray <mailtangyu@gmail.com>

refactor: unify image paths and add pymupdf per-page extraction

44bf83e

- Add convert_pdf_to_pages for per-page content+image extraction - All image paths use sources/images/ prefix relative to wiki root - Remove page marker comments from short doc source markdown

rejojer force-pushed the bugfix/compile-clean branch from 9f652d0 to 44bf83e Compare April 10, 2026 00:09

rejojer added 5 commits April 10, 2026 08:33

fix: replace concept body on update instead of appending

7ca95f9

The _CONCEPT_UPDATE_USER prompt asks the LLM for a full rewrite, but _write_concept was appending the rewrite to the existing body, causing content duplication on every concept update.

fix: use json_repair for robust LLM JSON parsing

d41588a

Replace hand-rolled fence stripping with json_repair to handle malformed JSON, missing fences, and prose-wrapped responses from LLMs. Also fixes str.index() ValueError on fenced blocks without newlines.

fix: use pdf_path.stem for full_text frontmatter path

7dd70c6

fix: sanitize concept names before links and index

b90f0b4

fix: pass doc_type and doc_brief in early return paths

3dd84f3

fix: sanitize concept name in _gen_update and correct _update_index d…

ef60f7d

…ocstring

rejojer merged commit 85eaebf into dev Apr 10, 2026

rejojer mentioned this pull request Apr 10, 2026

Release: merge dev into main #12

Merged

4 tasks

rejojer added a commit that referenced this pull request Apr 11, 2026

Merge pull request #10 from VectifyAI/bugfix/compile-clean

2b7394f

feat: compile pipeline, query agent, and multimodal improvements

rejojer deleted the bugfix/compile-clean branch April 11, 2026 17:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: compile pipeline, query agent, and multimodal improvements#10

feat: compile pipeline, query agent, and multimodal improvements#10
rejojer merged 12 commits intodevfrom
bugfix/compile-clean

rejojer commented Apr 10, 2026

Uh oh!

rejojer commented Apr 10, 2026

Uh oh!

rejojer commented Apr 10, 2026

Uh oh!

rejojer commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rejojer commented Apr 10, 2026

Summary

Test plan

Uh oh!

rejojer commented Apr 10, 2026

Code review

Uh oh!

rejojer commented Apr 10, 2026

Code review

Uh oh!

rejojer commented Apr 10, 2026

Code review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants