feat: OpenKB MVP — Karpathy's LLM Knowledge Base, powered by PageIndex by KylinMountain · Pull Request #4 · VectifyAI/OpenKB

KylinMountain · 2026-04-06T15:14:24Z

Summary

OpenKB — Karpathy's LLM Knowledge Base workflow as a CLI, powered by PageIndex.

Drop documents in. Get an auto-maintained, cross-linked wiki out.

Features

okb init — Interactive setup
okb add — Short docs (pymupdf) + long PDFs (PageIndex local/cloud)
okb query — Streaming Q&A with PageIndex cloud streaming
okb watch — Auto-compile on file changes
okb lint — Structural + knowledge health checks
okb list / status — Knowledge base overview
Obsidian compatible wiki output

Tech Stack

PageIndex, markitdown, OpenAI Agents SDK, LiteLLM, Click, watchdog

Sets up pyproject.toml (hatchling, direct-refs allowed, Python >=3.11), .gitignore, openkb/__init__.py, a Click CLI stub with all 7 commands (init, add, query, watch, lint, list, status), and tests/conftest.py with kb_dir and sample_tree fixtures. Package installs cleanly in a Python 3.12 venv; okb --help shows all commands; pytest collects 0 tests without error.

Add openkb/config.py (DEFAULT_CONFIG, load_config, save_config), openkb/state.py (HashRegistry with SHA-256 file hashing and JSON persistence), and openkb/schema.py (SCHEMA_MD constant). All 17 tests written first (red) then implemented (green).

Creates full KB directory structure (raw/, wiki/sources/images/, wiki/summaries/, wiki/concepts/, wiki/reports/), writes SCHEMA.md, index.md, config.yaml and hashes.json; guards against re-initialisation. Three tests in tests/test_cli.py cover structure, schema content, and the already-initialized guard, all via CliRunner.isolated_filesystem.

Implements extract_base64_images and copy_relative_images with full test coverage for single/multiple images, invalid base64, missing files, and URL filtering.

Implements ConvertResult dataclass, get_pdf_page_count, and convert_document with hash-dedup, markdown passthrough, PDF long-doc detection, MarkItDown conversion, and image extraction integration.

Implements render_source_md and render_summary_md with YAML frontmatter, recursive heading hierarchy (h1–h6 capped), page ranges, and separate text/summary views for source and summary wiki pages.

Implements IndexResult dataclass and index_long_document which creates a LocalClient with full node text/summary/description flags, adds the PDF via PageIndex, fetches structure, and writes source and summary wiki pages via the tree renderer.

Implements list_wiki_files, read_wiki_file, and write_wiki_file as plain functions in openkb/agent/tools.py without @function_tool decoration, ready to be wrapped when building the agent. Full test coverage including edge cases for missing files/dirs, filtering to .md only, and parent dir creation.

Implements build_compiler_agent, compile_short_doc, compile_long_doc in openkb/agent/compiler.py with function_tool-wrapped wiki tools and SCHEMA_MD-enriched instructions. Long-doc variant includes get_page_content. Tests mock Runner.run to avoid real LLM calls.

Replaces the add stub with full orchestration: convert_document, index_long_document for long PDFs, and compiler agent calls. Adds SUPPORTED_EXTENSIONS set, _find_kb_dir, _add_single_file helpers. Adds python-dotenv dependency and load_dotenv() at startup.

Implements pageindex_retrieve (structure -> LLM relevance -> page fetch), build_query_agent with list/read/retrieve tools, and run_query coroutine. Wires up `okb query` in cli.py.

Implements DebouncedHandler (collects events, ignores dirs/dotfiles, resets timer on burst) and watch_directory (Observer loop, Ctrl+C safe). Wires up `okb watch` in cli.py.

Implements find_broken_links, find_orphans, find_missing_entries, check_index_sync, and run_structural_lint with full Markdown report. Covers wikilink resolution, orphan detection, raw/wiki entry matching, and index.md sync checking.

Implements build_lint_agent with list/read tools and instructions for semantic quality checks (contradictions, gaps, staleness, redundancy). run_knowledge_lint runs the agent and returns the report string. okb lint combines structural + knowledge lint and writes timestamped report.

Tests verify list shows documents table and concepts, status shows per-directory file counts and total indexed. Both check missing-init guard.

Previously the converter registered the file hash immediately, so if LLM compilation failed the file was marked as "done" and retries would skip it. Now the hash is only registered by the CLI after successful compilation. Also: install markitdown[all] for PDF support, add python-dotenv.

…pport - Switch from col._backend.get_document_structure() to col.get_document_structure() - Add 3x retry for PageIndex indexing (stochastic TOC accuracy) - Fix storage path to use .db extension - Remove .doc from supported extensions (markitdown only supports .docx) - Note: col.get_page_content() still missing from PageIndex public API, using col._backend.get_page_content() as workaround

Replace col._backend.get_page_content(col._name, doc_id, spec) with col.get_page_content(doc_id, spec). Now all PageIndex access uses public API only.

… interactive init, list/status format

…d .db path

Rename CLI command and state dir from okb to openkb

- Hardcode reading LLM_API_KEY env var instead of indirecting through config - Remove llm_api_key_env from DEFAULT_CONFIG, okb init prompts, and config.yaml - Provider-specific env vars (OPENAI_API_KEY, etc.) still work via LiteLLM auto-detection - One less config field, one less okb init step

Simplify LLM API key configuration

The OpenAI Agents SDK requires a litellm/ prefix to route non-OpenAI models through LiteLLM. Without it, models like anthropic/claude-sonnet-4-6 fail with "Unknown prefix". This adds the prefix at all Agent() call sites while keeping litellm.completion() calls unchanged. Also updates README quick start comments and model format docs.

Fix: add litellm/ prefix for Agents SDK model routing

…config; update readme

KylinMountain · 2026-04-08T06:18:54Z

Code review

Found 1 issue:

extract_pdf_images and convert_pdf_with_images in images.py open pymupdf documents with explicit .close() instead of context managers. If an exception is raised during page iteration (e.g. corrupt image block, pixmap allocation failure), the PDF file handle leaks. This is the same bug pattern that was already fixed in converter.py:get_pdf_page_count (commit c525455), but images.py was missed. Fix: replace doc = pymupdf.open(...) / doc.close() with with pymupdf.open(...) as doc:.

OpenKB/openkb/images.py

Lines 38 to 74 in 1637697

    
           doc = pymupdf.open(str(pdf_path)) 
        
           for page_idx in range(len(doc)): 
        
               page = doc[page_idx] 
        
               page_num = page_idx + 1 
        
               for block in page.get_text("dict")["blocks"]: 
        
                   if block["type"] != 1:  # not an image block 
        
                       continue 
        
                   width = block.get("width", 0) 
        
                   height = block.get("height", 0) 
        
                   if width < _MIN_IMAGE_DIM or height < _MIN_IMAGE_DIM: 
        
                       continue 
        
                   image_bytes = block.get("image") 
        
                   if not image_bytes: 
        
                       continue 
        
                   try: 
        
                       pix = pymupdf.Pixmap(image_bytes) 
        
                       if pix.n > 4: 
        
                           pix = pymupdf.Pixmap(pymupdf.csRGB, pix) 
        
                       img_counter += 1 
        
                       filename = f"p{page_num}_img{img_counter}.png" 
        
                       save_path = images_dir / filename 
        
                       pix.save(str(save_path)) 
        
                       pix = None 
        
                   except Exception: 
        
                       logger.warning("Failed to save image block on page %d", page_num) 
        
                       continue 
        
                   rel_path = f"images/{doc_name}/{filename}" 
        
                   page_images.setdefault(page_num, []).append(rel_path) 
        
           doc.close() 
        
           return page_images

OpenKB/openkb/images.py

Lines 89 to 125 in 1637697

    
           doc = pymupdf.open(str(pdf_path)) 
        
           for page_idx in range(len(doc)): 
        
               page = doc[page_idx] 
        
               page_num = page_idx + 1 
        
               parts.append(f"\n\n<!-- Page {page_num} -->\n") 
        
               for block in page.get_text("dict")["blocks"]: 
        
                   if block["type"] == 0:  # text block 
        
                       lines = [] 
        
                       for line in block["lines"]: 
        
                           spans_text = "".join(span["text"] for span in line["spans"]) 
        
                           lines.append(spans_text) 
        
                       parts.append("\n".join(lines)) 
        
                   elif block["type"] == 1:  # image block 
        
                       width = block.get("width", 0) 
        
                       height = block.get("height", 0) 
        
                       if width < _MIN_IMAGE_DIM or height < _MIN_IMAGE_DIM: 
        
                           continue 
        
                       image_bytes = block.get("image") 
        
                       if not image_bytes: 
        
                           continue 
        
                       try: 
        
                           pix = pymupdf.Pixmap(image_bytes) 
        
                           if pix.n > 4: 
        
                               pix = pymupdf.Pixmap(pymupdf.csRGB, pix) 
        
                           img_counter += 1 
        
                           filename = f"p{page_num}_img{img_counter}.png" 
        
                           (images_dir / filename).write_bytes(pix.tobytes("png")) 
        
                           pix = None 
        
                           parts.append(f"\n![image](images/{doc_name}/{filename})\n") 
        
                       except Exception: 
        
                           logger.warning("Failed to save image block on page %d", page_num) 
        
           doc.close() 
        
           return "\n".join(parts)

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

…dle leaks

rejojer · 2026-04-08T11:41:59Z

Code review

Found 2 issues:

README claims LiteLLM is "pinned to a safe version" but pyproject.toml has no version pin. Line 67 of README.md states LiteLLM is (pinned to a safe version), but pyproject.toml line 17 lists the dependency as bare "litellm" with no version constraint (==, >=, ~=, etc.). Any version -- including potentially insecure ones -- can be installed.

OpenKB/README.md

Lines 66 to 68 in 854294c

    
           OpenKB comes with [multi-LLM support](https://docs.litellm.ai/docs/providers) (e.g., OpenAI, Claude, Gemini) via [LiteLLM](https://github.com/BerriAI/litellm) (pinned to a [safe version](https://docs.litellm.ai/blog/security-update-march-2026)).

OpenKB/pyproject.toml

Lines 16 to 18 in 854294c

    
           "watchdog>=3.0", 
        
           "litellm", 
        
           "openai-agents",

test_short_pdf_converted_via_markitdown mocks the wrong code path. The test patches openkb.converter.MarkItDown and openkb.converter.pymupdf.open, but converter.py line 99-101 routes short PDFs through convert_pdf_with_images() (from openkb.images), not MarkItDown. The MarkItDown mock is never exercised, and convert_pdf_with_images is not mocked, so the test either fails at runtime or passes for the wrong reasons.

OpenKB/tests/test_converter.py

Lines 83 to 108 in 854294c

    
           class TestConvertDocumentPdfShort: 
        
               def test_short_pdf_converted_via_markitdown(self, kb_dir, tmp_path): 
        
                   """PDF under threshold is converted with markitdown.""" 
        
                   src = tmp_path / "short.pdf" 
        
                   src.write_bytes(b"%PDF-1.4 fake content") 
        
                   fake_result = MagicMock() 
        
                   fake_result.text_content = "# Short PDF\n\nConverted content." 
        
                   with ( 
        
                       patch("openkb.converter.pymupdf.open") as mock_mu, 
        
                       patch("openkb.converter.MarkItDown") as mock_mid_cls, 
        
                   ): 
        
                       fake_doc = MagicMock() 
        
                       fake_doc.page_count = 5  # below default threshold of 20 
        
                       fake_doc.__enter__ = MagicMock(return_value=fake_doc) 
        
                       fake_doc.__exit__ = MagicMock(return_value=False) 
        
                       mock_mu.return_value = fake_doc 
        
                       mock_mid_cls.return_value.convert.return_value = fake_result 
        
                       result = convert_document(src, kb_dir) 
        
                   assert result.skipped is False 
        
                   assert result.is_long_doc is False 
        
                   assert result.source_path is not None 
        
                   assert result.source_path.exists()

OpenKB/openkb/converter.py

Lines 98 to 102 in 854294c

    
               markdown = copy_relative_images(markdown, src.parent, doc_name, images_dir) 
        
           elif src.suffix.lower() == ".pdf": 
        
               # Use pymupdf dict-mode for PDFs: text + images inline at correct positions 
        
               markdown = convert_pdf_with_images(src, doc_name, images_dir) 
        
           else:

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

Architectural review (4 parallel Opus auditors) found that the skill_runner core was already generic, but the deck SURFACE was still fused to Editorial Monocle. Fixed: * validator: now takes optional `grammar` param (DeckGrammar TypedDict); skill-agnostic by default (only checks file present, parses, ≥5 slides, self-contained). Third-party deck skills (guizang, swiss) now pass validation cleanly. Editorial-specific rules opt-in via `EDITORIAL_MONOCLE_GRAMMAR`. (finding #2) * skills/openkb-deck-editorial/SKILL.md: declares its grammar + output_path_template under `od:` frontmatter — `run_skill` reads these and applies them post-run. * run_skill: now honors frontmatter `od.mode`, `od.output_path_template`, `od.deck_grammar`. When mode=="deck" and template is set, the runner injects the path into intent, verifies the file exists post-run, and runs validate_deck with the skill's grammar. Validation result is returned via new SkillRunResult dataclass. (findings #4, #5) * `openkb deck new --skill <name>`: CLI flag accepts any installed deck skill (default openkb-deck-editorial). guizang and swiss now usable from the scripted CLI, not only freeform chat. (finding #1) * `/deck new --skill <name>` chat slash: same flag, parsed positionally alongside --critique. (finding #1) * tests/test_read_kb_file.py: 13 new tests mirroring test_write_kb_file for the read-side allow-list. Pins refusal of `.openkb/config.yaml`, `.env`, `raw/`, `..` traversal, absolute paths. (finding #6) * Generator deck branch: no longer calls validate_deck directly; just propagates run_deck_create's SkillRunResult.validation up. Validation is now a property of "this skill declared mode=deck", not of "this CLI path was taken". Existing tests updated: * tests/test_deck_validator.py: explicit grammar arg on Editorial- specific tests; added test_guizang_shape_passes_generic_mode + test_missing_cover_ignored_in_generic_mode to pin both modes. * tests/test_deck_creator.py: mocks return SkillRunResult; new test_run_deck_create_honors_skill_name_override for --skill flag. * tests/test_generator.py: deck dispatch test mocks SkillRunResult. Below-threshold findings deferred: * Generator if/else → registry (score 70) — works, just not extensible via plugin; future. * Iteration backup in chat freeform path (score 75) — needs write_kb_file hook; separate change. * run_skill / scan_local_skills / _handle_slash_critique direct tests (scores 60-70) — covered indirectly by integration; can add later. Regression: 538 tests pass (was 523 pre-fix; net +15 = 13 new read_kb_file tests + 2 new validator-mode tests).

KylinMountain and others added 30 commits April 6, 2026 23:13

chore: switch default model to gpt-5.4, fix test, setup uv

46f457f

feat: add image extraction module (Task 4)

9a00027

Implements extract_base64_images and copy_relative_images with full test coverage for single/multiple images, invalid base64, missing files, and URL filtering.

feat: add document converter module (Task 5)

4290d5b

Implements ConvertResult dataclass, get_pdf_page_count, and convert_document with hash-dedup, markdown passthrough, PDF long-doc detection, MarkItDown conversion, and image extraction integration.

feat: add PageIndex tree renderer module (Task 6)

e847162

Implements render_source_md and render_summary_md with YAML frontmatter, recursive heading hierarchy (h1–h6 capped), page ranges, and separate text/summary views for source and summary wiki pages.

feat: add Q&A agent and query command (Task 11)

bbbf4d3

Implements pageindex_retrieve (structure -> LLM relevance -> page fetch), build_query_agent with list/read/retrieve tools, and run_query coroutine. Wires up `okb query` in cli.py.

feat: add watch mode with debounced file handler (Task 12)

69233b4

Implements DebouncedHandler (collects events, ignores dirs/dotfiles, resets timer on burst) and watch_directory (Observer loop, Ctrl+C safe). Wires up `okb watch` in cli.py.

feat: add structural lint module (Task 13)

7cfc14e

Implements find_broken_links, find_orphans, find_missing_entries, check_index_sync, and run_structural_lint with full Markdown report. Covers wikilink resolution, orphan detection, raw/wiki entry matching, and index.md sync checking.

feat: add tests for okb list and okb status commands (Task 15)

ef4c2c2

Tests verify list shows documents table and concepts, status shows per-directory file counts and total indexed. Both check missing-init guard.

refactor: use public PageIndex API for get_page_content

564ed51

Replace col._backend.get_page_content(col._name, doc_id, spec) with col.get_page_content(doc_id, spec). Now all PageIndex access uses public API only.

feat: add log.md, query --save, rename SCHEMA.md to AGENTS.md

4da4fc4

fix: spec alignment — path mismatch, runtime schema, language, retry,…

f76459d

… interactive init, list/status format

refactor: switch from LocalClient to PageIndexClient, remove hardcode…

4629ae7

…d .db path

docs: add README with features, usage, and architecture overview

8579120

chore: add raw/, wiki/, .okb/ to gitignore

a5d4155

fix: include timestamp in log entries for better sorting

44d286a

fix: exclude AGENTS.md, log.md, reports/, sources/ from structural lint

662d8ee

feat: extract embedded images from PDFs via pymupdf

975fde3

feat: show progress counter when adding folder of documents

ff580c0

Update README.md

422123c

revert: undo tree_renderer change, waiting for PageIndex API

e270958

rejojer and others added 23 commits April 8, 2026 03:25

update readme

de5953c

update readme

f94df3c

update readme

c5fcd9a

update readme

4adbdc6

update readme

48d0d94

update readme

8f7f475

Update README.md

7d724be

Update README.md

2ac27c9

update readme

9296c55

rename CLI command and state directory from okb to openkb

0444ab3

Merge pull request #5 from VectifyAI/cli-rename-openkb

98e9892

Rename CLI command and state dir from okb to openkb

fix: remove max_tokens from litellm call for cross-model compatibility

d24590c

docs: add LiteLLM model name examples to README and okb init prompt

55444d3

Merge pull request #6 from VectifyAI/simplify-llm-api-key-config

018342e

Simplify LLM API key configuration

update readme

b4481e7

Merge pull request #7 from VectifyAI/fix/litellm-agent-model-prefix

0117751

Fix: add litellm/ prefix for Agents SDK model routing

update readme

0752290

update readme

771cb60

update readme

6216c2e

update readme

7a40829

simplify PageIndex API key config: remove pageindex_api_key_env from …

1637697

…config; update readme

fix: use context manager for pymupdf.open in images.py to prevent han…

854294c

…dle leaks

rejojer added 2 commits April 8, 2026 20:53

switch pageindex dependency from git to PyPI (0.3.0.dev0)

d3ca1a9

add authors, classifiers, keywords, and project urls to pyproject.toml

f0963f6

rejojer merged commit f0963f6 into main Apr 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: OpenKB MVP — Karpathy's LLM Knowledge Base, powered by PageIndex#4

feat: OpenKB MVP — Karpathy's LLM Knowledge Base, powered by PageIndex#4
rejojer merged 102 commits into
mainfrom
dev

KylinMountain commented Apr 6, 2026

Uh oh!

KylinMountain commented Apr 8, 2026

Uh oh!

rejojer commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

KylinMountain commented Apr 6, 2026

Summary

Features

Tech Stack

Uh oh!

KylinMountain commented Apr 8, 2026

Code review

Uh oh!

rejojer commented Apr 8, 2026

Code review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants