feat: OpenKB MVP — Karpathy's LLM Knowledge Base, powered by PageIndex by KylinMountain · Pull Request #2 · VectifyAI/OpenKB

KylinMountain · 2026-04-06T14:08:31Z

Summary

OpenKB is a CLI that implements Karpathy's LLM Knowledge Base workflow — drop documents in, get an auto-maintained, cross-linked wiki out.

Core Features

okb init — Interactive setup with model, language, PageIndex config
okb add — Two indexing paths: markitdown for short docs, PageIndex for long PDFs (local or cloud)
okb query — Streaming Q&A with tool call visibility, PageIndex cloud streaming for long docs
okb watch — Filesystem watcher with debounce for auto-compilation
okb lint — Structural checks (broken links, orphans, index sync) + LLM knowledge checks
okb list / status — Document, summary, concept, and report overview

Architecture

Short docs (PDF < 50 pages, docx, html, etc.) → pymupdf dict-mode conversion with inline images → LLM compiles wiki
Long docs (PDF ≥ 50 pages) → PageIndex tree index with summaries + text → LLM compiles from summaries
Wiki structure: sources/, summaries/, concepts/, explorations/, reports/, index.md, log.md, AGENTS.md
PageIndex Cloud support via PAGEINDEX_API_KEY with streaming query
Obsidian compatible — plain .md files with [[wikilinks]]

Tech Stack

PageIndex, markitdown, OpenAI Agents SDK, LiteLLM, Click, watchdog

Test Plan

145 unit tests passing
E2E: okb add short PDF (attention is all you need) — images inline
E2E: okb add long PDF (Introduction to Agents, 54 pages) — PageIndex tree + images
E2E: okb add docx
E2E: okb query --save with streaming output
E2E: okb lint structural + knowledge checks
E2E: PageIndex cloud streaming query

Karpathy's LLM Knowledge Base workflow powered by PageIndex for long document understanding. Covers architecture, two indexing paths (markitdown for short docs, PageIndex for long docs), wiki compilation via single LLM agent session with prompt caching, Q&A, watch mode, linting, CLI commands, and error handling.

Sets up pyproject.toml (hatchling, direct-refs allowed, Python >=3.11), .gitignore, openkb/__init__.py, a Click CLI stub with all 7 commands (init, add, query, watch, lint, list, status), and tests/conftest.py with kb_dir and sample_tree fixtures. Package installs cleanly in a Python 3.12 venv; okb --help shows all commands; pytest collects 0 tests without error.

Add openkb/config.py (DEFAULT_CONFIG, load_config, save_config), openkb/state.py (HashRegistry with SHA-256 file hashing and JSON persistence), and openkb/schema.py (SCHEMA_MD constant). All 17 tests written first (red) then implemented (green).

Creates full KB directory structure (raw/, wiki/sources/images/, wiki/summaries/, wiki/concepts/, wiki/reports/), writes SCHEMA.md, index.md, config.yaml and hashes.json; guards against re-initialisation. Three tests in tests/test_cli.py cover structure, schema content, and the already-initialized guard, all via CliRunner.isolated_filesystem.

Implements extract_base64_images and copy_relative_images with full test coverage for single/multiple images, invalid base64, missing files, and URL filtering.

Implements ConvertResult dataclass, get_pdf_page_count, and convert_document with hash-dedup, markdown passthrough, PDF long-doc detection, MarkItDown conversion, and image extraction integration.

Implements render_source_md and render_summary_md with YAML frontmatter, recursive heading hierarchy (h1–h6 capped), page ranges, and separate text/summary views for source and summary wiki pages.

Implements IndexResult dataclass and index_long_document which creates a LocalClient with full node text/summary/description flags, adds the PDF via PageIndex, fetches structure, and writes source and summary wiki pages via the tree renderer.

Implements list_wiki_files, read_wiki_file, and write_wiki_file as plain functions in openkb/agent/tools.py without @function_tool decoration, ready to be wrapped when building the agent. Full test coverage including edge cases for missing files/dirs, filtering to .md only, and parent dir creation.

Implements build_compiler_agent, compile_short_doc, compile_long_doc in openkb/agent/compiler.py with function_tool-wrapped wiki tools and SCHEMA_MD-enriched instructions. Long-doc variant includes get_page_content. Tests mock Runner.run to avoid real LLM calls.

Replaces the add stub with full orchestration: convert_document, index_long_document for long PDFs, and compiler agent calls. Adds SUPPORTED_EXTENSIONS set, _find_kb_dir, _add_single_file helpers. Adds python-dotenv dependency and load_dotenv() at startup.

Implements pageindex_retrieve (structure -> LLM relevance -> page fetch), build_query_agent with list/read/retrieve tools, and run_query coroutine. Wires up `okb query` in cli.py.

Implements DebouncedHandler (collects events, ignores dirs/dotfiles, resets timer on burst) and watch_directory (Observer loop, Ctrl+C safe). Wires up `okb watch` in cli.py.

Implements find_broken_links, find_orphans, find_missing_entries, check_index_sync, and run_structural_lint with full Markdown report. Covers wikilink resolution, orphan detection, raw/wiki entry matching, and index.md sync checking.

Implements build_lint_agent with list/read tools and instructions for semantic quality checks (contradictions, gaps, staleness, redundancy). run_knowledge_lint runs the agent and returns the report string. okb lint combines structural + knowledge lint and writes timestamped report.

Tests verify list shows documents table and concepts, status shows per-directory file counts and total indexed. Both check missing-init guard.

Previously the converter registered the file hash immediately, so if LLM compilation failed the file was marked as "done" and retries would skip it. Now the hash is only registered by the CLI after successful compilation. Also: install markitdown[all] for PDF support, add python-dotenv.

…pport - Switch from col._backend.get_document_structure() to col.get_document_structure() - Add 3x retry for PageIndex indexing (stochastic TOC accuracy) - Fix storage path to use .db extension - Remove .doc from supported extensions (markitdown only supports .docx) - Note: col.get_page_content() still missing from PageIndex public API, using col._backend.get_page_content() as workaround

Replace col._backend.get_page_content(col._name, doc_id, spec) with col.get_page_content(doc_id, spec). Now all PageIndex access uses public API only.

… interactive init, list/status format

…d .db path

Mention PageIndex Cloud API as an option for faster long document indexing, and add pageindex_api_key_env to the configuration example.

…lude examples from pip

…cture-based retrieval

…_API_KEY config

…shows progress

…s tool result only

…back

KylinMountain · 2026-04-06T14:21:04Z

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

… in long doc compiler

Architectural review (4 parallel Opus auditors) found that the skill_runner core was already generic, but the deck SURFACE was still fused to Editorial Monocle. Fixed: * validator: now takes optional `grammar` param (DeckGrammar TypedDict); skill-agnostic by default (only checks file present, parses, ≥5 slides, self-contained). Third-party deck skills (guizang, swiss) now pass validation cleanly. Editorial-specific rules opt-in via `EDITORIAL_MONOCLE_GRAMMAR`. (finding #2) * skills/openkb-deck-editorial/SKILL.md: declares its grammar + output_path_template under `od:` frontmatter — `run_skill` reads these and applies them post-run. * run_skill: now honors frontmatter `od.mode`, `od.output_path_template`, `od.deck_grammar`. When mode=="deck" and template is set, the runner injects the path into intent, verifies the file exists post-run, and runs validate_deck with the skill's grammar. Validation result is returned via new SkillRunResult dataclass. (findings #4, #5) * `openkb deck new --skill <name>`: CLI flag accepts any installed deck skill (default openkb-deck-editorial). guizang and swiss now usable from the scripted CLI, not only freeform chat. (finding #1) * `/deck new --skill <name>` chat slash: same flag, parsed positionally alongside --critique. (finding #1) * tests/test_read_kb_file.py: 13 new tests mirroring test_write_kb_file for the read-side allow-list. Pins refusal of `.openkb/config.yaml`, `.env`, `raw/`, `..` traversal, absolute paths. (finding #6) * Generator deck branch: no longer calls validate_deck directly; just propagates run_deck_create's SkillRunResult.validation up. Validation is now a property of "this skill declared mode=deck", not of "this CLI path was taken". Existing tests updated: * tests/test_deck_validator.py: explicit grammar arg on Editorial- specific tests; added test_guizang_shape_passes_generic_mode + test_missing_cover_ignored_in_generic_mode to pin both modes. * tests/test_deck_creator.py: mocks return SkillRunResult; new test_run_deck_create_honors_skill_name_override for --skill flag. * tests/test_generator.py: deck dispatch test mocks SkillRunResult. Below-threshold findings deferred: * Generator if/else → registry (score 70) — works, just not extensible via plugin; future. * Iteration backup in chat freeform path (score 75) — needs write_kb_file hook; separate change. * run_skill / scan_local_skills / _handle_slash_critique direct tests (scores 60-70) — covered indirectly by integration; can add later. Regression: 538 tests pass (was 523 pre-fix; net +15 = 13 new read_kb_file tests + 2 new validator-mode tests).

KylinMountain added 30 commits April 4, 2026 23:23

docs: rename project to OpenKB, pip package to openkb

b55c626

docs: add OpenKB implementation plan (16 tasks)

1d0433f

chore: switch default model to gpt-5.4, fix test, setup uv

69fcf20

feat: add image extraction module (Task 4)

27ba2e1

Implements extract_base64_images and copy_relative_images with full test coverage for single/multiple images, invalid base64, missing files, and URL filtering.

feat: add document converter module (Task 5)

59ba63e

Implements ConvertResult dataclass, get_pdf_page_count, and convert_document with hash-dedup, markdown passthrough, PDF long-doc detection, MarkItDown conversion, and image extraction integration.

feat: add PageIndex tree renderer module (Task 6)

3fc6cf3

Implements render_source_md and render_summary_md with YAML frontmatter, recursive heading hierarchy (h1–h6 capped), page ranges, and separate text/summary views for source and summary wiki pages.

feat: add Q&A agent and query command (Task 11)

3c8ebb5

Implements pageindex_retrieve (structure -> LLM relevance -> page fetch), build_query_agent with list/read/retrieve tools, and run_query coroutine. Wires up `okb query` in cli.py.

feat: add watch mode with debounced file handler (Task 12)

1d4a342

Implements DebouncedHandler (collects events, ignores dirs/dotfiles, resets timer on burst) and watch_directory (Observer loop, Ctrl+C safe). Wires up `okb watch` in cli.py.

feat: add structural lint module (Task 13)

55a56af

Implements find_broken_links, find_orphans, find_missing_entries, check_index_sync, and run_structural_lint with full Markdown report. Covers wikilink resolution, orphan detection, raw/wiki entry matching, and index.md sync checking.

feat: add tests for okb list and okb status commands (Task 15)

7608c96

Tests verify list shows documents table and concepts, status shows per-directory file counts and total indexed. Both check missing-init guard.

refactor: use public PageIndex API for get_page_content

14344eb

Replace col._backend.get_page_content(col._name, doc_id, spec) with col.get_page_content(doc_id, spec). Now all PageIndex access uses public API only.

feat: add log.md, query --save, rename SCHEMA.md to AGENTS.md

efd861b

fix: spec alignment — path mismatch, runtime schema, language, retry,…

866662f

… interactive init, list/status format

refactor: switch from LocalClient to PageIndexClient, remove hardcode…

b5e5b46

…d .db path

docs: add README with features, usage, and architecture overview

f94bea5

chore: add raw/, wiki/, .okb/ to gitignore

51dfcbb

fix: include timestamp in log entries for better sorting

2bac910

fix: exclude AGENTS.md, log.md, reports/, sources/ from structural lint

a174877

feat: extract embedded images from PDFs via pymupdf

3f98cc3

KylinMountain added 21 commits April 6, 2026 10:23

fix: rename stream labels to [tool call] and [tool output]

0d118dd

docs: add PageIndex Cloud API info to README

0b963de

Mention PageIndex Cloud API as an option for faster long document indexing, and add pageindex_api_key_env to the configuration example.

docs: add .env.example and config.example.yaml

c2c9fd7

chore: remove redundant config.example.yaml, keep .env.example only

3fd374a

fix: default to PAGEINDEX_API_KEY env var when config field is empty

e785de2

feat: show summaries and reports in okb list

cf6f01f

fix: cloud PageIndex query fallback, disable parallel tool calls, exc…

2fe2a25

…lude examples from pip

fix: update tests for cloud PageIndex fallback

1394ac8

fix: cloud PageIndex uses col.query() directly, not fallback

ed7be8a

fix: isolate test env for local PageIndex path

060d68a

fix: route cloud docs (pi- prefix) to col.query(), local docs to stru…

bbae3ba

…cture-based retrieval

docs: PageIndex cloud bug report — get_document_structure returns empty

03dadf3

docs: update README — Apache 2.0, image support, pageindex.ai link

c0b0294

chore: remove docs/ and .claude/ from git, keep local only

cb20017

docs: update PageIndex Cloud links to pageindex.ai, clarify PAGEINDEX…

aa0539c

…_API_KEY config

docs: point API key links to pageindex.ai/developer

369a0ab

docs: fix API key link to pageindex.dev

ce17efb

feat: use PageIndex streaming query for cloud docs — avoids timeout, …

e9291ba

…shows progress

fix: suppress cloud pageindex streaming output from user view, keep a…

da4946a

…s tool result only

initial commit

f140480

fix: restore cloud PageIndex streaming output to stdout for user feed…

cb72284

…back

KylinMountain added 5 commits April 6, 2026 22:45

fix: close pymupdf handle in get_pdf_page_count, use consistent model…

5c83661

… in long doc compiler

fix: update test mocks for pymupdf context manager

1f73a1f

chore: switch PageIndex dependency to VectifyAI/PageIndex@dev

faec35d

ci: add PyPI publish workflow on tag push

6921cc4

Merge branch 'main' into dev

841dd40

KylinMountain closed this Apr 6, 2026

KylinMountain force-pushed the main branch from f140480 to 0d2b230 Compare April 6, 2026 14:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: OpenKB MVP — Karpathy's LLM Knowledge Base, powered by PageIndex#2

feat: OpenKB MVP — Karpathy's LLM Knowledge Base, powered by PageIndex#2
KylinMountain wants to merge 70 commits into
mainfrom
dev

KylinMountain commented Apr 6, 2026

Uh oh!

KylinMountain commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

KylinMountain commented Apr 6, 2026

Summary

Core Features

Architecture

Tech Stack

Test Plan

Uh oh!

KylinMountain commented Apr 6, 2026

Code review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants