Skip to content

feat: OpenKB MVP — Karpathy's LLM Knowledge Base, powered by PageIndex#2

Closed
KylinMountain wants to merge 70 commits into
mainfrom
dev
Closed

feat: OpenKB MVP — Karpathy's LLM Knowledge Base, powered by PageIndex#2
KylinMountain wants to merge 70 commits into
mainfrom
dev

Conversation

@KylinMountain
Copy link
Copy Markdown
Collaborator

Summary

OpenKB is a CLI that implements Karpathy's LLM Knowledge Base workflow — drop documents in, get an auto-maintained, cross-linked wiki out.

Core Features

  • okb init — Interactive setup with model, language, PageIndex config
  • okb add — Two indexing paths: markitdown for short docs, PageIndex for long PDFs (local or cloud)
  • okb query — Streaming Q&A with tool call visibility, PageIndex cloud streaming for long docs
  • okb watch — Filesystem watcher with debounce for auto-compilation
  • okb lint — Structural checks (broken links, orphans, index sync) + LLM knowledge checks
  • okb list / status — Document, summary, concept, and report overview

Architecture

  • Short docs (PDF < 50 pages, docx, html, etc.) → pymupdf dict-mode conversion with inline images → LLM compiles wiki
  • Long docs (PDF ≥ 50 pages) → PageIndex tree index with summaries + text → LLM compiles from summaries
  • Wiki structure: sources/, summaries/, concepts/, explorations/, reports/, index.md, log.md, AGENTS.md
  • PageIndex Cloud support via PAGEINDEX_API_KEY with streaming query
  • Obsidian compatible — plain .md files with [[wikilinks]]

Tech Stack

PageIndex, markitdown, OpenAI Agents SDK, LiteLLM, Click, watchdog

Test Plan

  • 145 unit tests passing
  • E2E: okb add short PDF (attention is all you need) — images inline
  • E2E: okb add long PDF (Introduction to Agents, 54 pages) — PageIndex tree + images
  • E2E: okb add docx
  • E2E: okb query --save with streaming output
  • E2E: okb lint structural + knowledge checks
  • E2E: PageIndex cloud streaming query

Karpathy's LLM Knowledge Base workflow powered by PageIndex for long
document understanding. Covers architecture, two indexing paths
(markitdown for short docs, PageIndex for long docs), wiki compilation
via single LLM agent session with prompt caching, Q&A, watch mode,
linting, CLI commands, and error handling.
Sets up pyproject.toml (hatchling, direct-refs allowed, Python >=3.11),
.gitignore, openkb/__init__.py, a Click CLI stub with all 7 commands
(init, add, query, watch, lint, list, status), and tests/conftest.py
with kb_dir and sample_tree fixtures. Package installs cleanly in a
Python 3.12 venv; okb --help shows all commands; pytest collects 0
tests without error.
Add openkb/config.py (DEFAULT_CONFIG, load_config, save_config),
openkb/state.py (HashRegistry with SHA-256 file hashing and JSON
persistence), and openkb/schema.py (SCHEMA_MD constant). All 17 tests
written first (red) then implemented (green).
Creates full KB directory structure (raw/, wiki/sources/images/,
wiki/summaries/, wiki/concepts/, wiki/reports/), writes SCHEMA.md,
index.md, config.yaml and hashes.json; guards against re-initialisation.
Three tests in tests/test_cli.py cover structure, schema content, and
the already-initialized guard, all via CliRunner.isolated_filesystem.
Implements extract_base64_images and copy_relative_images with full test
coverage for single/multiple images, invalid base64, missing files, and
URL filtering.
Implements ConvertResult dataclass, get_pdf_page_count, and
convert_document with hash-dedup, markdown passthrough, PDF long-doc
detection, MarkItDown conversion, and image extraction integration.
Implements render_source_md and render_summary_md with YAML frontmatter,
recursive heading hierarchy (h1–h6 capped), page ranges, and separate
text/summary views for source and summary wiki pages.
Implements IndexResult dataclass and index_long_document which creates
a LocalClient with full node text/summary/description flags, adds the
PDF via PageIndex, fetches structure, and writes source and summary
wiki pages via the tree renderer.
Implements list_wiki_files, read_wiki_file, and write_wiki_file as plain
functions in openkb/agent/tools.py without @function_tool decoration,
ready to be wrapped when building the agent. Full test coverage including
edge cases for missing files/dirs, filtering to .md only, and parent dir
creation.
Implements build_compiler_agent, compile_short_doc, compile_long_doc in
openkb/agent/compiler.py with function_tool-wrapped wiki tools and
SCHEMA_MD-enriched instructions. Long-doc variant includes get_page_content.
Tests mock Runner.run to avoid real LLM calls.
Replaces the add stub with full orchestration: convert_document,
index_long_document for long PDFs, and compiler agent calls.
Adds SUPPORTED_EXTENSIONS set, _find_kb_dir, _add_single_file helpers.
Adds python-dotenv dependency and load_dotenv() at startup.
Implements pageindex_retrieve (structure -> LLM relevance -> page fetch),
build_query_agent with list/read/retrieve tools, and run_query coroutine.
Wires up `okb query` in cli.py.
Implements DebouncedHandler (collects events, ignores dirs/dotfiles, resets
timer on burst) and watch_directory (Observer loop, Ctrl+C safe).
Wires up `okb watch` in cli.py.
Implements find_broken_links, find_orphans, find_missing_entries,
check_index_sync, and run_structural_lint with full Markdown report.
Covers wikilink resolution, orphan detection, raw/wiki entry matching,
and index.md sync checking.
Implements build_lint_agent with list/read tools and instructions for
semantic quality checks (contradictions, gaps, staleness, redundancy).
run_knowledge_lint runs the agent and returns the report string.
okb lint combines structural + knowledge lint and writes timestamped report.
Tests verify list shows documents table and concepts, status shows
per-directory file counts and total indexed. Both check missing-init guard.
Previously the converter registered the file hash immediately, so if
LLM compilation failed the file was marked as "done" and retries
would skip it. Now the hash is only registered by the CLI after
successful compilation.

Also: install markitdown[all] for PDF support, add python-dotenv.
…pport

- Switch from col._backend.get_document_structure() to col.get_document_structure()
- Add 3x retry for PageIndex indexing (stochastic TOC accuracy)
- Fix storage path to use .db extension
- Remove .doc from supported extensions (markitdown only supports .docx)
- Note: col.get_page_content() still missing from PageIndex public API,
  using col._backend.get_page_content() as workaround
Replace col._backend.get_page_content(col._name, doc_id, spec) with
col.get_page_content(doc_id, spec). Now all PageIndex access uses
public API only.
Mention PageIndex Cloud API as an option for faster long document
indexing, and add pageindex_api_key_env to the configuration example.
@KylinMountain
Copy link
Copy Markdown
Collaborator Author

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

KylinMountain added a commit that referenced this pull request May 24, 2026
Architectural review (4 parallel Opus auditors) found that the skill_runner
core was already generic, but the deck SURFACE was still fused to
Editorial Monocle. Fixed:

* validator: now takes optional `grammar` param (DeckGrammar TypedDict);
  skill-agnostic by default (only checks file present, parses, ≥5
  slides, self-contained). Third-party deck skills (guizang, swiss)
  now pass validation cleanly. Editorial-specific rules opt-in via
  `EDITORIAL_MONOCLE_GRAMMAR`. (finding #2)
* skills/openkb-deck-editorial/SKILL.md: declares its grammar +
  output_path_template under `od:` frontmatter — `run_skill` reads
  these and applies them post-run.
* run_skill: now honors frontmatter `od.mode`, `od.output_path_template`,
  `od.deck_grammar`. When mode=="deck" and template is set, the runner
  injects the path into intent, verifies the file exists post-run, and
  runs validate_deck with the skill's grammar. Validation result is
  returned via new SkillRunResult dataclass. (findings #4, #5)
* `openkb deck new --skill <name>`: CLI flag accepts any installed deck
  skill (default openkb-deck-editorial). guizang and swiss now usable
  from the scripted CLI, not only freeform chat. (finding #1)
* `/deck new --skill <name>` chat slash: same flag, parsed positionally
  alongside --critique. (finding #1)
* tests/test_read_kb_file.py: 13 new tests mirroring test_write_kb_file
  for the read-side allow-list. Pins refusal of `.openkb/config.yaml`,
  `.env`, `raw/`, `..` traversal, absolute paths. (finding #6)
* Generator deck branch: no longer calls validate_deck directly; just
  propagates run_deck_create's SkillRunResult.validation up. Validation
  is now a property of "this skill declared mode=deck", not of "this
  CLI path was taken".

Existing tests updated:
* tests/test_deck_validator.py: explicit grammar arg on Editorial-
  specific tests; added test_guizang_shape_passes_generic_mode +
  test_missing_cover_ignored_in_generic_mode to pin both modes.
* tests/test_deck_creator.py: mocks return SkillRunResult; new
  test_run_deck_create_honors_skill_name_override for --skill flag.
* tests/test_generator.py: deck dispatch test mocks SkillRunResult.

Below-threshold findings deferred:
* Generator if/else → registry (score 70) — works, just not extensible
  via plugin; future.
* Iteration backup in chat freeform path (score 75) — needs write_kb_file
  hook; separate change.
* run_skill / scan_local_skills / _handle_slash_critique direct tests
  (scores 60-70) — covered indirectly by integration; can add later.

Regression: 538 tests pass (was 523 pre-fix; net +15 = 13 new
read_kb_file tests + 2 new validator-mode tests).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants