Cartographer is a local PDF/EPUB-to-Markdown extraction and content mapping tool designed for high-fidelity document processing and source-grounded knowledge organization.
Cartographer provides three core workflows for document processing:
- Markdown Extraction - Convert PDFs and EPUBs to clean Markdown with intelligent OCR routing
- Content Mapping - Split long Markdown into chapter/section files with deterministic knowledge maps
- Compression - Create source-faithful hierarchical Markdown maps using LLM-powered compression
- Intelligent OCR Routing - Uses pdf-inspector to analyze page-level text quality and route pages to appropriate extraction methods
- Multi-Backend Extraction - Supports MinerU (TXT/OCR), Docling, and other document parsing backends
- Formula Preservation - Maintains LaTeX formulas and mathematical notation through extraction and mapping
- Parallel Processing - Worker-based parallelization for large document processing
- Source-Faithful Output - Designed for information preservation rather than summarization
- Deterministic Mapping - Reproducible content maps with verification stages
- Python: 3.11 or higher
- Operating System: macOS, Linux
- Dependencies: See
pyproject.tomlfor full dependency list - Codex Binary: Required for
content-mapandcompressworkflows (see Codex Setup)
docling- Document parsing and conversionmineru[pipeline]- PDF/EPUB extraction with OCR capabilitiespaddleocr&paddlepaddle- OCR enginepdf-inspector- PDF analysis and OCR routingpydantic- Data validationpypdf&pypdfium2- PDF processingtrafilatura- Web content extraction
# Clone the repository
git clone https://github.com/dshap474/cartographer.git
cd cartographer
# Install dependencies
uv sync
# Verify installation
uv run cartographer --help# Install development dependencies
uv sync --group dev
# Run tests
uv run pytest
# Run linting
uv run ruff check src tests
uv run ruff format src tests
# Type checking
uv run tyThe content-map and compress workflows require a Codex binary for LLM-powered content analysis.
By default, Cartographer looks for codex in your PATH. Configure via CLI arguments:
--codex-bin- Path to Codex binary (default:codex)--model- Override default model per stage--profile- Codex profile to use--codex-timeout- Timeout in seconds (default: 3600)
Cartographer is currently Codex-native. Future releases will add support for:
- Claude exec backend
- Droid exec backend
- Direct OpenAI/Anthropic API integration
Convert PDFs and EPUBs to clean Markdown with intelligent OCR routing:
# Basic PDF extraction
uv run cartographer markdown-extraction ./book.pdf
# EPUB extraction
uv run cartographer markdown-extraction ./book.epub
# With custom output
uv run cartographer markdown-extraction ./book.pdf \
--output-dir ./output \
--output-name my-book
# Parallel processing
uv run cartographer markdown-extraction ./book.pdf --workers 4
# Force OCR mode
uv run cartographer markdown-extraction ./book.pdf --mineru-method ocr
# Remote OCR backend
uv run cartographer markdown-extraction ./book.pdf \
--ocr-backend remote-pc--output-dir- Output directory (default: source directory)--output-name- Output basename without extension--overwrite- Overwrite existing Markdown (default: true)--mineru-method- Extraction method:txtorocr(default:txt)--ocr-backend- OCR location:local-macorremote-pc(default:local-mac)--ocr-fallback- Retry with OCR when TXT extraction fails (default: true)--workers- Parallel worker count (default: 1)
- PDF Analysis - pdf-inspector analyzes each page for text layer quality
- Intelligent Routing - Pages with usable text layers use MinerU TXT; others use OCR
- Formula Preservation - LaTeX formulas are preserved through extraction
- Quality Control - Automated QC checks for extraction quality
- Cleanup - Temporary artifacts are removed after successful extraction
Output: Single Markdown file beside the source (book.pdf → book.md)
Split long Markdown into chapter/section files with deterministic knowledge maps:
# Basic content mapping
uv run cartographer content-map ./book.md --force
# Custom output directory
uv run cartographer content-map ./book.md \
--output-dir ./mappings \
--force--output-dir- Custom mapping output root--force- Overwrite existing map directory
- Partitioning - Splits Markdown by existing heading structure
- Unit Extraction - Creates compressed bullet maps for each section
- Verification - LLM verifies each unit for information preservation
- Knowledge Map - Generates SVG knowledge graph of content structure
Output Structure:
map/
chapters/
chapter-1/
section-1.md
section-2.md
chapter-2/
section-1.md
knowledge-map.svg
Create source-faithful hierarchical Markdown maps:
# Basic compression
uv run cartographer compress ./book.md --force
# Custom output with parallelization
uv run cartographer compress ./book.md \
--output-dir ./output \
--output-name book.compressed \
--workers 4 \
--force
# With custom Codex configuration
uv run cartographer compress ./book.md \
--codex-bin /path/to/codex \
--model gpt-4 \
--codex-timeout 7200 \
--force--output-dir- Output directory (default: source directory)--output-name- Output basename without extension--force- Overwrite existing compressed output--workers- Parallel worker count (default: 1)--codex-bin- Path to Codex binary (default:codex)--model- Override default model--profile- Codex profile--codex-timeout- Timeout in seconds (default: 3600)
- Temporary Layout - Creates temporary mapping structure
- Parallel Compression - Uses workers to compress sections concurrently
- LLM Verification - Each compressed unit is verified for information preservation
- Rollup - Combines verified units into single hierarchical map
- Cleanup - Removes temporary artifacts, keeping only final output
Output: Single compressed Markdown file (book.md → book.compressed.md)
cartographer/
├── src/cartographer/
│ ├── markdown_extraction/ # PDF/EPUB extraction layer
│ │ ├── adapter/ # Extraction backends (MinerU, Docling, etc.)
│ │ ├── formulas.py # Formula/math preservation
│ │ ├── qc.py # Quality control
│ │ └── extract.py # Main extraction orchestration
│ ├── content_mapping/ # Content mapping layer
│ │ ├── prompts/ # LLM prompts for mapping stages
│ │ ├── codex.py # Codex integration
│ │ ├── partition.py # Content partitioning logic
│ │ └── run.py # Mapping orchestration
│ ├── api.py # Public API surface
│ ├── cli.py # Command-line interface
│ └── config.py # Runtime configuration
├── tests/ # Test suite
└── pyproject.toml # Project configuration
- Layer Separation - Extraction and mapping layers are strictly separated
- Source Faithfulness - Prioritizes information preservation over summarization
- Deterministic Output - Same inputs produce same outputs
- Verification Gates - Multiple quality checks throughout pipelines
- Parallel Processing - Worker-based parallelization for performance
# Run all tests
uv run pytest
# Run specific test file
uv run pytest tests/test_cli_api_surface.py
# Run with coverage
uv run pytest --cov=cartographer# Linting
uv run ruff check src tests
# Formatting
uv run ruff format src tests
# Type checking
uv run ty- Extraction Backends - Add to
src/cartographer/markdown_extraction/adapter/ - Mapping Prompts - Add to
src/cartographer/content_mapping/prompts/ - CLI Commands - Extend
src/cartographer/cli.pyandsrc/cartographer/api.py
Edit src/cartographer/config.py to modify default Codex model configurations:
EXTRACT_UNIT_MODEL = "gpt-5.5"
EXTRACT_UNIT_REASONING = ""
VERIFY_PARTITION_MODEL = "gpt-5.5"
VERIFY_PARTITION_REASONING = ""
VERIFY_UNIT_MODEL = "gpt-5.5"
VERIFY_UNIT_REASONING = ""Currently, Codex configuration is handled via CLI arguments. Future versions will support environment variable configuration.
- Codex Dependency - Content mapping requires Codex binary
- Platform Support - Primarily tested on macOS and Linux
- Memory Usage - Large documents may require significant memory
- OCR Quality - Output quality depends on OCR engine capabilities
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes with tests
- Ensure all tests pass and code is formatted
- Submit a pull request
- Follow existing code style (Ruff formatting)
- Add tests for new features
- Update documentation as needed
- Keep the public API surface small and stable
MIT License - see LICENSE for details.
- Add Claude exec backend support
- Add Droid exec backend support
- Direct OpenAI/Anthropic API integration
- Environment variable configuration
- Windows platform support
- Additional document format support (DOCX, PPTX)
- Web UI for document processing
- Enhanced formula/math processing
- Batch processing capabilities
- Built with MinerU for PDF/EPUB extraction
- Uses pdf-inspector for intelligent OCR routing
- Powered by Codex for content mapping