Skip to content

dshap474/cartographer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cartographer

Cartographer is a local PDF/EPUB-to-Markdown extraction and content mapping tool designed for high-fidelity document processing and source-grounded knowledge organization.

Overview

Cartographer provides three core workflows for document processing:

  • Markdown Extraction - Convert PDFs and EPUBs to clean Markdown with intelligent OCR routing
  • Content Mapping - Split long Markdown into chapter/section files with deterministic knowledge maps
  • Compression - Create source-faithful hierarchical Markdown maps using LLM-powered compression

Key Features

  • Intelligent OCR Routing - Uses pdf-inspector to analyze page-level text quality and route pages to appropriate extraction methods
  • Multi-Backend Extraction - Supports MinerU (TXT/OCR), Docling, and other document parsing backends
  • Formula Preservation - Maintains LaTeX formulas and mathematical notation through extraction and mapping
  • Parallel Processing - Worker-based parallelization for large document processing
  • Source-Faithful Output - Designed for information preservation rather than summarization
  • Deterministic Mapping - Reproducible content maps with verification stages

Requirements

  • Python: 3.11 or higher
  • Operating System: macOS, Linux
  • Dependencies: See pyproject.toml for full dependency list
  • Codex Binary: Required for content-map and compress workflows (see Codex Setup)

Core Dependencies

  • docling - Document parsing and conversion
  • mineru[pipeline] - PDF/EPUB extraction with OCR capabilities
  • paddleocr & paddlepaddle - OCR engine
  • pdf-inspector - PDF analysis and OCR routing
  • pydantic - Data validation
  • pypdf & pypdfium2 - PDF processing
  • trafilatura - Web content extraction

Installation

Using uv (Recommended)

# Clone the repository
git clone https://github.com/dshap474/cartographer.git
cd cartographer

# Install dependencies
uv sync

# Verify installation
uv run cartographer --help

Development Setup

# Install development dependencies
uv sync --group dev

# Run tests
uv run pytest

# Run linting
uv run ruff check src tests
uv run ruff format src tests

# Type checking
uv run ty

Codex Setup

The content-map and compress workflows require a Codex binary for LLM-powered content analysis.

Configuration

By default, Cartographer looks for codex in your PATH. Configure via CLI arguments:

  • --codex-bin - Path to Codex binary (default: codex)
  • --model - Override default model per stage
  • --profile - Codex profile to use
  • --codex-timeout - Timeout in seconds (default: 3600)

Future Backend Support

Cartographer is currently Codex-native. Future releases will add support for:

  • Claude exec backend
  • Droid exec backend
  • Direct OpenAI/Anthropic API integration

Usage

Markdown Extraction

Convert PDFs and EPUBs to clean Markdown with intelligent OCR routing:

# Basic PDF extraction
uv run cartographer markdown-extraction ./book.pdf

# EPUB extraction
uv run cartographer markdown-extraction ./book.epub

# With custom output
uv run cartographer markdown-extraction ./book.pdf \
  --output-dir ./output \
  --output-name my-book

# Parallel processing
uv run cartographer markdown-extraction ./book.pdf --workers 4

# Force OCR mode
uv run cartographer markdown-extraction ./book.pdf --mineru-method ocr

# Remote OCR backend
uv run cartographer markdown-extraction ./book.pdf \
  --ocr-backend remote-pc

Extraction Options

  • --output-dir - Output directory (default: source directory)
  • --output-name - Output basename without extension
  • --overwrite - Overwrite existing Markdown (default: true)
  • --mineru-method - Extraction method: txt or ocr (default: txt)
  • --ocr-backend - OCR location: local-mac or remote-pc (default: local-mac)
  • --ocr-fallback - Retry with OCR when TXT extraction fails (default: true)
  • --workers - Parallel worker count (default: 1)

How It Works

  1. PDF Analysis - pdf-inspector analyzes each page for text layer quality
  2. Intelligent Routing - Pages with usable text layers use MinerU TXT; others use OCR
  3. Formula Preservation - LaTeX formulas are preserved through extraction
  4. Quality Control - Automated QC checks for extraction quality
  5. Cleanup - Temporary artifacts are removed after successful extraction

Output: Single Markdown file beside the source (book.pdfbook.md)

Content Mapping

Split long Markdown into chapter/section files with deterministic knowledge maps:

# Basic content mapping
uv run cartographer content-map ./book.md --force

# Custom output directory
uv run cartographer content-map ./book.md \
  --output-dir ./mappings \
  --force

Mapping Options

  • --output-dir - Custom mapping output root
  • --force - Overwrite existing map directory

How It Works

  1. Partitioning - Splits Markdown by existing heading structure
  2. Unit Extraction - Creates compressed bullet maps for each section
  3. Verification - LLM verifies each unit for information preservation
  4. Knowledge Map - Generates SVG knowledge graph of content structure

Output Structure:

map/
  chapters/
    chapter-1/
      section-1.md
      section-2.md
    chapter-2/
      section-1.md
  knowledge-map.svg

Compression

Create source-faithful hierarchical Markdown maps:

# Basic compression
uv run cartographer compress ./book.md --force

# Custom output with parallelization
uv run cartographer compress ./book.md \
  --output-dir ./output \
  --output-name book.compressed \
  --workers 4 \
  --force

# With custom Codex configuration
uv run cartographer compress ./book.md \
  --codex-bin /path/to/codex \
  --model gpt-4 \
  --codex-timeout 7200 \
  --force

Compression Options

  • --output-dir - Output directory (default: source directory)
  • --output-name - Output basename without extension
  • --force - Overwrite existing compressed output
  • --workers - Parallel worker count (default: 1)
  • --codex-bin - Path to Codex binary (default: codex)
  • --model - Override default model
  • --profile - Codex profile
  • --codex-timeout - Timeout in seconds (default: 3600)

How It Works

  1. Temporary Layout - Creates temporary mapping structure
  2. Parallel Compression - Uses workers to compress sections concurrently
  3. LLM Verification - Each compressed unit is verified for information preservation
  4. Rollup - Combines verified units into single hierarchical map
  5. Cleanup - Removes temporary artifacts, keeping only final output

Output: Single compressed Markdown file (book.mdbook.compressed.md)

Architecture

Project Structure

cartographer/
├── src/cartographer/
│   ├── markdown_extraction/     # PDF/EPUB extraction layer
│   │   ├── adapter/            # Extraction backends (MinerU, Docling, etc.)
│   │   ├── formulas.py         # Formula/math preservation
│   │   ├── qc.py               # Quality control
│   │   └── extract.py          # Main extraction orchestration
│   ├── content_mapping/        # Content mapping layer
│   │   ├── prompts/            # LLM prompts for mapping stages
│   │   ├── codex.py            # Codex integration
│   │   ├── partition.py        # Content partitioning logic
│   │   └── run.py              # Mapping orchestration
│   ├── api.py                  # Public API surface
│   ├── cli.py                  # Command-line interface
│   └── config.py               # Runtime configuration
├── tests/                      # Test suite
└── pyproject.toml             # Project configuration

Design Principles

  • Layer Separation - Extraction and mapping layers are strictly separated
  • Source Faithfulness - Prioritizes information preservation over summarization
  • Deterministic Output - Same inputs produce same outputs
  • Verification Gates - Multiple quality checks throughout pipelines
  • Parallel Processing - Worker-based parallelization for performance

Development

Running Tests

# Run all tests
uv run pytest

# Run specific test file
uv run pytest tests/test_cli_api_surface.py

# Run with coverage
uv run pytest --cov=cartographer

Code Quality

# Linting
uv run ruff check src tests

# Formatting
uv run ruff format src tests

# Type checking
uv run ty

Adding New Features

  1. Extraction Backends - Add to src/cartographer/markdown_extraction/adapter/
  2. Mapping Prompts - Add to src/cartographer/content_mapping/prompts/
  3. CLI Commands - Extend src/cartographer/cli.py and src/cartographer/api.py

Configuration

Runtime Defaults

Edit src/cartographer/config.py to modify default Codex model configurations:

EXTRACT_UNIT_MODEL = "gpt-5.5"
EXTRACT_UNIT_REASONING = ""
VERIFY_PARTITION_MODEL = "gpt-5.5"
VERIFY_PARTITION_REASONING = ""
VERIFY_UNIT_MODEL = "gpt-5.5"
VERIFY_UNIT_REASONING = ""

Environment Variables

Currently, Codex configuration is handled via CLI arguments. Future versions will support environment variable configuration.

Limitations

  • Codex Dependency - Content mapping requires Codex binary
  • Platform Support - Primarily tested on macOS and Linux
  • Memory Usage - Large documents may require significant memory
  • OCR Quality - Output quality depends on OCR engine capabilities

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes with tests
  4. Ensure all tests pass and code is formatted
  5. Submit a pull request

Development Guidelines

  • Follow existing code style (Ruff formatting)
  • Add tests for new features
  • Update documentation as needed
  • Keep the public API surface small and stable

License

MIT License - see LICENSE for details.

Roadmap

  • Add Claude exec backend support
  • Add Droid exec backend support
  • Direct OpenAI/Anthropic API integration
  • Environment variable configuration
  • Windows platform support
  • Additional document format support (DOCX, PPTX)
  • Web UI for document processing
  • Enhanced formula/math processing
  • Batch processing capabilities

Acknowledgments

  • Built with MinerU for PDF/EPUB extraction
  • Uses pdf-inspector for intelligent OCR routing
  • Powered by Codex for content mapping

Releases

No releases published

Packages

 
 
 

Contributors

Languages