Schema-first document intelligence using hybrid AI (spaCy + local LLMs)
Extract structured data from PDFs — invoices, resumes, contracts, research papers — using a powerful hybrid pipeline:
- ⚡ Fast NLP (spaCy)
- 🧠 Local LLMs (Ollama)
- 📐 Layout-aware parsing (pdfplumber)
- ✅ Schema validation (Pydantic)
-
🧠 Hybrid Extraction
- Deterministic rules (high accuracy)
- spaCy for fast entity recognition
- LLM fallback for complex fields
-
🏠 Local-first AI
- Works with Ollama (Mistral, LLaMA, etc.)
- No API required (OpenAI optional)
-
📐 Layout-aware parsing
- Extract tables and line items from PDFs
-
📊 Schema-driven
- Prebuilt schemas (invoice, resume, research)
- Custom schemas via Pydantic
-
📦 Batch processing
-
📁 Export to CSV / Excel
-
🧪 Built-in benchmarks
-
⚙️ One-command setup
pip install doc-inteldoc-intel setup --fullThis will:
- Install spaCy model
- Check Ollama availability
- Create sample dataset
- Validate pipeline
doc-intel extract invoice.pdf --schema invoicefrom doc_intel import extract
result = extract("invoice.pdf", schema="invoice")
print(result)doc-intel demo examples/invoice_sample.pdf📄 RAW TEXT (preview)
...
⚡ BEFORE vs AFTER
❌ Unstructured (PDF)
Messy, inconsistent text
✅ Structured Output
invoice_number: INV-001
total: 1250.0
vendor: Acme Corp
🎯 Machine-readable, validated, ready for pipelines
{
"invoice_number": "INV-2024-001",
"date": "2024-04-01",
"total": 1250.0,
"vendor": "Acme Corp",
"line_items": [
{"description": "Consulting", "amount": 1000.0},
{"description": "Expenses", "amount": 250.0}
],
"confidence": 0.94
}doc-intel batch ./invoices --schema invoice --output results.csvdoc-intel benchmarkField Accuracy: 0.92
Line Items Accuracy: 0.90
Combined Score: 0.91
from pydantic import BaseModel
from typing import List
class ResearchPaper(BaseModel):
title: str
authors: List[str]
year: int
result = extract("paper.pdf", schema=ResearchPaper)PDF
↓
Text Extraction (pypdf)
↓
Invoice Engine (rules + regex)
↓
Layout Parsing (tables)
↓
spaCy (NLP fallback)
↓
LLM (only if needed)
↓
Schema Validation
↓
Confidence Scoring
Install Ollama:
Run a model:
ollama run mistraldoc-intel will automatically use it.
doc_intel/
├── core/
├── schemas/
├── utils/
├── setup/
└── demo/
- OCR support (scanned PDFs)
- Higher accuracy invoice parsing (95–98%+)
- Plugin system (contracts, legal, healthcare)
- Confidence per field
- Streamlit / Web UI
- Dataset expansion
PRs welcome!
Before submitting:
doc-intel benchmarkMIT
Most tools are:
- ❌ LLM-only (slow, expensive)
- ❌ Rule-based (fragile)
👉 doc-intel combines both:
Fast + Accurate + Local-first