🚀 doc-intel

Schema-first document intelligence using hybrid AI (spaCy + local LLMs)

Extract structured data from PDFs — invoices, resumes, contracts, research papers — using a powerful hybrid pipeline:

⚡ Fast NLP (spaCy)
🧠 Local LLMs (Ollama)
📐 Layout-aware parsing (pdfplumber)
✅ Schema validation (Pydantic)

✨ Features

🧠 Hybrid Extraction
- Deterministic rules (high accuracy)
- spaCy for fast entity recognition
- LLM fallback for complex fields
🏠 Local-first AI
- Works with Ollama (Mistral, LLaMA, etc.)
- No API required (OpenAI optional)
📐 Layout-aware parsing
- Extract tables and line items from PDFs
📊 Schema-driven
- Prebuilt schemas (invoice, resume, research)
- Custom schemas via Pydantic
📦 Batch processing
📁 Export to CSV / Excel
🧪 Built-in benchmarks
⚙️ One-command setup

⚙️ Installation

pip install doc-intel

🔥 One-Command Setup

doc-intel setup --full

This will:

Install spaCy model
Check Ollama availability
Create sample dataset
Validate pipeline

🚀 Quick Start

CLI

doc-intel extract invoice.pdf --schema invoice

Python API

from doc_intel import extract

result = extract("invoice.pdf", schema="invoice")
print(result)

🎬 Demo

doc-intel demo examples/invoice_sample.pdf

Example Output

📄 RAW TEXT (preview)
...

⚡ BEFORE vs AFTER

❌ Unstructured (PDF)
Messy, inconsistent text

✅ Structured Output
invoice_number: INV-001
total: 1250.0
vendor: Acme Corp

🎯 Machine-readable, validated, ready for pipelines

📊 Example JSON Output

{
  "invoice_number": "INV-2024-001",
  "date": "2024-04-01",
  "total": 1250.0,
  "vendor": "Acme Corp",
  "line_items": [
    {"description": "Consulting", "amount": 1000.0},
    {"description": "Expenses", "amount": 250.0}
  ],
  "confidence": 0.94
}

📂 Batch Processing

doc-intel batch ./invoices --schema invoice --output results.csv

🧪 Benchmark

doc-intel benchmark

Example Results

Field Accuracy: 0.92
Line Items Accuracy: 0.90
Combined Score: 0.91

🧩 Custom Schema

from pydantic import BaseModel
from typing import List

class ResearchPaper(BaseModel):
    title: str
    authors: List[str]
    year: int

result = extract("paper.pdf", schema=ResearchPaper)

🧠 How It Works

PDF
 ↓
Text Extraction (pypdf)
 ↓
Invoice Engine (rules + regex)
 ↓
Layout Parsing (tables)
 ↓
spaCy (NLP fallback)
 ↓
LLM (only if needed)
 ↓
Schema Validation
 ↓
Confidence Scoring

🔌 Local LLM (Ollama)

Install Ollama:

👉 https://ollama.com

Run a model:

ollama run mistral

doc-intel will automatically use it.

📁 Project Structure

doc_intel/
├── core/
├── schemas/
├── utils/
├── setup/
└── demo/

🛣️ Roadmap

OCR support (scanned PDFs)
Higher accuracy invoice parsing (95–98%+)
Plugin system (contracts, legal, healthcare)
Confidence per field
Streamlit / Web UI
Dataset expansion

🤝 Contributing

PRs welcome!

Before submitting:

doc-intel benchmark

📄 License

MIT

⭐ Why doc-intel?

Most tools are:

❌ LLM-only (slow, expensive)
❌ Rule-based (fragile)

👉 doc-intel combines both:

Fast + Accurate + Local-first

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
doc_intel		doc_intel
examples		examples
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
architecture.md		architecture.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 doc-intel

✨ Features

⚙️ Installation

🔥 One-Command Setup

🚀 Quick Start

CLI

Python API

🎬 Demo

Example Output

📊 Example JSON Output

📂 Batch Processing

🧪 Benchmark

Example Results

🧩 Custom Schema

🧠 How It Works

🔌 Local LLM (Ollama)

📁 Project Structure

🛣️ Roadmap

🤝 Contributing

📄 License

⭐ Why doc-intel?

🔥 Star this repo if you find it useful!

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚀 doc-intel

✨ Features

⚙️ Installation

🔥 One-Command Setup

🚀 Quick Start

CLI

Python API

🎬 Demo

Example Output

📊 Example JSON Output

📂 Batch Processing

🧪 Benchmark

Example Results

🧩 Custom Schema

🧠 How It Works

🔌 Local LLM (Ollama)

📁 Project Structure

🛣️ Roadmap

🤝 Contributing

📄 License

⭐ Why doc-intel?

🔥 Star this repo if you find it useful!

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages