Skip to content

alexhanganu/doc-intel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🚀 doc-intel

Schema-first document intelligence using hybrid AI (spaCy + local LLMs)

Extract structured data from PDFs — invoices, resumes, contracts, research papers — using a powerful hybrid pipeline:

  • ⚡ Fast NLP (spaCy)
  • 🧠 Local LLMs (Ollama)
  • 📐 Layout-aware parsing (pdfplumber)
  • ✅ Schema validation (Pydantic)

✨ Features

  • 🧠 Hybrid Extraction

    • Deterministic rules (high accuracy)
    • spaCy for fast entity recognition
    • LLM fallback for complex fields
  • 🏠 Local-first AI

    • Works with Ollama (Mistral, LLaMA, etc.)
    • No API required (OpenAI optional)
  • 📐 Layout-aware parsing

    • Extract tables and line items from PDFs
  • 📊 Schema-driven

    • Prebuilt schemas (invoice, resume, research)
    • Custom schemas via Pydantic
  • 📦 Batch processing

  • 📁 Export to CSV / Excel

  • 🧪 Built-in benchmarks

  • ⚙️ One-command setup


⚙️ Installation

pip install doc-intel

🔥 One-Command Setup

doc-intel setup --full

This will:

  • Install spaCy model
  • Check Ollama availability
  • Create sample dataset
  • Validate pipeline

🚀 Quick Start

CLI

doc-intel extract invoice.pdf --schema invoice

Python API

from doc_intel import extract

result = extract("invoice.pdf", schema="invoice")
print(result)

🎬 Demo

doc-intel demo examples/invoice_sample.pdf

Example Output

📄 RAW TEXT (preview)
...

⚡ BEFORE vs AFTER

❌ Unstructured (PDF)
Messy, inconsistent text

✅ Structured Output
invoice_number: INV-001
total: 1250.0
vendor: Acme Corp

🎯 Machine-readable, validated, ready for pipelines

📊 Example JSON Output

{
  "invoice_number": "INV-2024-001",
  "date": "2024-04-01",
  "total": 1250.0,
  "vendor": "Acme Corp",
  "line_items": [
    {"description": "Consulting", "amount": 1000.0},
    {"description": "Expenses", "amount": 250.0}
  ],
  "confidence": 0.94
}

📂 Batch Processing

doc-intel batch ./invoices --schema invoice --output results.csv

🧪 Benchmark

doc-intel benchmark

Example Results

Field Accuracy: 0.92
Line Items Accuracy: 0.90
Combined Score: 0.91

🧩 Custom Schema

from pydantic import BaseModel
from typing import List

class ResearchPaper(BaseModel):
    title: str
    authors: List[str]
    year: int

result = extract("paper.pdf", schema=ResearchPaper)

🧠 How It Works

PDF
 ↓
Text Extraction (pypdf)
 ↓
Invoice Engine (rules + regex)
 ↓
Layout Parsing (tables)
 ↓
spaCy (NLP fallback)
 ↓
LLM (only if needed)
 ↓
Schema Validation
 ↓
Confidence Scoring

🔌 Local LLM (Ollama)

Install Ollama:

👉 https://ollama.com

Run a model:

ollama run mistral

doc-intel will automatically use it.


📁 Project Structure

doc_intel/
├── core/
├── schemas/
├── utils/
├── setup/
└── demo/

🛣️ Roadmap

  • OCR support (scanned PDFs)
  • Higher accuracy invoice parsing (95–98%+)
  • Plugin system (contracts, legal, healthcare)
  • Confidence per field
  • Streamlit / Web UI
  • Dataset expansion

🤝 Contributing

PRs welcome!

Before submitting:

doc-intel benchmark

📄 License

MIT


⭐ Why doc-intel?

Most tools are:

  • ❌ LLM-only (slow, expensive)
  • ❌ Rule-based (fragile)

👉 doc-intel combines both:

Fast + Accurate + Local-first


🔥 Star this repo if you find it useful!

About

Schema-first document intelligence using hybrid AI (spaCy + local LLMs)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages