# DocSense Lab — Document Intelligence, End-to-End

**Goal:** Turn scanned documents (PDFs/images) into **structured, validated JSON** with **traceable provenance**.

**Pipeline:**
1. **Classify** → invoice/contract/unknown  
2. **Preprocess** → deskew, denoise, binarize  
3. **OCR** → extract text + per-word confidence + bounding boxes  
4. **Layout** → detect table/line structure (lightweight morphology)  
5. **Extract** → schema-driven fields (invoice #, date, total, vendor)  
6. **Validate** → aggregate confidences, flag low-confidence results  
7. **Route** → send high-confidence to systems, otherwise review

**Why this lab?**  
To demonstrate a reproducible, lightweight pattern for **explainable document AI** that runs locally in Jupyter with minimal dependencies.

## Quick Start

### A) Environment
- Install **Tesseract** (OCR engine)  
  - macOS: `brew install tesseract`  
  - Ubuntu: `sudo apt-get install tesseract-ocr`
- (Optional for PDFs) Install **Poppler** (for `pdftoppm`)  
  - macOS: `brew install poppler`  
  - Ubuntu: `sudo apt-get install poppler-utils`

### B) Python Packages
```bash
pip install opencv-python-headless pytesseract pdf2image Pillow numpy



In [3]:
# Environment setup for a regular Jupyter notebook (no Colab commands).
# - Installs Python packages if missing.
# - Warns if system dependencies (Tesseract, Poppler) are not present.

import sys, subprocess, importlib, shutil

def pip_install(pkgs):
    subprocess.check_call([sys.executable, "-m", "pip", "install", "--quiet"] + pkgs)

required = {
    "opencv-python-headless": "opencv-python-headless>=4.8.0",  # cv2
    "pytesseract": "pytesseract>=0.3.10",
    "pdf2image": "pdf2image>=1.17.0",   # optional if you won’t handle PDFs
    "Pillow": "Pillow>=10.0.0",
    "numpy": "numpy>=1.23.0",
}

to_install = []
# cv2 module name is "cv2" even if pkg is opencv-python-headless
try:
    import cv2  # noqa: F401
except Exception:
    to_install.append(required["opencv-python-headless"])

for mod, spec in required.items():
    if mod == "opencv-python-headless":
        continue  # already handled via cv2 import
    try:
        importlib.import_module(mod)
    except Exception:
        to_install.append(spec)

if to_install:
    print("Installing missing packages:", to_install)
    pip_install(to_install)

# System dependency checks (helpful messages if missing)
if not shutil.which("tesseract"):
    print("⚠ Tesseract not found. Install it first:")
    print("  • macOS:  brew install tesseract")
    print("  • Ubuntu: sudo apt-get install tesseract-ocr")
if not shutil.which("pdftoppm"):
    print("⚠ Poppler (pdftoppm) not found (needed only for PDF input):")
    print("  • macOS:  brew install poppler")
    print("  • Ubuntu: sudo apt-get install poppler-utils")


Installing missing packages: ['Pillow>=10.0.0']


In [4]:
# All imports required for the pipeline
import os, re, json, shutil
from pathlib import Path

import cv2
import numpy as np
import pytesseract

# pdf2image is optional—if Poppler isn't installed, we gracefully skip PDF input
try:
    from pdf2image import convert_from_path
except Exception:
    convert_from_path = None

from PIL import Image, ImageDraw, ImageFont

# Ensure pytesseract uses the system tesseract binary, if available
tesseract_bin = shutil.which("tesseract")
if tesseract_bin:
    pytesseract.pytesseract.tesseract_cmd = tesseract_bin



## FAQ

**Q: Do I need Poppler installed?**  
A: Only if you process PDFs. For images (PNG/JPG), you don’t.

**Q: Why is the table detector “simple”?**  
A: To keep the lab explainable and dependency-light. You can swap in a stronger table model when needed.

**Q: Where do confidence numbers come from?**  
A: Tesseract’s per-token confidences, aggregated at the field level.


In [5]:
"""
Complete, runnable version of the user's snippet for a regular Jupyter notebook.

Key additions:
- Synthetic invoice image generator so the notebook runs without external files.
- Robust handling when pdf2image/Poppler is unavailable.
- Auto-configuration of Tesseract path.
- Clear, commented main execution that demonstrates the full pipeline.
"""

# -----------------------------
# 1) Classification
# -----------------------------
def classify_document(file_path: str) -> str:
    """Determine document type based on content patterns."""
    image = None
    suffix = Path(file_path).suffix.lower()

    # Convert first page of PDF to image if possible
    if suffix == ".pdf":
        if convert_from_path is None:
            raise RuntimeError("PDF input requested but pdf2image/Poppler not available.")
        images = convert_from_path(file_path, first_page=1, last_page=1)
        image = np.array(images[0])
        image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
    else:
        image = cv2.imread(file_path)

    if image is None:
        raise FileNotFoundError(f"Could not read: {file_path}")

    # Quick OCR to detect document type
    text = pytesseract.image_to_string(image).lower()

    if any(x in text for x in ("invoice", "bill to", "total amount", "amount due")):
        return "invoice"
    elif any(x in text for x in ("contract", "agreement")):
        return "contract"
    else:
        return "unknown"


# -----------------------------
# 2) Preprocessing + OCR
# -----------------------------
def preprocess_image(image: np.ndarray) -> np.ndarray:
    """Enhance image quality before OCR (grayscale, deskew, denoise, binarize)."""
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

    # Deskew
    coords = np.column_stack(np.where(gray > 0))
    angle = cv2.minAreaRect(coords)[-1]
    angle = -(90 + angle) if angle < -45 else -angle
    (h, w) = gray.shape[:2]
    M = cv2.getRotationMatrix2D((w // 2, h // 2), angle, 1.0)
    rotated = cv2.warpAffine(gray, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE)

    # Denoise + binarize
    denoised = cv2.fastNlMeansDenoising(rotated)
    _, enhanced = cv2.threshold(denoised, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    return enhanced


def perform_ocr_with_confidence(image: np.ndarray) -> dict:
    """Extract text + per-word confidence and bounding boxes."""
    ocr_data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)

    results = {
        "text": pytesseract.image_to_string(image),
        "words": [],
        "average_confidence": 0.0,
    }

    confidences = []
    n = len(ocr_data.get("text", []))
    for i in range(n):
        try:
            conf = float(ocr_data["conf"][i])
        except Exception:
            conf = -1.0
        token = (ocr_data["text"][i] or "").strip()

        if token and conf > 0:
            results["words"].append(
                {
                    "text": token,
                    "confidence": conf / 100.0,
                    "bbox": (
                        int(ocr_data["left"][i]),
                        int(ocr_data["top"][i]),
                        int(ocr_data["width"][i]),
                        int(ocr_data["height"][i]),
                    ),
                }
            )
            confidences.append(conf)

    if confidences:
        results["average_confidence"] = (sum(confidences) / len(confidences)) / 100.0
    return results


# -----------------------------
# 3) Layout parsing (simple)
# -----------------------------
def detect_layout_regions(image: np.ndarray) -> dict:
    """Identify simple table structures using morphological ops."""
    horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (40, 1))
    vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, 40))

    horizontal_lines = cv2.morphologyEx(image, cv2.MORPH_OPEN, horizontal_kernel)
    vertical_lines = cv2.morphologyEx(image, cv2.MORPH_OPEN, vertical_kernel)

    table_mask = cv2.addWeighted(horizontal_lines, 0.5, vertical_lines, 0.5, 0.0)
    return {"has_table": int(np.sum(table_mask)) > 1000, "table_regions": []}


# -----------------------------
# 4) Field extraction helpers
# -----------------------------
def find_confidence_for_text(target_text: str, ocr_results: dict) -> float:
    """Estimate confidence for a span by averaging token confidences."""
    confidences = []
    target_words = target_text.lower().split()
    for word_data in ocr_results["words"]:
        if word_data["text"].lower() in target_words:
            confidences.append(word_data["confidence"])
    return float(sum(confidences) / len(confidences)) if confidences else 0.5


def extract_invoice_fields(text: str, ocr_results: dict) -> dict:
    """Schema-driven extraction of common invoice fields."""
    out = {
        "invoice_number": None,
        "invoice_date": None,
        "total_amount": None,
        "vendor_name": None,
        "confidence_scores": {},
    }

    # Invoice number
    m = re.search(r"invoice\s*#?\s*:?\s*([A-Z0-9-]+)", text, flags=re.IGNORECASE)
    if m:
        out["invoice_number"] = m.group(1)
        out["confidence_scores"]["invoice_number"] = find_confidence_for_text(m.group(1), ocr_results)

    # Date (DD/MM/YYYY, MM-DD-YY, etc.)
    m = re.search(r"(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})", text)
    if m:
        out["invoice_date"] = m.group(1)
        out["confidence_scores"]["invoice_date"] = find_confidence_for_text(m.group(1), ocr_results)

    # Total
    m = re.search(r"total[:\s]+\$?\s*([\d,]+\.?\d*)", text, flags=re.IGNORECASE)
    if m:
        amt = m.group(1).replace(",", "")
        out["total_amount"] = amt
        out["confidence_scores"]["total_amount"] = find_confidence_for_text(m.group(1), ocr_results)

    # Vendor (first non-empty line)
    first_line = next((ln.strip() for ln in text.splitlines() if ln.strip()), "")
    if first_line:
        out["vendor_name"] = first_line
        out["confidence_scores"]["vendor_name"] = find_confidence_for_text(first_line, ocr_results)

    return out


# -----------------------------
# 5) Validation & routing
# -----------------------------
def validate_and_route(extracted: dict, confidence_threshold: float = 0.85) -> dict:
    if not extracted.get("confidence_scores"):
        extracted["needs_review"] = True
        extracted["review_reason"] = "No confidence scores available"
        extracted["average_confidence"] = 0.0
        return extracted

    scores = list(extracted["confidence_scores"].values())
    min_conf = min(scores)
    avg_conf = sum(scores) / len(scores)

    extracted["average_confidence"] = avg_conf
    extracted["needs_review"] = min_conf < confidence_threshold
    if extracted["needs_review"]:
        low_fields = [k for k, v in extracted["confidence_scores"].items() if v < confidence_threshold]
        extracted["review_reason"] = f"Low confidence in: {', '.join(low_fields)}"
    return extracted


# -----------------------------
# 6) Full pipeline
# -----------------------------
def process_document(file_path: str) -> dict:
    print(f"Processing: {file_path}")

    # Stage 1: Classification
    doc_type = classify_document(file_path)
    print(f"Document type: {doc_type}")

    # Load image (1st page for PDF)
    suffix = Path(file_path).suffix.lower()
    if suffix == ".pdf":
        if convert_from_path is None:
            raise RuntimeError("pdf2image/Poppler not available; cannot read PDFs.")
        images = convert_from_path(file_path, first_page=1, last_page=1)
        image = np.array(images[0])
        image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
    else:
        image = cv2.imread(file_path)
        if image is None:
            raise FileNotFoundError(f"Could not read: {file_path}")

    # Stage 2: Preprocess + OCR
    preprocessed = preprocess_image(image)
    ocr_results = perform_ocr_with_confidence(preprocessed)
    print(f"OCR average confidence: {ocr_results['average_confidence']:.2f}")

    # Stage 3: Layout
    layout = detect_layout_regions(preprocessed)
    print(f"Contains table: {layout['has_table']}")

    # Stage 4: Extraction
    extracted = extract_invoice_fields(ocr_results["text"], ocr_results)

    # Stage 5: Validation
    final_result = validate_and_route(extracted)
    return final_result


# -----------------------------
# Utility: Generate a synthetic invoice image so this notebook runs anywhere
# -----------------------------
def make_synthetic_invoice_image(path: str, seed: int = 42) -> str:
    img = Image.new("RGB", (1654, 2339), color="white")  # ~A4 at 150 dpi
    draw = ImageDraw.Draw(img)

    # Try to use a clean sans-serif font; fall back to default
    try:
        font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf", 36)
        font_bold = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf", 56)
    except Exception:
        font = ImageFont.load_default()
        font_bold = font

    # Header / vendor
    draw.text((80, 80), "ACME Corporation", fill=(0, 0, 0), font=font_bold)
    draw.text((80, 160), "123 Industrial Road", fill=(0, 0, 0), font=font)
    draw.text((80, 200), "Metropolis, NY 10001", fill=(0, 0, 0), font=font)

    # Invoice title & meta
    draw.text((1100, 80), "INVOICE", fill=(0, 0, 0), font=font_bold)
    draw.text((1100, 150), "Invoice #: INV-2024-001", fill=(0, 0, 0), font=font)
    draw.text((1100, 190), "Date: 12/31/2024", fill=(0, 0, 0), font=font)

    # Bill To
    draw.text((80, 300), "Bill To:", fill=(0, 0, 0), font=font_bold)
    draw.text((80, 350), "Wayne Enterprises", fill=(0, 0, 0), font=font)
    draw.text((80, 390), "1007 Mountain Drive", fill=(0, 0, 0), font=font)
    draw.text((80, 430), "Gotham, NJ 07001", fill=(0, 0, 0), font=font)

    # A simple table header line
    draw.line([(80, 520), (1570, 520)], fill=(0, 0, 0), width=3)
    draw.text((80, 540), "Description", fill=(0, 0, 0), font=font_bold)
    draw.text((1100, 540), "Qty", fill=(0, 0, 0), font=font_bold)
    draw.text((1250, 540), "Unit Price", fill=(0, 0, 0), font=font_bold)
    draw.text((1450, 540), "Amount", fill=(0, 0, 0), font=font_bold)
    draw.line([(80, 590), (1570, 590)], fill=(0, 0, 0), width=3)

    # A couple of rows
    y = 620
    rows = [
        ("Widget A", "2", "$199.50", "$399.00"),
        ("Gizmo B", "1", "$799.00", "$799.00"),
        ("Service Plan", "1", "$150.00", "$150.00"),
    ]
    for desc, qty, unit, amt in rows:
        draw.text((80, y), desc, fill=(0, 0, 0), font=font)
        draw.text((1100, y), qty, fill=(0, 0, 0), font=font)
        draw.text((1250, y), unit, fill=(0, 0, 0), font=font)
        draw.text((1450, y), amt, fill=(0, 0, 0), font=font)
        y += 60

    draw.line([(80, y + 10), (1570, y + 10)], fill=(0, 0, 0), width=3)
    draw.text((1200, y + 60), "TOTAL      $1,348.00", fill=(0, 0, 0), font=font_bold)

    Path(path).parent.mkdir(parents=True, exist_ok=True)
    img.save(path)
    return path


# -----------------------------
# Main execution
# -----------------------------
def run_demo():
    # 1) Create a synthetic invoice image so the pipeline always has input
    sample_image_path = "data/sample_invoice.png"
    make_synthetic_invoice_image(sample_image_path)

    # 2) Process the synthetic invoice
    result = process_document(sample_image_path)

    # 3) Display results
    print("\nExtracted Data (excluding per-field confidences):")
    print(json.dumps({k: v for k, v in result.items() if k not in ("confidence_scores",)}, indent=2))

    print("\nConfidence Scores:")
    print(json.dumps(result.get("confidence_scores", {}), indent=2))

    if result.get("needs_review"):
        print(f"\n⚠ ROUTING TO HUMAN REVIEW: {result.get('review_reason', 'Unknown reason')}")
    else:
        print("\n✓ HIGH CONFIDENCE: Sending to ERP system")

# Execute demo
run_demo()


Processing: data/sample_invoice.png
Document type: invoice
OCR average confidence: 0.88
Contains table: True

Extracted Data (excluding per-field confidences):
{
  "invoice_number": null,
  "invoice_date": null,
  "total_amount": "13",
  "vendor_name": "Wayne Enterprises",
  "average_confidence": 0.73,
  "needs_review": true,
  "review_reason": "Low confidence in: total_amount"
}

Confidence Scores:
{
  "total_amount": 0.5,
  "vendor_name": 0.96
}

⚠ ROUTING TO HUMAN REVIEW: Low confidence in: total_amount
