Skip to content

darshansatapara/OCR

Repository files navigation

OCR POC - Multi-Format Document Analysis

Compare Azure Document Intelligence and MinerU Cloud on various document formats.
Measure accuracy, detect document structures (multi-column, invoices, tables), and map OCR confusions.

→ Quick start: HOW_TO_RUN.md
→ Detailed workflows: WORKFLOW_FLOW.md
→ Algorithms & concepts: KNOWLEDGE_BASE.md

New Features ✨

Multi-Format Document Support

  • PDF — Multi-column detection, layout preservation
  • Text Files (.txt) — Plain text input
  • Excel (.xlsx, .xls) — All sheets, table extraction
  • CSV — Auto-delimiter detection, column preservation
  • Word (.docx) — Paragraphs, tables, formatting
  • PowerPoint (.pptx, .ppt) — Slide text extraction

Document Structure Detection

  • Multi-Column — Detects and reorders left-right text
  • Invoice/Bill — Groups headers, items, totals
  • Table — Identifies and preserves table structure
  • Single-Column — Standard linear text

Dual-Format Output

  • TXT Baseline — For accuracy comparison
  • MD Baseline — With structure metadata and formatting

Project Layout

ocr_poc/
├── main.py                  # CLI entry point
├── requirements.txt         # Python dependencies
├── .env.example             # API keys template
├── README.md                # This file
├── HOW_TO_RUN.md           # Usage instructions
├── KNOWLEDGE_BASE.md       # Reference material
├── WORKFLOW_FLOW.md        # Architecture documentation
├── samples/                 # Sample documents (PDF, CSV, Excel, etc.)
├── output/                  # Generated reports (gitignored)
└── src/
    ├── cli.py              # Command-line interface
    ├── pipeline.py         # Main OCR pipeline orchestrator
    ├── baseline.py         # Baseline/ground truth extraction
    ├── config.py           # Configuration & environment variables
    ├── report.py           # Report generation (JSON & Markdown)
    ├── documents/          # NEW: Multi-format document parsing
    │   ├── parser.py       # Main orchestrator (auto-detects format)
    │   ├── pdf_parser.py   # PDF text extraction
    │   ├── text_parser.py  # Plain text files
    │   ├── excel_parser.py # Excel workbooks
    │   ├── csv_parser.py   # CSV files
    │   ├── word_parser.py  # Word documents
    │   ├── ppt_parser.py   # PowerPoint presentations
    │   ├── structures.py   # Document structure detection
    │   ├── formatters.py   # Output formatting (TXT + MD)
    │   └── baseline_extractor.py  # Enhanced baseline extraction
    ├── ocr/                # OCR engine integrations
    │   ├── azure.py        # Azure Document Intelligence
    │   └── mineru.py       # MinerU Cloud API
    └── evaluation/         # Accuracy analysis
        ├── accuracy.py     # Similarity, CER, WER metrics
        ├── confusion.py    # Character & word confusion maps
        ├── failures.py     # Issue detection
        └── text.py         # Text normalization

Setup

cd D:\7span\projects\poc_ocr_2\ocr_poc
py -3.12 -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt
copy .env.example .env
# Edit .env with your API keys

Installation Notes

For full format support, ensure these are installed (auto-installed by requirements.txt):

  • openpyxl — Excel support
  • python-docx — Word support
  • python-pptx — PowerPoint support

Quick Commands

# Verify API keys
python main.py check-keys

# PDF with auto-baseline
python main.py run --pdf samples/invoice.pdf --engine mineru

# Multi-format with ground truth
python main.py run --pdf samples/data.csv --ground-truth samples/data_expected.txt --engine both

# Excel file with enhanced parsing
python main.py run --pdf samples/sheet.xlsx --enhanced-parsing --engine azure

# Both engines, no baseline
python main.py run --pdf samples/document.docx --engine both

Output Files

Each document generates a report folder at output/<document-name>/:

File Purpose
report.md Summary with accuracy metrics & confusion tables
report.json Structured data (all metrics)
{engine}_accuracy_details.md Full accuracy breakdown for each engine
{engine}_confusion_analysis.md Character & word confusion analysis
{engine}_accuracy.json Detailed metrics in JSON
{engine}_output.txt Raw OCR extracted text
baseline.txt Ground truth or auto-extracted baseline
baseline.md Baseline with structure metadata

Metrics & Analysis

Accuracy Metrics

  • Similarity % — String similarity score
  • Character Accuracy — Per-character correctness
  • Word Accuracy — Per-word correctness
  • CER (Character Error Rate) — Levenshtein distance
  • WER (Word Error Rate) — Word-level distance

Character Analysis

  • Digit vs Letter Accuracy — Separate metrics
  • Character Confusions — Top confused characters (e.g., li)
  • Context Examples — Shows how each confusion occurred

Word Analysis

  • Word Confusion Pairs — Misread words (e.g., lnvoiceinvoice)
  • Similarity Scores — How close each mismatch was
  • Word-level Metrics — Precision, Recall, F1 scores

Document Structure

  • Multi-Column Detection — Automatically reorders text
  • Invoice Structure — Groups headers, line items, totals
  • Table Detection — Preserves column alignment

API Keys

Service Environment Variables
Azure DOCUMENTINTELLIGENCE_ENDPOINT (URL), DOCUMENTINTELLIGENCE_API_KEY
MinerU MINERU_TOKEN — Get from mineru.net/apiManage/token

Note: MinerU has a free mode (--mineru-flash) that doesn't require a token.

Changelog

v2.0 - Multi-Format & Structure Detection

  • ✨ Added document parser supporting PDF, TXT, Excel, CSV, Word, PowerPoint
  • ✨ Automatic document structure detection (multi-column, invoices, tables)
  • ✨ Dual-format baseline output (TXT + Markdown)
  • ✨ Enhanced layout preservation for complex documents
  • 📄 Added src/documents/ module with 7 format-specific parsers

v1.1 - Detailed Accuracy Reports

  • Added *_accuracy_details.md with full metrics
  • Added *_confusion_analysis.md with error analysis
  • Cross-engine comparison reports

v1.0 - Initial Release

  • Azure & MinerU comparison
  • Baseline accuracy evaluation
  • Character & word confusion mapping

Development Notes

Code Structure

  • Parser functions — Modular, format-specific text extraction
  • Structure detection — Identifies document layout patterns
  • Pipeline — Orchestrates: parse → extract → evaluate → report
  • Two-liner comments — Quick function documentation throughout

Adding New Formats

To add a new document format (e.g., TIFF, JSON):

  1. Create src/documents/{format}_parser.py with extract_text_from_{format}() function
  2. Add format case to parse_document() in src/documents/parser.py
  3. Update requirements.txt with any needed dependencies
  4. Add file extension handling to CLI

Testing Structure Detection

from src.documents import parse_document

result = parse_document(Path("samples/multi_col.pdf"))
print(f"Type: {result.structure_analysis.structure_type}")
print(f"Confidence: {result.structure_analysis.confidence:.0%}")

Troubleshooting

Issue: "openpyxl not found" when using Excel
Fix: pip install openpyxl

Issue: Multi-column text still out of order
Fix: Use explicit --ground-truth file instead of auto-parsing

Issue: Azure/MinerU API errors
Fix: Verify .env has correct keys via python main.py check-keys

License

Internal POC — Proof of Concept for OCR evaluation

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors