Compare Azure Document Intelligence and MinerU Cloud on various document formats.
Measure accuracy, detect document structures (multi-column, invoices, tables), and map OCR confusions.
→ Quick start: HOW_TO_RUN.md
→ Detailed workflows: WORKFLOW_FLOW.md
→ Algorithms & concepts: KNOWLEDGE_BASE.md
- PDF — Multi-column detection, layout preservation
- Text Files (.txt) — Plain text input
- Excel (.xlsx, .xls) — All sheets, table extraction
- CSV — Auto-delimiter detection, column preservation
- Word (.docx) — Paragraphs, tables, formatting
- PowerPoint (.pptx, .ppt) — Slide text extraction
- Multi-Column — Detects and reorders left-right text
- Invoice/Bill — Groups headers, items, totals
- Table — Identifies and preserves table structure
- Single-Column — Standard linear text
- TXT Baseline — For accuracy comparison
- MD Baseline — With structure metadata and formatting
ocr_poc/
├── main.py # CLI entry point
├── requirements.txt # Python dependencies
├── .env.example # API keys template
├── README.md # This file
├── HOW_TO_RUN.md # Usage instructions
├── KNOWLEDGE_BASE.md # Reference material
├── WORKFLOW_FLOW.md # Architecture documentation
├── samples/ # Sample documents (PDF, CSV, Excel, etc.)
├── output/ # Generated reports (gitignored)
└── src/
├── cli.py # Command-line interface
├── pipeline.py # Main OCR pipeline orchestrator
├── baseline.py # Baseline/ground truth extraction
├── config.py # Configuration & environment variables
├── report.py # Report generation (JSON & Markdown)
├── documents/ # NEW: Multi-format document parsing
│ ├── parser.py # Main orchestrator (auto-detects format)
│ ├── pdf_parser.py # PDF text extraction
│ ├── text_parser.py # Plain text files
│ ├── excel_parser.py # Excel workbooks
│ ├── csv_parser.py # CSV files
│ ├── word_parser.py # Word documents
│ ├── ppt_parser.py # PowerPoint presentations
│ ├── structures.py # Document structure detection
│ ├── formatters.py # Output formatting (TXT + MD)
│ └── baseline_extractor.py # Enhanced baseline extraction
├── ocr/ # OCR engine integrations
│ ├── azure.py # Azure Document Intelligence
│ └── mineru.py # MinerU Cloud API
└── evaluation/ # Accuracy analysis
├── accuracy.py # Similarity, CER, WER metrics
├── confusion.py # Character & word confusion maps
├── failures.py # Issue detection
└── text.py # Text normalization
cd D:\7span\projects\poc_ocr_2\ocr_poc
py -3.12 -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt
copy .env.example .env
# Edit .env with your API keysFor full format support, ensure these are installed (auto-installed by requirements.txt):
openpyxl— Excel supportpython-docx— Word supportpython-pptx— PowerPoint support
# Verify API keys
python main.py check-keys
# PDF with auto-baseline
python main.py run --pdf samples/invoice.pdf --engine mineru
# Multi-format with ground truth
python main.py run --pdf samples/data.csv --ground-truth samples/data_expected.txt --engine both
# Excel file with enhanced parsing
python main.py run --pdf samples/sheet.xlsx --enhanced-parsing --engine azure
# Both engines, no baseline
python main.py run --pdf samples/document.docx --engine bothEach document generates a report folder at output/<document-name>/:
| File | Purpose |
|---|---|
report.md |
Summary with accuracy metrics & confusion tables |
report.json |
Structured data (all metrics) |
{engine}_accuracy_details.md |
Full accuracy breakdown for each engine |
{engine}_confusion_analysis.md |
Character & word confusion analysis |
{engine}_accuracy.json |
Detailed metrics in JSON |
{engine}_output.txt |
Raw OCR extracted text |
baseline.txt |
Ground truth or auto-extracted baseline |
baseline.md |
Baseline with structure metadata |
- Similarity % — String similarity score
- Character Accuracy — Per-character correctness
- Word Accuracy — Per-word correctness
- CER (Character Error Rate) — Levenshtein distance
- WER (Word Error Rate) — Word-level distance
- Digit vs Letter Accuracy — Separate metrics
- Character Confusions — Top confused characters (e.g.,
l→i) - Context Examples — Shows how each confusion occurred
- Word Confusion Pairs — Misread words (e.g.,
lnvoice→invoice) - Similarity Scores — How close each mismatch was
- Word-level Metrics — Precision, Recall, F1 scores
- Multi-Column Detection — Automatically reorders text
- Invoice Structure — Groups headers, line items, totals
- Table Detection — Preserves column alignment
| Service | Environment Variables |
|---|---|
| Azure | DOCUMENTINTELLIGENCE_ENDPOINT (URL), DOCUMENTINTELLIGENCE_API_KEY |
| MinerU | MINERU_TOKEN — Get from mineru.net/apiManage/token |
Note: MinerU has a free mode (--mineru-flash) that doesn't require a token.
- ✨ Added document parser supporting PDF, TXT, Excel, CSV, Word, PowerPoint
- ✨ Automatic document structure detection (multi-column, invoices, tables)
- ✨ Dual-format baseline output (TXT + Markdown)
- ✨ Enhanced layout preservation for complex documents
- 📄 Added
src/documents/module with 7 format-specific parsers
- Added
*_accuracy_details.mdwith full metrics - Added
*_confusion_analysis.mdwith error analysis - Cross-engine comparison reports
- Azure & MinerU comparison
- Baseline accuracy evaluation
- Character & word confusion mapping
- Parser functions — Modular, format-specific text extraction
- Structure detection — Identifies document layout patterns
- Pipeline — Orchestrates: parse → extract → evaluate → report
- Two-liner comments — Quick function documentation throughout
To add a new document format (e.g., TIFF, JSON):
- Create
src/documents/{format}_parser.pywithextract_text_from_{format}()function - Add format case to
parse_document()insrc/documents/parser.py - Update
requirements.txtwith any needed dependencies - Add file extension handling to CLI
from src.documents import parse_document
result = parse_document(Path("samples/multi_col.pdf"))
print(f"Type: {result.structure_analysis.structure_type}")
print(f"Confidence: {result.structure_analysis.confidence:.0%}")Issue: "openpyxl not found" when using Excel
Fix: pip install openpyxl
Issue: Multi-column text still out of order
Fix: Use explicit --ground-truth file instead of auto-parsing
Issue: Azure/MinerU API errors
Fix: Verify .env has correct keys via python main.py check-keys
Internal POC — Proof of Concept for OCR evaluation