Team T34 - Project 2
Team Members: Aryan & Bhavan
A cost-effective hybrid document processing system that learns from each document to reduce AI costs while maintaining consistent JSON output across all file formats.
π¬ Watch Demo Video - Multi-Format Document Parser
Watch our 6-minute demo showcasing the hybrid pipeline, cost optimization, and signature learning in action.
Organizations receive documents in various formats (PDFs, scans, emails, HTML) but need consistent JSON output for downstream systems. Traditional AI-only approaches become prohibitively expensive at scale.
Our Solution: A hybrid pipeline that learns document patterns, creates reusable signatures, and uses AI sparingly - getting smarter and cheaper with every processed document.
- Signature Matching: Reuse learned patterns for cost-free processing
- AI Extraction: Dynamic schema generation with automatic signature creation
- Intelligent Fallbacks: Always produces results, never fails silently
- Predictable Costs: Each document type gets cheaper to process over time
- Real-time Tracking: Monitor AI usage and cost savings
- Pattern Reuse: 0-cost processing for recognized document layouts
- Normalized JSON: Same structure regardless of input format
- Smart Field Mapping: Automatic categorization of extracted data
- Table Extraction: Preserves structured data from documents
- Processing Logs: Detailed explanation of every decision
- Confidence Scores: Know how reliable each extraction is
- Strategy Tracking: See which approach was used for each document
- HybridDocumentParser: Main orchestrator managing the pipeline
- SignatureManager: Learns and stores document patterns with versioning
- AIExtractor: Gemini-powered extraction with cost tracking
- Docling Integration: Multi-format text extraction engine
- Python 3.11+
- UV package manager (recommended) or pip
- Google Gemini API Key
- Tesseract OCR Engine (for image/scanned document processing)
- 4GB+ RAM (for processing large documents)
Tesseract is required for processing images and scanned PDFs. Install it first before setting up the Python environment.
Option 1: Download Installer (Recommended)
- Go to Tesseract GitHub Releases
- Download the latest Windows installer (e.g.,
tesseract-ocr-w64-setup-5.3.3.20231005.exe) - Run the installer with administrator privileges
- Important: During installation, note the installation path (usually
C:\Program Files\Tesseract-OCR)
Option 2: Using Chocolatey
# Install Chocolatey first if not installed
# Then run:
choco install tesseractOption 3: Using Winget
winget install --id UB-Mannheim.TesseractOCRUsing Homebrew (Recommended):
# Install Homebrew if not installed
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# Install Tesseract
brew install tesseractUsing MacPorts:
sudo port install tesseractUbuntu/Debian:
sudo apt update
sudo apt install tesseract-ocr tesseract-ocr-engCentOS/RHEL/Fedora:
# CentOS/RHEL
sudo yum install tesseract tesseract-langpack-eng
# Fedora
sudo dnf install tesseract tesseract-langpack-engArch Linux:
sudo pacman -S tesseract tesseract-data-eng-
Find Tesseract Installation Path:
- Default:
C:\Program Files\Tesseract-OCR - If different, check your installation directory
- Default:
-
Add to System PATH:
- Press
Win + R, typesysdm.cpl, press Enter - Click "Environment Variables"
- Under "System Variables", find "Path" and click "Edit"
- Click "New" and add:
C:\Program Files\Tesseract-OCR - Click "OK" to save
- Press
-
Alternative: Add to .env file:
TESSERACT_CMD=C:\Program Files\Tesseract-OCR\tesseract.exe
Tesseract should be automatically available in PATH after installation. Verify with:
tesseract --versionIf not found, add to your shell profile (~/.bashrc, ~/.zshrc):
export PATH="/usr/local/bin:$PATH" # macOS Homebrew
# or
export PATH="/opt/homebrew/bin:$PATH" # macOS Apple Silicongit clone https://github.com/your-username/multi-format-document-parser.git
cd multi-format-document-parserUsing UV (Recommended):
# Install UV if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create virtual environment with Python 3.11
uv venv --python 3.11
# Activate the environment
# On Windows:
.venv\Scripts\activate
# On macOS/Linux:
source .venv/bin/activate
# Install dependencies
uv pip install -r requirements.txtUsing pip (Alternative):
# Create virtual environment
python -m venv venv
# Activate environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate
# Install dependencies
pip install -r requirements.txtStep 1: Get Gemini API Key
- Go to Google AI Studio
- Create a new API key
- Copy the generated key
Step 2: Set Up Environment File
# Copy the sample environment file
cp .env.sample .envStep 3: Add Your API Key
Edit the .env file and add your Gemini API key:
# Required - Replace with your actual API key
GEMINI_API_KEY=your_actual_gemini_api_key_here
# Optional - Tesseract path (Windows only, if not in PATH)
TESSERACT_CMD=C:\Program Files\Tesseract-OCR\tesseract.exe
# Optional - Storage paths
PARSER_DATA_PATH=parser_data
TEMP_FILES_PATH=temp
# Optional - Processing settings
DEFAULT_CONFIDENCE_THRESHOLD=0.8
MAX_AI_TOKENS=800
LOG_LEVEL=INFO# Test Tesseract installation
tesseract --version
# Test Python environment
python -c "import docling; print('Docling installed successfully')"streamlit run app.pyThe application will start and automatically open in your browser at http://localhost:8501
multi-format-document-parser/
βββ README.md # This comprehensive guide
βββ requirements.txt # Python dependencies
βββ .env.sample # Environment template
βββ .env # Your actual environment (create this)
βββ .gitignore # Git ignore rules
βββ app.py # Streamlit web interface
βββ doc_parser.py # Core parsing engine
βββ demo/ # Demo materials (optional)
β βββ sample_documents/ # Test files
βββ parser_data/ # Auto-created signature storage
β βββ signatures.pkl # Learned patterns (auto-generated)
βββ temp/ # Temporary processing files (auto-created)
- Launch App: Run
streamlit run app.py - Upload Document: Drag and drop any supported file format
- Process: Click "π Process Documents"
- View Results: Check the normalized JSON output and processing logs
- Upload: Supports PDF, DOCX, images, HTML, CSV, and more
- Text Extraction: Docling processes the document (with Tesseract OCR for images)
- Smart Processing:
- First document: Uses AI extraction + creates signature
- Similar documents: Uses signature (free!)
- JSON Output: Consistent structured data every time
- Enable "Batch Processing Mode" in sidebar
- Upload multiple files
- Monitor progress and costs in real-time
- Download results individually or in bulk
Every processed document returns this consistent structure:
{
"document_id": "abc123...",
"document_type": "invoice",
"processing_date": "2024-01-15T10:30:00Z",
"sender": {
"name": "ABC Company Ltd",
"contact": "+1-555-0123",
"address": "123 Business St"
},
"recipient": {
"name": "Customer Name",
"contact": "customer@email.com"
},
"metadata": {
"date": "2024-01-10",
"reference_number": "INV-2024-001",
"subject": "Monthly Services"
},
"financial": {
"currency": "USD",
"subtotal": 1000.00,
"tax": 100.00,
"total": 1100.00
},
"line_items": [...],
"tables": [...],
"custom_fields": {...},
"processing_info": {
"strategy": "signature_match",
"confidence": 0.95,
"signature_id": "sig_abc123",
"ai_calls": 0,
"processing_time": 0.45
}
}- signature_match: Used existing pattern (0 cost) β
- ai_extraction: Used AI to extract and learn (small cost, creates signature) π€
-
First Document from Vendor A:
- Uses AI extraction (~$0.0001-$0.001)
- Creates signature automatically
-
Second Document from Vendor A:
- Uses signature matching (FREE!)
- Processing time: ~0.2 seconds
-
Result: 90-95% cost reduction after learning
The app provides detailed cost tracking:
- Current session costs
- Per-document breakdown
- Strategy distribution charts
- Optimization recommendations
- Automatic Learning: Every successful AI extraction creates a reusable signature
- Export/Import: Backup signatures or share across deployments
- Version Control: Signatures are versioned to prevent breaking changes
- Sender ID: Specify for better signature matching
- Force AI Extraction: Override signatures for testing
- Batch Processing: Process multiple files simultaneously
Every document processing includes detailed logs:
- Text extraction results
- Signature matching attempts
- AI processing decisions
- Field mapping explanations
- Cost calculations
β API Key Error
Error: No Gemini API key found
Solution: Check your .env file - ensure GEMINI_API_KEY is set correctly
β Tesseract Not Found
Error: TesseractNotFoundError
Solutions:
1. Verify installation: tesseract --version
2. Add to PATH (Windows) or check installation path
3. Set TESSERACT_CMD in .env file
4. Restart terminal/command prompt after PATH changes
β Import Error: docling
Solution: pip install docling[pdf] or uv pip install docling[pdf]β Streamlit Port Already in Use
Solution: streamlit run app.py --server.port 8502β Memory Issues with Large Files
Solution: Process smaller batches or increase system memory
β Signature Not Matching
Solution: Check interpretation logs for pattern matching scores
Enable debug logging to see detailed matching process
β OCR Not Working on Images
Solutions:
1. Ensure Tesseract is installed and in PATH
2. Check image quality (should be clear, high contrast)
3. Try different image formats (PNG, JPEG, TIFF)
4. Verify TESSERACT_CMD path in .env file (Windows)
Enable detailed logging for troubleshooting:
export LOG_LEVEL=DEBUG
streamlit run app.pyCreate a test image with text and run:
# Test basic OCR functionality
tesseract test_image.png output_text.txt
# Check supported languages
tesseract --list-langsstreamlit run app.pyFor Issues:
- Check existing GitHub Issues
- Create detailed bug reports
- Include error logs and system information
For Contributions:
- Fork the repository
- Create feature branch
- Submit pull request with clear description
- Docling Team: Excellent multi-format document processing
- Google AI: Gemini API for intelligent extraction
- Tesseract OCR: Robust optical character recognition
- Streamlit: Intuitive web framework
- Competition Organizers: Challenging real-world problem
π Built by Team T34 - Aryan & Bhavan
Transforming document processing with intelligence, efficiency, and cost-effectiveness.