This project provides multiple approaches for extracting text from PDF documents using both structured parsing and OCR techniques. It is designed for document preprocessing workflows such as Retrieval-Augmented Generation (RAG), data pipelines, and information extraction systems.
The project includes three main scripts:
docling.py– Structured PDF parsing using Doclingtesseract.py– OCR-based extraction using Tesseracteasyocr.py– OCR-based extraction using EasyOCR
Each script processes PDFs and outputs text in Markdown and/or JSON formats.
project_root/
│
├── input_folder/ # Input PDFs for all scripts
│
├── output_folder/ # Rename or organize outputs here
│
├── docling.py
├── tesseract.py
├── easyocr.py
└── README.md
Before running the scripts, ensure:
- All input directories are renamed to:
input_folder - All output directories are renamed to:
output_folder
If your scripts still reference older folder names, update them accordingly or rename your folders to match.
Uses Docling's document parser to extract structured text directly from PDFs without OCR.
- Preserves document structure
- Outputs clean Markdown
- Fast for machine-readable PDFs
- Folder:
input_folder/
- Markdown files saved to your configured
output_folder/
- Load PDFs from input folder
- Convert using Docling
- Export structured text to Markdown
Uses Tesseract OCR to extract text from scanned PDFs by converting pages into images.
- Image preprocessing using OpenCV
- Optimized for structured forms
- Outputs both Markdown and JSON
- Page-level text extraction
- Tesseract OCR installed
- Path configured: pytesseract.pytesseract.tesseract_cmd = "C:\Program Files\Tesseract-OCR\tesseract.exe"
- Folder:
input_folder/
- Markdown and JSON saved to your configured
output_folder/
- Convert PDF pages to images (300 DPI)
- Preprocess images (grayscale + thresholding)
- Extract text using Tesseract
- Clean and structure output
- Save as Markdown and JSON
Uses EasyOCR for deep learning-based text extraction from PDFs.
- Confidence-based filtering
- Outputs structured JSON with confidence scores
- Good for complex or low-quality scans
- Poppler installed (for PDF rendering)
- Path configured: POPPLER_PATH = "C:\Users..."
- Folder:
input_folder/
- Markdown and JSON saved to your configured
output_folder/
- Detect number of pages
- Convert each page to image
- Run EasyOCR text detection
- Filter results by confidence threshold
- Save results to Markdown and JSON
| Method | Best For | Output Type | Speed | Accuracy |
|---|---|---|---|---|
| Docling | Digital PDFs (text-based) | Markdown | Fast | High |
| Tesseract | Scanned forms/documents | Markdown + JSON | Medium | Medium |
| EasyOCR | Complex or noisy images | Markdown + JSON | Slower | High |
- Document ingestion for RAG systems
- Converting PDFs into structured datasets
- Extracting text from scanned forms
- Preprocessing for machine learning pipelines
Install required Python packages: pip install docling pytesseract pdf2image easyocr opencv-python pillow numpy
- Tesseract OCR (required for
tesseract.py) - Poppler (required for
easyocr.py)
python docling.py
python tesseract.py
python easyocr.py
- Ensure
input_folder/contains PDF files before running scripts - Ensure
output_folder/exists or is correctly referenced in scripts - OCR performance depends heavily on image quality and preprocessing
- Add text chunking for RAG pipelines
- Integrate embeddings and vector databases
- Add support for batch processing with metadata tagging
- Improve layout-aware parsing