Data Processing with OCR

Overview

This project provides multiple approaches for extracting text from PDF documents using both structured parsing and OCR techniques. It is designed for document preprocessing workflows such as Retrieval-Augmented Generation (RAG), data pipelines, and information extraction systems.

The project includes three main scripts:

docling.py – Structured PDF parsing using Docling
tesseract.py – OCR-based extraction using Tesseract
easyocr.py – OCR-based extraction using EasyOCR

Each script processes PDFs and outputs text in Markdown and/or JSON formats.

Project Structure

project_root/

│

├── input_folder/ # Input PDFs for all scripts

│

├── output_folder/ # Rename or organize outputs here

│

├── docling.py

├── tesseract.py

├── easyocr.py

└── README.md

Important Setup Note

Before running the scripts, ensure:

All input directories are renamed to: input_folder
All output directories are renamed to: output_folder

If your scripts still reference older folder names, update them accordingly or rename your folders to match.

1. Docling Pipeline (`docling.py`)

Description

Uses Docling's document parser to extract structured text directly from PDFs without OCR.

Features

Preserves document structure
Outputs clean Markdown
Fast for machine-readable PDFs

Input

Folder: input_folder/

Output

Markdown files saved to your configured output_folder/

Workflow

Load PDFs from input folder
Convert using Docling
Export structured text to Markdown

2. Tesseract OCR Pipeline (`tesseract.py`)

Description

Uses Tesseract OCR to extract text from scanned PDFs by converting pages into images.

Features

Image preprocessing using OpenCV
Optimized for structured forms
Outputs both Markdown and JSON
Page-level text extraction

Requirements

Tesseract OCR installed
Path configured: pytesseract.pytesseract.tesseract_cmd = "C:\Program Files\Tesseract-OCR\tesseract.exe"

Input

Folder: input_folder/

Output

Markdown and JSON saved to your configured output_folder/

Workflow

Convert PDF pages to images (300 DPI)
Preprocess images (grayscale + thresholding)
Extract text using Tesseract
Clean and structure output
Save as Markdown and JSON

3. EasyOCR Pipeline (`easyocr.py`)

Description

Uses EasyOCR for deep learning-based text extraction from PDFs.

Features

Confidence-based filtering
Outputs structured JSON with confidence scores
Good for complex or low-quality scans

Requirements

Poppler installed (for PDF rendering)
Path configured: POPPLER_PATH = "C:\Users..."

Input

Folder: input_folder/

Output

Markdown and JSON saved to your configured output_folder/

Workflow

Detect number of pages
Convert each page to image
Run EasyOCR text detection
Filter results by confidence threshold
Save results to Markdown and JSON

Comparison of Approaches

Method	Best For	Output Type	Speed	Accuracy
Docling	Digital PDFs (text-based)	Markdown	Fast	High
Tesseract	Scanned forms/documents	Markdown + JSON	Medium	Medium
EasyOCR	Complex or noisy images	Markdown + JSON	Slower	High

Use Cases

Document ingestion for RAG systems
Converting PDFs into structured datasets
Extracting text from scanned forms
Preprocessing for machine learning pipelines

Installation

Install required Python packages: pip install docling pytesseract pdf2image easyocr opencv-python pillow numpy

Additional System Dependencies

Tesseract OCR (required for tesseract.py)
Poppler (required for easyocr.py)

How to Run

Run Docling

python docling.py

Run Tesseract OCR

python tesseract.py

Run EasyOCR

python easyocr.py

Notes

Ensure input_folder/ contains PDF files before running scripts
Ensure output_folder/ exists or is correctly referenced in scripts
OCR performance depends heavily on image quality and preprocessing

Future Improvements

Add text chunking for RAG pipelines
Integrate embeddings and vector databases
Add support for batch processing with metadata tagging
Improve layout-aware parsing

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Docling.py		Docling.py
EasyOCR.py		EasyOCR.py
README.md		README.md
Tesseract.py		Tesseract.py

Folders and files

Latest commit

History

Repository files navigation

Data Processing with OCR

Overview

Project Structure

Important Setup Note

1. Docling Pipeline (docling.py)

Description

Features

Input

Output

Workflow

2. Tesseract OCR Pipeline (tesseract.py)

Description

Features

Requirements

Input

Output

Workflow

3. EasyOCR Pipeline (easyocr.py)

Description

Features

Requirements

Input

Output

Workflow

Comparison of Approaches

Use Cases

Installation

Additional System Dependencies

How to Run

Run Docling

Run Tesseract OCR

Run EasyOCR

Notes

Future Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Docling Pipeline (`docling.py`)

2. Tesseract OCR Pipeline (`tesseract.py`)

3. EasyOCR Pipeline (`easyocr.py`)

Packages