# PDF OCR Text Conversion Pipeline

**Purpose:**  
Automate OCR‐based text extraction from PDF certificates, classify pages/documents by readability, and export both the text and a summary of key metrics.

**Inputs:**  
- A folder of `.pdf` documents (e.g. Certificates of Incorporation).

**Outputs:**  
1. A `pandas.DataFrame` summarizing, for each PDF:  
   - File name  
   - Number of characters & words extracted  
   - OCR confidence level  
   - Composite readability score  
   - Quality label (`readable`/`unreadable`)  
2. Text files saved to **readable** and **unreadable** subfolders.  
3. Console/log output of summary metrics.

---

## Table of Contents

1. [Environment Setup](#setup)  
2. [Configuration](#config)  
3. [Preprocessing Functions](#preproc)  
4. [OCR Extraction Functions](#ocr)  
5. [Readability Classification](#readability)  
6. [Processing Pipeline](#pipeline)  
7. [Run Pipeline](#run)  
8. [Results Analysis](#analysis)  
9. [Summary Metrics](#metrics)  
10. [Next Steps & Extensions](#next)


## Set-up

In [21]:
# 1. Environment Setup

import os
from pathlib import Path
import re
import logging
from concurrent.futures import ThreadPoolExecutor

import pytesseract
from pdf2image import convert_from_path
from PIL import Image

import pandas as pd
import numpy as np
from nltk.corpus import words

# Configure logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s — %(levelname)s — %(message)s")

# Point pytesseract at your Tesseract installation (adjust path as needed)
pytesseract.pytesseract.tesseract_cmd = r"C:\Users\Owner\AppData\Local\Programs\Tesseract-OCR\tesseract.exe"

# Load English vocabulary for word‐list checks
english_vocab = set(words.words())

In [22]:
# 2. Configuration

# Base directory containing your PDF batches
BASE_DIR = Path("D:/vc-research/vc-research")

# Define batch folders and output folders
BATCHES = {
    "Batch1": {
        "input": BASE_DIR / "Batch1",
        "readable": BASE_DIR / "Batch1_text_readable",
        "unreadable": BASE_DIR / "Batch1_text_unreadable"
    },
    "Batch2": {
        "input": BASE_DIR / "Batch2",
        "readable": BASE_DIR / "Batch2_text_readable",
        "unreadable": BASE_DIR / "Batch2_text_unreadable"
    }
}

# Create output directories if they don't exist
for cfg in BATCHES.values():
    cfg["readable"].mkdir(parents=True, exist_ok=True)
    cfg["unreadable"].mkdir(parents=True, exist_ok=True)

# OCR & classification parameters
OCR_DPI = 300                  # DPI for pdf2image conversion
READABILITY_THRESHOLD = 0.6    # Composite score threshold
MAX_WORKERS = os.cpu_count()   # Parallel threads
EXPORT_TEXT = True             # Whether to save extracted text to disk

## Preprocessing Functions
Simplify images for better OCR accuracy (e.g., grayscale conversion).

In [23]:
# Preprocess pdfs for better OCR results
def preprocess_image(img: Image.Image) -> Image.Image:
    """
    Convert the input image to grayscale to improve OCR accuracy.
    """
    return img.convert("L")

## OCR Extraction Functions

In [24]:
def get_tesseract_confidence(img: Image.Image) -> float:
    """
    Run Tesseract word‐level OCR on the image and return the average confidence.
    """
    data = pytesseract.image_to_data(img, output_type=pytesseract.Output.DICT)
    # Filter out missing confidences marked as '-1'
    confs = [int(c) for c in data["conf"] if c != "-1"]
    return float(np.mean(confs)) if confs else 0.0


def extract_text_from_pdf(pdf_path: Path) -> str:
    """
    Convert each page of the PDF to an image, preprocess it, and OCR the text.
    Returns the concatenated text for the entire document.
    """
    text_parts = []
    try:
        pages = convert_from_path(str(pdf_path), dpi=OCR_DPI)
        for page in pages:
            img = preprocess_image(page)
            text_parts.append(pytesseract.image_to_string(img, lang="eng"))
        return "\n".join(text_parts)
    except Exception as e:
        logging.error(f"Failed to OCR {pdf_path.name}: {e}")
        return ""

## Readability Classification
Combine OCR confidence with English‐word coverage to decide readability.

In [25]:
def is_readable_text(img: Image.Image, text: str, threshold: float = READABILITY_THRESHOLD) -> bool:
    """
    Compute a composite score:
      70% weight → normalized Tesseract confidence (0–1)
      30% weight → fraction of words in English vocab
    Returns True if composite score ≥ threshold.
    """
    # 1) OCR confidence component
    conf_score = get_tesseract_confidence(img) / 100.0

    # 2) English‐word coverage
    tokens = re.findall(r"\b\w+\b", text.lower())
    if not tokens:
        return False
    valid = sum(1 for w in tokens if w in english_vocab) / len(tokens)

    composite = 0.7 * conf_score + 0.3 * valid
    return composite >= threshold

## Processing Pipeline
Process each PDF: OCR → metrics → classification → optional text export.

In [26]:
def process_pdf(
    pdf_path: Path,
    readable_dir: Path,
    unreadable_dir: Path,
    threshold: float = READABILITY_THRESHOLD,
    export_text: bool = EXPORT_TEXT
) -> dict:
    """
    For a single PDF:
      • OCR each page
      • Compute metrics: #chars, #words, avg. confidence, composite score
      • Label as 'readable' or 'unreadable'
      • Optionally save text to the corresponding folder
      • Return a dict of metrics
    """
    logging.info(f"Processing {pdf_path.name}")
    metrics = {"file_name": pdf_path.name}

    try:
        pages = convert_from_path(str(pdf_path), dpi=OCR_DPI, poppler_path=r"C:\poppler-24.08.0\Library\bin")
        all_text = []
        confidences = []

        for page in pages:
            img = preprocess_image(page)
            txt = pytesseract.image_to_string(img, lang="eng")
            all_text.append(txt)
            confidences.append(get_tesseract_confidence(img))

        full_text = "\n".join(all_text)
        char_count = len(full_text)
        word_count = len(re.findall(r"\b\w+\b", full_text))
        avg_conf = float(np.mean(confidences)) if confidences else 0.0

        # Re‐apply readability logic on full document
        tokens = re.findall(r"\b\w+\b", full_text.lower())
        valid_pct = (sum(1 for w in tokens if w in english_vocab) / len(tokens)) if tokens else 0.0
        composite = 0.7 * (avg_conf / 100.0) + 0.3 * valid_pct
        label = "readable" if composite >= threshold else "unreadable"

        # Save text file, if desired
        if export_text:
            out_dir = readable_dir if label == "readable" else unreadable_dir
            txt_path = out_dir / f"{pdf_path.stem}.txt"
            txt_path.write_text(full_text, encoding="utf-8")

        # Populate metrics dict
        metrics.update({
            "number_of_characters": char_count,
            "number_of_words": word_count,
            "confidence_level": avg_conf,
            "composite_score": composite,
            "quality": label
        })

    except Exception as e:
        logging.error(f"Error in {pdf_path.name}: {e}")
        # On failure, mark as unreadable with zero metrics
        metrics.update({
            "number_of_characters": 0,
            "number_of_words": 0,
            "confidence_level": 0.0,
            "composite_score": 0.0,
            "quality": "unreadable"
        })

    return metrics

In [27]:
def process_all_pdfs(
    folder: Path,
    readable_dir: Path,
    unreadable_dir: Path,
    threshold: float = READABILITY_THRESHOLD,
    limit: int = None,
    export_text: bool = EXPORT_TEXT
) -> pd.DataFrame:
    """
    Process every PDF in `folder` in parallel, returning a DataFrame of results.
    """
    pdfs = list(folder.glob("*.pdf"))
    if limit:
        pdfs = pdfs[:limit]

    # Map PDFs → metrics dicts
    with ThreadPoolExecutor(max_workers=MAX_WORKERS) as exec:
        results = list(exec.map(
            lambda p: process_pdf(p, readable_dir, unreadable_dir, threshold, export_text),
            pdfs
        ))

    return pd.DataFrame(results)

## Run Pipeline
Kick off processing for one batch (adjust name to switch batches).

In [28]:
# Example: run on Batch1
batch_cfg = BATCHES["Batch1"]
results_df = process_all_pdfs(
    folder=batch_cfg["input"],
    readable_dir=batch_cfg["readable"],
    unreadable_dir=batch_cfg["unreadable"]
)

# Display the first few results
results_df.head()

2025-07-14 19:42:25,168 — INFO — Processing 100_2007-02-22_Certificates of Incorporation.pdf
2025-07-14 19:42:25,171 — INFO — Processing 100_2008-12-03_Certificates of Incorporation.pdf
2025-07-14 19:42:25,173 — INFO — Processing 10_2006-09-13_Certificates of Incorporation.pdf
2025-07-14 19:42:25,174 — INFO — Processing 16_2003-07-03_Certificates of Incorporation.pdf
2025-07-14 19:42:25,178 — INFO — Processing 16_2004-01-22_Certificates of Incorporation.pdf
2025-07-14 19:42:25,180 — INFO — Processing 16_2004-07-14_Certificates of Incorporation.pdf
2025-07-14 19:42:25,195 — INFO — Processing 16_2005-05-18_Certificates of Incorporation.pdf
2025-07-14 19:42:25,195 — INFO — Processing 16_2006-03-09_Certificates of Incorporation.pdf
2025-07-14 19:42:25,196 — INFO — Processing 16_2007-05-16_Certificates of Incorporation.pdf
2025-07-14 19:42:25,197 — INFO — Processing 16_2008-03-03_Certificates of Incorporation.pdf
2025-07-14 19:42:25,206 — INFO — Processing 16_2009-01-20_Certificates of Inco

Unnamed: 0,file_name,number_of_characters,number_of_words,confidence_level,composite_score,quality
0,100_2007-02-22_Certificates of Incorporation.pdf,48801,7983,83.845956,0.841901,readable
1,100_2008-12-03_Certificates of Incorporation.pdf,58477,9529,83.41506,0.834131,readable
2,10_2006-09-13_Certificates of Incorporation.pdf,86,13,43.291667,0.441503,unreadable
3,16_2003-07-03_Certificates of Incorporation.pdf,2283,372,79.729604,0.796817,readable
4,16_2004-01-22_Certificates of Incorporation.pdf,6134,980,77.905922,0.792382,readable


## Results Analysis
Quick look at overall distribution of quality labels.

In [38]:
results_df['confidence_level'].mean()

78.28188975688529

In [None]:
# Count of readable vs. unreadable
results_df["quality"].value_counts()

quality
readable      89
unreadable     3
Name: count, dtype: int64

In [30]:
# Sample unreadable documents for manual review
results_df[results_df["quality"] == "unreadable"].head()

Unnamed: 0,file_name,number_of_characters,number_of_words,confidence_level,composite_score,quality
2,10_2006-09-13_Certificates of Incorporation.pdf,86,13,43.291667,0.441503,unreadable
14,16_2015-04-22_Certificates of Incorporation.pdf,591,100,60.858209,0.597007,unreadable
36,27_2008-11-13_Certificates of Incorporation.pdf,554,113,28.209933,0.340832,unreadable


In [31]:
# Sample unreadable documents for manual review
results_df[results_df["quality"] == "unreadable"].head()

Unnamed: 0,file_name,number_of_characters,number_of_words,confidence_level,composite_score,quality
2,10_2006-09-13_Certificates of Incorporation.pdf,86,13,43.291667,0.441503,unreadable
14,16_2015-04-22_Certificates of Incorporation.pdf,591,100,60.858209,0.597007,unreadable
36,27_2008-11-13_Certificates of Incorporation.pdf,554,113,28.209933,0.340832,unreadable


## Summary Metrics
Compute overall readability rates for each batch.

In [32]:
for name, cfg in BATCHES.items():
    total_pdfs = len(list(cfg["input"].glob("*.pdf")))
    readable_txt = len(list(cfg["readable"].glob("*.txt")))
    unreadable_txt = len(list(cfg["unreadable"].glob("*.txt")))

    print(
        f"{name}: {readable_txt}/{total_pdfs} readable "
        f"({readable_txt/total_pdfs:.1%}), "
        f"{unreadable_txt/total_pdfs:.1%} unreadable"
    )

Batch1: 90/92 readable (97.8%), 3.3% unreadable
Batch2: 948/950 readable (99.8%), 0.2% unreadable


## Next Steps & Extensions

- **Logging to File:** route `INFO`/`ERROR` logs to a timestamped logfile.  
- **Unit Tests:** build `pytest` tests for each function (`preprocess_image`, `get_tesseract_confidence`, etc.).  
- **Error Tracking:** record failed PDFs into a CSV for manual triage.  
- **Advanced Models:** integrate an LLM or fine‐tuned Vision+OCR model for edge‐case pages.  
- **Parallel Tuning:** experiment with chunk sizes or GPU‐accelerated OCR for speed.  