# PDF OCR Text Conversion Pipeline

**Purpose:**  
Automate OCR‐based text extraction from PDF certificates, classify pages/documents by readability, and export both the text and a summary of key metrics.

**Inputs:**  
- A folder of `.pdf` documents (e.g. Certificates of Incorporation).

**Outputs:**  
1. A `pandas.DataFrame` summarizing, for each PDF:  
   - File name  
   - Number of characters & words extracted  
   - OCR confidence level  
   - Composite readability score  
   - Quality label (`readable`/`unreadable`)  
2. Text files saved to **readable** and **unreadable** subfolders.  
3. Console/log output of summary metrics.

---

## Table of Contents

1. [Environment Setup](#setup)  
2. [Configuration](#config)  
3. [Preprocessing Functions](#preproc)  
4. [OCR Extraction Functions](#ocr)  
5. [Readability Classification](#readability)  
6. [Processing Pipeline](#pipeline)  
7. [Run Pipeline](#run)  
8. [Results Analysis](#analysis)  
9. [Summary Metrics](#metrics)  
10. [Next Steps & Extensions](#next)


## Set-up

In [None]:
# 1. Environment Setup

import os
from pathlib import Path
import re
import logging
from concurrent.futures import ThreadPoolExecutor

import pytesseract
from pdf2image import convert_from_path
from PIL import Image

import pandas as pd
import numpy as np
from nltk.corpus import words

# Configure logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s — %(levelname)s — %(message)s")

# Point pytesseract at your Tesseract installation (adjust path as needed)
pytesseract.pytesseract.tesseract_cmd = "/usr/local/bin/tesseract"

# Load English vocabulary for word‐list checks
english_vocab = set(words.words())

In [None]:
# 2. Configuration

# Base directory containing your PDF batches
BASE_DIR = Path("/Users/alexchen/Downloads/Projects/temp")

# Define batch folders and output folders
BATCHES = {
    "Batch1": {
        "input": BASE_DIR / "Batch1",
        "readable": BASE_DIR / "Batch1_text_readable",
        "unreadable": BASE_DIR / "Batch1_text_unreadable"
    },
    "Batch2": {
        "input": BASE_DIR / "Batch2",
        "readable": BASE_DIR / "Batch2_text_readable",
        "unreadable": BASE_DIR / "Batch2_text_unreadable"
    }
}

# Create output directories if they don't exist
for cfg in BATCHES.values():
    cfg["readable"].mkdir(parents=True, exist_ok=True)
    cfg["unreadable"].mkdir(parents=True, exist_ok=True)

# OCR & classification parameters
OCR_DPI = 300                  # DPI for pdf2image conversion
READABILITY_THRESHOLD = 0.6    # Composite score threshold
MAX_WORKERS = os.cpu_count()   # Parallel threads
EXPORT_TEXT = True             # Whether to save extracted text to disk

## Preprocessing Functions
Simplify images for better OCR accuracy (e.g., grayscale conversion).

In [None]:
# Preprocess pdfs for better OCR results
def preprocess_image(img: Image.Image) -> Image.Image:
    """
    Convert the input image to grayscale to improve OCR accuracy.
    """
    return img.convert("L")

## OCR Extraction Functions

In [None]:
def get_tesseract_confidence(img: Image.Image) -> float:
    """
    Run Tesseract word‐level OCR on the image and return the average confidence.
    """
    data = pytesseract.image_to_data(img, output_type=pytesseract.Output.DICT)
    # Filter out missing confidences marked as '-1'
    confs = [int(c) for c in data["conf"] if c != "-1"]
    return float(np.mean(confs)) if confs else 0.0


def extract_text_from_pdf(pdf_path: Path) -> str:
    """
    Convert each page of the PDF to an image, preprocess it, and OCR the text.
    Returns the concatenated text for the entire document.
    """
    text_parts = []
    try:
        pages = convert_from_path(str(pdf_path), dpi=OCR_DPI)
        for page in pages:
            img = preprocess_image(page)
            text_parts.append(pytesseract.image_to_string(img, lang="eng"))
        return "\n".join(text_parts)
    except Exception as e:
        logging.error(f"Failed to OCR {pdf_path.name}: {e}")
        return ""

## Readability Classification
Combine OCR confidence with English‐word coverage to decide readability.

In [None]:
def is_readable_text(img: Image.Image, text: str, threshold: float = READABILITY_THRESHOLD) -> bool:
    """
    Compute a composite score:
      70% weight → normalized Tesseract confidence (0–1)
      30% weight → fraction of words in English vocab
    Returns True if composite score ≥ threshold.
    """
    # 1) OCR confidence component
    conf_score = get_tesseract_confidence(img) / 100.0

    # 2) English‐word coverage
    tokens = re.findall(r"\b\w+\b", text.lower())
    if not tokens:
        return False
    valid = sum(1 for w in tokens if w in english_vocab) / len(tokens)

    composite = 0.7 * conf_score + 0.3 * valid
    return composite >= threshold

## Processing Pipeline
Process each PDF: OCR → metrics → classification → optional text export.

In [None]:
def process_pdf(
    pdf_path: Path,
    readable_dir: Path,
    unreadable_dir: Path,
    threshold: float = READABILITY_THRESHOLD,
    export_text: bool = EXPORT_TEXT
) -> dict:
    """
    For a single PDF:
      • OCR each page
      • Compute metrics: #chars, #words, avg. confidence, composite score
      • Label as 'readable' or 'unreadable'
      • Optionally save text to the corresponding folder
      • Return a dict of metrics
    """
    logging.info(f"Processing {pdf_path.name}")
    metrics = {"file_name": pdf_path.name}

    try:
        pages = convert_from_path(str(pdf_path), dpi=OCR_DPI)
        all_text = []
        confidences = []

        for page in pages:
            img = preprocess_image(page)
            txt = pytesseract.image_to_string(img, lang="eng")
            all_text.append(txt)
            confidences.append(get_tesseract_confidence(img))

        full_text = "\n".join(all_text)
        char_count = len(full_text)
        word_count = len(re.findall(r"\b\w+\b", full_text))
        avg_conf = float(np.mean(confidences)) if confidences else 0.0

        # Re‐apply readability logic on full document
        tokens = re.findall(r"\b\w+\b", full_text.lower())
        valid_pct = (sum(1 for w in tokens if w in english_vocab) / len(tokens)) if tokens else 0.0
        composite = 0.7 * (avg_conf / 100.0) + 0.3 * valid_pct
        label = "readable" if composite >= threshold else "unreadable"

        # Save text file, if desired
        if export_text:
            out_dir = readable_dir if label == "readable" else unreadable_dir
            txt_path = out_dir / f"{pdf_path.stem}.txt"
            txt_path.write_text(full_text, encoding="utf-8")

        # Populate metrics dict
        metrics.update({
            "number_of_characters": char_count,
            "number_of_words": word_count,
            "confidence_level": avg_conf,
            "composite_score": composite,
            "quality": label
        })

    except Exception as e:
        logging.error(f"Error in {pdf_path.name}: {e}")
        # On failure, mark as unreadable with zero metrics
        metrics.update({
            "number_of_characters": 0,
            "number_of_words": 0,
            "confidence_level": 0.0,
            "composite_score": 0.0,
            "quality": "unreadable"
        })

    return metrics

In [None]:
def process_all_pdfs(
    folder: Path,
    readable_dir: Path,
    unreadable_dir: Path,
    threshold: float = READABILITY_THRESHOLD,
    limit: int = None,
    export_text: bool = EXPORT_TEXT
) -> pd.DataFrame:
    """
    Process every PDF in `folder` in parallel, returning a DataFrame of results.
    """
    pdfs = list(folder.glob("*.pdf"))
    if limit:
        pdfs = pdfs[:limit]

    # Map PDFs → metrics dicts
    with ThreadPoolExecutor(max_workers=MAX_WORKERS) as exec:
        results = list(exec.map(
            lambda p: process_pdf(p, readable_dir, unreadable_dir, threshold, export_text),
            pdfs
        ))

    return pd.DataFrame(results)

## Run Pipeline
Kick off processing for one batch (adjust name to switch batches).

In [None]:
# Example: run on Batch1
batch_cfg = BATCHES["Batch1"]
results_df = process_all_pdfs(
    folder=batch_cfg["input"],
    readable_dir=batch_cfg["readable"],
    unreadable_dir=batch_cfg["unreadable"]
)

# Display the first few results
results_df.head()

2024-12-11 22:18:15,125 - Processing /Users/alexchen/Downloads/Projects/temp/Batch1/16_2008-03-03_Certificates of Incorporation.pdf...
2024-12-11 22:18:15,125 - Processing /Users/alexchen/Downloads/Projects/temp/Batch1/27_2009-05-15_Certificates of Incorporation.pdf...
2024-12-11 22:18:15,126 - Processing /Users/alexchen/Downloads/Projects/temp/Batch1/16_2009-01-20_Certificates of Incorporation.pdf...
2024-12-11 22:18:15,126 - Processing /Users/alexchen/Downloads/Projects/temp/Batch1/16_2004-07-14_Certificates of Incorporation.pdf...
2024-12-11 22:18:15,126 - Processing /Users/alexchen/Downloads/Projects/temp/Batch1/27_2002-09-23_Certificates of Incorporation.pdf...
2024-12-11 22:18:15,130 - Processing /Users/alexchen/Downloads/Projects/temp/Batch1/24_2016-04-04_Certificates of Incorporation.pdf...
2024-12-11 22:18:15,131 - Processing /Users/alexchen/Downloads/Projects/temp/Batch1/92_2010-02-23_Certificates of Incorporation.pdf...
2024-12-11 22:18:15,131 - Processing /Users/alexchen/Do

Unnamed: 0,file_name,number_of_characters,number_of_words,confidence_level,composite_score,quality
0,16_2008-03-03_Certificates of Incorporation.pdf,2473,391,73.204728,0.747216,readable
1,27_2009-05-15_Certificates of Incorporation.pdf,4337,744,75.885735,0.769910,readable
2,16_2009-01-20_Certificates of Incorporation.pdf,55638,9112,83.692853,0.843774,readable
3,16_2004-07-14_Certificates of Incorporation.pdf,46327,7572,83.490633,0.841685,readable
4,27_2002-09-23_Certificates of Incorporation.pdf,2498,403,76.769806,0.775602,readable
...,...,...,...,...,...,...
87,59_2006-05-01_Certificates of Incorporation.pdf,48676,8121,82.979610,0.839631,readable
88,24_2009-06-12_Certificates of Incorporation.pdf,2157,343,69.393056,0.713157,readable
89,81_2011-12-22_Certificates of Incorporation.pdf,128487,21449,85.609514,0.863685,readable
90,16_2007-05-16_Certificates of Incorporation.pdf,53604,8777,83.400815,0.840875,readable


## Results Analysis
Quick look at overall distribution of quality labels.

In [None]:
# Count of readable vs. unreadable
results_df["quality"].value_counts()

Unnamed: 0,file_name,number_of_characters,number_of_words,confidence_level,composite_score,quality
0,16_2008-03-03_Certificates of Incorporation.pdf,2473,391,73.204728,0.747216,readable
1,27_2009-05-15_Certificates of Incorporation.pdf,4337,744,75.885735,0.76991,readable
2,16_2009-01-20_Certificates of Incorporation.pdf,55638,9112,83.692853,0.843774,readable
3,16_2004-07-14_Certificates of Incorporation.pdf,46327,7572,83.490633,0.841685,readable
4,27_2002-09-23_Certificates of Incorporation.pdf,2498,403,76.769806,0.775602,readable
5,24_2016-04-04_Certificates of Incorporation.pdf,107743,17803,84.482088,0.855347,readable
6,92_2010-02-23_Certificates of Incorporation.pdf,53431,8819,82.425599,0.83575,readable
7,92_2004-11-23_Certificates of Incorporation.pdf,46879,7765,82.874414,0.836309,readable
8,59_2007-08-15_Certificates of Incorporation.pdf,54151,8999,84.279008,0.849915,readable
9,28_2009-12-07_Certificates of Incorporation.pdf,2882,472,82.341748,0.798214,readable


In [None]:
# Sample unreadable documents for manual review
results_df[results_df["quality"] == "unreadable"].head()

Unnamed: 0,file_name,number_of_characters,number_of_words,confidence_level,composite_score,quality
24,49_2008-06-12_Certificates of Incorporation.pdf,109852,18145,86.084837,0.866518,readable
89,81_2011-12-22_Certificates of Incorporation.pdf,128487,21449,85.609514,0.863685,readable
74,24_2018-04-03_Certificates of Incorporation.pdf,112067,18495,85.499236,0.86289,readable
55,24_2014-08-27_Certificates of Incorporation.pdf,104258,17194,85.258637,0.861199,readable
78,81_2009-12-03_Certificates of Incorporation.pdf,102255,17058,85.186218,0.860408,readable
46,27_2010-10-10_Certificates of Incorporation.pdf,71376,11882,85.197592,0.858763,readable
43,27_2009-08-04_Certificates of Incorporation.pdf,66530,11046,85.158319,0.857352,readable
16,27_2013-09-26_Certificates of Incorporation.pdf,79271,13273,84.97987,0.856661,readable
85,28_2012-03-16_Certificates of Incorporation.pdf,49340,8061,85.193501,0.855528,readable
21,24_2011-05-05_Certificates of Incorporation.pdf,89124,14609,84.577498,0.855366,readable


In [None]:
# Sample unreadable documents for manual review
results_df[results_df["quality"] == "unreadable"].head()

Unnamed: 0,file_name,number_of_characters,number_of_words,confidence_level,composite_score,quality
60,10_2006-09-13_Certificates of Incorporation.pdf,86,13,43.291667,0.441503,unreadable
14,27_2008-11-13_Certificates of Incorporation.pdf,553,113,27.992829,0.339313,unreadable


## Summary Metrics
Compute overall readability rates for each batch.

In [None]:
for name, cfg in BATCHES.items():
    total_pdfs = len(list(cfg["input"].glob("*.pdf")))
    readable_txt = len(list(cfg["readable"].glob("*.txt")))
    unreadable_txt = len(list(cfg["unreadable"].glob("*.txt")))

    print(
        f"{name}: {readable_txt}/{total_pdfs} readable "
        f"({readable_txt/total_pdfs:.1%}), "
        f"{unreadable_txt/total_pdfs:.1%} unreadable"
    )

Number of readable files in Batch1: 90
Number of unreadable files in Batch1: 2
Number of total files in Bacth1: 93
Proportion of readable files among Batch1: 0.967741935483871
Number of readable files in Batch2: 948
Number of unreadable files in Batch2: 2
Number of total files in Batch2: 950
Proportion of readable files among Batch2: 0.9978947368421053
Proportion of unreadable files among Batch1 and Batch2: 0.003835091083413231


## Next Steps & Extensions

- **Logging to File:** route `INFO`/`ERROR` logs to a timestamped logfile.  
- **Unit Tests:** build `pytest` tests for each function (`preprocess_image`, `get_tesseract_confidence`, etc.).  
- **Error Tracking:** record failed PDFs into a CSV for manual triage.  
- **Advanced Models:** integrate an LLM or fine‐tuned Vision+OCR model for edge‐case pages.  
- **Parallel Tuning:** experiment with chunk sizes or GPU‐accelerated OCR for speed.  