# Batch Processing - PDF Text Extraction

This notebook performs batch data extraction from multiple text-based PDFs using:
- `pdfplumber` for text extraction
- `rapidfuzz` for fuzzy field matching
- `pandas` for saving structured output

The smart_extract() function uses keyword-based fuzzy matching to identify fields such as:
- Policy Number
- Insured Name
- Sum Insured
- Premium
- Policy Start
- Policy End

Each PDF file is processed from a folder, and the structured data is saved into a combined Excel file.

Output:
- An Excel file (`batch_output.xlsx`) with structured rows, one per PDF.

In [None]:
import os
import pdfplumber
import pandas as pd
from rapidfuzz import process, fuzz

In [2]:
def smart_extract(text):
    fields = {
        "Policy Number": ["number"],
        "Insured Name": ["name"],
        "Sum Insured": ["sum", "insured"],
        "Premium": ["premium"],
        "Policy Start": ["start"],
        "Policy End": ["end"]
    }

    lines = text.splitlines()
    parsed_data = {}

    for label, required_keywords in fields.items():
        result = process.extractOne(label, lines, scorer=fuzz.token_sort_ratio, score_cutoff=50)
        if result:
            match_line, score, _ = result
            if all(keyword.lower() in match_line.lower() for keyword in required_keywords):
                try:
                    value = match_line.split(":")[1].strip()
                except:
                    value = match_line.strip()
                parsed_data[label] = value
            else:
                parsed_data[label] = ""
        else:
            parsed_data[label] = ""

    return parsed_data

In [5]:
def process_pdfs(input_dir="demo_pdfs", output_dir="output_excels", combined_filename="batch_output.xlsx"):
    os.makedirs(output_dir, exist_ok=True)
    combined_data = []

    for filename in os.listdir(input_dir):
        if filename.lower().endswith(".pdf"):
            try:
                with pdfplumber.open(os.path.join(input_dir, filename)) as pdf:
                    text = "\n".join([page.extract_text() for page in pdf.pages if page.extract_text()])
                    extracted = smart_extract(text)
                    extracted["Source File"] = filename
                    combined_data.append(extracted)
            except Exception as e:
                print(f"[ERROR] Failed to process {filename}: {e}")

    df = pd.DataFrame(combined_data)
    df.to_excel(os.path.join(output_dir, combined_filename), index=False)
    print(f"Processed {len(combined_data)} PDF(s). Excel saved at: {output_dir}/{combined_filename}")


In [6]:
process_pdfs()

CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox


Processed 4 PDF(s). Excel saved at: output_excels/batch_output.xlsx
