# Scanned PDF OCR + Extraction

This notebook contains how to extract structured data from a **scanned or image-based PDF** using OCR.

Steps included:
1. Convert PDF pages to images using:
   - `pdf2image`

2. Perform OCR to extract text using:
   - `pytesseract`

3. Parse the raw OCR output line-by-line to extract values like:
   - Policy Number
   - Insured Name
   - Sum Insured
   - Premium
   - Policy Start
   - Policy End

The extraction uses basic string parsing without smart/fuzzy logic.

Output:
- Raw OCR text printed to console
- Parsed fields saved to: `ocr_extracted_output.xlsx`


In [4]:
from pdf2image import convert_from_path
import pytesseract
import pandas as pd


In [2]:
# Step 1: Convert scanned PDF to images
images = convert_from_path("demo_pdfs\sample_scanned_based.pdf", dpi=300)

In [8]:
# OCR the first page
ocr_text = pytesseract.image_to_string(images[0])

# Extract key-value pairs from actual OCR output dynamically
lines = ocr_text.strip().splitlines()
data = {}

for line in lines:
    if ":" in line:
        key, value = line.split(":", 1)
        data[key.strip()] = value.strip()


In [10]:
df = pd.DataFrame([data])
df.to_excel("ocr_extracted_output.xlsx", index=False)

print("OCR text dynamically extracted and saved to Excel.")

OCR text dynamically extracted and saved to Excel.
