# JFK Document OCR Pipeline – Milestone 1
**Author**: [Your Name]  
**Objective**: Efficiently OCR a subset of JFK files with scalability to 100,000+ pages.  
This notebook serves as a prototype pipeline for processing large historical document archives with preprocessing, parallel OCR, and basic NLP/EDA.

---

In [None]:
!apt-get install -y poppler-utils tesseract-ocr
!pip install pytesseract pdf2image nltk tqdm

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
The following NEW packages will be installed:
  poppler-utils
0 upgraded, 1 newly installed, 0 to remove and 34 not upgraded.
Need to get 186 kB of archives.
After this operation, 696 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 poppler-utils amd64 22.02.0-2ubuntu0.7 [186 kB]
Fetched 186 kB in 0s (710 kB/s)
Selecting previously unselected package poppler-utils.
(Reading database ... 126332 files and directories currently installed.)
Preparing to unpack .../poppler-utils_22.02.0-2ubuntu0.7_amd64.deb ...
Unpacking poppler-utils (22.02.0-2ubuntu0.7) ...
Setting up poppler-utils (22.02.0-2ubuntu0.7) ...
Processing triggers for man-db (2.10.2-1) ...
Collecting pytesseract
  Downloading pytesseract-0.3.13-py3-none-any.whl.metadata (11 kB)
Collecting pdf2image
  Downloading p

## Step 1: Upload a Subset of JFK PDFs
To simulate the full corpus, we'll work with a small subset (e.g., 10–20 pages). These PDFs are assumed to be scans requiring OCR.


In [None]:
import zipfile
import io

uploaded = files.upload()
zip_filename = list(uploaded.keys())[0]

# Load ZIP archive
zf = zipfile.ZipFile(zip_filename)

# List of PDF files in the archive
pdf_files = [f for f in zf.namelist() if f.lower().endswith(".pdf")]
print(f"Found {len(pdf_files)} PDF files in ZIP.")


Saving jfk2023f.zip to jfk2023f.zip
Found 21 PDF files in ZIP.


## Step 2: Convert PDFs in ZIP to Images (In Memory)
We'll convert only the first N PDFs for Milestone 1. Each will be OCR'd using the same parallelized method as before.

In [None]:
from pdf2image import convert_from_bytes

N = 3  # Number of PDFs to process for milestone
images_all = []

for pdf_name in pdf_files[:N]:
    print(f"Processing {pdf_name}")
    pdf_bytes = zf.read(pdf_name)
    images = convert_from_bytes(pdf_bytes, dpi=200)
    images_all.extend(images)  # Flatten all pages

Processing 104-10105-10271.pdf
Processing 104-10120-10293.pdf
Processing 104-10172-10108.pdf


##Step 3: Image Preprocessing
Preprocessing can improve OCR speed and quality. We'll convert images to grayscale and apply optional sharpening.


In [None]:
from PIL import Image, ImageFilter

def preprocess_image(img):
    gray = img.convert("L")  # Grayscale
    sharpened = gray.filter(ImageFilter.SHARPEN)
    return sharpened

processed_images = [preprocess_image(img) for img in images]

##Step 4: Parallelized OCR with Tesseract
We use `ThreadPoolExecutor` to speed up OCR and `tqdm` to visualize progress.


In [None]:
import pytesseract
from concurrent.futures import ThreadPoolExecutor
from tqdm.notebook import tqdm

def ocr_image(img):
    return pytesseract.image_to_string(img, config='--psm 6')

ocr_texts = []
with ThreadPoolExecutor(max_workers=4) as executor:
    ocr_texts = list(tqdm(executor.map(ocr_image, processed_images), total=len(processed_images), desc="OCR Progress"))

full_text = "\n\n".join([f"--- Page {i+1} ---\n{text}" for i, text in enumerate(ocr_texts)])


OCR Progress:   0%|          | 0/201 [00:00<?, ?it/s]

## Step 5: Save OCR Output
We save the result as a `.txt` file for downstream NLP.


In [None]:
with open("jfk_ocr_output.txt", "w") as f:
    f.write(full_text)

from google.colab import files
files.download("jfk_ocr_output.txt")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

##Step 6: Light NLP & EDA
We’ll analyze word frequencies to preview the contents. This also sets the stage for future topic modeling and sentiment analysis.


In [None]:
import nltk
from nltk.corpus import stopwords
from collections import Counter
import re

nltk.download("stopwords")

# Tokenize and clean
words = re.findall(r'\b[a-zA-Z]{3,}\b', full_text.lower())
filtered_words = [w for w in words if w not in stopwords.words("english")]

# Count
word_counts = Counter(filtered_words)
top_words = word_counts.most_common(20)

# Display
for word, freq in top_words:
    print(f"{word}: {freq}")


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


david: 229
page: 222
halperin: 213
says: 204
know: 186
see: 184
well: 184
eee: 176
going: 141
get: 124
asks: 119
would: 116
right: 112
talk: 108
yes: 104
one: 100
call: 90
tho: 83
end: 82
take: 80


In [None]:
import numpy as np

# Assume one OCR chunk per page
doc_lengths = [len(doc.split()) for doc in ocr_texts]  # ocr_texts is list of page texts

print(f"Total docs: {len(doc_lengths)}")
print(f"Avg doc length: {np.mean(doc_lengths):.2f} words")
print(f"Min: {np.min(doc_lengths)}, Max: {np.max(doc_lengths)}, Std: {np.std(doc_lengths):.2f}")



Total docs: 201
Avg doc length: 563.35 words
Min: 65, Max: 1373, Std: 267.80


In [None]:
unique_words = set(filtered_words)
hapax_words = [w for w in unique_words if word_counts[w] == 1]
hapax_ratio = len(hapax_words) / len(unique_words)
print(f"Hapax Legomena Ratio: {hapax_ratio:.2f}")


Hapax Legomena Ratio: 0.75


In [None]:
from nltk import bigrams
from collections import defaultdict

bi_counts = defaultdict(int)
for b in bigrams(filtered_words):
    bi_counts[b] += 1

top_bigrams = sorted(bi_counts.items(), key=lambda x: x[1], reverse=True)[:15]
for (w1, w2), freq in top_bigrams:
    print(f"{w1} {w2}: {freq}")


would like: 27
maurice halperin: 27
mexico city: 24
wants know: 20
reproduction issuing: 14
issuing office: 12
office prohibited: 12
page real: 12
prohibited copy: 11
end message: 11
classified message: 10
page page: 10
next week: 10
secret page: 9
halperin lupe: 9


In [None]:
!pip install -q spacy
!python -m spacy download en_core_web_sm

import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp(full_text[:20000])  # Limit to first 20k chars for speed

entities = list(doc.ents)  # Keep actual spaCy Span objects
entity_counter = Counter(ent.label_ for ent in entities)
print("Named entity types:", entity_counter)

# Most common named entities
from collections import Counter

name_counter = Counter([ent.text for ent in entities if ent.label_ in ["PERSON", "ORG", "GPE"]])
print("Top 10 named entities:")
for name, count in name_counter.most_common(10):
    print(f"{name}: {count}")



Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m95.5 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Named entity types: Counter({'CARDINAL': 180, 'ORG': 170, 'PERSON': 162, 'GPE': 51, 'DATE': 40, 'PRODUCT': 17, 'NORP': 16, 'MONEY': 13, 'WORK_OF_ART': 6, 'QUANTITY': 5, 'FAC': 3, 'LOC': 3, 'PERCENT': 3, 'EVENT': 2, 'TIME': 2, 'LAW': 1})
Top 10 named entities:
Lo: 10
Subj: 4
Se: 3
Subject: 3
ER: 2
Sn: 2
Lugano: 2
Prague: 2
Csech: 2
ST

In [None]:
import csv
import os

# Create output directory
os.makedirs("output", exist_ok=True)

output_path = "output/nlp_summary_report.csv"

with open(output_path, "w", newline='', encoding='utf-8') as f:
    writer = csv.writer(f)

    # Section 1: Document Statistics
    doc_lengths = [len(doc.split()) for doc in ocr_texts]
    writer.writerow(["Document Length Statistics"])
    writer.writerow(["Metric", "Value"])
    writer.writerow(["Total Pages", len(doc_lengths)])
    writer.writerow(["Average Length (words)", round(np.mean(doc_lengths), 2)])
    writer.writerow(["Min Length", np.min(doc_lengths)])
    writer.writerow(["Max Length", np.max(doc_lengths)])
    writer.writerow(["Std Dev", round(np.std(doc_lengths), 2)])
    writer.writerow([])

    # Section 2: Top Words
    writer.writerow(["Top 20 Most Frequent Words"])
    writer.writerow(["Word", "Frequency"])
    for word, freq in top_words[:20]:
        writer.writerow([word, freq])
    writer.writerow([])

    # Section 3: Top Bigrams
    writer.writerow(["Top 20 Bigrams"])
    writer.writerow(["Bigram", "Frequency"])
    for (w1, w2), freq in top_bigrams[:20]:
        writer.writerow([f"{w1} {w2}", freq])
    writer.writerow([])

    # Section 4: Named Entity Types
    writer.writerow(["Named Entity Types"])
    writer.writerow(["Entity Type", "Count"])
    for label, count in entity_counter.most_common(20):
        writer.writerow([label, count])
    writer.writerow([])

    # Section 5: Top Named Entities
    writer.writerow(["Top 20 Named Entities (PERSON, ORG, GPE)"])
    writer.writerow(["Entity", "Mentions"])
    for name, count in name_counter.most_common(20):
        writer.writerow([name, count])
    writer.writerow([])

    # Section 6: Lexical Richness
    hapax_words = [w for w in set(filtered_words) if word_counts[w] == 1]
    hapax_ratio = len(hapax_words) / len(set(filtered_words))
    writer.writerow(["Lexical Richness"])
    writer.writerow(["Metric", "Value"])
    writer.writerow(["Hapax Legomena Ratio", round(hapax_ratio, 4)])

print(f" NLP summary written to {output_path}")


✅ NLP summary written to output/nlp_summary_report.csv
