# OCR of *El Martillo* Newspaper Page

This notebook loads a historical newspaper page from *El Martillo* (Chiclayo, 1903â€“1919), performs OCR using the Claude API (vision), structures the text, and exports a CSV dataset.

## Steps:
1. Load the scanned page.
2. Send the image to Claude (vision OCR).
3. Parse and structure the extracted content.
4. Export to CSV.
5. Run a simple visualization.


In [None]:
import os
import json
import pandas as pd
import matplotlib.pyplot as plt
from PIL import Image
from dotenv import load_dotenv
from anthropic import Anthropic

load_dotenv()

client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

## Load the scanned page
Replace the file name if you change the page.

In [None]:
image_path = "../data/el_martillo/page_01.png"
img = Image.open(image_path)
img

## Claude OCR request

The prompt requests the OCR in a structured JSON format.

In [None]:
prompt = (
    "You are an OCR and historical document analysis assistant. "
    "Extract all articles, headlines, advertisements, and sections from this scanned newspaper page. "
    "Return ONLY a JSON list where each entry contains: date, issue_number, headline, section, type (article/advertisement/other), text_excerpt."
)

with open(image_path, "rb") as f:
    image_bytes = f.read()

response = client.messages.create(
    model="claude-3-5-sonnet-vision",
    max_tokens=4096,
    temperature=0,
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": image_bytes.hex()}},
            ],
        }
    ],
)

raw_output = response.content[0].text
raw_output

## Parse JSON result

In [None]:
structured = json.loads(raw_output)
df = pd.DataFrame(structured)
df

## Export CSV

In [None]:
df.to_csv("../data/el_martillo/page_01_structured.csv", index=False)
df.head()

## Simple visualization

In [None]:
plt.figure(figsize=(6, 4))
df['type'].value_counts().plot(kind='bar')
plt.title('Content Types in Page')
plt.xlabel('Type')
plt.ylabel('Count')
plt.show()