# Plaques2Gallery Pipeline

This notebook orchestrates the full pipeline for matching museum plaque photographs to artwork images.

**Pipeline Overview:**
1. **OCR Text Extraction** — Extract text from plaque images using Tesseract
2. **Text Cleaning** — Use Gemini AI to parse "Title by Artist" from noisy OCR output
3. **Web Search** — Find candidate URLs using Google Custom Search API
4. **Image Extraction** — Download artwork images from the found URLs

**Prerequisites:**
- Ensure `.env` file contains your API keys (see `.env.example`)
- Place plaque images in the `museum_plaques/` directory
- Run cells sequentially within each section

## Setup and Imports

Load all required libraries and configure the environment. The `load_dotenv()` call reads API keys from your `.env` file.

In [None]:
import sys
from pathlib import Path
import shutil
from datetime import datetime

# Add parent directory to path so we can import the plaques2gallery package
sys.path.insert(0, str(Path().resolve().parent))

import os
import json
import cv2
import numpy as np
import pytesseract
import re
from dotenv import load_dotenv
from plaques2gallery.ocr_text_extraction import invert, custom_binarize
from plaques2gallery.clean_museum_plaques_text import clean_extracted_text
from plaques2gallery.web_search import google_search_top3
from plaques2gallery.web_scraping import process_a_painting
import google.generativeai as genai
from playwright.async_api import async_playwright
import asyncio
from docx import Document
from docx.shared import Inches
from make_it_pretty import save_results_to_docx

load_dotenv()

# Create required directories
os.makedirs("intermediate_results", exist_ok=True)
os.makedirs("extracted_images", exist_ok=True)


def save_json_with_backup(filepath: str, data: dict) -> None:
    """Save JSON data with automatic backup of existing file."""
    if os.path.exists(filepath):
        backup_dir = "intermediate_results/backups"
        os.makedirs(backup_dir, exist_ok=True)
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        filename = os.path.basename(filepath)
        backup_path = os.path.join(backup_dir, f"{filename}.{timestamp}.bak")
        shutil.copy2(filepath, backup_path)
    
    with open(filepath, "w", encoding="utf-8") as f:
        json.dump(data, f, ensure_ascii=False, indent=2)

## 1. OCR Text Extraction

Extract text from museum plaque images using Tesseract OCR.

**Process:**
1. Load previously processed results (to avoid re-processing)
2. For each new image: convert to grayscale, invert if needed, binarize
3. Run OCR with multi-language support (English, German, Italian, Korean, Chinese, Japanese)
4. If standard thresholding fails, try adaptive thresholding to maximize OCR confidence

**Output:** `intermediate_results/ocr_extracted_text.json` — mapping of image filenames to raw OCR text

In [None]:
folder_name = 'museum_plaques'

try:
    with open("intermediate_results/ocr_extracted_text.json", "r", encoding="utf-8") as f:
        ocr_extracted_text = json.load(f)
except FileNotFoundError:
    ocr_extracted_text = {}

newly_extracted_text = {}
for filename in os.listdir(folder_name):
    if filename in ocr_extracted_text:
        continue

    if filename.lower().endswith(('.png', '.jpg', '.jpeg', '.tiff', '.bmp', '.gif')):
        img_path = os.path.join(folder_name, filename)

        img = cv2.imread(img_path)
        gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
        inverted = invert(gray)
        t, binary = cv2.threshold(inverted, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)

        text = pytesseract.image_to_string(binary, lang='eng+deu+ita+kor+chi_sim+jpn')
        if not text:  # Try custom thresholding in the case of a failure
            t, binary = custom_binarize(inverted)
            text = pytesseract.image_to_string(binary, lang='eng+deu+ita+kor+chi_sim+jpn')
        
        if not text: text = "Failed to extract text."
        
        newly_extracted_text[filename] = text.strip()

ocr_extracted_text.update(newly_extracted_text)

In [None]:
save_json_with_backup("intermediate_results/ocr_extracted_text.json", ocr_extracted_text)

## 2. Text Cleaning with Gemini AI

Raw OCR output is often noisy and inconsistent. This step uses Google's Gemini model to extract a clean, standardized format: **"Painting Title by Artist"**.

**Process:**
1. Load the Gemini API key from environment variables
2. For each OCR result, prompt Gemini to extract only the title and artist
3. The model handles OCR artifacts, formatting issues, and multilingual text

**Output:** `intermediate_results/cleaned_names.json` — mapping of image filenames to cleaned "Title by Artist" strings

In [None]:
with open("intermediate_results/ocr_extracted_text.json", "r", encoding="utf-8") as f:
    ocr_extracted_text = json.load(f)

In [None]:
GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
if not GEMINI_API_KEY:
    raise ValueError("GEMINI_API_KEY environment variable is not set. Please add it to your .env file.")
genai.configure(api_key=GEMINI_API_KEY)
model = genai.GenerativeModel("models/gemini-1.5-pro-latest")

In [None]:
try:
    with open("intermediate_results/cleaned_names.json", "r", encoding="utf-8") as f:
        cleaned_names = json.load(f)
except FileNotFoundError:
    cleaned_names = {}

newly_cleaned_names = {}
for img_name, ocr_text in newly_extracted_text.items():
    if ocr_text == "Failed to extract text.":
        newly_cleaned_names[img_name] = "Failed to extract text."
        continue
    newly_cleaned_names[img_name] = clean_extracted_text(ocr_text, model)

cleaned_names.update(newly_cleaned_names)

In [None]:
save_json_with_backup("intermediate_results/cleaned_names.json", cleaned_names)

## 3. Web Search for Artwork Images

Use Google Custom Search API to find candidate URLs for each artwork.

**Why batching?** The free tier of Google Custom Search API allows only 100 queries/day. With hundreds of plaques, we split them into batches of 70 and process one batch per day.

**Process:**
1. Split cleaned names into batches of 70
2. Track which batch was last processed via `last_batch_index.txt`
3. For each painting name, search Google and collect top 3 URLs
4. Save results incrementally to avoid losing progress

**Output:** `intermediate_results/search_results.json` — mapping of painting names to lists of candidate URLs

In [None]:
with open("intermediate_results/cleaned_names.json", "r", encoding="utf-8") as f:
    imgs_to_names = json.load(f)

new_imgs_to_names = imgs_to_names
new_names_to_imgs = {painting_name: img for img, painting_name in new_imgs_to_names.items()}


In [None]:
batch_size = 70
output_dir = "intermediate_results/batches_for_searching"
os.makedirs(output_dir, exist_ok=True)

# Find the max existing batch index
existing_files = os.listdir(output_dir)
pattern = re.compile(r"batch_(\d+)\.json")
existing_indices = [int(m.group(1)) for f in existing_files if (m := pattern.match(f))]
new_batch_start_idx = max(existing_indices, default=0) + 1

# Create and write new batches
items = list(new_imgs_to_names.items())
for i in range(0, len(items), batch_size):
    batch = dict(items[i:i + batch_size])
    batch_num = new_batch_start_idx + (i // batch_size)
    batch_path = os.path.join(output_dir, f"batch_{batch_num}.json")
    with open(batch_path, "w", encoding="utf-8") as f:
        json.dump(batch, f, ensure_ascii=False, indent=2)

In [None]:
GOOGLE_SEARCH_API_KEY = os.getenv("GOOGLE_SEARCH_API_KEY")
CSE_ID = os.getenv("GOOGLE_CSE_ID")

if not GOOGLE_SEARCH_API_KEY or not CSE_ID:
    raise ValueError("GOOGLE_SEARCH_API_KEY and GOOGLE_CSE_ID environment variables must be set. Please add them to your .env file.")

In [None]:
# Get current batch to process. One should be very careful with this step.
idx_file = os.path.join("intermediate_results/batches_for_searching", "last_batch_index.txt")
try:
    with open(idx_file, "r") as f:
        cur_batch_idx = int(f.read().strip()) + 1
except FileNotFoundError:
    cur_batch_idx = 1

In [None]:
cur_batch_file = os.path.join("intermediate_results/batches_for_searching", f"batch_{cur_batch_idx}.json")
if os.path.exists(cur_batch_file):
    with open(cur_batch_file, "r", encoding="utf-8") as f:
        cur_batch = json.load(f)
else:
    raise FileNotFoundError(f"Batch file for index {cur_batch_idx} not found.")

In [None]:
try:
    with open("intermediate_results/search_results.json", "r", encoding="utf-8") as f:
        search_results = json.load(f)
except FileNotFoundError:
    search_results = {}

new_search_results = {}
for _, painting_name in cur_batch.items():
    links = google_search_top3(painting_name, GOOGLE_SEARCH_API_KEY, CSE_ID)
    new_search_results[painting_name] = links

search_results.update(new_search_results)

with open(idx_file, "w") as f:
    f.write(str(cur_batch_idx))

In [None]:
save_json_with_backup("intermediate_results/search_results.json", search_results)

## 4. Image Extraction with Playwright

Visit each candidate URL and download the artwork image using browser automation.

**Process:**
1. Load the current batch and its search results
2. For each painting, try URLs in priority order: Wikipedia → Museum sites → Other
3. Use Playwright to render the page (handles JavaScript-heavy sites)
4. Select the largest visible image on the page (heuristic for finding the artwork)
5. Handle cookie banners automatically
6. Convert downloaded images to JPEG format

**Output:**
- `extracted_images/` — downloaded artwork images
- `final_results.json` — complete results with image paths, source URLs, and museum info
- `final_results.docx` — formatted document for easy viewing

In [None]:
# Again, we should be careful here with the cur_batch_idx. If needed, the index can be selected manually when running the code.
cur_batch_file = os.path.join("intermediate_results/batches_for_searching", f"batch_{cur_batch_idx}.json")
if os.path.exists(cur_batch_file):
    with open(cur_batch_file, "r", encoding="utf-8") as f:
        cur_batch = json.load(f)
else:
    raise FileNotFoundError(f"Batch file for index {cur_batch_idx} not found.")

search_results_file = 'intermediate_results/search_results.json'
with open(search_results_file, "r", encoding="utf-8") as f:
    full_search_results = json.load(f)

# Get the artwork titles (values) from cur_batch
cur_artwork_titles = set(cur_batch.values())

# Filter only entries in full_search_results with matching keys
cur_batch_search_results = {
    title: urls for title, urls in full_search_results.items() if title in cur_artwork_titles
}

In [None]:
try:
    with open("final_results.json", "r") as f:
        final_results = json.load(f)
except FileNotFoundError:
    final_results = {}

cur_batch_results = {}

async_groups_cur_batch = []
cur_async_group = {}
for painting_name, urls in cur_batch_search_results.items():
    if len(cur_async_group) == 3:
        async_groups_cur_batch.append(cur_async_group)
        cur_async_group = {}
    cur_async_group[painting_name] = urls

if cur_async_group:
    async_groups_cur_batch.append(cur_async_group)

cur_batch_names_to_imgs = {painting_name: img for img, painting_name in cur_batch.items()}

async with async_playwright() as p:
    browser = await p.chromium.launch(headless=True)
    for async_group in async_groups_cur_batch:
        tasks = [
            process_a_painting(painting_name, urls, browser, cur_batch_names_to_imgs, cur_batch_results)
            for painting_name, urls in async_group.items()
        ]
        await asyncio.gather(*tasks)
    await browser.close()

final_results.update(cur_batch_results)

In [None]:
save_json_with_backup("final_results.json", final_results)

doc_path = "final_results.docx"
if os.path.exists(doc_path):
    backup_dir = "intermediate_results/backups"
    os.makedirs(backup_dir, exist_ok=True)
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    shutil.copy2(doc_path, os.path.join(backup_dir, f"final_results.docx.{timestamp}.bak"))

save_results_to_docx(doc_path, cur_batch_results)