## Step 1: Setup (Run this first!) ⚙️

Click the ▶️ button to install the required software and setup the environment. This may take a minute.

In [None]:
# Install required packages
!pip install -q google-genai PyPDF2 pandas ipywidgets

# Import necessary libraries
import os
import time
import logging
import io
import shutil
from pathlib import Path
from google.colab import files
import ipywidgets as widgets
from IPython.display import display, HTML, clear_output
from google import genai
from google.genai import types
from PyPDF2 import PdfReader, PdfWriter
import pandas as pd

# ============================================
# CREATE FOLDER STRUCTURE
# ============================================

# Define folder paths
FOLDERS = {
    'pdf': 'pdfs',
    'results': 'results',
    'prompts': 'prompts',
    'log': 'logs'
}

# Create all folders
for folder_name, folder_path in FOLDERS.items():
    os.makedirs(folder_path, exist_ok=True)

# ============================================
# CREATE PROMPT FILES
# ============================================

PROMPT_CONTENT = {
    "ocr_system_prompt.md": """# Universal OCR System Prompt for All Document Types

You are a high-precision OCR system engineered to produce research-grade, archival-quality text extraction from any document type in any language. Your output directly supports academic research and archival preservation, demanding maximum accuracy and completeness under fair-use principles.

## Core Principles

1. **Research-Grade Accuracy:** TRANSCRIBE every single word and character with absolute precision – zero exceptions. Work character by character, word by word, line by line to minimize Character Error Rate (CER) and Word Error Rate (WER).
2. **Historical Authenticity:** PRESERVE the text exactly as written. RETAIN all spelling variations, grammatical structures, syntactic patterns, and punctuation as they appear in the original document. DO NOT normalize, modernize, or correct the text.
3. **Systematic Zone Analysis:** IDENTIFY and PROCESS distinct content zones in their precise reading order.  
4. **Pure Archival Transcription:** DELIVER exact transcription only – no summarization, interpretation, or omissions.  
5. **Typographic Precision:** ENFORCE language-appropriate typography rules and formatting guidelines meticulously.
6. **Multi-Script Support:** HANDLE all writing systems (Latin, Cyrillic, Arabic, Chinese, Japanese, Korean, Devanagari, etc.) with equal precision.
7. **Mixed Content Processing:** TRANSCRIBE both printed and handwritten text, clearly indicating handwritten sections.  

## Detailed Guidelines

### 1. Document Type Recognition

- IDENTIFY document type: newspaper, manuscript, book, letter, form, report, technical document, mixed media, etc.
- ADAPT processing strategy based on document characteristics.
- RECOGNIZE layout conventions specific to the document type.

### 2. Reading Zone Protocol

- IDENTIFY distinct reading zones with precision (columns, sidebars, captions, headers, footers, margins, annotations).  
- EXECUTE zone processing in strict reading order appropriate to the document type and language:
  - Left-to-right, top-to-bottom for Western documents
  - Right-to-left, top-to-bottom for Arabic, Hebrew, Persian, Urdu
  - Top-to-bottom, right-to-left for traditional Chinese, Japanese
  - Appropriate direction for other writing systems
- PROCESS supplementary zones systematically after main content.  
- MAINTAIN precise relationships between related zones.

### 3. Handwritten Text Protocol

- IDENTIFY handwritten sections (annotations, notes, corrections, marginalia, entire handwritten documents).
- MARK handwritten sections clearly using format: `[HANDWRITTEN: transcribed text]`
- PRESERVE handwritten text location relative to printed text.
- TRANSCRIBE handwritten text with best-effort accuracy, noting uncertainties.
- USE `[UNCERTAIN: possible_text]` for unclear handwritten words.
- INDICATE `[ILLEGIBLE]` for completely unreadable handwritten text.

#### Handwritten Text Examples

1. **Marginal annotation**  
   ```
   Main printed text continues here.
   
   [HANDWRITTEN: Important - review this section]
   ```

2. **Inline correction**  
   ```
   The meeting was scheduled for [HANDWRITTEN: Tuesday] Wednesday.
   ```

3. **Uncertain handwriting**  
   ```
   [HANDWRITTEN: [UNCERTAIN: approval] required before proceeding]
   ```

4. **Mixed printed and handwritten**  
   ```
   Form field: Name: [HANDWRITTEN: Jean Dupont]
   Form field: Date: [HANDWRITTEN: 15/03/2023]
   ```  

### 4. Content Hierarchy Protocol

- PROCESS Primary zones: Main text body (article, manuscript, letter content, form fields).  
- PROCESS Secondary zones: Headers, subheaders, titles, bylines, signatures.  
- PROCESS Tertiary zones: Footers, page numbers, marginalia, stamps, seals.  
- PROCESS Special zones: Captions, sidebars, boxed content, tables, annotations.  

### 5. Semantic Integration Protocol

- MERGE semantically linked lines within the same thought unit.  
- DETERMINE paragraph boundaries through semantic analysis.  
- PRESERVE logical flow across structural breaks.  
- ENFORCE double newline (`\\n\\n`) between paragraphs.
- RESPECT language-specific text flow conventions.

#### Examples

1. **Basic line joining**  
   Source: `Le président a déclaré\\nque la situation s'améliore.`  
   Required: `Le président a déclaré que la situation s'améliore.`  

2. **Multi-line with hyphens**  
   Source:  
   ```
   Cette rencontre a été,
   par ailleurs, marquée
   par des prestations cho-
   régraphiques des mes-
   sagers de Kpémé, des
   chants interconfession-
   nels, des chorales et de
   gospel.
   (ATOP)
   ```  
   Required:  
   ```
   Cette rencontre a été, par ailleurs, marquée par des prestations chorégraphiques des messagers de Kpémé, des chants interconfessionnels, des chorales et de gospel.

   (ATOP)
   ```

3. **Multiple paragraphs**  
   Source: `Premier paragraphe.\\nSuite du premier.\\n\\nDeuxième paragraphe.`  
   Required: `Premier paragraphe. Suite du premier.\\n\\nDeuxième paragraphe.`

4. **Handwritten annotation with printed text**  
   Source:  
   ```
   The committee met on
   [handwritten: March 15]
   to discuss the proposal.
   ```  
   Required:  
   ```
   The committee met on [HANDWRITTEN: March 15] to discuss the proposal.
   ```  

### 6. Text Processing Protocol

- EXECUTE de-hyphenation: remove end-of-line hyphens (e.g. `ana-\\nlyse` → `analyse`).  
- PRESERVE legitimate compound hyphens (e.g. `arc-en-ciel`, `mother-in-law`).  
- REPLICATE all diacritical marks and special characters exactly (é, ñ, ü, ç, ş, ā, etc.).
- IMPLEMENT language-appropriate spacing rules:
  - French: ` : `, ` ; `, ` ! `, ` ? ` (space before punctuation)
  - English/most languages: `:`, `;`, `!`, `?` (no space before)
  - Adapt to the specific language's conventions
- RETAIN all original spelling errors, grammatical constructions, and punctuation exactly as written — DO NOT correct or modernize.
- PRESERVE author's insertions, corrections, and modifications in their indicated positions.
- MAINTAIN proper spacing for Asian languages (no spaces between characters in Chinese/Japanese, appropriate spacing in Korean).
- PRESERVE right-to-left text direction markers for Arabic, Hebrew, etc.  

### 7. Special Format Protocol

- PRESERVE list hierarchy with exact formatting.  
- MAINTAIN table structural integrity completely.  
- RETAIN intentional formatting in poetry or special text.  
- RESPECT spatial relationships in image-caption pairs.
- PRESERVE form field structures and labels.
- MAINTAIN mathematical equations and formulas exactly as shown.
- RETAIN special symbols, currency signs, and technical notation.  

### 8. Quality Control Protocol

- PRIORITIZE accuracy over completeness in degraded sections.  
- VERIFY semantic flow after line joining.  
- ENSURE proper zone separation.
- VALIDATE handwritten text transcription.
- CONFIRM language-appropriate typography rules are applied.
- CHECK proper handling of multi-script documents.  

### 9. Self-Review Protocol

Examine your initial output against these criteria:  
- VERIFY complete transcription of all text zones (printed and handwritten).  
- CONFIRM accurate reading order and zone relationships appropriate to the language and document type.  
- CHECK all de-hyphenation and paragraph joining.  
- VALIDATE language-appropriate typography and spacing rules.
- CONFIRM proper marking of handwritten sections.
- ASSESS semantic flow and coherence.  
Correct any deviations before delivering final output.  

### 10. Final Formatting Reflection

Before delivering your output, pause and verify:  

1. **Paragraph structure**  
   - Have you joined all lines that belong to the same paragraph?  
   - Is there exactly **one** empty line (`\\n\\n`) between paragraphs?  
   - Are there **no** single line breaks within paragraphs?  

2. **Hyphenation**  
   - Have you removed **all** end-of-line hyphens?  
   - Have you properly joined the word parts?  
     Example incorrect: `presta-\\ntions` → should be `prestations`.  
     Example correct: `prestations`.  

3. **Special elements**  
   - Are attributions and citations properly separated?  
   - Are headers and titles properly separated?
   - Are handwritten sections clearly marked with `[HANDWRITTEN: text]`?
   - Are uncertain or illegible sections properly marked?

4. **Language and script**  
   - Have you applied the correct typography rules for the document's language?
   - Is the text direction appropriate (LTR, RTL, vertical)?
   - Are all special characters and diacritics preserved?

5. **Final check**  
   - Read your output as continuous text.  
   - Verify that every paragraph is a single block of text.  
   - Confirm there are no artifacts from the original layout.
   - Validate that handwritten and printed text are properly distinguished.  
   
   If you find any formatting issues, fix them before final delivery.  

## Output Requirements

- DELIVER pure transcribed text only.  
- EXCLUDE all commentary or explanations (except required markers like `[HANDWRITTEN:]`, `[UNCERTAIN:]`, `[ILLEGIBLE]`).  
- MAINTAIN language-appropriate typography standards.  
- PRESERVE all semantic and spatial relationships.
- DELIVER plain text output only—no Markdown encoding, markup, or special formatting wrappers (except for required handwritten/uncertainty markers).
- CLEARLY DISTINGUISH between printed and handwritten text using the specified markers.
- PRESERVE original language(s) without translation—transcribe exactly as written.
"""
}

# Write prompt files to disk
for filename, content in PROMPT_CONTENT.items():
    filepath = os.path.join(FOLDERS['prompts'], filename)
    with open(filepath, 'w', encoding='utf-8') as f:
        f.write(content)

print("✅ Setup complete!")
print()
print("📁 Folder structure created:")
print("   ├── 📂 pdfs/             ← Upload your PDF files here")
print("   ├── 📂 results/          ← Output text files saved here")
print("   ├── 📂 prompts/          ← System prompts")
print("   │   └── ocr_system_prompt.md")
print("   └── 📂 logs/             ← Processing logs")

## Step 2: Enter Your API Key 🔑

Enter your Google Gemini API key below. 

**Don't have one?** Get it free at: https://aistudio.google.com/app/api-keys

Your API key is entered securely (hidden like a password).

In [None]:
# Create a secure password field for the API key
api_key_input = widgets.Password(
    placeholder='Paste your API key here',
    description='API Key:',
    layout=widgets.Layout(width='500px'),
    style={'description_width': '80px'}
)

api_key_status = widgets.HTML(value="")

def validate_api_key(change):
    if len(change['new']) > 20:
        api_key_status.value = "<span style='color: green;'>✅ API key entered</span>"
    else:
        api_key_status.value = "<span style='color: orange;'>⏳ Please enter your full API key</span>"

api_key_input.observe(validate_api_key, names='value')

display(HTML("<b>Enter your Gemini API key:</b>"))
display(api_key_input)
display(api_key_status)
display(HTML("<br><i>💡 Tip: Your key starts with 'AIza...'</i>"))

## Step 3: Upload Your PDF Documents 📁

Click the button below to select and upload your PDF files.

In [None]:
# Store uploaded files
uploaded_files = []

upload_status = widgets.HTML(value="")

def upload_pdf_files(b):
    global uploaded_files
    upload_status.value = "<span style='color: blue;'>📤 Upload dialog opened... Select your PDF file(s)</span>"
    
    try:
        uploaded = files.upload()
        
        if uploaded:
            uploaded_files = []
            valid_files = []
            invalid_files = []
            
            for filename, content in uploaded.items():
                ext = Path(filename).suffix.lower()
                if ext == '.pdf':
                    # Save file to pdfs folder
                    filepath = os.path.join(FOLDERS['pdf'], filename)
                    with open(filepath, 'wb') as f:
                        f.write(content)
                    uploaded_files.append(filepath)
                    valid_files.append(filename)
                else:
                    invalid_files.append(filename)
            
            status_html = ""
            if valid_files:
                status_html += f"<span style='color: green;'>✅ Uploaded {len(valid_files)} PDF file(s) to <code>pdfs/</code>:</span><br>"
                for f in valid_files:
                    status_html += f"&nbsp;&nbsp;&nbsp;📄 {f}<br>"
            if invalid_files:
                status_html += f"<span style='color: red;'>❌ Skipped {len(invalid_files)} non-PDF file(s):</span><br>"
                for f in invalid_files:
                    status_html += f"&nbsp;&nbsp;&nbsp;⚠️ {f}<br>"
            
            upload_status.value = status_html
        else:
            upload_status.value = "<span style='color: orange;'>⚠️ No files uploaded</span>"
    except Exception as e:
        upload_status.value = f"<span style='color: red;'>❌ Error: {str(e)}</span>"

upload_button = widgets.Button(
    description='📁 Click to Upload PDF Files',
    button_style='primary',
    layout=widgets.Layout(width='250px', height='40px')
)
upload_button.on_click(upload_pdf_files)

display(upload_button)
display(upload_status)
display(HTML("<br><i>💡 Files will be saved to the <code>pdfs/</code> folder</i>"))

## Step 4: OCR Settings 🎛️

Select the AI model. The system automatically detects the language and document type.

In [None]:
# ============================================
# SETTINGS WIDGETS
# ============================================

# Model selection
model_dropdown = widgets.Dropdown(
    options=[
        ('Gemini 3.0 Pro (Latest, highest quality)', 'gemini-3-pro-preview'),
        ('Gemini 2.5 Pro (High quality, balanced)', 'gemini-2.5-pro'),
        ('Gemini 2.5 Flash (Faster, good quality)', 'gemini-2.5-flash'),
    ],
    value='gemini-3-pro-preview',
    description='AI Model:',
    style={'description_width': '100px'},
    layout=widgets.Layout(width='450px')
)

# Thinking budget info
thinking_info = widgets.HTML(value="")

def update_thinking_info(change):
    model = change['new']
    if "pro" in model:
        thinking_info.value = "<i>🧠 Thinking mode enabled (Budget: 128 tokens)</i>"
    else:
        thinking_info.value = "<i>🧠 Thinking mode disabled (Flash model)</i>"

model_dropdown.observe(update_thinking_info, names='value')
# Initialize
update_thinking_info({'new': model_dropdown.value})

display(HTML("<h3>🤖 Select AI Model</h3>"))
display(model_dropdown)
display(thinking_info)

## Step 5: Start OCR Processing 🚀

Click the button below to start processing your PDF file(s).

In [None]:
# ============================================
# OCR ENGINE
# ============================================

class ColabGeminiOCR:
    """
    A high-precision universal OCR system using Google's Gemini model with native PDF processing.
    Adapted for Google Colab environment.
    """

    def __init__(self, api_key: str, model_name: str):
        self.client = genai.Client(api_key=api_key)
        self.model_name = model_name
        self.generation_config = self._setup_generation_config()
        
    def _setup_generation_config(self):
        # Set thinking budget based on model capabilities
        if "2.5-pro" in self.model_name.lower() or "3-pro" in self.model_name.lower():
            thinking_budget = 128
        else:
            thinking_budget = 0
        
        return types.GenerateContentConfig(
            temperature=0.2,
            top_p=0.95,
            top_k=40,
            max_output_tokens=65535,
            response_mime_type="text/plain",
            thinking_config=types.ThinkingConfig(thinking_budget=thinking_budget),
            safety_settings=[
                types.SafetySetting(
                    category=types.HarmCategory.HARM_CATEGORY_HARASSMENT,
                    threshold=types.HarmBlockThreshold.BLOCK_NONE
                ),
                types.SafetySetting(
                    category=types.HarmCategory.HARM_CATEGORY_HATE_SPEECH,
                    threshold=types.HarmBlockThreshold.BLOCK_NONE
                ),
                types.SafetySetting(
                    category=types.HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT,
                    threshold=types.HarmBlockThreshold.BLOCK_NONE
                ),
                types.SafetySetting(
                    category=types.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
                    threshold=types.HarmBlockThreshold.BLOCK_NONE
                )
            ]
        )
    
    def _get_system_instruction(self):
        filename = "ocr_system_prompt.md"
        prompt_file = os.path.join(FOLDERS['prompts'], filename)
        
        try:
            with open(prompt_file, 'r', encoding='utf-8') as f:
                return f.read()
        except Exception as e:
            print(f"❌ Error reading system prompt file: {e}")
            raise

    def extract_pdf_page(self, pdf_path, page_number):
        try:
            reader = PdfReader(str(pdf_path))
            writer = PdfWriter()
            writer.add_page(reader.pages[page_number])
            output_buffer = io.BytesIO()
            writer.write(output_buffer)
            output_buffer.seek(0)
            return output_buffer.getvalue()
        except Exception as e:
            print(f"❌ Error extracting page {page_number + 1}: {e}")
            raise

    def get_pdf_page_count(self, pdf_path):
        try:
            reader = PdfReader(str(pdf_path))
            return len(reader.pages)
        except Exception as e:
            print(f"❌ Error reading PDF page count: {e}")
            raise

    def process_pdf_page(self, page_bytes, page_num):
        """Process a single PDF page (inline only for Colab simplicity)."""
        try:
            print(f"   └─ 📄 Processing page {page_num}...")
            
            pdf_part = types.Part.from_bytes(
                data=page_bytes,
                mime_type='application/pdf'
            )
            
            combined_prompt = (
                self._get_system_instruction() + "\n\n" +
                "Please perform complete OCR transcription of this single page. "
                "Extract all visible text maintaining original formatting and structure."
            )
            
            response = self.client.models.generate_content(
                model=self.model_name,
                contents=[pdf_part, combined_prompt],
                config=self.generation_config
            )
            
            if not response.candidates:
                raise Exception("No candidates in Gemini response")
            
            candidate = response.candidates[0]
            if not candidate.content or not candidate.content.parts:
                 raise Exception(f"No valid response. Finish reason: {candidate.finish_reason}")

            text_content = response.text.replace('\xa0', ' ').strip()
            if not text_content:
                raise Exception("Empty text response")
            
            print(f"   └─ ✅ Page {page_num} complete")
            return text_content
            
        except Exception as e:
            print(f"   └─ ❌ Page {page_num} failed: {str(e)}")
            return None

# ============================================
# PROCESSING BUTTON AND OUTPUT
# ============================================

ocr_output_area = widgets.Output()
ocr_results = {}  # Store results for download

def run_ocr_process(b):
    global ocr_results
    ocr_results = {}
    
    with ocr_output_area:
        clear_output()
        
        # Validate inputs
        if not api_key_input.value or len(api_key_input.value) < 20:
            print("❌ Please enter a valid API key in Step 2")
            return
        
        if not uploaded_files:
            print("❌ Please upload at least one PDF file in Step 3")
            return
        
        # Get settings
        api_key = api_key_input.value
        model = model_dropdown.value
        
        print(f"🤖 Model: {model}")
        print("\n" + "="*50)
        
        try:
            # Initialize OCR
            ocr = ColabGeminiOCR(api_key, model)
            print("✅ Connected to Gemini API\n")
            
            # Process each file
            for i, pdf_file in enumerate(uploaded_files, 1):
                filename = Path(pdf_file).name
                print(f"\n📚 Processing PDF {i}/{len(uploaded_files)}: {filename}")
                print("-" * 40)
                
                try:
                    total_pages = ocr.get_pdf_page_count(pdf_file)
                    print(f"   📄 Found {total_pages} pages")
                    
                    full_text = []
                    successful_pages = 0
                    
                    for page_idx in range(total_pages):
                        page_num = page_idx + 1
                        
                        # Extract page
                        page_bytes = ocr.extract_pdf_page(pdf_file, page_idx)
                        
                        # Process page
                        text = ocr.process_pdf_page(page_bytes, page_num)
                        
                        if text:
                            if page_num == 1:
                                full_text.append(text)
                            else:
                                full_text.append(f"\n\n--- Page {page_num} ---\n\n{text}")
                            successful_pages += 1
                        else:
                            error_msg = f"[ERROR: Failed to process page {page_num}]"
                            if page_num == 1:
                                full_text.append(error_msg)
                            else:
                                full_text.append(f"\n\n--- Page {page_num} ---\n\n{error_msg}")
                    
                    # Save result
                    final_text = "".join(full_text)
                    output_filename = Path(pdf_file).stem + "_ocr.txt"
                    output_path = os.path.join(FOLDERS['results'], output_filename)
                    
                    with open(output_path, 'w', encoding='utf-8') as f:
                        f.write(f"OCR of: {filename}\n")
                        f.write(f"Model: {model}\n")
                        f.write("=" * 50 + "\n\n")
                        f.write(final_text)
                    
                    ocr_results[output_filename] = {
                        'path': output_path
                    }
                    
                    print(f"\n✅ PDF complete! ({successful_pages}/{total_pages} pages)")
                    print(f"   📄 Saved to: {output_path}")
                    
                except Exception as e:
                    print(f"\n❌ Error processing {filename}: {str(e)}")
            
            # Summary
            print("\n" + "="*50)
            print("🎉 OCR PROCESSING COMPLETE!")
            print(f"   Files processed: {len(ocr_results)}")
            print(f"   📁 Output folder: {FOLDERS['results']}/")
            print("\n👇 Download your results in the next step")
            
        except Exception as e:
            print(f"\n❌ Error: {str(e)}")

ocr_button = widgets.Button(
    description='🚀 Start OCR Processing',
    button_style='success',
    layout=widgets.Layout(width='200px', height='50px')
)
ocr_button.on_click(run_ocr_process)

display(ocr_button)
display(HTML("<br>"))
display(ocr_output_area)

## Step 6: Download Your Results 📥

After processing is complete, click below to download your text files.

In [None]:
download_output = widgets.Output()

def download_results(b):
    with download_output:
        clear_output()
        
        if not ocr_results:
            print("❌ No results available yet. Please run Step 5 first.")
            return
        
        print("📥 Preparing downloads...\n")
        
        for filename, data in ocr_results.items():
            try:
                filepath = data['path']
                print(f"   Downloading: {filename}")
                files.download(filepath)
            except Exception as e:
                print(f"   ⚠️ Could not download {filename}: {e}")
        
        print("\n✅ Downloads initiated! Check your browser's download folder.")

def download_all_zip(b):
    """Zip and download all results."""
    with download_output:
        clear_output()
        
        results_path = Path(FOLDERS['results'])
        txt_files = list(results_path.glob('*.txt'))
        
        if not txt_files:
            print("❌ No result files found.")
            return
        
        print(f"📦 Zipping {len(txt_files)} file(s)...")
        shutil.make_archive('ocr_results', 'zip', results_path)
        
        print("📥 Downloading zip file...")
        files.download('ocr_results.zip')
        print("\n✅ Download initiated!")

download_button = widgets.Button(
    description='📥 Download Latest Results',
    button_style='info',
    layout=widgets.Layout(width='250px', height='40px')
)
download_button.on_click(download_results)

download_zip_button = widgets.Button(
    description='📦 Download All as ZIP',
    button_style='',
    layout=widgets.Layout(width='250px', height='40px')
)
download_zip_button.on_click(download_all_zip)

display(widgets.HBox([download_button, download_zip_button]))
display(HTML(f"<br><i>💡 All results are saved in <code>{FOLDERS['results']}/</code></i>"))
display(download_output)

## Step 7: Cleanup 🧹

Delete temporary files or clear everything when you're done.

In [None]:
cleanup_output = widgets.Output()

def cleanup_pdfs(b):
    with cleanup_output:
        clear_output()
        path = Path(FOLDERS['pdf'])
        if path.exists():
            files_deleted = list(path.glob('*'))
            for f in files_deleted:
                f.unlink()
            print(f"🧹 Deleted {len(files_deleted)} PDF file(s)")
            global uploaded_files
            uploaded_files = []
        else:
            print("📁 PDF folder is already empty")

def cleanup_results(b):
    with cleanup_output:
        clear_output()
        path = Path(FOLDERS['results'])
        if path.exists():
            files_deleted = list(path.glob('*'))
            for f in files_deleted:
                f.unlink()
            print(f"🧹 Deleted {len(files_deleted)} result file(s)")
            global ocr_results
            ocr_results = {}
        else:
            print("📁 Results folder is already empty")

def cleanup_all(b):
    with cleanup_output:
        clear_output()
        cleanup_pdfs(None)
        cleanup_results(None)
        print("✨ All temporary files cleared!")

btn_pdf = widgets.Button(description='🗑️ Delete PDFs', button_style='warning', layout=widgets.Layout(width='180px'))
btn_res = widgets.Button(description='🗑️ Delete Results', button_style='warning', layout=widgets.Layout(width='180px'))
btn_all = widgets.Button(description='🗑️ Delete Everything', button_style='danger', layout=widgets.Layout(width='180px'))

btn_pdf.on_click(cleanup_pdfs)
btn_res.on_click(cleanup_results)
btn_all.on_click(cleanup_all)

display(HTML("<b>Cleanup options:</b>"))
display(widgets.HBox([btn_pdf, btn_res, btn_all]))
display(cleanup_output)