## Step 1: Setup (Run this first!) ‚öôÔ∏è

Click the ‚ñ∂Ô∏è button to install the required software and setup the environment. This may take a minute.

In [None]:
# Install required packages
!pip install -q google-genai PyPDF2 pandas ipywidgets

# Import necessary libraries
import os
import time
import logging
import io
import shutil
from pathlib import Path
from google.colab import files
import ipywidgets as widgets
from IPython.display import display, HTML, clear_output
from google import genai
from google.genai import types
from PyPDF2 import PdfReader, PdfWriter
import pandas as pd

# ============================================
# CREATE FOLDER STRUCTURE
# ============================================

# Define folder paths
FOLDERS = {
    'pdf': 'pdfs',
    'results': 'results',
    'prompts': 'prompts',
    'log': 'logs'
}

# Create all folders
for folder_name, folder_path in FOLDERS.items():
    os.makedirs(folder_path, exist_ok=True)

# ============================================
# CREATE PROMPT FILES
# ============================================

PROMPT_CONTENT = {
    "htr_system_prompt_french.md": """# HTR System Prompt for French Handwritten Documents

You are a high-precision HTR (Handwritten Text Recognition) system specialized in French-language handwritten documents, engineered to produce research-grade, archival-quality text extraction. Your output directly supports academic research and archival preservation, demanding maximum accuracy and completeness under fair-use principles.

## Core Principles

1. **Research-Grade Accuracy:** TRANSCRIBE every single word and character from handwritten text with absolute precision ‚Äì zero exceptions. Work character by character, word by word, line by line to minimize Character Error Rate (CER) and Word Error Rate (WER).
2. **Historical Authenticity:** PRESERVE the text exactly as written. RETAIN all spelling variations, grammatical structures, syntactic patterns, and punctuation as they appear in the original document. DO NOT normalize, modernize, or correct the historical text.
3. **Systematic Zone Analysis:** IDENTIFY and PROCESS distinct content zones in their precise reading order.  
4. **Pure Archival Transcription:** DELIVER exact transcription only ‚Äì no summarization, interpretation, or omissions.  
5. **Typographic Precision:** ENFORCE French typography rules and formatting guidelines meticulously.  

## Detailed Guidelines

### 1. Reading Zone Protocol

- IDENTIFY distinct reading zones with precision (columns, sidebars, handwritten notes, captions, headers, footers).  
- EXECUTE zone processing in strict reading order: left-to-right, top-to-bottom within the main flow.  
- PROCESS supplementary zones, including handwritten annotations, systematically after main content.  
- MAINTAIN precise relationships between related zones.  

### 2. Content Hierarchy Protocol

- PROCESS Primary zones: Main body text (handwritten).
- PROCESS Secondary zones: Headers, subheaders, bylines.
- PROCESS Tertiary zones: Footers, page numbers, marginalia, and handwritten notes.
- PROCESS Special zones: Captions, sidebars, boxed content, and handwritten additions.  

### 3. Semantic Integration Protocol

- MERGE semantically linked lines within the same thought unit.  
- DETERMINE paragraph boundaries through semantic analysis.  
- PRESERVE logical flow across structural breaks.  
- ENFORCE double newline (`\\n\\n`) between paragraphs.  

#### Examples

1. **Basic line joining**  
   Source: `Le pr√©sident a d√©clar√©\\nque la situation s'am√©liore.`  
   Required: `Le pr√©sident a d√©clar√© que la situation s'am√©liore.`  

2. **Multi-line with hyphens**  
   Source:  
   ```
   Cette rencontre a √©t√©,
   par ailleurs, marqu√©e
   par des prestations cho-
   r√©graphiques des mes-
   sagers de Kp√©m√©, des
   chants interconfession-
   nels, des chorales et de
   gospel.
   (ATOP)
   ```  
   Required:  
   ```
   Cette rencontre a √©t√©, par ailleurs, marqu√©e par des prestations chor√©graphiques des messagers de Kp√©m√©, des chants interconfessionnels, des chorales et de gospel.

   (ATOP)
   ```

3. **Multiple paragraphs**  
   Source: `Premier paragraphe.\\nSuite du premier.\\n\\nDeuxi√®me paragraphe.`  
   Required: `Premier paragraphe. Suite du premier.\\n\\nDeuxi√®me paragraphe.`  

### 4. Text Processing Protocol

- EXECUTE de-hyphenation: remove end-of-line hyphens (e.g. `ana-\\nlyse` ‚Üí `analyse`).  
- PRESERVE legitimate compound hyphens (e.g. `arc-en-ciel`).  
- REPLICATE all diacritical marks and special characters exactly from handwriting.  
- IMPLEMENT French spacing rules precisely: ` : `, ` ; `, ` ! `, ` ? `.
- RETAIN all original spelling errors, grammatical constructions, and punctuation exactly as written ‚Äî DO NOT correct or modernize.
- PRESERVE author's insertions, corrections, and modifications in their indicated positions.  

### 5. Special Format Protocol

- PRESERVE list hierarchy with exact formatting.  
- MAINTAIN table structural integrity completely.  
- RETAIN intentional formatting in poetry or special text, handwritten.  
- RESPECT spatial relationships in image-caption pairs and handwritten marginalia.  

### 6. Quality Control Protocol

- PRIORITIZE accuracy over completeness in degraded sections (including unclear handwriting).  
- VERIFY semantic flow after line joining.  
- ENSURE proper zone separation.  

### 7. Self-Review Protocol

Examine your initial output against these criteria:  
- VERIFY complete transcription of all text zones, including handwritten content.  
- CONFIRM accurate reading order and zone relationships.  
- CHECK all de-hyphenation and paragraph joining.  
- VALIDATE French typography and spacing rules.  
- ASSESS semantic flow and coherence.  
Correct any deviations before delivering final output.  

### 8. Final Formatting Reflection

Before delivering your output, pause and verify:  

1. **Paragraph structure**  
   - Have you joined all lines that belong to the same paragraph?  
   - Is there exactly **one** empty line (`\\n\\n`) between paragraphs?  
   - Are there **no** single line breaks within paragraphs?  

2. **Hyphenation**  
   - Have you removed **all** end-of-line hyphens?  
   - Have you properly joined the word parts?  
     Example incorrect: `presta-\\ntions` ‚Üí should be `prestations`.  
     Example correct: `prestations`.  

3. **Special elements**  
   - Are attributions (e.g. `(ATOP)`) on their own line with double spacing?  
   - Are headers and titles properly separated?  

4. **Final check**  
   - Read your output as continuous text.  
   - Verify that every paragraph is a single block of text.  
   - Confirm there are no artifacts from the original layout.  
   If you find any formatting issues, fix them before final delivery.  

## Output Requirements

- DELIVER pure transcribed text only.  
- EXCLUDE all commentary or explanations.  
- MAINTAIN exact French typography standards.  
- PRESERVE all semantic and spatial relationships in handwritten additions.
""",
    "htr_system_prompt_arabic.md": """# HTR System Prompt for Arabic Handwritten Manuscripts

You are a high-precision HTR (Handwritten Text Recognition) system specialized in Arabic-language handwritten manuscripts, engineered to produce research-grade, archival-quality text extraction. Your output directly supports academic research and archival preservation, demanding maximum accuracy and completeness under fair-use principles.

## Core Principles

1. **Research-Grade Accuracy:** TRANSCRIBE every single word and character from handwritten Arabic text with absolute precision ‚Äì zero exceptions. Work character by character, word by word, line by line to minimize Character Error Rate (CER) and Word Error Rate (WER).
2. **Historical Authenticity:** PRESERVE the text exactly as written. RETAIN all spelling variations, grammatical structures, syntactic patterns, and punctuation as they appear in the original manuscript. DO NOT normalize, modernize, or correct the historical text.
3. **Systematic Zone Analysis:** IDENTIFY and PROCESS distinct content zones in their precise reading order.  
4. **Pure Archival Transcription:** DELIVER exact transcription only ‚Äì no summarization, interpretation, or omissions.  
5. **Typographic Precision:** ENFORCE Arabic typography rules and formatting guidelines meticulously.  

## Detailed Guidelines

### 1. Reading Zone Protocol

- IDENTIFY distinct reading zones with precision (columns, sidebars, handwritten notes, captions, headers, footers, marginalia).  
- EXECUTE zone processing in strict reading order: right-to-left for Arabic text, following traditional manuscript layout conventions.  
- PROCESS supplementary zones, including handwritten annotations, systematically after main content.  
- MAINTAIN precise relationships between related zones.  

### 2. Content Hierarchy Protocol

- PROCESS Primary zones: Main body text (handwritten Arabic).
- PROCESS Secondary zones: Headers, subheaders, chapter titles.
- PROCESS Tertiary zones: Footers, page numbers, marginalia, and handwritten notes.
- PROCESS Special zones: Captions, sidebars, boxed content, and handwritten additions.  

### 3. Semantic Integration Protocol

- MERGE semantically linked lines within the same thought unit.  
- DETERMINE paragraph boundaries through semantic analysis.  
- PRESERVE logical flow across structural breaks.  
- ENFORCE double newline (`\\n\\n`) between paragraphs.  

#### Examples

1. **Basic line joining**  
   Source: `ŸÇÿßŸÑ ÿßŸÑÿ±ÿ¶Ÿäÿ≥\\nÿ•ŸÜ ÿßŸÑŸàÿ∂ÿπ Ÿäÿ™ÿ≠ÿ≥ŸÜ.`  
   Required: `ŸÇÿßŸÑ ÿßŸÑÿ±ÿ¶Ÿäÿ≥ ÿ•ŸÜ ÿßŸÑŸàÿ∂ÿπ Ÿäÿ™ÿ≠ÿ≥ŸÜ.`  

2. **Multiple paragraphs**  
   Source: `ÿßŸÑŸÅŸÇÿ±ÿ© ÿßŸÑÿ£ŸàŸÑŸâ.\\nÿ™ÿ™ŸÖÿ© ÿßŸÑŸÅŸÇÿ±ÿ©.\\n\\nÿßŸÑŸÅŸÇÿ±ÿ© ÿßŸÑÿ´ÿßŸÜŸäÿ©.`  
   Required: `ÿßŸÑŸÅŸÇÿ±ÿ© ÿßŸÑÿ£ŸàŸÑŸâ. ÿ™ÿ™ŸÖÿ© ÿßŸÑŸÅŸÇÿ±ÿ©.\\n\\nÿßŸÑŸÅŸÇÿ±ÿ© ÿßŸÑÿ´ÿßŸÜŸäÿ©.`  

### 4. Text Processing Protocol

- REPLICATE all diacritical marks (tashkeel) and special characters exactly from handwriting when present.  
- PRESERVE ligatures and connected letter forms as they appear in the manuscript.  
- MAINTAIN proper Arabic spacing rules.  
- RESPECT traditional manuscript orthography, including historical spelling variations.
- RETAIN all original spelling errors, grammatical constructions, and punctuation exactly as written ‚Äî DO NOT correct or modernize.
- PRESERVE author's insertions, corrections, and modifications in their indicated positions.  

### 5. Special Format Protocol

- PRESERVE list hierarchy with exact formatting.  
- MAINTAIN table structural integrity completely.  
- RETAIN intentional formatting in poetry, Quranic verses, or special text.  
- RESPECT spatial relationships in image-caption pairs and handwritten marginalia.  

### 6. Quality Control Protocol

- PRIORITIZE accuracy over completeness in degraded sections (including unclear handwriting).  
- VERIFY semantic flow after line joining.  
- ENSURE proper zone separation.  
- MARK uncertain readings with [?] when text is illegible or ambiguous.  

### 7. Self-Review Protocol

Examine your initial output against these criteria:  
- VERIFY complete transcription of all text zones, including handwritten content.  
- CONFIRM accurate reading order and zone relationships (right-to-left for Arabic).  
- CHECK all paragraph joining and proper line breaks.  
- VALIDATE Arabic typography and spacing rules.  
- ASSESS semantic flow and coherence.  
Correct any deviations before delivering final output.  

### 8. Final Formatting Reflection

Before delivering your output, pause and verify:  

1. **Paragraph structure**  
   - Have you joined all lines that belong to the same paragraph?  
   - Is there exactly **one** empty line (`\\n\\n`) between paragraphs?  
   - Are there **no** single line breaks within paragraphs?  

2. **Arabic text direction**  
   - Is the text properly formatted for right-to-left reading?  
   - Are numerals and mixed-script elements handled correctly?  

3. **Special elements**  
   - Are chapter headings and titles properly separated?  
   - Are marginalia and annotations clearly distinguished?  

4. **Final check**  
   - Read your output as continuous text.  
   - Verify that every paragraph is a single block of text.  
   - Confirm there are no artifacts from the original layout.  
   If you find any formatting issues, fix them before final delivery.  

## Output Requirements

- DELIVER pure transcribed Arabic text only.  
- EXCLUDE all commentary or explanations.  
- MAINTAIN exact Arabic typography standards.  
- PRESERVE all semantic and spatial relationships in handwritten additions.
- RESPECT traditional manuscript conventions and historical orthography.
""",
    "htr_system_prompt_multilingual.md": """# HTR System Prompt for Multilingual Handwritten Documents

You are a high-precision HTR (Handwritten Text Recognition) system specialized in multilingual handwritten documents, engineered to produce research-grade, archival-quality text extraction. Your output directly supports academic research and archival preservation, demanding maximum accuracy and completeness under fair-use principles.

## Core Principles

1. **Language Detection First:** IDENTIFY the language(s) and writing system(s) present in the document before transcription.
2. **Research-Grade Accuracy:** TRANSCRIBE every single word and character from handwritten text with absolute precision ‚Äì zero exceptions. Work character by character, word by word, line by line to minimize Character Error Rate (CER) and Word Error Rate (WER).
3. **Historical Authenticity:** PRESERVE the text exactly as written. RETAIN all spelling variations, grammatical structures, syntactic patterns, and punctuation as they appear in the original document. DO NOT normalize, modernize, or correct the historical text.
4. **Systematic Zone Analysis:** IDENTIFY and PROCESS distinct content zones in their precise reading order.  
5. **Pure Archival Transcription:** DELIVER exact transcription only ‚Äì no summarization, interpretation, or omissions.  
6. **Typographic Precision:** ENFORCE language-specific typography rules and formatting guidelines meticulously.  

## Language Detection Protocol

### Step 1: Analyze the Document

Before transcription, EXAMINE the manuscript and DETERMINE:

1. **Primary writing system(s):**
   - Latin alphabet (e.g., French, English, Spanish, German, Italian, Portuguese, etc.)
   - Arabic script (e.g., Arabic, Persian, Urdu, Ottoman Turkish)
   - Cyrillic alphabet (e.g., Russian, Ukrainian, Bulgarian, Serbian)
   - Greek alphabet
   - Hebrew script
   - Chinese characters (Traditional or Simplified)
   - Japanese (Hiragana, Katakana, Kanji)
   - Korean (Hangul)
   - Devanagari script (e.g., Hindi, Sanskrit, Marathi, Nepali)
   - Other scripts (Bengali, Tamil, Thai, etc.)

2. **Language identification:**
   - Examine vocabulary, grammar patterns, and characteristic words
   - Note language-specific diacritics and special characters
   - Identify any mixed-language sections

3. **Text directionality:**
   - Left-to-right (most Latin, Cyrillic, Greek scripts)
   - Right-to-left (Arabic, Hebrew, Persian)
   - Top-to-bottom (traditional Chinese, Japanese)
   - Mixed directionality for multilingual documents

### Step 2: Output Format

BEGIN your transcription with a header (enclosed in square brackets) that states:

```
[LANGUAGE DETECTED: <language name>]
[WRITING SYSTEM: <script name>]
[TEXT DIRECTION: <direction>]

```

Then proceed with the transcription following language-specific rules.

#### Examples:

```
[LANGUAGE DETECTED: Russian]
[WRITING SYSTEM: Cyrillic]
[TEXT DIRECTION: Left-to-right]

<transcribed text follows>
```

```
[LANGUAGE DETECTED: Persian]
[WRITING SYSTEM: Arabic script]
[TEXT DIRECTION: Right-to-left]

<transcribed text follows>
```

```
[LANGUAGE DETECTED: Spanish and Latin (mixed)]
[WRITING SYSTEM: Latin alphabet]
[TEXT DIRECTION: Left-to-right]

<transcribed text follows>
```

## Detailed Guidelines

### 1. Reading Zone Protocol

- IDENTIFY distinct reading zones with precision (columns, sidebars, handwritten notes, captions, headers, footers, marginalia).  
- EXECUTE zone processing in strict reading order appropriate to the detected language and script.  
- PROCESS supplementary zones, including handwritten annotations, systematically after main content.  
- MAINTAIN precise relationships between related zones.  

### 2. Content Hierarchy Protocol

- PROCESS Primary zones: Main body text (handwritten).
- PROCESS Secondary zones: Headers, subheaders, titles.
- PROCESS Tertiary zones: Footers, page numbers, marginalia, and handwritten notes.
- PROCESS Special zones: Captions, sidebars, boxed content, and handwritten additions.  

### 3. Semantic Integration Protocol

- MERGE semantically linked lines within the same thought unit.  
- DETERMINE paragraph boundaries through semantic analysis.  
- PRESERVE logical flow across structural breaks.  
- ENFORCE double newline (`\\n\\n`) between paragraphs.  

### 4. Language-Specific Text Processing

#### For Latin-script languages:
- EXECUTE de-hyphenation: remove end-of-line hyphens (e.g., `ana-\\nlyse` ‚Üí `analyse`).  
- PRESERVE legitimate compound hyphens (e.g., `arc-en-ciel`, `self-aware`).  
- REPLICATE all diacritical marks exactly (√©, √±, √∂, ƒÖ, etc.).  
- IMPLEMENT language-specific spacing rules (e.g., French: ` : `, ` ; `, ` ! `, ` ? `).
- RETAIN all original spelling errors, grammatical constructions, and punctuation exactly as written ‚Äî DO NOT correct or modernize.

#### For Arabic script:
- REPLICATE all diacritical marks (tashkeel, harakat) when present.  
- PRESERVE ligatures and connected letter forms.  
- MAINTAIN proper Arabic/Persian spacing rules.  
- RESPECT traditional orthography and historical spelling variations.
- RETAIN all original spelling errors and grammatical constructions exactly as written ‚Äî DO NOT correct or modernize.

#### For Cyrillic script:
- PRESERVE hard signs (—ä), soft signs (—å), and all special characters (—ë, —î, —ñ, —ó, etc.).  
- REPLICATE historical orthographic forms if present (pre-reform spellings).  
- MAINTAIN proper spacing and punctuation rules.
- RETAIN all original spelling errors and grammatical constructions exactly as written ‚Äî DO NOT correct or modernize.

#### For East Asian scripts:
- PRESERVE traditional or simplified character forms as written.  
- MAINTAIN proper spacing between characters and punctuation.  
- RESPECT vertical or horizontal text orientation as present.  
- PRESERVE ruby annotations (furigana) if present.
- RETAIN all original character choices and grammatical constructions exactly as written ‚Äî DO NOT correct or modernize.

#### For Other scripts:
- IDENTIFY and use the correct Unicode characters for the script.  
- PRESERVE all diacritics, vowel marks, and special characters.  
- MAINTAIN script-specific spacing and formatting conventions.
- RETAIN all original spelling errors and grammatical constructions exactly as written ‚Äî DO NOT correct or modernize.

#### Universal requirement for all scripts:
- PRESERVE author's insertions, corrections, and modifications in their indicated positions.  

### 5. Special Format Protocol

- PRESERVE list hierarchy with exact formatting.  
- MAINTAIN table structural integrity completely.  
- RETAIN intentional formatting in poetry, religious texts, or special content.  
- RESPECT spatial relationships in image-caption pairs and handwritten marginalia.  

### 6. Quality Control Protocol

- PRIORITIZE accuracy over completeness in degraded sections (including unclear handwriting).  
- VERIFY semantic flow after line joining.  
- ENSURE proper zone separation.  
- MARK uncertain readings with [?] when text is illegible or ambiguous.  
- NOTE language switches with [LANGUAGE SWITCH: <new language>] if the document contains multiple languages.

### 7. Self-Review Protocol

Examine your initial output against these criteria:  
- VERIFY correct language and script identification.
- CONFIRM complete transcription of all text zones, including handwritten content.  
- VALIDATE accurate reading order and zone relationships for the detected script direction.  
- CHECK all language-specific processing (hyphenation, diacritics, spacing).  
- ASSESS semantic flow and coherence.  
Correct any deviations before delivering final output.  

### 8. Final Formatting Reflection

Before delivering your output, pause and verify:  

1. **Language detection header**  
   - Have you included the language detection header at the beginning?  
   - Is the detected language, writing system, and direction correct?  

2. **Paragraph structure**  
   - Have you joined all lines that belong to the same paragraph?  
   - Is there exactly **one** empty line (`\\n\\n`) between paragraphs?  
   - Are there **no** single line breaks within paragraphs?  

3. **Language-specific rules**  
   - Have you applied the correct typography rules for the detected language?  
   - Are diacritics and special characters properly rendered?  
   - Is the text direction respected in formatting?  

4. **Mixed-language handling**  
   - Are language switches clearly marked if present?  
   - Is each section transcribed according to its own language rules?  

5. **Final check**  
   - Read your output as continuous text.  
   - Verify that every paragraph is a single block of text.  
   - Confirm there are no artifacts from the original layout.  
   If you find any formatting issues, fix them before final delivery.  

## Output Requirements

- BEGIN with language detection header in square brackets.
- DELIVER pure transcribed text only (after the header).  
- EXCLUDE all commentary or explanations beyond the detection header.  
- MAINTAIN exact language-specific typography standards.  
- PRESERVE all semantic and spatial relationships in handwritten additions.
- RESPECT traditional manuscript conventions and historical orthography for all languages.
- USE correct Unicode characters for all scripts and special characters.
"""
}

# Write prompt files to disk
for filename, content in PROMPT_CONTENT.items():
    filepath = os.path.join(FOLDERS['prompts'], filename)
    with open(filepath, 'w', encoding='utf-8') as f:
        f.write(content)

print("‚úÖ Setup complete!")
print()
print("üìÅ Folder structure created:")
print("   ‚îú‚îÄ‚îÄ üìÇ pdfs/             ‚Üê Upload your PDF files here")
print("   ‚îú‚îÄ‚îÄ üìÇ results/          ‚Üê Output text files saved here")
print("   ‚îú‚îÄ‚îÄ üìÇ prompts/          ‚Üê System prompts")
print("   ‚îÇ   ‚îú‚îÄ‚îÄ htr_system_prompt_french.md")
print("   ‚îÇ   ‚îú‚îÄ‚îÄ htr_system_prompt_arabic.md")
print("   ‚îÇ   ‚îî‚îÄ‚îÄ htr_system_prompt_multilingual.md")
print("   ‚îî‚îÄ‚îÄ üìÇ logs/             ‚Üê Processing logs")

## Step 2: Enter Your API Key üîë

Enter your Google Gemini API key below. 

**Don't have one?** Get it free at: https://aistudio.google.com/app/api-keys

Your API key is entered securely (hidden like a password).

In [None]:
# Create a secure password field for the API key
api_key_input = widgets.Password(
    placeholder='Paste your API key here',
    description='API Key:',
    layout=widgets.Layout(width='500px'),
    style={'description_width': '80px'}
)

api_key_status = widgets.HTML(value="")

def validate_api_key(change):
    if len(change['new']) > 20:
        api_key_status.value = "<span style='color: green;'>‚úÖ API key entered</span>"
    else:
        api_key_status.value = "<span style='color: orange;'>‚è≥ Please enter your full API key</span>"

api_key_input.observe(validate_api_key, names='value')

display(HTML("<b>Enter your Gemini API key:</b>"))
display(api_key_input)
display(api_key_status)
display(HTML("<br><i>üí° Tip: Your key starts with 'AIza...'</i>"))

## Step 3: Upload Your PDF Documents üìÅ

Click the button below to select and upload your PDF files.


In [None]:
# Store uploaded files
uploaded_files = []

upload_status = widgets.HTML(value="")

def upload_pdf_files(b):
    global uploaded_files
    upload_status.value = "<span style='color: blue;'>üì§ Upload dialog opened... Select your PDF file(s)</span>"
    
    try:
        uploaded = files.upload()
        
        if uploaded:
            uploaded_files = []
            valid_files = []
            invalid_files = []
            
            for filename, content in uploaded.items():
                ext = Path(filename).suffix.lower()
                if ext == '.pdf':
                    # Save file to pdfs folder
                    filepath = os.path.join(FOLDERS['pdf'], filename)
                    with open(filepath, 'wb') as f:
                        f.write(content)
                    uploaded_files.append(filepath)
                    valid_files.append(filename)
                else:
                    invalid_files.append(filename)
            
            status_html = ""
            if valid_files:
                status_html += f"<span style='color: green;'>‚úÖ Uploaded {len(valid_files)} PDF file(s) to <code>pdfs/</code>:</span><br>"
                for f in valid_files:
                    status_html += f"&nbsp;&nbsp;&nbsp;üìÑ {f}<br>"
            if invalid_files:
                status_html += f"<span style='color: red;'>‚ùå Skipped {len(invalid_files)} non-PDF file(s):</span><br>"
                for f in invalid_files:
                    status_html += f"&nbsp;&nbsp;&nbsp;‚ö†Ô∏è {f}<br>"
            
            upload_status.value = status_html
        else:
            upload_status.value = "<span style='color: orange;'>‚ö†Ô∏è No files uploaded</span>"
    except Exception as e:
        upload_status.value = f"<span style='color: red;'>‚ùå Error: {str(e)}</span>"

upload_button = widgets.Button(
    description='üìÅ Click to Upload PDF Files',
    button_style='primary',
    layout=widgets.Layout(width='250px', height='40px')
)
upload_button.on_click(upload_pdf_files)

display(upload_button)
display(upload_status)
display(HTML("<br><i>üí° Files will be saved to the <code>pdfs/</code> folder</i>"))

## Step 4: HTR Settings üéõÔ∏è

Select the AI model and manuscript language.


In [None]:
# ============================================
# SETTINGS WIDGETS
# ============================================

# Model selection
model_dropdown = widgets.Dropdown(
    options=[
        ('Gemini 3.0 Pro (Latest, highest quality)', 'gemini-3-pro-preview'),
        ('Gemini 2.5 Pro (High quality, balanced)', 'gemini-2.5-pro'),
        ('Gemini 2.5 Flash (Faster, good quality)', 'gemini-2.5-flash'),
    ],
    value='gemini-3-pro-preview',
    description='AI Model:',
    style={'description_width': '100px'},
    layout=widgets.Layout(width='450px')
)

# Language selection
language_dropdown = widgets.Dropdown(
    options=[
        ('French Handwritten Manuscripts', 'french'),
        ('Arabic Handwritten Manuscripts', 'arabic'),
        ('Multilingual / Auto-detect', 'multilingual'),
    ],
    value='french',
    description='Language:',
    style={'description_width': '100px'},
    layout=widgets.Layout(width='450px')
)

# Thinking budget info
thinking_info = widgets.HTML(value="")

def update_thinking_info(change):
    model = change['new']
    if "pro" in model:
        thinking_info.value = "<i>üß† Thinking mode enabled (Budget: 128 tokens)</i>"
    else:
        thinking_info.value = "<i>üß† Thinking mode disabled (Flash model)</i>"

model_dropdown.observe(update_thinking_info, names='value')
# Initialize
update_thinking_info({'new': model_dropdown.value})

display(HTML("<h3>ü§ñ Select AI Model</h3>"))
display(model_dropdown)
display(thinking_info)

display(HTML("<h3>üìú Select Manuscript Language</h3>"))
display(language_dropdown)

## Step 5: Start HTR Processing üöÄ

Click the button below to start processing your PDF file(s).


In [None]:
# ============================================
# HTR ENGINE
# ============================================

class ColabGeminiHTR:
    """
    A high-precision HTR system using Google's Gemini model with native PDF processing.
    Adapted for Google Colab environment.
    """

    def __init__(self, api_key: str, model_name: str, language: str = "french"):
        self.client = genai.Client(api_key=api_key)
        self.model_name = model_name
        self.language = language
        self.generation_config = self._setup_generation_config()
        
    def _setup_generation_config(self):
        # Set thinking budget based on model capabilities
        if "2.5-pro" in self.model_name.lower() or "3-pro" in self.model_name.lower():
            thinking_budget = 128
        else:
            thinking_budget = 0
        
        return types.GenerateContentConfig(
            temperature=0.2,
            top_p=0.95,
            top_k=40,
            max_output_tokens=65535,
            response_mime_type="text/plain",
            thinking_config=types.ThinkingConfig(thinking_budget=thinking_budget),
            safety_settings=[
                types.SafetySetting(
                    category=types.HarmCategory.HARM_CATEGORY_HARASSMENT,
                    threshold=types.HarmBlockThreshold.BLOCK_NONE
                ),
                types.SafetySetting(
                    category=types.HarmCategory.HARM_CATEGORY_HATE_SPEECH,
                    threshold=types.HarmBlockThreshold.BLOCK_NONE
                ),
                types.SafetySetting(
                    category=types.HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT,
                    threshold=types.HarmBlockThreshold.BLOCK_NONE
                ),
                types.SafetySetting(
                    category=types.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
                    threshold=types.HarmBlockThreshold.BLOCK_NONE
                )
            ]
        )
    
    def _get_system_instruction(self):
        # Select the appropriate prompt file based on language
        if self.language == "arabic":
            filename = "htr_system_prompt_arabic.md"
        elif self.language == "multilingual":
            filename = "htr_system_prompt_multilingual.md"
        else:  # default to french
            filename = "htr_system_prompt_french.md"
        
        prompt_file = os.path.join(FOLDERS['prompts'], filename)
        
        try:
            with open(prompt_file, 'r', encoding='utf-8') as f:
                return f.read()
        except Exception as e:
            print(f"‚ùå Error reading system prompt file: {e}")
            raise

    def extract_pdf_page(self, pdf_path, page_number):
        try:
            reader = PdfReader(str(pdf_path))
            writer = PdfWriter()
            writer.add_page(reader.pages[page_number])
            output_buffer = io.BytesIO()
            writer.write(output_buffer)
            output_buffer.seek(0)
            return output_buffer.getvalue()
        except Exception as e:
            print(f"‚ùå Error extracting page {page_number + 1}: {e}")
            raise

    def get_pdf_page_count(self, pdf_path):
        try:
            reader = PdfReader(str(pdf_path))
            return len(reader.pages)
        except Exception as e:
            print(f"‚ùå Error reading PDF page count: {e}")
            raise

    def process_pdf_page(self, page_bytes, page_num):
        """Process a single PDF page (inline only for Colab simplicity)."""
        try:
            print(f"   ‚îî‚îÄ üìÑ Processing page {page_num}...")
            
            pdf_part = types.Part.from_bytes(
                data=page_bytes,
                mime_type='application/pdf'
            )
            
            if self.language == "multilingual":
                language_desc = "text (detect language automatically)"
            elif self.language == "arabic":
                language_desc = "Arabic"
            else:
                language_desc = "French"
            
            combined_prompt = (
                self._get_system_instruction() + "\n\n" +
                f"This is a legitimate handwritten text transcription (HTR) request for academic research and archival preservation. "
                f"Transcribe ALL handwritten {language_desc} text with exact wording, spacing rules, accents, and WITHOUT summarizing or omitting any zones."
            )
            
            response = self.client.models.generate_content(
                model=self.model_name,
                contents=[pdf_part, combined_prompt],
                config=self.generation_config
            )
            
            if not response.candidates:
                raise Exception("No candidates in Gemini response")
            
            candidate = response.candidates[0]
            if not candidate.content or not candidate.content.parts:
                 raise Exception(f"No valid response. Finish reason: {candidate.finish_reason}")

            text_content = response.text.replace('\xa0', ' ').strip()
            if not text_content:
                raise Exception("Empty text response")
            
            print(f"   ‚îî‚îÄ ‚úÖ Page {page_num} complete")
            return text_content
            
        except Exception as e:
            print(f"   ‚îî‚îÄ ‚ùå Page {page_num} failed: {str(e)}")
            return None

# ============================================
# PROCESSING BUTTON AND OUTPUT
# ============================================

htr_output_area = widgets.Output()
htr_results = {}  # Store results for download

def run_htr_process(b):
    global htr_results
    htr_results = {}
    
    with htr_output_area:
        clear_output()
        
        # Validate inputs
        if not api_key_input.value or len(api_key_input.value) < 20:
            print("‚ùå Please enter a valid API key in Step 2")
            return
        
        if not uploaded_files:
            print("‚ùå Please upload at least one PDF file in Step 3")
            return
        
        # Get settings
        api_key = api_key_input.value
        model = model_dropdown.value
        language = language_dropdown.value
        
        print(f"ü§ñ Model: {model}")
        print(f"üìú Language: {language}")
        print("\n" + "="*50)
        
        try:
            # Initialize HTR
            htr = ColabGeminiHTR(api_key, model, language)
            print("‚úÖ Connected to Gemini API\n")
            
            # Process each file
            for i, pdf_file in enumerate(uploaded_files, 1):
                filename = Path(pdf_file).name
                print(f"\nüìö Processing PDF {i}/{len(uploaded_files)}: {filename}")
                print("-" * 40)
                
                try:
                    total_pages = htr.get_pdf_page_count(pdf_file)
                    print(f"   üìÑ Found {total_pages} pages")
                    
                    full_text = []
                    successful_pages = 0
                    
                    for page_idx in range(total_pages):
                        page_num = page_idx + 1
                        
                        # Extract page
                        page_bytes = htr.extract_pdf_page(pdf_file, page_idx)
                        
                        # Process page
                        text = htr.process_pdf_page(page_bytes, page_num)
                        
                        if text:
                            if page_num == 1:
                                full_text.append(text)
                            else:
                                full_text.append(f"\n\n--- Page {page_num} ---\n\n{text}")
                            successful_pages += 1
                        else:
                            error_msg = f"[ERROR: Failed to process page {page_num}]"
                            if page_num == 1:
                                full_text.append(error_msg)
                            else:
                                full_text.append(f"\n\n--- Page {page_num} ---\n\n{error_msg}")
                    
                    # Save result
                    final_text = "".join(full_text)
                    output_filename = Path(pdf_file).stem + "_htr.txt"
                    output_path = os.path.join(FOLDERS['results'], output_filename)
                    
                    with open(output_path, 'w', encoding='utf-8') as f:
                        f.write(f"HTR of: {filename}\n")
                        f.write(f"Model: {model}\n")
                        f.write(f"Language: {language}\n")
                        f.write("=" * 50 + "\n\n")
                        f.write(final_text)
                    
                    htr_results[output_filename] = {
                        'path': output_path
                    }
                    
                    print(f"\n‚úÖ PDF complete! ({successful_pages}/{total_pages} pages)")
                    print(f"   üìÑ Saved to: {output_path}")
                    
                except Exception as e:
                    print(f"\n‚ùå Error processing {filename}: {str(e)}")
            
            # Summary
            print("\n" + "="*50)
            print("üéâ HTR PROCESSING COMPLETE!")
            print(f"   Files processed: {len(htr_results)}")
            print(f"   üìÅ Output folder: {FOLDERS['results']}/")
            print("\nüëá Download your results in the next step")
            
        except Exception as e:
            print(f"\n‚ùå Error: {str(e)}")

htr_button = widgets.Button(
    description='üöÄ Start HTR Processing',
    button_style='success',
    layout=widgets.Layout(width='200px', height='50px')
)
htr_button.on_click(run_htr_process)

display(htr_button)
display(HTML("<br>"))
display(htr_output_area)

## Step 6: Download Your Results üì•

After processing is complete, click below to download your text files.


In [None]:
download_output = widgets.Output()

def download_results(b):
    with download_output:
        clear_output()
        
        if not htr_results:
            print("‚ùå No results available yet. Please run Step 5 first.")
            return
        
        print("üì• Preparing downloads...\n")
        
        for filename, data in htr_results.items():
            try:
                filepath = data['path']
                print(f"   Downloading: {filename}")
                files.download(filepath)
            except Exception as e:
                print(f"   ‚ö†Ô∏è Could not download {filename}: {e}")
        
        print("\n‚úÖ Downloads initiated! Check your browser's download folder.")

def download_all_zip(b):
    """Zip and download all results."""
    with download_output:
        clear_output()
        
        results_path = Path(FOLDERS['results'])
        txt_files = list(results_path.glob('*.txt'))
        
        if not txt_files:
            print("‚ùå No result files found.")
            return
        
        print(f"üì¶ Zipping {len(txt_files)} file(s)...")
        shutil.make_archive('htr_results', 'zip', results_path)
        
        print("üì• Downloading zip file...")
        files.download('htr_results.zip')
        print("\n‚úÖ Download initiated!")

download_button = widgets.Button(
    description='üì• Download Latest Results',
    button_style='info',
    layout=widgets.Layout(width='250px', height='40px')
)
download_button.on_click(download_results)

download_zip_button = widgets.Button(
    description='üì¶ Download All as ZIP',
    button_style='',
    layout=widgets.Layout(width='250px', height='40px')
)
download_zip_button.on_click(download_all_zip)

display(widgets.HBox([download_button, download_zip_button]))
display(HTML(f"<br><i>üí° All results are saved in <code>{FOLDERS['results']}/</code></i>"))
display(download_output)

## Step 7: Cleanup üßπ

Delete temporary files or clear everything when you're done.


In [None]:
cleanup_output = widgets.Output()

def cleanup_pdfs(b):
    with cleanup_output:
        clear_output()
        path = Path(FOLDERS['pdf'])
        if path.exists():
            files_deleted = list(path.glob('*'))
            for f in files_deleted:
                f.unlink()
            print(f"üßπ Deleted {len(files_deleted)} PDF file(s)")
            global uploaded_files
            uploaded_files = []
        else:
            print("üìÅ PDF folder is already empty")

def cleanup_results(b):
    with cleanup_output:
        clear_output()
        path = Path(FOLDERS['results'])
        if path.exists():
            files_deleted = list(path.glob('*'))
            for f in files_deleted:
                f.unlink()
            print(f"üßπ Deleted {len(files_deleted)} result file(s)")
            global htr_results
            htr_results = {}
        else:
            print("üìÅ Results folder is already empty")

def cleanup_all(b):
    with cleanup_output:
        clear_output()
        cleanup_pdfs(None)
        cleanup_results(None)
        print("‚ú® All temporary files cleared!")

btn_pdf = widgets.Button(description='üóëÔ∏è Delete PDFs', button_style='warning', layout=widgets.Layout(width='180px'))
btn_res = widgets.Button(description='üóëÔ∏è Delete Results', button_style='warning', layout=widgets.Layout(width='180px'))
btn_all = widgets.Button(description='üóëÔ∏è Delete Everything', button_style='danger', layout=widgets.Layout(width='180px'))

btn_pdf.on_click(cleanup_pdfs)
btn_res.on_click(cleanup_results)
btn_all.on_click(cleanup_all)

display(HTML("<b>Cleanup options:</b>"))
display(widgets.HBox([btn_pdf, btn_res, btn_all]))
display(cleanup_output)