# Phase 2: Translate German Anchors to English

## Goal
Translate the 8,541 German anchor words to English using DeepL API.
This creates the hieroglyphic ↔ English anchor pairs needed for alignment.

## Strategy
1. Load `german_anchors.json`
2. Extract unique German words
3. Batch translate using DeepL API (most accurate for German→English)
4. Create final anchor dictionary: hieroglyphic → English
5. Save for embedding alignment

## Cost Estimate
- DeepL Free: 500k characters/month
- Our ~8.5k words ≈ 85k characters
- **Cost: FREE** (within free tier)

In [16]:
import json
import pickle
from pathlib import Path
from collections import Counter
import time

# Load German anchors
anchors_path = Path('../data/processed/german_anchors.json')
with open(anchors_path, 'r', encoding='utf-8') as f:
    german_anchors = json.load(f)

print(f"Loaded {len(german_anchors):,} anchors")
print(f"\nSample:")
for i in range(5):
    a = german_anchors[i]
    print(f"  {a['hieroglyphic']:15s} → {a['german']:20s} (conf: {a['confidence']:.2f})")

Loaded 8,541 anchors

Sample:
  n               → der                  (conf: 0.34)
  m               → der                  (conf: 0.37)
  =f              → er                   (conf: 0.38)
  =k              → du                   (conf: 0.45)
  =j              → ich                  (conf: 0.60)


In [2]:
# Extract unique German words
german_words = list(set([a['german'] for a in german_anchors]))
print(f"Unique German words to translate: {len(german_words):,}")

# Estimate character count
total_chars = sum(len(w) for w in german_words)
print(f"Total characters: {total_chars:,}")
print(f"\nWithin DeepL free tier: {total_chars < 500000}")

Unique German words to translate: 1,917
Total characters: 13,212

Within DeepL free tier: True


## DeepL API Setup

To use DeepL:
1. Sign up at https://www.deepl.com/pro-api
2. Get your API key (free tier: 500k chars/month)
3. Install: `pip install deepl`
4. Set environment variable: `export DEEPL_API_KEY="your-key"`

**Alternative:** Use Google Translate (less accurate for German)
```python
from googletrans import Translator
translator = Translator()
result = translator.translate(text, src='de', dest='en')
```

In [17]:

# Option 1: DeepL (recommended)
try:
    import deepl
    import os
    
    api_key = os.getenv('DEEPL_API_KEY')
    if not api_key:
        print("⚠️  DEEPL_API_KEY not found in environment")
        print("Please set it: export DEEPL_API_KEY='your-key'")
        raise ImportError("API key required")
    
    translator = deepl.Translator(api_key)
    print("✓ DeepL translator initialized")
    
    # Test translation
    test = translator.translate_text("Gott", target_lang="EN-US")
    print(f"Test: 'Gott' → '{test.text}'")
    
except ImportError as e:
    print(f"DeepL not available: {e}")
    print("\nFalling back to Google Translate...")
    
    try:
        from googletrans import Translator
        translator = Translator()
        print("✓ Google Translate initialized")
        
        # Test translation
        test = translator.translate("Gott", src='de', dest='en')
        print(f"Test: 'Gott' → '{test.text}'")
        
    except ImportError:
        print("❌ No translation library available")
        print("Install one: pip install deepl OR pip install googletrans==4.0.0-rc1")

⚠️  DEEPL_API_KEY not found in environment
Please set it: export DEEPL_API_KEY='your-key'
DeepL not available: API key required

Falling back to Google Translate...
❌ No translation library available
Install one: pip install deepl OR pip install googletrans==4.0.0-rc1


In [18]:
# Translate in batches
from tqdm import tqdm

translations = {}
BATCH_SIZE = 50  # Translate 50 words at a time
USE_DEEPL = 'deepl' in str(type(translator)).lower()

print(f"Translating {len(german_words):,} words using {'DeepL' if USE_DEEPL else 'Google Translate'}...")

for i in tqdm(range(0, len(german_words), BATCH_SIZE), desc="Batches"):
    batch = german_words[i:i+BATCH_SIZE]
    
    try:
        if USE_DEEPL:
            # DeepL batch translation
            results = translator.translate_text(batch, target_lang="EN-US")
            for german, result in zip(batch, results):
                translations[german] = result.text.lower()
        else:
            # Google Translate (one by one)
            for german in batch:
                result = translator.translate(german, src='de', dest='en')
                translations[german] = result.text.lower()
                time.sleep(0.1)  # Rate limiting
    except Exception as e:
        print(f"\nError in batch {i//BATCH_SIZE}: {e}")
        # Continue with next batch
        continue

print(f"\nTranslated {len(translations):,} / {len(german_words):,} words")

Translating 1,917 words using DeepL...


Batches: 100%|██████████| 39/39 [00:24<00:00,  1.61it/s]


Translated 1,917 / 1,917 words





In [19]:
# Show sample translations
print("Sample Translations:")
print("="*50)
for i, (de, en) in enumerate(list(translations.items())[:20], 1):
    print(f"{i:2d}. {de:20s} → {en}")

Sample Translations:
 1. festduftöl           → solid fragrance oil
 2. taub                 → deaf
 3. himmelskuh           → heavenly cow
 4. nahen                → near
 5. mnw                  → mnw
 6. 25                   → 25
 7. entstanden           → originated
 8. schwellung           → swelling
 9. rptrf                → rptrf
10. hymne                → hymn
11. 34                   → 34
12. bien                 → well
13. amme                 → amme
14. salz                 → salt
15. nwmmwt               → nwmmwt
16. meketaton            → meketaton
17. speichel             → saliva
18. ville                → city
19. fähre                → ferry
20. palastes             → palaces


In [20]:
# Create English anchors
english_anchors = []

for anchor in german_anchors:
    german = anchor['german']
    if german in translations:
        english_anchors.append({
            'hieroglyphic': anchor['hieroglyphic'],
            'english': translations[german],
            'german': german,  # Keep for reference
            'confidence': anchor['confidence'],
            'frequency': anchor['frequency']
        })

print(f"Created {len(english_anchors):,} English anchors")
print(f"\nTop 10:")
for i, a in enumerate(english_anchors[:10], 1):
    print(f"{i:2d}. {a['hieroglyphic']:15s} → {a['english']:20s} (was: {a['german']})")

Created 8,541 English anchors

Top 10:
 1. n               → der                  (was: der)
 2. m               → der                  (was: der)
 3. =f              → er                   (was: er)
 4. =k              → du                   (was: du)
 5. =j              → ich                  (was: ich)
 6. ḥr              → der                  (was: der)
 7. r               → der                  (was: der)
 8. wsjr            → osiris               (was: osiris)
 9. =sn             → sie                  (was: sie)
10. pw              → ist                  (was: ist)


In [21]:
# Save English anchors
output_pkl = Path('../data/processed/english_anchors.pkl')
output_json = Path('../data/processed/english_anchors.json')

with open(output_pkl, 'wb') as f:
    pickle.dump(english_anchors, f)

with open(output_json, 'w', encoding='utf-8') as f:
    json.dump(english_anchors, f, ensure_ascii=False, indent=2)

print(f"\nSaved {len(english_anchors):,} English anchors to:")
print(f"  - {output_pkl}")
print(f"  - {output_json}")


Saved 8,541 English anchors to:
  - ../data/processed/english_anchors.pkl
  - ../data/processed/english_anchors.json


## Comparison with V3

Let's compare our V5 anchors with V3's successful ones.

In [22]:
# V3's top anchors (from the notebook we saw earlier)
v3_anchors = {
    '=f': 'the',
    '=k': 'you',
    'm': 'the',
    'n': 'the',
    'ḥr.w': 'horus',
    'wnḏs': 'unas',
    'pn': 'this',
    'zꜣ': 'son',
    'pw': 'the',
    'ḏd-mdw': 'words',
    'ppy': 'pepi',
    'wsꜣr': 'osiris',
    'nꜣ.t': 'neith',
    'wsr.w': 'osiris'
}

# Check overlap
v5_dict = {a['hieroglyphic']: a['english'] for a in english_anchors}

print("V3 vs V5 Anchor Comparison:")
print("="*70)
for h, v3_en in v3_anchors.items():
    v5_en = v5_dict.get(h, "NOT FOUND")
    match = "✓" if v5_en == v3_en else "✗"
    print(f"{match} {h:15s} | V3: {v3_en:15s} | V5: {v5_en}")

V3 vs V5 Anchor Comparison:
✗ =f              | V3: the             | V5: er
✗ =k              | V3: you             | V5: du
✗ m               | V3: the             | V5: der
✗ n               | V3: the             | V5: der
✗ ḥr.w            | V3: horus           | V5: NOT FOUND
✗ wnḏs            | V3: unas            | V5: NOT FOUND
✗ pn              | V3: this            | V5: der
✓ zꜣ              | V3: son             | V5: son
✗ pw              | V3: the             | V5: ist
✗ ḏd-mdw          | V3: words           | V5: zu
✓ ppy             | V3: pepi            | V5: pepi
✗ wsꜣr            | V3: osiris          | V5: NOT FOUND
✗ nꜣ.t            | V3: neith           | V5: NOT FOUND
✗ wsr.w           | V3: osiris          | V5: der


## Next Steps: Phase 3 - Embedding Training

Now that we have English anchors, we can:
1. Train FastText on hieroglyphic corpus (104k texts)
2. Load/train English embeddings (Wikipedia)
3. Align using Orthogonal Procrustes
4. Evaluate against V3's test set

Create `06_embedding_training.ipynb` next.