# Phase 4: Procrustes Alignment

## Goal
Align hieroglyphic and English embedding spaces using Orthogonal Procrustes.
This is the same technique that made V3 successful (22% accuracy).

## Strategy
1. Load hieroglyphic embeddings (FastText trained on 104k texts)
2. Load English embeddings (GloVe 300d)
3. Extract anchor vectors from both spaces
4. Compute Procrustes transformation matrix
5. Align hieroglyphic space to English space
6. Evaluate and discover new meanings!

## Mathematical Background
Orthogonal Procrustes finds the optimal rotation matrix **R** that minimizes:
```
||X·R - Y||²
```
Where:
- X = hieroglyphic anchor vectors
- Y = English anchor vectors
- R = rotation matrix (computed via SVD)

Once we have R, we can map any hieroglyphic word to English space!

In [1]:
import numpy as np
import pickle
import json
from pathlib import Path
from gensim.models import KeyedVectors
from scipy.linalg import orthogonal_procrustes
from collections import Counter
import pandas as pd

print("✓ Libraries loaded")

✓ Libraries loaded


## 1. Load Embeddings

In [2]:
# Load hieroglyphic embeddings
print("Loading hieroglyphic embeddings...")
hier_path = Path('../data/processed/hieroglyphic_vectors.kv')
hier_wv = KeyedVectors.load(str(hier_path), mmap='r')
print(f"✓ Loaded {len(hier_wv):,} hieroglyphic vectors")

# Load GloVe English embeddings
print("\nLoading GloVe embeddings...")
glove_path = Path('../data/processed/glove.6B.300d.txt')

# GloVe format: word vec1 vec2 ... vec300
eng_vectors = {}
with open(glove_path, 'r', encoding='utf-8') as f:
    for line in f:
        parts = line.strip().split()
        word = parts[0]
        vector = np.array([float(x) for x in parts[1:]])
        eng_vectors[word] = vector

print(f"✓ Loaded {len(eng_vectors):,} English vectors")
print(f"\nVector dimensions: {hier_wv.vector_size}d (hier) x {len(next(iter(eng_vectors.values())))}d (eng)")

Loading hieroglyphic embeddings...
✓ Loaded 11,974 hieroglyphic vectors

Loading GloVe embeddings...
✓ Loaded 400,000 English vectors

Vector dimensions: 300d (hier) x 300d (eng)


## 2. Load Anchors and Extract Vectors

In [3]:
# Load English anchors
anchors_path = Path('../data/processed/english_anchors.pkl')
with open(anchors_path, 'rb') as f:
    anchors = pickle.load(f)

print(f"Loaded {len(anchors):,} anchor pairs")
print(f"\nSample:")
for i in range(5):
    a = anchors[i]
    print(f"  {a['hieroglyphic']:15s} → {a['english']:15s}")

Loaded 8,541 anchor pairs

Sample:
  n               → der            
  m               → der            
  =f              → er             
  =k              → du             
  =j              → ich            


In [4]:
# Extract anchor vectors
X_list = []  # Hieroglyphic vectors
Y_list = []  # English vectors
valid_anchors = []

for anchor in anchors:
    h_word = anchor['hieroglyphic']
    e_word = anchor['english']
    
    # Check if both words exist in embeddings
    if h_word in hier_wv and e_word in eng_vectors:
        X_list.append(hier_wv[h_word])
        Y_list.append(eng_vectors[e_word])
        valid_anchors.append(anchor)

X = np.array(X_list)
Y = np.array(Y_list)

print(f"Valid anchor pairs: {len(valid_anchors):,} / {len(anchors):,} ({len(valid_anchors)/len(anchors)*100:.1f}%)")
print(f"\nAnchor matrix shapes:")
print(f"  X (hieroglyphic): {X.shape}")
print(f"  Y (English): {Y.shape}")

Valid anchor pairs: 7,471 / 8,541 (87.5%)

Anchor matrix shapes:
  X (hieroglyphic): (7471, 300)
  Y (English): (7471, 300)


## 3. Compute Procrustes Transformation

Find the optimal rotation matrix R using SVD.

In [6]:
# Compute optimal rotation matrix
R, scale = orthogonal_procrustes(X, Y)

print(f"✓ Transformation matrix computed")
print(f"  Shape: {R.shape}")
print(f"  Scale factor: {scale:.4f}")

# Verify it's orthogonal (R^T · R ≈ I)
orthogonality_error = np.linalg.norm(R.T @ R - np.eye(R.shape[0]))
print(f"  Orthogonality error: {orthogonality_error:.6f}")

✓ Transformation matrix computed
  Shape: (300, 300)
  Scale factor: 80381.3102
  Orthogonality error: 0.000000


## 4. Align Hieroglyphic Space

Transform all hieroglyphic vectors to English space.

In [7]:
# Create aligned hieroglyphic vectors
print("Aligning hieroglyphic vectors...")

aligned_hier = {}
for word in hier_wv.index_to_key:
    # Transform: v_aligned = v_hier · R
    aligned_hier[word] = hier_wv[word] @ R

print(f"✓ Aligned {len(aligned_hier):,} hieroglyphic vectors to English space")

Aligning hieroglyphic vectors...
✓ Aligned 11,974 hieroglyphic vectors to English space


## 5. Translation Function

Find the nearest English word for any hieroglyphic word.

In [8]:
def translate_hieroglyphic(h_word, topn=5):
    """
    Translate a hieroglyphic word to English by finding nearest neighbors.
    """
    if h_word not in aligned_hier:
        return None
    
    h_vec = aligned_hier[h_word]
    
    # Compute cosine similarity with all English words
    similarities = {}
    for e_word, e_vec in eng_vectors.items():
        # Cosine similarity
        sim = np.dot(h_vec, e_vec) / (np.linalg.norm(h_vec) * np.linalg.norm(e_vec))
        similarities[e_word] = sim
    
    # Get top N
    top = sorted(similarities.items(), key=lambda x: x[1], reverse=True)[:topn]
    return top

print("✓ Translation function ready")

✓ Translation function ready


## 6. Test Translations

Let's test on some key words from V3's discoveries!

In [9]:
# Test words from V3 and V5
test_cases = [
    ('wsjr', 'osiris'),      # Osiris (V5 spelling)
    ('ḥr,w', 'horus'),       # Horus (V5 spelling)
    ('ppy', 'pepi'),         # Pepi
    ('zꜣ', 'son'),           # Son
    ('nṯr', 'god'),          # God
    ('mw', 'water'),         # Water (V3's "perfect hit")
    ('ꜥnḫ', 'life'),         # Life/living
    ('rꜥw', 're'),           # Re (sun god)
]

print("Translation Tests:")
print("="*70)

for h_word, expected in test_cases:
    results = translate_hieroglyphic(h_word, topn=5)
    
    if results:
        top_word, top_score = results[0]
        match = "✓" if top_word.lower() == expected.lower() else "✗"
        
        print(f"\n{match} {h_word:15s} (expected: {expected})")
        print(f"  Top predictions:")
        for word, score in results:
            print(f"    {word:20s} (score: {score:.3f})")
    else:
        print(f"\n✗ {h_word:15s} - NOT IN VOCABULARY")

Translation Tests:

✓ wsjr            (expected: osiris)
  Top predictions:
    osiris               (score: 0.615)
    der                  (score: 0.404)
    anubis               (score: 0.387)
    isis                 (score: 0.324)
    und                  (score: 0.321)

✓ ḥr,w            (expected: horus)
  Top predictions:
    horus                (score: 0.621)
    der                  (score: 0.402)
    zum                  (score: 0.346)
    anubis               (score: 0.343)
    deutschen            (score: 0.339)

✓ ppy             (expected: pepi)
  Top predictions:
    pepi                 (score: 0.671)
    ist                  (score: 0.390)
    gott                 (score: 0.387)
    der                  (score: 0.353)
    auf                  (score: 0.351)

✓ zꜣ              (expected: son)
  Top predictions:
    son                  (score: 0.474)
    father               (score: 0.444)
    der                  (score: 0.419)
    eldest               (score: 0.407)

## 7. Evaluate on Anchors

Calculate accuracy on our anchor set (like V3's 22%).

In [10]:
# Evaluate accuracy
correct = 0
total = 0
top5_correct = 0

for anchor in valid_anchors:
    h_word = anchor['hieroglyphic']
    e_word = anchor['english']
    
    results = translate_hieroglyphic(h_word, topn=5)
    if results:
        total += 1
        top_word = results[0][0]
        
        if top_word.lower() == e_word.lower():
            correct += 1
        
        # Check if in top 5
        if e_word.lower() in [w.lower() for w, _ in results]:
            top5_correct += 1

accuracy = correct / total * 100 if total > 0 else 0
top5_accuracy = top5_correct / total * 100 if total > 0 else 0

print("Evaluation Results:")
print("="*70)
print(f"Total anchors evaluated: {total:,}")
print(f"\nTop-1 Accuracy: {correct:,} / {total:,} = {accuracy:.2f}%")
print(f"Top-5 Accuracy: {top5_correct:,} / {total:,} = {top5_accuracy:.2f}%")
print(f"\nV3 Baseline: 22% (top-1)")
print(f"V5 vs V3: {accuracy - 22:+.2f}% improvement" if accuracy > 22 else f"V5 vs V3: {accuracy - 22:.2f}% (needs improvement)")

KeyboardInterrupt: 

## 8. Discover New Meanings

The exciting part! Let's explore words that might reveal new insights.

In [None]:
# Interesting words to explore
discovery_words = [
    'inpw',      # Anubis (if in V5 dataset)
    'wsjr',      # Osiris
    'ḥr,w',      # Horus
    'nṯr',       # God
    'ḥm-nṯr',    # Priest
    'ḥqt',       # Beer
    'rꜥw',       # Ra
]

print("New Discoveries:")
print("="*70)

for word in discovery_words:
    results = translate_hieroglyphic(word, topn=10)
    if results:
        print(f"\n{word}:")
        for i, (e_word, score) in enumerate(results, 1):
            print(f"  {i:2d}. {e_word:20s} (score: {score:.3f})")
    else:
        print(f"\n{word}: NOT FOUND")

## 9. Save Results

In [None]:
# Save transformation matrix
np.save('../data/processed/procrustes_matrix.npy', R)
print("✓ Saved transformation matrix")

# Save evaluation results
results = {
    'total_anchors': total,
    'top1_correct': correct,
    'top1_accuracy': accuracy,
    'top5_correct': top5_correct,
    'top5_accuracy': top5_accuracy,
    'v3_baseline': 22.0,
    'improvement': accuracy - 22.0
}

with open('../data/processed/alignment_results.json', 'w') as f:
    json.dump(results, f, indent=2)

print("✓ Saved evaluation results")
print(f"\nFinal V5 Accuracy: {accuracy:.2f}%")