# Notebook 3: Alignment & Analysis

## Goal
This is the final step. We have two separate embedding spaces:
1.  **Hieroglyphic Space** ($H$)
2.  **English Space** ($E$)

And we have a set of **Anchors** (pairs of words $h_i, e_i$ that mean the same thing).

We will use **Procrustes Analysis** to find a linear transformation (rotation matrix $R$) that maps $H$ onto $E$ such that:
$$ v_{h_i} R \approx v_{e_i} $$

## Steps
1.  Load models and anchors.
2.  Construct alignment matrices from the anchors.
3.  Compute the optimal rotation matrix $R$ using SVD.
4.  Evaluate the alignment on a held-out test set.
5.  **Discover**: Translate unknown hieroglyphic words into English!

In [1]:
import os
import pickle
import numpy as np
from gensim.models import FastText, Word2Vec
from sklearn.model_selection import train_test_split

# Configuration
DATA_DIR = "data"
MODELS_DIR = "models"
ANCHOR_FILE = os.path.join(DATA_DIR, "anchors.pkl")
HIEROGLYPHIC_MODEL_FILE = os.path.join(MODELS_DIR, "hieroglyphic_fasttext.model")
ENGLISH_MODEL_FILE = os.path.join(MODELS_DIR, "english_word2vec.model")

print("Loading models...")
hier_model = FastText.load(HIEROGLYPHIC_MODEL_FILE)
eng_model = Word2Vec.load(ENGLISH_MODEL_FILE)

print("Loading anchors...")
with open(ANCHOR_FILE, 'rb') as f:
    anchors = pickle.load(f)
    
print(f"Loaded {len(anchors)} anchors.")

Loading models...


Loading anchors...
Loaded 1362 anchors.


## 1. Prepare Alignment Matrices

We need to filter our anchors to ensure both words exist in their respective model vocabularies.

In [2]:
valid_anchors = []
X_list = []
Y_list = []

for anchor in anchors:
    h_word = anchor['hieroglyphic']
    e_word = anchor['english']
    
    # Check if words are in vocab
    # FastText always has a vector (subwords), but Word2Vec might not
    if e_word in eng_model.wv:
        valid_anchors.append((h_word, e_word))
        X_list.append(hier_model.wv[h_word])
        Y_list.append(eng_model.wv[e_word])

X = np.array(X_list)
Y = np.array(Y_list)

print(f"Filtered to {len(valid_anchors)} valid anchors.")
print(f"X shape: {X.shape}, Y shape: {Y.shape}")

Filtered to 1362 valid anchors.
X shape: (1362, 100), Y shape: (1362, 100)


## 2. Train/Test Split

We'll use 80% of anchors to learn the rotation and 20% to test if it generalizes.

In [3]:
X_train, X_test, Y_train, Y_test, anchors_train, anchors_test = train_test_split(
    X, Y, valid_anchors, test_size=0.2, random_state=42
)

print(f"Training on {len(X_train)} pairs, Testing on {len(X_test)} pairs.")

Training on 1089 pairs, Testing on 273 pairs.


## 3. Procrustes Alignment

We want to find orthogonal matrix $R$ that minimizes $||X_{train}R - Y_{train}||_F$.
Solution: $R = UV^T$ where $U, \Sigma, V^T = SVD(Y_{train}^T X_{train})$.

In [4]:
def learn_rotation(X, Y):
    # SVD of Y.T @ X
    U, S, Vt = np.linalg.svd(Y.T @ X)
    # R = U @ Vt
    R = U @ Vt
    return R

print("Learning rotation matrix...")
R = learn_rotation(X_train, Y_train)
print("Done.")

Learning rotation matrix...
Done.


## 4. Evaluation

For each test anchor $(h, e)$, we map $h$ to English space: $v_{pred} = v_h R$. 
Then we look at the top $k$ nearest neighbors in the English space. If $e$ is among them, it's a hit.

In [5]:
def evaluate(X, anchors, R, top_k=10):
    hits = 0
    total = len(anchors)
    
    # Project all X to English space
    X_projected = X @ R.T  # Note: X @ R.T is equivalent to (R @ X.T).T
    
    for i, (h_word, true_e_word) in enumerate(anchors):
        pred_vec = X_projected[i]
        
        # Find nearest neighbors in English model
        # We use the model's built-in most_similar which is efficient
        similar = eng_model.wv.most_similar([pred_vec], topn=top_k)
        candidates = [w for w, score in similar]
        
        if true_e_word in candidates:
            hits += 1
            
    accuracy = hits / total
    return accuracy

print("Evaluating on Test Set...")
acc_1 = evaluate(X_test, anchors_test, R, top_k=1)
acc_5 = evaluate(X_test, anchors_test, R, top_k=5)
acc_10 = evaluate(X_test, anchors_test, R, top_k=10)

print(f"Top-1 Accuracy: {acc_1:.2%}")
print(f"Top-5 Accuracy: {acc_5:.2%}")
print(f"Top-10 Accuracy: {acc_10:.2%}")

Evaluating on Test Set...


Top-1 Accuracy: 13.19%
Top-5 Accuracy: 21.25%
Top-10 Accuracy: 23.08%


## 5. Discovery & Translation

Now for the fun part. Let's define a translation function and try it on some words.

In [6]:
def translate(h_word, top_k=5):
    # Get vector (FastText handles OOV)
    vec = hier_model.wv[h_word]
    # Project
    proj_vec = vec @ R.T
    # Find neighbors
    similar = eng_model.wv.most_similar([proj_vec], topn=top_k)
    
    print(f"\nTranslation for '{h_word}':")
    for w, score in similar:
        print(f"  -> {w} ({score:.3f})")

# Famous words
translate("nfr")       # Good/Beautiful
translate("pr-aa")     # Pharaoh (Great House)
translate("ankh")      # Life
translate("maat")      # Truth/Order
translate("ra")        # Sun God
translate("suten")     # King (nswt)
translate("netjer")    # God (nTr)


Translation for 'nfr':
  -> tjetu (0.487)
  -> the (0.484)
  -> of (0.473)
  -> behdeti (0.455)
  -> colorful (0.453)

Translation for 'pr-aa':
  -> house (0.597)
  -> domestic (0.574)
  -> elevator (0.558)
  -> hutsahaneb (0.535)
  -> annual (0.529)

Translation for 'ankh':
  -> norm (0.307)
  -> gloss (0.275)
  -> self (0.267)
  -> wiped (0.244)
  -> avoid (0.243)

Translation for 'maat':
  -> anchkai (0.355)
  -> powers (0.306)
  -> rises (0.301)
  -> trembles (0.283)
  -> removed (0.277)

Translation for 'ra':
  -> bench (0.248)
  -> anchkai (0.230)
  -> administrations (0.207)
  -> from (0.203)
  -> hall (0.197)

Translation for 'suten':
  -> incorruptible (0.316)
  -> tetis (0.298)
  -> unas (0.281)
  -> provides (0.273)
  -> dwellers (0.269)

Translation for 'netjer':
  -> wiped (0.310)
  -> lit (0.297)
  -> make (0.272)
  -> fall (0.254)
  -> mouth (0.252)
