# Notebook: CSLS Alignment (Attempt 4)

## Goal
In Attempt 3, we used **Procrustes Analysis** to align the spaces and **Nearest Neighbors (NN)** to translate.
However, NN suffers from the **Hubness Problem**: some words (like "the", "is") appear as neighbors to *everything* just because they are in a dense part of the vector space.

**CSLS (Cross-Domain Similarity Local Scaling)** fixes this. It penalizes words that are "hubs".
Instead of just asking "Are you close to me?", it asks "Are you close to me AND not close to everyone else?"

## Steps
1.  Load the pre-trained models from V3.
2.  Calculate the Procrustes Rotation Matrix $R$ (same as V3).
3.  Implement the **CSLS** metric.
4.  Compare CSLS vs. Standard NN accuracy.

In [1]:
import os
import pickle
import numpy as np
from gensim.models import FastText, Word2Vec
from sklearn.model_selection import train_test_split
from tqdm import tqdm

# Configuration
DATA_DIR = "data"
MODELS_DIR = "models"
ANCHOR_FILE = os.path.join(DATA_DIR, "anchors.pkl")
HIEROGLYPHIC_MODEL_FILE = os.path.join(MODELS_DIR, "hieroglyphic_fasttext.model")
ENGLISH_MODEL_FILE = os.path.join(MODELS_DIR, "english_word2vec.model")

print("Loading models...")
hier_model = FastText.load(HIEROGLYPHIC_MODEL_FILE)
eng_model = Word2Vec.load(ENGLISH_MODEL_FILE)

print("Loading anchors...")
with open(ANCHOR_FILE, 'rb') as f:
    anchors = pickle.load(f)
    
print(f"Loaded {len(anchors)} anchors.")

Loading models...


Loading anchors...
Loaded 1362 anchors.


## 1. Re-Calculate Rotation (Procrustes)
We need to re-do the alignment step to get our rotation matrix $R$.

In [2]:
# Prepare matrices
valid_anchors = []
X_list = []
Y_list = []

for anchor in anchors:
    h_word = anchor['hieroglyphic']
    e_word = anchor['english']
    if e_word in eng_model.wv:
        valid_anchors.append((h_word, e_word))
        X_list.append(hier_model.wv[h_word])
        Y_list.append(eng_model.wv[e_word])

X = np.array(X_list)
Y = np.array(Y_list)

# Split
X_train, X_test, Y_train, Y_test, anchors_train, anchors_test = train_test_split(
    X, Y, valid_anchors, test_size=0.2, random_state=42
)

# SVD for Rotation
U, S, Vt = np.linalg.svd(Y_train.T @ X_train)
R = U @ Vt

print("Rotation matrix R calculated.")

Rotation matrix R calculated.


## 2. Implement CSLS

CSLS is defined as:
$$ CSLS(x, y) = 2 \cos(x, y) - r_T(x) - r_S(y) $$
Where $r_T(x)$ is the average similarity of $x$ to its $k$ nearest neighbors in the target space.

Basically: "Similarity minus Popularity".

In [3]:
def get_csls_scores(source_vecs, target_vecs, k=10):
    """
    Compute CSLS scores between source and target vectors.
    source_vecs: (N, dim) - Projected hieroglyphic vectors
    target_vecs: (M, dim) - All English vectors
    """
    # Normalize vectors for cosine similarity
    source_norm = source_vecs / np.linalg.norm(source_vecs, axis=1, keepdims=True)
    target_norm = target_vecs / np.linalg.norm(target_vecs, axis=1, keepdims=True)
    
    # Compute Cosine Similarity Matrix (N x M)
    # This can be large, so be careful with memory in production
    sim_matrix = np.dot(source_norm, target_norm.T)
    
    # Calculate r_T (average sim to k nearest neighbors in target for each source)
    # For each row (source word), find top k in target
    r_T = np.mean(np.sort(sim_matrix, axis=1)[:, -k:], axis=1)
    
    # Calculate r_S (average sim to k nearest neighbors in source for each target)
    # For each col (target word), find top k in source
    r_S = np.mean(np.sort(sim_matrix, axis=0)[-k:, :], axis=0)
    
    # CSLS = 2*cos - r_T - r_S
    # Broadcast r_T (N, 1) and r_S (1, M)
    csls_scores = 2 * sim_matrix - r_T[:, np.newaxis] - r_S[np.newaxis, :]
    
    return csls_scores

# Prepare Target Space (All English Words)
# We'll limit to top 20k words to keep it fast
target_words = eng_model.wv.index_to_key[:20000]
target_vecs = np.array([eng_model.wv[w] for w in target_words])

print(f"Target space prepared: {len(target_words)} words.")

Target space prepared: 4177 words.


## 3. Evaluate CSLS vs NN

Let's compare the accuracy on the test set.

In [4]:
def evaluate_csls(X_test, anchors_test, R, top_k=10):
    # Project Test Vectors
    X_projected = X_test @ R.T
    
    # Compute CSLS Matrix (Test_Size x Target_Vocab_Size)
    scores = get_csls_scores(X_projected, target_vecs)
    
    hits = 0
    total = len(anchors_test)
    
    for i, (h_word, true_e_word) in enumerate(anchors_test):
        # Get top k indices for this word
        top_indices = np.argsort(scores[i])[-top_k:][::-1]
        candidates = [target_words[idx] for idx in top_indices]
        
        if true_e_word in candidates:
            hits += 1
            
    return hits / total

print("Evaluating CSLS...")
csls_acc_1 = evaluate_csls(X_test, anchors_test, R, top_k=1)
csls_acc_10 = evaluate_csls(X_test, anchors_test, R, top_k=10)

print(f"CSLS Top-1 Accuracy: {csls_acc_1:.2%}")
print(f"CSLS Top-10 Accuracy: {csls_acc_10:.2%}")

# Compare with V3 (NN) results from memory/previous run
print("\n(Recall V3 NN Top-10 was ~22%)")

Evaluating CSLS...


CSLS Top-1 Accuracy: 5.86%
CSLS Top-10 Accuracy: 15.38%

(Recall V3 NN Top-10 was ~22%)


## 4. Translation Demo

Let's see if the translations make more sense now.

In [5]:
def translate_csls(h_word, top_k=5):
    if h_word not in hier_model.wv:
        print(f"'{h_word}' not in vocab.")
        return
        
    vec = hier_model.wv[h_word]
    proj_vec = (vec @ R.T).reshape(1, -1)
    
    scores = get_csls_scores(proj_vec, target_vecs)
    top_indices = np.argsort(scores[0])[-top_k:][::-1]
    
    print(f"\nCSLS Translation for '{h_word}':")
    for idx in top_indices:
        print(f"  -> {target_words[idx]} ({scores[0][idx]:.3f})")

translate_csls("nfr")
translate_csls("pr-aa")
translate_csls("ra")


CSLS Translation for 'nfr':
  -> wemetetka (0.055)
  -> tjetu (0.028)
  -> worried (0.008)
  -> behdeti (0.007)
  -> hetepeniptah (-0.002)

CSLS Translation for 'pr-aa':
  -> sacrifice (0.031)
  -> thotfest (0.030)
  -> writing (0.025)
  -> kapriests (0.022)
  -> subjects (0.010)

CSLS Translation for 'ra':
  -> demedj (0.080)
  -> nisachmetanch (0.040)
  -> nose (0.002)
  -> inside (0.000)
  -> tjauti (-0.008)
