# V6 Phase 2: Multimodal Alignment

## Goal
Combine **HieroBERT's contextual embeddings** with our **768d Visual Embeddings** to create a superior representation for alignment.

## Strategy
1. **Load Models**: Load the pre-trained HieroBERT and the `visual_embeddings_768d.pkl`.
2. **Extract Contextual Vectors**: Pass the anchor words through HieroBERT to get their contextualized representations.
3. **Visual Fusion**: Combine the BERT vector with the Visual vector (e.g., `BERT + 0.5 * Visual`).
4. **Alignment**: Use Procrustes Analysis to align this multimodal space with English GloVe/BERT embeddings.
5. **Evaluation**: Test on the V3/V5 evaluation set.

## Inputs
- `models/hierobert_small`: Pre-trained HieroBERT.
- `data/processed/visual_embeddings_768d.pkl`: Visual features.
- `data/processed/anchors.json`: The 2,000+ anchor pairs.

In [1]:
!pip install transformers torch scikit-learn numpy pandas tqdm






[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [5]:
import torch
import pickle
import json
import numpy as np
from pathlib import Path
from transformers import BertModel, BertTokenizerFast, BertTokenizer
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics.pairwise import cosine_similarity

# Paths
MODEL_PATH = Path("../models/hierobert_small")
VISUAL_PATH = Path("../data/processed/visual_embeddings_768d.pkl")
ANCHORS_PATH = Path("../data/processed/anchors.json")

device = torch.device('mps' if torch.backends.mps.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: mps


## 1. Load Resources

In [6]:
# Load Visual Embeddings
with open(VISUAL_PATH, 'rb') as f:
    visual_embeddings = pickle.load(f)
print(f"Loaded {len(visual_embeddings)} visual embeddings.")

# Load HieroBERT
tokenizer = BertTokenizerFast.from_pretrained(str(MODEL_PATH))
model = BertModel.from_pretrained(str(MODEL_PATH)).to(device)
model.eval()
print("HieroBERT loaded.")

Some weights of BertModel were not initialized from the model checkpoint at ../models/hierobert_small and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loaded 1071 visual embeddings.
HieroBERT loaded.


## 2. Define Fusion Function
How do we combine a sequence of BERT vectors with static visual vectors?

**Approach**:
1. Get BERT output for the word (mean pooling of last hidden state).
2. Look up visual vectors for each glyph in the word.
3. Average the visual vectors.
4. Combine: `Final = BERT_Vector + (alpha * Visual_Vector)`

In [7]:
def get_multimodal_embedding(text, alpha=0.5):
    """
    Generates a 768d vector combining BERT context and Visual features.
    """
    # 1. BERT Embedding
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True).to(device)
    with torch.no_grad():
        outputs = model(**inputs)
    # Mean pooling of last hidden state (excluding [CLS] and [SEP])
    # Shape: [1, seq_len, 768]
    bert_vec = outputs.last_hidden_state[0, 1:-1, :].mean(dim=0).cpu().numpy()
    
    if np.isnan(bert_vec).any():
        bert_vec = np.zeros(768)

    # 2. Visual Embedding
    visual_vecs = []
    for char in text:
        if char in visual_embeddings:
            visual_vecs.append(visual_embeddings[char])
            
    if visual_vecs:
        visual_mean = np.mean(visual_vecs, axis=0)
    else:
        visual_mean = np.zeros(768)
        
    # 3. Fusion
    final_vec = bert_vec + (alpha * visual_mean)
    return final_vec

## 3. Generate Anchor Embeddings & Align
We will:
1. Load English BERT (`bert-base-uncased`) to get target vectors.
2. Generate source (Hiero) and target (English) vectors for all anchors.
3. Train a Linear Regression (Procrustes) to map Hiero -> English.
4. Evaluate Top-K accuracy.

In [8]:
# Load Anchors
with open(ANCHORS_PATH, 'r') as f:
    anchors = json.load(f)
print(f"Loaded {len(anchors)} anchor pairs.")

# Load English BERT for alignment target
en_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
en_model = BertModel.from_pretrained('bert-base-uncased').to(device)
en_model.eval()

def get_english_embedding(text):
    inputs = en_tokenizer(text, return_tensors="pt", padding=True, truncation=True).to(device)
    with torch.no_grad():
        outputs = en_model(**inputs)
    # Mean pooling
    return outputs.last_hidden_state[0, 1:-1, :].mean(dim=0).cpu().numpy()

# Generate Datasets
X_hiero = []
Y_english = []
valid_pairs = []

print("Generating embeddings...")
for pair in tqdm(anchors):
    h_text = pair['hieroglyphic']
    e_text = pair['english']
    
    # Generate Multimodal Hiero Vector
    h_vec = get_multimodal_embedding(h_text, alpha=0.5)
    
    # Generate English BERT Vector
    e_vec = get_english_embedding(e_text)
    
    if not np.isnan(h_vec).any() and not np.isnan(e_vec).any():
        X_hiero.append(h_vec)
        Y_english.append(e_vec)
        valid_pairs.append(pair)

X = np.array(X_hiero)
Y = np.array(Y_english)

print(f"Generated embeddings for {len(X)} pairs.")

# Split Data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Procrustes Alignment (Linear Regression)
print("Training alignment...")
aligner = LinearRegression(fit_intercept=False)
aligner.fit(X_train, Y_train)

# Evaluation
def evaluate(model, X_test, Y_test, k_values=[1, 5, 10]):
    Y_pred = model.predict(X_test)
    
    # Cosine Similarity between all predictions and all targets
    sim_matrix = cosine_similarity(Y_pred, Y_test)
    
    top_k_hits = {k: 0 for k in k_values}
    n_test = len(X_test)
    
    for i in range(n_test):
        # Get indices of sorted similarities (descending)
        sorted_indices = np.argsort(-sim_matrix[i])
        
        # Check if the correct index (i) is in the top k
        for k in k_values:
            if i in sorted_indices[:k]:
                top_k_hits[k] += 1
                
    results = {f"Top-{k}": hits/n_test for k, hits in top_k_hits.items()}
    return results

print("Evaluating...")
scores = evaluate(aligner, X_test, Y_test)
print("Alignment Results:")
print(json.dumps(scores, indent=2))

# Save the alignment matrix
with open('../models/alignment_matrix.pkl', 'wb') as f:
    pickle.dump(aligner, f)
print("Alignment model saved.")

Loaded 8541 anchor pairs.
Generating embeddings...


100%|██████████| 8541/8541 [04:03<00:00, 35.12it/s]


Generated embeddings for 8541 pairs.
Training alignment...
Evaluating...
Alignment Results:
{
  "Top-1": 0.005851375073142188,
  "Top-5": 0.016968987712112346,
  "Top-10": 0.024575775307197192
}
Alignment model saved.


not a very great result here, eh?