# Phase 3: Embedding Training

## Goal
Train FastText embeddings on our 104k hieroglyphic corpus and prepare for alignment with English.

## Strategy
Following V3's successful approach, but with 10x more data:
1. Train FastText on hieroglyphic transliterations
2. Load pre-trained English embeddings (or train on Wikipedia)
3. Verify anchor coverage in both embedding spaces
4. Save embeddings for alignment

## Key Improvements over V3
- **10x more training data** (104k vs 12k texts)
- **4x more anchors** (8.5k vs 1.3k pairs)
- **Hieroglyphs column** for additional context
- **Larger vocabulary** (~20k vs 7k unique words)

In [1]:


import pandas as pd
import pickle
import json
from pathlib import Path
from gensim.models import FastText
from gensim.models import KeyedVectors
import numpy as np
from tqdm import tqdm

print("Libraries loaded successfully")

Libraries loaded successfully


## 1. Load Corpus and Anchors

In [2]:
# Load hieroglyphic corpus
corpus_path = Path('../data/processed/hieroglyphic_corpus_full.tsv')
df = pd.read_csv(corpus_path, sep='\t')

print(f"Corpus size: {len(df):,} texts")
print(f"\nColumns: {df.columns.tolist()}")

Corpus size: 104,426 texts

Columns: ['transliteration', 'translation', 'source', 'metadata', 'hieroglyphs', 'transliteration_clean']


In [3]:
# Load English anchors
anchors_path = Path('../data/processed/english_anchors.pkl')
with open(anchors_path, 'rb') as f:
    anchors = pickle.load(f)

print(f"Loaded {len(anchors):,} English anchors")
print(f"\nSample anchors:")
for i in range(5):
    a = anchors[i]
    print(f"  {a['hieroglyphic']:15s} → {a['english']:15s} (conf: {a['confidence']:.2f})")

Loaded 8,541 English anchors

Sample anchors:
  n               → der             (conf: 0.34)
  m               → der             (conf: 0.37)
  =f              → er              (conf: 0.38)
  =k              → du              (conf: 0.45)
  =j              → ich             (conf: 0.60)


## 2. Prepare Training Sentences

FastText needs a list of tokenized sentences.

In [4]:
# Extract sentences from transliterations
sentences = []

for text in tqdm(df['transliteration_clean'].dropna(), desc="Preparing sentences"):
    if isinstance(text, str) and text.strip():
        # Split into words
        words = text.split()
        if len(words) > 0:
            sentences.append(words)

print(f"\nTotal sentences: {len(sentences):,}")
print(f"Sample sentence: {sentences[0][:10]}...")

# Vocabulary stats
all_words = [word for sent in sentences for word in sent]
vocab = set(all_words)
print(f"\nVocabulary size: {len(vocab):,} unique words")
print(f"Total tokens: {len(all_words):,}")

Preparing sentences: 100%|██████████| 104426/104426 [00:00<00:00, 727945.12it/s]


Total sentences: 104,426
Sample sentence: ['nḏ', '(w)di̯', 'r', '=s']...

Vocabulary size: 85,013 unique words
Total tokens: 872,070





## 3. Train FastText Model

Using similar hyperparameters to V3, but adjusted for larger corpus.

In [5]:
# FastText hyperparameters
VECTOR_SIZE = 300      # Embedding dimension (same as V3)
WINDOW = 5             # Context window
MIN_COUNT = 5          # Minimum word frequency
EPOCHS = 10            # Training epochs
SG = 1                 # Skip-gram (1) vs CBOW (0)

print("Training FastText model...")
print(f"Parameters: dim={VECTOR_SIZE}, window={WINDOW}, min_count={MIN_COUNT}, epochs={EPOCHS}")

model = FastText(
    sentences=sentences,
    vector_size=VECTOR_SIZE,
    window=WINDOW,
    min_count=MIN_COUNT,
    epochs=EPOCHS,
    sg=SG,
    workers=4
)

print("\n✓ Training complete!")
print(f"Vocabulary size: {len(model.wv):,} words")

Training FastText model...
Parameters: dim=300, window=5, min_count=5, epochs=10


Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'



✓ Training complete!
Vocabulary size: 11,974 words


## 4. Verify Anchor Coverage

Check how many of our anchors are in the trained embedding space.

In [6]:
# Check anchor coverage
hieroglyphic_words = [a['hieroglyphic'] for a in anchors]
found = [w for w in hieroglyphic_words if w in model.wv]
missing = [w for w in hieroglyphic_words if w not in model.wv]

print(f"Anchor Coverage:")
print(f"  Found: {len(found):,} / {len(hieroglyphic_words):,} ({len(found)/len(hieroglyphic_words)*100:.1f}%)")
print(f"  Missing: {len(missing):,}")

if len(missing) > 0:
    print(f"\nTop 10 missing anchors:")
    missing_with_freq = [(w, next((a['frequency'] for a in anchors if a['hieroglyphic'] == w), 0)) for w in missing[:10]]
    for w, freq in sorted(missing_with_freq, key=lambda x: x[1], reverse=True)[:10]:
        print(f"  {w:15s} (freq: {freq:,})")

Anchor Coverage:
  Found: 8,541 / 8,541 (100.0%)
  Missing: 0


## 5. Test Embeddings

Quick sanity check: find similar words to key anchors.

In [7]:
# Test similar words
test_words = ['wsjr', 'ḥr,w', 'ppy', 'zꜣ', 'nṯr']  # osiris, horus, pepi, son, god

print("Similar words test:")
print("="*70)
for word in test_words:
    if word in model.wv:
        similar = model.wv.most_similar(word, topn=5)
        print(f"\n{word}:")
        for sim_word, score in similar:
            print(f"  {sim_word:15s} (similarity: {score:.3f})")
    else:
        print(f"\n{word}: NOT IN VOCABULARY")

Similar words test:

wsjr:
  wsj             (similarity: 0.833)
  〈wsjr〉          (similarity: 0.813)
  wsjr-ḫnt(,j)-jmn,tt (similarity: 0.789)
  wsjr-ḫnt,j-jmn,tj.w (similarity: 0.788)
  wsjr-wnn-nfr    (similarity: 0.780)

ḥr,w:
  ḥr,wt           (similarity: 0.813)
  +ḥr,w           (similarity: 0.797)
  〈ḥr,w〉          (similarity: 0.743)
  {ḥr,w}          (similarity: 0.741)
  ḥr,w-j          (similarity: 0.729)

ppy:
  nfr-kꜣ-rꜥw      (similarity: 0.876)
  nfr-kꜣ-[rꜥw]    (similarity: 0.820)
  ⸢nfr-kꜣ-rꜥw⸣    (similarity: 0.818)
  sḫm-kꜣ-rꜥw      (similarity: 0.782)
  [nfr-kꜣ-rꜥw]    (similarity: 0.781)

zꜣ:
  zꜣt             (similarity: 0.646)
  zꜣš             (similarity: 0.614)
  zꜣ-z            (similarity: 0.598)
  zꜣb             (similarity: 0.596)
  sms,w           (similarity: 0.595)

nṯr:
  nṯri̯           (similarity: 0.669)
  nṯr.j           (similarity: 0.662)
  nṯ              (similarity: 0.643)
  〈nṯr〉           (similarity: 0.642)
  nṯrj            (similarity

## 6. Save Hieroglyphic Embeddings

In [8]:
# Save the model
model_path = Path('../data/processed/hieroglyphic_fasttext.model')
model.save(str(model_path))
print(f"✓ Saved full model to {model_path}")

# Save just the word vectors (smaller, faster to load)
wv_path = Path('../data/processed/hieroglyphic_vectors.kv')
model.wv.save(str(wv_path))
print(f"✓ Saved word vectors to {wv_path}")

✓ Saved full model to ../data/processed/hieroglyphic_fasttext.model
✓ Saved word vectors to ../data/processed/hieroglyphic_vectors.kv


## 7. English Embeddings

We have two options:
1. **Use pre-trained** (e.g., GloVe, Word2Vec from Google)
2. **Train on Wikipedia** (what V3 did)

For now, let's use **pre-trained GloVe** (faster, proven quality).

In [None]:
# Option 1: Download pre-trained GloVe embeddings
# Download from: https://nlp.stanford.edu/projects/glove/
# We'll use glove.6B.300d.txt (300 dimensions, 6B tokens)

print("To use pre-trained English embeddings:")
print("1. Download GloVe: wget http://nlp.stanford.edu/data/glove.6B.zip")
print("2. Extract: unzip glove.6B.zip")
print("3. Move glove.6B.300d.txt to ../data/processed/")
print("\nOr we can train on Wikipedia (slower, but custom to our needs)")

## Next Steps: Phase 4 - Alignment

Once we have both embedding spaces:
1. Load hieroglyphic and English embeddings
2. Extract anchor vectors from both spaces
3. Apply Orthogonal Procrustes alignment
4. Evaluate on V3's test set
5. Discover new meanings (Anubis-type connections!)

Create `07_procrustes_alignment.ipynb` next.