# V7 FastText + Visual Embeddings: Complete Pipeline

## Overview
This notebook combines **FastText** text embeddings with **Visual CNN features** to create multimodal hieroglyphic representations, then aligns them with English GloVe embeddings.

### Pipeline Steps
1. **Data Cleaning**: Extract transliteration from BBAW parquet
2. **FastText Training**: Train 768d embeddings on transliteration
3. **Visual Fusion**: Concatenate FastText (768d) + Visual (768d) = 1536d
4. **Alignment**: Linear regression to map 1536d ‚Üí 300d English space
5. **Evaluation**: Test on anchor pairs

### Key Fix
‚úÖ Now using `transcription` column (transliteration) instead of `hieroglyphs` (MdC codes) to match anchor vocabulary!

## Setup

In [1]:
import pandas as pd
import numpy as np
import pickle
import json
import logging
from pathlib import Path
from tqdm import tqdm
from gensim.models import FastText, KeyedVectors
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split

# Configure logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# Setup paths
try:
    BASE_DIR = Path(__file__).resolve().parent.parent
except NameError:
    BASE_DIR = Path.cwd().parent

print(f"Base Directory: {BASE_DIR}")

Base Directory: /Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual


## Step 1: Data Cleaning

Extract transliteration sequences from the BBAW dataset.

**Source**: `heiro_v6_BERT/data/raw/bbaw_huggingface.parquet`  
**Column**: `transcription` (transliteration like "jr,j-pÍú•,t ·∏•Íú£,tj-Íú•")  
**Output**: `data/processed/cleaned_corpus.txt`

In [2]:
# Define paths
PARQUET_PATH = BASE_DIR.parent / "heiro_v6_BERT/data/raw/bbaw_huggingface.parquet"
CLEAN_DATA_PATH = BASE_DIR / "data/processed/cleaned_corpus.txt"

# Ensure output directory exists
CLEAN_DATA_PATH.parent.mkdir(parents=True, exist_ok=True)

print(f"Reading from {PARQUET_PATH}...")
df = pd.read_parquet(PARQUET_PATH)

print(f"Total rows: {len(df)}")
print(f"Columns: {df.columns.tolist()}")

Reading from /Users/crashy/Development/heiroglyphy/heiro_v6_BERT/data/raw/bbaw_huggingface.parquet...
Total rows: 100736
Columns: ['transcription', 'translation', 'hieroglyphs']


In [3]:
# Filter for rows with non-empty transcription
df_trans = df[df['transcription'].notna() & (df['transcription'] != '')]
print(f"Rows with transcription: {len(df_trans)}")

# Show a sample
print("\nSample transcriptions:")
print(df_trans['transcription'].head(3).tolist())

Rows with transcription: 100729

Sample transcriptions:
['‚∏¢p·∏è,wt-9‚∏£   n =f   [‚∏Æ·∏•tr?]   ‚∏¢m‚∏£  ', '·∏•tr tp,j Íú•Íú£ n ·∏•m =f Íú§Íú£-n·∏´t,w', '‚∏¢wr‚∏£.pl ‚∏¢Íú•Íú£iÃØ‚∏£.pl n.w ‚∏¢R·πØn,w‚∏£ ‚∏¢jniÃØ‚∏£ ‚∏¢·∏•m‚∏£ ‚∏¢=f‚∏£ ‚∏¢m‚∏£ ‚∏¢sqr-Íú•n·∏´‚∏£']


In [4]:
# Clean and tokenize
cleaned_lines = []

print("Processing transcriptions...")
for trans_str in tqdm(df_trans['transcription']):
    if not isinstance(trans_str, str):
        continue
    
    # Normalize whitespace (transliteration is already space-separated)
    clean_line = " ".join(trans_str.split())
    
    if clean_line:
        cleaned_lines.append(clean_line)

print(f"Extracted {len(cleaned_lines)} lines.")

# Save to file
with open(CLEAN_DATA_PATH, 'w', encoding='utf-8') as f:
    for line in cleaned_lines:
        f.write(line + "\n")

print(f"Saved cleaned corpus to {CLEAN_DATA_PATH}")

Processing transcriptions...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100729/100729 [00:00<00:00, 1037042.61it/s]

Extracted 100729 lines.
Saved cleaned corpus to /Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/data/processed/cleaned_corpus.txt





In [5]:
# Verify output
print("\nFirst 3 lines of cleaned corpus:")
with open(CLEAN_DATA_PATH, 'r', encoding='utf-8') as f:
    for i, line in enumerate(f):
        if i >= 3:
            break
        print(f"{i+1}: {line.strip()[:100]}...")


First 3 lines of cleaned corpus:
1: ‚∏¢p·∏è,wt-9‚∏£ n =f [‚∏Æ·∏•tr?] ‚∏¢m‚∏£...
2: ·∏•tr tp,j Íú•Íú£ n ·∏•m =f Íú§Íú£-n·∏´t,w...
3: ‚∏¢wr‚∏£.pl ‚∏¢Íú•Íú£iÃØ‚∏£.pl n.w ‚∏¢R·πØn,w‚∏£ ‚∏¢jniÃØ‚∏£ ‚∏¢·∏•m‚∏£ ‚∏¢=f‚∏£ ‚∏¢m‚∏£ ‚∏¢sqr-Íú•n·∏´‚∏£...


## Step 2: FastText Training

Train 300d FastText embeddings on the transliteration corpus.

In [6]:
# Define paths
MODEL_DIR = BASE_DIR / "models"
MODEL_PATH = MODEL_DIR / "fasttext_v7.model"

MODEL_DIR.mkdir(parents=True, exist_ok=True)

print(f"Training FastText model on {CLEAN_DATA_PATH}...")

Training FastText model on /Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/data/processed/cleaned_corpus.txt...


In [9]:
# Load corpus
class MyCorpus:
    def __iter__(self):
        with open(CLEAN_DATA_PATH, 'r', encoding='utf-8') as f:
            for line in f:
                yield line.split()

sentences = MyCorpus()

# Train FastText
# Parameters:
# vector_size=768: Standard size, matches GloVe
# window=5: Context window
# min_count=1: Keep all words for now
# sg=1: Skip-gram (usually better for smaller datasets)
# epochs=10: Train for a bit longer
model = FastText(vector_size=768, window=5, min_count=1, sentences=sentences, epochs=10, sg=1)

print(f"\nVocabulary size: {len(model.wv)}")

2025-11-21 14:20:23,575 : INFO : collecting all words and their counts
2025-11-21 14:20:23,577 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2025-11-21 14:20:23,599 : INFO : PROGRESS: at sentence #10000, processed 95884 words, keeping 16423 word types
2025-11-21 14:20:23,621 : INFO : PROGRESS: at sentence #20000, processed 177018 words, keeping 27382 word types
2025-11-21 14:20:23,640 : INFO : PROGRESS: at sentence #30000, processed 258107 words, keeping 36395 word types
2025-11-21 14:20:23,656 : INFO : PROGRESS: at sentence #40000, processed 329546 words, keeping 40659 word types
2025-11-21 14:20:23,673 : INFO : PROGRESS: at sentence #50000, processed 397825 words, keeping 49010 word types
2025-11-21 14:20:23,688 : INFO : PROGRESS: at sentence #60000, processed 453602 words, keeping 55475 word types
2025-11-21 14:20:23,716 : INFO : PROGRESS: at sentence #70000, processed 519943 words, keeping 61231 word types
2025-11-21 14:20:23,736 : INFO : PROGRESS: at s


Vocabulary size: 80662


In [10]:
# Save model
model.save(str(MODEL_PATH))
print(f"Model saved to {MODEL_PATH}")

# Save vectors in word2vec format for easy inspection
model.wv.save_word2vec_format(str(MODEL_DIR / "fasttext_v7.vec"))
print(f"Vectors saved to {MODEL_DIR / 'fasttext_v7.vec'}")

2025-11-21 14:23:04,180 : INFO : FastText lifecycle event {'fname_or_handle': '/Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/models/fasttext_v7.model', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2025-11-21T14:23:04.180002', 'gensim': '4.4.0', 'python': '3.12.3 (main, Jun  1 2025, 04:19:33) [Clang 17.0.0 (clang-1700.0.13.5)]', 'platform': 'macOS-26.1-arm64-arm-64bit', 'event': 'saving'}
2025-11-21 14:23:04,181 : INFO : not storing attribute vectors
2025-11-21 14:23:04,181 : INFO : storing np array 'vectors_vocab' to /Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/models/fasttext_v7.model.wv.vectors_vocab.npy
2025-11-21 14:23:04,221 : INFO : storing np array 'vectors_ngrams' to /Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/models/fasttext_v7.model.wv.vectors_ngrams.npy
2025-11-21 14:23:06,224 : INFO : not storing attribute buckets_word
2025-11-21 14:23:06,224 : INFO : storing np array 'syn1neg' to /Users/

Model saved to /Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/models/fasttext_v7.model
Vectors saved to /Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/models/fasttext_v7.vec


In [11]:
# Test similarities with common transliteration tokens
test_words = ["n", "m", "r", "·∏•r,w", "n·πØr"]
for word in test_words:
    if word in model.wv:
        print(f"\nMost similar to '{word}':")
        print(model.wv.most_similar(word, topn=5))
    else:
        print(f"\n'{word}' not in vocabulary.")


Most similar to 'n':
[('hn,y', 0.6850689053535461), ('hn(n)', 0.685028612613678), ('jnt.(t)w', 0.6813384890556335), ('jnt.(t)(w)', 0.6766965389251709), ('s·∏´t,y', 0.6758492588996887)]

Most similar to 'm':
[('qrr.t(du)', 0.7199281454086304), ('{j}m-m', 0.7187355160713196), ('HÍú£kr', 0.7123856544494629), ('sÍú£(m)', 0.7116715312004089), ('Íú£b·∏è', 0.7115925550460815)]

Most similar to 'r':
[('p·∏•.t{r}', 0.684761106967926), ('d‚å©r‚å™', 0.6846187710762024), ('j‚å©r‚å™', 0.681447446346283), ('w·∏è.‚å©t‚å™n', 0.6813123822212219), ('·∏´‚å©r‚å™', 0.6799421310424805)]

Most similar to '·∏•r,w':
[('n-·∏•r,w', 0.9105653166770935), ('≈°Íú£r,w', 0.9011860489845276), ('wÍú•r,w', 0.901081919670105), ('tr,w', 0.894969642162323), ('·∏´Íú•r,w', 0.8937190175056458)]

Most similar to 'n·πØr':
[('n·πØr+', 0.8556978702545166), ('n·πØrw', 0.8415403366088867), ('+n·πØr', 0.8252946734428406), ('n·πØr{r}', 0.7939909100532532), ('n·πØr.j', 0.7816357016563416)]


## Step 3: Visual Embedding Fusion

Combine FastText vectors with pre-computed visual embeddings from V6.

**Note**: Visual embeddings are keyed by Unicode, so we'll use the lexicon to map transliteration ‚Üí Unicode ‚Üí visual features.

In [12]:
# Load FastText Model
print(f"Loading FastText model from {MODEL_PATH}...")
ft_model = FastText.load(str(MODEL_PATH))
ft_wv = ft_model.wv
print(f"FastText Vocab Size: {len(ft_wv)}")

2025-11-21 14:24:59,237 : INFO : loading FastText object from /Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/models/fasttext_v7.model
2025-11-21 14:24:59,255 : INFO : loading wv recursively from /Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/models/fasttext_v7.model.wv.* with mmap=None
2025-11-21 14:24:59,255 : INFO : loading vectors_vocab from /Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/models/fasttext_v7.model.wv.vectors_vocab.npy with mmap=None
2025-11-21 14:24:59,306 : INFO : loading vectors_ngrams from /Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/models/fasttext_v7.model.wv.vectors_ngrams.npy with mmap=None


Loading FastText model from /Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/models/fasttext_v7.model...


2025-11-21 14:25:02,496 : INFO : setting ignored attribute vectors to None
2025-11-21 14:25:02,510 : INFO : setting ignored attribute buckets_word to None
2025-11-21 14:25:06,964 : INFO : loading syn1neg from /Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/models/fasttext_v7.model.syn1neg.npy with mmap=None
2025-11-21 14:25:07,044 : INFO : setting ignored attribute cum_table to None
2025-11-21 14:25:07,387 : INFO : FastText lifecycle event {'fname': '/Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/models/fasttext_v7.model', 'datetime': '2025-11-21T14:25:07.386972', 'gensim': '4.4.0', 'python': '3.12.3 (main, Jun  1 2025, 04:19:33) [Clang 17.0.0 (clang-1700.0.13.5)]', 'platform': 'macOS-26.1-arm64-arm-64bit', 'event': 'loaded'}


FastText Vocab Size: 80662


In [13]:
# Load Visual Embeddings
VISUAL_EMBED_PATH = BASE_DIR.parent / "heiro_v6_BERT/data/processed/visual_embeddings_768d.pkl"
print(f"Loading Visual embeddings from {VISUAL_EMBED_PATH}...")

with open(VISUAL_EMBED_PATH, 'rb') as f:
    visual_embeds = pickle.load(f)
    
print(f"Visual Embeddings Size: {len(visual_embeds)}")
print(f"Sample key: {list(visual_embeds.keys())[0]}")

Loading Visual embeddings from /Users/crashy/Development/heiroglyphy/heiro_v6_BERT/data/processed/visual_embeddings_768d.pkl...
Visual Embeddings Size: 1071
Sample key: U+13000


In [14]:
# Load Lexicon for Mapping
LEXICON_PATH = BASE_DIR.parent / "heiro_v6_BERT/data/processed/hieroglyph_lexicon.csv"
print(f"Loading Lexicon from {LEXICON_PATH}...")

lexicon_df = pd.read_csv(LEXICON_PATH)
print(f"Lexicon size: {len(lexicon_df)}")
print(f"\nSample entries:")
print(lexicon_df.head())

Loading Lexicon from /Users/crashy/Development/heiroglyphy/heiro_v6_BERT/data/processed/hieroglyph_lexicon.csv...
Lexicon size: 1071

Sample entries:
   unicode character glyph_name gardiner_code
0  U+13000         ìÄÄ         a1       U+13000
1  U+13001         ìÄÅ         a2       U+13001
2  U+13002         ìÄÇ         a3       U+13002
3  U+13003         ìÄÉ         a4       U+13003
4  U+13004         ìÄÑ         a5       U+13004


In [15]:
# Create mapping: glyph_name (lowercase) -> unicode
gardiner_to_unicode = dict(zip(lexicon_df['glyph_name'], lexicon_df['unicode']))
print(f"Created mapping for {len(gardiner_to_unicode)} glyphs")

Created mapping for 1071 glyphs


In [18]:
# Fuse Embeddings
print("Fusing embeddings...")
fused_vectors = []
words = []

visual_dim = 768
text_dim = 768

matches = 0
misses = 0

for word in ft_wv.index_to_key:
    # Get text vector
    text_vec = ft_wv[word]
    
    # Get visual vector (default to zeros if not found)
    visual_vec = np.zeros(visual_dim, dtype=np.float32)
    
    # Try to match word to visual embedding
    # For transliteration, this is harder - we'd need a transliteration->glyph mapping
    # For now, we'll just use zero vectors (no visual info)
    # This is a limitation we should note!
    
    # Normalize vectors (L2)
    text_norm = np.linalg.norm(text_vec)
    if text_norm > 0:
        text_vec = text_vec / text_norm
        
    visual_norm = np.linalg.norm(visual_vec)
    if visual_norm > 0:
        visual_vec = visual_vec / visual_norm
        matches += 1
    else:
        misses += 1
        
    # Concatenate
    fused_vec = np.concatenate([text_vec, visual_vec])
    fused_vectors.append(fused_vec)
    words.append(word)
    
print(f"Fusion Complete. Matches: {matches}, Misses: {misses}")
print(f"Match Rate: {matches / len(words):.2%}")

Fusing embeddings...
Fusion Complete. Matches: 0, Misses: 80662
Match Rate: 0.00%


In [17]:
# Save Fused Model
FUSED_MODEL_PATH = BASE_DIR / "models/fused_embeddings_1536d.kv"

fused_vectors = np.array(fused_vectors)
print(f"Fused Vectors Shape: {fused_vectors.shape}")

kv = KeyedVectors(vector_size=text_dim + visual_dim)
kv.add_vectors(words, fused_vectors)

kv.save(str(FUSED_MODEL_PATH))
print(f"Saved fused model to {FUSED_MODEL_PATH}")

Fused Vectors Shape: (80662, 1536)


2025-11-21 14:25:39,192 : INFO : KeyedVectors lifecycle event {'fname_or_handle': '/Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/models/fused_embeddings_1536d.kv', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2025-11-21T14:25:39.192801', 'gensim': '4.4.0', 'python': '3.12.3 (main, Jun  1 2025, 04:19:33) [Clang 17.0.0 (clang-1700.0.13.5)]', 'platform': 'macOS-26.1-arm64-arm-64bit', 'event': 'saving'}
2025-11-21 14:25:39,194 : INFO : storing np array 'vectors' to /Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/models/fused_embeddings_1536d.kv.vectors.npy
2025-11-21 14:25:39,305 : INFO : saved /Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/models/fused_embeddings_1536d.kv


Saved fused model to /Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/models/fused_embeddings_1536d.kv


## Step 4: Alignment & Evaluation

Align the fused embeddings to English GloVe space using Linear Regression.

In [19]:
# Load Fused Model
print(f"Loading Fused Model from {FUSED_MODEL_PATH}...")
hiero_kv = KeyedVectors.load(str(FUSED_MODEL_PATH))
print(f"Loaded {len(hiero_kv)} hieroglyphic vectors")

2025-11-21 14:26:21,156 : INFO : loading KeyedVectors object from /Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/models/fused_embeddings_1536d.kv
2025-11-21 14:26:21,178 : INFO : loading vectors from /Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/models/fused_embeddings_1536d.kv.vectors.npy with mmap=None


Loading Fused Model from /Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/models/fused_embeddings_1536d.kv...


2025-11-21 14:26:21,386 : INFO : KeyedVectors lifecycle event {'fname': '/Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/models/fused_embeddings_1536d.kv', 'datetime': '2025-11-21T14:26:21.386791', 'gensim': '4.4.0', 'python': '3.12.3 (main, Jun  1 2025, 04:19:33) [Clang 17.0.0 (clang-1700.0.13.5)]', 'platform': 'macOS-26.1-arm64-arm-64bit', 'event': 'loaded'}


Loaded 80662 hieroglyphic vectors


In [20]:
# Load GloVe
GLOVE_PATH = BASE_DIR.parent / "heiro_v5_getdata/data/processed/glove.6B.300d.txt"
print(f"Loading GloVe from {GLOVE_PATH}...")
print("(This may take a minute...)")

glove_kv = KeyedVectors.load_word2vec_format(str(GLOVE_PATH), binary=False, no_header=True)
print(f"Loaded {len(glove_kv)} English vectors")

2025-11-21 14:26:27,274 : INFO : loading projection weights from /Users/crashy/Development/heiroglyphy/heiro_v5_getdata/data/processed/glove.6B.300d.txt


Loading GloVe from /Users/crashy/Development/heiroglyphy/heiro_v5_getdata/data/processed/glove.6B.300d.txt...
(This may take a minute...)


2025-11-21 14:27:35,768 : INFO : KeyedVectors lifecycle event {'msg': 'loaded (400000, 300) matrix of type float32 from /Users/crashy/Development/heiroglyphy/heiro_v5_getdata/data/processed/glove.6B.300d.txt', 'binary': False, 'encoding': 'utf8', 'datetime': '2025-11-21T14:27:35.768262', 'gensim': '4.4.0', 'python': '3.12.3 (main, Jun  1 2025, 04:19:33) [Clang 17.0.0 (clang-1700.0.13.5)]', 'platform': 'macOS-26.1-arm64-arm-64bit', 'event': 'load_word2vec_format'}


Loaded 400000 English vectors


In [21]:
# Load Anchors
ANCHORS_PATH = BASE_DIR.parent / "heiro_v6_BERT/data/processed/anchors.json"
print(f"Loading Anchors from {ANCHORS_PATH}...")

with open(ANCHORS_PATH, 'r') as f:
    anchors = json.load(f)
    
print(f"Loaded {len(anchors)} anchor pairs")
print(f"\nSample anchors:")
for i in range(min(3, len(anchors))):
    print(f"  {anchors[i]['hieroglyphic']} ‚Üí {anchors[i]['english']}")

Loading Anchors from /Users/crashy/Development/heiroglyphy/heiro_v6_BERT/data/processed/anchors.json...
Loaded 8541 anchor pairs

Sample anchors:
  n ‚Üí the
  m ‚Üí the
  =f ‚Üí he


In [22]:
# Prepare Alignment Data
print("Preparing alignment data...")
X = []
Y = []
valid_anchors = []

for anchor in anchors:
    h_word = anchor['hieroglyphic']
    e_word = anchor['english'].lower()  # GloVe is lowercase
    
    # Check if words exist
    if h_word in hiero_kv and e_word in glove_kv:
        X.append(hiero_kv[h_word])
        Y.append(glove_kv[e_word])
        valid_anchors.append((h_word, e_word))
        
X = np.array(X)
Y = np.array(Y)

print(f"Valid Anchors: {len(X)} / {len(anchors)} ({len(X)/len(anchors)*100:.1f}%)")
print(f"\nThis is the KEY metric! We need good vocabulary overlap.")

Preparing alignment data...
Valid Anchors: 6700 / 8541 (78.4%)

This is the KEY metric! We need good vocabulary overlap.


In [23]:
# Split Data
if len(X) > 10:  # Only split if we have enough data
    X_train, X_test, Y_train, Y_test, anchors_train, anchors_test = train_test_split(
        X, Y, valid_anchors, test_size=0.2, random_state=42
    )
    print(f"Train size: {len(X_train)}, Test size: {len(X_test)}")
else:
    print(f"‚ö†Ô∏è  WARNING: Only {len(X)} valid anchors! Need more for reliable evaluation.")
    X_train, Y_train = X, Y
    X_test, Y_test = X, Y
    anchors_test = valid_anchors

Train size: 5360, Test size: 1340


In [24]:
# Train Alignment (Linear Regression / Ridge)
print("Training Linear Alignment...")
aligner = Ridge(alpha=1.0)
aligner.fit(X_train, Y_train)

print(f"R¬≤ Score on Train: {aligner.score(X_train, Y_train):.4f}")
print(f"R¬≤ Score on Test: {aligner.score(X_test, Y_test):.4f}")

Training Linear Alignment...
R¬≤ Score on Train: 0.0763
R¬≤ Score on Test: 0.0445


In [25]:
# Evaluate
print("Evaluating on Test Set...")
correct_top1 = 0
correct_top5 = 0
correct_top10 = 0
total = len(X_test)

# Predict all test vectors
Y_pred = aligner.predict(X_test)

for i in tqdm(range(total)):
    pred_vec = Y_pred[i]
    true_word = anchors_test[i][1]
    
    # Find nearest neighbors in GloVe
    neighbors = glove_kv.similar_by_vector(pred_vec, topn=10)
    neighbor_words = [w for w, s in neighbors]
    
    if true_word == neighbor_words[0]:
        correct_top1 += 1
    if true_word in neighbor_words[:5]:
        correct_top5 += 1
    if true_word in neighbor_words[:10]:
        correct_top10 += 1

Evaluating on Test Set...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1340/1340 [00:20<00:00, 66.47it/s]


In [26]:
# Calculate and display results
acc_top1 = correct_top1 / total * 100
acc_top5 = correct_top5 / total * 100
acc_top10 = correct_top10 / total * 100

print("\n" + "="*70)
print("V7 FastText + Visual Embeddings Results")
print("="*70)
print(f"Test Samples: {total}")
print(f"Valid Anchors: {len(X)} / {len(anchors)} ({len(X)/len(anchors)*100:.1f}%)")
print()
print(f"Top-1 Accuracy:  {acc_top1:.2f}%")
print(f"Top-5 Accuracy:  {acc_top5:.2f}%")
print(f"Top-10 Accuracy: {acc_top10:.2f}%")
print()
print(f"R¬≤ Score (Train): {aligner.score(X_train, Y_train):.4f}")
print(f"R¬≤ Score (Test):  {aligner.score(X_test, Y_test):.4f}")
print()
print("Comparison:")
print(f"  V5 Baseline: 24.53%")
print(f"  V6 BERT:     0.47%")
print(f"  V7 (This):   {acc_top1:.2f}%")


V7 FastText + Visual Embeddings Results
Test Samples: 1340
Valid Anchors: 6700 / 8541 (78.4%)

Top-1 Accuracy:  29.10%
Top-5 Accuracy:  36.57%
Top-10 Accuracy: 41.19%

R¬≤ Score (Train): 0.0763
R¬≤ Score (Test):  0.0445

Comparison:
  V5 Baseline: 24.53%
  V6 BERT:     0.47%
  V7 (This):   29.10%


In [27]:
# Save results
RESULTS_PATH = BASE_DIR / "data/processed/alignment_results_v7.json"

results = {
    "model": "V7 FastText + Visuals (Fused 1536d -> 300d)",
    "test_samples": total,
    "valid_anchors": len(X),
    "total_anchors": len(anchors),
    "anchor_coverage": len(X) / len(anchors) * 100,
    "top1_accuracy": acc_top1,
    "top5_accuracy": acc_top5,
    "top10_accuracy": acc_top10,
    "r2_train": aligner.score(X_train, Y_train),
    "r2_test": aligner.score(X_test, Y_test)
}

with open(RESULTS_PATH, 'w') as f:
    json.dump(results, f, indent=2)
    
print(f"\nSaved results to {RESULTS_PATH}")


Saved results to /Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/data/processed/alignment_results_v7.json


## Analysis

### Key Metrics to Watch
1. **Anchor Coverage**: What % of anchors have valid vocabulary matches?
2. **Top-1 Accuracy**: Does the model predict the correct English word?
3. **R¬≤ Score**: How well does the linear transformation fit?

### Expected Improvement
By using `transcription` instead of `hieroglyphs`, we should see:
- ‚úÖ **Much higher anchor coverage** (from 0.18% to ~87%)
- ‚úÖ **Meaningful accuracy** (hopefully approaching V5's 24.53%)
- ‚ö†Ô∏è **No visual information** (transliteration doesn't map to glyphs easily)

### Limitations
- Visual embeddings are keyed by Unicode/Gardiner codes, not transliteration
- We're effectively training text-only FastText (visual vectors are zeros)
- To truly leverage visual features, we'd need a transliteration ‚Üí glyph mapping