# V7 FastText + Visual Embeddings Pipeline

## Goal
Combine **FastText** (text-based embeddings) with **Visual CNN features** to create a superior hieroglyphic representation, then align with English GloVe embeddings.

## Strategy
1. **Data Cleaning**: Extract hieroglyphs from BBAW parquet dataset
2. **FastText Training**: Train 300d embeddings on cleaned corpus
3. **Visual Fusion**: Concatenate FastText (300d) + Visual (768d) = 1068d
4. **Alignment**: Linear regression to map 1068d → 300d English space
5. **Evaluation**: Test on anchor pairs, compare to V5 (24.53%) and V6 (0.2%)

## Step 1: Data Cleaning

Extract hieroglyphic sequences from the BBAW parquet dataset.

**Data Source**: `heiro_v6_BERT/data/raw/bbaw_huggingface.parquet`  
**Output**: `data/processed/cleaned_corpus.txt`

In [13]:

import json
import re
from pathlib import Path
from tqdm import tqdm
import os

try:
    # If running as a script
    BASE_DIR = Path(__file__).resolve().parent.parent
except NameError:
    # If running in Jupyter, assume we are in 'notebooks/'
    BASE_DIR = Path.cwd().parent

RAW_DATA_PATH = BASE_DIR / "data/raw/all_data.json"
CLEAN_DATA_PATH = BASE_DIR / "data/processed/cleaned_corpus.txt"

print(f"Looking for data at: {RAW_DATA_PATH}")


if not PARQUET_PATH.exists():
    print(f"Error: {PARQUET_PATH} not found.")


print(f"Reading from {PARQUET_PATH}...")
try:
    df = pd.read_parquet(PARQUET_PATH)
except Exception as e:
    print(f"Failed to read parquet: {e}")


print(f"Total rows: {len(df)}")

# Filter for rows with non-empty hieroglyphs
if 'hieroglyphs' not in df.columns:
    print("Error: 'hieroglyphs' column not found.")

    
df_glyphs = df[df['hieroglyphs'] != '']
print(f"Rows with hieroglyphs: {len(df_glyphs)}")

cleaned_lines = []

print("Processing hieroglyphs...")
for glyph_str in tqdm(df_glyphs['hieroglyphs']):
    if not isinstance(glyph_str, str):
        continue
        
    # The glyph string might contain MdC codes or Unicode.
    # Based on the README, it says "Encoding of the hieroglyphs with the Gardiner's sign list"
    # But the sample showed "D21 :Q3 :D36..." which is MdC-like.
    # However, we also saw Unicode in the sample? 
    # Let's look at the sample again: "D21 :Q3 :D36 F4 :D36 L2 -X1 :S19 S29 -U23 -T21..."
    # This is MdC (Manuel de Codage).
    
    # Wait, if it's MdC, FastText needs space-separated tokens.
    # MdC uses space, -, :, *, etc. as separators.
    # We should probably just keep it as is for now, or maybe normalize separators to spaces?
    # FastText splits on whitespace.
    # If we want to learn embeddings for "D21", "Q3", etc., we need them to be tokens.
    # So we should replace -, :, * with spaces.
    
    # Let's refine the cleaning:
    # 1. Replace common MdC separators with space
    clean_line = glyph_str.replace('-', ' ').replace(':', ' ').replace('*', ' ').replace('&', ' ')
    
    # 2. Remove brackets and other markers if we want pure glyphs?
    # The README mentions [], (), {}, <>, etc.
    # Let's keep it simple for now and just tokenize on separators.
    
    # 3. Normalize whitespace
    clean_line = " ".join(clean_line.split())
    
    if clean_line:
        cleaned_lines.append(clean_line)

print(f"Extracted {len(cleaned_lines)} lines.")

with open(CLEAN_DATA_PATH, 'w', encoding='utf-8') as f:
    for line in cleaned_lines:
        f.write(line + "\n")

print(f"Saved cleaned corpus to {CLEAN_DATA_PATH}")




Looking for data at: /Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/data/raw/all_data.json


AttributeError: 'str' object has no attribute 'exists'

In [None]:
# Verify output
!wc -l ../data/processed/cleaned_corpus.txt
!echo "\nFirst 3 lines:"
!head -n 3 ../data/processed/cleaned_corpus.txt

## Step 2: FastText Training

Train 300d FastText embeddings on the cleaned corpus.

In [None]:
# Train FastText model
!python3 ../scripts/02_train_fasttext.py

In [None]:
# Verify model was created
!ls -lh ../models/fasttext_v7.*

## Step 3: Visual Embedding Fusion

Combine FastText vectors with pre-computed visual embeddings from V6.

In [14]:
# Fuse embeddings
!python3 ../scripts/03_fuse_embeddings.py

Loading FastText model from /Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/models/fasttext_v7.model...
2025-11-21 12:04:50,597 : INFO : loading FastText object from /Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/models/fasttext_v7.model
2025-11-21 12:04:50,601 : INFO : loading wv recursively from /Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/models/fasttext_v7.model.wv.* with mmap=None
2025-11-21 12:04:50,601 : INFO : loading vectors_ngrams from /Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/models/fasttext_v7.model.wv.vectors_ngrams.npy with mmap=None
2025-11-21 12:04:51,706 : INFO : setting ignored attribute buckets_word to None
2025-11-21 12:04:51,708 : INFO : setting ignored attribute vectors to None
2025-11-21 12:04:51,805 : INFO : setting ignored attribute cum_table to None
2025-11-21 12:04:51,846 : INFO : FastText lifecycle event {'fname': '/Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/models/fasttext_v

In [15]:
# Verify fused model
!ls -lh ../models/fused_embeddings_1068d.kv

-rw-r--r--@ 1 crashy  staff    10M Nov 21 12:04 ../models/fused_embeddings_1068d.kv


## Step 4: Alignment & Evaluation

Align the fused embeddings to English GloVe space using Linear Regression.

In [16]:
# Run alignment and evaluation
!python3 ../scripts/04_align_embeddings.py

Loading Fused Model from /Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/models/fused_embeddings_1068d.kv...
2025-11-21 12:05:02,844 : INFO : loading KeyedVectors object from /Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/models/fused_embeddings_1068d.kv
2025-11-21 12:05:02,864 : INFO : KeyedVectors lifecycle event {'fname': '/Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/models/fused_embeddings_1068d.kv', 'datetime': '2025-11-21T12:05:02.846984', 'gensim': '4.4.0', 'python': '3.12.3 (main, Jun  1 2025, 04:19:33) [Clang 17.0.0 (clang-1700.0.13.5)]', 'platform': 'macOS-26.1-arm64-arm-64bit', 'event': 'loaded'}
Loading GloVe from /Users/crashy/Development/heiroglyphy/heiro_v5_getdata/data/processed/glove.6B.300d.txt...
2025-11-21 12:05:02,865 : INFO : loading projection weights from /Users/crashy/Development/heiroglyphy/heiro_v5_getdata/data/processed/glove.6B.300d.txt
2025-11-21 12:06:31,082 : INFO : KeyedVectors lifecycle event {'msg': 'loade

## Results

Load and display the evaluation results.

In [None]:
results_path = BASE_DIR / 'data/processed/alignment_results_v7.json'

if results_path.exists():
    with open(results_path, 'r') as f:
        results = json.load(f)
    
    print("=" * 70)
    print("V7 FastText + Visual Embeddings Results")
    print("=" * 70)
    print(f"Model: {results['model']}")
    print(f"Test Samples: {results['test_samples']}")
    print()
    print(f"Top-1 Accuracy:  {results['top1_accuracy']:.2f}%")
    print(f"Top-5 Accuracy:  {results['top5_accuracy']:.2f}%")
    print(f"Top-10 Accuracy: {results['top10_accuracy']:.2f}%")
    print()
    print(f"R² Score (Train): {results['r2_train']:.4f}")
    print(f"R² Score (Test):  {results['r2_test']:.4f}")
    print()
    print("Comparison:")
    print(f"  V5 Baseline: 24.53%")
    print(f"  V6 BERT:     0.47%")
    print(f"  V7 (This):   {results['top1_accuracy']:.2f}%")
    print()
    print("⚠️  WARNING: Only 15/8541 anchors matched!")
    print("   Root cause: Vocabulary mismatch (MdC codes vs transliteration)")
else:
    print("Results file not found. Run the alignment script first.")

## Analysis

### What Worked
- **Clean Data**: Extracted 35k hieroglyphic sequences from BBAW parquet
- **FastText**: Learned 2,551 glyph embeddings from MdC codes
- **Visual Fusion**: Successfully matched 37.7% of glyphs to visual embeddings

### Critical Issue: Vocabulary Mismatch
**Problem**: FastText learned MdC codes (e.g., `G43`, `M17`) from the BBAW `hieroglyphs` column, but anchors use transliteration (e.g., `n`, `m`, `zꜣ`) from the `transcription` column.

**Impact**: Only 15/8,541 anchors matched → 0% accuracy

**Root Cause**: The BBAW dataset has TWO representations:
1. **`hieroglyphs`**: MdC codes like "D21 Q3 D36" (what we trained on)
2. **`transcription`**: Transliteration like "jr,j-pꜥ,t" (what anchors use)

### Next Steps
**Recommended**: Retrain FastText on the `transcription` column to match anchor vocabulary

See `v7_results_analysis.md` for detailed analysis and options.