# V7 FastText + Visual Embeddings Pipeline

## Goal
Combine **FastText** (text-based embeddings) with **Visual CNN features** to create a superior hieroglyphic representation, then align with English GloVe embeddings.

## Strategy
1. **Data Cleaning**: Extract hieroglyphs from BBAW parquet dataset
2. **FastText Training**: Train 300d embeddings on cleaned corpus
3. **Visual Fusion**: Concatenate FastText (300d) + Visual (768d) = 1068d
4. **Alignment**: Linear regression to map 1068d → 300d English space
5. **Evaluation**: Test on anchor pairs, compare to V5 (24.53%) and V6 (0.2%)

In [None]:
import json
from pathlib import Path

# Setup paths
try:
    # If running as a script
    BASE_DIR = Path(__file__).resolve().parent.parent
except NameError:
    # If running in Jupyter, assume we are in 'notebooks/'
    BASE_DIR = Path.cwd().parent

print(f"Base Directory: {BASE_DIR}")

## Step 1: Data Cleaning

Extract hieroglyphic sequences from the BBAW parquet dataset.

**Data Source**: `heiro_v6_BERT/data/raw/bbaw_huggingface.parquet`  
**Output**: `data/processed/cleaned_corpus.txt`

In [None]:
# Run the cleaning script
!python3 ../scripts/01_clean_and_tokenize.py

In [None]:
# Verify output
!wc -l ../data/processed/cleaned_corpus.txt
!echo "\nFirst 3 lines:"
!head -n 3 ../data/processed/cleaned_corpus.txt

## Step 2: FastText Training

Train 300d FastText embeddings on the cleaned corpus.

In [None]:
# Train FastText model
!python3 ../scripts/02_train_fasttext.py

In [None]:
# Verify model was created
!ls -lh ../models/fasttext_v7.*

## Step 3: Visual Embedding Fusion

Combine FastText vectors with pre-computed visual embeddings from V6.

In [None]:
# Fuse embeddings
!python3 ../scripts/03_fuse_embeddings.py

In [None]:
# Verify fused model
!ls -lh ../models/fused_embeddings_1068d.kv

## Step 4: Alignment & Evaluation

Align the fused embeddings to English GloVe space using Linear Regression.

In [None]:
# Run alignment and evaluation
!python3 ../scripts/04_align_embeddings.py

## Results

Load and display the evaluation results.

In [None]:
results_path = BASE_DIR / 'data/processed/alignment_results_v7.json'

if results_path.exists():
    with open(results_path, 'r') as f:
        results = json.load(f)
    
    print("=" * 70)
    print("V7 FastText + Visual Embeddings Results")
    print("=" * 70)
    print(f"Model: {results['model']}")
    print(f"Test Samples: {results['test_samples']}")
    print()
    print(f"Top-1 Accuracy:  {results['top1_accuracy']:.2f}%")
    print(f"Top-5 Accuracy:  {results['top5_accuracy']:.2f}%")
    print(f"Top-10 Accuracy: {results['top10_accuracy']:.2f}%")
    print()
    print(f"R² Score (Train): {results['r2_train']:.4f}")
    print(f"R² Score (Test):  {results['r2_test']:.4f}")
    print()
    print("Comparison:")
    print(f"  V5 Baseline: 24.53%")
    print(f"  V6 BERT:     0.47%")
    print(f"  V7 (This):   {results['top1_accuracy']:.2f}%")
    print()
    print("⚠️  WARNING: Only 15/8541 anchors matched!")
    print("   Root cause: Vocabulary mismatch (MdC codes vs transliteration)")
else:
    print("Results file not found. Run the alignment script first.")

## Analysis

### What Worked
- **Clean Data**: Extracted 35k hieroglyphic sequences from BBAW parquet
- **FastText**: Learned 2,551 glyph embeddings from MdC codes
- **Visual Fusion**: Successfully matched 37.7% of glyphs to visual embeddings

### Critical Issue: Vocabulary Mismatch
**Problem**: FastText learned MdC codes (e.g., `G43`, `M17`) from the BBAW `hieroglyphs` column, but anchors use transliteration (e.g., `n`, `m`, `zꜣ`) from the `transcription` column.

**Impact**: Only 15/8,541 anchors matched → 0% accuracy

**Root Cause**: The BBAW dataset has TWO representations:
1. **`hieroglyphs`**: MdC codes like "D21 Q3 D36" (what we trained on)
2. **`transcription`**: Transliteration like "jr,j-pꜥ,t" (what anchors use)

### Next Steps
**Recommended**: Retrain FastText on the `transcription` column to match anchor vocabulary

See `v7_results_analysis.md` for detailed analysis and options.