# V8 Training Pipeline: Retraining Alignment with Enhanced Anchors

## Goal
Retrain the alignment model using the enhanced anchor dictionary (+368 new anchors from Coptic bridge) and evaluate the improvement in coverage and accuracy.

## Process
1. **Load Data**: Load enhanced anchors, Egyptian vectors (V7), and English vectors (GloVe).
2. **Train Alignment**: Train Ridge Regression to map Egyptian → English.
3. **Evaluate**: Measure Top-1, Top-5, and Top-10 accuracy on the test set.
4. **Save Results**: Store the final metrics for comparison.

In [17]:
import logging
import json
import numpy as np
from pathlib import Path
from gensim.models import KeyedVectors, FastText
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from tqdm import tqdm

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.StreamHandler()
    ]
)
logger = logging.getLogger(__name__)

## 1. Load Data

We load:
- **Enhanced Anchors**: From Phase 2
- **Egyptian Vectors**: V7 FastText model (`fasttext_v7.vec`)
- **English Vectors**: GloVe 300d (Standard target space)

In [18]:
def load_data(project_root):
    """Load necessary data for training."""
    logger.info("Loading data...")
    
    # Define Repository Root (parent of heiro_v8_use_coptic)
    repo_root = project_root.parent
    logger.info(f"Repository Root: {repo_root}")
    
    # 1. Load Enhanced Anchors (in current project)
    anchors_path = project_root / 'data/processed/enhanced_anchors.json'
    with open(anchors_path, 'r', encoding='utf-8') as f:
        anchors = json.load(f)
    logger.info(f"Loaded {len(anchors)} enhanced anchors")
    
    # 2. Load Egyptian Vectors (V7 Model in sibling directory)
    egy_model_path = repo_root / 'heiro_v7_FastTextVisual/models/fasttext_v7.model'
    logger.info(f"Loading Egyptian model from {egy_model_path}...")
    
    if egy_model_path.exists():
        model = FastText.load(str(egy_model_path))
        egy_vectors = model.wv
        logger.info(f"Loaded model with {len(egy_vectors)} vectors")
    else:
        # Fallback to .vec if model not found
        egy_vec_path = repo_root / 'heiro_v7_FastTextVisual/models/fasttext_v7.vec'
        logger.warning(f"Model not found, falling back to vectors at {egy_vec_path}")
        egy_vectors = KeyedVectors.load_word2vec_format(str(egy_vec_path))
    
    # 3. Load English Vectors (GloVe in sibling directory)
    glove_path = repo_root / 'heiro_v5_getdata/data/processed/glove.6B.300d.txt'
    logger.info(f"Loading GloVe vectors from {glove_path}...")
    # GloVe is text format, no header
    eng_vectors = KeyedVectors.load_word2vec_format(str(glove_path), binary=False, no_header=True)
    logger.info(f"Loaded {len(eng_vectors)} English vectors")
    
    return anchors, egy_vectors, eng_vectors

# Set project root
PROJECT_ROOT = Path.cwd().parent if Path.cwd().name == 'notebooks' else Path.cwd()
print(f"Project Root: {PROJECT_ROOT}")

anchors, egy_vectors, eng_vectors = load_data(PROJECT_ROOT)

2025-11-22 10:52:09,456 - INFO - Loading data...
2025-11-22 10:52:09,457 - INFO - Repository Root: /Users/crashy/Development/heiroglyphy
2025-11-22 10:52:09,459 - INFO - Loaded 8579 enhanced anchors
2025-11-22 10:52:09,460 - INFO - Loading Egyptian model from /Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/models/fasttext_v7.model...
2025-11-22 10:52:09,460 - INFO - loading FastText object from /Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/models/fasttext_v7.model
2025-11-22 10:52:09,477 - INFO - loading wv recursively from /Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/models/fasttext_v7.model.wv.* with mmap=None
2025-11-22 10:52:09,478 - INFO - loading vectors_vocab from /Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/models/fasttext_v7.model.wv.vectors_vocab.npy with mmap=None
2025-11-22 10:52:09,574 - INFO - loading vectors_ngrams from /Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/models/fasttext_v7.model.wv

Project Root: /Users/crashy/Development/heiroglyphy/heiro_v8_use_coptic


2025-11-22 10:52:12,225 - INFO - setting ignored attribute vectors to None
2025-11-22 10:52:12,248 - INFO - setting ignored attribute buckets_word to None
2025-11-22 10:52:17,186 - INFO - loading syn1neg from /Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/models/fasttext_v7.model.syn1neg.npy with mmap=None
2025-11-22 10:52:17,295 - INFO - setting ignored attribute cum_table to None
2025-11-22 10:52:17,597 - INFO - FastText lifecycle event {'fname': '/Users/crashy/Development/heiroglyphy/heiro_v7_FastTextVisual/models/fasttext_v7.model', 'datetime': '2025-11-22T10:52:17.597018', 'gensim': '4.4.0', 'python': '3.12.3 (main, Jun  1 2025, 04:19:33) [Clang 17.0.0 (clang-1700.0.13.5)]', 'platform': 'macOS-26.1-arm64-arm-64bit', 'event': 'loaded'}
2025-11-22 10:52:17,597 - INFO - Loaded model with 80662 vectors
2025-11-22 10:52:17,598 - INFO - Loading GloVe vectors from /Users/crashy/Development/heiroglyphy/heiro_v5_getdata/data/processed/glove.6B.300d.txt...
2025-11-22 10:52:17

## 2. Prepare Training Data

Filter anchors to ensure both words exist in their respective vector spaces.

In [20]:
def prepare_data(anchors, egy_vectors, eng_vectors):
    """Prepare X (Egyptian) and Y (English) matrices for alignment."""
    X = []
    Y = []
    valid_anchors = []
    
    for egy_word, eng_word in anchors.items():
        # Normalize English word (GloVe is lowercase)
        eng_word = eng_word.lower()
        
        if egy_word in egy_vectors and eng_word in eng_vectors:
            X.append(egy_vectors[egy_word])
            Y.append(eng_vectors[eng_word])
            valid_anchors.append((egy_word, eng_word))
            
    X = np.array(X)
    Y = np.array(Y)
    
    logger.info(f"Valid Anchors: {len(X)} / {len(anchors)}")
    return X, Y, valid_anchors

X, Y, valid_anchors = prepare_data(anchors, egy_vectors, eng_vectors)

2025-11-22 10:53:30,639 - INFO - Valid Anchors: 7508 / 8579


## 3. Train Alignment

Train a Ridge Regression model to map Egyptian vectors to English vectors.

In [21]:
# Split Data
X_train, X_test, Y_train, Y_test, anchors_train, anchors_test = train_test_split(
    X, Y, valid_anchors, test_size=0.2, random_state=42
)

print(f"Train size: {len(X_train)}, Test size: {len(X_test)}")

# Train Alignment
print("Training Ridge Regression Alignment...")
aligner = Ridge(alpha=1.0)
aligner.fit(X_train, Y_train)

print(f"R^2 Score on Train: {aligner.score(X_train, Y_train):.4f}")
print(f"R^2 Score on Test: {aligner.score(X_test, Y_test):.4f}")

Train size: 6006, Test size: 1502
Training Ridge Regression Alignment...
R^2 Score on Train: 0.0884
R^2 Score on Test: 0.0465


## 4. Evaluate Accuracy

Calculate Top-1, Top-5, and Top-10 accuracy on the test set.

In [22]:
def evaluate_accuracy(aligner, X_test, anchors_test, eng_vectors):
    """Evaluate accuracy by finding nearest neighbors in English space."""
    print("Evaluating on Test Set...")
    correct_top1 = 0
    correct_top5 = 0
    correct_top10 = 0
    total = len(X_test)
    
    # Predict all test vectors
    Y_pred = aligner.predict(X_test)
    
    for i in tqdm(range(total)):
        pred_vec = Y_pred[i]
        true_word = anchors_test[i][1]
        
        # Find nearest neighbors in GloVe
        neighbors = eng_vectors.similar_by_vector(pred_vec, topn=10)
        neighbor_words = [w for w, s in neighbors]
        
        if true_word == neighbor_words[0]:
            correct_top1 += 1
        if true_word in neighbor_words[:5]:
            correct_top5 += 1
        if true_word in neighbor_words[:10]:
            correct_top10 += 1
            
    acc_top1 = correct_top1 / total * 100
    acc_top5 = correct_top5 / total * 100
    acc_top10 = correct_top10 / total * 100
    
    print(f"\nResults (Test Set N={total}):")
    print(f"Top-1 Accuracy: {acc_top1:.2f}%")
    print(f"Top-5 Accuracy: {acc_top5:.2f}%")
    print(f"Top-10 Accuracy: {acc_top10:.2f}%")
    
    return acc_top1, acc_top5, acc_top10

acc_top1, acc_top5, acc_top10 = evaluate_accuracy(aligner, X_test, anchors_test, eng_vectors)

Evaluating on Test Set...


100%|██████████| 1502/1502 [00:26<00:00, 55.95it/s]


Results (Test Set N=1502):
Top-1 Accuracy: 28.16%
Top-5 Accuracy: 36.09%
Top-10 Accuracy: 40.28%





## 5. Save Results

In [24]:
# Save results
results = {
    'total_anchors': len(anchors),
    'valid_anchors': len(valid_anchors),
    'top1_accuracy': acc_top1,
    'top5_accuracy': acc_top5,
    'top10_accuracy': acc_top10,
    'status': 'success'
}

output_path = PROJECT_ROOT / 'results.json'
with open(output_path, 'w') as f:
    json.dump(results, f, indent=2)
    
logger.info(f"Results saved to {output_path}")
print(json.dumps(results, indent=2))

2025-11-22 11:05:23,897 - INFO - Results saved to /Users/crashy/Development/heiroglyphy/heiro_v8_use_coptic/results.json


{
  "total_anchors": 8579,
  "valid_anchors": 7508,
  "top1_accuracy": 28.1624500665779,
  "top5_accuracy": 36.085219707057256,
  "top10_accuracy": 40.279627163781626,
  "status": "success"
}
