# V5 Data Strategy: Leveraging the Combined Corpus

## Context
We now have a **significantly larger dataset** than previous attempts:
- **TLA**: ~12,773 texts (German translations)
- **Ramses**: ~9,644 texts (French/multilingual)
- **BBAW**: ~100,736 texts (German translations + **hieroglyphs**)
- **Total**: ~123k texts → ~104k after deduplication

## The German Problem
Most of our data has **German** translations, not English. This creates a challenge for our goal of mapping hieroglyphs to **English** meanings.

## Previous Approaches (V1-V4)
### V3 (Most Successful: ~22% accuracy)
- Used **Orthogonal Procrustes** (linear alignment)
- Required ~1,362 **anchor pairs** (hieroglyphic ↔ English)
- Anchors extracted via co-occurrence analysis
- **Key insight**: Context-based alignment works, but needs good anchors

### V3's Data Pipeline
1. German → English translation (machine translation)
2. Build co-occurrence matrix between hieroglyphic words and English words
3. Extract high-confidence pairs as anchors
4. Train FastText on hieroglyphic corpus
5. Train Word2Vec on English corpus (Wikipedia)
6. Align using Procrustes with anchors

## V5 Strategy Options

### Option 1: Machine Translation (V3 approach, scaled up)
**Pros:**
- Proven to work (V3 achieved 22%)
- 10x more data = potentially better embeddings
- Can reuse V3 pipeline

**Cons:**
- Translation errors compound
- Loses nuance (German → English isn't perfect)
- Expensive (API costs for 100k+ texts)

### Option 2: Trilingual Alignment (Hieroglyphic ↔ German ↔ English)
**Pros:**
- No translation needed for German
- Can use German as a "bridge" language
- Preserves original scholarly translations

**Cons:**
- More complex (need German embeddings)
- Two alignment steps (more error propagation)
- Still need English corpus

### Option 3: Multilingual Embeddings (Modern approach)
**Pros:**
- Use pre-trained multilingual models (mBERT, XLM-R)
- German/English already aligned in the model
- State-of-the-art NLP

**Cons:**
- Hieroglyphic transliteration not in pre-trained vocab
- Need to fine-tune on our corpus
- More complex than Word2Vec/FastText

### Option 4: Hybrid Approach (Recommended)
**Strategy:**
1. **Keep German** for the majority of data
2. **Translate only anchor candidates** (high-frequency words)
3. Use **hieroglyphic column** from BBAW for richer context
4. Build **German-aligned** hieroglyphic embeddings
5. Use **pre-trained German-English alignment** (from mBERT or similar)
6. Map: Hieroglyphic → German → English

**Why this works:**
- Minimizes translation cost (only ~2k anchor words)
- Leverages scholarly German translations (more accurate)
- Uses modern multilingual models for German↔English
- Preserves hieroglyphic context

## Next Steps
1. **Analyze the corpus** (this notebook)
2. **Extract German anchors** (high-confidence hieroglyphic ↔ German pairs)
3. **Translate anchors** to English (small, manageable set)
4. **Train hieroglyphic embeddings** (FastText on full corpus)
5. **Align to German** (Procrustes with German anchors)
6. **Bridge to English** (using pre-trained German-English alignment)
7. **Evaluate** against V3 results

In [1]:
import pandas as pd
import json
from pathlib import Path
from collections import Counter
import matplotlib.pyplot as plt

# Load the combined corpus
corpus_path = Path('../data/processed/hieroglyphic_corpus_full.tsv')
df = pd.read_csv(corpus_path, sep='\t')

print(f"Total records: {len(df):,}")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nSource distribution:")
print(df['source'].value_counts())

Total records: 104,426

Columns: ['transliteration', 'translation', 'source', 'metadata', 'hieroglyphs', 'transliteration_clean']

Source distribution:
source
BBAW (HuggingFace)    87100
TLA (HuggingFace)      9088
Ramses Online          8238
Name: count, dtype: int64


In [2]:
# Analyze translation availability
print("Translation Coverage:")
print(f"Records with translation: {df['translation'].notna().sum():,}")
print(f"Empty translations: {(df['translation'] == '').sum():,}")

# Check hieroglyphs column (from BBAW)
if 'hieroglyphs' in df.columns:
    print(f"\nRecords with hieroglyphs: {df['hieroglyphs'].notna().sum():,}")
    print(f"Percentage: {df['hieroglyphs'].notna().sum() / len(df) * 100:.1f}%")

Translation Coverage:
Records with translation: 95,314
Empty translations: 0

Records with hieroglyphs: 30,992
Percentage: 29.7%


In [3]:
# Vocabulary analysis
all_words = ' '.join(df['transliteration'].dropna()).split()
word_freq = Counter(all_words)

print(f"Total tokens: {len(all_words):,}")
print(f"Unique words: {len(word_freq):,}")
print(f"\nTop 20 most common hieroglyphic words:")
for word, count in word_freq.most_common(20):
    print(f"  {word}: {count:,}")

Total tokens: 872,070
Unique words: 87,636

Top 20 most common hieroglyphic words:
  n: 37,473
  =f: 37,217
  m: 32,386
  =k: 26,900
  ḥr: 14,064
  r: 13,849
  =j: 13,735
  jw: 10,208
  1: 7,030
  pꜣ: 6,536
  =s: 5,847
  =sn: 5,735
  nb: 5,403
  tꜣ: 4,421
  2: 4,409
  sw: 3,980
  pw: 3,707
  n(,j): 3,657
  jr: 3,646
  jm: 3,508


In [4]:
# Comparison with V3 dataset
print("="*50)
print("V3 vs V5 Comparison")
print("="*50)
print(f"V3 corpus size: ~12,773 texts")
print(f"V5 corpus size: {len(df):,} texts")
print(f"Increase: {len(df) / 12773:.1f}x")
print(f"\nV3 vocabulary: ~7,174 unique words")
print(f"V5 vocabulary: {len(word_freq):,} unique words")
print(f"Increase: {len(word_freq) / 7174:.1f}x")

V3 vs V5 Comparison
V3 corpus size: ~12,773 texts
V5 corpus size: 104,426 texts
Increase: 8.2x

V3 vocabulary: ~7,174 unique words
V5 vocabulary: 87,636 unique words
Increase: 12.2x


## Recommended Action Plan

### Phase 1: Anchor Extraction (German)
- Build co-occurrence matrix: hieroglyphic ↔ German
- Extract ~2,000 high-confidence pairs
- Filter for words that appear in V3's successful anchors

### Phase 2: Selective Translation
- Translate only the ~2,000 anchor German words to English
- Use DeepL or Google Translate API
- Manual review of top 100 most frequent

### Phase 3: Embedding Training
- Train FastText on full hieroglyphic corpus (104k texts)
- Use hieroglyphs column for additional context
- Larger embedding dimension (300 → 500?)

### Phase 4: Alignment
- Primary: Hieroglyphic → English (via translated anchors)
- Fallback: Hieroglyphic → German → English (bridge)

### Phase 5: Evaluation
- Test on V3's evaluation set
- Target: >22% accuracy
- Analyze: Does 10x data improve Anubis-type discoveries?