# V8 Anchor Enhancement: Egyptian → Coptic → English

## Goal
Combine Egyptian-Coptic cognates with Coptic-English Bible mappings to create enhanced Egyptian-English anchors.

## Strategy
1. Load Egyptian-Coptic cognates from ThotBank
2. Load Coptic-English word pairs from OPUS Bible
3. Chain mappings: Egyptian → Coptic → English
4. Merge with V7 baseline anchors
5. Analyze coverage improvement

In [21]:
import json
from pathlib import Path
from collections import defaultdict
import pandas as pd

# Paths
PROJECT_ROOT = Path.cwd().parent if Path.cwd().name == 'notebooks' else Path.cwd()
PROCESSED_DIR = PROJECT_ROOT / 'data/processed'

# Input files
COGNATES_PATH = PROCESSED_DIR / 'egyptian_coptic_cognates.json'
COPTIC_ENG_PATH = PROCESSED_DIR / 'coptic_english_word_pairs.json'

# V7 baseline anchors
V7_ANCHORS_PATH = PROJECT_ROOT.parent / 'heiro_v7_FastTextVisual/data/processed/anchors.json'

print(f'Project root: {PROJECT_ROOT}')
print(f'Cognates file exists: {COGNATES_PATH.exists()}')
print(f'Coptic-English file exists: {COPTIC_ENG_PATH.exists()}')
print(f'V7 anchors file exists: {V7_ANCHORS_PATH.exists()}')

Project root: /Users/crashy/Development/heiroglyphy/heiro_v8_use_coptic
Cognates file exists: True
Coptic-English file exists: True
V7 anchors file exists: True


## Load Data

In [22]:
# Load Egyptian-Coptic cognates
with open(COGNATES_PATH, 'r', encoding='utf-8') as f:
    cognates = json.load(f)

print(f'Loaded {len(cognates)} Egyptian-Coptic cognate pairs')
print(f'Sample: {cognates[0]}')

Loaded 2326 Egyptian-Coptic cognate pairs
Sample: {'egyptian': 'bḏꜣ', 'egyptian_meaning': "crucible, bread mold; mold, baker's oven, [a (measuring) vessel], [a copper jug], [a mold for making figures of Osiris]", 'coptic': 'ⲃⲏⲧⲉ, ϩⲃⲏⲧⲉ', 'coptic_meaning': 'scale-like plate', 'entry_id': '0', 'coptic_id': 'C553'}


In [23]:
# Load Coptic-English word pairs
with open(COPTIC_ENG_PATH, 'r', encoding='utf-8') as f:
    coptic_english = json.load(f)

print(f'Loaded {len(coptic_english)} Coptic-English word pairs')
print(f'Sample: {coptic_english[0]}')

Loaded 5231 Coptic-English word pairs
Sample: {'coptic': 'ⲟⲩⲟϩ', 'english': 'and', 'count': 6532}


In [24]:
# Load V7 baseline anchors
with open(V7_ANCHORS_PATH, 'r', encoding='utf-8') as f:
    v7_anchors_list = json.load(f)

# Convert list to dictionary for easier lookup
v7_anchors = {item['hieroglyphic']: item['english'] for item in v7_anchors_list}

print(f'Loaded {len(v7_anchors)} V7 baseline anchors')
print(f'Sample: {list(v7_anchors.items())[0]}')

Loaded 8541 V7 baseline anchors
Sample: ('n', 'the')


## Extract Anchors from ThotBank

Use ThotBank's Egyptian-English meanings directly (bypassing noisy Bible co-occurrence).

In [25]:
# Extract anchors directly from ThotBank Egyptian meanings
# ThotBank already has Egyptian words with English translations!
new_anchors = {}
match_stats = {
    'total_cognates': len(cognates),
    'with_english_meaning': 0,
    'new_anchors_created': 0
}

for cognate in cognates:
    egy_word = cognate['egyptian']
    egy_meaning = cognate['egyptian_meaning']
    
    # Use ThotBank's English meaning directly
    if egy_word and egy_meaning and egy_meaning.strip():
        match_stats['with_english_meaning'] += 1
        
        # Clean up the meaning (take first word/phrase if multiple)
        # Remove brackets and extra annotations
        clean_meaning = egy_meaning.strip()
        clean_meaning = clean_meaning.replace('[', '').replace(']', '')
        
        # Take first meaning if comma-separated
        if ',' in clean_meaning:
            clean_meaning = clean_meaning.split(',')[0].strip()
        
        # Only add if we have a clean meaning
        if clean_meaning and len(clean_meaning) > 0:
            if egy_word not in new_anchors:
                new_anchors[egy_word] = clean_meaning
                match_stats['new_anchors_created'] += 1

print('ThotBank direct extraction results:')
print(f"  Total cognates: {match_stats['total_cognates']}")
print(f"  With English meanings: {match_stats['with_english_meaning']}")
print(f"  New anchors created: {match_stats['new_anchors_created']}")
print(f"\nSample new anchors:")
for i, (egy, eng) in enumerate(list(new_anchors.items())[:20]):
    print(f"  {i+1:2d}. {egy:25s} → {eng}")

ThotBank direct extraction results:
  Total cognates: 2326
  With English meanings: 2246
  New anchors created: 809

Sample new anchors:
   1. bḏꜣ                       → crucible
   2. bgs                       → to injure; to be injured; to be disloyal
   3. bhꜣ                       → fan
   4. bẖ                        → Buchis (sacred bull of Armant)
   5. bḥz                       → calf
   6. bj                        → bee
   7. bjk                       → falcon
   8. bjn                       → harp
   9. bjꜣj                      → amazement; confusion
  10. bkꜣ                       → to be (become) pregnant; to make pregnant
  11. bny                       → sweet
  12. bnw                       → to escape; to depart
  13. bl                        → outside
  14. bq                        → to be hostile
  15. bꜣ                        → eyeball; eye
  16. br                        → a mullet
  17. brg                       → a semi-precious stone (beryl?)
  18. brq     

## Merge with V7 Baseline Anchors

In [26]:
# Merge new anchors with V7 baseline
# V7 anchors take precedence (they're from direct Egyptian-English dictionaries)
enhanced_anchors = v7_anchors.copy()

# Add new anchors that don't conflict
new_additions = 0
for egy_word, eng_word in new_anchors.items():
    if egy_word not in enhanced_anchors:
        enhanced_anchors[egy_word] = eng_word
        new_additions += 1

print('Anchor enhancement results:')
print(f'  V7 baseline anchors: {len(v7_anchors)}')
print(f'  New anchors from Coptic bridge: {len(new_anchors)}')
print(f'  New additions (non-overlapping): {new_additions}')
print(f'  Total enhanced anchors: {len(enhanced_anchors)}')
print(f'  Improvement: +{new_additions} anchors ({100*new_additions/len(v7_anchors):.1f}%)')

Anchor enhancement results:
  V7 baseline anchors: 8541
  New anchors from Coptic bridge: 809
  New additions (non-overlapping): 368
  Total enhanced anchors: 8909
  Improvement: +368 anchors (4.3%)


## Analyze Coverage

In [27]:
# Compare anchor coverage
print('Coverage comparison:')
print(f'\nV7 Baseline:')
print(f'  Total anchors: {len(v7_anchors)}')

print(f'\nV8 Enhanced:')
print(f'  Total anchors: {len(enhanced_anchors)}')
print(f'  New anchors: {new_additions}')
print(f'  Coverage increase: {100*new_additions/len(v7_anchors):.2f}%')

Coverage comparison:

V7 Baseline:
  Total anchors: 8541

V8 Enhanced:
  Total anchors: 8909
  New anchors: 368
  Coverage increase: 4.31%


## Save Enhanced Anchors

In [None]:
# Save enhanced anchors
OUTPUT_PATH = PROCESSED_DIR / 'enhanced_anchors.json'

with open(OUTPUT_PATH, 'w', encoding='utf-8') as f:
    json.dump(enhanced_anchors, f, ensure_ascii=False, indent=2)

print(f'Saved {len(enhanced_anchors)} enhanced anchors to {OUTPUT_PATH}')

# Also save the new anchors separately for analysis
NEW_ANCHORS_PATH = PROCESSED_DIR / 'coptic_bridge_anchors.json'
with open(NEW_ANCHORS_PATH, 'w', encoding='utf-8') as f:
    json.dump(new_anchors, f, ensure_ascii=False, indent=2)

print(f'Saved {len(new_anchors)} Coptic-bridge anchors to {NEW_ANCHORS_PATH}')

## Summary

Successfully created enhanced anchor dictionary using Coptic as a bridge language.

**Key Results**:
- Chained Egyptian → Coptic → English mappings
- Added new anchors to V7 baseline
- Improved anchor coverage

**Next Steps**:
1. Use enhanced anchors to retrain V7 alignment model
2. Evaluate on test set
3. Compare accuracy to V7 baseline (29.10%)