# V8 Data Collection: Egyptian-Coptic Cognates

## Goal
Extract Egyptian-Coptic cognate pairs from ThotBank dataset to map Egyptian words to Coptic equivalents.

## Data Source
- **ThotBank**: Part of the Maduwwe project (GitHub)
- **Content**: Coptic words with suggested Egyptian etymologies
- **Format**: JSON with references to Černý's dictionary and TLA

## Data Download

Download ThotBank JSON from GitHub.

In [25]:
import subprocess
from pathlib import Path
import json

# Ensure data directory exists
PROJECT_ROOT = Path.cwd().parent if Path.cwd().name == 'notebooks' else Path.cwd()
RAW_DATA_DIR = PROJECT_ROOT / 'data/raw'
RAW_DATA_DIR.mkdir(parents=True, exist_ok=True)

# ThotBank is in JSON format
THOTBANK_URL = 'https://raw.githubusercontent.com/MKilani/Maduwwe/master/ThotBank/Json_Database_25_3_2020_1.7.json'
THOTBANK_PATH = RAW_DATA_DIR / 'thotbank_cognates.json'

# Download if not already present
if not THOTBANK_PATH.exists():
    print(f'Downloading ThotBank Egyptian-Coptic cognates...')
    result = subprocess.run(
        ['curl', '-L', THOTBANK_URL, '-o', str(THOTBANK_PATH)],
        capture_output=True,
        text=True
    )
    if result.returncode == 0:
        print(f'✓ Downloaded to {THOTBANK_PATH}')
        print(f'  Size: {THOTBANK_PATH.stat().st_size / 1024:.1f} KB')
    else:
        print(f'✗ Download failed: {result.stderr}')
else:
    print(f'✓ File already exists: {THOTBANK_PATH}')
    print(f'  Size: {THOTBANK_PATH.stat().st_size / 1024:.1f} KB')

✓ File already exists: /Users/crashy/Development/heiroglyphy/heiro_v8_use_coptic/data/raw/thotbank_cognates.json
  Size: 5859.1 KB


## Load and Inspect Data

In [26]:
# Load ThotBank JSON
with open(THOTBANK_PATH, 'r', encoding='utf-8') as f:
    thotbank_data = json.load(f)

print(f'Loaded ThotBank data')
print(f'Type: {type(thotbank_data)}')

# Inspect structure
if isinstance(thotbank_data, dict):
    print(f'Total entries: {len(thotbank_data)}')
    first_key = list(thotbank_data.keys())[0]
    print(f'\nFirst entry ID: {first_key}')
    print(f'First entry keys: {list(thotbank_data[first_key].keys())}')
elif isinstance(thotbank_data, list):
    print(f'Total entries: {len(thotbank_data)}')
    if len(thotbank_data) > 0:
        print(f'First entry keys: {list(thotbank_data[0].keys())}')

Loaded ThotBank data
Type: <class 'dict'>
Total entries: 1345

First entry ID: 0
First entry keys: ['EGY_ROOT__ID', 'EGY_ROOT__TLA_ROOT', 'EGY_ROOT__form', 'EGY_ROOT__meaning_en', 'TLA__forms', 'CCL__forms', 'MATCHES']


In [27]:
# Show first entry in detail
first_entry = list(thotbank_data.values())[0]
print('First entry structure:\n')
print(json.dumps(first_entry, indent=2, ensure_ascii=False)[:1000] + '...')

First entry structure:

{
  "EGY_ROOT__ID": "0",
  "EGY_ROOT__TLA_ROOT": {
    "869423": {
      "TLA_ROOT__ID": "869423",
      "TLA_ROOT__form": "bḏꜣ",
      "TLA_ROOT__meaning_en": "",
      "TLA_ROOT__meaning_de": "[Gussform]"
    }
  },
  "EGY_ROOT__form": "bḏꜣ",
  "EGY_ROOT__meaning_en": "",
  "TLA__forms": {
    "58550": {
      "TLA__ID": "58550",
      "TLA__form": "bḏꜣ",
      "TLA__meaning_en": "crucible",
      "TLA__meaning_de": "Gussform"
    },
    "58570": {
      "TLA__ID": "58570",
      "TLA__form": "bḏꜣ",
      "TLA__meaning_en": "bread mold; mold",
      "TLA__meaning_de": "Backform; Gussform (Topf aus gebranntem Ton)"
    },
    "58600": {
      "TLA__ID": "58600",
      "TLA__form": "bḏꜣ.t",
      "TLA__meaning_en": "baker's oven",
      "TLA__meaning_de": "Backofen"
    },
    "54070": {
      "TLA__ID": "54070",
      "TLA__form": "bꜣd.t",
      "TLA__meaning_en": "[a (measuring) vessel]",
      "TLA__meaning_de": "[ein Topf (als Maß)]"
    },
    "55100": {
  

## Analyze Data Structure

In [28]:
# Analyze JSON structure
print('Data structure analysis:\n')
print(f'Total entries: {len(thotbank_data)}')

# Analyze first entry to understand structure
first_entry = list(thotbank_data.values())[0]
print(f'\nFirst entry keys: {list(first_entry.keys())}')

# Count entries with Coptic forms (CCL = Comprehensive Coptic Lexicon)
entries_with_coptic = sum(1 for entry in thotbank_data.values() if entry.get('CCL__forms'))
print(f'\nEntries with Coptic forms: {entries_with_coptic} / {len(thotbank_data)} ({100*entries_with_coptic/len(thotbank_data):.1f}%)')

# Count total Coptic forms
total_coptic_forms = sum(len(entry.get('CCL__forms', {})) for entry in thotbank_data.values())
print(f'Total Coptic forms: {total_coptic_forms}')

Data structure analysis:

Total entries: 1345

First entry keys: ['EGY_ROOT__ID', 'EGY_ROOT__TLA_ROOT', 'EGY_ROOT__form', 'EGY_ROOT__meaning_en', 'TLA__forms', 'CCL__forms', 'MATCHES']

Entries with Coptic forms: 1171 / 1345 (87.1%)
Total Coptic forms: 2326


## Show Sample Entries

In [29]:
# Show sample entries
print('Sample Egyptian-Coptic cognate entries:\n')

sample_count = 0
for entry_id, entry in list(thotbank_data.items())[:5]:
    sample_count += 1
    print(f"Entry {sample_count} (ID: {entry_id}):")
    
    # Show Egyptian root
    egy_form = entry.get('EGY_ROOT__form', 'N/A')
    egy_meaning = entry.get('EGY_ROOT__meaning_en', 'N/A')
    print(f"  Egyptian: {egy_form} ({egy_meaning})")
    
    # Show Coptic forms if available (CCL = Comprehensive Coptic Lexicon)
    coptic_forms = entry.get('CCL__forms', {})
    if coptic_forms:
        print(f"  Coptic forms ({len(coptic_forms)}):")
        for cop_id, cop_data in list(coptic_forms.items())[:3]:
            cop_form = cop_data.get('CCL__form', 'N/A')
            cop_meaning = cop_data.get('CCL__meaning_en', 'N/A')
            print(f"    - {cop_form} ({cop_meaning})")
    print()

Sample Egyptian-Coptic cognate entries:

Entry 1 (ID: 0):
  Egyptian: bḏꜣ ()
  Coptic forms (1):
    - ⲃⲏⲧⲉ, ϩⲃⲏⲧⲉ (scale-like plate)

Entry 2 (ID: 1):
  Egyptian: bgs ()
  Coptic forms (2):
    - ⲃⲁⲧⲥ, ⲃⲁⲥⲧ (combat, quarrel)
    - ⲃⲱⲧⲥ (array, war)

Entry 3 (ID: 2):
  Egyptian: bhꜣ ()
  Coptic forms (1):
    - ⲃⲟⲩϩⲉ, ⲥⲣⲉⲃⲣⲟⲩⲃⲉ (eyelid, eyebrows)

Entry 4 (ID: 3):
  Egyptian: bẖ ()
  Coptic forms (1):
    - ⲃⲱⲱϩ, ⲃⲟϩ (an idol at Alexandria)

Entry 5 (ID: 4):
  Egyptian: bhn ()
  Coptic forms (2):
    - ⲃⲱϩⲛ (cover)
    - ⲃⲱϩⲛ, ⲃⲁϩⲛ (canopy, awning)



## Extract Cognate Pairs

In [30]:
# Extract cognate pairs from JSON structure
cognate_pairs = []

for entry_id, entry in thotbank_data.items():
    egy_form = entry.get('EGY_ROOT__form', '')
    
    # Try to find meaning in EGY_ROOT first
    egy_meaning = entry.get('EGY_ROOT__meaning_en', '')
    
    # If not found, look in TLA__forms (Thesaurus Linguae Aegyptiae)
    if not egy_meaning:
        tla_forms = entry.get('TLA__forms', {})
        # Collect all English meanings
        meanings = []
        for tla_id, tla_data in tla_forms.items():
            m = tla_data.get('TLA__meaning_en', '')
            if m and m not in meanings:
                meanings.append(m)
        
        # Join them or take the first one
        if meanings:
            egy_meaning = ', '.join(meanings)
    
    # Get Coptic forms from CCL (Comprehensive Coptic Lexicon)
    coptic_forms = entry.get('CCL__forms', {})
    
    for cop_id, cop_data in coptic_forms.items():
        cop_form = cop_data.get('CCL__form', '')
        cop_meaning = cop_data.get('CCL__meaning_en', '')
        
        if egy_form and cop_form:
            cognate_pairs.append({
                'egyptian': egy_form,
                'egyptian_meaning': egy_meaning,
                'coptic': cop_form,
                'coptic_meaning': cop_meaning,
                'entry_id': entry_id,
                'coptic_id': cop_id
            })

print(f'Extracted {len(cognate_pairs)} Egyptian-Coptic cognate pairs')
print('\nFirst 10 pairs:')
for i, pair in enumerate(cognate_pairs[:10]):
    print(f"{i+1:2d}. {pair['egyptian']:20s} → {pair['coptic']:30s}  ({pair['egyptian_meaning']})")

Extracted 2326 Egyptian-Coptic cognate pairs

First 10 pairs:
 1. bḏꜣ                  → ⲃⲏⲧⲉ, ϩⲃⲏⲧⲉ                     (crucible, bread mold; mold, baker's oven, [a (measuring) vessel], [a copper jug], [a mold for making figures of Osiris])
 2. bgs                  → ⲃⲁⲧⲥ, ⲃⲁⲥⲧ                      (to injure; to be injured; to be disloyal, wrongdoing)
 3. bgs                  → ⲃⲱⲧⲥ                            (to injure; to be injured; to be disloyal, wrongdoing)
 4. bhꜣ                  → ⲃⲟⲩϩⲉ, ⲥⲣⲉⲃⲣⲟⲩⲃⲉ                (fan)
 5. bẖ                   → ⲃⲱⲱϩ, ⲃⲟϩ                       (Buchis (sacred bull of Armant))
 6. bhn                  → ⲃⲱϩⲛ                            ()
 7. bhn                  → ⲃⲱϩⲛ, ⲃⲁϩⲛ                      ()
 8. bḥz                  → ⲃⲁϩⲥⲉ                           (calf)
 9. bj                   → ⲉⲃⲓⲧ                            (bee, honey, [an aggressive insect (wasp? bee?)])
10. bj                   → ⲁϥ ⲛⲉⲃⲓⲱ, ⲁϥ ⲙⲃⲓⲟⲩ, ⲉⲃⲓⲱ        (bee, honey, [

## Save Processed Data

In [31]:
# Save cognate pairs
PROCESSED_DIR = PROJECT_ROOT / 'data/processed'
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

COGNATES_PATH = PROCESSED_DIR / 'egyptian_coptic_cognates.json'

with open(COGNATES_PATH, 'w', encoding='utf-8') as f:
    json.dump(cognate_pairs, f, ensure_ascii=False, indent=2)

print(f'Saved {len(cognate_pairs)} cognate pairs to {COGNATES_PATH}')

Saved 2326 cognate pairs to /Users/crashy/Development/heiroglyphy/heiro_v8_use_coptic/data/processed/egyptian_coptic_cognates.json


## Summary

Successfully extracted Egyptian-Coptic cognate pairs from ThotBank dataset.

**Next Steps**:
1. Load Coptic-English mappings from `01_coptic_english_extraction.ipynb`
2. Create Egyptian → Coptic → English chain
3. Generate enhanced anchor dictionary