# Notebook 1: Data Preparation & Anchor Extraction

## Goal
The objective of this notebook is to prepare the datasets required for our **Anchor-Guided Alignment** of Ancient Egyptian and Modern English.

Unlike previous attempts that relied on massive unsupervised training or machine translation bridges, we will explicitly construct a set of **Anchors**‚Äîknown translations‚Äîthat will serve as the "Rosetta Stone" for our mathematical alignment later.

## Steps
1.  **Load Raw Data**: We will use the TLA (Thesaurus Linguae Aegyptiae) dataset, which contains hieroglyphic transliterations and their German translations.
2.  **English Translation**: Since our target is English, we will use the pre-translated English versions (from the previous attempt's cache) to build our dictionary.
3.  **Clean & Normalize**: We need to ensure the transliterations are consistent and the English text is clean.
4.  **Construct Anchor Dictionary**: We will create a list of pairs `(hieroglyphic_word, english_word)` that we are confident in. These are our anchors.
5.  **Export**: Save the clean corpora and the anchor dictionary for the next steps.

In [1]:
import os
import pickle
import pandas as pd
import re
from collections import Counter
from tqdm import tqdm

# Configuration
DATA_DIR = "data"
CACHE_FILE = os.path.join(DATA_DIR, "german_english_translations.pkl")
HIEROGLYPHIC_CORPUS_FILE = os.path.join(DATA_DIR, "hieroglyphic_corpus.txt")
ANCHOR_FILE = os.path.join(DATA_DIR, "anchors.pkl")
CLEAN_CORPUS_FILE = os.path.join(DATA_DIR, "clean_corpora.pkl")

## 1. Load Data

We are loading the `german_english_translations.pkl` file which contains:
-   `hieroglyphic`: The transliterated Egyptian text.
-   `german`: The original German translation from TLA.
-   `english`: The English translation (machine translated from German in the previous attempt).

We rely on this cached file to save time and resources.

In [2]:
print(f"Loading data from {CACHE_FILE}...")
with open(CACHE_FILE, 'rb') as f:
    raw_data = pickle.load(f)

print(f"Loaded {len(raw_data)} entries.")

# Let's inspect a sample
print("\nSample Entry:")
print(raw_data[0])

Loading data from data/german_english_translations.pkl...
Loaded 12773 entries.

Sample Entry:
{'hieroglyphic': 'n·∏è (w)diÃØ r =s', 'german': '(es) werde zerrieben, (es) werde darauf gelegt.', 'hieroglyphs': 'ìê©ìèåìÄú ìÇß ìÇã ìã¥', 'lemmatization': '90880|n·∏è 51510|wdiÃØ 91901|r 10090|=s', 'date_not_before': '-1580', 'date_not_after': '-1539', 'english': 'It shall be crushed, and it shall be laid upon it.'}


## 2. Data Cleaning

We need to process this list into a usable format. Specifically, we want to extract individual word pairs where possible, but since these are full sentences/phrases, we will first focus on the **sentences** for training the embeddings, and then try to extract **word-level anchors**.

### 2.1 Corpus Preparation
For `FastText` (Hieroglyphic) and our English model, we need clean lists of sentences.

In [3]:
def clean_hieroglyphic(text):
    """Normalize hieroglyphic transliteration."""
    if not isinstance(text, str): return ""
    # Remove brackets and uncertain markers often found in TLA
    text = re.sub(r'[\[\]\(\)\?\<\>]', '', text)
    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

def clean_english(text):
    """Normalize English text."""
    if not isinstance(text, str): return ""
    # Lowercase
    text = text.lower()
    # Remove punctuation
    text = re.sub(r'[^a-z0-9\s]', '', text)
    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

hieroglyphic_sentences = []
english_sentences = []

for entry in tqdm(raw_data, desc="Cleaning Data"):
    h_clean = clean_hieroglyphic(entry.get('hieroglyphic', ''))
    e_clean = clean_english(entry.get('english', ''))
    
    if h_clean and e_clean:
        hieroglyphic_sentences.append(h_clean)
        english_sentences.append(e_clean)

print(f"\nPrepared {len(hieroglyphic_sentences)} parallel sentences.")

Cleaning Data:   0%|      | 0/12773 [00:00<?, ?it/s]

Cleaning Data: 100%|‚ñà| 12773/12773 [00:00<00:00, 161


Prepared 12773 parallel sentences.





## 3. Anchor Extraction

This is the most critical step for our "Anchor-Guided" approach. We need reliable pairs of `(egyptian_word, english_word)`.

Since we have aligned sentences, we can use a simple heuristic: **Co-occurrence** or just use the **Lemmatization** data if available in the raw dump.

Looking at the raw data structure from the previous analysis:
```python
{
    'hieroglyphic': '...', 
    'german': '...', 
    'english': '...', 
    'lemmatization': 'lemma1|trans1 lemma2|trans2 ...'
}
```
The `lemmatization` field seems to contain the gold mine! It likely maps specific hieroglyphic words to their IDs or lemmas. However, we might not have direct English translations for those lemmas, only the sentence translation.

**Strategy**:
1.  We will try to build a dictionary from the `lemmatization` if it contains readable text.
2.  If `lemmatization` is just IDs, we will fall back to a frequency-based alignment on the sentences (e.g., if "nfr" appears in sentences with "good" 100 times, they are a pair).

Let's inspect the `lemmatization` field of the first few entries.

In [4]:
print("Inspecting Lemmatization:")
for i in range(5):
    print(f"Entry {i}: {raw_data[i].get('lemmatization', 'N/A')}")

Inspecting Lemmatization:
Entry 0: 90880|n·∏è 51510|wdiÃØ 91901|r 10090|=s
Entry 1: 78890|n 174900|·πØw 400007|m 10100|=sn
Entry 2: 113110|·∏´Íú£ 400082|m 168810|t æ 110300|·∏•nq.t 162930|kÍú£ 107|Íú£pd 400055|n 25090|ÍûΩmÍú£·∏´.w 72420|ÍûΩm.ÍûΩ-r æ-≈°nÍú• 400161|Íûºmn-m-·∏•Íú£.t 66750|mÍú£Íú•-·∏´rw
Entry 3: 40110|Íú•·∏•Íú•
Entry 4: 49461|WsÍûΩr 800001|WnÍûΩs 67780|mÍûΩ 400055|n 10110|=k 28410|ÍûΩr.t-·∏§r.w 21680|ÍûΩÍú•b 400055|n 10110|=k 127770|sÍûΩ 91901|r 92560|r æ 10110|=k


### 3.1 Building the Dictionary

Assuming the lemmatization provides the hieroglyphic words, we can align them with the English words in the sentence. 

For this demonstration, we will use a **Frequency-Based Probabilistic Dictionary** approach (simplified IBM Model 1 idea):
1.  Count word co-occurrences between Hieroglyphic words and English words in the parallel sentences.
2.  Filter for high-confidence pairs.

This is robust and doesn't require parsing complex lemma strings if they are messy.

In [5]:
co_occurrence = Counter()
h_freq = Counter()
e_freq = Counter()

print("Building co-occurrence matrix...")
for h_sent, e_sent in zip(hieroglyphic_sentences, english_sentences):
    h_words = set(h_sent.split())
    e_words = set(e_sent.split())
    
    for h in h_words:
        h_freq[h] += 1
        for e in e_words:
            co_occurrence[(h, e)] += 1
            e_freq[e] += 1

print(f"Unique Hieroglyphic words: {len(h_freq)}")
print(f"Unique English words: {len(e_freq)}")

Building co-occurrence matrix...


Unique Hieroglyphic words: 7174
Unique English words: 7800


In [6]:
# Calculate Pointwise Mutual Information (PMI) or just Conditional Probability P(e|h)
# P(e|h) = count(h, e) / count(h)

anchors = {}
MIN_COUNT = 5
CONFIDENCE_THRESHOLD = 0.3  # If a word translates to X 30% of the time, it's a candidate

print("Extracting anchors...")
for (h, e), count in co_occurrence.items():
    if count < MIN_COUNT:
        continue
        
    # Conditional probability P(e|h)
    prob = count / h_freq[h]
    
    if prob > CONFIDENCE_THRESHOLD:
        # We keep the best translation for each hieroglyphic word
        if h not in anchors or anchors[h]['prob'] < prob:
            anchors[h] = {'english': e, 'prob': prob, 'count': count}

print(f"Found {len(anchors)} potential anchors.")

# Let's see some top anchors
sorted_anchors = sorted(anchors.items(), key=lambda x: x[1]['count'], reverse=True)
print("\nTop 20 Anchors:")
for h, data in sorted_anchors[:20]:
    print(f"{h} -> {data['english']} (prob: {data['prob']:.2f}, count: {data['count']})")

Extracting anchors...
Found 1362 potential anchors.

Top 20 Anchors:
=f -> the (prob: 0.54, count: 1592)
=k -> you (prob: 0.61, count: 1425)
m -> the (prob: 0.65, count: 1414)
n -> the (prob: 0.56, count: 1288)
·∏•r.w -> horus (prob: 0.99, count: 1172)
wnÍûΩs -> unas (prob: 0.98, count: 572)
·∏•r -> the (prob: 0.57, count: 485)
r -> the (prob: 0.56, count: 448)
n.ÍûΩ -> the (prob: 0.76, count: 404)
pn -> this (prob: 0.87, count: 343)
zÍú£ -> son (prob: 0.94, count: 307)
pw -> the (prob: 0.50, count: 287)
·∏èd-mdw -> words (prob: 0.92, count: 285)
=ÍûΩ -> i (prob: 0.53, count: 275)
ÍûΩr -> the (prob: 0.57, count: 269)
=s -> the (prob: 0.49, count: 268)
ppy -> pepi (prob: 1.00, count: 268)
wsÍûΩr -> osiris (prob: 1.00, count: 262)
nÍûΩ.t -> neith (prob: 0.98, count: 258)
wsr.w -> osiris (prob: 1.00, count: 250)


## 4. Saving Data

We will save:
1.  The clean sentences (for training embeddings).
2.  The anchor dictionary (for alignment).

In [7]:
# Save Corpora
with open(CLEAN_CORPUS_FILE, 'wb') as f:
    pickle.dump({
        'hieroglyphic': hieroglyphic_sentences,
        'english': english_sentences
    }, f)

# Save Anchors
anchor_list = [{'hieroglyphic': h, 'english': data['english']} for h, data in anchors.items()]
with open(ANCHOR_FILE, 'wb') as f:
    pickle.dump(anchor_list, f)

print("Data saved successfully.")

Data saved successfully.
