# V7 Phase 1: Data Cleaning & Tokenization

## Goal
Prepare a clean, high-quality corpus for FastText training. 
We need to remove non-glyph artifacts (like line numbers, headers) that might have polluted previous models.

## Strategy
1. **Load Data**: `data/raw/all_data.json` (104k texts).
2. **Clean**: Apply regex to remove:
    - Line numbers (e.g., "1.", "[1]")
    - English text/comments
    - Punctuation not part of MdC (Manuel de Codage).
3. **Tokenize**: Ensure space-separated glyphs for FastText.
4. **Export**: Save as `data/processed/cleaned_corpus.txt`.

In [None]:
import json
import re
from pathlib import Path
from tqdm import tqdm

# Paths
RAW_DATA_PATH = Path("../data/raw/all_data.json")
CLEAN_DATA_PATH = Path("../data/processed/cleaned_corpus.txt")

# Ensure output dir exists
CLEAN_DATA_PATH.parent.mkdir(parents=True, exist_ok=True)

## 1. Load Data

In [4]:
with open(RAW_DATA_PATH, 'r') as f:
    raw_data = json.load(f)

print(f"Loaded {len(raw_data)} texts.")
print("Sample raw text:", raw_data[0])

FileNotFoundError: [Errno 2] No such file or directory: 'data/raw/all_data.json'

## 2. Cleaning Logic

In [None]:
def clean_hieroglyphs(text):
    if not isinstance(text, str):
        return ""
    
    # 1. Remove Line Numbers & Brackets (e.g., "[1]", "1.", "<...>")
    text = re.sub(r'\[.*?\]', '', text)  # Remove [1], [2a]
    text = re.sub(r'<.*?>', '', text)     # Remove <...>
    text = re.sub(r'\(.*?\)', '', text)     # Remove (...)
    
    # 2. Remove digits that are likely line numbers (start of line or standalone)
    text = re.sub(r'^\d+\.', '', text)    # "1. ..."
    text = re.sub(r'\s\d+\.', ' ', text)  # " ... 2. ..."
    
    # 3. Remove non-MdC characters (keep A-Z, a-z, 0-9, -, *, :, etc.)
    # This is aggressive; we might want to be more careful if MdC uses special chars.
    # For now, let's focus on removing obvious English words.
    # A heuristic: if a token is > 1 char and contains only lowercase letters, it's likely English garbage.
    # (MdC usually uses uppercase for Gardiner codes like A1, N35)
    
    tokens = text.split()
    clean_tokens = []
    for t in tokens:
        # Filter out likely English words (all lowercase alpha, length > 1)
        if t.isalpha() and t.islower() and len(t) > 1:
            continue
        # Filter out standalone numbers
        if t.isdigit():
            continue
        clean_tokens.append(t)
        
    return " ".join(clean_tokens)

## 3. Process & Export

In [None]:
cleaned_lines = []
skipped = 0

for item in tqdm(raw_data):
    # Depending on structure, text might be in 'content', 'hiero', or just the item itself
    # Adjust based on actual data structure (assuming list of strings or dicts)
    text = item.get('hieroglyphs', '') if isinstance(item, dict) else item
    
    clean_text = clean_hieroglyphs(text)
    if len(clean_text.strip()) > 0:
        cleaned_lines.append(clean_text)
    else:
        skipped += 1

print(f"Processed {len(cleaned_lines)} lines. Skipped {skipped} empty lines.")

# Save to file
with open(CLEAN_DATA_PATH, 'w') as f:
    for line in cleaned_lines:
        f.write(line + "\n")
        
print(f"Saved cleaned corpus to {CLEAN_DATA_PATH}")