# V6 Phase 2: HieroBERT Pre-training

## Goal
Train a context-aware **HieroBERT** model on the 104k hieroglyphic texts. 
This model will learn to predict masked hieroglyphs based on their context, capturing the syntax and semantics of the language.

## Architecture: "HieroBERT-Small"
- **Hidden Size**: 768 (Matches our visual embeddings)
- **Layers**: 6 (Reduced from 12 to prevent overfitting on small data)
- **Attention Heads**: 12
- **Vocab Size**: 30,000 (Learned via WordPiece)

## Steps
1. Train Tokenizer
2. Configure Model
3. Prepare Dataset (MLM)
4. Train
5. Save

In [1]:
!pip install transformers tokenizers datasets torch accelerate


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
import os
from pathlib import Path
from tokenizers import BertWordPieceTokenizer
from transformers import (
    BertConfig,
    BertForMaskedLM,
    LineByLineTextDataset,
    DataCollatorForLanguageModeling,
    Trainer,
    TrainingArguments
)
import torch

# Paths
DATA_PATH = Path("../data/raw/hieroglyphic_corpus.txt")
MODEL_DIR = Path("../models/hierobert_small")
MODEL_DIR.mkdir(parents=True, exist_ok=True)

print(f"Using device: {torch.device('mps' if torch.backends.mps.is_available() else 'cpu')}")

Using device: mps


## 1. Train Tokenizer
We need a tokenizer that understands hieroglyphic groupings. We'll use WordPiece.

In [4]:
# Initialize tokenizer
tokenizer = BertWordPieceTokenizer(
    clean_text=True,
    handle_chinese_chars=False, # Not Chinese
    strip_accents=False, # Keep accents if any (though mostly codes)
    lowercase=False # Hieroglyph codes are case sensitive (e.g. A1 vs a1? Actually Gardiner is usually uppercase)
)

# Train
tokenizer.train(
    files=[str(DATA_PATH)],
    vocab_size=30000,
    min_frequency=2,
    show_progress=True,
    special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]
)

# Save tokenizer
tokenizer.save_model(str(MODEL_DIR))
print("Tokenizer saved.")




Tokenizer saved.


## 2. Configure Model
Defining HieroBERT-Small.

In [5]:
config = BertConfig(
    vocab_size=30000,
    hidden_size=768,
    num_hidden_layers=6,
    num_attention_heads=12,
    intermediate_size=3072,
    max_position_embeddings=512,
    type_vocab_size=1,
)

model = BertForMaskedLM(config)
print(f"Model parameters: {model.num_parameters():,}")

Model parameters: 66,584,880


## 3. Prepare Dataset
Loading the corpus for Masked Language Modeling.

In [None]:
# Load tokenizer as Transformers object
tokenizer = BertTokenizerFast.from_pretrained(str(MODEL_DIR), max_len=512)

# Create Dataset
dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path=str(DATA_PATH),
    block_size=128 # Short texts mostly
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)



## 4. Train

In [None]:
training_args = TrainingArguments(
    output_dir=str(MODEL_DIR),
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=32,
    save_steps=500,
    save_total_limit=2,
    prediction_loss_only=True,
    learning_rate=1e-4,
    weight_decay=0.01,
    logging_steps=50,
    use_mps_device=torch.backends.mps.is_available()
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

trainer.train()

In [None]:
trainer.save_model(str(MODEL_DIR))
print("HieroBERT saved successfully!")