# Ancient Greek ↔ English NMT — Tiny Demo Notebook

This notebook walks you through a **toy end‑to‑end run** using the single‑file pipeline (`grc_nmt.py`).

> **Notes**
> - Run on Colab or a local GPU box; models can be large.
> - It expects `../grc_nmt.py` relative to this notebook.
> - We use *very* small data and **1 epoch** just to verify the wiring.


In [None]:
# Install dependencies (uncomment if needed)
!python -m pip install --upgrade "transformers>=4.42" "datasets>=2.16" "accelerate>=0.33" sacrebleu ftfy unidecode regex

In [None]:
# Sanity check for pipeline script & copy tiny data
from pathlib import Path
import os, shutil
assert Path('..','grc_nmt.py').exists(), 'Place grc_nmt.py in the repo root.'
os.makedirs('../data/raw', exist_ok=True)
shutil.copy('../sample_data/grc.txt','../data/raw/grc.txt')
shutil.copy('../sample_data/en.txt','../data/raw/en.txt')
print('Copied toy data to ../data/raw/')

## 1) Preprocess and split

In [None]:
!python ../grc_nmt.py preprocess \
  --src_file ../data/raw/grc.txt \
  --tgt_file ../data/raw/en.txt \
  --out_jsonl ../data/bitext.jsonl \
  --lang src_grc --keep_diacritics --lowercase

!python ../grc_nmt.py make_splits \
  --bitext ../data/bitext.jsonl \
  --train ../data/train.jsonl --dev ../data/dev.jsonl --test ../data/test.jsonl

## 2) Train GRC→EN (mBART default) — 1 epoch

In [None]:
!python ../grc_nmt.py train \
  --train_jsonl ../data/train.jsonl --dev_jsonl ../data/dev.jsonl \
  --direction grc2en \
  --model_name facebook/mbart-large-50-many-to-many-mmt \
  --save_dir ../runs/grc2en.mbart \
  --num_epochs 1 --lr 5e-5 --batch 4 --grad_accum 1 \
  --src_lang_tok_code el_GR --tgt_lang_tok_code en_XX --forced_bos_lang en_XX

## 3) Evaluate (BLEU + chrF)

In [None]:
!python ../grc_nmt.py evaluate \
  --model_dir ../runs/grc2en.mbart \
  --test_jsonl ../data/test.jsonl \
  --src_lang grc --tgt_lang en \
  --src_lang_tok_code el_GR --tgt_lang_tok_code en_XX

## 4) Try a translation

In [None]:
open('../toy_in.grc','w',encoding='utf-8').write('οἱ παῖδες ἐν τῇ οἰκίᾳ εἰσίν.\n')
!python ../grc_nmt.py generate \
  --model_dir ../runs/grc2en.mbart \
  --in_txt ../toy_in.grc --out_txt ../toy_out.en \
  --src_lang grc --tgt_lang en \
  --src_lang_tok_code el_GR --tgt_lang_tok_code en_XX --forced_bos_lang en_XX
print(open('../toy_out.en','r',encoding='utf-8').read())