# Notebook 2: Embedding Training

## Goal
Now that we have our clean corpora, we need to turn words into vectors. We will train two separate embedding spaces:

1.  **Hieroglyphic Space**: Using **FastText**.
    -   *Why?* Hieroglyphs (and their transliterations) are morphologically rich. FastText breaks words into sub-character n-grams (e.g., "nfr" -> "<nf", "nfr", "fr>"), allowing the model to understand the structure of the words better than standard Word2Vec.
2.  **English Space**: Using **Word2Vec** (Skip-gram).
    -   *Why?* Standard Word2Vec is sufficient for English. We train it on the *same* parallel corpus to ensure the domain (Egyptology) matches perfectly.

## Steps
1.  Load the clean corpora.
2.  Train Hieroglyphic FastText model.
3.  Train English Word2Vec model.
4.  Visualize the training loss (if available) or inspect similar words to verify quality.

In [1]:
import os
import pickle
import logging
from gensim.models import FastText, Word2Vec
from gensim.utils import simple_preprocess

# Setup logging to see training progress
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# Configuration
DATA_DIR = "data"
MODELS_DIR = "models"
CLEAN_CORPUS_FILE = os.path.join(DATA_DIR, "clean_corpora.pkl")
HIEROGLYPHIC_MODEL_FILE = os.path.join(MODELS_DIR, "hieroglyphic_fasttext.model")
ENGLISH_MODEL_FILE = os.path.join(MODELS_DIR, "english_word2vec.model")

# Hyperparameters
VECTOR_SIZE = 100
WINDOW = 5
MIN_COUNT = 2
EPOCHS = 50

## 1. Load Data

In [2]:
print(f"Loading corpora from {CLEAN_CORPUS_FILE}...")
with open(CLEAN_CORPUS_FILE, 'rb') as f:
    corpora = pickle.load(f)

hier_sentences = [s.split() for s in corpora['hieroglyphic']]
eng_sentences = [s.split() for s in corpora['english']]

print(f"Loaded {len(hier_sentences)} hieroglyphic sentences and {len(eng_sentences)} English sentences.")

Loading corpora from data/clean_corpora.pkl...
Loaded 12773 hieroglyphic sentences and 12773 English sentences.


## 2. Train Hieroglyphic FastText

We use the Skip-gram model (`sg=1`) because it generally produces better embeddings for smaller datasets.

In [3]:
print("Training Hieroglyphic FastText model...")
hier_model = FastText(
    sentences=hier_sentences,
    vector_size=VECTOR_SIZE,
    window=WINDOW,
    min_count=MIN_COUNT,
    sg=1,  # Skip-gram
    epochs=EPOCHS,
    seed=42
)

print("Saving Hieroglyphic model...")
hier_model.save(HIEROGLYPHIC_MODEL_FILE)
print("Done.")

2025-11-19 09:25:45,315 : INFO : collecting all words and their counts


2025-11-19 09:25:45,315 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types


2025-11-19 09:25:45,322 : INFO : PROGRESS: at sentence #10000, processed 54462 words, keeping 6287 word types


2025-11-19 09:25:45,324 : INFO : collected 7174 word types from a corpus of 70267 raw words and 12773 sentences


2025-11-19 09:25:45,324 : INFO : Creating a fresh vocabulary


2025-11-19 09:25:45,330 : INFO : FastText lifecycle event {'msg': 'effective_min_count=2 retains 3677 unique words (51.25% of original 7174, drops 3497)', 'datetime': '2025-11-19T09:25:45.330533', 'gensim': '4.3.3', 'python': '3.12.3 (main, Jun  1 2025, 04:19:33) [Clang 17.0.0 (clang-1700.0.13.5)]', 'platform': 'macOS-26.1-arm64-arm-64bit', 'event': 'prepare_vocab'}


2025-11-19 09:25:45,330 : INFO : FastText lifecycle event {'msg': 'effective_min_count=2 leaves 66770 word corpus (95.02% of original 70267, drops 3497)', 'datetime': '2025-11-19T09:25:45.330925', 'gensim': '4.3.3', 'python': '3.12.3 (main, Jun  1 2025, 04:19:33) [Clang 17.0.0 (clang-1700.0.13.5)]', 'platform': 'macOS-26.1-arm64-arm-64bit', 'event': 'prepare_vocab'}


2025-11-19 09:25:45,338 : INFO : deleting the raw counts dictionary of 7174 items


2025-11-19 09:25:45,339 : INFO : sample=0.001 downsamples 48 most-common words


2025-11-19 09:25:45,339 : INFO : FastText lifecycle event {'msg': 'downsampling leaves estimated 48138.16567809211 word corpus (72.1%% of prior 66770)', 'datetime': '2025-11-19T09:25:45.339609', 'gensim': '4.3.3', 'python': '3.12.3 (main, Jun  1 2025, 04:19:33) [Clang 17.0.0 (clang-1700.0.13.5)]', 'platform': 'macOS-26.1-arm64-arm-64bit', 'event': 'prepare_vocab'}


2025-11-19 09:25:45,366 : INFO : estimated required memory for 3677 words, 2000000 buckets and 100 dimensions: 805413924 bytes


2025-11-19 09:25:45,366 : INFO : resetting layer weights


Training Hieroglyphic FastText model...


2025-11-19 09:25:46,038 : INFO : FastText lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-11-19T09:25:46.038696', 'gensim': '4.3.3', 'python': '3.12.3 (main, Jun  1 2025, 04:19:33) [Clang 17.0.0 (clang-1700.0.13.5)]', 'platform': 'macOS-26.1-arm64-arm-64bit', 'event': 'build_vocab'}


2025-11-19 09:25:46,039 : INFO : FastText lifecycle event {'msg': 'training model with 3 workers on 3677 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-11-19T09:25:46.039202', 'gensim': '4.3.3', 'python': '3.12.3 (main, Jun  1 2025, 04:19:33) [Clang 17.0.0 (clang-1700.0.13.5)]', 'platform': 'macOS-26.1-arm64-arm-64bit', 'event': 'train'}


2025-11-19 09:25:46,147 : INFO : EPOCH 0: training on 70267 raw words (48179 effective words) took 0.1s, 486942 effective words/s


2025-11-19 09:25:46,263 : INFO : EPOCH 1: training on 70267 raw words (48173 effective words) took 0.1s, 435304 effective words/s


2025-11-19 09:25:46,371 : INFO : EPOCH 2: training on 70267 raw words (48129 effective words) took 0.1s, 470524 effective words/s


2025-11-19 09:25:46,481 : INFO : EPOCH 3: training on 70267 raw words (48239 effective words) took 0.1s, 463918 effective words/s


2025-11-19 09:25:46,585 : INFO : EPOCH 4: training on 70267 raw words (48073 effective words) took 0.1s, 490068 effective words/s


2025-11-19 09:25:46,694 : INFO : EPOCH 5: training on 70267 raw words (48159 effective words) took 0.1s, 469505 effective words/s


2025-11-19 09:25:46,801 : INFO : EPOCH 6: training on 70267 raw words (48134 effective words) took 0.1s, 481484 effective words/s


2025-11-19 09:25:46,909 : INFO : EPOCH 7: training on 70267 raw words (48281 effective words) took 0.1s, 484001 effective words/s


2025-11-19 09:25:47,015 : INFO : EPOCH 8: training on 70267 raw words (48129 effective words) took 0.1s, 481289 effective words/s


2025-11-19 09:25:47,123 : INFO : EPOCH 9: training on 70267 raw words (48183 effective words) took 0.1s, 482425 effective words/s


2025-11-19 09:25:47,231 : INFO : EPOCH 10: training on 70267 raw words (48130 effective words) took 0.1s, 480429 effective words/s


2025-11-19 09:25:47,338 : INFO : EPOCH 11: training on 70267 raw words (48086 effective words) took 0.1s, 474568 effective words/s


2025-11-19 09:25:47,444 : INFO : EPOCH 12: training on 70267 raw words (48099 effective words) took 0.1s, 480595 effective words/s


2025-11-19 09:25:47,551 : INFO : EPOCH 13: training on 70267 raw words (48046 effective words) took 0.1s, 478188 effective words/s


2025-11-19 09:25:47,659 : INFO : EPOCH 14: training on 70267 raw words (48137 effective words) took 0.1s, 470045 effective words/s


2025-11-19 09:25:47,769 : INFO : EPOCH 15: training on 70267 raw words (48120 effective words) took 0.1s, 472478 effective words/s


2025-11-19 09:25:47,877 : INFO : EPOCH 16: training on 70267 raw words (48153 effective words) took 0.1s, 478690 effective words/s


2025-11-19 09:25:47,985 : INFO : EPOCH 17: training on 70267 raw words (48200 effective words) took 0.1s, 472807 effective words/s


2025-11-19 09:25:48,097 : INFO : EPOCH 18: training on 70267 raw words (48029 effective words) took 0.1s, 454108 effective words/s


2025-11-19 09:25:48,207 : INFO : EPOCH 19: training on 70267 raw words (48207 effective words) took 0.1s, 471884 effective words/s


2025-11-19 09:25:48,315 : INFO : EPOCH 20: training on 70267 raw words (48040 effective words) took 0.1s, 472410 effective words/s


2025-11-19 09:25:48,425 : INFO : EPOCH 21: training on 70267 raw words (48113 effective words) took 0.1s, 461548 effective words/s


2025-11-19 09:25:48,535 : INFO : EPOCH 22: training on 70267 raw words (48104 effective words) took 0.1s, 463228 effective words/s


2025-11-19 09:25:48,677 : INFO : EPOCH 23: training on 70267 raw words (48178 effective words) took 0.1s, 353544 effective words/s


2025-11-19 09:25:48,790 : INFO : EPOCH 24: training on 70267 raw words (48156 effective words) took 0.1s, 450581 effective words/s


2025-11-19 09:25:48,897 : INFO : EPOCH 25: training on 70267 raw words (48103 effective words) took 0.1s, 475458 effective words/s


2025-11-19 09:25:49,010 : INFO : EPOCH 26: training on 70267 raw words (48201 effective words) took 0.1s, 454636 effective words/s


2025-11-19 09:25:49,116 : INFO : EPOCH 27: training on 70267 raw words (48000 effective words) took 0.1s, 477606 effective words/s


2025-11-19 09:25:49,221 : INFO : EPOCH 28: training on 70267 raw words (48085 effective words) took 0.1s, 483293 effective words/s


2025-11-19 09:25:49,329 : INFO : EPOCH 29: training on 70267 raw words (48242 effective words) took 0.1s, 476090 effective words/s


2025-11-19 09:25:49,436 : INFO : EPOCH 30: training on 70267 raw words (48211 effective words) took 0.1s, 474774 effective words/s


2025-11-19 09:25:49,545 : INFO : EPOCH 31: training on 70267 raw words (48076 effective words) took 0.1s, 477305 effective words/s


2025-11-19 09:25:49,646 : INFO : EPOCH 32: training on 70267 raw words (48124 effective words) took 0.1s, 515362 effective words/s


2025-11-19 09:25:49,754 : INFO : EPOCH 33: training on 70267 raw words (48253 effective words) took 0.1s, 481811 effective words/s


2025-11-19 09:25:49,863 : INFO : EPOCH 34: training on 70267 raw words (48073 effective words) took 0.1s, 465593 effective words/s


2025-11-19 09:25:49,969 : INFO : EPOCH 35: training on 70267 raw words (48008 effective words) took 0.1s, 492640 effective words/s


2025-11-19 09:25:50,082 : INFO : EPOCH 36: training on 70267 raw words (48155 effective words) took 0.1s, 459038 effective words/s


2025-11-19 09:25:50,196 : INFO : EPOCH 37: training on 70267 raw words (48117 effective words) took 0.1s, 444727 effective words/s


2025-11-19 09:25:50,306 : INFO : EPOCH 38: training on 70267 raw words (48067 effective words) took 0.1s, 470456 effective words/s


2025-11-19 09:25:50,413 : INFO : EPOCH 39: training on 70267 raw words (48116 effective words) took 0.1s, 476953 effective words/s


2025-11-19 09:25:50,524 : INFO : EPOCH 40: training on 70267 raw words (48184 effective words) took 0.1s, 459666 effective words/s


2025-11-19 09:25:50,636 : INFO : EPOCH 41: training on 70267 raw words (48385 effective words) took 0.1s, 466007 effective words/s


2025-11-19 09:25:50,743 : INFO : EPOCH 42: training on 70267 raw words (48286 effective words) took 0.1s, 481861 effective words/s


2025-11-19 09:25:50,853 : INFO : EPOCH 43: training on 70267 raw words (48168 effective words) took 0.1s, 470909 effective words/s


2025-11-19 09:25:50,967 : INFO : EPOCH 44: training on 70267 raw words (48045 effective words) took 0.1s, 452880 effective words/s


2025-11-19 09:25:51,073 : INFO : EPOCH 45: training on 70267 raw words (48205 effective words) took 0.1s, 489718 effective words/s


2025-11-19 09:25:51,186 : INFO : EPOCH 46: training on 70267 raw words (48256 effective words) took 0.1s, 459909 effective words/s


2025-11-19 09:25:51,296 : INFO : EPOCH 47: training on 70267 raw words (48105 effective words) took 0.1s, 471636 effective words/s


2025-11-19 09:25:51,409 : INFO : EPOCH 48: training on 70267 raw words (48183 effective words) took 0.1s, 449051 effective words/s


2025-11-19 09:25:51,522 : INFO : EPOCH 49: training on 70267 raw words (48199 effective words) took 0.1s, 453524 effective words/s


2025-11-19 09:25:51,522 : INFO : FastText lifecycle event {'msg': 'training on 3513350 raw words (2407324 effective words) took 5.5s, 439050 effective words/s', 'datetime': '2025-11-19T09:25:51.522606', 'gensim': '4.3.3', 'python': '3.12.3 (main, Jun  1 2025, 04:19:33) [Clang 17.0.0 (clang-1700.0.13.5)]', 'platform': 'macOS-26.1-arm64-arm-64bit', 'event': 'train'}


2025-11-19 09:25:51,569 : INFO : FastText lifecycle event {'params': 'FastText<vocab=3677, vector_size=100, alpha=0.025>', 'datetime': '2025-11-19T09:25:51.569265', 'gensim': '4.3.3', 'python': '3.12.3 (main, Jun  1 2025, 04:19:33) [Clang 17.0.0 (clang-1700.0.13.5)]', 'platform': 'macOS-26.1-arm64-arm-64bit', 'event': 'created'}


2025-11-19 09:25:51,569 : INFO : FastText lifecycle event {'fname_or_handle': 'models/hieroglyphic_fasttext.model', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2025-11-19T09:25:51.569737', 'gensim': '4.3.3', 'python': '3.12.3 (main, Jun  1 2025, 04:19:33) [Clang 17.0.0 (clang-1700.0.13.5)]', 'platform': 'macOS-26.1-arm64-arm-64bit', 'event': 'saving'}


2025-11-19 09:25:51,570 : INFO : storing np array 'vectors_ngrams' to models/hieroglyphic_fasttext.model.wv.vectors_ngrams.npy


2025-11-19 09:25:51,684 : INFO : not storing attribute vectors


2025-11-19 09:25:51,684 : INFO : not storing attribute buckets_word


2025-11-19 09:25:51,684 : INFO : not storing attribute cum_table


2025-11-19 09:25:51,687 : INFO : saved models/hieroglyphic_fasttext.model


Saving Hieroglyphic model...
Done.


## 3. Train English Word2Vec

In [4]:
print("Training English Word2Vec model...")
eng_model = Word2Vec(
    sentences=eng_sentences,
    vector_size=VECTOR_SIZE,
    window=WINDOW,
    min_count=MIN_COUNT,
    sg=1,  # Skip-gram
    epochs=EPOCHS,
    seed=42
)

print("Saving English model...")
eng_model.save(ENGLISH_MODEL_FILE)
print("Done.")

2025-11-19 09:25:51,691 : INFO : collecting all words and their counts


2025-11-19 09:25:51,692 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types


2025-11-19 09:25:51,702 : INFO : PROGRESS: at sentence #10000, processed 95814 words, keeping 6847 word types


2025-11-19 09:25:51,706 : INFO : collected 7800 word types from a corpus of 123355 raw words and 12773 sentences


2025-11-19 09:25:51,706 : INFO : Creating a fresh vocabulary


2025-11-19 09:25:51,713 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=2 retains 4177 unique words (53.55% of original 7800, drops 3623)', 'datetime': '2025-11-19T09:25:51.713352', 'gensim': '4.3.3', 'python': '3.12.3 (main, Jun  1 2025, 04:19:33) [Clang 17.0.0 (clang-1700.0.13.5)]', 'platform': 'macOS-26.1-arm64-arm-64bit', 'event': 'prepare_vocab'}


2025-11-19 09:25:51,713 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=2 leaves 119732 word corpus (97.06% of original 123355, drops 3623)', 'datetime': '2025-11-19T09:25:51.713818', 'gensim': '4.3.3', 'python': '3.12.3 (main, Jun  1 2025, 04:19:33) [Clang 17.0.0 (clang-1700.0.13.5)]', 'platform': 'macOS-26.1-arm64-arm-64bit', 'event': 'prepare_vocab'}


2025-11-19 09:25:51,722 : INFO : deleting the raw counts dictionary of 7800 items


2025-11-19 09:25:51,723 : INFO : sample=0.001 downsamples 54 most-common words


2025-11-19 09:25:51,723 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 80806.09783115299 word corpus (67.5%% of prior 119732)', 'datetime': '2025-11-19T09:25:51.723590', 'gensim': '4.3.3', 'python': '3.12.3 (main, Jun  1 2025, 04:19:33) [Clang 17.0.0 (clang-1700.0.13.5)]', 'platform': 'macOS-26.1-arm64-arm-64bit', 'event': 'prepare_vocab'}


2025-11-19 09:25:51,737 : INFO : estimated required memory for 4177 words and 100 dimensions: 5430100 bytes


2025-11-19 09:25:51,738 : INFO : resetting layer weights


2025-11-19 09:25:51,740 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2025-11-19T09:25:51.739993', 'gensim': '4.3.3', 'python': '3.12.3 (main, Jun  1 2025, 04:19:33) [Clang 17.0.0 (clang-1700.0.13.5)]', 'platform': 'macOS-26.1-arm64-arm-64bit', 'event': 'build_vocab'}


2025-11-19 09:25:51,740 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 4177 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2025-11-19T09:25:51.740253', 'gensim': '4.3.3', 'python': '3.12.3 (main, Jun  1 2025, 04:19:33) [Clang 17.0.0 (clang-1700.0.13.5)]', 'platform': 'macOS-26.1-arm64-arm-64bit', 'event': 'train'}


Training English Word2Vec model...


2025-11-19 09:25:51,867 : INFO : EPOCH 0: training on 123355 raw words (80705 effective words) took 0.1s, 666366 effective words/s


2025-11-19 09:25:51,993 : INFO : EPOCH 1: training on 123355 raw words (80814 effective words) took 0.1s, 671061 effective words/s


2025-11-19 09:25:52,120 : INFO : EPOCH 2: training on 123355 raw words (80818 effective words) took 0.1s, 657125 effective words/s


2025-11-19 09:25:52,244 : INFO : EPOCH 3: training on 123355 raw words (80758 effective words) took 0.1s, 674087 effective words/s


2025-11-19 09:25:52,367 : INFO : EPOCH 4: training on 123355 raw words (80759 effective words) took 0.1s, 677432 effective words/s


2025-11-19 09:25:52,491 : INFO : EPOCH 5: training on 123355 raw words (80867 effective words) took 0.1s, 679286 effective words/s


2025-11-19 09:25:52,612 : INFO : EPOCH 6: training on 123355 raw words (80814 effective words) took 0.1s, 701102 effective words/s


2025-11-19 09:25:52,739 : INFO : EPOCH 7: training on 123355 raw words (80864 effective words) took 0.1s, 656084 effective words/s


2025-11-19 09:25:52,864 : INFO : EPOCH 8: training on 123355 raw words (80837 effective words) took 0.1s, 672724 effective words/s


2025-11-19 09:25:52,993 : INFO : EPOCH 9: training on 123355 raw words (80962 effective words) took 0.1s, 649014 effective words/s


2025-11-19 09:25:53,120 : INFO : EPOCH 10: training on 123355 raw words (81057 effective words) took 0.1s, 661069 effective words/s


2025-11-19 09:25:53,243 : INFO : EPOCH 11: training on 123355 raw words (80781 effective words) took 0.1s, 678741 effective words/s


2025-11-19 09:25:53,366 : INFO : EPOCH 12: training on 123355 raw words (80968 effective words) took 0.1s, 675659 effective words/s


2025-11-19 09:25:53,490 : INFO : EPOCH 13: training on 123355 raw words (80885 effective words) took 0.1s, 684600 effective words/s


2025-11-19 09:25:53,611 : INFO : EPOCH 14: training on 123355 raw words (80929 effective words) took 0.1s, 697479 effective words/s


2025-11-19 09:25:53,733 : INFO : EPOCH 15: training on 123355 raw words (80776 effective words) took 0.1s, 682808 effective words/s


2025-11-19 09:25:53,860 : INFO : EPOCH 16: training on 123355 raw words (80914 effective words) took 0.1s, 654491 effective words/s


2025-11-19 09:25:53,981 : INFO : EPOCH 17: training on 123355 raw words (80718 effective words) took 0.1s, 703385 effective words/s


2025-11-19 09:25:54,105 : INFO : EPOCH 18: training on 123355 raw words (80657 effective words) took 0.1s, 667907 effective words/s


2025-11-19 09:25:54,228 : INFO : EPOCH 19: training on 123355 raw words (80977 effective words) took 0.1s, 679983 effective words/s


2025-11-19 09:25:54,349 : INFO : EPOCH 20: training on 123355 raw words (80584 effective words) took 0.1s, 696378 effective words/s


2025-11-19 09:25:54,474 : INFO : EPOCH 21: training on 123355 raw words (80644 effective words) took 0.1s, 674937 effective words/s


2025-11-19 09:25:54,599 : INFO : EPOCH 22: training on 123355 raw words (80779 effective words) took 0.1s, 672697 effective words/s


2025-11-19 09:25:54,726 : INFO : EPOCH 23: training on 123355 raw words (80845 effective words) took 0.1s, 663170 effective words/s


2025-11-19 09:25:54,851 : INFO : EPOCH 24: training on 123355 raw words (80777 effective words) took 0.1s, 672302 effective words/s


2025-11-19 09:25:55,015 : INFO : EPOCH 25: training on 123355 raw words (80918 effective words) took 0.2s, 510391 effective words/s


2025-11-19 09:25:55,136 : INFO : EPOCH 26: training on 123355 raw words (80720 effective words) took 0.1s, 687121 effective words/s


2025-11-19 09:25:55,258 : INFO : EPOCH 27: training on 123355 raw words (80872 effective words) took 0.1s, 697015 effective words/s


2025-11-19 09:25:55,382 : INFO : EPOCH 28: training on 123355 raw words (81013 effective words) took 0.1s, 672589 effective words/s


2025-11-19 09:25:55,505 : INFO : EPOCH 29: training on 123355 raw words (80786 effective words) took 0.1s, 677492 effective words/s


2025-11-19 09:25:55,627 : INFO : EPOCH 30: training on 123355 raw words (80814 effective words) took 0.1s, 685841 effective words/s


2025-11-19 09:25:55,748 : INFO : EPOCH 31: training on 123355 raw words (80842 effective words) took 0.1s, 703026 effective words/s


2025-11-19 09:25:55,872 : INFO : EPOCH 32: training on 123355 raw words (80809 effective words) took 0.1s, 673076 effective words/s


2025-11-19 09:25:55,993 : INFO : EPOCH 33: training on 123355 raw words (80830 effective words) took 0.1s, 691296 effective words/s


2025-11-19 09:25:56,112 : INFO : EPOCH 34: training on 123355 raw words (80821 effective words) took 0.1s, 712138 effective words/s


2025-11-19 09:25:56,226 : INFO : EPOCH 35: training on 123355 raw words (80817 effective words) took 0.1s, 745679 effective words/s


2025-11-19 09:25:56,347 : INFO : EPOCH 36: training on 123355 raw words (80820 effective words) took 0.1s, 703851 effective words/s


2025-11-19 09:25:56,464 : INFO : EPOCH 37: training on 123355 raw words (81010 effective words) took 0.1s, 714434 effective words/s


2025-11-19 09:25:56,584 : INFO : EPOCH 38: training on 123355 raw words (80731 effective words) took 0.1s, 694873 effective words/s


2025-11-19 09:25:56,702 : INFO : EPOCH 39: training on 123355 raw words (80829 effective words) took 0.1s, 710486 effective words/s


2025-11-19 09:25:56,826 : INFO : EPOCH 40: training on 123355 raw words (80858 effective words) took 0.1s, 674569 effective words/s


2025-11-19 09:25:56,949 : INFO : EPOCH 41: training on 123355 raw words (80760 effective words) took 0.1s, 679767 effective words/s


2025-11-19 09:25:57,070 : INFO : EPOCH 42: training on 123355 raw words (80813 effective words) took 0.1s, 686756 effective words/s


2025-11-19 09:25:57,191 : INFO : EPOCH 43: training on 123355 raw words (80783 effective words) took 0.1s, 691003 effective words/s


2025-11-19 09:25:57,315 : INFO : EPOCH 44: training on 123355 raw words (80769 effective words) took 0.1s, 680227 effective words/s


2025-11-19 09:25:57,432 : INFO : EPOCH 45: training on 123355 raw words (80833 effective words) took 0.1s, 713507 effective words/s


2025-11-19 09:25:57,552 : INFO : EPOCH 46: training on 123355 raw words (80704 effective words) took 0.1s, 694406 effective words/s


2025-11-19 09:25:57,669 : INFO : EPOCH 47: training on 123355 raw words (80925 effective words) took 0.1s, 714249 effective words/s


2025-11-19 09:25:57,792 : INFO : EPOCH 48: training on 123355 raw words (80683 effective words) took 0.1s, 690055 effective words/s


2025-11-19 09:25:57,914 : INFO : EPOCH 49: training on 123355 raw words (80833 effective words) took 0.1s, 682020 effective words/s


2025-11-19 09:25:57,914 : INFO : Word2Vec lifecycle event {'msg': 'training on 6167750 raw words (4041082 effective words) took 6.2s, 654492 effective words/s', 'datetime': '2025-11-19T09:25:57.914988', 'gensim': '4.3.3', 'python': '3.12.3 (main, Jun  1 2025, 04:19:33) [Clang 17.0.0 (clang-1700.0.13.5)]', 'platform': 'macOS-26.1-arm64-arm-64bit', 'event': 'train'}


2025-11-19 09:25:57,915 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec<vocab=4177, vector_size=100, alpha=0.025>', 'datetime': '2025-11-19T09:25:57.915216', 'gensim': '4.3.3', 'python': '3.12.3 (main, Jun  1 2025, 04:19:33) [Clang 17.0.0 (clang-1700.0.13.5)]', 'platform': 'macOS-26.1-arm64-arm-64bit', 'event': 'created'}


2025-11-19 09:25:57,915 : INFO : Word2Vec lifecycle event {'fname_or_handle': 'models/english_word2vec.model', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2025-11-19T09:25:57.915543', 'gensim': '4.3.3', 'python': '3.12.3 (main, Jun  1 2025, 04:19:33) [Clang 17.0.0 (clang-1700.0.13.5)]', 'platform': 'macOS-26.1-arm64-arm-64bit', 'event': 'saving'}


2025-11-19 09:25:57,915 : INFO : not storing attribute cum_table


2025-11-19 09:25:57,918 : INFO : saved models/english_word2vec.model


Saving English model...
Done.


## 4. Sanity Check

Let's check the "most similar" words for a few concepts to see if the individual spaces learned anything meaningful.

In [5]:
def check_similarity(model, word, title):
    print(f"\n--- {title}: '{word}' ---")
    try:
        similar = model.wv.most_similar(word, topn=5)
        for w, score in similar:
            print(f"{w}: {score:.3f}")
    except KeyError:
        print(f"Word '{word}' not in vocabulary.")

# Check Hieroglyphic (e.g., 'nfr' = good/beautiful, 'ra' = sun god)
check_similarity(hier_model, 'nfr', "Hieroglyphic")
check_similarity(hier_model, 'ra', "Hieroglyphic")

# Check English
check_similarity(eng_model, 'god', "English")
check_similarity(eng_model, 'king', "English")


--- Hieroglyphic: 'nfr' ---
nfrꞽ: 0.744
nfr.n: 0.708
nfr.w: 0.649
nfr.tꞽ: 0.649
snfr: 0.645

--- Hieroglyphic: 'ra' ---
zꜣ: 0.228
ꞽm.ꞽ-rd: 0.227
ḫꜣ.w: 0.213
ꞽri̯.w: 0.205
hꜣi̯.y: 0.191

--- English: 'god' ---
chentjtkaues: 0.524
herwer: 0.515
balmier: 0.507
tjenetholydom: 0.499
jti: 0.499

--- English: 'king' ---
sheschi: 0.593
captain: 0.561
file: 0.546
heneni: 0.544
chui: 0.541
