# **`word2vec` Training Workflow**
In this workflow, we first train `word2vec` models across years using a range of hyperparameters (e.g., vector dimensions and training epochs). The purpose is twofold: (1) to determine whether models from earlier years are reasonably stable, and (2) choose a set of hyperparameters that yield good results across all years. Models are evaluated using "intrinsic" tests of similarity and analogy performance, which we visualize using plots and analyze using linear regression.

Once we've chosen our hyperparameters, we use them to train models for every year from 1900 through 2019.

## **Setup**
### Imports

In [1]:
%load_ext autoreload
%autoreload 2

from ngramkit.ngram_train.word2vec import build_word2vec_models, evaluate_word2vec_models, plot_evaluation_results
from ngramkit.ngram_train.word2vec import run_regression_analysis, plot_regression_results

### Configure

In [2]:
db_path_stub = '/scratch/edk202/NLP_corpora/Google_Books/'
release = '20200217'
language = 'eng-fiction'
size = 5

## **Test Model Hyperparameters**
### Train Models
Here we test models from 1900 tp 2015 in 5-year increments, cycling through a range of reasonable hyperparameters. In this workflow, we constrain our grid search as follows:
1. We stick to the Skip-Gram (`skip-gram`) approach. Skip-gram is known to be more efficient than Continuous Bag of Words (`CBOW`) for Google n-gram data.
2. We test vector dimensions (`vector_size`) from 100 to 300. Our vocabulary is probably too small to support the extraction of more than 300 meaningful features.
3. We test training epochs (`epochs`) from 5 to 30. More than 30 epochs risks overfitting.
4. We set the minimum word count (`min_count`) to 1, meaning that no words will be excluded from training. Our whitelist ensures that all vocabulary words appear frequently in every corpus from 1900 to 2015.
5. Weighting (`weight_by`) is set to none. `word2vec` already downweights extremely frequent words.
6. We set a context window (`window`) of 4. This width extracts as much context as possible from 5-grams.

In [None]:
build_word2vec_models(
    ngram_size=size,
    repo_release_id=release,
    repo_corpus_id=language,
    db_path_stub=db_path_stub,
    dir_suffix='test',
    years=(1900, 2015),
    year_step=5,
    weight_by=('none',),
    vector_size=(100, 200, 300),
    window=(4,),
    min_count=(1,),
    approach=('skip-gram',),
    epochs=(5, 10, 15, 20, 25, 30),
    max_parallel_models=25,
    workers_per_model=2,
    mode="resume",
    unk_mode="strip",
    use_corpus_file=True,
    cache_corpus=True
)


Scanning for existing models...


Scanning existing models: 100%|██████████| 29/29 [00:00<00:00, 43.21 files/s]

  Valid models found:    29
  Invalid/partial:       0

WORD2VEC MODEL TRAINING
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2025-11-16 21:15:36

Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Database:             ...NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/5grams_pivoted.db
Model directory:      ...edk202/NLP_models/Google_Books/20200217/eng-fiction/5gram_files/models_test
Log directory:        ...NLP_models/Google_Books/20200217/eng-fiction/5gram_files/logs_test/training
Parallel models:      25

Training Parameters
────────────────────────────────────────────────────────────────────────────────────────────────────
Years:                1900–2015 (step=5, 24 years)
Weighting:            ('none',)
Vector size:          (100, 200, 300)
Context window:       (4,)
Minimum word count:   (1,)
Approach:             ('skip-gram',)
Training 




  Created corpus file for year 1945, weight_by=none: /state/partition1/job-457422/w2v_corpus_y1945_wbnone_es6m8c9i.txt
  Created corpus file for year 1940, weight_by=none: /state/partition1/job-457422/w2v_corpus_y1940_wbnone_c_6xu_1a.txt
  Created corpus file for year 1935, weight_by=none: /state/partition1/job-457422/w2v_corpus_y1935_wbnone_s93eev6x.txt
  Created corpus file for year 1955, weight_by=none: /state/partition1/job-457422/w2v_corpus_y1955_wbnone_fmcs9xuk.txt
  Created corpus file for year 1915, weight_by=none: /state/partition1/job-457422/w2v_corpus_y1915_wbnone_6p58gn_w.txt
  Created corpus file for year 1965, weight_by=none: /state/partition1/job-457422/w2v_corpus_y1965_wbnone_yv_2g254.txt
  Created corpus file for year 1975, weight_by=none: /state/partition1/job-457422/w2v_corpus_y1975_wbnone_kd8s5naj.txt
  Created corpus file for year 1930, weight_by=none: /state/partition1/job-457422/w2v_corpus_y1930_wbnone_u5bv6pq0.txt
  Created corpus file for year 1960, weight_by=n

### Evaluate Models

Here we evaluate the models we've trained using two "intrinsic" tests: (1) a _similarity test_ assessing how well each model predicts human-rated synonymy judgments, and (2) an _analogy test_ assessing how well each model can answer SAT-style analogy questions. Test results are saved to a CSV file.

Similarity performance is the metric of choice for models intended to track semantic relatedness over time. However, we run both tests here to demonstrate the evaluation code and show that different hyperparameters lend themselves to different performance metrics.

In [None]:
evaluate_word2vec_models(
    ngram_size=size,
    repo_release_id=release,
    repo_corpus_id=language,
    db_path_stub=db_path_stub,
    dir_suffix='test',
    save_mode='overwrite',
    run_similarity=True,
    run_analogy=True,
    workers=100
)

### Visualize Model Performance
The code below plots the results of the similarity and analogy tests for easy inspection.
#### Similarity Results
Visual inspection of the similarity plots reveals that all models display decent performance, with the earliest models scoring in the .50–.55 range and the most recent models scoring in the .58–.62 range over our chose timespan. Overall, then, the models are performing well for ngram-based semantic relatedness. Vector dimensions and training epochs don't seem to matter much overall.

In [None]:
from pathlib import Path
from ngramkit.ngram_acquire.db.build_path import build_db_path
from ngramkit.ngram_train.word2vec.config import construct_model_path

# Construct path to evaluation results
base_path = Path(build_db_path(db_path_stub, size, release, language)).parent
model_base = construct_model_path(str(base_path))
eval_file = Path(model_base) / "evaluation_results_test.csv"

plot_evaluation_results(
    csv_file=str(eval_file),
    verbose=False,
    metric='similarity_score',
    x_vars=['epochs', 'vector_size'],
    panel_by='year',
    plot_type='line',
    plot_title='Similarity Score by Training Epochs and Vector Size'
)

#### Analogy Results
Like the similarity results, model performance improves with corpus recency. Unlike the similarity results, analogy performance clearly improves with the number training epochs. The number of vector dimensions doesn't seem to matter much.

In [None]:
# eval_file already constructed above
plot_evaluation_results(
    csv_file=str(eval_file),
    verbose=False,
    metric='analogy_score',
    x_vars=['epochs', 'vector_size'],
    panel_by='year',
    plot_type='line',
    plot_title='Analogy Score by Training Epochs and Vector Size'
)

### Regression Analysis
The code below runs regression analyses on the similarity and analogy results.
#### Predictors of Similarity Performance
The model coefficient and dot-and-whisker plots show that — not surprisingly — similarity performance is best for recent corpora. The number of training epochs doesn't seem to matter; in fact, the year-by-epochs interaction indicates that more epochs are detrimental to model quality for the earliest corpora. The small positive effects of vector dimensions and the year-by-dimensions interaction show that adding features improves model quality a bit, especially for the most recent corpora.

In [None]:
# eval_file already constructed above
results = run_regression_analysis(
    csv_file=str(eval_file),
    model_type="auto",
    outcome='similarity_score',
    predictors=['year', 'vector_size', 'epochs', 'approach'],
    interactions=[('year', 'vector_size'), ('year', 'epochs'), ('vector_size', 'epochs')],
)

plot_regression_results(results)

#### Predictors of Analogy Performance
The regression results for analogy performance confirm what we saw in the plots. Models trained on more recent corpora perform better, and more training epochs contribute to better performance. Although it wasn't immediately apparent in the plots, adding vector dimensions negatively impacts analogy performance, and this effect is not moderated by year.

In [None]:
# eval_file already constructed above
results = run_regression_analysis(
    csv_file=str(eval_file),
    model_type="auto",
    outcome='analogy_score',
    predictors=['year', 'vector_size', 'epochs', 'approach'],
    interactions=[('year', 'vector_size'), ('year', 'epochs')],
)

plot_regression_results(results)

## **Train Final Models**

Having explored a range of hyperparameters, we train final models for every year from 1900 through 2019 using what we've learned. A defensible hyperparameter set is:
1. `approach=('skip-gram',)`
2. `window=(4,)`
3. `vector_size=(200,)`
4. `epochs=(10,)`
5. `min_count=1`

In [None]:
build_word2vec_models(
    ngram_size=size,
    repo_release_id=release,
    repo_corpus_id=language,
    db_path_stub=db_path_stub,
    dir_suffix='final',
    years=(1900, 2019),
    year_step=1,
    weight_by=('none',),
    vector_size=(200,),
    window=(4,),
    min_count=(1,),
    approach=('skip-gram',),
    epochs=(10,),
    max_parallel_models=33,
    workers_per_model=3,
    mode="resume",
    unk_mode="strip",
    use_corpus_file=True
)

## **Normalize and Align Models**

Before we can use the models for diachronic analysis, we need to unit-normalize the vectors and align them across years using Procrustes rotation. The `normalize_and_align_vectors` function does this.

In [2]:
from ngramkit.ngram_train.word2vec.normalize_and_align_models import normalize_and_align_vectors

ngram_size = 5
proj_dir = '/scratch/edk202/NLP_models/Google_Books/20200217/eng'
dir_suffix = 'final'
anchor_year = 2000
workers = 16

normalize_and_align_vectors(
    ngram_size=ngram_size,
    proj_dir=proj_dir,
    dir_suffix=dir_suffix,
    anchor_year=anchor_year,
    workers=workers
)

Saved normalized anchor model to /scratch/edk202/NLP_models/Google_Books/20200217/eng/5gram_files/models_final/norm_and_align/w2v_y2000_wbnone_vs200_w004_mc001_sg1_e010.kv


Processing models: 100%|██████████| 119/119 [01:24<00:00,  1.40file/s]

Total runtime: 0:01:25.178946
Processed 120 models. Aligned to anchor year 2000.



