In [2]:
%matplotlib widget
from training_tools.train_models import train_models
from training_tools.evaluate_models import evaluate_models
from training_tools.plotting import load_results, plot_metrics

ModuleNotFoundError: No module named 'training_tools'

# **Train `Word2vec` Models**
## **Goal**: Train and evaluate word embeddings using the `Word2vec` algorithm. 

If you have successfully run the two previous workflows (`workflow_unigrams.ipynb` and `workflow_multigrams.ipynb`), then you are in possession of pre-processed yearly Google Ngram data. This workflow uses the `train_models.py` module to train year-specific models using [Word2Vec](https://en.wikipedia.org/wiki/Word2vec)—a popular technique for deriving [vector representations](https://en.wikipedia.org/wiki/Word_embedding) of the words in a corpus. You can then use the `evaluate_models.py` module to see how well your models have been trained.

Training word-embedding models requires decisions about various modeling parameters (i.e., hyperparameters). The most important are:

1. **`vector_size`**: the dimensionality of the vector space in which words are embedded. In essence, vector size is the number of words _in relation to which_ the model understands any particular word. More dimensions are not necessarily better. While too few dimensions can prevent the model from learning words' full meanings, too many dimensions risks overfitting or allowing low-frequency, specialized uses to obscure words' most important meanings.

2. **`window`**: the width of `word2vec`'s sliding context window. Word2vec reads in "sentences" (in this case, multigrams) and tries to learn probabilistic relationships between a target word and the words that surround it in the corpus. To the extent that two words have the same probabilistic relations to other nearby words, their vector representations will be similar. The `window` parameter determines what "nearby" means: With a window of 2, a word's meaning is defined strictly by the words immediately adjacent to it; with a window of 5, a word's meaning is determined by words up to four words away. Roughly speaking, models trained using a narrow context window will privilege _syntactic_ relationships, whereas models trained using a wider context window will do a better job learning _semantic_ relationships.

3. **`approach`**: the training architecture to use. You can specify either `CBOW` (Continuous Bag of Words) or `skip-gram`. In CBOW, context words are used to predict target words; in `skip-gram`, target words are used to predict context words. `skip-gram` tends to yield better results with ngrams.

4. **`min_count`**: the minimum number of times a word must appear in the corpus to be used for training. You may wish to ignore extremely infrequent words—especially in large corpora. This parameter lets you do that.

5. **`weight_by`**: the strategy for weighting ngrams by their frequency in the corpus. `none` means that no weighting is used and each unique ngram is fed to the model only once. `freq` gives ngrams a "bonus" if they appear multiple times; however the bonus diminishes as the frequency increases. For example, an ngram appearing 100 times in the corpus will be fed to `word2vec` twice, an ngram appearing 1,000 times is fed to the model 3 times, an ngram appearing 10,000 is fed to the model 4 times, and so on. The `doc_freq` option does the same thing, but using the number of unique _documents_ ngrams appear in. The goal of weighting is to allow somewhat frequent ngrams to influence the model more than rare ones—while not letting extremely frequent ngrams skew the model.

6. **`epochs`**: the number of training passes over the corpus. Too many can lead to overfitting, while too few can lead to imprecise embeddings. Experiment to see which values lead to the best model performance.

The other options are straightforward. `proj_dir` is the base directory for your project, `years=([start_year], [end_year])` specifies which yearly models to train, and `workers` is the number of CPU cores to use.

### Train Models
The `train_models.py` module can be used to iterate through multiple parameter combinations—allowing you to conduct a [grid search](https://en.wikipedia.org/wiki/Hyperparameter_optimization). Thus, if you want to train models for 2018 and 2019 using vector sizes of 100, 200, and 300 and all three weighting strategies, you would specify `year=(2018, 2019)`, `weight_by=('none', 'freq', 'doc_freq')` and `vector_size=(100, 200, 300)`. The module would then train models using all combinations of these parameters, for a total of 18 models.

In [None]:
train_models(
    ngram_size=5,
    proj_dir='/vast/edk202/NLP_corpora/Google_Books/20200217/eng/',
    years=(1892, 1900),
    weight_by=('none'),
    vector_size=(200),
    window=5,
    min_count=(20),
    approach=('skip-gram'),
    epochs=(10)
)

### Evaluate Training Quality
The next step is to examine the quality of the trained models. The `evaluate_models.py` module performs two "intrinsic" tests on the models in your project's `models` directory: a similarity test and an analogy test. In a similarity test, models predict human-rated similarities between word pairs. In an analogy test, models attempt to answer questions of the form "_a_ is to _b_ as _c_ is to what?"—for example, "_king_ is to _queen_ as _man_ is to what?" (where the correct answer is _woman_).

By default, the code runs the similarity tests packaged with Gensim, but if you want to use different test items, you can by specifying the `similarity_dataset` and `analogy_dataset` options (making sure that the test files are properly formatted).

In [None]:
evaluate_models(
    ngram_size=5,
    proj_dir='/vast/edk202/NLP_corpora/Google_Books/20200217/eng/',
    eval_dir='/scratch/edk202/hist_w2v/training_results',
    save_mode='append',
    workers=48
)

### Plot Metrics

In [None]:
results_df = load_results('../training_results/evaluation_results.csv')

plot_metrics(
    df=results_df,
    x_vars=['vector_size', 'epochs'],
    plot_type='contour',
    metric='similarity_score'
)

In [None]:
results_df = load_results('../training_results/evaluation_results.csv')

plot_metrics(
    df=results_df,
    x_vars=['vector_size', 'epochs'],
    plot_type='contour',
    metric='analogy_score'
)