# _Word2Vec Model_

This notebook introduces the [`gensim`](https://radimrehurek.com/gensim/index.html) library, which can be used for various NLP tasks, with a specific focus on topic modeling. 

More specifically, this notebook will follow along with `gensim`'s [Word2Vec Model](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#sphx-glr-auto-examples-tutorials-run-word2vec-py) tutorial, which gives the user hands-on experience not only with the library but with [`Word2Vec`](https://en.wikipedia.org/wiki/Word2vec) as well, which is an algorithm used to learn a word embedding from a text corpus.

What `Word2vec` does is take large amounts of unannotated text and attempts to learn the semantic relationships between the words. The outputs are vectors, with one vector per word. For example, it allows us to detect the following relationships:
- vec("king") - vec("man") + vec("woman") =~ vec("queen")
- vec("Montral Canadiens") - vec("Montreal") + vec("Toronto") =~ vec("Toronto Maple Leafs")

Now how does it do this? How is it capable of detecting these relationships? It uses a neural network to embed words in a lower-dimensional vector space. The result is a set of word-vectors that cluster together according to their meanings. In other words, words that are grouped together have similar meanings, while words that are further away from each other are more dissimilar. 

There are two versions of `Word2vec`, and the `gensim` class implements them both:
1. Skip-grams (SG)
2. Continuous-bag-of-words (CBOW)

Skip-grams takes in a pair of words generated by a moving window across the text data. It trains a 1-hidden-layer neural network based on this, resulting in a predicted probability distribution of words that are nearby to the input (i.e. the first word). Then a virtual one-hot encoding of words goes through a 'project layer' to the hidden layer, and these projection weights are later interpreted as word embeddings. 

Continuous-bag-of-words is similar in that it also has a 1-hidden-layer neural network. Instead of using a single word as in skip-gram, the training task now uses the average of multiple input context words to predict the center word. Then the projection weights are again turned from one-hot words into averageable vectors, of the same width as the hidden layer, and interpreted as the word embeddings.

### _Demo_

We are going to download a pre-trained model and play around with it. We'll use `gensim` to fetch the `Word2Vec` model trained on part of the Google News dataset, which covers approximately 3 million words and phrases. Training such a model can take hours, but this one is already available and all we have to do is download it. 

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import gensim.downloader as api

In [3]:
# load in word2vec trained on Google News
wv = api.load('word2vec-google-news-300')

In [4]:
# let's retrieve some of the vocabulary of the model
for i, word in enumerate(wv.vocab):
    if i == 10:
        break
    print(word)

</s>
in
for
that
is
on
##
The
with
said


In [5]:
# we can easily obtain vectors for terms the model is familiar with
vec_king = wv['king']

In [6]:
# however, it is unable to infer vectors for unfamilar words
try:
    vec_cameroon = wv['cameroon']
except KeyError:
    print('The word cameroon does not appear in this mode.')

The word cameroon does not appear in this mode.


In [7]:
# Word2Vec supports several word similarities out of the box
pairs = [
    ('car', 'minivan'),
    ('car', 'bicycle'),
    ('car', 'airplane'),
    ('car', 'cereal'),
    ('car', 'communism')
]

for w1, w2 in pairs:
    print(f'{w1} : {w2} --> {wv.similarity(w1, w2)}')

car : minivan --> 0.6907036304473877
car : bicycle --> 0.5364484786987305
car : airplane --> 0.42435577511787415
car : cereal --> 0.13924746215343475
car : communism --> 0.05820293724536896


Let's take a second to assess what just happened. We took a list of tuples containing the word `car` and a miscellaneous word. Some of these miscellaneous words were more similar to `car`, like `minivan`. Other words had next to nothing to do with `car`, i.e. `communism`. 

When we look at there similarity scores, we can see a trend. The less similar the word is to `car`, the lower the score is. So, remember that vector space we talked about before? That number essentially represents how close that respective word is to `car`. The higher the value, the closer it is; the lower the value, the further away the word is.

In [2]:
# we can even see which are the 5 most similar words to car or minivan
# print(wv.most_similar(positive=['car', 'minivan'], topn=5))
# the above line takes too much time on my local machine; if your
# machine has a significant amount of memory, you can uncomment the line above
# and run this command

## _Training Your Own Model_

In addition to pre-trained models like the Google News one above, we can create our own model as well. First, we'll need some data. Gensim includes some data sets, including the [Lee Corpus](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_data/lee_background.cor), which we'll use to train our first model. 

This corpus is smaller than Google News, however, we'll implement a memory-friendly iterator that reads it in line-by-line, which is a better demonstration of how to handle a larger corpus.

The `MyCorpus` class also gives us the capability, if we so choose, to do custom preprocessing. For example, we could decode a non-standard encoding, or lowercase the text, extract named entities, etc. All of this could be done inside the `MyCorpus` iterator.

In [2]:
from gensim.test.utils import datapath
from gensim import utils

In [3]:
class MyCorpus(object):
    '''An iterator that yields sentences (lists of str)'''
    def __iter__(self):
        corpus_path = datapath('lee_background.cor')
        for line in open(corpus_path):
            #assume there's one document per line, tokens separated by whitespace
            yield utils.simple_preprocess(line)

In [4]:
# train an out-of-the-box model on the Lee corpus
import gensim.models

sentences = MyCorpus()
model = gensim.models.Word2Vec(sentences=sentences)

We now have our model, and we can use this in the same way to the Google model above. The main part of the model is `model.wv`, where `wv` stands for **"word vectors"**.

In [5]:
# get vector for word king
vec_king = model.wv['king']

In [6]:
# retrieve the vocabulary of our model
for i, word in enumerate(model.wv.vocab):
    if i == 10:
        break
    print(word)

hundreds
of
people
have
been
forced
to
their
homes
in


## _Storing and loading models_

There is an unfortunate downside to training models: they can take significant amounts of time. Once your model has been trained and it works (like you expect it to), you can save it to disk. This means that we won't have to spend time training it again!

In [7]:
import tempfile

with tempfile.NamedTemporaryFile(prefix='gensim-model-', delete=False) as tmp:
    temporary_filepath = tmp.name
    model.save(temporary_filepath)
    #
    # the model can now safely be stored in the filepath
    # you can copy it to other machines, share it with others, etc
    #
    # To laod a saved model
    new_model = gensim.models.Word2Vec.load(temporary_filepath)

This uses pickle internally. In addition, you can load models created by the original C tool, both using its text and binary formats.

## _Training Parameters_

`Word2Vec` accepts several parameters that affect both training speed and quality.

### `min_count`

This is for pruning the internal dictionary. Words that appear only a few times in say, a billion-word corpus, are uninteresting. Plus, there's not enough data to make anything useful of them, so it's best to ignore them. 

The default value of `min_count` is 5.

In [8]:
# with our corpus, create model where a word has to appear at least 10 times
model = gensim.models.Word2Vec(sentences, min_count=10)

### `size`

This is the number of dimensions $N$ of the $N$-dimensional space that `gensim` Word2Vec maps the words onto. A bigger size value requires more training data, but can lead to better (i.e. more accurate) models. For a start, values within the tens to the hundreds are reasonable.

In [9]:
# set the size of N equal to 200 (default = 100)
model = gensim.models.Word2Vec(sentences, size=200)

### `workers`

This is the last major parameter, and is for parallelization, which speeds up training. The default number of `workers` is equal to 3. 

If you want a full list of parameters, you can check out this [link](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec).

In [10]:
# set number of workers equal to 4 (default = 3)
model = gensim.models.Word2Vec(sentences, workers=4)

## _Memory_

`Word2Vec` models are stored as matrices (i.e. `NumPy` arrays) with each array being **vocabulary times size of floats**. Three matrices are held in RAM, so if your input contains 100,000 unique words, and has `size=200`, the model will require approximately `100,000 * 200 * 4 * 3 = ~229MB`. 

Additionally, there's a little extra memory needed for storing the vocabulary tree (which requires a few megabytes) but unless your words are extremely long strings, the footprint of this will be minimal compared to the three matrices above. 

## _Evaluating_

Since `Word2Vec` is an unsupervised task, there's no "good" way to evaluate the result, as evaluation is also dependent on your end application. 

Google has released their testing set of 20,000 syntactic & semantic test examples that follow the __"A is to B as C is to D"__ task; this is provided in the `datasets` folder. 

An example of a syntactic analogy is `bad to words : good to ?`. There are a total of 9 types of syntactic comparisons in the dataset like plural nouns and nouns of opposite meaning.

An example of a semantic question could be capital cities (`Paris to France : Tokyo to ?`) or family members (`brother to sister : dad to ?`). There are five types of these semantic analogies.

In [13]:
# you can access the Google test examples with the following command 
# model.accuracy('/datasets/questions-words.txt')
# throws a FileNotFoundError, need to address this

## _Online training/Resuming Training_

A more advanced technique is to load a model and continue training it with more sentences and new vocabulary words.

In [17]:
model = gensim.models.Word2Vec.load(temporary_filepath)
more_sentences = [
    ['Advanced', 'users', 'can', 'load', 'a', 'model',
     'and', 'continue', 'training', 'it', 'with', 'more', 'sentences']
]
model.build_vocab(more_sentences, update=True)
model.train(more_sentences, total_examples=model.corpus_count, epochs=model.iter)

  import sys


(30, 65)

In [19]:
# clean up temporary file
import os
os.remove(temporary_filepath)

## _Training Loss Computation_

There is a parameter -- `compute_loss` -- that can be used to toggle computation of loss while training the `Word2Vec` model, which is stored in the model attribute `running_training_loss`. This can be retrieved using the function `get_latest_training_loss` as follows:

In [20]:
# instantiate and train the Word2Vec model
model_with_loss = gensim.models.Word2Vec(
    sentences,
    min_count=1,
    compute_loss=True,
    hs=0,
    sg=1,
    seed=42
)

# get the training loss value
training_loss = model_with_loss.get_latest_training_loss()
print(training_loss)

1383308.5
