# Learning word embeddings

LIN 371 :: UT Austin

Original material from https://machinelearningmastery.com/develop-word-embeddings-python-gensim/

This tutorial aims to:
* How to train your own word2vec word embedding model on text data.
* How to visualize a trained word embedding model using Principal Component Analysis.
* How to load pre-trained word2vec and GloVe word embedding models from Google and Stanford.

This notebook requires [Gensim](https://radimrehurek.com/gensim/)

## Word Embeddings

[Word2Vec](https://en.wikipedia.org/wiki/Word2vec) is one algorithm for learning a word embedding from a text corpus. There are two main training algorithms that can be used to learn the embedding from text; they are continuous bag of words (CBOW) and skip grams. Word2Vec models require a lot of text, e.g. the entire Wikipedia corpus. Nevertheless, we will demonstrate the principles using a small in-memory example of text.

Gensim provides [the Word2Vec class](https://radimrehurek.com/gensim/models/word2vec.html) for working with a Word2Vec model. Learning a word embedding from text involves loading and organizing the text into sentences and providing them to the constructor of a new `Word2Vec()` instance. For example:

```
sentences = ...
model = Word2Vec(sentences)
```

Specifically, each sentence must be **tokenized**, meaning divided into words and prepared (e.g. perhaps pre-filtered and perhaps converted to a preferred case).

In [None]:
from gensim.models import Word2Vec
from gensim.models import KeyedVectors

from nltk import sent_tokenize, word_tokenize

# define training data
text = "When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebrating his eleventy-first birthday with a party of special magnificence, there was much talk and excitement in Hobbiton. Bilbo was very rich and very peculiar, and had been the wonder of the Shire for sixty years, ever since his remarkable disappearance and unexpected return. The riches he had brought back from his travels had now become a local legend, and it was popularly believed, whatever the old folk might say, that the Hill at Bag End was full of tunnels stuffed with treasure."
sentences = [word_tokenize(s) for s in sent_tokenize(text)]
for s in sentences:
    print(s)
# train model

# summarize the loaded model


There are many parameters on this constructor; a few noteworthy arguments you may wish to configure are:

* **vector_size**: (default 100) The number of dimensions of the embedding, e.g. the length of the dense vector to represent each token (word).
* **window**: (default 5) The maximum distance between a target word and words around the target word.
* **min_count**: (default 5) The minimum count of words to consider when training the model; words with an occurrence less than this count will be ignored.
* **sg**: (default 0 or CBOW) The training algorithm, either CBOW (0) or skip gram (1).

The defaults are often good enough when just getting started.

You can print the learned vocabulary of tokens (words) as follows:

In [None]:
# summarize vocabulary
words = list(model.wv.key_to_index)
print(words)

In [None]:
# You can review the embedded vector for a specific token as follows:


In [None]:
# save model. By default, the model is saved in a binary format to save space.

# load model


When getting started, you can save the learned model in ASCII format and review the contents. You can do this by setting binary=False when calling the save_word2vec_format() function:

In [None]:
## load model with load_word2vec_format, also specifying binary=False


## Plot Word Vectors Using PCA

After you learn word embedding for your text data, it can be nice to explore it with visualization. You can use PCA to reduce the high-dimensional word vectors to two-dimensional plots and plot them on a graph. The visualizations can provide a qualitative diagnostic for your learned model.

We can then train a projection method on the vectors, such as those methods offered in scikit-learn, then use matplotlib to plot the projection as a scatter plot.

We can retrieve all of the vectors from a trained model as follows:

We can create a 2-dimensional PCA model of the word vectors using [the scikit-learn PCA class](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) as follows:

In [None]:
from sklearn.decomposition import PCA
from matplotlib import pyplot

# fit a 2d PCA model to the vectors
pca = PCA(n_components=2)
result = pca.fit_transform(X)

# create a scatter plot of the projection
pyplot.scatter(result[:, 0], result[:, 1])
words = list(model.wv.key_to_index)
for i, word in enumerate(words):
    pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))
pyplot.show()

## Load Google’s Word2Vec Embedding

Training your own word vectors may be the best approach for a given NLP problem. But it can take a long time, a fast computer with a lot of RAM and disk space, and perhaps some expertise in finessing the input data and training algorithm. An alternative is to simply use an existing pre-trained word embedding.

Along with the paper and code for word2vec, Google also published a pre-trained word2vec model on the [Word2Vec Google Code Project](https://code.google.com/archive/p/word2vec/).

A pre-trained model is nothing more than a file containing tokens and their associated word vectors. The pre-trained Google word2vec model was trained on Google news data (about 100 billion words); it contains 3 million words and phrases and was fit using 300-dimensional word vectors. It is a 1.53 Gigabytes file. You can download it from here: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing

Unzipped, the binary file (GoogleNews-vectors-negative300.bin) is 3.4 Gigabytes.

The Gensim library provides tools to load this file. Specifically, you can call the `KeyedVectors.load_word2vec_format()` function to load this model into memory, for example:

In [None]:
# filename = '/GoogleNews-vectors-negative300.bin'
filename = '/Users/jessy/Downloads/GoogleNews-vectors-negative300.bin'


Recall that another interesting thing that you can do is do a little linear algebra arithmetic with words:
```
queen = (king - man) + woman
```
Gensim provides an interface for performing these types of operations in the `most_similar()` function on the trained or loaded model. For example:

Let's try some other relations if they work — king:crown::police:badge

Let's look at some other things - how similar are two words?

What word seems out of place in the sequence?

## Load Stanford’s GloVe Embedding

Stanford researchers also have their own word embedding algorithm like word2vec called Global Vectors for Word Representation, or GloVe for short. In practice, NLP practitioners seem to prefer GloVe at the moment based on results. Like word2vec, the GloVe researchers also provide pre-trained word vectors, in this case, a great selection to choose from.

You can download the GloVe pre-trained word vectors and load them easily with gensim.

Let's download the smallest GloVe pre-trained model from [the GloVe website](https://nlp.stanford.edu/projects/glove/). It an 822 Megabyte zip file with 4 different models (50, 100, 200 and 300-dimensional vectors) trained on Wikipedia data with 6 billion tokens and a 400,000 word vocabulary. Direct download: http://nlp.stanford.edu/data/glove.6B.zip

Now we can load it and perform the same `(king – man) + woman = ?` test as in the previous section. 

Note that the GLOVE files are different from Word2Vec, so we set `no_header=True` when loading.

In [None]:
# load the Stanford GloVe model
filename = '/Users/jessy/Downloads/glove.6B/glove.6B.300d.txt'


In [None]:
# calculate: (king - man) + woman = ?
