# Training word vectors

In the last notebook we saw several applications of word and sentence vectors. In this notebook we will train our own.

For this we will use the gensim library.

Quoting Wikipedia:

> Gensim is a robust open-source vector space modeling and topic modeling toolkit implemented in Python. It uses NumPy, SciPy and optionally Cython for performance. Gensim is specifically designed to handle large text collections, using data streaming and efficient incremental algorithms, which differentiates it from most other scientific software packages that only target batch and in-memory processing.

LGPL licensed, https://github.com/RaRe-Technologies/gensim.

Documentation for the `Word2Vec` class: https://radimrehurek.com/gensim/models/word2vec.html

To get going, we will need to have a set of documents to train our word2vec model. In theory, a document could be anything from a short 140 character tweet, a single paragraph (i.e., journal article abstract), a news article, or a book. In NLP parlance a collection or set of documents is often referred to as a corpus.

We need to create a list of sentences to feed to the `Word2Vec` class.

In [None]:
# Use `conda install -c conda-forge gensim` to install this
import gensim
from gensim.models import Word2Vec

In [None]:
import spacy

In [None]:
# loading the model can take a moment
nlp = spacy.load('en_core_web_sm')

In [None]:
# and chewing through the whole of Frankenstein as well
doc = nlp(open("../data/84-0.txt").read())

In [None]:
# you can see that the sentence boundary detection is not
# perfect in spacy, especially at the beginning of the
# book which contains lots of strangely formatted text.
for n, sentence in enumerate(doc.sents):
    # skip the first 40 "sentences"
    # disable to see the weird ones
    if n < 40:
        continue
    # maybe preprocessing the text like this helps
    print(" ".join(w.lower_.strip() for w in sentence))
    print("-" * 80)
    if n > 40 + 20:
        break

The word2vec training needs a generator of sentences. Let's write one that skips over the first part of the book, and then applies some normalisation to each sentence.

In [None]:
# It seems simpler to use a generator than create a whole class
# as is shown in the gensim documentation. YMMV.
def sentences(document):
    for n, sentence in enumerate(document.sents):
        if n < 40:
            continue
        # maybe preprocessing the text like this helps
        yield [w.lower_.strip() for w in sentence if w.is_alpha]

In [None]:
# check your generator creates sentences
# one sentence per iteration, one sentence is a list of words
next(sentences(doc))

In [None]:
# 20 dimensional vectors are probably enough for such a small text
# experiment a bit with what works best
w2v = Word2Vec(size=20, min_count=3, iter=10)
w2v.build_vocab(sentences(doc))

In [None]:
w2v.train(sentences(doc),
          total_examples=w2v.corpus_count,
          epochs=w2v.iter
         )

In [None]:
# inspect the vocabulary
w2v.wv.vocab

In [None]:
w2v.wv.most_similar("violence")

In [None]:
w2v.wv.most_similar("cabin")

In [None]:
# get the vector for "milk"
w2v['milk']

In [None]:
# vectors for "milk" and "cabin"
w2v[['milk', 'cabin']]

The vocabulary size is very small compared to any of the pre-traiend vectors. These vectors are "tuned" to this particular text but probably not as useful as using generic word vectors from Glove.

One thing to notice is that if you have a lot of specific jargon in your documents you might improve your performance by training a specialised set of word vectors. Because for words out of the vocabulary (like misspelt ones) you have no vector to assign. Often people simply set them to zero or initialise them randomly.


## Using word vectors for movie reviews
Let's compare using self-trained word vectors to simple TfIdf on the movie sentiment task. Use what you learnt above to train (small) word vectors on the IMBD dataset we used previously.

To train word vectors we need to:
* load all the individual reviews and chunk them into sentences
* feed sentences to our `Word2vec` model
* train the model
* inspect word vectors (for sanity checking)

After training the vectors and checking that they are somewhat sensible try
and use them as input features for a logistic regression model instead of TfIdf
or the `CountVectorizer` that we used before in `10-tfidf.ipynb`.

In [None]:
import numpy as np
from sklearn.datasets import load_files

reviews_train = load_files("../data/aclImdb/train/", categories=['neg', 'pos'])

text_trainval, y_trainval = reviews_train.data, reviews_train.target

print("type of text_train: {}".format(type(text_trainval)))
print("length of text_train: {}".format(len(text_trainval)))
print("class balance: {}".format(np.bincount(y_trainval)))

In [None]:
from sklearn.model_selection import train_test_split


text_trainval = [doc.replace(b"<br />", b" ") for doc in text_trainval]

text_train, text_val, y_train, y_val = train_test_split(
    text_trainval, y_trainval, stratify=y_trainval, random_state=0)

In [None]:
text_train[:10]

In [None]:
tokenizer = spacy.load('en_core_web_sm')
# turn off features from spacy that we don't need
tokenizer.remove_pipe("ner")
tokenizer.remove_pipe("tagger")
tokenizer.add_pipe(nlp.create_pipe('sentencizer'))

def movie_sentences(text):
    for sample in text:
        doc = tokenizer(sample.decode())
        for sentence in doc.sents:
            # maybe preprocessing the text like this helps
            yield [w.lower_.strip() for w in sentence if w.is_alpha]

In [None]:
%%time
# compare the speed of the tokenizer to a full spacy model
# that performs NER etc
# probably want to use the %%time magic
next(movie_sentences(text_train))

In [None]:
%%time
# this step can take quite some time :(
# there is a pickle of all the sentences in the
# repository which you can just load instead of having
# to run this yourself.
all_movie_sentences = list(movie_sentences(text_train))

In [None]:
import pickle

# load sentence list
with open("all_movie_sentences.pkl", "rb") as f:
    all_movie_sentences = pickle.load(f)

# store sentence list
#with open("all_movie_sentences.pkl", "wb") as f:
#    pickle.dump(all_movie_sentences, f)

In [None]:
assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"

In [None]:
%%time
movie_w2v = Word2Vec(size=50, workers=5)
# no RAM? Use this slower verison
#movie_w2v.build_vocab(movie_sentences(text_train))
movie_w2v.build_vocab(all_movie_sentences)

In [None]:
%%time
# no RAM? Use this slower verison
#movie_w2v.train(movie_sentences(text_train),
#                total_examples=movie_w2v.corpus_count,
#                epochs=movie_w2v.iter
#                )
movie_w2v.train(all_movie_sentences,
                total_examples=movie_w2v.corpus_count,
                epochs=movie_w2v.iter
                )

In [None]:
# you get more specific synonyms than before
# compare to what spacy would find as similar words
# to movie
movie_w2v.wv.most_similar("movie")

In [None]:
# saving and loading the model is easy
movie_w2v.save("movie_w2v_model")

In [None]:
loaded_movie_w2v = Word2Vec.load("movie_w2v_model")

In [None]:
loaded_movie_w2v.wv.most_similar("movie")

In [None]:
loaded_movie_w2v.wv.most_similar("batman")

In [None]:
# Use word vectors as input to a logistic regression
# Let's see if we can improve on our baseline for movie reviews by using
# our own word vectors.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vect_w2v = CountVectorizer(vocabulary=loaded_movie_w2v.wv.index2word)
vect_w2v.fit(text_train)
docs = vect_w2v.inverse_transform(vect_w2v.transform(text_train))
docs[0]

In [None]:
# compute the average of the word vectors in a review to represent the whole document
# place your training data in `X_train`

In [None]:
# what should the shape of the training data in X_train be?
# What size is your embedding?
X_train.shape

In [None]:
# this compuares the average word vectors for the validation dataset
val_docs = vect_w2v.inverse_transform(vect_w2v.transform(text_val))

X_val = np.vstack([np.mean(loaded_movie_w2v[doc], axis=0) for doc in val_docs])

In [None]:
from sklearn.linear_model import LogisticRegression

lr_w2v = LogisticRegression(C=100).fit(X_train, y_train)
lr_w2v.score(X_train, y_train)

In [None]:
lr_w2v.score(X_val, y_val)

In [None]:
# Can you improve this by preprocessing the words that are given to the Word2Vec model
# For example by removing stop words?
# Check out the documentation for `CountVectorizer` to see if you can find the
# stopword list used by scikit-learn.