# Training word vectors

In the last notebook we saw several applications of word and sentence vectors. In this notebook we will train our own.

For this we will use the gensim library.

Quoting Wikipedia:

> Gensim is a robust open-source vector space modeling and topic modeling toolkit implemented in Python. It uses NumPy, SciPy and optionally Cython for performance. Gensim is specifically designed to handle large text collections, using data streaming and efficient incremental algorithms, which differentiates it from most other scientific software packages that only target batch and in-memory processing.

LGPL licensed, https://github.com/RaRe-Technologies/gensim.

Documentation for the `Word2Vec` class: https://radimrehurek.com/gensim/models/word2vec.html

To get going, we will need to have a set of documents to train our word2vec model. In theory, a document could be anything from a short 140 character tweet, a single paragraph (i.e., journal article abstract), a news article, or a book. In NLP parlance a collection or set of documents is often referred to as a corpus.

We need to create a list of sentences to feed to the `Word2Vec` class.

In [1]:
import gensim
from gensim.models import Word2Vec

Using TensorFlow backend.


In [2]:
import spacy

In [3]:
# loading the model can take a moment
nlp = spacy.load('en_core_web_sm')

In [4]:
# and chewing through the whole of Frankenstein as well
doc = nlp(open("../data/84-0.txt").read())

In [5]:
# you can see that the sentence boundary detection is not
# perfect in spacy, especially at the beginning of the
# book which contains lots of strangely formatted text.
for n, sentence in enumerate(doc.sents):
    # skip the first 40 "sentences"
    # disable to see the weird ones
    if n < 40:
        continue
    # maybe preprocessing the text like this helps
    print(" ".join(w.lower_.strip() for w in sentence))
    print("-" * 80)
    if n > 40 + 20:
        break

you are well  acquainted with my failure and how heavily i bore the disappointment . 
--------------------------------------------------------------------------------
but just at that time i inherited the fortune of my cousin , and my  thoughts were turned into the channel of their earlier bent . 
--------------------------------------------------------------------------------
six years have passed since i resolved on my present undertaking . 
--------------------------------------------------------------------------------
i  can , even now , remember the hour from which i dedicated myself to this  great enterprise . 
--------------------------------------------------------------------------------
i commenced by inuring my body to hardship . 
--------------------------------------------------------------------------------
i  accompanied the whale - fishers on several expeditions to the north sea ;  i voluntarily endured cold , famine , thirst , and want of sleep ; i often  worked harde

The word2vec training needs a generator of sentences. Let's write one that skips over the first part of the book, and then applies some normalisation to each sentence.

In [6]:
# It seems simpler to use a generator than create a whole class
# as is shown in the gensim documentation. YMMV.
def sentences(document):
    for n, sentence in enumerate(document.sents):
        if n < 40:
            continue
        # maybe preprocessing the text like this helps
        yield [w.lower_.strip() for w in sentence if w.is_alpha]

In [7]:
# check your generator creates sentences
next(sentences(doc))

['you',
 'are',
 'well',
 'acquainted',
 'with',
 'my',
 'failure',
 'and',
 'how',
 'heavily',
 'i',
 'bore',
 'the',
 'disappointment']

In [8]:
# 20 dimensional vectors are probably enough for such a small text
w2v = Word2Vec(size=20, min_count=3, iter=10)
w2v.build_vocab(sentences(doc))

In [9]:
w2v.train(sentences(doc),
          total_examples=w2v.corpus_count,
          epochs=w2v.iter
         )

49157

In [10]:
# inspect the vocabulary
w2v.wv.vocab

{'you': <gensim.models.keyedvectors.Vocab at 0x1288c03c8>,
 'are': <gensim.models.keyedvectors.Vocab at 0x1288c0438>,
 'well': <gensim.models.keyedvectors.Vocab at 0x1288c3908>,
 'acquainted': <gensim.models.keyedvectors.Vocab at 0x1288c3e80>,
 'with': <gensim.models.keyedvectors.Vocab at 0x1288c3da0>,
 'my': <gensim.models.keyedvectors.Vocab at 0x1288c3358>,
 'failure': <gensim.models.keyedvectors.Vocab at 0x1288c3a20>,
 'and': <gensim.models.keyedvectors.Vocab at 0x1288c3b38>,
 'how': <gensim.models.keyedvectors.Vocab at 0x1288c3f98>,
 'heavily': <gensim.models.keyedvectors.Vocab at 0x1288c3ac8>,
 'i': <gensim.models.keyedvectors.Vocab at 0x1288c4780>,
 'bore': <gensim.models.keyedvectors.Vocab at 0x1288c4be0>,
 'the': <gensim.models.keyedvectors.Vocab at 0x1288c4240>,
 'disappointment': <gensim.models.keyedvectors.Vocab at 0x1288c42e8>,
 'but': <gensim.models.keyedvectors.Vocab at 0x1288c4320>,
 'just': <gensim.models.keyedvectors.Vocab at 0x1288c4400>,
 'at': <gensim.models.keyedve

In [11]:
w2v.wv.most_similar("violence")

[('cousin', 0.9845793843269348),
 ('back', 0.9840900301933289),
 ('once', 0.9831483960151672),
 ('sister', 0.9822748899459839),
 ('understand', 0.9815670251846313),
 ('because', 0.9814151525497437),
 ('confirmed', 0.9812175035476685),
 ('quit', 0.9808748960494995),
 ('always', 0.9808467626571655),
 ('very', 0.9807900786399841)]

In [12]:
w2v.wv.most_similar("cabin")

[('aside', 0.9007824659347534),
 ('original', 0.9007487893104553),
 ('please', 0.9005066752433777),
 ('plunged', 0.8995437622070312),
 ('excess', 0.8929291367530823),
 ('loathsome', 0.891219973564148),
 ('duties', 0.8853996992111206),
 ('permitted', 0.8851587772369385),
 ('derivative', 0.8848140239715576),
 ('winds', 0.884331226348877)]

In [13]:
w2v['milk']

array([ 0.01859159,  0.04055761, -0.00351291, -0.04027863, -0.01581051,
       -0.01603812,  0.02295979,  0.03851065, -0.08552489, -0.02482962,
        0.0238169 ,  0.02455953, -0.0017025 ,  0.06407107, -0.03061383,
        0.09628089,  0.03562825, -0.0743821 ,  0.03901686, -0.02824123], dtype=float32)

In [14]:
w2v[['milk', 'cabin']]

array([[ 0.01859159,  0.04055761, -0.00351291, -0.04027863, -0.01581051,
        -0.01603812,  0.02295979,  0.03851065, -0.08552489, -0.02482962,
         0.0238169 ,  0.02455953, -0.0017025 ,  0.06407107, -0.03061383,
         0.09628089,  0.03562825, -0.0743821 ,  0.03901686, -0.02824123],
       [-0.01102102, -0.00042219,  0.00676592,  0.00438442, -0.02855915,
        -0.01677999,  0.00430627,  0.01867756, -0.0557486 , -0.0394527 ,
         0.00565027,  0.01897681,  0.01985462,  0.0678477 ,  0.0170551 ,
         0.07187837,  0.00589731, -0.02592836, -0.00037808,  0.0077379 ]], dtype=float32)

The vocabulary size is very small compared to any of the pre-traiend vectors. These vectors are "tuned" to this particular text but probably not as useful as using generic word vectors from Glove.

One thing to notice is that if you have a lot of specific jargon in your documents you might improve your performance by training a specialised set of word vectors. Because for words out of the vocabulary (like misspelt ones) you have no vector to assign. Often people simply set them to zero or initialise them randomly.


## Using word vectors for movie reviews
Let's compare using self-trained word vectors to simple TfIdf on the movie sentiment task. Use what you learnt above to train (small) word vectors on the IMBD dataset we used previously.

To train word vectors we need to:
* load all the individual reviews and chunk them into sentences
* feed sentences to our `Word2vec` model
* train the model
* inspect word vectors (for sanity checking)

After training the vectors and checking that they are somewhat sensible try
and use them as input features for a logistic regression model instead of TfIdf
or the `CountVectorizer` that we used before in `10-tfidf.ipynb`.

In [15]:
import numpy as np
from sklearn.datasets import load_files

reviews_train = load_files("../data/aclImdb/train/", categories=['neg', 'pos'])

text_trainval, y_trainval = reviews_train.data, reviews_train.target

print("type of text_train: {}".format(type(text_trainval)))
print("length of text_train: {}".format(len(text_trainval)))
print("class balance: {}".format(np.bincount(y_trainval)))

type of text_train: <class 'list'>
length of text_train: 25000
class balance: [12500 12500]


In [16]:
from sklearn.model_selection import train_test_split


text_trainval = [doc.replace(b"<br />", b" ") for doc in text_trainval]

text_train, text_val, y_train, y_val = train_test_split(
    text_trainval, y_trainval, stratify=y_trainval, random_state=0)

In [17]:
text_train[:10]

[b'Maybe it\'s just because I have an intense fear of hospitals and medical stuff, but this one got under my skin (pardon the pun). This piece is brave, not afraid to go over the top and as satisfying as they come in terms of revenge movies. Not only did I find myself feeling lots of hatred for the screwer and lots of sympathy towards the "screwee", I felt myself cringe and feel pangs of disgust at certain junctures which is really a rare and delightful thing for a somewhat jaded horror viewer like myself. Some parts are very reminiscant of "Hellraiser", but come off as tribute rather than imitation. It\'s a heavy handed piece that does not offer the viewer much to consider, but I enjoy being assaulted by a film once and awhile. This piece brings it and doesn\'t appologize. I liked this one a lot. Do NOT watch whilst eating pudding.',
 b'Sophmoric this film is. But, it is funny as all get out. It shows the "boys locker room mentality" being played by the "other side". It is good to see

In [18]:
tokenizer = spacy.load('en_core_web_sm')
tokenizer.remove_pipe("ner")
tokenizer.remove_pipe("tagger")
tokenizer.add_pipe(nlp.create_pipe('sentencizer'))

def movie_sentences(text):
    for sample in text:
        doc = tokenizer(sample.decode())
        for sentence in doc.sents:
            # maybe preprocessing the text like this helps
            yield [w.lower_.strip() for w in sentence if w.is_alpha]

In [19]:
%%time
# compare the speed of the tokenizer to a full spacy model
# that performs NER etc
# probably want to use the %%timeit magic
next(movie_sentences(text_train))

CPU times: user 58.1 ms, sys: 24.4 ms, total: 82.6 ms
Wall time: 39.1 ms


['maybe',
 'it',
 'just',
 'because',
 'i',
 'have',
 'an',
 'intense',
 'fear',
 'of',
 'hospitals',
 'and',
 'medical',
 'stuff',
 'but',
 'this',
 'one',
 'got',
 'under',
 'my',
 'skin',
 'pardon',
 'the',
 'pun']

In [20]:
%%time
# this step can take quite some time :(
# ask Tim for the pickle
all_movie_sentences = list(movie_sentences(text_train))

CPU times: user 32min 23s, sys: 9min 48s, total: 42min 11s
Wall time: 11min 5s


In [22]:
import pickle

# load sentence list
with open("all_movie_sentences.pkl", "rb") as f:
    all_movie_sentences = pickle.load(f)

# store sentence list
#with open("all_movie_sentences.pkl", "wb") as f:
#    pickle.dump(all_movie_sentences, f)

In [None]:
assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"

In [None]:
%%time
movie_w2v = Word2Vec(size=50, workers=5)
# no RAM? Use this slower verison
#movie_w2v.build_vocab(movie_sentences(text_train))
movie_w2v.build_vocab(all_movie_sentences)

In [None]:
%%time
# no RAM? Use this slower verison
#movie_w2v.train(movie_sentences(text_train),
#                total_examples=movie_w2v.corpus_count,
#                epochs=movie_w2v.iter
#                )
movie_w2v.train(all_movie_sentences,
                total_examples=movie_w2v.corpus_count,
                epochs=movie_w2v.iter
                )

In [None]:
# you get more specific synonyms than before
movie_w2v.wv.most_similar("movie")

In [None]:
# saving and loading the model is easy
movie_w2v.save("movie_w2v_model")

In [None]:
loaded_movie_w2v = Word2Vec.load("movie_w2v_model")

In [None]:
loaded_movie_w2v.wv.most_similar("movie")

In [None]:
loaded_movie_w2v.wv.most_similar("batman")

In [None]:
# Use word vectors as input to a logistic regression
# Let's see if we can improve on our baseline for movie reviews by using
# our own word vectors.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vect_w2v = CountVectorizer(vocabulary=loaded_movie_w2v.wv.index2word)
vect_w2v.fit(text_train)
docs = vect_w2v.inverse_transform(vect_w2v.transform(text_train))
docs[0]

In [None]:
# compute the average of the word vectors in a review to represent the whole document
# place your training data in `X_train`

In [None]:
# what should the shape of the training data in X_train be
X_train.shape

In [None]:
val_docs = vect_w2v.inverse_transform(vect_w2v.transform(text_val))

X_val = np.vstack([np.mean(loaded_movie_w2v[doc], axis=0) for doc in val_docs])

In [None]:
from sklearn.linear_model import LogisticRegression

lr_w2v = LogisticRegression(C=100).fit(X_train, y_train)
lr_w2v.score(X_train, y_train)

In [None]:
lr_w2v.score(X_val, y_val)

In [None]:
# Can you improve this by preprocessing the words that are given to the Word2Vec model
# For example by removing stop words?
# Check out the documentation for `CountVectorizer` to see if you can find the
# stopword list used by scikit-learn.

## Bonus: Compare to Google News Pretrained vectors

Surf to https://code.google.com/archive/p/word2vec/ and scroll to "Pre-trained word and phrase vectors". Download and extract [GoogleNews-vectors-negative300.bin.gz](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing).

The compressed file is about 1.5GB, so make sure you have some disk space.

Compare the similar words for these word vectors to the ones we just trained specifically for movies.

Repeat the exercise of fitting a logistic regression model on the new google word vectors.

In [None]:
w = models.KeyedVectors.load_word2vec_format(
    '../GoogleNews-vectors-negative300.bin', binary=True)
w['queen'].shape