# Training word vectors

In the last notebook we saw several applications of word and sentence vectors. In this notebook we will train our own.

For this we will use the gensim library.

Quoting Wikipedia:

> Gensim is a robust open-source vector space modeling and topic modeling toolkit implemented in Python. It uses NumPy, SciPy and optionally Cython for performance. Gensim is specifically designed to handle large text collections, using data streaming and efficient incremental algorithms, which differentiates it from most other scientific software packages that only target batch and in-memory processing.

LGPL licensed, https://github.com/RaRe-Technologies/gensim.

Documentation for the `Word2Vec` class: https://radimrehurek.com/gensim/models/word2vec.html

To get going, we will need to have a set of documents to train our doc2vec model. In theory, a document could be anything from a short 140 character tweet, a single paragraph (i.e., journal article abstract), a news article, or a book. In NLP parlance a collection or set of documents is often referred to as a corpus.

In [1]:
import gensim
from gensim.models import Word2Vec

Using TensorFlow backend.


In [2]:
import spacy

In [3]:
# loading the model can take a moment
nlp = spacy.load('en_core_web_sm')

In [4]:
# and chewing through the whole of Frankenstein as well
doc = nlp(open("../data/84-0.txt").read())

In [5]:
# you can see that the sentence boundary detection is not
# perfect in spacy, especially at the beginning of the
# book which contains lots of strangely formatted text.
for n, sentence in enumerate(doc.sents):
    if n < 40:
        continue
    # maybe preprocessing the text like this helps
    print(" ".join(w.lower_.strip() for w in sentence))
    print("-" * 80)
    if n > 40 + 20:
        break

you are well  acquainted with my failure and how heavily i bore the disappointment . 
--------------------------------------------------------------------------------
but just at that time i inherited the fortune of my cousin , and my  thoughts were turned into the channel of their earlier bent . 
--------------------------------------------------------------------------------
six years have passed since i resolved on my present undertaking . 
--------------------------------------------------------------------------------
i  can , even now , remember the hour from which i dedicated myself to this  great enterprise . 
--------------------------------------------------------------------------------
i commenced by inuring my body to hardship . 
--------------------------------------------------------------------------------
i  accompanied the whale - fishers on several expeditions to the north sea ;  i voluntarily endured cold , famine , thirst , and want of sleep ; i often  worked harde

The word2vec training needs a generator of sentences. Let's write one that skips over the first part of the book, and then applies some normalisation to each sentence.

In [6]:
# It seems simpler to use a generator than create a whole class
# as is shown in the gensim documentation. YMMV.
def sentences(document):
    for n, sentence in enumerate(document.sents):
        if n < 40:
            continue
        # maybe preprocessing the text like this helps
        yield [w.lower_.strip() for w in sentence if w.is_alpha]

In [7]:
next(sentences(doc))

['you',
 'are',
 'well',
 'acquainted',
 'with',
 'my',
 'failure',
 'and',
 'how',
 'heavily',
 'i',
 'bore',
 'the',
 'disappointment']

In [8]:
# 20 dimensional vectors are probably enough for such a small text
w2v = Word2Vec(size=20, min_count=3, iter=10)
w2v.build_vocab(sentences(doc))

In [9]:
w2v.train(sentences(doc),
          total_examples=w2v.corpus_count,
          epochs=w2v.iter
         )

49157

In [10]:
# inspect the vocabulary
w2v.wv.vocab

{'you': <gensim.models.keyedvectors.Vocab at 0x129208c18>,
 'are': <gensim.models.keyedvectors.Vocab at 0x129208cf8>,
 'well': <gensim.models.keyedvectors.Vocab at 0x12920b0f0>,
 'acquainted': <gensim.models.keyedvectors.Vocab at 0x12920bfd0>,
 'with': <gensim.models.keyedvectors.Vocab at 0x12920b7f0>,
 'my': <gensim.models.keyedvectors.Vocab at 0x12920b390>,
 'failure': <gensim.models.keyedvectors.Vocab at 0x12920bc50>,
 'and': <gensim.models.keyedvectors.Vocab at 0x12920b630>,
 'how': <gensim.models.keyedvectors.Vocab at 0x12920be48>,
 'heavily': <gensim.models.keyedvectors.Vocab at 0x12920b128>,
 'i': <gensim.models.keyedvectors.Vocab at 0x12920ce80>,
 'bore': <gensim.models.keyedvectors.Vocab at 0x12920c438>,
 'the': <gensim.models.keyedvectors.Vocab at 0x12920cb00>,
 'disappointment': <gensim.models.keyedvectors.Vocab at 0x12920c940>,
 'but': <gensim.models.keyedvectors.Vocab at 0x12920c978>,
 'just': <gensim.models.keyedvectors.Vocab at 0x12920c470>,
 'at': <gensim.models.keyedve

In [11]:
w2v.wv.most_similar("violence")

[('endure', 0.9793963432312012),
 ('poured', 0.9728488922119141),
 ('charge', 0.9725732803344727),
 ('william', 0.9724262952804565),
 ('ecstasy', 0.9720258116722107),
 ('frankenstein', 0.9715650677680969),
 ('sorrow', 0.9713457822799683),
 ('city', 0.9712594747543335),
 ('agreement', 0.9709281921386719),
 ('effects', 0.97092604637146)]

In [12]:
w2v.wv.most_similar("cabin")

[('endowed', 0.9553591012954712),
 ('utterance', 0.9377244710922241),
 ('ocean', 0.9333876371383667),
 ('ernest', 0.9326275587081909),
 ('lives', 0.9325741529464722),
 ('shut', 0.9272109270095825),
 ('group', 0.9261401295661926),
 ('performed', 0.9242812991142273),
 ('cried', 0.9219566583633423),
 ('copying', 0.9213256239891052)]

In [13]:
w2v['milk']

array([-0.01427533, -0.00309333, -0.03516025,  0.04312757, -0.05083948,
        0.0157671 ,  0.01348748,  0.00242899, -0.02110313,  0.02872177,
       -0.04411073,  0.02081146, -0.0285642 ,  0.00983214, -0.01428768,
        0.11524288, -0.00472942, -0.03330777,  0.03491119, -0.02706871], dtype=float32)

In [14]:
w2v[['milk', 'cabin']]

array([[-0.01427533, -0.00309333, -0.03516025,  0.04312757, -0.05083948,
         0.0157671 ,  0.01348748,  0.00242899, -0.02110313,  0.02872177,
        -0.04411073,  0.02081146, -0.0285642 ,  0.00983214, -0.01428768,
         0.11524288, -0.00472942, -0.03330777,  0.03491119, -0.02706871],
       [ 0.01554712, -0.0135473 , -0.01374551,  0.02475062, -0.03733797,
         0.00212585,  0.02688417,  0.03376074, -0.03926441,  0.05290264,
        -0.01627741,  0.01778587, -0.03651457,  0.03482833, -0.00515658,
         0.07903156,  0.00192991, -0.02872157, -0.00893783, -0.00679017]], dtype=float32)

The vocabulary size is very small compared to any of the pre-traiend vectors. These vectors are "fien tuned" to this particular text but probably not as useful as using generic word vectors from Glove.

One thing to notice is that if you have a lot of specific jargon in your documents you might improve your performance by training a specialised set of word vectors.


## Using word vectors for movie reviews
Let's compare using self-trained word vectors to simple TfIdf on the movie sentiment task. Use what you learnt above to train (small) word vectors on the IMBD dataset we used previously.

To train word vectors we need to:
* load all the individual reviews and chunk them into sentences
* feed sentences to our `Word2vec` model
* train the model
* inspect word vectors (for sanity checking)

After training the vectors and checking that they are somewhat sensible try
and use them as input features for a logistic regression model instead of TfIdf
or the `CountVectorizer` that we used before in `10-tfidf.ipynb`.

In [15]:
import numpy as np
from sklearn.datasets import load_files

reviews_train = load_files("../data/aclImdb/train/", categories=['neg', 'pos'])

text_trainval, y_trainval = reviews_train.data, reviews_train.target

print("type of text_train: {}".format(type(text_trainval)))
print("length of text_train: {}".format(len(text_trainval)))
print("class balance: {}".format(np.bincount(y_trainval)))

type of text_train: <class 'list'>
length of text_train: 25000
class balance: [12500 12500]


In [16]:
from sklearn.model_selection import train_test_split


text_trainval = [doc.replace(b"<br />", b" ") for doc in text_trainval]

text_train, text_val, y_train, y_val = train_test_split(
    text_trainval, y_trainval, stratify=y_trainval, random_state=0)

In [17]:
text_train[:10]

[b'Maybe it\'s just because I have an intense fear of hospitals and medical stuff, but this one got under my skin (pardon the pun). This piece is brave, not afraid to go over the top and as satisfying as they come in terms of revenge movies. Not only did I find myself feeling lots of hatred for the screwer and lots of sympathy towards the "screwee", I felt myself cringe and feel pangs of disgust at certain junctures which is really a rare and delightful thing for a somewhat jaded horror viewer like myself. Some parts are very reminiscant of "Hellraiser", but come off as tribute rather than imitation. It\'s a heavy handed piece that does not offer the viewer much to consider, but I enjoy being assaulted by a film once and awhile. This piece brings it and doesn\'t appologize. I liked this one a lot. Do NOT watch whilst eating pudding.',
 b'Sophmoric this film is. But, it is funny as all get out. It shows the "boys locker room mentality" being played by the "other side". It is good to see

In [18]:
tokenizer = spacy.load('en_core_web_sm')
tokenizer.remove_pipe("ner")
tokenizer.remove_pipe("tagger")
tokenizer.add_pipe(nlp.create_pipe('sentencizer'))

def movie_sentences(text):
    for sample in text:
        doc = tokenizer(sample.decode())
        for sentence in doc.sents:
            # maybe preprocessing the text like this helps
            yield [w.lower_.strip() for w in sentence if w.is_alpha]

In [19]:
%%time
# compare the speed of the tokenizer to a full spacy model
# that performs NER etc
# probably want to use the %%timeit magic
next(movie_sentences(text_train))

CPU times: user 58.9 ms, sys: 17.7 ms, total: 76.6 ms
Wall time: 26 ms


['maybe',
 'it',
 'just',
 'because',
 'i',
 'have',
 'an',
 'intense',
 'fear',
 'of',
 'hospitals',
 'and',
 'medical',
 'stuff',
 'but',
 'this',
 'one',
 'got',
 'under',
 'my',
 'skin',
 'pardon',
 'the',
 'pun']

In [29]:
%%time
all_movie_sentences = list(movie_sentences(text_train))

CPU times: user 31min 35s, sys: 8min 49s, total: 40min 24s
Wall time: 10min 26s


In [20]:
assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"

In [33]:
%%time
movie_w2v = Word2Vec(size=50, workers=5)
# no RAM? Use this slower verison
#movie_w2v.build_vocab(movie_sentences(text_train))
movie_w2v.build_vocab(all_movie_sentences)

CPU times: user 1.96 s, sys: 12.9 ms, total: 1.97 s
Wall time: 1.97 s


In [34]:
%%time
# no RAM? Use this slower verison
#movie_w2v.train(movie_sentences(text_train),
#                total_examples=movie_w2v.corpus_count,
#                epochs=movie_w2v.iter
#                )
movie_w2v.train(all_movie_sentences,
                total_examples=movie_w2v.corpus_count,
                epochs=movie_w2v.iter
                )

CPU times: user 6min 1s, sys: 725 ms, total: 6min 2s
Wall time: 1min 14s


15831755

In [35]:
movie_w2v.wv.most_similar("movie")

[('film', 0.944290041923523),
 ('flick', 0.7585451602935791),
 ('show', 0.7337191104888916),
 ('sequel', 0.7283260226249695),
 ('picture', 0.7199562788009644),
 ('documentary', 0.7051140666007996),
 ('it', 0.6698643565177917),
 ('series', 0.667942225933075),
 ('episode', 0.6619446873664856),
 ('mess', 0.6303345561027527)]

In [36]:
movie_w2v.save("movie_w2v_model")

In [38]:
loaded_movie_w2v = Word2Vec.load("movie_w2v_model")

In [39]:
loaded_movie_w2v.wv.most_similar("movie")

[('film', 0.944290041923523),
 ('flick', 0.7585451602935791),
 ('show', 0.7337191104888916),
 ('sequel', 0.7283260226249695),
 ('picture', 0.7199562788009644),
 ('documentary', 0.7051140666007996),
 ('it', 0.6698643565177917),
 ('series', 0.667942225933075),
 ('episode', 0.6619446873664856),
 ('mess', 0.6303345561027527)]

In [67]:
loaded_movie_w2v.wv.most_similar("batman")

[('superman', 0.733562171459198),
 ('lotr', 0.6726023554801941),
 ('trilogy', 0.6343921422958374),
 ('shakespeare', 0.632647693157196),
 ('godfather', 0.6310911178588867),
 ('k', 0.6288937926292419),
 ('king', 0.624601423740387),
 ('zu', 0.6244131922721863),
 ('holmes', 0.6220515966415405),
 ('hamlet', 0.613922655582428)]

In [None]:
# Use word vectors as input to a logistic regression

In [42]:
from sklearn.feature_extraction.text import CountVectorizer

vect_w2v = CountVectorizer(vocabulary=loaded_movie_w2v.wv.index2word)
vect_w2v.fit(text_train)
docs = vect_w2v.inverse_transform(vect_w2v.transform(text_train))
docs[0]

array(['the', 'and', 'of', 'to', 'is', 'it', 'in', 'this', 'that', 'as',
       'for', 'but', 'film', 'not', 'are', 'have', 'one', 'at', 'they',
       'by', 'an', 'like', 'just', 'do', 'some', 'very', 'my', 'only',
       'which', 'really', 'did', 'does', 'than', 'much', 'because',
       'movies', 'watch', 'being', 'over', 'off', 'go', 'thing', 'find',
       'lot', 'got', 'horror', 'come', 'feel', 'rather', 'maybe', 'once',
       'top', 'enjoy', 'piece', 'felt', 'liked', 'under', 'viewer',
       'parts', 'myself', 'stuff', 'feeling', 'somewhat', 'lots',
       'certain', 'towards', 'brings', 'fear', 'revenge', 'consider',
       'heavy', 'rare', 'terms', 'offer', 'intense', 'afraid', 'whilst',
       'delightful', 'eating', 'satisfying', 'brave', 'skin', 'sympathy',
       'handed', 'medical', 'tribute', 'hatred', 'cringe', 'imitation',
       'awhile', 'pun', 'disgust', 'jaded', 'pardon', 'hellraiser',
       'hospitals', 'assaulted', 'pudding'],
      dtype='<U20')

In [44]:
X_train = np.vstack([np.mean(loaded_movie_w2v[doc], axis=0) for doc in docs])

In [45]:
X_train.shape

(18750, 50)

In [51]:
val_docs = vect_w2v.inverse_transform(vect_w2v.transform(text_val))

X_val = np.vstack([np.mean(loaded_movie_w2v[doc], axis=0) for doc in val_docs])

In [53]:
from sklearn.linear_model import LogisticRegression

lr_w2v = LogisticRegression(C=100).fit(X_train, y_train)
lr_w2v.score(X_train, y_train)

0.80869333333333338

In [54]:
lr_w2v.score(X_val, y_val)

0.79776000000000002

In [None]:
# Can you improve this by preprocessing the words that are given to the Word2Vec model
# For example by removing stop words?
# Check out the documentation for `CountVectorizer` to see if you can find the
# stopword list used by scikit-learn.

## Bonus: Compare to Google News Pretrained vectors

Surf to https://code.google.com/archive/p/word2vec/ and scroll to "Pre-trained word and phrase vectors". Download and extract [GoogleNews-vectors-negative300.bin.gz](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing).

The compressed file is about 1.5GB, so make sure you have some disk space.

Compare the similar words for these word vectors to the ones we just trained specifically for movies.

Repeat the exercise of fitting a logistic regression model on the new google word vectors.

In [None]:
w = models.KeyedVectors.load_word2vec_format(
    '../GoogleNews-vectors-negative300.bin', binary=True)
w['queen'].shape