# Training word vectors

In the last notebook we saw several applications of word and sentence vectors. In this notebook we will train our own.

For this we will use the gensim library.

Quoting Wikipedia:

> Gensim is a robust open-source vector space modeling and topic modeling toolkit implemented in Python. It uses NumPy, SciPy and optionally Cython for performance. Gensim is specifically designed to handle large text collections, using data streaming and efficient incremental algorithms, which differentiates it from most other scientific software packages that only target batch and in-memory processing.

LGPL licensed, https://github.com/RaRe-Technologies/gensim.

Documentation for the `Word2Vec` class: https://radimrehurek.com/gensim/models/word2vec.html

To get going, we will need to have a set of documents to train our word2vec model. In theory, a document could be anything from a short 140 character tweet, a single paragraph (i.e., journal article abstract), a news article, or a book. In NLP parlance a collection or set of documents is often referred to as a corpus.

We need to create a list of sentences to feed to the `Word2Vec` class.

In [1]:
# Use `conda install -c conda-forge gensim` to install this
import gensim
from gensim.models import Word2Vec

Using TensorFlow backend.


In [2]:
import spacy

In [3]:
# loading the model can take a moment
nlp = spacy.load('en_core_web_sm')

In [4]:
# and chewing through the whole of Frankenstein as well
doc = nlp(open("../data/84-0.txt").read())

In [5]:
# you can see that the sentence boundary detection is not
# perfect in spacy, especially at the beginning of the
# book which contains lots of strangely formatted text.
for n, sentence in enumerate(doc.sents):
    # skip the first 40 "sentences"
    # disable to see the weird ones
    if n < 40:
        continue
    # maybe preprocessing the text like this helps
    print(" ".join(w.lower_.strip() for w in sentence))
    print("-" * 80)
    if n > 40 + 20:
        break

you are well  acquainted with my failure and how heavily i bore the disappointment . 
--------------------------------------------------------------------------------
but just at that time i inherited the fortune of my cousin , and my  thoughts were turned into the channel of their earlier bent . 
--------------------------------------------------------------------------------
six years have passed since i resolved on my present undertaking . 
--------------------------------------------------------------------------------
i  can , even now , remember the hour from which i dedicated myself to this  great enterprise . 
--------------------------------------------------------------------------------
i commenced by inuring my body to hardship . 
--------------------------------------------------------------------------------
i  accompanied the whale - fishers on several expeditions to the north sea ;  i voluntarily endured cold , famine , thirst , and want of sleep ; i often  worked harde

The word2vec training needs a generator of sentences. Let's write one that skips over the first part of the book, and then applies some normalisation to each sentence.

In [6]:
# It seems simpler to use a generator than create a whole class
# as is shown in the gensim documentation. YMMV.
def sentences(document):
    for n, sentence in enumerate(document.sents):
        if n < 40:
            continue
        # maybe preprocessing the text like this helps
        yield [w.lower_.strip() for w in sentence if w.is_alpha]

In [7]:
next(sentences(doc))

['you',
 'are',
 'well',
 'acquainted',
 'with',
 'my',
 'failure',
 'and',
 'how',
 'heavily',
 'i',
 'bore',
 'the',
 'disappointment']

In [8]:
# 20 dimensional vectors are probably enough for such a small text
w2v = Word2Vec(size=20, min_count=3, iter=10)
w2v.build_vocab(sentences(doc))

In [9]:
w2v.train(sentences(doc),
          total_examples=w2v.corpus_count,
          epochs=w2v.iter
         )

49157

In [10]:
# inspect the vocabulary
w2v.wv.vocab

{'you': <gensim.models.keyedvectors.Vocab at 0x127022f98>,
 'are': <gensim.models.keyedvectors.Vocab at 0x127022e10>,
 'well': <gensim.models.keyedvectors.Vocab at 0x127022ef0>,
 'acquainted': <gensim.models.keyedvectors.Vocab at 0x127024be0>,
 'with': <gensim.models.keyedvectors.Vocab at 0x127024e10>,
 'my': <gensim.models.keyedvectors.Vocab at 0x127024d68>,
 'failure': <gensim.models.keyedvectors.Vocab at 0x127024dd8>,
 'and': <gensim.models.keyedvectors.Vocab at 0x1270240f0>,
 'how': <gensim.models.keyedvectors.Vocab at 0x127024cc0>,
 'heavily': <gensim.models.keyedvectors.Vocab at 0x127024eb8>,
 'i': <gensim.models.keyedvectors.Vocab at 0x127024320>,
 'bore': <gensim.models.keyedvectors.Vocab at 0x127026c88>,
 'the': <gensim.models.keyedvectors.Vocab at 0x1270262b0>,
 'disappointment': <gensim.models.keyedvectors.Vocab at 0x127026be0>,
 'but': <gensim.models.keyedvectors.Vocab at 0x127026668>,
 'just': <gensim.models.keyedvectors.Vocab at 0x1270266a0>,
 'at': <gensim.models.keyedve

In [11]:
w2v.wv.most_similar("violence")

[('next', 0.9769854545593262),
 ('sorrow', 0.9744818806648254),
 ('name', 0.9741429090499878),
 ('virtue', 0.9736821055412292),
 ('enough', 0.973641037940979),
 ('remaining', 0.9731167554855347),
 ('overcome', 0.972711443901062),
 ('poured', 0.9726100564002991),
 ('either', 0.9725511074066162),
 ('unable', 0.9724026918411255)]

In [12]:
w2v.wv.most_similar("cabin")

[('lead', 0.910813570022583),
 ('oh', 0.9100989699363708),
 ('uniform', 0.9086700677871704),
 ('utmost', 0.9084763526916504),
 ('web', 0.9070572853088379),
 ('grass', 0.9050130844116211),
 ('stream', 0.9039220809936523),
 ('boundary', 0.9038843512535095),
 ('wander', 0.9032682180404663),
 ('showed', 0.9023261070251465)]

In [13]:
w2v['milk']

array([-0.10349518,  0.05293521, -0.0146624 ,  0.04489234,  0.007876  ,
       -0.06108918,  0.04437191,  0.05421524,  0.0013715 ,  0.10909642,
        0.06968687, -0.0115966 , -0.0450452 , -0.03449446, -0.0087857 ,
       -0.01903279, -0.01777055, -0.03001725, -0.01400112,  0.00862293], dtype=float32)

In [14]:
w2v[['milk', 'cabin']]

array([[-0.10349518,  0.05293521, -0.0146624 ,  0.04489234,  0.007876  ,
        -0.06108918,  0.04437191,  0.05421524,  0.0013715 ,  0.10909642,
         0.06968687, -0.0115966 , -0.0450452 , -0.03449446, -0.0087857 ,
        -0.01903279, -0.01777055, -0.03001725, -0.01400112,  0.00862293],
       [-0.05589633,  0.03057461, -0.02947029,  0.00460367,  0.01574911,
        -0.00695881,  0.03469225,  0.02978546, -0.01627689,  0.05157636,
         0.01818113, -0.04638796, -0.03510034, -0.00967856, -0.01573741,
         0.00311428, -0.03708719,  0.0088835 , -0.0157753 ,  0.00673747]], dtype=float32)

The vocabulary size is very small compared to any of the pre-traiend vectors. These vectors are "fien tuned" to this particular text but probably not as useful as using generic word vectors from Glove.

One thing to notice is that if you have a lot of specific jargon in your documents you might improve your performance by training a specialised set of word vectors.


## Using word vectors for movie reviews
Let's compare using self-trained word vectors to simple TfIdf on the movie sentiment task. Use what you learnt above to train (small) word vectors on the IMBD dataset we used previously.

To train word vectors we need to:
* load all the individual reviews and chunk them into sentences
* feed sentences to our `Word2vec` model
* train the model
* inspect word vectors (for sanity checking)

After training the vectors and checking that they are somewhat sensible try
and use them as input features for a logistic regression model instead of TfIdf
or the `CountVectorizer` that we used before in `10-tfidf.ipynb`.

In [15]:
import numpy as np
from sklearn.datasets import load_files

reviews_train = load_files("../data/aclImdb/train/", categories=['neg', 'pos'])

text_trainval, y_trainval = reviews_train.data, reviews_train.target

print("type of text_train: {}".format(type(text_trainval)))
print("length of text_train: {}".format(len(text_trainval)))
print("class balance: {}".format(np.bincount(y_trainval)))

type of text_train: <class 'list'>
length of text_train: 25000
class balance: [12500 12500]


In [16]:
from sklearn.model_selection import train_test_split


text_trainval = [doc.replace(b"<br />", b" ") for doc in text_trainval]

text_train, text_val, y_train, y_val = train_test_split(
    text_trainval, y_trainval, stratify=y_trainval, random_state=0)

In [17]:
text_train[:10]

[b'Maybe it\'s just because I have an intense fear of hospitals and medical stuff, but this one got under my skin (pardon the pun). This piece is brave, not afraid to go over the top and as satisfying as they come in terms of revenge movies. Not only did I find myself feeling lots of hatred for the screwer and lots of sympathy towards the "screwee", I felt myself cringe and feel pangs of disgust at certain junctures which is really a rare and delightful thing for a somewhat jaded horror viewer like myself. Some parts are very reminiscant of "Hellraiser", but come off as tribute rather than imitation. It\'s a heavy handed piece that does not offer the viewer much to consider, but I enjoy being assaulted by a film once and awhile. This piece brings it and doesn\'t appologize. I liked this one a lot. Do NOT watch whilst eating pudding.',
 b'Sophmoric this film is. But, it is funny as all get out. It shows the "boys locker room mentality" being played by the "other side". It is good to see

In [18]:
tokenizer = spacy.load('en_core_web_sm')
tokenizer.remove_pipe("ner")
tokenizer.remove_pipe("tagger")
tokenizer.add_pipe(nlp.create_pipe('sentencizer'))

def movie_sentences(text):
    for sample in text:
        doc = tokenizer(sample.decode())
        for sentence in doc.sents:
            # maybe preprocessing the text like this helps
            yield [w.lower_.strip() for w in sentence if w.is_alpha]

In [19]:
%%time
# compare the speed of the tokenizer to a full spacy model
# that performs NER etc
# probably want to use the %%timeit magic
next(movie_sentences(text_train))

CPU times: user 59.1 ms, sys: 24.6 ms, total: 83.7 ms
Wall time: 31.5 ms


['maybe',
 'it',
 'just',
 'because',
 'i',
 'have',
 'an',
 'intense',
 'fear',
 'of',
 'hospitals',
 'and',
 'medical',
 'stuff',
 'but',
 'this',
 'one',
 'got',
 'under',
 'my',
 'skin',
 'pardon',
 'the',
 'pun']

In [20]:
#%%time
#all_movie_sentences = list(movie_sentences(text_train))

In [21]:
import pickle

# load sentence list
with open("all_movie_sentences.pkl", "rb") as f:
    all_movie_sentences = pickle.load(f)

# store sentence list
#with open("all_movie_sentences.pkl", "wb") as f:
#    pickle.dump(all_movie_sentences, f)

In [22]:
assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"

In [23]:
%%time
movie_w2v = Word2Vec(size=50, workers=5)
# no RAM? Use this slower verison
#movie_w2v.build_vocab(movie_sentences(text_train))
movie_w2v.build_vocab(all_movie_sentences)

CPU times: user 1.93 s, sys: 20 ms, total: 1.95 s
Wall time: 1.96 s


In [24]:
%%time
# no RAM? Use this slower verison
#movie_w2v.train(movie_sentences(text_train),
#                total_examples=movie_w2v.corpus_count,
#                epochs=movie_w2v.iter
#                )
movie_w2v.train(all_movie_sentences,
                total_examples=movie_w2v.corpus_count,
                epochs=movie_w2v.iter
                )

CPU times: user 6min 14s, sys: 1.13 s, total: 6min 15s
Wall time: 1min 17s


15829541

In [25]:
movie_w2v.wv.most_similar("movie")

[('film', 0.9476180076599121),
 ('show', 0.750161349773407),
 ('flick', 0.7486521601676941),
 ('sequel', 0.7103124856948853),
 ('documentary', 0.7055528163909912),
 ('picture', 0.7020326852798462),
 ('episode', 0.6879920959472656),
 ('series', 0.6628345847129822),
 ('it', 0.6605150103569031),
 ('movies', 0.6311708688735962)]

In [26]:
movie_w2v.save("movie_w2v_model")

In [27]:
loaded_movie_w2v = Word2Vec.load("movie_w2v_model")

In [28]:
loaded_movie_w2v.wv.most_similar("movie")

[('film', 0.9476180076599121),
 ('show', 0.750161349773407),
 ('flick', 0.7486521601676941),
 ('sequel', 0.7103124856948853),
 ('documentary', 0.7055528163909912),
 ('picture', 0.7020326852798462),
 ('episode', 0.6879920959472656),
 ('series', 0.6628345847129822),
 ('it', 0.6605150103569031),
 ('movies', 0.6311708688735962)]

In [29]:
loaded_movie_w2v.wv.most_similar("batman")

[('hamlet', 0.6821765899658203),
 ('king', 0.6802836656570435),
 ('godfather', 0.6740092039108276),
 ('trilogy', 0.6630781888961792),
 ('novels', 0.631263017654419),
 ('superman', 0.6286085844039917),
 ('bbc', 0.628006100654602),
 ('shakespeare', 0.6208999156951904),
 ('sherlock', 0.6149337291717529),
 ('spielberg', 0.6132889986038208)]

In [30]:
# Use word vectors as input to a logistic regression

In [31]:
from sklearn.feature_extraction.text import CountVectorizer

vect_w2v = CountVectorizer(vocabulary=loaded_movie_w2v.wv.index2word)
vect_w2v.fit(text_train)
docs = vect_w2v.inverse_transform(vect_w2v.transform(text_train))
docs[0]

array(['the', 'and', 'of', 'to', 'is', 'it', 'in', 'this', 'that', 'as',
       'for', 'but', 'film', 'not', 'are', 'have', 'one', 'at', 'they',
       'by', 'an', 'like', 'just', 'do', 'some', 'very', 'my', 'only',
       'which', 'really', 'did', 'does', 'than', 'much', 'because',
       'movies', 'watch', 'being', 'over', 'off', 'go', 'thing', 'find',
       'lot', 'got', 'horror', 'come', 'feel', 'rather', 'maybe', 'once',
       'top', 'enjoy', 'piece', 'felt', 'liked', 'under', 'viewer',
       'parts', 'myself', 'stuff', 'feeling', 'somewhat', 'lots',
       'certain', 'towards', 'brings', 'fear', 'revenge', 'consider',
       'heavy', 'rare', 'terms', 'offer', 'intense', 'afraid', 'whilst',
       'delightful', 'eating', 'satisfying', 'brave', 'skin', 'sympathy',
       'handed', 'medical', 'tribute', 'hatred', 'cringe', 'imitation',
       'awhile', 'pun', 'disgust', 'jaded', 'pardon', 'hellraiser',
       'hospitals', 'assaulted', 'pudding'],
      dtype='<U20')

In [32]:
X_train = np.vstack([np.mean(loaded_movie_w2v[doc], axis=0) for doc in docs])

In [33]:
X_train.shape

(18750, 50)

In [34]:
val_docs = vect_w2v.inverse_transform(vect_w2v.transform(text_val))

X_val = np.vstack([np.mean(loaded_movie_w2v[doc], axis=0) for doc in val_docs])

In [35]:
from sklearn.linear_model import LogisticRegression

lr_w2v = LogisticRegression(C=100).fit(X_train, y_train)
lr_w2v.score(X_train, y_train)

0.80864000000000003

In [36]:
lr_w2v.score(X_val, y_val)

0.80000000000000004

In [37]:
# Can you improve this by preprocessing the words that are given to the Word2Vec model
# For example by removing stop words?
# Check out the documentation for `CountVectorizer` to see if you can find the
# stopword list used by scikit-learn.