# Exercise Monday

Train your own model

In [1]:
from nltk.tokenize import sent_tokenize
import string
import re

# tqdm allows you to display progress bars in loops
from tqdm import tqdm
from datetime import datetime

# you need to have courseutils.py in the same folder
from courseutils import get_review_data

import gensim

# lets get more output
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## Step 1: Get a lot of texts

I'll just take the movie reviews here, but you are *very much encouraged* to take your own data. Use any method to get them into a long list (or similar).

In [2]:
train, test, _, _ = get_review_data()

Using cached file reviewdata.pickle.bz2


In [3]:
# we just need one list
print(f"The original dataset has two sets of reviews of length {len(train)} and {len(test)}")
train.extend(test)
del test
print(f"We merged them into one list of {len(train)} reviews")

The original dataset has two sets of reviews of length 25000 and 25000
We merged them into one list of 50000 reviews


In [4]:
train[:3]

['I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, e

## Step 2: Reformat
We want to train on sentences, not on whole reviews. We don't need a list of reviews, but a list of sentences.

Also, **we only want unique sentences**. It has been shown that this improves the resulting models (and it speeds up training, of course).

There are different ways of achieving this, here is one. Some remarks:

- tqdm displays a progress bar - it's not strictly necessary
- a set is like a list without order, and all items are guaranteed to be unique. You could also use a list, but this is faster. Then, you need to use `uniquesentences = []` and `.append()` instead of `.add()`
- we also remove punctuation 
- depending on whether the texts we want to use our model on later on are lowercased or not, we have to (or not) lowercase here as well. That's a decision to make.

In [5]:
trans = str.maketrans('', '', string.punctuation) # translation scheme for removing punctuation
uniquesentences = set()
for review in tqdm(train):
    for sentence in sent_tokenize(review):
        # remove HTML tags in there
        sentence = re.sub(r"<.*?>"," ",sentence)
        sentence = sentence.translate(trans) 
        if sentence not in uniquesentences:
            uniquesentences.add(sentence.lower())

100%|██████████| 50000/50000 [00:19<00:00, 2541.32it/s]


In [6]:
print(f"We now have {len(uniquesentences)} unique sentences.")

We now have 516343 unique sentences.


In [7]:
# if we want to, we can turn the set into a list and expect it, e.g. like this:
# list(uniquesentences)[:10]

**Note that unqiesentences can be also a generator that reads from disk (or from elsewhere) for the next step. Hence, it is possible to train models on more sentences than fit in your memory!**

## Step 3: Train the model

That's really straightforward in gensim

In [8]:
# we do not need a list of lists of tokens later on, so let's use a generator instead of a list to save memory
# note that we use round parentheses instead of square brackets to achieve this
# we do need two generators, though, as we first need to build the vocabulary and later need to train.
# If we use a list, we obviously only need once.
tokenizedsentences = (sentence.split() for sentence in uniquesentences)
tokenizedsentences2 = (sentence.split() for sentence in uniquesentences)

In [9]:
print(f"Started setting up the model at {datetime.now()}")
model = gensim.models.Word2Vec(size=300) # we want 300 dimensions
model.build_vocab(tokenizedsentences)
print(f"Started training at {datetime.now()}")
model.train(tokenizedsentences2, total_examples=model.corpus_count,  epochs=model.epochs)
print(f"Finished training at {datetime.now()}")

2021-04-08 12:42:55,880 : INFO : collecting all words and their counts
2021-04-08 12:42:55,880 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-04-08 12:42:55,928 : INFO : PROGRESS: at sentence #10000, processed 215831 words, keeping 20409 word types
2021-04-08 12:42:55,979 : INFO : PROGRESS: at sentence #20000, processed 428994 words, keeping 29908 word types
2021-04-08 12:42:56,044 : INFO : PROGRESS: at sentence #30000, processed 649955 words, keeping 37572 word types


Started setting up the model at 2021-04-08 12:42:55.879829


2021-04-08 12:42:56,096 : INFO : PROGRESS: at sentence #40000, processed 870867 words, keeping 43679 word types
2021-04-08 12:42:56,148 : INFO : PROGRESS: at sentence #50000, processed 1085511 words, keeping 49110 word types
2021-04-08 12:42:56,198 : INFO : PROGRESS: at sentence #60000, processed 1302984 words, keeping 53978 word types
2021-04-08 12:42:56,250 : INFO : PROGRESS: at sentence #70000, processed 1524256 words, keeping 58460 word types
2021-04-08 12:42:56,300 : INFO : PROGRESS: at sentence #80000, processed 1744797 words, keeping 62623 word types
2021-04-08 12:42:56,360 : INFO : PROGRESS: at sentence #90000, processed 1963297 words, keeping 66550 word types
2021-04-08 12:42:56,410 : INFO : PROGRESS: at sentence #100000, processed 2178469 words, keeping 70222 word types
2021-04-08 12:42:56,461 : INFO : PROGRESS: at sentence #110000, processed 2397220 words, keeping 73787 word types
2021-04-08 12:42:56,511 : INFO : PROGRESS: at sentence #120000, processed 2615912 words, keepin

Started training at 2021-04-08 12:43:08.691256


2021-04-08 12:43:09,715 : INFO : EPOCH 1 - PROGRESS: at 4.23% examples, 347486 words/s, in_qsize 6, out_qsize 0
2021-04-08 12:43:10,797 : INFO : EPOCH 1 - PROGRESS: at 8.90% examples, 355144 words/s, in_qsize 6, out_qsize 1
2021-04-08 12:43:11,808 : INFO : EPOCH 1 - PROGRESS: at 12.64% examples, 339547 words/s, in_qsize 5, out_qsize 0
2021-04-08 12:43:12,809 : INFO : EPOCH 1 - PROGRESS: at 16.12% examples, 328803 words/s, in_qsize 4, out_qsize 1
2021-04-08 12:43:13,816 : INFO : EPOCH 1 - PROGRESS: at 20.94% examples, 342361 words/s, in_qsize 5, out_qsize 2
2021-04-08 12:43:14,821 : INFO : EPOCH 1 - PROGRESS: at 29.57% examples, 403586 words/s, in_qsize 5, out_qsize 0
2021-04-08 12:43:15,829 : INFO : EPOCH 1 - PROGRESS: at 35.58% examples, 417240 words/s, in_qsize 4, out_qsize 1
2021-04-08 12:43:16,845 : INFO : EPOCH 1 - PROGRESS: at 41.26% examples, 423454 words/s, in_qsize 4, out_qsize 1
2021-04-08 12:43:17,850 : INFO : EPOCH 1 - PROGRESS: at 49.05% examples, 448140 words/s, in_qsize 

Finished training at 2021-04-08 12:43:24.036453


In [10]:
model.save("mymodel")

2021-04-08 12:43:24,040 : INFO : saving Word2Vec object under mymodel, separately None
2021-04-08 12:43:24,041 : INFO : storing np array 'vectors' to mymodel.wv.vectors.npy
2021-04-08 12:43:24,150 : INFO : not storing attribute vectors_norm
2021-04-08 12:43:24,150 : INFO : storing np array 'syn1neg' to mymodel.trainables.syn1neg.npy
2021-04-08 12:43:24,263 : INFO : not storing attribute cum_table
2021-04-08 12:43:24,313 : INFO : saved mymodel


In [11]:
# and load it again, just to check
mymodel = gensim.models.Word2Vec.load("mymodel")

2021-04-08 12:43:24,318 : INFO : loading Word2Vec object from mymodel
2021-04-08 12:43:24,435 : INFO : loading wv recursively from mymodel.wv.* with mmap=None
2021-04-08 12:43:24,436 : INFO : loading vectors from mymodel.wv.vectors.npy with mmap=None
2021-04-08 12:43:24,484 : INFO : setting ignored attribute vectors_norm to None
2021-04-08 12:43:24,485 : INFO : loading vocabulary recursively from mymodel.vocabulary.* with mmap=None
2021-04-08 12:43:24,485 : INFO : loading trainables recursively from mymodel.trainables.* with mmap=None
2021-04-08 12:43:24,486 : INFO : loading syn1neg from mymodel.trainables.syn1neg.npy with mmap=None
2021-04-08 12:43:24,510 : INFO : setting ignored attribute cum_table to None
2021-04-08 12:43:24,510 : INFO : loaded mymodel


# Step 4: Play with the model

In [12]:
animals = ['cat', 'dog', 'horse', 'goldfish', 'lion']
for animal in animals:
    try:
        print(f"A {animal} is almost the same as a {model.wv.most_similar(animal)[0][0]}.")
    except Exception as e:
        print(e)

2021-04-08 12:43:24,569 : INFO : precomputing L2-norms of word weight vectors


A cat is almost the same as a horse.
A dog is almost the same as a cat.
A horse is almost the same as a tree.
A goldfish is almost the same as a fifi.
A lion is almost the same as a luther.


In [13]:
animals = ['director', 'actor', 'bad', 'good']
for animal in animals:
    try:
        print(f"A {animal} is almost the same as a {model.wv.most_similar(animal)[0][0]}.")
    except Exception as e:
        print(e)

A director is almost the same as a writer.
A actor is almost the same as a actress.
A bad is almost the same as a terrible.
A good is almost the same as a decent.


In [14]:
model.wv.most_similar("action")

[('suspense', 0.7064108848571777),
 ('gore', 0.6617962121963501),
 ('adventure', 0.6219613552093506),
 ('thrills', 0.6181213855743408),
 ('exciting', 0.6074783802032471),
 ('slapstick', 0.5946393013000488),
 ('atmosphere', 0.5907326936721802),
 ('animation', 0.5855820775032043),
 ('slasher', 0.5830831527709961),
 ('excitement', 0.5784242153167725)]

In [15]:
model.wv.most_similar("movie")

[('film', 0.8614335656166077),
 ('flick', 0.7448533177375793),
 ('documentary', 0.6746926307678223),
 ('picture', 0.6668728590011597),
 ('show', 0.6489842534065247),
 ('sequel', 0.6386774182319641),
 ('series', 0.6123355627059937),
 ('episode', 0.5932444930076599),
 ('program', 0.5816488265991211),
 ('garbage', 0.5613850355148315)]

# 5 Adapt

Now it's time to dive into the gensim documentation (online or via `?` / tab completion) to figure out the options you have - e.g., skipgram vs CBOW, dimensions, etc.

In [16]:
gensim.models.Word2Vec?