# Exercise Monday

Train your own model

In [1]:
from nltk.tokenize import sent_tokenize
import string
import re

# tqdm allows you to display progress bars in loops
from tqdm import tqdm
from datetime import datetime

# you need to have courseutils.py in the same folder
from courseutils import get_review_data

import gensim

# lets get more output
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)



note that there are some slight syntax changes between gensim 3 and 4; notebook is now optimized for gensim 4.

In [2]:
gensim.__version__

'4.0.1'

## Step 1: Get a lot of texts

I'll just take the movie reviews here, but you are *very much encouraged* to take your own data. Use any method to get them into a long list (or similar).

In [3]:
train, test, _, _ = get_review_data()

Using cached file reviewdata.pickle.bz2


In [4]:
# we just need one list
print(f"The original dataset has two sets of reviews of length {len(train)} and {len(test)}")
train.extend(test)
del test
print(f"We merged them into one list of {len(train)} reviews")

The original dataset has two sets of reviews of length 25000 and 25000
We merged them into one list of 50000 reviews


In [5]:
train[:3]

['I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, e

## Step 2: Reformat
We want to train on sentences, not on whole reviews. We don't need a list of reviews, but a list of sentences.

Also, **we only want unique sentences**. It has been shown that this improves the resulting models (and it speeds up training, of course).

There are different ways of achieving this, here is one. Some remarks:

- tqdm displays a progress bar - it's not strictly necessary
- a set is like a list without order, and all items are guaranteed to be unique. You could also use a list, but this is faster. Then, you need to use `uniquesentences = []` and `.append()` instead of `.add()`
- we also remove punctuation 
- depending on whether the texts we want to use our model on later on are lowercased or not, we have to (or not) lowercase here as well. That's a decision to make.

In [6]:
trans = str.maketrans('', '', string.punctuation) # translation scheme for removing punctuation
uniquesentences = set()
for review in tqdm(train):
    for sentence in sent_tokenize(review):
        # remove HTML tags in there
        sentence = re.sub(r"<.*?>"," ",sentence)
        sentence = sentence.translate(trans) 
        if sentence not in uniquesentences:
            uniquesentences.add(sentence.lower())

100%|██████████| 50000/50000 [00:33<00:00, 1471.57it/s]


In [7]:
print(f"We now have {len(uniquesentences)} unique sentences.")

We now have 516343 unique sentences.


In [8]:
# if we want to, we can turn the set into a list and expect it, e.g. like this:
# list(uniquesentences)[:10]

**Note that unqiesentences can be also a generator that reads from disk (or from elsewhere) for the next step. Hence, it is possible to train models on more sentences than fit in your memory!**

## Step 3: Train the model

That's really straightforward in gensim

In [9]:
# we do not need a list of lists of tokens later on, so let's use a generator instead of a list to save memory
# note that we use round parentheses instead of square brackets to achieve this
# we do need two generators, though, as we first need to build the vocabulary and later need to train.
# If we use a list, we obviously only need once.
tokenizedsentences = (sentence.split() for sentence in uniquesentences)
tokenizedsentences2 = (sentence.split() for sentence in uniquesentences)

In [10]:
print(f"Started setting up the model at {datetime.now()}")
model = gensim.models.Word2Vec(vector_size=300) # we want 300 dimensions
model.build_vocab(tokenizedsentences)
print(f"Started training at {datetime.now()}")
model.train(tokenizedsentences2, total_examples=model.corpus_count,  epochs=1)
# our model gets better if we use more epochs, but we can only do so if we use a list instead of a generator as input
# after all, you can only pass over a generator once.
# model.train(tokenizedsentences2, total_examples=model.corpus_count,  epochs=model.epochs)
print(f"Finished training at {datetime.now()}")

2021-04-12 15:13:23,112 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec(vocab=0, vector_size=300, alpha=0.025)', 'datetime': '2021-04-12T15:13:23.068712', 'gensim': '4.0.1', 'python': '3.8.5 (default, Jan 27 2021, 15:41:15) \n[GCC 9.3.0]', 'platform': 'Linux-5.4.0-70-generic-x86_64-with-glibc2.29', 'event': 'created'}
2021-04-12 15:13:23,113 : INFO : collecting all words and their counts
2021-04-12 15:13:23,116 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types


Started setting up the model at 2021-04-12 15:13:23.067450


2021-04-12 15:13:23,235 : INFO : PROGRESS: at sentence #10000, processed 218529 words, keeping 20717 word types
2021-04-12 15:13:23,357 : INFO : PROGRESS: at sentence #20000, processed 435537 words, keeping 30405 word types
2021-04-12 15:13:23,505 : INFO : PROGRESS: at sentence #30000, processed 653899 words, keeping 37751 word types
2021-04-12 15:13:23,627 : INFO : PROGRESS: at sentence #40000, processed 875468 words, keeping 43880 word types
2021-04-12 15:13:23,739 : INFO : PROGRESS: at sentence #50000, processed 1094762 words, keeping 49271 word types
2021-04-12 15:13:23,855 : INFO : PROGRESS: at sentence #60000, processed 1315483 words, keeping 54105 word types
2021-04-12 15:13:23,984 : INFO : PROGRESS: at sentence #70000, processed 1534805 words, keeping 58617 word types
2021-04-12 15:13:24,096 : INFO : PROGRESS: at sentence #80000, processed 1749939 words, keeping 62565 word types
2021-04-12 15:13:24,217 : INFO : PROGRESS: at sentence #90000, processed 1965987 words, keeping 6643

Started training at 2021-04-12 15:13:30.830823


2021-04-12 15:13:31,842 : INFO : EPOCH 1 - PROGRESS: at 3.28% examples, 273239 words/s, in_qsize 5, out_qsize 0
2021-04-12 15:13:32,868 : INFO : EPOCH 1 - PROGRESS: at 7.25% examples, 299411 words/s, in_qsize 6, out_qsize 0
2021-04-12 15:13:33,873 : INFO : EPOCH 1 - PROGRESS: at 10.86% examples, 300228 words/s, in_qsize 6, out_qsize 0
2021-04-12 15:13:34,901 : INFO : EPOCH 1 - PROGRESS: at 14.85% examples, 306453 words/s, in_qsize 5, out_qsize 0
2021-04-12 15:13:35,948 : INFO : EPOCH 1 - PROGRESS: at 19.30% examples, 316077 words/s, in_qsize 6, out_qsize 1
2021-04-12 15:13:36,970 : INFO : EPOCH 1 - PROGRESS: at 23.05% examples, 314256 words/s, in_qsize 6, out_qsize 0
2021-04-12 15:13:37,974 : INFO : EPOCH 1 - PROGRESS: at 26.67% examples, 312540 words/s, in_qsize 6, out_qsize 0
2021-04-12 15:13:38,998 : INFO : EPOCH 1 - PROGRESS: at 30.27% examples, 310560 words/s, in_qsize 5, out_qsize 0
2021-04-12 15:13:40,008 : INFO : EPOCH 1 - PROGRESS: at 33.99% examples, 310254 words/s, in_qsize 

Finished training at 2021-04-12 15:13:59.920354


In [11]:
gensim.models.Word2Vec?

In [12]:
model.save("mymodel")

2021-04-12 15:14:00,041 : INFO : Word2Vec lifecycle event {'fname_or_handle': 'mymodel', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2021-04-12T15:14:00.041830', 'gensim': '4.0.1', 'python': '3.8.5 (default, Jan 27 2021, 15:41:15) \n[GCC 9.3.0]', 'platform': 'Linux-5.4.0-70-generic-x86_64-with-glibc2.29', 'event': 'saving'}
2021-04-12 15:14:00,044 : INFO : storing np array 'vectors' to mymodel.wv.vectors.npy
2021-04-12 15:14:00,281 : INFO : storing np array 'syn1neg' to mymodel.syn1neg.npy
2021-04-12 15:14:00,583 : INFO : not storing attribute cum_table
2021-04-12 15:14:00,630 : INFO : saved mymodel


In [13]:
# and load it again, just to check
mymodel = gensim.models.Word2Vec.load("mymodel")

2021-04-12 15:14:00,639 : INFO : loading Word2Vec object from mymodel
2021-04-12 15:14:00,661 : INFO : loading wv recursively from mymodel.wv.* with mmap=None
2021-04-12 15:14:00,662 : INFO : loading vectors from mymodel.wv.vectors.npy with mmap=None
2021-04-12 15:14:00,782 : INFO : loading syn1neg from mymodel.syn1neg.npy with mmap=None
2021-04-12 15:14:00,977 : INFO : setting ignored attribute cum_table to None
2021-04-12 15:14:02,067 : INFO : Word2Vec lifecycle event {'fname': 'mymodel', 'datetime': '2021-04-12T15:14:02.067166', 'gensim': '4.0.1', 'python': '3.8.5 (default, Jan 27 2021, 15:41:15) \n[GCC 9.3.0]', 'platform': 'Linux-5.4.0-70-generic-x86_64-with-glibc2.29', 'event': 'loaded'}


# Step 4: Play with the model

In [14]:
animals = ['cat', 'dog', 'horse', 'goldfish', 'lion']
for animal in animals:
    try:
        print(f"A {animal} is almost the same as a {model.wv.most_similar(animal)[0][0]}.")
    except Exception as e:
        print(e)

A cat is almost the same as a horse.
A dog is almost the same as a baby.
A horse is almost the same as a cat.
A goldfish is almost the same as a sugar.
A lion is almost the same as a rueda.


In [15]:
animals = ['director', 'actor', 'bad', 'good']
for animal in animals:
    try:
        print(f"A {animal} is almost the same as a {model.wv.most_similar(animal)[0][0]}.")
    except Exception as e:
        print(e)

A director is almost the same as a writer.
A actor is almost the same as a actress.
A bad is almost the same as a good.
A good is almost the same as a bad.


In [16]:
model.wv.most_similar("action")

[('suspense', 0.7984694242477417),
 ('gore', 0.7710525393486023),
 ('cheap', 0.7398205995559692),
 ('humor', 0.7144436240196228),
 ('mood', 0.7109081149101257),
 ('atmosphere', 0.7079086899757385),
 ('cheesy', 0.7015352845191956),
 ('violence', 0.7014023065567017),
 ('slasher', 0.698836088180542),
 ('music', 0.6971469521522522)]

In [17]:
model.wv.most_similar("movie")

[('film', 0.9301337599754333),
 ('show', 0.7778735756874084),
 ('flick', 0.7578645944595337),
 ('documentary', 0.7446282505989075),
 ('picture', 0.7295028567314148),
 ('sequel', 0.7083451747894287),
 ('series', 0.7011323571205139),
 ('episode', 0.6649038195610046),
 ('it', 0.6541301608085632),
 ('movies', 0.6459987163543701)]

# 5 Adapt

Now it's time to dive into the gensim documentation (online or via `?` / tab completion) to figure out the options you have - e.g., skipgram vs CBOW, dimensions, etc.

In [18]:
gensim.models.Word2Vec?