# Learning word embeddings with `Gensim`

In this part, we are going to perform word embeddings with Gensim - a very popular toolkit for Natural Language Processing (NLP). We will build two word embeddings: `word2vec` and `fastText`.

First, let's start to import some useful packages:

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
import nltk, re
from keras.preprocessing.text import Tokenizer
from datetime import datetime
from gensim.models import *
import logging
from rnn_utils import *
%matplotlib inline

Using TensorFlow backend.


## Data preparation

We will use same dataset from Kaggle "Bag of Words Meets Bags of Popcorn". We will work with not only the training set but also the unlabled training set. Because we only want to learn word embedding (not building model) so we don't need labels in the set, making the unlabeled data useful.

Now, let's load them using `pandas` package.

In [2]:
train = pd.read_csv("./data/labeledTrainData.tsv", delimiter="\t", header = 0, quoting=3)
unlabeled = pd.read_csv("./data/unlabeledTrainData.tsv", delimiter="\t", header = 0, quoting=3)

There are 25,000 reviews in the training set and 50,000 reviews in the unlabeled dataset.

In [3]:
print("Train set: {}".format(train.shape))
print("Unlabeled set: {}".format(unlabeled.shape))

Train set: (25000, 3)
Unlabeled set: (50000, 2)


Let's take a quick review about the first unlabeled review. We can see that thare are many HTML tags in this review. So our task is to clean up those reviews by removing the unwanted characters such as HTML tags and special characters `(\, /, ?, !, etc.)`.

In [4]:
unlabeled.review[1]

'"I saw this film about 20 years ago and remember it as being particularly nasty. I believe it is based on a true incident: a young man breaks into a nurses\' home and rapes, tortures and kills various women.<br /><br />It is in black and white but saves the colour for one shocking shot.<br /><br />At the end the film seems to be trying to make some political statement but it just comes across as confused and obscene.<br /><br />Avoid."'

Function <span style="color:blue; font-family:Courier">convert_to_sentences</span> is defined in order to remove special characters. Steps are as below:
1. Convert each review into sentences as we might have many sentences in one review by using `punkt` tokenizer from NLTK
2. Replace HTML tags with space. Character '-' with space also, and add space after '.' just to help tokenizer to strip sentence correctly. If we don't do this, sentences connected with '.' (no space after) will be considered as one sentence which is not correct. And replace double space with one space only.
3. Convert each sentence into words and apply the `special_characters` filter on it. Then convert all of the into lowercase.
4. Finally, return a list of reviews which contain inside each review a list of sentences made up a list of words.


In [5]:
special_characters = re.compile("[^A-Za-z0-9 ]")
tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")
def convert_to_sentences(data, tokenizer):
    # First, converting each review into sentences
    # Use NLTK Tokenizer to split review into sentences (punkt tokenizer - english.pickle)
    data = data.lower().replace("<br />", " ")
    data = data.replace("-", " ")
    data = data.replace(".", ". ")
    data = re.sub("  ", " ", data)
    all_sentences = tokenizer.tokenize(data.strip())
    
    # Second, converting each sentence into words
    sentences = []
    for words in all_sentences:
        s = re.sub(special_characters, "", words.lower())
        if (len(s)) > 0:
            sentences.append(s.split())
    
    # Finally, returning a list of sentences (containing words in each sentence)
    return sentences

Below is a sample of raw review:

In [6]:
print("Review: \n{}".format(unlabeled.review[90]))

Review: 
"Steven Seagal returns in Black Dawn.Cool title.I cannot say the same for the movie though.It ranks as the Worst Seagal movie yet.Yes.It is worse than Out for a Kill.Did we really need a sequel to the Forigner that bad?<br /><br />Steven returns as Jonathon Cold.Assassin for hire.This time he is out to stop a group of Terrorists and prevent Nuclier Bomb from hitting Los Angelas with the help of CIA agent Tamara Davies.<br /><br />This idea seems okay.The Jonathon Cold character once again gives the big guy a chance to tarnish his clean-cut cinematic Image.You get the feeling with a decent director and production team this movie could have went somewhere.If this cost 15 million I wanna know where it was spent?It was not spent on the movie?!! <br /><br />The action scenes are not bad.The movie was obviously shot like a Doug Liman flick.Its just that most of the Time Seagal is doubled so much you cannot take the movie seriously whatsoever.The script seems...what script?!Logic is 

Applying the function <span style="color:blue; font-family:Courier">convert_to_sentences</span> to this review, we get 31 sentences in total.

In [7]:
sentences_sample = convert_to_sentences(unlabeled.review[90], tokenizer)
print("Length of formated sentences: {}".format(len(sentences_sample)))

Length of formated sentences: 31


Details of each sentence are as followed:

In [8]:
for i in sentences_sample:
    print(i)

['steven', 'seagal', 'returns', 'in', 'black', 'dawn']
['cool', 'title']
['i', 'cannot', 'say', 'the', 'same', 'for', 'the', 'movie', 'though']
['it', 'ranks', 'as', 'the', 'worst', 'seagal', 'movie', 'yet']
['yes']
['it', 'is', 'worse', 'than', 'out', 'for', 'a', 'kill']
['did', 'we', 'really', 'need', 'a', 'sequel', 'to', 'the', 'forigner', 'that', 'bad']
['steven', 'returns', 'as', 'jonathon', 'cold']
['assassin', 'for', 'hire']
['this', 'time', 'he', 'is', 'out', 'to', 'stop', 'a', 'group', 'of', 'terrorists', 'and', 'prevent', 'nuclier', 'bomb', 'from', 'hitting', 'los', 'angelas', 'with', 'the', 'help', 'of', 'cia', 'agent', 'tamara', 'davies']
['this', 'idea', 'seems', 'okay']
['the', 'jonathon', 'cold', 'character', 'once', 'again', 'gives', 'the', 'big', 'guy', 'a', 'chance', 'to', 'tarnish', 'his', 'clean', 'cut', 'cinematic', 'image']
['you', 'get', 'the', 'feeling', 'with', 'a', 'decent', 'director', 'and', 'production', 'team', 'this', 'movie', 'could', 'have', 'went', 'so

Now, let's do the same for all of our train and unlabeled reviews. Here we will join 2 lists into 1. This may take several minutes to complete due to large number of reviews needed to be clean up (75,000).

In [9]:
sentences = []
for review in train.review:
    sentences += convert_to_sentences(review, tokenizer)
print("Done for train set.")

for review in unlabeled.review:
    sentences += convert_to_sentences(review, tokenizer)
print("Done for unlabled set.")

Done for train set.
Done for unlabled set.


Let's check first 5 sentences:

In [10]:
for i in sentences[0:5]:
    print("{}\n".format(i))

['with', 'all', 'this', 'stuff', 'going', 'down', 'at', 'the', 'moment', 'with', 'mj', 'ive', 'started', 'listening', 'to', 'his', 'music', 'watching', 'the', 'odd', 'documentary', 'here', 'and', 'there', 'watched', 'the', 'wiz', 'and', 'watched', 'moonwalker', 'again']

['maybe', 'i', 'just', 'want', 'to', 'get', 'a', 'certain', 'insight', 'into', 'this', 'guy', 'who', 'i', 'thought', 'was', 'really', 'cool', 'in', 'the', 'eighties', 'just', 'to', 'maybe', 'make', 'up', 'my', 'mind', 'whether', 'he', 'is', 'guilty', 'or', 'innocent']

['moonwalker', 'is', 'part', 'biography', 'part', 'feature', 'film', 'which', 'i', 'remember', 'going', 'to', 'see', 'at', 'the', 'cinema', 'when', 'it', 'was', 'originally', 'released']

['some', 'of', 'it', 'has', 'subtle', 'messages', 'about', 'mjs', 'feeling', 'towards', 'the', 'press', 'and', 'also', 'the', 'obvious', 'message', 'of', 'drugs', 'are', 'bad', 'mkay']

['visually', 'impressive', 'but', 'of', 'course', 'this', 'is', 'all', 'about', 'mic

## `word2vec` embedding

Before doing the embedding, let's define values for the model.

- `num_feature`: The dimension of word vector. The more dimension the better representation but this is going to take more time to learn and more data. However, since we don't have that much data, let's set this to 50 only (default is 100).
- `min_word_count`: Any words appears less than this number will not be considered in the learning (default is 5).
- `window_size`: For any given word, window defines how many words to consider to it’s left and right (default is 5). This is the maximum distance between the current and predicted word within a sentence.
- `down_sampling`: The threshold for configuring which higher-frequency words are randomly downsampled, useful range is (0, 1e-5).
- `num_thread`: Number of parallel processes to run.
- `iteration`: Number of iterations (epochs) over the corpus (default is 5). However, in practice, it's advised that more iteration will improve the representations.
- Training algorithm: we will select between CBOW and Skip-gram model.
        
    - CBOW: works well with small dataset, well representation with rare words/phrases.
    - Skip-gram: faster training time, slightly better accuracy for frequent words.
    
In this case, we will pick CBOW which is the default value.

### Training word embeddings

In [21]:
num_feature = 50
min_word_count = 20
num_thread = 5
window_size = 10
down_sampling = 0.001
iteration = 20

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
model = word2vec.Word2Vec(sentences, 
                          iter = iteration,
                          size=num_feature, 
                          min_count = min_word_count, 
                          window = window_size, 
                          sample = down_sampling, 
                          workers=num_thread)

2018-05-18 00:06:24,082 : INFO : collecting all words and their counts
2018-05-18 00:06:24,087 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-05-18 00:06:24,601 : INFO : PROGRESS: at sentence #10000, processed 185521 words, keeping 17175 word types
2018-05-18 00:06:24,846 : INFO : PROGRESS: at sentence #20000, processed 367101 words, keeping 24423 word types
2018-05-18 00:06:25,091 : INFO : PROGRESS: at sentence #30000, processed 551281 words, keeping 29689 word types
2018-05-18 00:06:25,402 : INFO : PROGRESS: at sentence #40000, processed 733293 words, keeping 34309 word types
2018-05-18 00:06:25,684 : INFO : PROGRESS: at sentence #50000, processed 917534 words, keeping 38241 word types
2018-05-18 00:06:26,026 : INFO : PROGRESS: at sentence #60000, processed 1097938 words, keeping 41448 word types
2018-05-18 00:06:26,562 : INFO : PROGRESS: at sentence #70000, processed 1281359 words, keeping 44467 word types
2018-05-18 00:06:26,812 : INFO : PROGRESS: a

2018-05-18 00:06:56,539 : INFO : PROGRESS: at sentence #710000, processed 12919593 words, keeping 135944 word types
2018-05-18 00:06:57,379 : INFO : PROGRESS: at sentence #720000, processed 13103676 words, keeping 136863 word types
2018-05-18 00:06:57,888 : INFO : PROGRESS: at sentence #730000, processed 13283040 words, keeping 137837 word types
2018-05-18 00:06:58,486 : INFO : PROGRESS: at sentence #740000, processed 13468256 words, keeping 138924 word types
2018-05-18 00:06:58,922 : INFO : PROGRESS: at sentence #750000, processed 13650003 words, keeping 139879 word types
2018-05-18 00:06:59,443 : INFO : PROGRESS: at sentence #760000, processed 13830700 words, keeping 140743 word types
2018-05-18 00:06:59,857 : INFO : PROGRESS: at sentence #770000, processed 14012331 words, keeping 141698 word types
2018-05-18 00:07:00,418 : INFO : PROGRESS: at sentence #780000, processed 14195449 words, keeping 142712 word types
2018-05-18 00:07:01,217 : INFO : PROGRESS: at sentence #790000, processe

2018-05-18 00:07:56,013 : INFO : EPOCH 1 - PROGRESS: at 98.42% examples, 323289 words/s, in_qsize 9, out_qsize 0
2018-05-18 00:07:56,823 : INFO : worker thread finished; awaiting finish of 4 more threads
2018-05-18 00:07:56,827 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-05-18 00:07:56,867 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-05-18 00:07:56,908 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-05-18 00:07:56,912 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-05-18 00:07:56,916 : INFO : EPOCH - 1 : training on 17503976 raw words (12867740 effective words) took 40.1s, 321072 effective words/s
2018-05-18 00:07:57,968 : INFO : EPOCH 2 - PROGRESS: at 2.04% examples, 258036 words/s, in_qsize 10, out_qsize 0
2018-05-18 00:07:59,039 : INFO : EPOCH 2 - PROGRESS: at 4.08% examples, 252392 words/s, in_qsize 9, out_qsize 0
2018-05-18 00:08:00,040 : INFO : EPOCH 2 - PROGRESS: at 6.47% examp

2018-05-18 00:09:00,584 : INFO : EPOCH 3 - PROGRESS: at 76.43% examples, 422423 words/s, in_qsize 10, out_qsize 0
2018-05-18 00:09:01,600 : INFO : EPOCH 3 - PROGRESS: at 80.31% examples, 425311 words/s, in_qsize 10, out_qsize 0
2018-05-18 00:09:02,619 : INFO : EPOCH 3 - PROGRESS: at 84.41% examples, 429068 words/s, in_qsize 10, out_qsize 0
2018-05-18 00:09:03,640 : INFO : EPOCH 3 - PROGRESS: at 87.48% examples, 427473 words/s, in_qsize 9, out_qsize 0
2018-05-18 00:09:04,644 : INFO : EPOCH 3 - PROGRESS: at 91.69% examples, 431650 words/s, in_qsize 10, out_qsize 0
2018-05-18 00:09:05,655 : INFO : EPOCH 3 - PROGRESS: at 95.95% examples, 435426 words/s, in_qsize 9, out_qsize 0
2018-05-18 00:09:06,672 : INFO : EPOCH 3 - PROGRESS: at 99.43% examples, 435833 words/s, in_qsize 10, out_qsize 0
2018-05-18 00:09:06,737 : INFO : worker thread finished; awaiting finish of 4 more threads
2018-05-18 00:09:06,740 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-05-18 00:09:06,75

2018-05-18 00:10:04,275 : INFO : EPOCH 5 - PROGRESS: at 67.48% examples, 339250 words/s, in_qsize 9, out_qsize 0
2018-05-18 00:10:05,310 : INFO : EPOCH 5 - PROGRESS: at 69.30% examples, 334893 words/s, in_qsize 10, out_qsize 1
2018-05-18 00:10:06,320 : INFO : EPOCH 5 - PROGRESS: at 70.84% examples, 329827 words/s, in_qsize 9, out_qsize 0
2018-05-18 00:10:07,327 : INFO : EPOCH 5 - PROGRESS: at 73.00% examples, 327979 words/s, in_qsize 10, out_qsize 2
2018-05-18 00:10:08,352 : INFO : EPOCH 5 - PROGRESS: at 75.39% examples, 327054 words/s, in_qsize 8, out_qsize 1
2018-05-18 00:10:09,358 : INFO : EPOCH 5 - PROGRESS: at 77.73% examples, 326151 words/s, in_qsize 9, out_qsize 0
2018-05-18 00:10:10,368 : INFO : EPOCH 5 - PROGRESS: at 81.22% examples, 329896 words/s, in_qsize 10, out_qsize 0
2018-05-18 00:10:11,370 : INFO : EPOCH 5 - PROGRESS: at 83.89% examples, 330339 words/s, in_qsize 10, out_qsize 0
2018-05-18 00:10:12,375 : INFO : EPOCH 5 - PROGRESS: at 86.52% examples, 330510 words/s, in_

2018-05-18 00:11:15,158 : INFO : EPOCH 6 - PROGRESS: at 99.21% examples, 219782 words/s, in_qsize 10, out_qsize 2
2018-05-18 00:11:15,445 : INFO : worker thread finished; awaiting finish of 4 more threads
2018-05-18 00:11:15,520 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-05-18 00:11:15,530 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-05-18 00:11:15,538 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-05-18 00:11:15,561 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-05-18 00:11:15,566 : INFO : EPOCH - 6 : training on 17503976 raw words (12870332 effective words) took 58.5s, 219967 effective words/s
2018-05-18 00:11:16,639 : INFO : EPOCH 7 - PROGRESS: at 2.20% examples, 283066 words/s, in_qsize 9, out_qsize 0
2018-05-18 00:11:17,673 : INFO : EPOCH 7 - PROGRESS: at 4.59% examples, 290804 words/s, in_qsize 10, out_qsize 1
2018-05-18 00:11:18,759 : INFO : EPOCH 7 - PROGRESS: at 6.97% exam

2018-05-18 00:12:22,251 : INFO : EPOCH 8 - PROGRESS: at 10.92% examples, 194437 words/s, in_qsize 8, out_qsize 2
2018-05-18 00:12:23,361 : INFO : EPOCH 8 - PROGRESS: at 12.64% examples, 194985 words/s, in_qsize 8, out_qsize 1
2018-05-18 00:12:24,481 : INFO : EPOCH 8 - PROGRESS: at 14.26% examples, 193627 words/s, in_qsize 9, out_qsize 0
2018-05-18 00:12:25,531 : INFO : EPOCH 8 - PROGRESS: at 16.03% examples, 195935 words/s, in_qsize 8, out_qsize 1
2018-05-18 00:12:26,547 : INFO : EPOCH 8 - PROGRESS: at 17.62% examples, 196509 words/s, in_qsize 10, out_qsize 0
2018-05-18 00:12:27,598 : INFO : EPOCH 8 - PROGRESS: at 18.58% examples, 190001 words/s, in_qsize 9, out_qsize 0
2018-05-18 00:12:28,685 : INFO : EPOCH 8 - PROGRESS: at 19.97% examples, 187773 words/s, in_qsize 9, out_qsize 0
2018-05-18 00:12:29,737 : INFO : EPOCH 8 - PROGRESS: at 21.82% examples, 190355 words/s, in_qsize 9, out_qsize 1
2018-05-18 00:12:30,750 : INFO : EPOCH 8 - PROGRESS: at 23.70% examples, 193522 words/s, in_qsi

2018-05-18 00:13:33,860 : INFO : EPOCH 9 - PROGRESS: at 43.71% examples, 269521 words/s, in_qsize 9, out_qsize 0
2018-05-18 00:13:34,970 : INFO : EPOCH 9 - PROGRESS: at 45.16% examples, 264253 words/s, in_qsize 10, out_qsize 0
2018-05-18 00:13:36,073 : INFO : EPOCH 9 - PROGRESS: at 47.01% examples, 262127 words/s, in_qsize 9, out_qsize 1
2018-05-18 00:13:37,086 : INFO : EPOCH 9 - PROGRESS: at 48.61% examples, 259647 words/s, in_qsize 10, out_qsize 1
2018-05-18 00:13:38,159 : INFO : EPOCH 9 - PROGRESS: at 50.37% examples, 257613 words/s, in_qsize 6, out_qsize 3
2018-05-18 00:13:39,216 : INFO : EPOCH 9 - PROGRESS: at 52.32% examples, 256754 words/s, in_qsize 8, out_qsize 1
2018-05-18 00:13:40,228 : INFO : EPOCH 9 - PROGRESS: at 54.42% examples, 257191 words/s, in_qsize 9, out_qsize 0
2018-05-18 00:13:41,243 : INFO : EPOCH 9 - PROGRESS: at 56.52% examples, 257566 words/s, in_qsize 10, out_qsize 0
2018-05-18 00:13:42,259 : INFO : EPOCH 9 - PROGRESS: at 58.09% examples, 255648 words/s, in_q

2018-05-18 00:14:43,773 : INFO : EPOCH 10 - PROGRESS: at 77.96% examples, 255129 words/s, in_qsize 10, out_qsize 0
2018-05-18 00:14:44,804 : INFO : EPOCH 10 - PROGRESS: at 79.39% examples, 253159 words/s, in_qsize 9, out_qsize 0
2018-05-18 00:14:45,877 : INFO : EPOCH 10 - PROGRESS: at 81.32% examples, 252629 words/s, in_qsize 9, out_qsize 0
2018-05-18 00:14:46,877 : INFO : EPOCH 10 - PROGRESS: at 83.44% examples, 253073 words/s, in_qsize 8, out_qsize 1
2018-05-18 00:14:47,913 : INFO : EPOCH 10 - PROGRESS: at 85.66% examples, 253641 words/s, in_qsize 10, out_qsize 0
2018-05-18 00:14:48,919 : INFO : EPOCH 10 - PROGRESS: at 87.99% examples, 254662 words/s, in_qsize 9, out_qsize 0
2018-05-18 00:14:49,992 : INFO : EPOCH 10 - PROGRESS: at 90.03% examples, 254468 words/s, in_qsize 10, out_qsize 0
2018-05-18 00:14:51,091 : INFO : EPOCH 10 - PROGRESS: at 91.75% examples, 253193 words/s, in_qsize 10, out_qsize 0
2018-05-18 00:14:52,125 : INFO : EPOCH 10 - PROGRESS: at 94.73% examples, 255712 wor

2018-05-18 00:15:49,111 : INFO : EPOCH 12 - PROGRESS: at 14.88% examples, 268476 words/s, in_qsize 9, out_qsize 0
2018-05-18 00:15:50,134 : INFO : EPOCH 12 - PROGRESS: at 16.99% examples, 268101 words/s, in_qsize 10, out_qsize 0
2018-05-18 00:15:51,141 : INFO : EPOCH 12 - PROGRESS: at 19.23% examples, 269907 words/s, in_qsize 10, out_qsize 0
2018-05-18 00:15:52,166 : INFO : EPOCH 12 - PROGRESS: at 21.76% examples, 274492 words/s, in_qsize 9, out_qsize 1
2018-05-18 00:15:53,200 : INFO : EPOCH 12 - PROGRESS: at 24.10% examples, 276057 words/s, in_qsize 9, out_qsize 0
2018-05-18 00:15:54,308 : INFO : EPOCH 12 - PROGRESS: at 26.28% examples, 273851 words/s, in_qsize 7, out_qsize 2
2018-05-18 00:15:55,346 : INFO : EPOCH 12 - PROGRESS: at 28.57% examples, 274499 words/s, in_qsize 9, out_qsize 1
2018-05-18 00:15:56,373 : INFO : EPOCH 12 - PROGRESS: at 31.06% examples, 276874 words/s, in_qsize 10, out_qsize 1
2018-05-18 00:15:57,420 : INFO : EPOCH 12 - PROGRESS: at 33.66% examples, 279428 word

2018-05-18 00:16:58,931 : INFO : EPOCH 13 - PROGRESS: at 78.18% examples, 305620 words/s, in_qsize 9, out_qsize 0
2018-05-18 00:16:59,948 : INFO : EPOCH 13 - PROGRESS: at 80.25% examples, 304253 words/s, in_qsize 9, out_qsize 0
2018-05-18 00:17:00,973 : INFO : EPOCH 13 - PROGRESS: at 82.07% examples, 302051 words/s, in_qsize 9, out_qsize 0
2018-05-18 00:17:01,986 : INFO : EPOCH 13 - PROGRESS: at 83.84% examples, 299889 words/s, in_qsize 10, out_qsize 1
2018-05-18 00:17:03,129 : INFO : EPOCH 13 - PROGRESS: at 85.72% examples, 297179 words/s, in_qsize 9, out_qsize 0
2018-05-18 00:17:04,134 : INFO : EPOCH 13 - PROGRESS: at 87.60% examples, 295692 words/s, in_qsize 10, out_qsize 0
2018-05-18 00:17:05,145 : INFO : EPOCH 13 - PROGRESS: at 90.39% examples, 297244 words/s, in_qsize 10, out_qsize 0
2018-05-18 00:17:06,157 : INFO : EPOCH 13 - PROGRESS: at 93.02% examples, 298160 words/s, in_qsize 10, out_qsize 0
2018-05-18 00:17:07,181 : INFO : EPOCH 13 - PROGRESS: at 95.56% examples, 298579 wor

2018-05-18 00:18:04,725 : INFO : EPOCH 15 - PROGRESS: at 29.35% examples, 371577 words/s, in_qsize 9, out_qsize 0
2018-05-18 00:18:05,744 : INFO : EPOCH 15 - PROGRESS: at 31.98% examples, 367250 words/s, in_qsize 9, out_qsize 1
2018-05-18 00:18:06,744 : INFO : EPOCH 15 - PROGRESS: at 34.45% examples, 362991 words/s, in_qsize 9, out_qsize 0
2018-05-18 00:18:07,745 : INFO : EPOCH 15 - PROGRESS: at 35.85% examples, 349338 words/s, in_qsize 9, out_qsize 0
2018-05-18 00:18:08,784 : INFO : EPOCH 15 - PROGRESS: at 38.29% examples, 346021 words/s, in_qsize 8, out_qsize 1
2018-05-18 00:18:09,800 : INFO : EPOCH 15 - PROGRESS: at 40.13% examples, 338369 words/s, in_qsize 9, out_qsize 0
2018-05-18 00:18:10,807 : INFO : EPOCH 15 - PROGRESS: at 42.13% examples, 333203 words/s, in_qsize 10, out_qsize 0
2018-05-18 00:18:11,814 : INFO : EPOCH 15 - PROGRESS: at 44.70% examples, 332866 words/s, in_qsize 9, out_qsize 0
2018-05-18 00:18:12,856 : INFO : EPOCH 15 - PROGRESS: at 46.62% examples, 327526 words/

2018-05-18 00:19:11,075 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-05-18 00:19:11,079 : INFO : EPOCH - 16 : training on 17503976 raw words (12866047 effective words) took 37.9s, 339895 effective words/s
2018-05-18 00:19:12,138 : INFO : EPOCH 17 - PROGRESS: at 3.40% examples, 429487 words/s, in_qsize 8, out_qsize 1
2018-05-18 00:19:13,225 : INFO : EPOCH 17 - PROGRESS: at 6.64% examples, 406442 words/s, in_qsize 8, out_qsize 1
2018-05-18 00:19:14,237 : INFO : EPOCH 17 - PROGRESS: at 9.55% examples, 394686 words/s, in_qsize 10, out_qsize 0
2018-05-18 00:19:15,244 : INFO : EPOCH 17 - PROGRESS: at 12.64% examples, 394381 words/s, in_qsize 10, out_qsize 0
2018-05-18 00:19:16,280 : INFO : EPOCH 17 - PROGRESS: at 16.03% examples, 399210 words/s, in_qsize 10, out_qsize 0
2018-05-18 00:19:17,295 : INFO : EPOCH 17 - PROGRESS: at 18.46% examples, 384641 words/s, in_qsize 10, out_qsize 1
2018-05-18 00:19:18,318 : INFO : EPOCH 17 - PROGRESS: at 20.25% examples, 361633 wo

2018-05-18 00:20:18,431 : INFO : EPOCH 18 - PROGRESS: at 94.73% examples, 373118 words/s, in_qsize 9, out_qsize 0
2018-05-18 00:20:19,437 : INFO : EPOCH 18 - PROGRESS: at 98.26% examples, 375476 words/s, in_qsize 10, out_qsize 1
2018-05-18 00:20:19,844 : INFO : worker thread finished; awaiting finish of 4 more threads
2018-05-18 00:20:19,880 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-05-18 00:20:19,890 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-05-18 00:20:19,894 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-05-18 00:20:19,910 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-05-18 00:20:19,913 : INFO : EPOCH - 18 : training on 17503976 raw words (12868225 effective words) took 34.1s, 376836 effective words/s
2018-05-18 00:20:20,952 : INFO : EPOCH 19 - PROGRESS: at 3.17% examples, 403694 words/s, in_qsize 9, out_qsize 0
2018-05-18 00:20:21,970 : INFO : EPOCH 19 - PROGRESS: at 6.41%

2018-05-18 00:21:22,572 : INFO : EPOCH 20 - PROGRESS: at 80.88% examples, 361818 words/s, in_qsize 7, out_qsize 2
2018-05-18 00:21:23,627 : INFO : EPOCH 20 - PROGRESS: at 84.64% examples, 365287 words/s, in_qsize 9, out_qsize 0
2018-05-18 00:21:24,652 : INFO : EPOCH 20 - PROGRESS: at 87.43% examples, 364792 words/s, in_qsize 9, out_qsize 0
2018-05-18 00:21:25,692 : INFO : EPOCH 20 - PROGRESS: at 90.62% examples, 365804 words/s, in_qsize 9, out_qsize 0
2018-05-18 00:21:26,698 : INFO : EPOCH 20 - PROGRESS: at 93.65% examples, 366441 words/s, in_qsize 9, out_qsize 0
2018-05-18 00:21:27,720 : INFO : EPOCH 20 - PROGRESS: at 96.58% examples, 366431 words/s, in_qsize 9, out_qsize 2
2018-05-18 00:21:28,515 : INFO : worker thread finished; awaiting finish of 4 more threads
2018-05-18 00:21:28,524 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-05-18 00:21:28,541 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-05-18 00:21:28,562 : INFO : worker thr

Total training time for `word2vec` embeddings is 851.8s.
Let's save model for backing up purposes.

In [46]:
model.save("./data/gensim_word2vec_model_180518")

2018-05-18 13:55:07,623 : INFO : saving Word2Vec object under ./data/gensim_word2vec_model_180518, separately None
2018-05-18 13:55:07,634 : INFO : not storing attribute vectors_norm
2018-05-18 13:55:07,639 : INFO : not storing attribute cum_table
2018-05-18 13:55:08,800 : INFO : saved ./data/gensim_word2vec_model_180518


To load the saved model, we use below code:

In [11]:
model = Word2Vec.load("./data/gensim_word2vec_model_180518")

### Model checking

Our embedding is ready now. Let's perform some NLP task with the model. There are about 24,800 words in our vocabulary.

In [12]:
print("Total of words: {}".format(len(model.wv.vocab)))

Total of words: 24800


#### Word similarity

In the example below, we can easily see that numbers `1` and `2` tend to stay near each other than the words `apple` and `weather`.

In [13]:
print(model.wv.similarity('1', '2'))

0.776723224032


In [14]:
print(model.wv.similarity('apple', 'weather'))

0.193279800243


In [15]:
model.wv.most_similar('character')

[('protagonist', 0.7609272003173828),
 ('role', 0.7386566996574402),
 ('personality', 0.7352930307388306),
 ('villain', 0.6820997595787048),
 ('persona', 0.6581060886383057),
 ('antagonist', 0.6566777229309082),
 ('attraction', 0.6436684131622314),
 ('rapport', 0.6315550208091736),
 ('relationship', 0.6272909641265869),
 ('presence', 0.6150318384170532)]

If we have `good:best :: bad:?` and we want to find the missing word (which is `worse`). We have to look for  vector(x) such that it is close to:
    
    vector(`best`) - vector(`good`) + vector(`bad`)

In [16]:
model.wv.most_similar(positive=['best','bad'],negative=['good'])

[('worst', 0.8797682523727417),
 ('funniest', 0.771563708782196),
 ('stupidest', 0.7688117027282715),
 ('poorest', 0.7639316916465759),
 ('cheesiest', 0.7318538427352905),
 ('finest', 0.7109623551368713),
 ('weakest', 0.6855686902999878),
 ('greatest', 0.6832293272018433),
 ('lamest', 0.6801236867904663),
 ('dullest', 0.6715612411499023)]

Below is another example. The output is exactly as in our expectation - the word `paris`.

In [17]:
model.wv.most_similar(positive=['berlin','france'],negative=['germany'])

[('paris', 0.7609268426895142),
 ('margo', 0.7574951648712158),
 ('du', 0.7245261073112488),
 ('vienna', 0.7232941389083862),
 ('pierre', 0.7172746062278748),
 ('clair', 0.7013946771621704),
 ('eleni', 0.6924647092819214),
 ('villa', 0.6900115013122559),
 ('belle', 0.6767010688781738),
 ('venice', 0.6763746738433838)]

#### Different from the group

By using `doesnt_match` function, we can find out which word does not belong to the group of words. Since 3 out of 4 are negative words, leaving `good` the only positive word.

In [18]:
print(model.wv.doesnt_match("terrible bad horrible good".split()))

good


#### Word vectors

We can get vector representations of a specific word easily by using the function `get_vector`. Below are some vectors as examples:

In [19]:
print("Vector for word 'film': \n{}".format(model.wv.get_vector('film')))

Vector for word 'film': 
[-2.26498747  0.59922707 -5.11942196 -2.86017013 -1.70951402  0.84601313
 -1.06525266  4.49640226  4.57102489  3.69904923  1.4661808   0.78479469
  0.70178229  1.55773318 -3.21934795  0.53184873 -5.91274118 -4.79618073
 -2.17931485 -1.01804733  4.69565201 -1.75921273 -1.93044889 -4.00331783
 -1.81263435 -2.60498953 -4.44695473 -0.77245116  1.8795706  -1.26255
 -4.07912254  2.29629445  3.18819356  6.95551157 -2.86236167  4.5243454
 -0.4579851  -1.42139578  6.93507481 -4.84183931 -0.88582969 -2.39337063
 -1.08843029 -3.75431132 -0.25205219 -4.76414824 -3.94741678 -1.41823518
  0.39292559 -1.12679636]


In [20]:
print("Vector for word 'movie': \n{}".format(model.wv['movie']))

Vector for word 'movie': 
[-2.78263164  2.90011311 -4.97895622 -1.87051165  0.61191779  0.10896216
 -4.27431345  3.89203548  4.72996855  2.96508002  0.89933449  1.81341279
  1.32129169  1.44068658 -3.42293787  0.49027845 -3.87458372 -4.89371157
 -2.96179509  0.64510381  4.7046051  -1.42691004 -0.36782303 -5.17180538
 -0.8367576  -5.06912422 -5.59841299 -2.08465052  1.37301195 -0.82797796
 -4.32099485  3.36242175  2.25995493  6.93139076 -1.24982738  4.21200705
  0.46820539 -1.62913144  6.4912262  -4.05751276 -1.26376617 -3.03613043
 -3.47508025 -4.51454163  1.68685043 -4.90973186 -4.11981916 -0.05337746
  0.70520127 -2.7952342 ]


#### Word index

Another useful function named `index2word` can help us find the word at specific index.

In [21]:
print("At index 15 is the word: {}".format(model.wv.index2word[15]))

At index 15 is the word: movie


#### Index of word

By looking up a word in the vocabulary, we can get its index.

In [22]:
print("Index of word 'movie' is: {}".format(model.wv.vocab['movie'].index))

Index of word 'movie' is: 15


#### Saving model into array of words and vectors

In order to make it easier for us to manipulate with the words in vocab and their vectors, we will save them into 2 seperate arrays, one for words and one for vectors.

First, create a zero array with same dimension with our embedding (50), but the number of rows will be increased by 2 because we need 1 row for padding character '-' at the beginning and another for 'unk' (use for unknown word) at the end. The reason of doing this is to limit the length of the sequence and anything less than our max length will be padded with '-' with 0 vector values, we choose '-' because in our vocab there is no character like that, causing no confusion with other words when training the RNN model. For 'unk', any words that is not found in the list will take index of 'unk' also have 0 vector value.

Then looping through all vocabulary and add corresponding word into `word_list` and vector into `word_vector`.

In [23]:
vec_dim = 50
word_list = []
word_vector = np.zeros((len(model.wv.vocab)+2, vec_dim)) # add '-' as padding and 'unk' as unknown
word_list.append('-') # padding value
for i in range (len(model.wv.vocab)):
    word = model.wv.index2word[i]
    vector = model.wv[word]
    word_vector[i+1] = vector
    word_list.append(word)
word_list.append('unk')

In [24]:
print("Number of words in list: {}".format(len(word_list)))
print("Shape of word vector: {}".format(word_vector.shape))

Number of words in list: 24802
Shape of word vector: (24802, 50)


In [25]:
print("First word in our vocab: {}\n".format(word_list[0]))
print("Its vector values: \n{}".format(word_vector[0]))

First word in our vocab: -

Its vector values: 
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]


In [26]:
print("Last word in our vocab: {}".format(word_list[-1]))
print("Its vector values: \n{}".format(word_vector[-1]))

Last word in our vocab: unk
Its vector values: 
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]


Use the code below to save them for later use. (We already saved them into `data` folder.)

np.save('./data/word_list_gensim_w2v', word_list)
np.save('./data/word_vector_gensim_w2v', word_vector)

To load them, we use below code:

In [27]:
load_word_list = np.load('./data/word_list_gensim_w2v.npy')
load_word_vector = np.load('./data/word_vector_gensim_w2v.npy')

load_word_list = load_word_list.tolist()

Now, let's try to check if our saved values are consistent with the original model.

Let's get the vector of the word 'movie' in our model and try to compare them with the same word in our saved list:

In [41]:
print("Index of word 'movie' in word_list: {}".format(load_word_list.index("movie")))
print("Vector of word 'movie':\n{}".format(load_word_vector[16]))

Index of word 'movie' in word_list: 16
Vector of word 'movie':
[-2.78263164  2.90011311 -4.97895622 -1.87051165  0.61191779  0.10896216
 -4.27431345  3.89203548  4.72996855  2.96508002  0.89933449  1.81341279
  1.32129169  1.44068658 -3.42293787  0.49027845 -3.87458372 -4.89371157
 -2.96179509  0.64510381  4.7046051  -1.42691004 -0.36782303 -5.17180538
 -0.8367576  -5.06912422 -5.59841299 -2.08465052  1.37301195 -0.82797796
 -4.32099485  3.36242175  2.25995493  6.93139076 -1.24982738  4.21200705
  0.46820539 -1.62913144  6.4912262  -4.05751276 -1.26376617 -3.03613043
 -3.47508025 -4.51454163  1.68685043 -4.90973186 -4.11981916 -0.05337746
  0.70520127 -2.7952342 ]


If vector of word `movie` in the model is the same with vector in our `word_vector` list, the expected result of the subtraction will be 0 since they are the same which also means that our list is correct.

In [29]:
print("Result: \n{}".format(model.wv.word_vec('movie') - load_word_vector[16]))

Result: 
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]


## `fastText` embedding

Next, let's try to learn embedding using `fastText` algorithm with Gensim. We will use the same data prepared above and train the embedding.

### Training word embeddings

In [23]:
num_feature = 50
min_word_count = 20
num_thread = 5
window_size = 10
down_sampling = 0.001
iteration = 20

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
model_fastText = FastText(sentences, 
                          size=num_feature, 
                          window=window_size, 
                          min_count=min_word_count, 
                          workers=num_thread)

2018-05-19 21:31:43,353 : INFO : collecting all words and their counts
2018-05-19 21:31:43,355 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-05-19 21:31:43,480 : INFO : PROGRESS: at sentence #10000, processed 185521 words, keeping 17175 word types
2018-05-19 21:31:43,570 : INFO : PROGRESS: at sentence #20000, processed 367101 words, keeping 24423 word types
2018-05-19 21:31:43,658 : INFO : PROGRESS: at sentence #30000, processed 551281 words, keeping 29689 word types
2018-05-19 21:31:43,722 : INFO : PROGRESS: at sentence #40000, processed 733293 words, keeping 34309 word types
2018-05-19 21:31:43,791 : INFO : PROGRESS: at sentence #50000, processed 917534 words, keeping 38241 word types
2018-05-19 21:31:43,856 : INFO : PROGRESS: at sentence #60000, processed 1097938 words, keeping 41448 word types
2018-05-19 21:31:43,929 : INFO : PROGRESS: at sentence #70000, processed 1281359 words, keeping 44467 word types
2018-05-19 21:31:43,992 : INFO : PROGRESS: a

2018-05-19 21:31:48,240 : INFO : PROGRESS: at sentence #710000, processed 12919593 words, keeping 135944 word types
2018-05-19 21:31:48,323 : INFO : PROGRESS: at sentence #720000, processed 13103676 words, keeping 136863 word types
2018-05-19 21:31:48,414 : INFO : PROGRESS: at sentence #730000, processed 13283040 words, keeping 137837 word types
2018-05-19 21:31:48,507 : INFO : PROGRESS: at sentence #740000, processed 13468256 words, keeping 138924 word types
2018-05-19 21:31:48,596 : INFO : PROGRESS: at sentence #750000, processed 13650003 words, keeping 139879 word types
2018-05-19 21:31:48,688 : INFO : PROGRESS: at sentence #760000, processed 13830700 words, keeping 140743 word types
2018-05-19 21:31:48,784 : INFO : PROGRESS: at sentence #770000, processed 14012331 words, keeping 141698 word types
2018-05-19 21:31:48,881 : INFO : PROGRESS: at sentence #780000, processed 14195449 words, keeping 142712 word types
2018-05-19 21:31:48,966 : INFO : PROGRESS: at sentence #790000, processe

2018-05-19 21:32:36,320 : INFO : EPOCH 1 - PROGRESS: at 45.72% examples, 146637 words/s, in_qsize 10, out_qsize 0
2018-05-19 21:32:37,385 : INFO : EPOCH 1 - PROGRESS: at 46.91% examples, 146584 words/s, in_qsize 8, out_qsize 1
2018-05-19 21:32:38,388 : INFO : EPOCH 1 - PROGRESS: at 48.15% examples, 146926 words/s, in_qsize 10, out_qsize 0
2018-05-19 21:32:39,445 : INFO : EPOCH 1 - PROGRESS: at 49.17% examples, 146397 words/s, in_qsize 8, out_qsize 1
2018-05-19 21:32:40,513 : INFO : EPOCH 1 - PROGRESS: at 50.19% examples, 145851 words/s, in_qsize 10, out_qsize 2
2018-05-19 21:32:41,651 : INFO : EPOCH 1 - PROGRESS: at 51.51% examples, 145914 words/s, in_qsize 8, out_qsize 1
2018-05-19 21:32:42,685 : INFO : EPOCH 1 - PROGRESS: at 52.55% examples, 145516 words/s, in_qsize 8, out_qsize 1
2018-05-19 21:32:43,814 : INFO : EPOCH 1 - PROGRESS: at 53.85% examples, 145606 words/s, in_qsize 9, out_qsize 0
2018-05-19 21:32:44,816 : INFO : EPOCH 1 - PROGRESS: at 55.05% examples, 145773 words/s, in_q

2018-05-19 21:33:48,423 : INFO : EPOCH 2 - PROGRESS: at 24.45% examples, 134128 words/s, in_qsize 8, out_qsize 1
2018-05-19 21:33:49,437 : INFO : EPOCH 2 - PROGRESS: at 25.71% examples, 135175 words/s, in_qsize 9, out_qsize 0
2018-05-19 21:33:50,501 : INFO : EPOCH 2 - PROGRESS: at 26.98% examples, 135871 words/s, in_qsize 9, out_qsize 0
2018-05-19 21:33:51,533 : INFO : EPOCH 2 - PROGRESS: at 28.17% examples, 136392 words/s, in_qsize 9, out_qsize 0
2018-05-19 21:33:52,624 : INFO : EPOCH 2 - PROGRESS: at 29.42% examples, 136857 words/s, in_qsize 8, out_qsize 1
2018-05-19 21:33:53,633 : INFO : EPOCH 2 - PROGRESS: at 30.58% examples, 137152 words/s, in_qsize 8, out_qsize 1
2018-05-19 21:33:54,705 : INFO : EPOCH 2 - PROGRESS: at 31.87% examples, 137640 words/s, in_qsize 9, out_qsize 0
2018-05-19 21:33:55,878 : INFO : EPOCH 2 - PROGRESS: at 33.08% examples, 137395 words/s, in_qsize 9, out_qsize 1
2018-05-19 21:33:56,880 : INFO : EPOCH 2 - PROGRESS: at 34.22% examples, 137688 words/s, in_qsiz

2018-05-19 21:35:03,347 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-05-19 21:35:03,410 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-05-19 21:35:03,418 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-05-19 21:35:03,420 : INFO : EPOCH - 2 : training on 17503976 raw words (12868553 effective words) took 98.4s, 130740 effective words/s
2018-05-19 21:35:04,504 : INFO : EPOCH 3 - PROGRESS: at 0.89% examples, 115523 words/s, in_qsize 9, out_qsize 0
2018-05-19 21:35:05,534 : INFO : EPOCH 3 - PROGRESS: at 2.03% examples, 128994 words/s, in_qsize 9, out_qsize 0
2018-05-19 21:35:06,574 : INFO : EPOCH 3 - PROGRESS: at 3.11% examples, 130815 words/s, in_qsize 8, out_qsize 1
2018-05-19 21:35:07,583 : INFO : EPOCH 3 - PROGRESS: at 4.20% examples, 132688 words/s, in_qsize 10, out_qsize 1
2018-05-19 21:35:08,596 : INFO : EPOCH 3 - PROGRESS: at 5.27% examples, 133639 words/s, in_qsize 8, out_qsize 1
2018-05-19 21:35:09,759 : 

2018-05-19 21:36:17,788 : INFO : EPOCH 3 - PROGRESS: at 76.48% examples, 132414 words/s, in_qsize 9, out_qsize 0
2018-05-19 21:36:18,797 : INFO : EPOCH 3 - PROGRESS: at 77.39% examples, 132203 words/s, in_qsize 9, out_qsize 0
2018-05-19 21:36:19,838 : INFO : EPOCH 3 - PROGRESS: at 78.47% examples, 132228 words/s, in_qsize 8, out_qsize 1
2018-05-19 21:36:20,945 : INFO : EPOCH 3 - PROGRESS: at 79.62% examples, 132235 words/s, in_qsize 8, out_qsize 1
2018-05-19 21:36:22,024 : INFO : EPOCH 3 - PROGRESS: at 80.82% examples, 132376 words/s, in_qsize 9, out_qsize 0
2018-05-19 21:36:23,107 : INFO : EPOCH 3 - PROGRESS: at 81.95% examples, 132421 words/s, in_qsize 7, out_qsize 2
2018-05-19 21:36:24,337 : INFO : EPOCH 3 - PROGRESS: at 83.15% examples, 132314 words/s, in_qsize 10, out_qsize 2
2018-05-19 21:36:25,398 : INFO : EPOCH 3 - PROGRESS: at 84.35% examples, 132485 words/s, in_qsize 9, out_qsize 0
2018-05-19 21:36:26,461 : INFO : EPOCH 3 - PROGRESS: at 85.55% examples, 132652 words/s, in_qsi

2018-05-19 21:37:30,209 : INFO : EPOCH 4 - PROGRESS: at 53.41% examples, 136874 words/s, in_qsize 9, out_qsize 0
2018-05-19 21:37:31,321 : INFO : EPOCH 4 - PROGRESS: at 54.63% examples, 137051 words/s, in_qsize 8, out_qsize 1
2018-05-19 21:37:32,358 : INFO : EPOCH 4 - PROGRESS: at 55.73% examples, 137005 words/s, in_qsize 9, out_qsize 1
2018-05-19 21:37:33,422 : INFO : EPOCH 4 - PROGRESS: at 56.87% examples, 137018 words/s, in_qsize 10, out_qsize 0
2018-05-19 21:37:34,483 : INFO : EPOCH 4 - PROGRESS: at 57.99% examples, 137038 words/s, in_qsize 8, out_qsize 1
2018-05-19 21:37:35,608 : INFO : EPOCH 4 - PROGRESS: at 59.16% examples, 137040 words/s, in_qsize 8, out_qsize 1
2018-05-19 21:37:36,643 : INFO : EPOCH 4 - PROGRESS: at 60.31% examples, 137131 words/s, in_qsize 8, out_qsize 1
2018-05-19 21:37:37,664 : INFO : EPOCH 4 - PROGRESS: at 61.45% examples, 137246 words/s, in_qsize 9, out_qsize 0
2018-05-19 21:37:38,702 : INFO : EPOCH 4 - PROGRESS: at 62.55% examples, 137195 words/s, in_qsi

2018-05-19 21:38:42,900 : INFO : EPOCH 5 - PROGRESS: at 31.64% examples, 137787 words/s, in_qsize 9, out_qsize 0
2018-05-19 21:38:43,901 : INFO : EPOCH 5 - PROGRESS: at 32.67% examples, 137590 words/s, in_qsize 8, out_qsize 1
2018-05-19 21:38:44,942 : INFO : EPOCH 5 - PROGRESS: at 33.84% examples, 137704 words/s, in_qsize 9, out_qsize 0
2018-05-19 21:38:46,042 : INFO : EPOCH 5 - PROGRESS: at 34.95% examples, 137553 words/s, in_qsize 8, out_qsize 1
2018-05-19 21:38:47,121 : INFO : EPOCH 5 - PROGRESS: at 36.08% examples, 137513 words/s, in_qsize 8, out_qsize 1
2018-05-19 21:38:48,183 : INFO : EPOCH 5 - PROGRESS: at 37.22% examples, 137540 words/s, in_qsize 8, out_qsize 1
2018-05-19 21:38:49,228 : INFO : EPOCH 5 - PROGRESS: at 38.41% examples, 137834 words/s, in_qsize 10, out_qsize 1
2018-05-19 21:38:50,316 : INFO : EPOCH 5 - PROGRESS: at 39.34% examples, 136956 words/s, in_qsize 8, out_qsize 1
2018-05-19 21:38:51,365 : INFO : EPOCH 5 - PROGRESS: at 40.41% examples, 136847 words/s, in_qsi

To load model, we can use below code:

In [30]:
model_fastText = FastText.load("./data/gensim_fastText_model_190518")

### Model checking

#### Word similarity

Let's try to find similar words to `character`:

In [31]:
model_fastText.wv.most_similar('character')

[('characterize', 0.8779246211051941),
 ('characterisation', 0.8767824172973633),
 ('characteristically', 0.8579357862472534),
 ('characteristic', 0.8445369005203247),
 ('characterization', 0.8444110751152039),
 ('protagonist', 0.8273643255233765),
 ('uncharacteristically', 0.8241872787475586),
 ('characteristics', 0.8222066164016724),
 ('characterized', 0.8184616565704346),
 ('characterisations', 0.8086223602294922)]

For the word `citi`, with the `word2vec` model, we can't find any similar words to it because it doesn't exist in the vocabulary.

In [32]:
try:
    print(model.wv.get_vector('citi'))
except KeyError:
    print("Word 'citi' doesn't exist in word2vec vocab.")

Word 'citi' doesn't exist in word2vec vocab.


In [33]:
print("Vector for word 'citi' in fastText vocab: \n{}".format(model_fastText.wv.get_vector('citi')))

Vector for word 'citi' in fastText vocab: 
[ 0.32699245  1.16181421 -1.69706643  0.70408148  3.09601688  1.4794116
  0.4120028   0.53522533 -1.95446098 -0.77416432  0.76101339 -3.46570253
  2.60462809 -2.5443294  -0.15657869 -0.96487325 -2.42882991  2.43761039
 -2.11448622  2.24130487 -2.57711458  2.95774055  2.16594553  2.98466587
  1.72890961 -2.87294412 -0.03725249 -1.97050095  2.67122102  2.1939292
 -5.70520449  3.38641     0.34512585  2.02045274 -1.99330831 -6.04638863
  0.62465125  1.53422356  0.4336963  -0.90493697 -0.54554069  2.55114365
  0.9242475  -2.06424022  2.2968049   0.65186095 -0.73221743 -0.4610323
 -5.32092571  1.38438857]


We can see that `fastText` learns word representation better than `word2vec` because it breaks word into sub-words, making the learning more efficient. As a result, it can find better similiar words as below:

In [34]:
model_fastText.wv.most_similar('citi')

[('citizen', 0.8678514957427979),
 ('citizens', 0.8646746873855591),
 ('europa', 0.8545630574226379),
 ('ww2', 0.8246501684188843),
 ('euro', 0.8074249029159546),
 ('japan', 0.8040512800216675),
 ('wwi', 0.7996044754981995),
 ('europe', 0.7969143390655518),
 ('western', 0.7959458827972412),
 ('wwii', 0.7896401286125183)]

Analogy task for similar example as in `word2vec` model, the result is the same for the most similar word - `worst`.

In [35]:
model_fastText.wv.most_similar(positive=['best','bad'],negative=['good'])

[('worst', 0.8669127225875854),
 ('dumbest', 0.862362265586853),
 ('lamest', 0.812777042388916),
 ('laziest', 0.8003767728805542),
 ('zest', 0.7850866913795471),
 ('funniest', 0.781031608581543),
 ('ugliest', 0.7756903171539307),
 ('silliest', 0.7735652923583984),
 ('tiniest', 0.7715500593185425),
 ('wiest', 0.7651567459106445)]

Unfortunately, `fastText` model doesn't work well with analogy tasks in some cases (below example) whereas the opposite is true for `word2vec` model.

In [42]:
model_fastText.wv.most_similar(positive=['berlin','france'],negative=['germany'])

[('merlin', 0.8392232656478882),
 ('palmer', 0.8322012424468994),
 ('darlene', 0.827530026435852),
 ('della', 0.822186291217804),
 ('daphne', 0.821194052696228),
 ('erin', 0.8182156682014465),
 ('lynne', 0.8165528774261475),
 ('kavner', 0.8089902400970459),
 ('gabriella', 0.8058374524116516),
 ('brynner', 0.8057600259780884)]

#### Different from the group

The result is similar with `word2vec` model.

In [37]:
print(model_fastText.wv.doesnt_match("terrible bad horrible good".split()))

good


To get the word vector and word index, syntax is the same when we do it with `word2vec` above.

#### Saving model into array of words and vectors

Finally, let's save the word vocabulary and their vectors for later RNN model.

In [38]:
vec_dim = 50
word_list_fastText = []
word_vector_fastText = np.zeros((len(model_fastText.wv.vocab)+2, vec_dim)) # add '-' as padding character and 'unk' as unknown
word_list_fastText.append('-') # padding value
for i in range (len(model_fastText.wv.vocab)):
    word = model_fastText.wv.index2word[i]
    vector = model_fastText.wv[word]
    word_vector_fastText[i+1] = vector
    word_list_fastText.append(word)
word_list_fastText.append('unk')

Let's check the number of words in our vocabulary and vector shape.

In [39]:
print("Number of words in list: {}".format(len(word_list_fastText)))
print("Shape of word vector: {}".format(word_vector_fastText.shape))

Number of words in list: 24802
Shape of word vector: (24802, 50)


We are going to need those 2 arrays for builidng RNN model. Let's save them for later use.

np.save('./data/word_list_gensim_fT', word_list_fastText)
np.save('./data/word_vector_gensim_fT', word_vector_fastText)

Now, our embeddings are ready for use. Later in the next part we will perform the following tasks:
- Build RNN model using pre-trained embedding `GloVe`
- Build RNN model using our own Gensim `word2vec`
- Build RNN model using our own Gensim `fastText`