# Learning word embeddings with `Gensim`

In this part, we are going to perform word embeddings with Gensim - a very popular toolkit for Natural Language Processing (NLP). We will build two word embeddings: `word2vec` and `fastText`.

First, let's start to import some useful packages:

In [7]:
#import pandas, numpy, tensorflow, nltk, re, keras,datetime, gensim

import logging
%matplotlib inline




## Data preparation

Dataset: "Bag of Words Meets Bags of Popcorn". We will work with not only the training set but also the unlabled training set. Because we only want to learn word embedding (not building model) so we don't need labels in the set, making the unlabeled data useful.

Now, let's load them using `pandas` package.

In [19]:
# Import the required libraries for execution.

import numpy as np
import pandas as pd
from gensim.models import word2vec
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='darkgrid', context='talk')


In [4]:
# Convert into dataframe the labeledTrainData and unlabeledTrainData

train = pd.read_csv("labeledTrainData.tsv", delimiter="\t", header = 0, quoting=3)
unlabeled = pd.read_csv("unlabeledTrainData.tsv", delimiter="\t", header = 0, quoting=3)

There are 25,000 reviews in the training set and 50,000 reviews in the unlabeled dataset.

In [5]:
# Check the shape of the dataset provided

print("Train set: {}".format(train.shape))
print("Unlabeled set: {}".format(unlabeled.shape))

Train set: (25000, 3)
Unlabeled set: (50000, 2)


Let's take a quick review about the first unlabeled review. We can see that thare are many HTML tags in this review. So our task is to clean up those reviews by removing the unwanted characters such as HTML tags and special characters `(\, /, ?, !, etc.)`.

In [6]:
unlabeled.review[1]

'"I saw this film about 20 years ago and remember it as being particularly nasty. I believe it is based on a true incident: a young man breaks into a nurses\' home and rapes, tortures and kills various women.<br /><br />It is in black and white but saves the colour for one shocking shot.<br /><br />At the end the film seems to be trying to make some political statement but it just comes across as confused and obscene.<br /><br />Avoid."'

Function <span style="color:blue; font-family:Courier">convert_to_sentences</span> is defined in order to remove special characters. Steps are as below:
1. Convert each review into sentences as we might have many sentences in one review by using `punkt` tokenizer from NLTK
2. Replace HTML tags with space. Character '-' with space also, and add space after '.' just to help tokenizer to strip sentence correctly. If we don't do this, sentences connected with '.' (no space after) will be considered as one sentence which is not correct. And replace double space with one space only.
3. Convert each sentence into words and apply the `special_characters` filter on it. Then convert all of the into lowercase.
4. Finally, return a list of reviews which contain inside each review a list of sentences made up a list of words.


In [10]:
# Reviews are been preprocessed in this function all the html tags and all the special characters are been identified and been replaces with space for better processing the sentences.

special_characters = re.compile("[^A-Za-z0-9 ]")
tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")
def convert_to_sentences(data, tokenizer):
    # First, converting each review into sentences
    # Use NLTK Tokenizer to split review into sentences (punkt tokenizer - english.pickle)
    data = data.lower().replace("<br />", " ")
    data = data.replace("-", " ")
    data = data.replace(".", ". ")
    data = re.sub("  ", " ", data)
    all_sentences = tokenizer.tokenize(data.strip())
    
    # Second, converting each sentence into words
    sentences = []
    for words in all_sentences:
        s = re.sub(special_characters, "", words.lower())
        if (len(s)) > 0:
            sentences.append(s.split())
    
    # Finally, returning a list of sentences (containing words in each sentence)
    return sentences

Below is a sample of raw review:

In [11]:
print("Review: \n{}".format(unlabeled.review[90]))

Review: 
"Steven Seagal returns in Black Dawn.Cool title.I cannot say the same for the movie though.It ranks as the Worst Seagal movie yet.Yes.It is worse than Out for a Kill.Did we really need a sequel to the Forigner that bad?<br /><br />Steven returns as Jonathon Cold.Assassin for hire.This time he is out to stop a group of Terrorists and prevent Nuclier Bomb from hitting Los Angelas with the help of CIA agent Tamara Davies.<br /><br />This idea seems okay.The Jonathon Cold character once again gives the big guy a chance to tarnish his clean-cut cinematic Image.You get the feeling with a decent director and production team this movie could have went somewhere.If this cost 15 million I wanna know where it was spent?It was not spent on the movie?!! <br /><br />The action scenes are not bad.The movie was obviously shot like a Doug Liman flick.Its just that most of the Time Seagal is doubled so much you cannot take the movie seriously whatsoever.The script seems...what script?!Logic is 

Applying the function <span style="color:blue; font-family:Courier">convert_to_sentences</span> to this review, we get 31 sentences in total.

In [12]:
# Check the results from the function application and review them.

sentences_sample = convert_to_sentences(unlabeled.review[90], tokenizer)
print("Length of formated sentences: {}".format(len(sentences_sample)))

Length of formated sentences: 31


Details of each sentence are as followed:

In [13]:
# Print each word in the sentence and identify detais of sentence samples
for i in sentences_sample:
    print(i)

['steven', 'seagal', 'returns', 'in', 'black', 'dawn']
['cool', 'title']
['i', 'cannot', 'say', 'the', 'same', 'for', 'the', 'movie', 'though']
['it', 'ranks', 'as', 'the', 'worst', 'seagal', 'movie', 'yet']
['yes']
['it', 'is', 'worse', 'than', 'out', 'for', 'a', 'kill']
['did', 'we', 'really', 'need', 'a', 'sequel', 'to', 'the', 'forigner', 'that', 'bad']
['steven', 'returns', 'as', 'jonathon', 'cold']
['assassin', 'for', 'hire']
['this', 'time', 'he', 'is', 'out', 'to', 'stop', 'a', 'group', 'of', 'terrorists', 'and', 'prevent', 'nuclier', 'bomb', 'from', 'hitting', 'los', 'angelas', 'with', 'the', 'help', 'of', 'cia', 'agent', 'tamara', 'davies']
['this', 'idea', 'seems', 'okay']
['the', 'jonathon', 'cold', 'character', 'once', 'again', 'gives', 'the', 'big', 'guy', 'a', 'chance', 'to', 'tarnish', 'his', 'clean', 'cut', 'cinematic', 'image']
['you', 'get', 'the', 'feeling', 'with', 'a', 'decent', 'director', 'and', 'production', 'team', 'this', 'movie', 'could', 'have', 'went', 'so

Now, let's do the same for all of our train and unlabeled reviews. Here we will join 2 lists into 1. This may take several minutes to complete due to large number of reviews needed to be clean up (75,000).

In [14]:
sentences = []
for review in train.review:
    sentences += convert_to_sentences(review, tokenizer)
print("Done for train set.")

for review in unlabeled.review:
    sentences += convert_to_sentences(review, tokenizer)
print("Done for unlabled set.")

Done for train set.
Done for unlabled set.


Let's check first 5 sentences:

In [15]:
for i in sentences[0:5]:
    print("{}\n".format(i))

['with', 'all', 'this', 'stuff', 'going', 'down', 'at', 'the', 'moment', 'with', 'mj', 'ive', 'started', 'listening', 'to', 'his', 'music', 'watching', 'the', 'odd', 'documentary', 'here', 'and', 'there', 'watched', 'the', 'wiz', 'and', 'watched', 'moonwalker', 'again']

['maybe', 'i', 'just', 'want', 'to', 'get', 'a', 'certain', 'insight', 'into', 'this', 'guy', 'who', 'i', 'thought', 'was', 'really', 'cool', 'in', 'the', 'eighties', 'just', 'to', 'maybe', 'make', 'up', 'my', 'mind', 'whether', 'he', 'is', 'guilty', 'or', 'innocent']

['moonwalker', 'is', 'part', 'biography', 'part', 'feature', 'film', 'which', 'i', 'remember', 'going', 'to', 'see', 'at', 'the', 'cinema', 'when', 'it', 'was', 'originally', 'released']

['some', 'of', 'it', 'has', 'subtle', 'messages', 'about', 'mjs', 'feeling', 'towards', 'the', 'press', 'and', 'also', 'the', 'obvious', 'message', 'of', 'drugs', 'are', 'bad', 'mkay']

['visually', 'impressive', 'but', 'of', 'course', 'this', 'is', 'all', 'about', 'mic

## `word2vec` embedding

Before doing the embedding, let's define values for the model.

- `num_feature`: The dimension of word vector. The more dimension the better representation but this is going to take more time to learn and more data. However, since we don't have that much data, let's set this to 50 only (default is 100).
- `min_word_count`: Any words appears less than this number will not be considered in the learning (default is 5).
- `window_size`: For any given word, window defines how many words to consider to it’s left and right (default is 5). This is the maximum distance between the current and predicted word within a sentence.
- `down_sampling`: The threshold for configuring which higher-frequency words are randomly downsampled, useful range is (0, 1e-5).
- `num_thread`: Number of parallel processes to run.
- `iteration`: Number of iterations (epochs) over the corpus (default is 5). However, in practice, it's advised that more iteration will improve the representations.
- Training algorithm: we will select between CBOW and Skip-gram model.
        
    - CBOW: works well with small dataset, well representation with rare words/phrases.
    - Skip-gram: faster training time, slightly better accuracy for frequent words.
    
In this case, we will pick CBOW which is the default value.

### Training word embeddings

In [20]:
# Applying word2vec on all the sentences we have converted all the sentences that are present it our dataset into vector format
num_feature = 50
min_word_count = 20
num_thread = 5
window_size = 10
down_sampling = 0.001
iteration = 20

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
model = word2vec.Word2Vec(sentences, 
                          min_count = min_word_count, 
                          window = window_size, 
                          sample = down_sampling, 
                          workers=num_thread)

2024-02-20 15:51:27,968 : INFO : collecting all words and their counts
2024-02-20 15:51:27,973 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types


2024-02-20 15:51:28,115 : INFO : PROGRESS: at sentence #10000, processed 186930 words, keeping 17241 word types
2024-02-20 15:51:28,334 : INFO : PROGRESS: at sentence #20000, processed 370700 words, keeping 24543 word types
2024-02-20 15:51:28,482 : INFO : PROGRESS: at sentence #30000, processed 555660 words, keeping 29804 word types
2024-02-20 15:51:28,619 : INFO : PROGRESS: at sentence #40000, processed 738919 words, keeping 34397 word types
2024-02-20 15:51:28,766 : INFO : PROGRESS: at sentence #50000, processed 925692 words, keeping 38424 word types
2024-02-20 15:51:28,906 : INFO : PROGRESS: at sentence #60000, processed 1106441 words, keeping 41571 word types
2024-02-20 15:51:29,049 : INFO : PROGRESS: at sentence #70000, processed 1290985 words, keeping 44601 word types
2024-02-20 15:51:29,186 : INFO : PROGRESS: at sentence #80000, processed 1472549 words, keeping 47163 word types
2024-02-20 15:51:29,330 : INFO : PROGRESS: at sentence #90000, processed 1657500 words, keeping 49788

Total training time for `word2vec` embeddings is 851.8s.
Let's save model for backing up purposes.

In [21]:
# Save the model created
model.save("gensim_word2vec_model_180518")

2024-02-20 16:46:28,904 : INFO : Word2Vec lifecycle event {'fname_or_handle': 'gensim_word2vec_model_180518', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2024-02-20T16:46:28.904089', 'gensim': '4.1.2', 'python': '3.9.18 (main, Sep 11 2023, 14:09:26) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22621-SP0', 'event': 'saving'}
2024-02-20 16:46:28,904 : INFO : not storing attribute cum_table
2024-02-20 16:46:28,949 : INFO : saved gensim_word2vec_model_180518


To load the saved model, we use below code:

In [22]:
# Load the model
model = Word2Vec.load("gensim_word2vec_model_180518")

2024-02-20 16:46:39,914 : INFO : loading Word2Vec object from gensim_word2vec_model_180518
2024-02-20 16:46:39,960 : INFO : loading wv recursively from gensim_word2vec_model_180518.wv.* with mmap=None
2024-02-20 16:46:39,961 : INFO : setting ignored attribute cum_table to None
2024-02-20 16:46:40,101 : INFO : Word2Vec lifecycle event {'fname': 'gensim_word2vec_model_180518', 'datetime': '2024-02-20T16:46:40.101910', 'gensim': '4.1.2', 'python': '3.9.18 (main, Sep 11 2023, 14:09:26) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22621-SP0', 'event': 'loaded'}


### Model checking

Our embedding is ready now. Let's perform some NLP task with the model. There are about 24,800 words in our vocabulary.

In [23]:
# Check the total number of words in the model
print("Total of words: {}".format(len(model.wv.index_to_key)))

Total of words: 24800


#### Word similarity

In the example below, we can easily see that numbers `1` and `2` tend to stay near each other than the words `apple` and `weather`.

In [24]:
# Find the similarity between 1 and 2 how close to each other they are present.
print(model.wv.similarity('1', '2'))

0.7546308


In [25]:
# Find the similarity between apple and weather how close to each other they are present.
print(model.wv.similarity('apple', 'weather'))

0.28807396


In [26]:
# Find the most similar words that are present next to character, and we get the following results.
model.wv.most_similar('character')

[('protagonist', 0.6902007460594177),
 ('role', 0.6755948066711426),
 ('villain', 0.6517958045005798),
 ('personality', 0.6343037486076355),
 ('characters', 0.6324491500854492),
 ('attraction', 0.5448999404907227),
 ('actor', 0.5357121229171753),
 ('persona', 0.5354807376861572),
 ('antagonist', 0.5153608918190002),
 ('demeanor', 0.5046439170837402)]

If we have `good:best :: bad:?` and we want to find the missing word (which is `worse`). We have to look for  vector(x) such that it is close to:
    
    vector(`best`) - vector(`good`) + vector(`bad`)

In [28]:
# Find the missing word when given positive and negative inputs example 1.
model.wv.most_similar(positive=['best','bad'],negative=['good'])

[('worst', 0.7701299786567688),
 ('funniest', 0.66257643699646),
 ('finest', 0.626103937625885),
 ('weakest', 0.6009334921836853),
 ('poorest', 0.5990979671478271),
 ('greatest', 0.5783557891845703),
 ('scariest', 0.5781227946281433),
 ('stupidest', 0.5666341781616211),
 ('lamest', 0.5537522435188293),
 ('cheesiest', 0.5401484370231628)]

Below is another example. The output is exactly as in our expectation - the word `paris`.

In [27]:
# # Find the missing word when given positive and negative inputs example 2.
model.wv.most_similar(positive=['berlin','france'],negative=['germany'])

[('paris', 0.7268273234367371),
 ('vienna', 0.7151327729225159),
 ('vermont', 0.7054781317710876),
 ('du', 0.6951680183410645),
 ('tudor', 0.690239667892456),
 ('rio', 0.6882959604263306),
 ('virginia', 0.6670973300933838),
 ('georgia', 0.6643968820571899),
 ('london', 0.6642588376998901),
 ('18th', 0.663261890411377)]

#### Different from the group

By using `doesnt_match` function, we can find out which word does not belong to the group of words. Since 3 out of 4 are negative words, leaving `good` the only positive word.

In [48]:
# Find out which word dose not belog to a perticular group of words.
print(model.wv.doesnt_match("terrible bad horrible good".split()))

good


#### Word vectors

We can get vector representations of a specific word easily by using the function `get_vector`. Below are some vectors as examples:

In [30]:
# Get a vector representation of any particular word - 'film'
print("Vector for word 'film': \n{}".format(model.wv.get_vector('film')))

Vector for word 'film': 
[-0.7503893   3.3117137  -0.08055668  3.3314981  -3.9168468   4.0740614
  1.720352    0.17729267 -2.9377303  -0.3530059  -0.26996315 -0.8304131
  2.3925884  -0.82503444  1.7596908  -2.4348834   0.5046636   1.2159004
 -1.3662195  -0.24275813 -2.9849536  -0.52368295 -3.0727255  -0.1403945
  0.5905988  -1.5283539  -1.2884144   2.6366818   3.2392526  -2.588227
 -2.2030482   1.5332707  -1.6776215   2.0891924  -2.864895   -2.3103518
  2.1927185   0.22805344 -3.9385703   1.7084055   3.2778258   0.57050663
  0.12756376 -0.7861977   1.4626007   0.2759785  -2.011784    0.6770213
 -1.9714171  -2.132479    0.66350836  1.0172902   2.44041     1.5059102
 -2.255609   -2.001187   -0.02927517  2.4219036   2.234259    4.323884
  2.4221594  -6.0610523  -0.06094478  1.0150019  -0.13281061  0.02484169
 -1.5710665   3.1357245  -1.3091756  -3.9651773  -1.7298561   1.3895074
 -0.38656276  0.8062439  -0.39627022  1.2938296   2.468507   -0.64202166
  0.30750096 -0.31411654  1.808364   -

In [29]:
# Get a vector representation of any particular word - 'movie'
print("Vector for word 'movie': \n{}".format(model.wv['movie']))

Vector for word 'movie': 
[-5.7691991e-01  1.1479995e+00 -1.0233791e+00  3.4909616e+00
 -3.1608334e+00  4.2682137e+00  3.6045623e-01  2.4319756e-01
 -2.6696441e+00 -1.8097198e-01  5.2845532e-01 -2.0128448e-01
  1.6378982e+00 -5.8694518e-01  1.4338499e+00 -2.1019242e+00
  4.9284998e-01  1.5599576e+00 -7.8196257e-01  2.1106923e+00
 -1.8566707e+00 -1.2288387e+00 -4.6938748e+00 -7.7675253e-02
  5.3333765e-01 -1.6631466e+00 -9.0212572e-01  2.3264823e+00
  3.3582087e+00 -2.9959867e+00 -2.4594772e+00  1.0541623e+00
 -2.4222088e+00  3.1137607e+00 -3.0317674e+00 -2.6605301e+00
  1.9669440e+00  3.9107588e-01 -3.6743505e+00  3.1654711e+00
  2.9152694e+00 -3.3426475e-02 -6.2896848e-01  4.2351496e-01
  9.7219682e-01  7.5878066e-01 -1.1812390e+00 -1.6225159e-01
 -2.2522054e+00 -2.5029373e+00  8.4004670e-01  1.4362350e+00
  2.5612805e+00  4.4987008e-01 -2.9342649e+00 -1.9887673e+00
 -9.1348392e-01  2.7184153e+00  2.3170495e+00  3.2989283e+00
  1.9566458e+00 -5.8452253e+00  9.2251801e-01  1.1729634e+0

#### Word index

Another useful function named `index2word` can help us find the word at specific index.

In [31]:
# Find a word at a specific index location eg. 15
print("At index 15 is the word: {}".format(model.wv.index_to_key[15]))

At index 15 is the word: movie


#### Index of word

By looking up a word in the vocabulary, we can get its index.

In [34]:
# If we need to find the index of the particular word in the vocab.
word = 'movie'
if word in model.wv.key_to_index:
    print("Index of word '{}' is: {}".format(word, model.wv.key_to_index[word]))
else:
    print("Word '{}' not found in the vocabulary.".format(word))

Index of word 'movie' is: 15


#### Saving model into array of words and vectors

In order to make it easier for us to manipulate with the words in vocab and their vectors, we will save them into 2 seperate arrays, one for words and one for vectors.

First, create a zero array with same dimension with our embedding (50), but the number of rows will be increased by 2 because we need 1 row for padding character '-' at the beginning and another for 'unk' (use for unknown word) at the end. The reason of doing this is to limit the length of the sequence and anything less than our max length will be padded with '-' with 0 vector values, we choose '-' because in our vocab there is no character like that, causing no confusion with other words when training the RNN model. For 'unk', any words that is not found in the list will take index of 'unk' also have 0 vector value.

Then looping through all vocabulary and add corresponding word into `word_list` and vector into `word_vector`.

In [35]:
# The words in the vocabulary and the respective vectors are saved into 2 different arrays for better manipulation.

vec_dim = 100  # Assuming the word vectors have a dimensionality of 100
word_list = []
word_vector = np.zeros((len(model.wv) + 2, vec_dim))  # add '-' as padding and 'unk' as unknown
# Add padding value '-'
word_list.append('-')  
# Iterate through the word vectors
for i, word in enumerate(model.wv.index_to_key):
    vector = model.wv.get_vector(word)
    word_vector[i + 1] = vector
    word_list.append(word)
# Add 'unk' as unknown
word_list.append('unk')

In [36]:
# Find the number of words in the word_list and word_vector shape.

print("Number of words in list: {}".format(len(word_list)))
print("Shape of word vector: {}".format(word_vector.shape))

Number of words in list: 24802
Shape of word vector: (24802, 100)


In [37]:
# Get the first word from the vocab list and its respective vector value.

print("First word in our vocab: {}\n".format(word_list[0]))
print("Its vector values: \n{}".format(word_vector[0]))

First word in our vocab: -

Its vector values: 
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0.]


In [38]:
# Get the last word from the vocab list and its respective vector value.

print("Last word in our vocab: {}".format(word_list[-1]))
print("Its vector values: \n{}".format(word_vector[-1]))

Last word in our vocab: unk
Its vector values: 
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0.]


Use the code below to save them for later use. (We already saved them into `data` folder.)

`np.save('./data/word_list_gensim_w2v', word_list)
np.save('./data/word_vector_gensim_w2v', word_vector)`

To load them, we use below code:

In [39]:
# Load word list and word vector.

load_word_list = word_list
load_word_vector = word_vector

load_word_list = load_word_list

Now, let's try to check if our saved values are consistent with the original model.

Let's get the vector of the word 'movie' in our model and try to compare them with the same word in our saved list:

In [40]:
# Check the values and compare with the original model values.

print("Index of word 'movie' in word_list: {}".format(load_word_list.index("movie")))
print("Vector of word 'movie':\n{}".format(load_word_vector[16]))

Index of word 'movie' in word_list: 16
Vector of word 'movie':
[-5.76919913e-01  1.14799953e+00 -1.02337909e+00  3.49096155e+00
 -3.16083336e+00  4.26821375e+00  3.60456228e-01  2.43197560e-01
 -2.66964412e+00 -1.80971980e-01  5.28455317e-01 -2.01284483e-01
  1.63789821e+00 -5.86945176e-01  1.43384993e+00 -2.10192418e+00
  4.92849976e-01  1.55995762e+00 -7.81962574e-01  2.11069226e+00
 -1.85667074e+00 -1.22883868e+00 -4.69387484e+00 -7.76752532e-02
  5.33337653e-01 -1.66314662e+00 -9.02125716e-01  2.32648230e+00
  3.35820866e+00 -2.99598670e+00 -2.45947719e+00  1.05416226e+00
 -2.42220879e+00  3.11376071e+00 -3.03176737e+00 -2.66053009e+00
  1.96694398e+00  3.91075879e-01 -3.67435050e+00  3.16547108e+00
  2.91526937e+00 -3.34264748e-02 -6.28968477e-01  4.23514962e-01
  9.72196817e-01  7.58780658e-01 -1.18123901e+00 -1.62251592e-01
 -2.25220537e+00 -2.50293732e+00  8.40046704e-01  1.43623495e+00
  2.56128049e+00  4.49870080e-01 -2.93426490e+00 -1.98876727e+00
 -9.13483918e-01  2.7184152

If vector of word `movie` in the model is the same with vector in our `word_vector` list, the expected result of the subtraction will be 0 since they are the same which also means that our list is correct.

In [41]:
# Results of the loaded word vector and the model word vector for the word 'movie' should give us the results as 0 when subtracted from each other.

print("Result: \n{}".format(model.wv.word_vec('movie') - load_word_vector[16]))

Result: 
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0.]


  print("Result: \n{}".format(model.wv.word_vec('movie') - load_word_vector[16]))


## `fastText` embedding

Next, let's try to learn embedding using `fastText` algorithm with Gensim. We will use the same data prepared above and train the embedding.

### Training word embeddings

In [42]:
import logging
from gensim.models import FastText

num_feature = 50
min_word_count = 20
num_thread = 5
window_size = 10
down_sampling = 0.001
iteration = 20

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
model_fastText = FastText(sentences, 
                          vector_size=num_feature,  # Using vector_size instead of size
                          window=window_size, 
                          min_count=min_word_count, 
                          workers=num_thread,
                          min_n=3,  # Minimum length of char n-grams
                          max_n=6)  # Maximum length of char n-grams

2024-02-20 17:39:09,509 : INFO : collecting all words and their counts
2024-02-20 17:39:09,509 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2024-02-20 17:39:09,550 : INFO : PROGRESS: at sentence #10000, processed 186930 words, keeping 17241 word types
2024-02-20 17:39:09,591 : INFO : PROGRESS: at sentence #20000, processed 370700 words, keeping 24543 word types
2024-02-20 17:39:09,631 : INFO : PROGRESS: at sentence #30000, processed 555660 words, keeping 29804 word types
2024-02-20 17:39:09,674 : INFO : PROGRESS: at sentence #40000, processed 738919 words, keeping 34397 word types
2024-02-20 17:39:09,723 : INFO : PROGRESS: at sentence #50000, processed 925692 words, keeping 38424 word types
2024-02-20 17:39:09,757 : INFO : PROGRESS: at sentence #60000, processed 1106441 words, keeping 41571 word types
2024-02-20 17:39:09,823 : INFO : PROGRESS: at sentence #70000, processed 1290985 words, keeping 44601 word types
2024-02-20 17:39:09,877 : INFO : PROGRESS: a

### Model checking

#### Word similarity

Let's try to find similar words to `character`:

In [43]:
model_fastText.wv.most_similar('character')

[('characterisation', 0.8808436989784241),
 ('characterize', 0.8790835738182068),
 ('characteristically', 0.8554302453994751),
 ('characterization', 0.8502060770988464),
 ('characteristic', 0.8450654745101929),
 ('protagonist', 0.837716281414032),
 ('uncharacteristically', 0.8267769813537598),
 ('characterized', 0.8229421973228455),
 ('characteristics', 0.8223468661308289),
 ('characterisations', 0.8144554495811462)]

For the word `citi`, with the `word2vec` model, we can't find any similar words to it because it doesn't exist in the vocabulary.

In [44]:
# Trying to find 'citi' word in the word2vec model vocb.
try:
    print(model.wv.get_vector('citi'))
except KeyError:
    print("Word 'citi' doesn't exist in word2vec vocab.")

Word 'citi' doesn't exist in word2vec vocab.


In [45]:
# Trying to find 'citi' word in the fastText model vocb.

print("Vector for word 'citi' in fastText vocab: \n{}".format(model_fastText.wv.get_vector('citi')))

Vector for word 'citi' in fastText vocab: 
[-0.35331836 -1.8534994  -1.687759   -3.1684115  -2.7455697  -0.14201069
  1.5212997   5.6288514   1.2461898   0.2898962   1.9896507  -1.5007322
 -1.9811895   1.472321    1.8036091   0.928356   -0.7877016  -2.7003706
 -0.6201223   3.3970153  -0.707863   -0.30301785 -1.1019266  -0.35861874
 -1.58039     1.8841965   1.2614226  -0.0312026   2.7832553  -2.8458462
  0.55815643  1.395603    1.077099   -2.1922326   2.1213596  -0.8281086
 -0.46570596 -1.0203273   0.18812998 -1.2443421  -1.8714466  -0.50708544
  2.3123326   2.406491   -4.420427    0.11099686  1.2977874   1.8224685
  1.0413573   2.5069156 ]


We can see that `fastText` learns word representation better than `word2vec` because it breaks word into sub-words, making the learning more efficient. As a result, it can find better similiar words as below:

In [47]:
# Find the most similar words for the word 'citi' from the vocab list.

model_fastText.wv.most_similar('citi')

[('citizens', 0.8885571956634521),
 ('citizen', 0.8578288555145264),
 ('europa', 0.8339913487434387),
 ('america', 0.8214999437332153),
 ('citys', 0.8171120882034302),
 ('europe', 0.8087190985679626),
 ('metropolitan', 0.803908109664917),
 ('civilian', 0.792152464389801),
 ('civil', 0.7896285057067871),
 ('civic', 0.786487877368927)]

Analogy task for similar example as in `word2vec` model, the result is the same for the most similar word - `worst`.

In [48]:
# Find the most similar word from a negative word when give positive words and its respective negative word. example 1
model_fastText.wv.most_similar(positive=['best','bad'],negative=['good'])

[('worst', 0.8682386875152588),
 ('dumbest', 0.8550361394882202),
 ('lamest', 0.803633451461792),
 ('laziest', 0.7880313396453857),
 ('ugliest', 0.7783506512641907),
 ('zest', 0.777234673500061),
 ('silliest', 0.7568947076797485),
 ('horst', 0.7552919983863831),
 ('tiniest', 0.7524278163909912),
 ('weakest', 0.7484068274497986)]

Unfortunately, `fastText` model doesn't work well with analogy tasks in some cases (below example) whereas the opposite is true for `word2vec` model.

In [49]:
# Find most similar word from the list when given positive and respective negative words. example 2
model_fastText.wv.most_similar(positive=['berlin','france'],negative=['germany'])

[('brynner', 0.8524536490440369),
 ('palmer', 0.8439511060714722),
 ('kavner', 0.8348411321640015),
 ('daphne', 0.8340691328048706),
 ('merlin', 0.8293867111206055),
 ('laura', 0.826379120349884),
 ('della', 0.8252083659172058),
 ('gabrielle', 0.8237994909286499),
 ('roscoe', 0.8233895897865295),
 ('darlene', 0.8228525519371033)]

#### Different from the group

The result is similar with `word2vec` model.

In [50]:
print(model_fastText.wv.doesnt_match("terrible bad horrible good".split()))

good


To get the word vector and word index, syntax is the same when we do it with `word2vec` above.

#### Saving model into array of words and vectors

Finally, let's save the word vocabulary and their vectors for later RNN model.

In [51]:
# Saving the model into array of words and vectors. 
vec_dim = 50
word_list_fastText = []
word_vector_fastText = np.zeros((len(model_fastText.wv.key_to_index) + 2, vec_dim)) # add '-' as padding character and 'unk' as unknown
word_list_fastText.append('-') # padding value

for i, word in enumerate(model_fastText.wv.index_to_key):
    vector = model_fastText.wv.get_vector(word)
    word_vector_fastText[i+1] = vector
    word_list_fastText.append(word)

word_list_fastText.append('unk')


Let's check the number of words in our vocabulary and vector shape.

In [52]:
# Find the number of words in the list and its shape of the word vector.

print("Number of words in list: {}".format(len(word_list_fastText)))
print("Shape of word vector: {}".format(word_vector_fastText.shape))

Number of words in list: 24802
Shape of word vector: (24802, 50)
