<a href="https://colab.research.google.com/github/c-w-m/pnlp/blob/master/Ch03/06_Training_embeddings_using_gensim.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 3.6: Training Embeddings Using __Gensim__
Word embeddings are an approach to representing text in NLP. In this notebook we will demonstrate how to train embeddings using Genism. [Gensim](https://radimrehurek.com/gensim/index.html) is an open source Python library for natural language processing, with a focus on topic modeling (explained in chapter 7).

In [1]:
from gensim.models import Word2Vec
import warnings
warnings.filterwarnings('ignore')

In [2]:
# define training data
#Genism word2vec requires that a format of ‘list of lists’ be provided for training where every document contained in a list.
#Every list contains lists of tokens of that document.
corpus = [['dog','bites','man'], ["man", "bites" ,"dog"],["dog","eats","meat"],["man", "eats","food"]]

#Training the model
model_cbow = Word2Vec(corpus, min_count=1,sg=0) #using CBOW Architecture for trainnig
model_skipgram = Word2Vec(corpus, min_count=1,sg=1)#using skipGram Architecture for training 

## Continuous Bag of Words (CBOW) 
In CBOW, the primary task is to build a language model that correctly predicts the center word given the context words in which the center word appears.

In [3]:
#Summarize the loaded model
print(model_cbow)

#Summarize vocabulary
words = list(model_cbow.wv.vocab)
print(words)

#Acess vector for one word
print(model_cbow['dog'])

Word2Vec(vocab=6, size=100, alpha=0.025)
['dog', 'bites', 'man', 'eats', 'meat', 'food']
[ 1.5272563e-03  3.0823143e-03 -3.3804951e-03 -3.1428083e-03
  2.2466332e-03  6.2112015e-04 -2.7811723e-03 -2.9659085e-03
 -1.3845018e-03 -1.0679704e-03 -1.6333798e-03 -2.6655751e-03
  1.9629470e-03 -4.3159369e-03 -6.8872940e-04 -3.4440032e-03
  3.2118661e-03 -2.7149902e-03 -3.0569120e-03  2.4059096e-03
 -1.6733137e-03  2.8828131e-03  3.4532668e-03  4.0449455e-04
 -3.6279229e-03  2.5212732e-03 -4.9624010e-03  7.0500733e-05
 -1.5966175e-03  3.2820425e-03 -1.3758682e-03 -3.7259256e-04
  2.5749868e-03  3.1122605e-03 -4.9211895e-03  1.5680203e-03
  4.3737190e-04  3.4999684e-03 -3.2759327e-03 -3.0293607e-03
 -3.7514111e-03 -3.6232336e-03 -2.1212299e-03 -4.1303020e-03
  3.8243306e-03 -1.5868167e-03  1.5789388e-03  1.4703074e-03
 -4.7776653e-03 -4.7189468e-03 -4.3726638e-03 -4.6205176e-03
  3.9527213e-04  6.0655829e-04  2.0725809e-03 -3.2809997e-04
  2.9776883e-03 -4.4842032e-03 -4.9260687e-03 -8.0167019e

In [4]:
#Compute similarity 
print("Similarity between eats and bites:",model_cbow.similarity('eats', 'bites'))
print("Similarity between eats and man:",model_cbow.similarity('eats', 'man'))

Similarity between eats and bites: 0.10500215
Similarity between eats and man: -0.10587834


From the above similarity scores we can conclude that eats is more similar to bites than man.

In [5]:
#Most similarity
model_cbow.most_similar('meat')

[('bites', 0.22404563426971436),
 ('eats', 0.10473474860191345),
 ('dog', 0.10331213474273682),
 ('man', 0.0980159193277359),
 ('food', -0.1473582684993744)]

In [6]:
# save model
model_cbow.save('model_cbow.bin')

# load model
new_model_cbow = Word2Vec.load('model_cbow.bin')
print(new_model_cbow)

Word2Vec(vocab=6, size=100, alpha=0.025)


## SkipGram
In skipgram, the task is to predict the context words from the center word.

In [7]:
#Summarize the loaded model
print(model_skipgram)

#Summarize vocabulary
words = list(model_skipgram.wv.vocab)
print(words)

#Acess vector for one word
print(model_skipgram['dog'])

Word2Vec(vocab=6, size=100, alpha=0.025)
['dog', 'bites', 'man', 'eats', 'meat', 'food']
[ 1.5272563e-03  3.0823143e-03 -3.3804951e-03 -3.1428083e-03
  2.2466332e-03  6.2112015e-04 -2.7811723e-03 -2.9659085e-03
 -1.3845018e-03 -1.0679704e-03 -1.6333798e-03 -2.6655751e-03
  1.9629470e-03 -4.3159369e-03 -6.8872940e-04 -3.4440032e-03
  3.2118661e-03 -2.7149902e-03 -3.0569120e-03  2.4059096e-03
 -1.6733137e-03  2.8828131e-03  3.4532668e-03  4.0449455e-04
 -3.6279229e-03  2.5212732e-03 -4.9624010e-03  7.0500733e-05
 -1.5966175e-03  3.2820425e-03 -1.3758682e-03 -3.7259256e-04
  2.5749868e-03  3.1122605e-03 -4.9211895e-03  1.5680203e-03
  4.3737190e-04  3.4999684e-03 -3.2759327e-03 -3.0293607e-03
 -3.7514111e-03 -3.6232336e-03 -2.1212299e-03 -4.1303020e-03
  3.8243306e-03 -1.5868167e-03  1.5789388e-03  1.4703074e-03
 -4.7776653e-03 -4.7189468e-03 -4.3726638e-03 -4.6205176e-03
  3.9527213e-04  6.0655829e-04  2.0725809e-03 -3.2809997e-04
  2.9776883e-03 -4.4842032e-03 -4.9260687e-03 -8.0167019e

In [8]:
#Compute similarity 
print("Similarity between eats and bites:",model_skipgram.similarity('eats', 'bites'))
print("Similarity between eats and man:",model_skipgram.similarity('eats', 'man'))

Similarity between eats and bites: 0.10498691
Similarity between eats and man: -0.10588615


From the above similarity scores we can conclude that eats is more similar to bites than man.

In [9]:
#Most similarity
model_skipgram.most_similar('meat')

[('bites', 0.22404563426971436),
 ('eats', 0.10466405749320984),
 ('dog', 0.10331211984157562),
 ('man', 0.09801590442657471),
 ('food', -0.14735828340053558)]

In [10]:
# save model
model_skipgram.save('model_skipgram.bin')

# load model
new_model_skipgram = Word2Vec.load('model_skipgram.bin')
print(model_skipgram)

Word2Vec(vocab=6, size=100, alpha=0.025)


## Training Your Embedding on Wiki Corpus

##### The corpus download page : https://dumps.wikimedia.org/enwiki/20200120/
The entire wiki corpus as of 28/04/2020 is just over 16GB in size.
We will take a part of this corpus due to computation constraints and train our word2vec and fasttext embeddings.

The file size is 294MB so it can take a while to download.

Source for code which downloads files from Google Drive: https://stackoverflow.com/questions/25010369/wget-curl-large-file-from-google-drive/39225039#39225039

In [11]:
import os
import requests

os.makedirs('data/en', exist_ok= True)
file_name = "data/en/enwiki-latest-pages-articles-multistream14.xml-p13159683p14324602.bz2"
file_id = "11804g0GcWnBIVDahjo5fQyc05nQLXGwF"

def download_file_from_google_drive(id, destination):
    URL = "https://docs.google.com/uc?export=download"

    session = requests.Session()

    response = session.get(URL, params = { 'id' : id }, stream = True)
    token = get_confirm_token(response)

    if token:
        params = { 'id' : id, 'confirm' : token }
        response = session.get(URL, params = params, stream = True)

    save_response_content(response, destination)    

def get_confirm_token(response):
    for key, value in response.cookies.items():
        if key.startswith('download_warning'):
            return value

    return None

def save_response_content(response, destination):
    CHUNK_SIZE = 32768

    with open(destination, "wb") as f:
        for chunk in response.iter_content(CHUNK_SIZE):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)

if not os.path.exists(file_name):
    download_file_from_google_drive(file_id, file_name)
else:
    print("file already exists, skipping download")

print(f"File at: {file_name}")

File at: data/en/enwiki-latest-pages-articles-multistream14.xml-p13159683p14324602.bz2


In [12]:
from gensim.corpora.wikicorpus import WikiCorpus
from gensim.models.word2vec import Word2Vec
from gensim.models.fasttext import FastText
import time

In [13]:
#Preparing the Training data
wiki = WikiCorpus(file_name, lemmatize=False, dictionary={})
sentences = list(wiki.get_texts())

#if you get a memory error executing the lines above
#comment the lines out and uncomment the lines below. 
#loading will be slower, but stable.
# wiki = WikiCorpus(file_name, processes=4, lemmatize=False, dictionary={})
# sentences = list(wiki.get_texts())

#if you still get a memory error, try settings processes to 1 or 2 and then run it again.

### Hyperparameters


1.   sg - Selecting the training algorithm: 1 for skip-gram else its 0 for CBOW. Default is CBOW.
2.   min_count-  Ignores all words with total frequency lower than this.<br>
There are many more hyperparamaeters whose list can be found in the official documentation [here.](https://radimrehurek.com/gensim/models/word2vec.html)


In [14]:
#CBOW
start = time.time()
word2vec_cbow = Word2Vec(sentences,min_count=10, sg=0)
end = time.time()

print("CBOW Model Training Complete.\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))

CBOW Model Training Complete.
Time taken for training is:0.07 hrs 


In [15]:
#Summarize the loaded model
print(word2vec_cbow)
print("-"*30)

#Summarize vocabulary
words = list(word2vec_cbow.wv.vocab)
print(f"Length of vocabulary: {len(words)}")
print("Printing the first 30 words.")
print(words[:30])
print("-"*30)

#Acess vector for one word
print(f"Length of vector: {len(word2vec_cbow['film'])}")
print(word2vec_cbow['film'])
print("-"*30)

#Compute similarity 
print("Similarity between film and drama:",word2vec_cbow.similarity('film', 'drama'))
print("Similarity between film and tiger:",word2vec_cbow.similarity('film', 'tiger'))
print("-"*30)

Word2Vec(vocab=111150, size=100, alpha=0.025)
------------------------------
Length of vocabulary: 111150
Printing the first 30 words.
['the', 'roses', 'registered', 'as', 'is', 'brisbane', 'racing', 'club', 'group', 'thoroughbred', 'horse', 'race', 'for', 'three', 'year', 'old', 'filles', 'run', 'under', 'set', 'weights', 'conditions', 'over', 'distance', 'of', 'metres', 'at', 'racecourse', 'australia', 'during']
------------------------------
Length of vector: 100
[-1.0506552  -0.02559625 -3.23259     1.488893    1.3472427  -0.71339667
 -1.3500955  -1.8422508   0.9123795  -0.33282     1.306864    1.256017
 -3.9414885  -3.2811587   0.92902136 -1.3302459   2.5690322  -4.466726
  0.31229255 -1.1977108  -3.6023428   1.7325197   0.60706854  0.6550589
  3.5523074   1.1816992  -1.67011     2.964605    3.1188414   0.46984217
  0.59819984 -1.4899861  -0.784721   -0.51103383  0.10948575 -1.6183871
 -1.2906426  -0.7345948  -0.692857    1.6596159   3.453691   -3.9252017
  2.4272904   2.8858774  

In [16]:
# save model
from gensim.models import Word2Vec, KeyedVectors   
word2vec_cbow.wv.save_word2vec_format('word2vec_cbow.bin', binary=True)

# load model
# new_modelword2vec_cbow = Word2Vec.load('word2vec_cbow.bin')
# print(word2vec_cbow)

In [17]:
#SkipGram
start = time.time()
word2vec_skipgram = Word2Vec(sentences,min_count=10, sg=1)
end = time.time()

print("SkipGram Model Training Complete\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))

SkipGram Model Training Complete
Time taken for training is:0.23 hrs 


In [18]:
#Summarize the loaded model
print(word2vec_skipgram)
print("-"*30)

#Summarize vocabulary
words = list(word2vec_skipgram.wv.vocab)
print(f"Length of vocabulary: {len(words)}")
print("Printing the first 30 words.")
print(words[:30])
print("-"*30)

#Acess vector for one word
print(f"Length of vector: {len(word2vec_skipgram['film'])}")
print(word2vec_skipgram['film'])
print("-"*30)

#Compute similarity 
print("Similarity between film and drama:",word2vec_skipgram.similarity('film', 'drama'))
print("Similarity between film and tiger:",word2vec_skipgram.similarity('film', 'tiger'))
print("-"*30)

Word2Vec(vocab=111150, size=100, alpha=0.025)
------------------------------
Length of vocabulary: 111150
Printing the first 30 words.
['the', 'roses', 'registered', 'as', 'is', 'brisbane', 'racing', 'club', 'group', 'thoroughbred', 'horse', 'race', 'for', 'three', 'year', 'old', 'filles', 'run', 'under', 'set', 'weights', 'conditions', 'over', 'distance', 'of', 'metres', 'at', 'racecourse', 'australia', 'during']
------------------------------
Length of vector: 100
[-0.40480924  0.19233392 -0.38905182  0.37828633  0.23791173 -0.69260263
 -0.07492838 -0.310254    0.3276737   0.21507213  0.2542622  -0.18547595
 -0.2083325  -0.365353    0.03026555 -0.64667964 -0.39256847 -0.95628494
 -0.3719754  -0.17749608  0.22393923  0.13561735 -0.02995735  0.36526108
  0.22936612  0.3042692  -0.21229504  0.3387915   0.48752365 -0.40663773
 -0.2855313  -0.15150711  0.24762839  0.16370678 -0.04403246  0.21982297
  0.55844146 -0.02884511 -0.00423493  0.40492105  0.725358   -0.3371495
  0.31590566 -0.133

In [19]:
# save model
word2vec_cbow.wv.save_word2vec_format('word2vec_sg.bin', binary=True)

# load model
# new_model_skipgram = Word2Vec.load('model_skipgram.bin')
# print(model_skipgram)

## FastText

In [20]:
#CBOW
start = time.time()
fasttext_cbow = FastText(sentences, sg=0, min_count=10)
end = time.time()

print("FastText CBOW Model Training Complete\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))

FastText CBOW Model Training Complete
Time taken for training is:0.23 hrs 


In [21]:
#Summarize the loaded model
print(fasttext_cbow)
print("-"*30)

#Summarize vocabulary
words = list(fasttext_cbow.wv.vocab)
print(f"Length of vocabulary: {len(words)}")
print("Printing the first 30 words.")
print(words[:30])
print("-"*30)

#Acess vector for one word
print(f"Length of vector: {len(fasttext_cbow['film'])}")
print(fasttext_cbow['film'])
print("-"*30)

#Compute similarity 
print("Similarity between film and drama:",fasttext_cbow.similarity('film', 'drama'))
print("Similarity between film and tiger:",fasttext_cbow.similarity('film', 'tiger'))
print("-"*30)

FastText(vocab=111150, size=100, alpha=0.025)
------------------------------
Length of vocabulary: 111150
Printing the first 30 words.
['the', 'roses', 'registered', 'as', 'is', 'brisbane', 'racing', 'club', 'group', 'thoroughbred', 'horse', 'race', 'for', 'three', 'year', 'old', 'filles', 'run', 'under', 'set', 'weights', 'conditions', 'over', 'distance', 'of', 'metres', 'at', 'racecourse', 'australia', 'during']
------------------------------
Length of vector: 100
[-3.1095068  -0.97738403 -0.4148361   2.1723313  -0.3926139  -2.9532897
  0.78257734  3.0079002  -0.04771179  6.0281415   3.063762   -8.705526
  0.34551618 -2.4340384   0.34824482 -2.81405     2.542777   -1.3970212
  5.048003   -1.101865   -3.5768595  -0.7213247  -1.4273902   0.29522827
  2.6626627  -1.0342059  -2.8289397  -3.7968981  -5.5886517  -1.1344062
 -0.89612794  0.22751266  2.4569476   0.54126287  1.3069663  -1.5429406
 -1.8660321   4.109258   -3.1627958  -1.6807796   2.527449    1.1229966
 -5.530825   -0.85293293 

In [22]:
#SkipGram
start = time.time()
fasttext_skipgram = FastText(sentences, sg=1, min_count=10)
end = time.time()

print("FastText SkipGram Model Training Complete\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))

FastText SkipGram Model Training Complete
Time taken for training is:0.37 hrs 


In [23]:
#Summarize the loaded model
print(fasttext_skipgram)
print("-"*30)

#Summarize vocabulary
words = list(fasttext_skipgram.wv.vocab)
print(f"Length of vocabulary: {len(words)}")
print("Printing the first 30 words.")
print(words[:30])
print("-"*30)

#Acess vector for one word
print(f"Length of vector: {len(fasttext_skipgram['film'])}")
print(fasttext_skipgram['film'])
print("-"*30)

#Compute similarity 
print("Similarity between film and drama:",fasttext_skipgram.similarity('film', 'drama'))
print("Similarity between film and tiger:",fasttext_skipgram.similarity('film', 'tiger'))
print("-"*30)

FastText(vocab=111150, size=100, alpha=0.025)
------------------------------
Length of vocabulary: 111150
Printing the first 30 words.
['the', 'roses', 'registered', 'as', 'is', 'brisbane', 'racing', 'club', 'group', 'thoroughbred', 'horse', 'race', 'for', 'three', 'year', 'old', 'filles', 'run', 'under', 'set', 'weights', 'conditions', 'over', 'distance', 'of', 'metres', 'at', 'racecourse', 'australia', 'during']
------------------------------
Length of vector: 100
[-0.33846703  0.5514187   0.23116641  0.27032182 -0.38991967 -0.03107124
 -0.0388683   0.20263244  0.00909805  0.23460355 -0.49365014 -0.6210484
  0.13146438 -0.21467814 -0.28132766  0.17498098  0.24386536  0.25516975
  0.00322998  0.21797052 -0.30652273 -0.2812657  -0.19846335 -0.02550178
  0.42202958  0.04961881 -0.5410628   0.23720703 -0.11622997  0.36257976
  0.066438    0.51200646 -0.05544357 -0.43429697  0.1863742  -0.46424323
 -0.25711092 -0.47104245 -0.35568997  0.22489314  0.10952259 -0.5474739
 -0.11191456  0.2679

#### An interesting obeseravtion if you noticed is that CBOW trains faster than SkipGram in both cases.
We will leave it to the user to figure out why. A hint would be to refer the working of CBOW and skipgram.