## Training Embeddings Using Gensim
Word embeddings are an approach to representing text in NLP. In this notebook we will demonstrate how to train embeddings using Genism. [Gensim](https://radimrehurek.com/gensim/index.html) is an open source Python library for natural language processing, with a focus on topic modeling (explained in chapter 7).

In [None]:
from gensim.models import Word2Vec
import warnings
warnings.filterwarnings('ignore')


In [None]:
# define training data
#Genism word2vec requires that a format of ‘list of lists’ be provided for training where every document contained in a list.
#Every list contains lists of tokens of that document.
corpus = [['dog','bites','man'], ["man", "bites" ,"dog"],["dog","eats","meat"],["man", "eats","food"]]

#Training the model
model_cbow = Word2Vec(corpus, min_count=1,sg=0) #using CBOW Architecture for trainnig
model_skipgram = Word2Vec(corpus, min_count=1,sg=1)#using skipGram Architecture for training 



## Continuous Bag of Words (CBOW) 
In CBOW, the primary task is to build a language model that correctly predicts the center word given the context words in which the center word appears.

In [None]:
#Summarize the loaded model
print(model_cbow)

#Summarize vocabulary
words = list(model_cbow.wv.vocab)
print(words)

#Acess vector for one word
print(model_cbow['dog'])


Word2Vec(vocab=6, size=100, alpha=0.025)
['dog', 'bites', 'man', 'eats', 'meat', 'food']
[-4.5604608e-03 -4.1865492e-03  4.7075828e-03  3.8018953e-03
 -3.3744269e-03  4.5191548e-03 -4.7336686e-03  2.7872496e-03
 -4.4913584e-04 -4.5465166e-03  3.7044188e-04  4.8339888e-03
  1.1151227e-03  6.5007032e-04  4.6296185e-03  1.7534556e-03
  2.9777489e-03  1.3111510e-03 -1.5691645e-03  3.9113029e-03
 -3.4971205e-03 -4.5216680e-03  4.5007714e-03 -4.6802913e-03
 -2.6713819e-03  3.1111734e-03 -4.6404875e-03 -2.6754229e-03
  1.1636510e-03 -2.7437278e-03 -3.5955142e-03  2.5860409e-03
  4.8808232e-03  4.6369997e-03 -3.3008356e-03  4.7991946e-03
 -3.0183028e-03  3.5763083e-03  2.4996283e-03  3.4738888e-03
  2.4487043e-03 -3.1419268e-03 -4.5127183e-04  7.8748312e-04
  4.2158621e-03 -3.6303843e-03 -4.1386588e-03 -9.4168622e-04
 -1.2478436e-03 -9.2951243e-04 -3.0577860e-03  4.3834057e-03
 -3.4839928e-03 -3.5775993e-03  3.0613912e-03  2.3927158e-03
 -4.3483726e-03 -4.4364594e-03  2.8029773e-03  1.0735807e

In [None]:
#Compute similarity 
print("Similarity between eats and bites:",model_cbow.similarity('eats', 'bites'))
print("Similarity between eats and man:",model_cbow.similarity('eats', 'man'))


Similarity between eats and bites: 0.121292144
Similarity between eats and man: -0.22350481


From the above similarity scores we can conclude that eats is more similar to bites than man.

In [None]:
#Most similarity
model_cbow.most_similar('meat')

[('bites', 0.05448414385318756),
 ('food', 0.05061908811330795),
 ('eats', 0.033351119607686996),
 ('man', 0.016889430582523346),
 ('dog', -0.1078210175037384)]

In [None]:
# save model
model_cbow.save('model_cbow.bin')

# load model
new_model_cbow = Word2Vec.load('model_cbow.bin')
print(new_model_cbow)

Word2Vec(vocab=6, size=100, alpha=0.025)


## SkipGram
In skipgram, the task is to predict the context words from the center word.

In [None]:
#Summarize the loaded model
print(model_skipgram)

#Summarize vocabulary
words = list(model_skipgram.wv.vocab)
print(words)

#Acess vector for one word
print(model_skipgram['dog'])


Word2Vec(vocab=6, size=100, alpha=0.025)
['dog', 'bites', 'man', 'eats', 'meat', 'food']
[-4.5604608e-03 -4.1865492e-03  4.7075828e-03  3.8018953e-03
 -3.3744269e-03  4.5191548e-03 -4.7336686e-03  2.7872496e-03
 -4.4913584e-04 -4.5465166e-03  3.7044188e-04  4.8339888e-03
  1.1151227e-03  6.5007032e-04  4.6296185e-03  1.7534556e-03
  2.9777489e-03  1.3111510e-03 -1.5691645e-03  3.9113029e-03
 -3.4971205e-03 -4.5216680e-03  4.5007714e-03 -4.6802913e-03
 -2.6713819e-03  3.1111734e-03 -4.6404875e-03 -2.6754229e-03
  1.1636510e-03 -2.7437278e-03 -3.5955142e-03  2.5860409e-03
  4.8808232e-03  4.6369997e-03 -3.3008356e-03  4.7991946e-03
 -3.0183028e-03  3.5763083e-03  2.4996283e-03  3.4738888e-03
  2.4487043e-03 -3.1419268e-03 -4.5127183e-04  7.8748312e-04
  4.2158621e-03 -3.6303843e-03 -4.1386588e-03 -9.4168622e-04
 -1.2478436e-03 -9.2951243e-04 -3.0577860e-03  4.3834057e-03
 -3.4839928e-03 -3.5775993e-03  3.0613912e-03  2.3927158e-03
 -4.3483726e-03 -4.4364594e-03  2.8029773e-03  1.0735807e

In [None]:
#Compute similarity 
print("Similarity between eats and bites:",model_skipgram.similarity('eats', 'bites'))
print("Similarity between eats and man:",model_skipgram.similarity('eats', 'man'))


Similarity between eats and bites: 0.12128863
Similarity between eats and man: -0.22350654


From the above similarity scores we can conclude that eats is more similar to bites than man.

In [None]:
#Most similarity
model_skipgram.most_similar('meat')

[('bites', 0.05448414012789726),
 ('food', 0.05061909183859825),
 ('eats', 0.03328130766749382),
 ('man', 0.01688944548368454),
 ('dog', -0.1078210175037384)]

In [None]:
# save model
model_skipgram.save('model_skipgram.bin')

# load model
new_model_skipgram = Word2Vec.load('model_skipgram.bin')
print(model_skipgram)

Word2Vec(vocab=6, size=100, alpha=0.025)


## Training Your Embedding on Wiki Corpus

##### The corpus download page : https://dumps.wikimedia.org/enwiki/latest/
The entire wiki corpus as of 28/04/2020 is just over 16GB in size.
We will take a part of this corpus due to computation constraints and train our word2vec and fasttext embeddings.


In [None]:
!mkdir -p data/en/
!wget -P data/en/ https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream14.xml-p13159683p14324602.bz2

--2021-02-21 17:21:22--  https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream14.xml-p13159683p14324602.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.7, 2620:0:861:1:208:80:154:7
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 293032519 (279M) [application/octet-stream]
Saving to: ‘data/en/enwiki-latest-pages-articles-multistream14.xml-p13159683p14324602.bz2’


2021-02-21 17:22:32 (3.98 MB/s) - ‘data/en/enwiki-latest-pages-articles-multistream14.xml-p13159683p14324602.bz2’ saved [293032519/293032519]



In [None]:
from gensim.corpora.wikicorpus import WikiCorpus
from gensim.models.word2vec import Word2Vec
from gensim.models.fasttext import FastText
import time

In [None]:
#Preparing the Training data
wiki = WikiCorpus('data/en/enwiki-latest-pages-articles-multistream14.xml-p13159683p14324602.bz2', 
                  lemmatize=False, dictionary={})
sentences = list(wiki.get_texts())


### Hyperparameters


1.   sg - Selecting the training algorithm: 1 for skip-gram else its 0 for CBOW. Default is CBOW.
2.   min_count-  Ignores all words with total frequency lower than this.<br>
There are many more hyperparamaeters whose list can be found in the official documentation [here.](https://radimrehurek.com/gensim/models/word2vec.html)


In [None]:
#CBOW
start = time.time()
word2vec_cbow = Word2Vec(sentences,min_count=10, sg=0)
end = time.time()

print("CBOW Model Training Complete.\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))


CBOW Model Training Complete.
Time taken for training is:0.09 hrs 


In [None]:
#Summarize the loaded model
print(word2vec_cbow)
print("-"*30)

#Summarize vocabulary
words = list(word2vec_cbow.wv.vocab)
print(words)
print("-"*30)

#Acess vector for one word
print(word2vec_cbow['film'])
print("-"*30)

#Compute similarity 
print("Similarity between film and drama:",word2vec_cbow.similarity('film', 'drama'))
print("Similarity between film and tiger:",word2vec_cbow.similarity('film', 'tiger'))
print("-"*30)


Word2Vec(vocab=110997, size=100, alpha=0.025)
------------------------------
------------------------------
[ 0.14469117 -0.28921476  0.23295106 -1.8959092   0.59242994  2.3013368
  0.6362999  -0.47934872 -1.4844652  -3.2674859  -3.1097887   0.46180272
 -1.0115974  -0.29205173  1.3071846  -3.627418   -3.260393   -0.26122433
 -2.8342912  -0.95945543  3.363748    2.2227206   0.09122686 -0.37115598
 -1.7219776   3.3730228   2.008371    0.39256507 -2.3858988   0.83556646
  2.2703335   1.6889174  -0.59408367  1.2397051  -3.3127654  -0.5952282
 -0.33337146 -4.9783764  -0.44979894  0.13568197  0.68004054 -1.7465216
  1.4314096   1.0910078   1.1116912   1.0166246  -0.12953798  1.0931164
  2.4676175   2.5462422  -0.09013224  3.3413339   0.75600237 -1.851937
 -4.190646   -1.5449554   0.6143329   2.19413     0.6910196  -0.36486822
  1.4756849  -0.7670692   0.8687646  -1.8047935  -0.5015106  -1.3350308
  0.2546906   2.772934    4.9336224   0.03668461 -0.16972627  0.43100247
 -2.7304187   1.4655955

In [None]:
# save model
from gensim.models import Word2Vec, KeyedVectors   
word2vec_cbow.wv.save_word2vec_format('word2vec_cbow.bin', binary=True)

# # load model
# new_modelword2vec_cbow = Word2Vec.load('word2vec_cbow.bin')
# print(word2vec_cbow)

In [None]:
#SkipGram
start = time.time()
word2vec_skipgram = Word2Vec(sentences,min_count=10, sg=1)
end = time.time()

print("SkipGram Model Training Complete\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))


SkipGram Model Training Complete
Time taken for training is:0.25 hrs 


In [None]:
#Summarize the loaded model
print(word2vec_skipgram)
print("-"*30)

#Summarize vocabulary
words = list(word2vec_skipgram.wv.vocab)
print(words)
print("-"*30)

#Acess vector for one word
print(word2vec_skipgram['film'])
print("-"*30)

#Compute similarity 
print("Similarity between film and drama:", word2vec_skipgram.similarity('film', 'drama'))
print("Similarity between film and tiger:",word2vec_skipgram.similarity('film', 'tiger'))
print("-"*30)


Word2Vec(vocab=110997, size=100, alpha=0.025)
------------------------------
------------------------------
[ 0.27225834 -0.5861906  -0.35500345  0.29826826 -0.19943865  0.34612972
 -0.8024322  -0.2913623  -0.31449452  0.19208223  0.31968918  0.2510302
  0.2255462   0.04457688  0.24135762  0.09339283  0.23359303 -0.04118087
 -0.06462657  0.27571356  0.61254907 -0.10575642  0.316353    0.5828546
  0.09691314 -0.00745691  0.508958   -0.20375773  0.08106556  0.520376
  0.3671975  -0.33622995  0.10092556 -0.26504773 -0.41886437  0.21565165
  0.23403569 -0.7975919  -0.76027215  0.6867329   0.03445792  0.08613819
  0.2853066   0.6391702  -0.18217704 -0.09379357  0.01560227  0.12126487
 -0.1057599  -0.39701825  0.04217564  0.3248062  -0.38682678 -0.15171234
 -0.2404829   0.02126346  0.19134267  0.7284584   0.6502726   0.06069236
  0.23028097 -0.80178106  0.03419147 -0.14850223 -0.39485648 -0.42202872
  0.3376493  -0.40125296  0.3554273  -0.54759246  0.01909317  0.27070278
 -0.03286394  0.1951

In [None]:
# save model
word2vec_cbow.wv.save_word2vec_format('word2vec_sg.bin', binary=True)

# # load model
# new_model_skipgram = Word2Vec.load('model_skipgram.bin')
# print(model_skipgram)

## FastText

In [None]:

#CBOW
start = time.time()
fasttext_cbow = FastText(sentences, sg=0, min_count=10)
end = time.time()

print("FastText CBOW Model Training Complete\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))


FastText CBOW Model Training Complete
Time taken for training is:0.39 hrs 


In [None]:
#Summarize the loaded model
print(fasttext_cbow)
print("-"*30)

#Summarize vocabulary
words = list(fasttext_cbow.wv.vocab)
print(words)
print("-"*30)

#Acess vector for one word
print(fasttext_cbow['film'])
print("-"*30)

#Compute similarity 
print("Similarity between film and drama:",fasttext_cbow.similarity('film', 'drama'))
print("Similarity between film and tiger:",fasttext_cbow.similarity('film', 'tiger'))
print("-"*30)


FastText(vocab=161018, size=100, alpha=0.025)
------------------------------
------------------------------
[-3.4158628   3.8855803   0.535912   -5.182645   -0.6096103   6.1879306
  1.9820771  -1.4699645   2.1031582  -1.8087566  -0.6309119  -1.205867
  1.5846244   0.22744241 -2.1966136   2.0511622  -0.6773196  -0.4715227
  3.9407995   6.199335   -2.6367686   1.1709683   2.7057931   4.855923
 -5.096699    7.433429    7.8346696   0.98290753  5.292873   -4.175929
 -4.130687    0.9335608  -5.3310313  -1.5800712   2.984793    0.28918087
  1.4197284  -0.89113504  1.6581714   1.1043363  -0.3220185   2.6870852
 -0.6005217  -2.289015    4.6048236  -0.65780896  2.1253297  -2.1278186
  0.41051725 -0.8623372   3.4963434   3.8041396  -1.9575641  -0.8581801
 -3.1491356  -2.6680999  -0.27547327 -0.25134414 -3.7401705  -0.40602767
 -4.5328755   1.3974704   5.138355   -0.500581   -3.237352    5.123986
 -1.2240024  -1.6047837  -1.6459205  -0.77467674  0.5509822   1.6679264
  4.729095    0.62219006 -2.73

In [None]:
#SkipGram
start = time.time()
fasttext_skipgram = FastText(sentences, sg=1, min_count=10)
end = time.time()

print("FastText SkipGram Model Training Complete\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))


FastText SkipGram Model Training Complete
Time taken for training is:0.65 hrs 


In [None]:
#Summarize the loaded model
print(fasttext_skipgram)
print("-"*30)

#Summarize vocabulary
words = list(fasttext_skipgram.wv.vocab)
print(words)
print("-"*30)

#Acess vector for one word
print(fasttext_skipgram['film'])
print("-"*30)

#Compute similarity 
print("Similarity between film and drama:",fasttext_skipgram.similarity('film', 'drama'))
print("Similarity between film and tiger:",fasttext_skipgram.similarity('film', 'tiger'))
print("-"*30)


FastText(vocab=161018, size=100, alpha=0.025)
------------------------------
------------------------------
[-4.01990078e-02  3.53444159e-01  2.54945934e-01  4.80403066e-01
 -3.21140260e-01  5.59775710e-01 -4.67194825e-01 -1.73437566e-01
 -1.64864406e-01 -2.49177516e-01 -4.84021157e-01  2.12465003e-01
 -2.15262547e-01  7.43608400e-02 -5.25858462e-01 -4.52829629e-01
  6.79721832e-02 -6.40648901e-02 -4.02468592e-01 -2.06037983e-01
 -4.81559843e-01 -3.20515335e-01  1.60730317e-01  5.23487292e-03
 -3.07535052e-01  7.72237599e-01 -5.03421545e-01  4.23307449e-01
 -6.49608374e-01 -1.15924791e-01 -6.47014454e-02 -3.46179813e-01
 -7.23404825e-01  1.36679158e-01  4.55667861e-02 -4.77901548e-01
  3.03246289e-01  3.38047385e-01  8.01058710e-02  1.11218736e-01
 -1.68238163e-01  2.86948115e-01 -1.24533847e-01  1.34248048e-01
  1.36137992e-01  8.41890052e-02  1.00599341e-01  4.17892247e-01
  2.27972612e-01  5.28719008e-01  9.70892459e-02  4.02288616e-01
 -1.97849020e-01 -5.34242280e-02 -3.59556358e-0

#### An interesting obeseravtion if you noticed is that CBOW trains faster than SkipGram in both cases.
We will leave it to the user to figure out why. A hint would be to refer the working of CBOW and skipgram.