A word embedding model is a model that can provide numerical vectors for a given word. Using the Gensim’s downloader API, we can download pre-built word embedding models like word2vec, fasttext, GloVe and ConceptNet. These are built on large corpuses of commonly occurring text data such as wikipedia, google news etc.

However, if we are working in a specialized niche such as technical documents, we may not be able to get word embeddings for all the words. So, in such cases its desirable to train our own model.

In [1]:
import gensim
from gensim.models.word2vec import Word2Vec
import gensim.downloader as api
from multiprocessing import cpu_count

In [15]:
import warnings
warnings.simplefilter(action='ignore')

In [3]:
dataset=api.load('text8')

In [4]:
data=[wd for wd in dataset]

In [8]:
data[0][:5]

['anarchism', 'originated', 'as', 'a', 'term']

In [9]:
#splitting the data into 2 parts,Later part will be used to update the model.
data_part1=data[:1000]
data_part2=data[1000:]

In [10]:
#Train te word2vec model.The default length of vectors in 100
model=Word2Vec(data_part1,min_count=0,workers=cpu_count())

In [16]:

model['topic']

array([ 0.5913536 , -0.4669403 ,  0.2300151 , -0.0326189 ,  0.82428384,
        1.0622251 ,  0.41448697,  0.3202088 , -0.4964485 ,  1.2178779 ,
        0.36397702, -0.32891572,  0.09694634,  0.5569295 , -1.1447318 ,
        0.592945  , -0.7436546 , -0.3540702 , -0.18498786, -0.4505671 ,
        0.26177293,  0.59536797, -0.302518  , -1.5558119 ,  0.18977602,
       -0.57413954, -1.725028  , -1.4557137 ,  0.10832078, -0.59584504,
        0.764049  , -0.40850034, -0.29667923, -0.90620416, -0.8371268 ,
       -0.8163875 ,  0.5548657 , -0.9921423 ,  0.65224427,  0.89225054,
       -0.64238834,  0.0553678 ,  0.23970193, -0.47748867,  0.2895233 ,
       -0.6060436 , -0.54446214, -0.28231296,  0.3759572 ,  0.4926797 ,
        0.691612  , -0.31964317,  1.7561483 , -0.5470852 ,  0.3934648 ,
        0.05594723, -0.85763055, -0.43768674, -0.72433746,  0.6842822 ,
       -0.94029146, -0.11017247, -0.5186951 ,  0.14269929, -0.83399725,
       -0.2777898 , -0.06652811, -0.9678803 , -0.4636435 ,  0.66

In [18]:
model.most_similar('topic')

[('discussion', 0.7643040418624878),
 ('consensus', 0.7583703994750977),
 ('interpretation', 0.7362848520278931),
 ('speculation', 0.7311907410621643),
 ('discussions', 0.7196162939071655),
 ('discourse', 0.7167651653289795),
 ('explanation', 0.7145833969116211),
 ('opinion', 0.7082844972610474),
 ('debate', 0.701775074005127),
 ('viewpoint', 0.6989355683326721)]

#### save the model

In [None]:
#model.save('newmodel')

#### load the model


In [None]:
#model=Word2Vec.load('newmodel')

## Update existing word2vec model

However, when a new dataset comes, you want to update the model so as to account for new words.

On an existing Word2Vec model, call the **build_vocab()** on the new datset and then call the **train()** method. build_vocab() is called first because the model has to be apprised of what new words to expect in the incoming corpus.

In [19]:
model.build_vocab(data_part2,update=True)

In [20]:
model.train(data_part2,total_examples=model.corpus_count,epochs=model.iter)

(26274353, 35026035)

In [21]:
model['topic']

array([ 9.1000140e-01, -5.9464908e-01,  7.8773774e-02,  1.8042450e-01,
        1.0738237e+00,  1.0348794e+00,  4.9057728e-01,  6.9823545e-01,
       -7.2200233e-01,  8.6618423e-01, -2.3714249e-01, -1.1065097e+00,
       -4.6098250e-01,  8.5288048e-02, -1.1391873e+00,  8.6345319e-03,
       -1.4106318e+00, -9.1770929e-01, -1.6538605e-01,  1.7640872e-01,
        1.5542947e-01,  4.2729264e-01, -6.2255305e-01, -2.2827628e+00,
       -7.1614474e-02, -7.8251725e-01, -2.4768395e+00, -2.0002849e+00,
        4.1942349e-01, -5.5499250e-01,  3.4081087e-01, -1.1283801e+00,
        9.8258637e-02, -2.0512664e+00, -1.2708035e+00, -1.2356541e+00,
        1.0830261e+00, -6.1893833e-01,  1.0266517e+00,  1.3877740e+00,
       -4.4582039e-01, -6.7985398e-03,  5.3191161e-01, -6.1662298e-01,
        7.5019562e-01, -2.1223310e-01, -1.2417248e+00, -1.9535630e-03,
        3.3657765e-01,  8.5073608e-01,  1.2846785e+00, -3.8802296e-01,
        1.4071589e+00, -1.0248456e+00,  2.7813846e-01,  8.4242068e-02,
      

In [22]:
len(model['topic'])

100