<a href="https://colab.research.google.com/github/c-w-m/pnlp/blob/master/Ch03/08_Training_Dov2Vec_using_Gensim.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 3.8: Training `Doc2Vec` Using __Gensim__
In this notebook we demonstrate how to train a doc2vec model on a custom corpus.

In [1]:
import warnings
warnings.filterwarnings('ignore')
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
from pprint import pprint
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [2]:
data = ["dog bites man",
        "man bites dog",
        "dog eats meat",
        "man eats food"]

tagged_data = [TaggedDocument(words=word_tokenize(word.lower()), tags=[str(i)]) for i, word in enumerate(data)]


In [3]:
tagged_data

[TaggedDocument(words=['dog', 'bites', 'man'], tags=['0']),
 TaggedDocument(words=['man', 'bites', 'dog'], tags=['1']),
 TaggedDocument(words=['dog', 'eats', 'meat'], tags=['2']),
 TaggedDocument(words=['man', 'eats', 'food'], tags=['3'])]

In [4]:
#dbow
model_dbow = Doc2Vec(tagged_data,vector_size=20, min_count=1, epochs=2,dm=0)


In [5]:
print(model_dbow.infer_vector(['man','eats','food']))#feature vector of man eats food

[ 0.00833178 -0.01789699  0.02012537  0.00655591 -0.00948043 -0.01276905
  0.02016209 -0.01082841 -0.00845614  0.02440228  0.0107796  -0.00898186
 -0.00386237  0.00644664  0.0102197  -0.01220094 -0.00701983  0.01758396
 -0.00515178 -0.02301671]


In [6]:
model_dbow.wv.most_similar("man",topn=5)#top 5 most simlar words.

[('food', 0.521940290927887),
 ('bites', 0.322302907705307),
 ('dog', 0.2668944001197815),
 ('eats', 0.21319103240966797),
 ('meat', 0.009995978325605392)]

In [7]:
 model_dbow.wv.n_similarity(["dog"],["man"])

0.26689443

In [8]:
#dm
model_dm = Doc2Vec(tagged_data, min_count=1, vector_size=20, epochs=2,dm=1)

print("Inference Vector of man eats food\n ",model_dm.infer_vector(['man','eats','food']))

print("Most similar words to man in our corpus\n",model_dm.wv.most_similar("man",topn=5))
print("Similarity between man and dog: ",model_dm.wv.n_similarity(["dog"],["man"]))

Inference Vector of man eats food
  [ 0.00833187 -0.01789691  0.02012526  0.0065558  -0.00948052 -0.01276907
  0.02016215 -0.01082829 -0.00845597  0.02440219  0.0107796  -0.00898175
 -0.00386238  0.00644674  0.01021964 -0.01220085 -0.00701991  0.01758401
 -0.00515177 -0.02301681]
Most similar words to man in our corpus
 [('food', 0.521940290927887), ('bites', 0.322302907705307), ('dog', 0.2668944001197815), ('eats', 0.21319103240966797), ('meat', 0.009995978325605392)]
Similarity between man and dog:  0.26689443


What happens when we compare between words which are not in the vocabulary?

In [9]:
try:
    model_dm.wv.n_similarity(['covid'],['man'])
except KeyError as err:
    print("model_dm.wv.n_similarity(['covid'],['man']): {}".format(err))

model_dm.wv.n_similarity(['covid'],['man']): "word 'covid' not in vocabulary"
