<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Doc2Vec example



---

Example of the measurement of document similarity with `doc2vec`. We'll use the the gensim implementation of `doc2vec`. 

In [1]:
import gensim.models as gsm
from os import listdir
from os.path import isfile, join
from gensim.models.doc2vec import TaggedDocument
from collections import OrderedDict

#path to the input corpus files
train_corpus="assets/data/doc2vec/papers/"


**Tagging the text files:**

In [2]:


class DocIterator(object):
    def __init__(self, doc_list, labels_list):
        self.labels_list = labels_list
        self.doc_list = doc_list

    def __iter__(self):
        for idx, doc in enumerate(self.doc_list):
            yield TaggedDocument(words=doc.split(), tags=[self.labels_list[idx]])

docLabels = [f for f in listdir(train_corpus) if f.endswith('.txt')]
print(docLabels)
data = []
for doc in docLabels:
    data.append(open(join(train_corpus, doc), 'r').read())
    
it = DocIterator(data, docLabels)



['DanBoneh_3.txt', 'DanBoneh_2.txt', 'DanBoneh_1.txt', 'AlexAiken_2.txt', 'GeoffreyFox_3.txt', 'GeoffreyFox_2.txt', 'AlexAiken_3.txt', 'AlexAiken_1.txt', 'GeoffreyFox_1.txt', 'StratosIdreos_1.txt', 'StratosIdreos_3.txt', 'StratosIdreos_2.txt']


**Now we train the `doc2vec` model, and save it.**

In [3]:
#train doc2vec model
model = gsm.Doc2Vec(vector_size=300, window=10, min_count=1, workers=11,alpha=0.025, min_alpha=0.025) # use fixed learning rate
model.build_vocab(it)
model.train(it, total_examples=len(doc), epochs=20)


model.save("assets/data/models/paper.model")

print("model is saved")

model is saved


**Reloading the saved model**

In [4]:
#loading the model
model="assets/data/models/paper.model"
m=gsm.Doc2Vec.load(model)
print("model is loaded")

model is loaded


**Set up the document to test**

In [5]:
#path to test files
test_paper="assets/data/doc2vec/test_paper/DanBoneh_4.txt"
new_test = open(join(test_paper), 'r').read().split()
#print(new_test)


**We infer a new vector for the text document, so that we can find the most similar vectors/documents**

In [6]:
inferred_docvec = m.infer_vector(new_test)
m.docvecs.most_similar([inferred_docvec], topn=3)


[('DanBoneh_1.txt', 0.9290025234222412),
 ('DanBoneh_2.txt', 0.8971514701843262),
 ('StratosIdreos_3.txt', 0.7446365356445312)]