# Calculating Similarity using Gensim Doc2vec

Doc2vec is an extension for word2vec. The objective of doc2vec is to create the numerical representation of sentence/paragraphs/documents unlike word2vec that computes a feature vector for every word in the corpus, Doc2Vec computes a feature vector for every document in the corpus. The vectors generated by doc2vec can be used for tasks like finding similarity between sentences/paragraphs/documents

In [1]:
#Import all the dependencies
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
import pandas as pd  
import nltk
import gensim
from time import time

## Data Preprocessing

In [115]:
#Reading Data
train = pd.read_csv('train.csv')
train.head(3)

Unnamed: 0,id,tid1,tid2,title1_zh,title2_zh,title1_en,title2_en,label
0,0,0,1,2017养老保险又新增两项，农村老人人人可申领，你领到了吗,警方辟谣“鸟巢大会每人领5万” 仍有老人坚持进京,There are two new old-age insurance benefits f...,"Police disprove ""bird's nest congress each per...",unrelated
1,3,2,3,"""你不来深圳，早晚你儿子也要来""，不出10年深圳人均GDP将超香港",深圳GDP首超香港？深圳统计局辟谣：只是差距在缩小,"""If you do not come to Shenzhen, sooner or lat...",Shenzhen's GDP outstrips Hong Kong? Shenzhen S...,unrelated
2,1,2,4,"""你不来深圳，早晚你儿子也要来""，不出10年深圳人均GDP将超香港",GDP首超香港？深圳澄清：还差一点点……,"""If you do not come to Shenzhen, sooner or lat...",The GDP overtopped Hong Kong? Shenzhen clarifi...,unrelated


In [3]:
import string
#User defined function for removing punctuation
def RemovePunctuation(my_str):
    punctuations = string.punctuation
    no_punct = ""
    for char in my_str:
        if char not in punctuations:
            no_punct = no_punct + char.lower()
            
    return no_punct

In [6]:
#Applying function on both the columns with news headlines
train['title1_en'] = train['title1_en'].apply(lambda x: RemovePunctuation(x))
train['title2_en'] = train['title2_en'].apply(lambda x: RemovePunctuation(x))

In [98]:
#Creating a list of each columns to make a list of tagged documents
text1 = train['title1_en'].tolist()
text2 = train['title2_en'].tolist()
data = text1+text2

In [9]:
#Tagging each row as documents with numbers
tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(data)]

## Training Doc2Vec model

In [11]:
max_epochs = 10
#vec_size = 20
alpha = 0.025

model = Doc2Vec(#size=vec_size,
                alpha=alpha, 
                min_alpha=0.00025,
                min_count=1,
                dm =1)
  
model.build_vocab(tagged_data)

for epoch in range(max_epochs):
    print('iteration {0}'.format(epoch))
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=model.iter)
    # decrease the learning rate
    model.alpha -= 0.0002
    # fix the learning rate, no decay
    model.min_alpha = model.alpha

model.save("d2v.model")
print("Model Saved")

iteration 0




iteration 1
iteration 2
iteration 3
iteration 4
iteration 5
iteration 6
iteration 7
iteration 8
iteration 9
Model Saved


### Example: Testing trained model on a new sentence

In [30]:
model= Doc2Vec.load("d2v.model")

new_sentence = "I love chicken farm and netease".split(" ") 

model.docvecs.most_similar(positive=[model.infer_vector(new_sentence)],topn=5)

[('435402', 0.835533857345581),
 ('575540', 0.8246408104896545),
 ('575549', 0.81596839427948),
 ('435403', 0.8147094249725342),
 ('114843', 0.8118643760681152)]

### Creating Dataframe to store the results of similarity score from the Doc2Vec model

In [48]:
#Creating Dataframe for both column
tt1 = pd.DataFrame(text1)
tt2 = pd.DataFrame(text2)

In [50]:
#Applying tokenizing function on both the dataframes
tt1 = tt1.applymap(lambda x: word_tokenize(x))
tt2 = tt2.applymap(lambda x: word_tokenize(x))

In [58]:
#Extracting values in list format
text1_list = tt1.values
text2_list = tt2.values

In [86]:
#Calculating similarity score for each row and appending it to an empty list
cosine_score= []
for i in range(len(text1)):
    a = []
    if tt1.iloc[i][0] == a or tt2.iloc[i][0] == a :
        cosine_score.append(0)
        
    else:
        score = model.n_similarity(tt1.iloc[i][0] ,tt2.iloc[i][0])
        cosine_score.append(score)

  # Remove the CWD from sys.path while we load stuff.


In [119]:
df = train["label"]

In [125]:
cosine = pd.DataFrame(cosine_score,  columns = ["score"])

In [126]:
clasi_data = pd.concat([df, cosine], axis=1)

In [128]:
#Saving the new created dataframe in csv file
clasi_data.to_csv("Doc2Vec_data.csv", index=False)