# Doc2Vec
Doc2Vec using Gensim and Python.

<b>Author:</b> Yash Sharma<br />
<b>Created On:</b> 9th Septembe, 2017

<b> Libraries Used:</b>

    Gensim, NLTK, Pandas, OS, Matplotlib, Sklean

## Dependencies

In [1]:
import gensim
import pandas as pd
from nltk import RegexpTokenizer
from nltk.corpus import stopwords
from os import listdir
from os.path import isfile, join

from sklearn.manifold import TSNE
from matplotlib import pylab

Using TensorFlow backend.


## Loading Files into the Memory
A Total of 100 Documents (News Articles on Science) in Doc2Vec Sample Dir.<br />


In [4]:
# List of FileNames in the Doc2Vec Sample Directory.
docLabels = [f for f in listdir('Doc2Vec Sample/')]

# Data Content of each file.
data = []
for doc in docLabels:
    data.append(open('Doc2Vec Sample/' + doc).read())

## Iterator Object
<i>Labeled_Sentences</i> class collects all documents from passed list <b>docList</b> and corresponding <b>labelsList</b> and <b>returns</b> an iterator over those documents

In [5]:
class Labeled_Sentences(object):
    def __init__(self, docList, labelsList):
        self.labels_list = labelsList
        self.doc_list = docList
        
    def __iter__(self):
        for idx, doc in enumerate(self.doc_list):
            yield gensim.models.doc2vec.LabeledSentence(
                doc, [self.labels_list[idx]]
            )

## Pre-Processing Text
PreProcessing is one of the most important task we perform before building the model.

Here, we are converting the content (text) to tokens and then removing stopwords from the newly created tokens.

In [6]:
# Set of StopWords from English Dictionary
stopWords = set(stopwords.words('english'))

# Initializing the Regex Tokenizer
# We can use any Tokenizer.
tokenizer = RegexpTokenizer(r'\w+')

In [7]:
def PreProcess(data):
    new_data = []
    for d in data:
        # Convert data to lower case
        d = d.lower()
        # Tokenize the content
        tokens = tokenizer.tokenize(d)
        # Remove StopWords
        tokens = list(set(tokens).difference(stopWords))
        
        # Append the Processed Data (Content) to the list
        new_data.append(tokens)
        
    # Return the New Data List.
    return new_data

In [8]:
# Cleaning (PreProcessing)
data = PreProcess(data)

#### After Pre-Processing the Content we have

    docLabels containing Unique Labels for each Document.
    data containing corresponding data for that Document.
    
Next, pass these to the <i>Labeled_Sentences</i> class to get a Iterator Object.

In [9]:
# Calling the Labeled_Sentences Class.
# data -> Text Content of each Document.
# docLabels -> Name of the Document corresponding to the Content.
it = Labeled_Sentences(data, docLabels)

## Doc2Vec Model
### Parameters:
    size: Number of Features<br/>
    alpha: Learning Rate<br />
    min_count: To Neglect InFrequent Words.

In [10]:
model = gensim.models.Doc2Vec(size = 300, min_count=0, 
                              alpha = 0.025, min_alpha = 0.025)
model.build_vocab(it)

### Training the Model

In [11]:
model.train(it, total_examples=model.corpus_count, epochs=100)

334873

### Save the Trained Model

In [12]:
model.save('doc2vec.model')
print ("Model Saved")

Model Saved


## Let's Play
### Documents which are Most Similar

In [13]:
# To get most similar document with similarity scores using document-index
similar_doc = model.docvecs.most_similar(5) 

In [14]:
pd.DataFrame(similar_doc, columns=['Document', 'Score'])

Unnamed: 0,Document,Score
0,Asteroid impact in ocean may vaporise 250 mn t...,0.73578
1,Artificial intelligence can destroy society St...,0.72618
2,Are we alone in the universe .txt,0.692185
3,"Asteroid collision with Earth inevitable, warn...",0.675604
4,Apple hires MIT student who 3D printed his own...,0.673789
5,Alps could lose 70% of their snow cover by 210...,0.668411
6,Battery-less implantable device powered by bod...,0.665988
7,Printable solar cells to turn surfaces into po...,0.664394
8,Antibody that neutralises 98% HIV strains iden...,0.648427
9,Project 'City of Trees' aims to plant 30 lakh ...,0.647168


### Vocabulary

In [15]:
model.wv.vocab

{'conditions': <gensim.models.keyedvectors.Vocab at 0x1a9b5ffb9e8>,
 'hired': <gensim.models.keyedvectors.Vocab at 0x1a9b5fcdef0>,
 'child': <gensim.models.keyedvectors.Vocab at 0x1a9b5fcf1d0>,
 'long': <gensim.models.keyedvectors.Vocab at 0x1a9b5fcf320>,
 'flowing': <gensim.models.keyedvectors.Vocab at 0x1a9b5fcf358>,
 'technology': <gensim.models.keyedvectors.Vocab at 0x1a9b5fcf400>,
 'dubbed': <gensim.models.keyedvectors.Vocab at 0x1a9b6005fd0>,
 '19': <gensim.models.keyedvectors.Vocab at 0x1a9b5fcf4a8>,
 'mount': <gensim.models.keyedvectors.Vocab at 0x1a9b5fcf518>,
 'bdellovibrio': <gensim.models.keyedvectors.Vocab at 0x1a9b5fd1c88>,
 '76': <gensim.models.keyedvectors.Vocab at 0x1a9b5fcf5c0>,
 'firms': <gensim.models.keyedvectors.Vocab at 0x1a9b5fcf5f8>,
 'apple': <gensim.models.keyedvectors.Vocab at 0x1a9b5fcf668>,
 'release': <gensim.models.keyedvectors.Vocab at 0x1a9b5fcf6a0>,
 'beef': <gensim.models.keyedvectors.Vocab at 0x1a9b5fcf6d8>,
 'closest': <gensim.models.keyedvectors.V

## Plotting the Data

In [16]:
def plot_data_point(embedding, labels):
    # Print Error and Exit if we hav more labels than Embedding
    assert embedding.shape[0] >= len(labels), 'More Labels than Embeddings'
    
    pylab.clf()
    pylab.figure(figsize=(16,16))
    for i, label in enumerate(labels):
        x, y = embedding[i, :]
        pylab.scatter(x, y)
        pylab.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom')
        
    pylab.savefig('Doc2Vec.png')

In [17]:
# List of all the vocabulary
All_Vocab = list(model.wv.vocab)

In [18]:
# Select Only Top 100
vocab = All_Vocab[:100]
X = model[vocab]

In [19]:
tSNE = TSNE(n_components=2)
X_tsne = tSNE.fit_transform(X)

In [20]:
plot_data_point(X_tsne, vocab)