## What is Doc2Vec?

Doc2Vec is the straightforward extension of the word2vec model taking into account the overall vectorized form of the paragraph or the document. Word2vec creates vectors of only the words but this has a disadvantage where the model looses the overall meaning of the words. This is where Doc2Vec is important.
It is based on the 2 word2vec architectures - skipgram(SG) and continuous bag of words(CBOW). The only difference is that apart from the word vectors, we also feed a paragraph-id as input to the neural network. Given below are the architectures defined in the original paper:

**1. The first architecture is the Distributed Memory Model of Paragraph Vectors (PV-DM). This is similar to CBOW architecture of Word2Vec where, given a context we need to predict the word which would follow the sequence. However, we take a paragraph-token as a word and feed it into the neural net.**

<img src="1.png" alt="Drawing" style="width: 600px;"/>

**2. In the second architecture the only the paragraph token is fed as input to the neural network which then learns/predicts the words in a fixed context/window. This architecture is similar to the Skip-Gram model in Word2Vec and is known as Distributed Bag of Words version of Paragraph Vector (PV-DBOW)**

<img src="2.png" alt="Drawing" style="width: 600px;"/>

## Import the libraries

In [1]:
import numpy as np
import keras

#Import necessary nlp tools from gensim
from gensim import utils
from gensim.models.doc2vec import LabeledSentence, TaggedDocument
from gensim.models import Doc2Vec

from random import shuffle

Using TensorFlow backend.


## Load the text data

We will use Cornell's IMDB Movie Review dataset for our sentiment analysis. We have four text files:
- `test-neg.txt`: 12500 negative movie reviews from the test data
- `test-pos.txt`: 12500 positive movie reviews from the test data
- `train-neg.txt`: 12500 negative movie reviews from the training data
- `train-pos.txt`: 12500 positive movie reviews from the training data

I have already downloaded the preprocessed datasets but for someone collecting the dataset, one needs to carry out the required preprocessing such as removing punctuations and converting everything to lowercase.

Each of the reviews should be formatted as such:

```
once again mr costner has dragged out a movie for far longer than necessary aside from the terrific sea rescue sequences of which there are very few i just did not care about any of the characters most of us have ghosts in the closet and costner s character are realized early on and then forgotten until much later by which time i did not care the character we should really care about is a very cocky overconfident ashton kutcher the problem is he comes off as kid who thinks he s better than anyone else around him and shows no signs of a cluttered closet his only obstacle appears to be winning over costner finally when we are well past the half way point of this stinker costner tells us all about kutcher s ghosts we are told why kutcher is driven to be the best with no prior inkling or foreshadowing no magic here it was all i could do to keep from turning it off an hour in
this is an example of why the majority of action films are the same generic and boring there s really nothing worth watching here a complete waste of the then barely tapped talents of ice t and ice cube who ve each proven many times over that they are capable of acting and acting well don t bother with this one go see new jack city ricochet or watch new york undercover for ice t or boyz n the hood higher learning or friday for ice cube and see the real deal ice t s horribly cliched dialogue alone makes this film grate at the teeth and i m still wondering what the heck bill paxton was doing in this film and why the heck does he always play the exact same character from aliens onward every film i ve seen with bill paxton has him playing the exact same irritating character and at least in aliens his character died which made it somewhat gratifying overall this is second rate action trash there are countless better films to see and if you really want to see this one watch judgement night which is practically a carbon copy but has better acting and a better script the only thing that made this at all worth watching was a decent hand on the camera the cinematography was almost refreshing which comes close to making up for the horrible film itself but not quite
```

In [2]:
# Define source files for input data
source_dict = {'test-neg.txt':'TEST_NEG',
                'test-pos.txt':'TEST_POS',
                'train-neg.txt':'TRAIN_NEG',
                'train-pos.txt':'TRAIN_POS'
               }



# Define a LabeledDocSentence class to process multiple documents. This is an extension of the gensim's 
# LabeledLine class. Gensim's LabeledLine class does not process multiple documents, hence we need to define our own
# implementation.
class LabeledDocSentence():
    
    # Initialize the source dict
    def __init__(self, source_dict):
        self.sources = source_dict
    
    # This creates sentences as a list of words and assigns each sentence a tag 
    # e.g. [['word1', 'word2', 'word3', 'lastword'], ['label1']]
    def create_sentences(self):
        self.sentences = []
        for source_file, prefix in self.sources.items():
            with utils.smart_open(source_file) as f:
                for line_id, line in enumerate(f):
                    sentence_label = prefix + '_' + str(line_id)
                    self.sentences.append(LabeledSentence(utils.to_unicode(line).split(), [sentence_label]))
        
        return self.sentences
             
    # Return a permutation of the sentences in each epoch. I read that this leads to the best results and 
    # helps the model to train properly.
    def get_permuted_sentences(self):
        sentences = self.create_sentences()
        shuffled = list(sentences)
        shuffle(shuffled)
        return shuffled

## Model Training

Now we use Gensim's Doc2Vec function to train our model on the sentences. There are various hyperparameters used in the function. Some of them are:
- `min_count`: ignore all words with total frequency lower than this. You have to set this to 1, since the sentence labels only appear once. Setting it any higher than 1 will miss out on the sentences.
- `window`: the maximum distance between the current and predicted word within a sentence. Word2Vec uses a skip-gram model, and this is simply the window size of the skip-gram model.
- `size`: dimensionality of the feature vectors in output. 100 is a good number. If you're extreme, you can go up to around 400.
- `sample`: threshold for configuring which higher-frequency words are randomly downsampled
- `workers`: use this many worker threads to train the model 

I train the model for 10 epochs. It takes around 10 mins. We can use higher epochs for better results.

In [3]:
labeled_doc = LabeledDocSentence(source_dict) 
model = Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=7)

# Let the model learn the vocabulary - all the words in the paragraph
model.build_vocab(labeled_doc.get_permuted_sentences())

In [None]:
# Train the model on the entire set of sentences/reviews for 10 epochs. At each epoch sample a different
# permutation of the sentences to make the model learn better.
for epoch in range(10):
    print epoch
    model.train(labeled_doc.get_permuted_sentences(), total_examples=model.corpus_count, epochs=10)

In [None]:
# To avoid retraining, we save the model
model.save('imdb.d2v')

In [5]:
# Load the saved model
model_saved = Doc2Vec.load('imdb.d2v')

In [7]:
# Check what the model learned. It will show 10 most similar words to the input word. Since we kept the window size
# to be 10, it will show the 10 most recent.
model_saved.most_similar('good')

[(u'great', 0.7556788921356201),
 (u'decent', 0.7458477020263672),
 (u'nice', 0.7088463306427002),
 (u'bad', 0.6982812285423279),
 (u'solid', 0.6951433420181274),
 (u'fine', 0.6745833158493042),
 (u'excellent', 0.6504001021385193),
 (u'terrific', 0.629471480846405),
 (u'fantastic', 0.6224625110626221),
 (u'cool', 0.5988503098487854)]

In [9]:
# Our model is a Doc2Vec model, hence it also learnt the sentence vectors apart from the word embeddings. Hence we
# can see the vector of any sentence by passing the tag for the sentence.
model_saved.docvecs['TRAIN_NEG_0']

array([-0.83013731,  1.28567719, -1.59022784,  1.13399196,  0.24096866,
       -0.75068295,  1.56633914, -0.46217346, -2.37130737,  1.59989583,
       -0.09220402, -0.24014057, -0.23156425, -1.29025269, -1.69817221,
        1.69940197, -0.808604  ,  0.64784503, -0.69525599, -2.82596254,
       -1.76106191,  2.25225616,  0.0821627 ,  2.27012062,  1.47031569,
        1.83323991,  1.43627822, -0.91032636,  2.7978332 , -1.95226407,
        0.16659185,  1.82336974, -0.36955601,  0.17694767, -0.3247571 ,
        1.28920376, -0.15226717,  2.2388792 ,  0.02890237,  1.87150168,
       -0.60525328,  0.12953719, -0.33312163, -0.97939563, -0.12537138,
        0.5496937 ,  1.26948678,  0.55162621, -1.57944381,  0.18557446,
        0.12860912, -1.0455178 ,  1.51583648,  1.01854277, -0.63605636,
       -2.5904026 , -1.82550168,  0.69058442, -0.71976918,  0.97268033,
       -0.58825153,  0.93084186, -2.34840751, -1.81852698,  0.10071582,
       -0.47328663, -1.20318794, -1.39567649, -2.14003181, -0.59

In [24]:
# Create a labelled training and testing set

x_train = np.zeros((25000, 100))
y_train = np.zeros(25000)
x_test = np.zeros((25000, 100))
y_test = np.zeros(25000)

for i in range(12500):
    prefix_train_pos = 'TRAIN_POS_' + str(i)
    prefix_train_neg = 'TRAIN_NEG_' + str(i)
    x_train[i] = model_saved.docvecs[prefix_train_pos]
    x_train[12500 + i] = model_saved.docvecs[prefix_train_neg]
    
    y_train[i] = 1
    y_train[12500 + i] = 0
    
    
for i in range(12500):
    prefix_test_pos = 'TRAIN_POS_' + str(i)
    prefix_test_neg = 'TRAIN_NEG_' + str(i)
    x_test[i] = model_saved.docvecs[prefix_test_pos]
    x_test[12500 + i] = model_saved.docvecs[prefix_test_neg]
    
    y_test[i] = 1
    y_test[12500 + i] = 0

In [25]:
print x_train

[[ 0.25382927 -0.71361071  0.98171765 ...,  0.06509715 -2.94681573
  -2.84318161]
 [-0.10604549 -2.1016674   1.83482242 ..., -1.58855641 -2.93337131
  -2.401752  ]
 [ 0.77429169  1.67526484 -1.24800324 ...,  0.34189934 -3.20372248
   2.06218839]
 ..., 
 [ 0.38597104 -2.45056629 -0.57511675 ..., -0.04185961 -1.80855238
   0.34784812]
 [-0.37830129  0.77813071 -0.86706328 ..., -0.97749794 -0.0616008
   0.31928027]
 [ 2.320539    0.19292563 -1.98491764 ..., -0.30593163  0.04952879
   1.83004558]]


In [26]:
# Convert the output to a categorical variable to be used for the 2 neuron output layer in the neural network.

from keras.utils import to_categorical

y_train_cat = to_categorical(y_train)
y_test_cat = to_categorical(y_test)

In [27]:
# Create a neural network with a single hidden layer and a softmax output layer with 2 neurons.

from keras.models import Sequential
from keras.layers import Dense

nnet = Sequential()
nnet.add(Dense(32, input_dim=100, activation='relu'))
nnet.add(Dense(2, input_dim=32, activation='softmax'))
nnet.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [28]:
# Visualize the neural net's layer
nnet.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_6 (Dense)              (None, 32)                3232      
_________________________________________________________________
dense_7 (Dense)              (None, 2)                 66        
Total params: 3,298
Trainable params: 3,298
Non-trainable params: 0
_________________________________________________________________


In [29]:
# Train the net on the training data
nnet.fit(x_train, y_train_cat, epochs=5, batch_size=32)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x16a2f9e50>

In [33]:
# Predict on the test set
score = nnet.evaluate(x_test, y_test_cat, batch_size=32)
score[1]



0.88331999999999999