# Sentiment Analysis using Doc2Vec

We use Word2Vec for sentiment analysis by attempting to classify the Cornell IMDB movie review corpus (http://www.cs.cornell.edu/people/pabo/movie-review-data/).

The data used in this demo can be found at https://github.com/bariscimen/doc2vec-sentiment

## Setup
### Modules
We use gensim, since gensim has a much more readable implementation of Word2Vec (and Doc2Vec). Bless those guys. We also use numpy for general array manipulation, and sklearn for Logistic Regression classifier.


In [1]:
from random import shuffle

# gensim modules
from gensim import utils
from gensim.models.doc2vec import TaggedDocument
from gensim.models import Doc2Vec

# numpy
import numpy

# classifier
from sklearn.linear_model import LogisticRegression

### Input Format
We can't input the raw reviews from the Cornell movie review data repository. Instead, we clean them up by converting everything to lower case and removing punctuation. 

The result is to have five documents:
- test-neg.txt: 12500 negative movie reviews from the test data
- test-pos.txt: 12500 positive movie reviews from the test data
- train-neg.txt: 12500 negative movie reviews from the training data
- train-pos.txt: 12500 positive movie reviews from the training data
- train-unsup.txt: 50000 Unlabelled movie reviews

** *each document should be on one line, separated by new lines. **

### Feeding Data to Doc2Vec
Doc2Vec (the portion of gensim that implements the Doc2Vec algorithm) does a great job at word embedding, but a terrible job at reading in files. It only takes in TaggedLineSentence classes which basically yields TaggedDocument, a class from gensim.models.doc2vec representing a single sentence. 

In [2]:
class TaggedLineSentence(object):
    def __init__(self, sources):
        self.sources = sources
        
        flipped = {}
        
        # make sure that keys are unique
        for key, value in sources.items():
            if value not in flipped:
                flipped[value] = [key]
            else:
                raise Exception('Non-unique prefix encountered')
    
    def __iter__(self):
        for source, prefix in self.sources.items():
            with utils.smart_open(source) as fin:
                for item_no, line in enumerate(fin):
                    yield TaggedDocument(words=utils.to_unicode(line).split(), tags=[prefix + '_%s' % item_no])
    
    def to_array(self):
        self.sentences = []
        for source, prefix in self.sources.items():
            with utils.smart_open(source) as fin:
                for item_no, line in enumerate(fin):
                    self.sentences.append(TaggedDocument(words=utils.to_unicode(line).split(), tags=[prefix + '_%s' % item_no]))
        return self.sentences
    
    def shuffle_sentences(self):
        return shuffle(self.sentences)

Now we can feed the data files to TaggedLineSentence. As we mentioned earlier, TaggedLineSentence simply takes a dictionary with keys as the file names and values the special prefixes for sentences from that document. The prefixes need to be unique, so that there is no ambiguitiy for sentences from different documents.

The prefixes will have a counter appended to them to label individual sentences in the documents.

In [3]:
sources = {'test-neg.txt':'TEST_NEG', 'test-pos.txt':'TEST_POS', 'train-neg.txt':'TRAIN_NEG', 'train-pos.txt':'TRAIN_POS', 'train-unsup.txt':'TRAIN_UNS'}

sentences = TaggedLineSentence(sources)

## Model
### Building the Vocabulary Table
Doc2Vec requires us to build the vocabulary table (simply digesting all the words and filtering out the unique words, and doing some basic counts on them). So we feed it the array of sentences. model.build_vocab takes an array of TaggedLineSentence, hence our to_array function in the TaggedLineSentence class.
If you're curious about the parameters, do read the Word2Vec documentation. Otherwise, here's a quick rundown:
- min_count: ignore all words with total frequency lower than this. You have to set this to 1, since the sentence labels only appear once. Setting it any higher than 1 will miss out on the sentences.
- window: the maximum distance between the current and predicted word within a sentence. Word2Vec uses a skip-gram model, and this is simply the window size of the skip-gram model.
- size: dimensionality of the feature vectors in output. 100 is a good number. If you're extreme, you can go up to around 400.
- sample: threshold for configuring which higher-frequency words are randomly downsampled
- workers: use this many worker threads to train the model

In [4]:
model = Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=7)

In [5]:
model.build_vocab(sentences.to_array())

### Training Doc2Vec
Now we train the model. The model is better trained if **in each training epoch, the sequence of sentences fed to the model is randomized.** This is important: missing out on this steps gives you really shitty results. This is the reason for the sentences_perm method in our TaggedLineSentence class.

We train it for 10 epochs. If I had more time, I'd have done 20.

This process takes around 10 mins, so go grab some coffee.

In [7]:
for epoch in range(10):
    sentences.shuffle_sentences()
    model.train(sentences.sentences)

Here's a sample vector for the first sentence in the training set for negative reviews:

In [25]:
model.docvecs['TRAIN_NEG_1'] 

array([ 0.32277206, -0.03962494, -0.07266915, -0.3584137 ,  0.20809357,
       -0.25993574, -0.11822301,  0.48944992, -0.37939513, -0.22172844,
        0.67778319,  0.25943616,  0.18589967,  0.59513271, -0.03071132,
        0.16239348,  0.08516458, -0.17490691,  0.44850519, -0.3339214 ,
        0.07385798,  0.13084169,  0.10786842, -0.1326807 , -0.13181499,
        0.00358524,  0.01887527,  0.0025575 ,  0.3032847 , -0.0722024 ,
       -0.27787772, -0.01889877,  0.00405976, -0.19545884, -0.12312309,
       -0.276225  , -0.48368922,  0.29648355,  0.04676267,  0.20177564,
        0.0226334 , -0.28215566,  0.14990538,  0.1497556 , -0.22867598,
       -0.59170425, -0.08138497, -0.51442462, -0.04024356, -0.22482392,
        0.16997276, -0.21076104,  0.16284338,  0.26058167,  0.52194506,
        0.3672393 , -0.06896825,  0.03472322, -0.05638345, -0.50952393,
        0.34179381, -0.05666731,  0.08396357,  0.12766367, -0.16163798,
        0.1123985 ,  0.0809842 , -0.3686167 ,  0.13539273,  0.37

### Saving and Loading Models

To avoid training the model again, we can save it.

In [26]:
model.save('./imdb.d2v')

And load it.

In [27]:
model = Doc2Vec.load('./imdb.d2v')

## Classifying Sentiments
### Training Vectors
Now let's use these vectors to train a classifier. First, we must extract the training vectors. Remember that we have a total of 25000 training reviews, with equal numbers of positive and negative ones (12500 positive, 12500 negative).

Hence, we create a numpy array (since the classifier we use only takes numpy arrays. There are two parallel arrays, one containing the vectors (train_arrays) and the other containing the labels (train_labels).

We simply put the positive ones at the first half of the array, and the negative ones at the second half.

In [28]:
train_arrays = numpy.zeros((25000, 100))
train_labels = numpy.zeros(25000)

for i in range(12500):
    prefix_train_pos = 'TRAIN_POS_' + str(i)
    prefix_train_neg = 'TRAIN_NEG_' + str(i)
    train_arrays[i] = model.docvecs[prefix_train_pos]
    train_arrays[12500 + i] = model.docvecs[prefix_train_neg]
    train_labels[i] = 1
    train_labels[12500 + i] = 0

The training array looks like this: rows and rows of vectors representing each sentence.

In [29]:
print train_arrays

[[-0.53936642 -0.23031077 -0.03832249 ...,  0.2249258  -0.54613489
  -0.11463475]
 [-0.41936064 -0.16192876  0.19960476 ...,  0.04475639  0.04624066
   0.0749617 ]
 [-0.09952848 -0.2859416   0.12483131 ...,  0.53659874 -0.65958494
   0.11224388]
 ..., 
 [-0.05353339 -0.2487039  -0.02476246 ..., -0.0764806  -0.30736387
   0.02775923]
 [-0.02365136 -0.14059344 -0.08467647 ..., -0.2423065  -0.38618767
   0.16995062]
 [-0.05582325 -0.16865364 -0.26180309 ..., -0.11948675 -0.29945078
  -0.19749543]]


The labels are simply category labels for the sentence vectors -- 1 representing positive and 0 for negative.

In [30]:
print train_labels

[ 1.  1.  1. ...,  0.  0.  0.]


### Testing Vectors
We do the same for testing data -- data that we are going to feed to the classifier after we've trained it using the training data. This allows us to evaluate our results. The process is pretty much the same as extracting the results for the training data.

In [32]:
test_arrays = numpy.zeros((25000, 100))
test_labels = numpy.zeros(25000)

for i in range(12500):
    prefix_test_pos = 'TEST_POS_' + str(i)
    prefix_test_neg = 'TEST_NEG_' + str(i)
    test_arrays[i] = model.docvecs[prefix_test_pos]
    test_arrays[12500 + i] = model.docvecs[prefix_test_neg]
    test_labels[i] = 1
    test_labels[12500 + i] = 0

### Classification
Now we train a logistic regression classifier using the training data.

In [33]:
classifier = LogisticRegression()
classifier.fit(train_arrays, train_labels)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001)

And find that we have achieved near 88% accuracy for sentiment analysis. This is rather incredible, given that we are only using a linear SVM and a very shallow neural network.

In [34]:
classifier.score(test_arrays, test_labels)

0.88

## References
- https://github.com/linanqiu/word2vec-sentiments
- Doc2vec: https://radimrehurek.com/gensim/models/doc2vec.html
- Paper that inspired this: http://arxiv.org/abs/1405.4053