# Using doc2vec to classify movie reviews
Aaron Palumbo | Nov. 8, 2015

## Objective

Use the technique doc2vec to characterize movie reviews as positive or negative. This implementation mostly follows the example in http://districtdatalabs.silvrback.com/modern-methods-for-sentiment-analysis, but it appears there have been changes in the library since  that was written that require changes to some of the symantics.

## Dependencies

In [1]:
from gensim.models import Doc2Vec
import gensim.models.doc2vec
LabeledSentence = gensim.models.doc2vec.LabeledSentence

from collections import OrderedDict
import multiprocessing
import os
import numpy as np

from sklearn.cross_validation import train_test_split
from sklearn.linear_model import SGDClassifier

from IPython import display as dis

In [2]:
# silly utility to launch a qtconsole when one isn't already running
import psutil

def returnPyIDs():
    pyids = set()
    for pid in psutil.pids():
        try:
            if "python" in psutil.Process(pid).name():
                pyids.add(pid)
        except:
            pass
    return pyids

def launchConsole():
    before_pyids = returnPyIDs()
    %qtconsole
    after_pyids = returnPyIDs()
    newid = after_pyids.difference(before_pyids)
    assert len(newid) == 1
    return list(newid)[0]

try:
    qtid
except NameError:
    qtid = launchConsole()
    
if qtid not in returnPyIDs():
    qtid = launchConsole()
    
qtid

6976

In [3]:
cores = multiprocessing.cpu_count()
assert gensim.models.doc2vec.FAST_VERSION > -1

## Load and clean data

We need to create a list containing the text of each review

In [4]:
# get the data from the link in the Appendix
# Reviews
reviews = {
    "positive": {
        "dir": "aclImdb/train/pos/",
        "text": []
    },
    "negative": {
        "dir": "aclImdb/train/neg/",
        "text": []
    },
    "unsup": {
        "dir": "aclImdb/train/unsup/",
        "text": []
    }
        
}

# Read files from disk
for sentiment in reviews.keys():
    d = reviews[sentiment]["dir"]
    for fileName in os.listdir(d):
        with open(os.path.join(d, fileName), "r") as f:
            reviews[sentiment]["text"] += f.readlines()

Let's make sure we understand the output of each step as we go along. In the above code we extracted the reviews from all the files and concatenated these into a list.

In [5]:
reviews["unsup"]["text"][:2]

['I admit, the great majority of films released before say 1933 are just not for me. Of the dozen or so "major" silents I have viewed, one I loved (The Crowd), and two were very good (The Last Command and City Lights, that latter Chaplin circa 1931).<br /><br />So I was apprehensive about this one, and humor is often difficult to appreciate (uh, enjoy) decades later. I did like the lead actors, but thought little of the film.<br /><br />One intriguing sequence. Early on, the guys are supposed to get "de-loused" and for about three minutes, fully dressed, do some schtick. In the background, perhaps three dozen men pass by, all naked, white and black (WWI ?), and for most, their butts, part or full backside, are shown. Was this an early variation of beefcake courtesy of Howard Hughes?',
 'Take a low budget, inexperienced actors doubling as production staff\xc2\x97 as well as limited facilities\xc2\x97and you can\'t expect much more than "Time Chasers" gives you, but you can absolutely ex

Next we need to do some minor cleaning. We will transform all words to lowercase and remove line returns (both '\n' and '&#60;br />'). We also treat punctuation as individual words.

In [6]:
# Minor cleaning
def cleanText(corpus):
    punctuation = """.,?!:;(){}[]"""
    corpus = [z.lower().replace('\n', '') for z in corpus]
    corpus = [z.replace('<br />', ' ') for z in corpus]
    
    # treat punctuation as individual words
    for c in punctuation:
        corpus = [z.replace(c, ' {} '.format(c)) for z in corpus]
    corpus = [z.split() for z in corpus]
    return corpus

pos = cleanText(reviews["positive"]["text"])
neg = cleanText(reviews["negative"]["text"])
unsup = cleanText(reviews["unsup"]["text"])

In [7]:
unsup[0][:20]

['i',
 'admit',
 ',',
 'the',
 'great',
 'majority',
 'of',
 'films',
 'released',
 'before',
 'say',
 '1933',
 'are',
 'just',
 'not',
 'for',
 'me',
 '.',
 'of',
 'the']

In [8]:
# Create results vector
# 1 for positive review 0 for negative review
y = np.concatenate(
    (np.ones(len(pos)), np.zeros(len(neg)))
)

# Split into training and test set
x_train, x_test, y_train, y_test = \
    train_test_split(np.concatenate((pos, neg)), y,
                     test_size=0.2, random_state=1)

Next we create a LabeledSentence object for each review. This is the required by Gensim's Doc2Vec implementation.

In [9]:
def labelizedReviews(reviews, label_type):
    labelized = []
    for i, v in enumerate(reviews):
        label = '{}_{}'.format(label_type, i)
        labelized.append(LabeledSentence(words=v, tags=[label]))
    return labelized

x_train = labelizedReviews(x_train, 'TRAIN')
x_test = labelizedReviews(x_test, 'TEST')
unsup = labelizedReviews(unsup, 'UNSUP')

The output looks like this:

In [10]:
print(type(unsup[0]))
unsup[0]

<class 'gensim.models.doc2vec.LabeledSentence'>


TaggedDocument(words=['i', 'admit', ',', 'the', 'great', 'majority', 'of', 'films', 'released', 'before', 'say', '1933', 'are', 'just', 'not', 'for', 'me', '.', 'of', 'the', 'dozen', 'or', 'so', '"major"', 'silents', 'i', 'have', 'viewed', ',', 'one', 'i', 'loved', '(', 'the', 'crowd', ')', ',', 'and', 'two', 'were', 'very', 'good', '(', 'the', 'last', 'command', 'and', 'city', 'lights', ',', 'that', 'latter', 'chaplin', 'circa', '1931', ')', '.', 'so', 'i', 'was', 'apprehensive', 'about', 'this', 'one', ',', 'and', 'humor', 'is', 'often', 'difficult', 'to', 'appreciate', '(', 'uh', ',', 'enjoy', ')', 'decades', 'later', '.', 'i', 'did', 'like', 'the', 'lead', 'actors', ',', 'but', 'thought', 'little', 'of', 'the', 'film', '.', 'one', 'intriguing', 'sequence', '.', 'early', 'on', ',', 'the', 'guys', 'are', 'supposed', 'to', 'get', '"de-loused"', 'and', 'for', 'about', 'three', 'minutes', ',', 'fully', 'dressed', ',', 'do', 'some', 'schtick', '.', 'in', 'the', 'background', ',', 'perhap

We now train two model types:
* DM:
> DM stands for (D)istributed (M)emory. DM attempts to predict a word given it's previous words

* DBOW:
> DBOW stands for (D)istributed (B)ag (O)f (W)ords. DBOW predictes a reandom group of words in a paragraph given only it's paragraph vector.

In [11]:
size = 400
num_epochs = 10    #~2.5 min / epoch

# instantiate our DM annd DBOW models
model_dm = gensim.models.Doc2Vec(
    min_count=1,
    window=10,
    size=size,
    sample=1e-3,
    negative=5,
    workers=3
)

model_dbow = gensim.models.Doc2Vec(
    min_count=1,
    window=10,
    size=size,
    sample=1e-3,
    negative=5,
    dm=0,
    workers=3
)

# build vocab over all reviews
model_dm.build_vocab(x_train + x_test + unsup)
model_dbow.build_vocab(x_train + x_test + unsup)

In [12]:
# in training the model, it is recommended to train over
# the data multiple times while randomizing the input order
all_training_reviews = x_train + unsup
for epoch in xrange(num_epochs):
    perm = np.random.permutation(len(all_training_reviews))
    model_dm.train([all_training_reviews[i] for i in perm])
    model_dbow.train([all_training_reviews[i] for i in perm])



In [13]:
# Get training set vectors
def getVecs(model, corpus, size):
    vecs = [np.array(model.docvecs[z.tags[0]]).reshape((1, size)) for z in corpus]
    return np.concatenate(vecs)

train_vecs_dm = getVecs(model_dm, x_train, size)
train_vecs_dbow = getVecs(model_dbow, x_train, size)

train_vecs = np.hstack((train_vecs_dm, train_vecs_dbow))

# Build test vectors
test_vecs_dm = getVecs(model_dm, x_test, size)
test_vecs_dbow = getVecs(model_dbow, x_test, size)

test_vecs = np.hstack((test_vecs_dm, test_vecs_dbow))

We are now ready to build a simple linear model, train it on our training set and measure  performance on our test set.

In [14]:
lr = SGDClassifier(loss='log', penalty='l1')
lr.fit(train_vecs, y_train)

print 'Test Accuracy: {:.2f}'.format(lr.score(test_vecs, y_test))

Test Accuracy: 0.50


## Results

That's not very good, and it is significantly less than te accuracy reported by similar attempts to classify the same database, which are achieving upwards of 87%. I must conclude that there is a mistake somewhere in my process, but I have been unable to find it.

(I tested the model using an SVM and saw the same results, so there is something in the way I am constructing the document vectors.)

## Appendix

### References

#### Data:
* http://ai.stanford.edu/%7Eamaas/data/sentiment/

#### Methodology:
* http://cs.stanford.edu/~quocle/paragraph_vector.pdf
* http://districtdatalabs.silvrback.com/modern-methods-for-sentiment-analysis
* https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb

### doc2vec documentation


 class gensim.models.doc2vec.Doc2Vec(documents=None, size=300, alpha=0.025, window=8, min_count=5, max_vocab_size=None, sample=0, seed=1, workers=1, min_alpha=0.0001, dm=1, hs=1, negative=0, dbow_words=0, dm_mean=0, dm_concat=0, dm_tag_count=1, docvecs=None, docvecs_mapfile=None, comment=None, trim_rule=None, **kwargs)

    Bases: gensim.models.word2vec.Word2Vec

    Class for training, using and evaluating neural networks described in http://arxiv.org/pdf/1405.4053v2.pdf

    Initialize the model from an iterable of documents. Each document is a TaggedDocument object that will be used for training.

    The documents iterable can be simply a list of TaggedDocument elements, but for larger corpora, consider an iterable that streams the documents directly from disk/network.

    If you don’t supply documents, the model is left uninitialized – use if you plan to initialize it in some other way.

    dm defines the training algorithm. By default (dm=1), ‘distributed memory’ (PV-DM) is used. Otherwise, distributed bag of words (PV-DBOW) is employed.

    size is the dimensionality of the feature vectors.

    window is the maximum distance between the predicted word and context words used for prediction within a document.

    alpha is the initial learning rate (will linearly drop to zero as training progresses).

    seed = for the random number generator. Only runs with a single worker will be deterministically reproducible because of the ordering randomness in multi-threaded runs.

    min_count = ignore all words with total frequency lower than this.

    max_vocab_size = limit RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM. Set to None for no limit (default).

    sample = threshold for configuring which higher-frequency words are randomly downsampled;
        default is 0 (off), useful value is 1e-5.

    workers = use this many worker threads to train the model (=faster training with multicore machines).

    hs = if 1 (default), hierarchical sampling will be used for model training (else set to 0).

    negative = if > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20).

    dm_mean = if 0 (default), use the sum of the context word vectors. If 1, use the mean. Only applies when dm is used in non-concatenative mode.

    dm_concat = if 1, use concatenation of context vectors rather than sum/average; default is 0 (off). Note concatenation results in a much-larger model, as the input is no longer the size of one (sampled or arithmatically combined) word vector, but the size of the tag(s) and all words in the context strung together.

    dm_tag_count = expected constant number of document tags per document, when using dm_concat mode; default is 1.

    dbow_words if set to 1 trains word-vectors (in skip-gram fashion) simultaneous with DBOW doc-vector training; default is 0 (faster training of doc-vectors only).

    trim_rule = vocabulary trimming rule, specifies whether certain words should remain

        in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used), or a callable that accepts parameters (word, count, min_count) and returns either util.RULE_DISCARD, util.RULE_KEEP or util.RULE_DEFAULT. Note: The rule, if given, is only used prune vocabulary during build_vocab() and is not stored as part

            of the model.

