We work with a dataset containing 50,000 movie reviews from IMDB, labeled by sentiment (positive/negative).  In addition, there are another 50,000 IMDB reviews provided without any rating labels.  

The reviews are split evenly into train and test sets (25k train and 25k test). The overall distribution of labels is also balanced within the train and test sets (12.5k pos and 12.5k neg).  Our goal is to predict sentiment in the test dataset. 

In [1]:
import os                                # accessing directory of files
import pandas as pd                      # storing the data
from bs4 import BeautifulSoup            # removing HTML tags
import re                                # text processing with regular expressions
from gensim.models import word2vec       # embedding algorithm
import numpy as np                       # arrays and other mathy structures     
from tqdm import tqdm                    # timing algorithms
from gensim import models                # doc2vec implementation
from random import shuffle               # for shuffling reviews
from sklearn.linear_model import LogisticRegression
import nltk.data                         # sentence splitting
from keras.models import Sequential      # deep learning (part 1)
from keras.layers import Dense, Dropout  # deep learning (part 2)
%matplotlib inline                       

# If you are using Python 3, you will get an error.
# (Pattern is a Python 2 library and fails to install for Python 3.)

Using TensorFlow backend.


The dataset can be downloaded [here](http://ai.stanford.edu/~amaas/data/sentiment/).  We first write code to extract the reviews into Pandas dataframes.

In [2]:
def load_data(directory_name):
    # load dataset from directory to Pandas dataframe
    data = []
    files = [f for f in os.listdir('../../../aclImdb/' + directory_name)]
    for f in files:
        with open('../../../aclImdb/' + directory_name + f, "r", encoding = 'utf-8') as myfile:
            data.append(myfile.read())
    df = pd.DataFrame({'review': data, 'file': files})
    return df

# load training dataset
train_pos = load_data('train/pos/')
train_neg = load_data('train/neg/')

# load test dataset
test_pos = load_data('test/pos/')
test_neg = load_data('test/neg/')

# load unsupervised dataset
unsup = load_data('train/unsup/')

print("\n %d pos train reviews \n %d neg train reviews \n %d pos test reviews \n %d neg test reviews \n %d unsup reviews" \
      % (train_pos.shape[0], train_neg.shape[0], test_pos.shape[0], test_neg.shape[0], unsup.shape[0]))
print("\n TOTAL: %d reviews" % int(train_pos.shape[0] + train_neg.shape[0] + test_pos.shape[0] + test_neg.shape[0] + unsup.shape[0]))


 12500 pos train reviews 
 12500 neg train reviews 
 12500 pos test reviews 
 12500 neg test reviews 
 50000 unsup reviews

 TOTAL: 100000 reviews


`train_pos`, `train_neg`, `test_pos`, `test_neg`, and `unsup` are Pandas dataframes.  They each have two columns, and each row corresponds to a review:
- `file` : name of file that contains review
- `review` : the full text of the review

We write a function `review_to_wordlist`, which processes each review as follows:
- Punctuation is made consistent through the use of regular expressions.
- HTML tags are removed through the use of the Beautiful Soup library.
- All words are converted to lowercase.
- Each review is converted into a list of words.

We note that there is still some room for improvement.  For instance, 
- Strings like "Sgt. Cutter" currently are broken into two sentences.  We should instead determine how to differentiate between periods that signify the end of an abbreviation and periods that denote the end of a sentence.
- Some writers separate their sentences with commas or line breaks; the algorithm currently absorbs these multiple sentences into an individual sentence.
- Ellipses (...) are currently processed as multiple, empty sentences (which are then discarded).

Before writing this post, I read the Kaggle tutorial [here](https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-2-word-vectors).  My current processing algorithm borrows from that page, but also adds some meaningful improvements, partially informed by the algorithm [here](https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py).  For instance, 
- We keep the punctuation that ends each sentence (i.e., period vs. exclamation point), whereas punctuation was discarded in the Kaggle tutorial.  
- We do smarter processing of contractions, in a way that understands that "should've" = "should" + "'ve".  In the Kaggle tutorial, "should've" is kept as a single word (contractions are not understood in terms of their composite parts).

In [3]:
def clean_str( string ):
    # Function that cleans text using regular expressions
    string = re.sub(r' +', ' ', string)
    string = re.sub(r'\.+', '.', string)
    string = re.sub(r'\.(?! )', '. ', string)    
    string = re.sub(r"\'s", " \'s", string) 
    string = re.sub(r"\'ve", " \'ve", string) 
    string = re.sub(r"n\'t", " n\'t", string) 
    string = re.sub(r"\'re", " \'re", string) 
    string = re.sub(r"\'m", " \'m", string) 
    string = re.sub(r"\'d", " \'d", string) 
    string = re.sub(r"\'ll", " \'ll", string) 
    string = re.sub(r",", " , ", string) 
    string = re.sub(r"!", " ! ", string) 
    string = re.sub(r"\(", " ( ", string) 
    string = re.sub(r"\)", " ) ", string) 
    string = re.sub(r"\?", " ? ", string) 
    string = re.sub(r"\.", " . ", string)
    string = re.sub(r"\-", " - ", string)
    string = re.sub(r"\;", " ; ", string)
    string = re.sub(r"\:", " : ", string)
    string = re.sub(r'\"', ' " ', string)
    string = re.sub(r'\/', ' / ', string)
    return string

Each cleaned review is fed into the `LabeledLineReview` class, written below.  These labeled reviews are fed into the Doc2Vec algorithm to obtain an embedding of each review.  We note that we use the full set of 100,000 reviews to learn the embedding, but our classification algorithm will be trained with the training set only.

In [4]:
class Doc2VecUtility(object):

    def review_to_wordlist( review ):
        #
        # Function to turn each review into a list of sentences, 
        # where each sentence is a list of words
        #
        # 1. Process punctuation, excessive periods, missing spaces 
        review = clean_str(review)
        #
        # 2. Remove HTML tags 
        review = BeautifulSoup(review, "lxml").get_text()
        #
        # 3. remove white spaces
        review = review.strip()
        #
        # 4. return lowercase collection of words
        wordlist = review.lower().split()
        #
        # 5. Return the list of words
        return wordlist

    class LabeledLineReview(object):
        def __init__(self, dflist):
            self.dflist = dflist

        def __iter__(self):
            for df in self.dflist:
                for idx in tqdm(df.index):
                    yield models.doc2vec.LabeledSentence(review_to_wordlist(df.ix[idx, 'review']), [df.ix[idx, 'file']])

        def to_array(self):
            self.reviews = []
            for df in self.dflist:
                for idx in tqdm(df.index):
                    self.reviews.append(models.doc2vec.LabeledSentence(review_to_wordlist(df.ix[idx, 'review']), [df.ix[idx, 'file']]))
            return self.reviews

        def reviews_perm(self):
            shuffle(self.reviews)
            return self.reviews
    
    def train(self):
        #
        # Trains the doc2vec model
        #
        # 1. Get all reviews together
        reviews = LabeledLineReview([train_pos, train_neg, test_pos, test_neg, unsup])
        
        # 2. Set values for various parameters and define the model
        num_features = 100    # Word vector dimensionality                      
        min_word_count = 1   # Minimum word count                        
        num_workers = 8       # Number of threads to run in parallel
        context = 10          # Context window size                                                                                    
        downsampling = 1e-4   # Downsample setting for frequent words
        #
        model = models.Doc2Vec(workers = num_workers, \
                               size = num_features, min_count = min_word_count, \
                               window = context, sample = downsampling, negative = 5)
        model.build_vocab(reviews.to_array())
        #
        # 3. Train the model
        for epoch in tqdm(range(10)):
            model.train(reviews.reviews_perm())
        #
        # 4. Save the model
        model.init_sims(replace=True)
        model.save("models/d2v")
                
    def get_embedding(self):
        #
        # Returns embedding and labels, training model if necessary
        #
        # 1. Load the saved model.
        #   (If the model is not already saved, train the model)
        if not os.path.isfile('models/d2v'):
            self.train()
        model = models.Doc2Vec.load("models/d2v")
        #
        # 2. Obtain train data embeddings and labels
        train_array = np.zeros((25000, 100))
        train_tags = list(train_pos['file'].values) + list(train_neg['file'].values)
        for idx , val in enumerate(train_tags):
            train_array[idx] = model.docvecs[val]
        train_labels = np.append(np.ones(12500), np.zeros(12500))
        #
        # 3. Obtain test data embeddings and labels
        test_array = np.zeros((25000, 100))
        test_tags = list(test_pos['file'].values) + list(test_neg['file'].values)
        for idx , val in enumerate(test_tags):
            test_array[idx] = model.docvecs[val]
        test_labels = np.append(np.ones(12500), np.zeros(12500))
        #
        # 4. Return embeddings and labels
        return train_array, train_labels, test_array, test_labels

In [5]:
[d2v_train, train_labels, d2v_test, test_labels] = Doc2VecUtility().get_embedding()

In [6]:
classifier = LogisticRegression()
classifier.fit(d2v_train, train_labels)
classifier.score(d2v_test, test_labels)

0.8992

Woohoo!  We can predict sentiment with nearly 90 percent accuracy!  Can we do better?  Let's try out a MLP with one hidden layer.

In [7]:
# Now, we try a multilayer perceptron (with one hidden layer) in Keras !
keras_model = Sequential()
keras_model.add(Dense(200, input_dim=100, init='uniform', activation='relu'))
keras_model.add(Dropout(0.5))
keras_model.add(Dense(1, activation='sigmoid'))
keras_model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

keras_model.fit(d2v_train, train_labels, nb_epoch=20, batch_size=20)
loss, accuracy = keras_model.evaluate(d2v_test, test_labels, verbose=0)
print("\n", accuracy)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20

 0.90508


Hmm, it's a slight improvement, but we haven't done much better.  To improve, our intuition tells us we should try some combination of:
- better string cleaning,
- testing other parameters for Doc2Vec embedding step, and
- constructing deeper neural networks.

However, it seems that Kaggle competitiors were not able to make this work -- that is, we won't be able to do much better with the current plan.  Thus, we will try something else, informed by the techniques [here](https://www.kaggle.com/c/word2vec-nlp-tutorial/forums/t/14966/post-competition-solutions).  Namely, we will train a few more models and then use an ensemble method.

First, we train a Word2Vec embedding, which we expect will get slightly worse performance than the Doc2Vec model.

In [74]:
class Word2VecUtility(object):
    
    def getAvgFeatureVecs(df, model, num_features):
        #
        # Given a df of reviews, calculate the average feature vector for each one 
        # 
        index2word_set = set(model.index2word)
        reviewFeatureVecs = []
        for idx in tqdm(df.index):
            to_append = makeAvgFeatureVec(df.ix[idx, 'review'], model, index2word_set, num_features)
            reviewFeatureVecs.append(to_append)
        reviewFeatureVecs = np.array(reviewFeatureVecs)
        return reviewFeatureVecs

    def makeAvgFeatureVec(review, model, index2word_set, num_features):
        #
        # Averages all of the word vectors in a given review
        #
        # 1. Pre-initialize an empty numpy array (for speed)
        featureVec = np.zeros((num_features,), dtype="float32")
        #
        # 2. Initialize number of words in review
        nwords = 0.
        # 
        # 3. Loop over each word in the review 
        #    If it is in the model's vocab, add its feature vector to the total
        words = review_to_wordlist(review)
        for word in words:
            if word in index2word_set: 
                nwords = nwords + 1.
                featureVec = np.add(featureVec, model[word])
        # 
        # 4. Divide the result by the number of words to get the average
        featureVec = np.divide(featureVec, nwords)
        # 5. Return the average word vector
        return featureVec
    
    def review_to_wordlist( review, only_words = False ):
        #
        # Function to convert a document to a sequence of words,
        # optionally removing stop words.  Returns a list of words.
        #
        # 1. Remove HTML
        review = BeautifulSoup(review, "lxml").get_text()
        #
        # 2. Process punctuation, excessive periods, missing spaces
        review = clean_str(review)
        #
        # 3. (Optionally) remove non-letters / non-words
        if only_words:
            review = re.sub("[^a-zA-Z]"," ", review)
        #
        # 4. Convert words to lower case and split into list
        words = review.lower().split()
        #
        # 5. Return a list of words
        return(words)
    
    def review_to_lists_of_lists( review, only_words = False ):
        # 
        # Function to turn each review into a list of sentences, 
        # where each sentence is a list of words
        #
        # 1. Process punctuation, excessive periods, missing spaces 
        review = BeautifulSoup(review, "lxml").get_text()
        #
        # 2. Remove HTML tags 
        review = clean_str(review)
        #
        # 3. (Optionally) remove non-letters / non-words
        if only_words:
            review = re.sub("[^a-zA-Z]"," ", review)
        #
        # 4. Use the NLTK tokenizer to split the review into list of sentences
        #   (getting rid of extra spaces at front/back)
        tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
        raw_sentences = tokenizer.tokenize(review.strip())
        #
        # 5. Loop over each sentence to get list of list of lowercase words
        sentences = []
        for raw_sentence in raw_sentences:
            # Convert to lowercase and split into list of words
            raw_sentence = raw_sentence.lower().split()
            # If a sentence is not long enough, skip it
            if len(raw_sentence) > 1:
                # add list of words to returned object
                sentences.append( raw_sentence )
        #
        # 6. Return the list of sentences (each sentence is a list of words,
        # so returns a list of lists)
        return sentences

    def corpus_to_list(df, tokenizer):
        # Turns dataframe of reviews into a list of sentences,
        # where each sentence is a list of words
        # and sentences are derived from *all reviews* in dataframe df
        sentences = []
        for idx in tqdm(df.index):
            to_append = review_to_lists_of_lists(df.ix[idx, 'review'])
            sentences += to_append
        return sentences

    def train(self):
        #
        # Trains the word2vec model
        #
        # 1. Assemble all reviews
        train_pos_sentences = corpus_to_list(train_pos, tokenizer)
        train_neg_sentences = corpus_to_list(train_neg, tokenizer)
        test_pos_sentences = corpus_to_list(test_pos, tokenizer)
        test_neg_sentences = corpus_to_list(test_neg, tokenizer)
        unsup_sentences = corpus_to_list(unsup, tokenizer)
        sentences = train_pos_sentences + train_neg_sentences + test_pos_sentences + test_neg_sentences + unsup_sentences
        #
        # 2. Set values for various parameters 
        num_features = 300    # Word vector dimensionality                      
        min_word_count = 40   # Minimum word count                        
        num_workers = 4       # Number of threads to run in parallel
        context = 10          # Context window size                                                                                    
        downsampling = 1e-3   # Downsample setting for frequent words
        #
        # 3. Initialize and train the model 
        model = word2vec.Word2Vec(sentences, workers = num_workers, \
                    size = num_features, min_count = min_word_count, \
                    window = context, sample = downsampling)
        #
        # 4. Save the model
        model.init_sims(replace=True)
        model.save("models/w2v")
            
    def get_embedding(self):
        #
        # Returns embedding and labels, training model if necessary
        #
        # 1. Load the saved model.
        #   (If the model is not already saved, train the model)
        if not os.path.isfile('models/w2v'):
            self.train()
        model = models.Doc2Vec.load("models/w2v")
        #
        # 2. Obtain train data embeddings 
        pos_w2v_train = getAvgFeatureVecs(train_pos, model, 300)
        neg_w2v_train = getAvgFeatureVecs(train_neg, model, 300)
        w2v_train = np.append(pos_w2v_train, neg_w2v_train, axis=0)
        #
        # 3. Obtain test data embeddings
        pos_w2v_test = getAvgFeatureVecs(test_pos, model, 300)
        neg_w2v_test = getAvgFeatureVecs(test_neg, model, 300)
        w2v_test = np.append(pos_w2v_test, neg_w2v_test, axis=0)
        #
        # 4. Return all embeddings
        return w2v_train, w2v_test

In [75]:
w2v_train, w2v_test = Word2VecUtility().get_embedding()

100%|██████████| 12500/12500 [00:13<00:00, 899.14it/s]
100%|██████████| 12500/12500 [00:13<00:00, 903.10it/s]
100%|██████████| 12500/12500 [00:13<00:00, 959.36it/s]
100%|██████████| 12500/12500 [00:12<00:00, 993.89it/s]


In [70]:
classifier = LogisticRegression()
classifier.fit(w2v_train, train_labels)
classifier.score(w2v_test, test_labels)

0.81228

Here, we see a drop in performance relative to the Doc2Vec embedding.

In [71]:
# Now, we try a multilayer perceptron (with one hidden layer) in Keras !
keras_model = Sequential()
keras_model.add(Dense(200, input_dim=300, init='uniform', activation='relu'))
keras_model.add(Dropout(0.5))
keras_model.add(Dense(1, activation='sigmoid'))
keras_model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

keras_model.fit(w2v_train, train_labels, nb_epoch=20, batch_size=20)
loss, accuracy = keras_model.evaluate(w2v_test, test_labels, verbose=0)
print("\n", accuracy)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20

 0.86204


As expected, we also see a drop in accuracy in the neural network with the Word2Vec embedding.
However, combining the two embeddings shoul result in an increase in performance, as we will see below.

In [79]:
# Combine Doc2Vec and Word2Vec features
train = np.append(d2v_train,w2v_train,axis=1)
test = np.append(d2v_test,w2v_test,axis=1)

In [80]:
classifier = LogisticRegression()
classifier.fit(train, train_labels)
classifier.score(test, test_labels)

0.90007999999999999

This is a marginal increase in performance, relative to when we used Doc2Vec features alone.

In [82]:
# Now, we try a multilayer perceptron (with one hidden layer) in Keras !
keras_model = Sequential()
keras_model.add(Dense(200, input_dim=400, init='uniform', activation='relu'))
keras_model.add(Dropout(0.5))
keras_model.add(Dense(1, activation='sigmoid'))
keras_model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

keras_model.fit(train, train_labels, nb_epoch=20, batch_size=20)
loss, accuracy = keras_model.evaluate(test, test_labels, verbose=0)
print("\n", accuracy)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20

 0.9072


Still, only a slight increase in performance over Doc2Vec alone.  I bet we can do better :)!