This article is still __under development__: I'm currently working on adding more description and more algorithms.

We work with a dataset containing 50,000 movie reviews from IMDB, labeled by sentiment (positive/negative).  In addition, there are another 50,000 IMDB reviews provided without any rating labels.  

The reviews are split evenly into train and test sets (25k train and 25k test). The overall distribution of labels is also balanced within the train and test sets (12.5k pos and 12.5k neg).  Our goal is to predict sentiment in the test dataset. 

In [1]:
import os                                # accessing directory of files
import pandas as pd                      # storing the data
import re                                # text processing with regular expressions
import numpy as np                       # arrays and other mathy structures     
from sklearn.linear_model import LogisticRegression  # LR implementation
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from keras.models import Sequential, load_model      # MLP, RNN implementation
from keras.layers import Dense, Dropout, LSTM        # MLP, RNN implementation
from keras.layers.embeddings import Embedding        # RNN implementation
from keras.preprocessing import sequence             # pre-process RNN data
from classes import Doc2VecUtility       # training, loading Doc2Vec embeddings for LR, MLP
from classes import Word2VecUtility      # training, loading Word2Vec embeddings
from classes import Word2VecRNNUtility   # training, loading Word2Vec embeddings for RNN
from classes import BagofWordsUtility    # loading data for NBSVM
import matplotlib.pyplot as plt          # plotting
import plotly.plotly as py               # pretty plotting
import plotly.graph_objs as go           # pretty plotting
from plotly.tools import FigureFactory as FF         # for tables
import seaborn as sns                    # pretty plotting
import warnings                          # suppress (plotly) warnings
import statistics                        # calculating median
from nbsvm import NBSVM                  # NBSVM
import random
%matplotlib inline   

random.seed(8675309)

# If you are using Python 3, you will get an error.
# (Pattern is a Python 2 library and fails to install for Python 3.)

Using TensorFlow backend.
  'Matplotlib is building the font cache using fc-list. '


The dataset can be downloaded [here](http://ai.stanford.edu/~amaas/data/sentiment/).  We have written code to extract the reviews into Pandas dataframes.

In [2]:
def load_data(directory_name):
    # load dataset from directory to Pandas dataframe
    data = []
    files = [f for f in os.listdir('../../../aclImdb/' + directory_name)]
    for f in files:
        with open('../../../aclImdb/' + directory_name + f, "r", encoding = 'utf-8') as myfile:
            data.append(myfile.read())
    df = pd.DataFrame({'review': data, 'file': files})
    return df

# load training dataset
train_pos = load_data('train/pos/')
train_neg = load_data('train/neg/')

# load test dataset
test_pos = load_data('test/pos/')
test_neg = load_data('test/neg/')

# load unsupervised dataset
unsup = load_data('train/unsup/')

print("\n %d pos train reviews \n %d neg train reviews \n %d pos test reviews \n %d neg test reviews \n %d unsup reviews" \
      % (train_pos.shape[0], train_neg.shape[0], test_pos.shape[0], test_neg.shape[0], unsup.shape[0]))
print("\n TOTAL: %d reviews" % int(train_pos.shape[0] + train_neg.shape[0] + test_pos.shape[0] + test_neg.shape[0] + unsup.shape[0]))


 12500 pos train reviews 
 12500 neg train reviews 
 12500 pos test reviews 
 12500 neg test reviews 
 50000 unsup reviews

 TOTAL: 100000 reviews


`train_pos`, `train_neg`, `test_pos`, `test_neg`, and `unsup` are Pandas dataframes.  They each have two columns, and each row corresponds to a review:
- `file` : name of file that contains review
- `review` : the full text of the review

Each review is cleaned before being fed into a classification algorithm.
- HTML tags are removed through the use of the Beautiful Soup library.
- Punctuation is made consistent through the use of regular expressions.  This code appears below, in the `clean_str` method.
- All words are converted to lowercase.
- Each review is converted into either a list of words or to a list of lists of words.  In the latter, each list of lists corresponds to a sentence.

In [3]:
def clean_str( string ):
    # Function that cleans text using regular expressions
    string = re.sub(r' +', ' ', string)
    string = re.sub(r'\.+', '.', string)
    string = re.sub(r'\.(?! )', '. ', string)    
    string = re.sub(r"\'s", " \'s", string) 
    string = re.sub(r"\'ve", " \'ve", string) 
    string = re.sub(r"n\'t", " n\'t", string) 
    string = re.sub(r"\'re", " \'re", string) 
    string = re.sub(r"\'m", " \'m", string) 
    string = re.sub(r"\'d", " \'d", string) 
    string = re.sub(r"\'ll", " \'ll", string) 
    string = re.sub(r",", " , ", string) 
    string = re.sub(r"!", " ! ", string) 
    string = re.sub(r"\(", " ( ", string) 
    string = re.sub(r"\)", " ) ", string) 
    string = re.sub(r"\?", " ? ", string) 
    string = re.sub(r"\.", " . ", string)
    string = re.sub(r"\-", " - ", string)
    string = re.sub(r"\;", " ; ", string)
    string = re.sub(r"\:", " : ", string)
    string = re.sub(r'\"', ' " ', string)
    string = re.sub(r'\/', ' / ', string)
    return string

We note that there is still some room for improvement.  For instance, 
- Strings like "Sgt. Cutter" currently are broken into two sentences.  We should instead determine how to differentiate between periods that signify the end of an abbreviation and periods that denote the end of a sentence.
- Some writers separate their sentences with commas or line breaks; the algorithm currently absorbs these multiple sentences into an individual sentence.
- Ellipses (...) are currently processed as multiple, empty sentences (which are then discarded).

Before writing this post, I read the Kaggle tutorial [here](https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-2-word-vectors).  My current processing algorithm borrows from that page, but also adds some meaningful improvements, partially informed by the algorithm [here](https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py).  For instance, 
- We keep the punctuation that ends each sentence (i.e., period vs. exclamation point), whereas punctuation was discarded in the Kaggle tutorial.  
- We do smarter processing of contractions, in a way that understands that "should've" = "should" + "'ve".  In the Kaggle tutorial, "should've" is kept as a single word (contractions are not understood in terms of their composite parts).

# TL;DR ~ Results

In [4]:
data_matrix = [['', 'Word2Vec', 'Doc2Vec', 'CountVectorizer', 'TF-IDF'],
               ['LR', '81.396%', '89.456%', '89.903%', '90.076%'],
               ['MLP', '86.192%', '90.532%', '-', '-'],
               ['LSTM', '85.648%', '-', '-', '-'],
               ['NBSVM', '-', '-', '91.584%', '90.328%']]
colorscale = [[0, '#375fa9'],[.5, '#ffffff'],[1, '#ffffff']]
table = FF.create_table(data_matrix, index=True, colorscale=colorscale)
py.iplot(table)

# Doc2Vec + (LR, MLP) Implementation

We can now see how well our Doc2Vec embedding performs.  To peek at the code under the hood, please look at the Python file [here](https://github.com/alexisbcook/Pelican_blog_all/blob/master/content/notebooks/classes/Doc2VecUtility.py).

In [5]:
[d2v_train, d2v_test] = Doc2VecUtility.get_embedding()

In [6]:
train_labels = np.append(np.ones(12500), np.zeros(12500))
test_labels = np.append(np.ones(12500), np.zeros(12500))

In [7]:
# train the model
classifier = LogisticRegression()
classifier.fit(d2v_train, train_labels)

# calculate accuracy
classifier.score(d2v_test, test_labels)

0.89456000000000002

Woohoo!  We can predict sentiment with nearly 90 percent accuracy!  Can we do better?  Let's try out a MLP with one hidden layer.

In [8]:
# train the model
keras_model = Sequential()
keras_model.add(Dense(200, input_dim=300, init='uniform', activation='relu'))
keras_model.add(Dropout(0.5))
keras_model.add(Dense(1, activation='sigmoid'))
keras_model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])
keras_model.fit(d2v_train, train_labels, nb_epoch=10, batch_size=20)

# calculate accuracy
loss, accuracy = keras_model.evaluate(d2v_test, test_labels, verbose=0)
print("\n", accuracy)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10

 0.90496


Hmm, it's a slight improvement, but we haven't done much better.  To improve, our intuition tells us we should try some combination of:
- better string cleaning,
- testing other parameters for Doc2Vec embedding step, and
- constructing deeper neural networks.

However, it seems that Kaggle competitiors were not able to make this work -- that is, we won't be able to do much better with the current plan.  Thus, we will try something else, informed by the techniques [here](https://www.kaggle.com/c/word2vec-nlp-tutorial/forums/t/14966/post-competition-solutions).  Namely, we will train a few more embedding models and then use an ensemble method.

# Word2Vec + (LR, MLP) Implementation

We train a Word2Vec embedding, which we expect will get slightly worse performance than the Doc2Vec model.  To peek at the code under the hood, please look at the Python file [here](https://github.com/alexisbcook/Pelican_blog_all/blob/master/content/notebooks/classes/Word2VecUtility.py).

In [9]:
[w2v_train, w2v_test] = Word2VecUtility.get_embedding()

100%|██████████| 12500/12500 [00:13<00:00, 931.33it/s]
100%|██████████| 12500/12500 [00:13<00:00, 908.77it/s]
100%|██████████| 12500/12500 [00:13<00:00, 960.56it/s] 
100%|██████████| 12500/12500 [00:12<00:00, 996.59it/s] 


In [10]:
# train the model
classifier = LogisticRegression()
classifier.fit(w2v_train, train_labels)

# calculate accuracy
classifier.score(w2v_test, test_labels)

0.81299999999999994

In [11]:
# train the model
keras_model = Sequential()
keras_model.add(Dense(200, input_dim=300, init='uniform', activation='relu'))
keras_model.add(Dropout(0.5))
keras_model.add(Dense(1, activation='sigmoid'))
keras_model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])
keras_model.fit(w2v_train, train_labels, nb_epoch=10, batch_size=20)

# calculate accuracy
loss, accuracy = keras_model.evaluate(w2v_test, test_labels, verbose=0)
print("\n", accuracy)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10

 0.86264


# {Doc2Vec, Word2Vec} + (LR, MLP) Implementation

As expected, we the Doc2Vec embedding provides better accuracy than the Word2Vec embedding (both with logistic regression and an MLP).  However, combining the two should result in an increase in performance, as we will see below.

In [12]:
# Combine Doc2Vec and Word2Vec features
d2v_w2v_train = np.append(d2v_train, w2v_train, axis=1)
d2v_w2v_test = np.append(d2v_test, w2v_test, axis=1)

In [13]:
# train the model
classifier = LogisticRegression()
classifier.fit(d2v_w2v_train, train_labels)

# calculate accuracy
classifier.score(d2v_w2v_test, test_labels)

0.89563999999999999

In [14]:
# train the model
keras_model = Sequential()
keras_model.add(Dense(200, input_dim=600, init='uniform', activation='relu'))
keras_model.add(Dropout(0.5))
keras_model.add(Dense(1, activation='sigmoid'))
keras_model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])
keras_model.fit(d2v_w2v_train, train_labels, nb_epoch=10, batch_size=20)

# calculate accuracy
loss, accuracy = keras_model.evaluate(d2v_w2v_test, test_labels, verbose=0)
print("\n", accuracy)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10

 0.906


Here, unfortunately, the performace is basically the same as when we used Doc2Vec alone.  

# Word2Vec + RNN/LSTM Implementation

To peek at the code under the hood, please look at the Python file [here](https://github.com/alexisbcook/Pelican_blog_all/blob/master/content/notebooks/classes/Word2VecRNNUtility.py).

In [15]:
[w2vRNN_train, w2vRNN_test, embedding_weights] = Word2VecRNNUtility.get_embedding()

100%|██████████| 12500/12500 [00:06<00:00, 1833.35it/s]
100%|██████████| 12500/12500 [00:06<00:00, 1839.01it/s]
100%|██████████| 12500/12500 [00:06<00:00, 1934.11it/s]
100%|██████████| 12500/12500 [00:06<00:00, 1830.87it/s]


Our input has variable length!  Keras can't handle this ... so, let's figure out how to deal with it ...

First, we calculate some statistics to determine how much information is lost when our reviews are truncated.

In [16]:
review_lengths = [len(element) for element in w2vRNN_train + w2vRNN_test]
print("Minimum review length:", min(review_lengths), "\n")
print("Maximum review length:", max(review_lengths), "\n")
print("Mean review length:", np.mean(review_lengths), "\n")
print("Median review length:", statistics.median(review_lengths), "\n")
# get number of reviews with more than 1000 words
print("Reviews of length >1000:", sum([i>1000 for i in review_lengths]), 
      "(", 100*sum([i>1000 for i in review_lengths])/len(review_lengths), "% ), ",
      "<=1000:", sum([i<=1000 for i in review_lengths]), "\n")

layout = go.Layout(
    xaxis=dict(title='Review Length'),
    yaxis=dict(title='Count'),
    bargap=0.1,
    bargroupgap=0.1
)

data = [go.Histogram(
    x=review_lengths,
    autobinx=True,
    opacity=0.75
)]

warnings.filterwarnings('ignore')
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)

Minimum review length: 6 

Maximum review length: 2582 

Mean review length: 262.14742 

Median review length: 197.0 

Reviews of length >1000: 555 ( 1.11 % ),  <=1000: 49445 



We train only briefly, since RNNs are computationally expensive.

In [17]:
# process data 
w2vRNN_train_final = sequence.pad_sequences(w2vRNN_train, maxlen=600, dtype='int32', \
                                            padding='post', truncating='post', value=0.)
w2vRNN_test_final = sequence.pad_sequences(w2vRNN_test, maxlen=600, dtype='int32', \
                                            padding='post', truncating='post', value=0.)

# get the model
if not os.path.isfile('models/rnn_model.h5'):
    keras_model = Sequential()
    keras_model.add(Embedding(output_dim=300, input_dim=len(embedding_weights), mask_zero=True, weights=[embedding_weights]))
    keras_model.add(LSTM(10, return_sequences=False))  
    keras_model.add(Dense(1, activation='sigmoid'))
    keras_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    keras_model.fit(w2vRNN_train_final, train_labels, nb_epoch=3, batch_size=20)
    # save the model
    keras_model.save('models/rnn_model.h5')
keras_model = load_model('models/rnn_model.h5')

# calculate accuracy
loss, accuracy = keras_model.evaluate(w2vRNN_test_final, test_labels, verbose=1)
print("\n", accuracy)


 0.85648


From Keras documentation [here](https://github.com/fchollet/keras/blob/master/examples/imdb_lstm.py):

> The IMDB dataset is too small for LSTM to be of any advantage compared to simpler, much faster methods such as TF-IDF + LogReg.

> RNNs are tricky. The coice of batch size, loss function, and optimizer are critical. Some configurations won't converge.

# NBSVM Implementation

To peek at the code under the hood, please look at the Python file [here](https://github.com/alexisbcook/Pelican_blog_all/blob/master/content/notebooks/classes/BagofWordsUtility.py).

In [18]:
# get the data
BOW_train, BOW_test = BagofWordsUtility.get_embedding()

# get the count embedding
vectorizer = CountVectorizer(ngram_range = (1,4), \
                             analyzer = 'word', \
                             binary = True, \
                             max_features = 2000000)

# get train and test data
count_train_data = vectorizer.fit_transform(BOW_train)
count_test_data = vectorizer.transform(BOW_test)

# define and train the model
nbsvm_model = NBSVM()
nbsvm_model.fit(count_train_data, train_labels)

# evaluate accuracy
print('Test Accuracy: %s' % nbsvm_model.score(count_test_data, test_labels))

100%|██████████| 12500/12500 [00:06<00:00, 2018.07it/s]
100%|██████████| 12500/12500 [00:06<00:00, 2010.73it/s]
100%|██████████| 12500/12500 [00:06<00:00, 2036.89it/s]
100%|██████████| 12500/12500 [00:06<00:00, 2021.52it/s]


Test Accuracy: 0.91584


In [19]:
# train the model
classifier = LogisticRegression()
classifier.fit(count_train_data, train_labels)

# calculate accuracy
classifier.score(count_test_data, test_labels)

0.89900000000000002

# TF-IDF + (LR, MLP) Implementation

In [20]:
# get the embedding
vectorizer = TfidfVectorizer(ngram_range = (1,2), 
                             analyzer = 'word', 
                             binary = True,
                             max_features = 50000)

# get train and test data
TFIDF_train_data = vectorizer.fit_transform(BOW_train)
TFIDF_test_data = vectorizer.transform(BOW_test)

In [21]:
# define and train the model
classifier = LogisticRegression()
classifier.fit(TFIDF_train_data, train_labels)

# calculate accuracy
print('Test Accuracy: %s' % classifier.score(TFIDF_test_data, test_labels))

Test Accuracy: 0.90076


In [22]:
# define and train the model
nbsvm_model = NBSVM()
nbsvm_model.fit(TFIDF_train_data, train_labels)

# evaluate accuracy
print('Test Accuracy: %s' % nbsvm_model.score(TFIDF_test_data, test_labels))

Test Accuracy: 0.90328
