We work with a dataset containing 50,000 movie reviews from IMDB, labeled by sentiment (positive/negative).  In addition, there are another 50,000 IMDB reviews provided without any rating labels.  

The reviews are split evenly into train and test sets (25k train and 25k test). The overall distribution of labels is also balanced within the train and test sets (12.5k pos and 12.5k neg).  Our goal is to predict sentiment in the test dataset. 

In [1]:
import os                                # accessing directory of files
import pandas as pd                      # storing the data
from bs4 import BeautifulSoup            # removing HTML tags
import re                                # text processing with regular expressions
from gensim.models import word2vec       # embedding algorithm
import numpy as np                       # arrays and other mathy structures     
from tqdm import tqdm                    # timing algorithms
from gensim import models                # doc2vec implementation
from random import shuffle
from sklearn.linear_model import LogisticRegression
%matplotlib inline                       

# If you are using Python 3, you will get an error.
# (Pattern is a Python 2 library and fails to install for Python 3.)



The dataset can be downloaded [here](http://ai.stanford.edu/~amaas/data/sentiment/).  We first write code to extract the reviews into Pandas dataframes.

In [2]:
def load_data(directory_name):
    # load dataset from directory to Pandas dataframe
    data = []
    files = [f for f in os.listdir('../../aclImdb/' + directory_name)]
    for f in files:
        with open('../../aclImdb/' + directory_name + f, "r", encoding = 'utf-8') as myfile:
            data.append(myfile.read())
    df = pd.DataFrame({'review': data, 'file': files})
    return df

# load training dataset
train_pos = load_data('train/pos/')
train_neg = load_data('train/neg/')

# load test dataset
test_pos = load_data('test/pos/')
test_neg = load_data('test/neg/')

# load unsupervised dataset
unsup = load_data('train/unsup/')

print("\n %d pos train reviews \n %d neg train reviews \n %d pos test reviews \n %d neg test reviews \n %d unsup reviews" \
      % (train_pos.shape[0], train_neg.shape[0], test_pos.shape[0], test_neg.shape[0], unsup.shape[0]))
print("\n TOTAL: %d reviews" % int(train_pos.shape[0] + train_neg.shape[0] + test_pos.shape[0] + test_neg.shape[0] + unsup.shape[0]))


 12500 pos train reviews 
 12500 neg train reviews 
 12500 pos test reviews     
 12500 neg test reviews 
 50000 unsup reviews

 TOTAL: 100000 reviews


`train_pos`, `train_neg`, `test_pos`, `test_neg`, and `unsup` are Pandas dataframes.  They each have two columns, and each row corresponds to a review:
- `file` : name of file that contains review
- `review` : the full text of the review

We write a function `review_to_wordlist`, which processes each review as follows:
- Punctuation is made consistent through the use of regular expressions.
- HTML tags are removed through the use of the Beautiful Soup library.
- All words are converted to lowercase.
- Each review is converted into a list of words.

We note that there is still some room for improvement.  For instance, 
- Strings like "Sgt. Cutter" currently are broken into two sentences.  We should instead determine how to differentiate between periods that signify the end of an abbreviation and periods that denote the end of a sentence.
- Some writers separate their sentences with commas or line breaks; the algorithm currently absorbs these multiple sentences into an individual sentence.
- Ellipses (...) are currently processed as multiple, empty sentences (which are then discarded).

Before writing this post, I read the Kaggle tutorial [here](https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-2-word-vectors).  My current processing algorithm borrows from that page, but also adds some meaningful improvements, partially informed by the algorithm [here](https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py).  For instance, 
- We keep the punctuation that ends each sentence (i.e., period vs. exclamation point), whereas punctuation was discarded in the Kaggle tutorial.  
- We do smarter processing of contractions, in a way that understands that "should've" = "should" + "'ve".  In the Kaggle tutorial, "should've" is kept as a single word (contractions are not understood in terms of their composite parts).

In [3]:
def clean_str( string ):
    # Function that cleans text using regular expressions
    string = re.sub(r' +', ' ', string)
    string = re.sub(r'\.+', '.', string)
    string = re.sub(r'\.(?! )', '. ', string)    
    string = re.sub(r"\'s", " \'s", string) 
    string = re.sub(r"\'ve", " \'ve", string) 
    string = re.sub(r"n\'t", " n\'t", string) 
    string = re.sub(r"\'re", " \'re", string) 
    string = re.sub(r"\'m", " \'m", string) 
    string = re.sub(r"\'d", " \'d", string) 
    string = re.sub(r"\'ll", " \'ll", string) 
    string = re.sub(r",", " , ", string) 
    string = re.sub(r"!", " ! ", string) 
    string = re.sub(r"\(", " ( ", string) 
    string = re.sub(r"\)", " ) ", string) 
    string = re.sub(r"\?", " ? ", string) 
    string = re.sub(r"\.", " . ", string)
    string = re.sub(r"\-", " - ", string)
    string = re.sub(r"\;", " ; ", string)
    string = re.sub(r"\:", " : ", string)
    string = re.sub(r'\"', ' " ', string)
    string = re.sub(r'\/', ' / ', string)
    return string

def review_to_wordlist( review ):
    #
    # Function to turn each review into a list of sentences, 
    # where each sentence is a list of words
    #
    # 1. Process punctuation, excessive periods, missing spaces 
    review = clean_str(review)
    #
    # 2. Remove HTML tags 
    review = BeautifulSoup(review, "lxml").get_text()
    #
    # 3. remove white spaces
    review = review.strip()
    #
    # 4. return lowercase collection of words
    wordlist = review.lower().split()
    #
    # Return the list of words
    return wordlist

Each cleaned review is fed into the `LabeledLineReview` class, written below.  These labeled reviews are fed into the Doc2Vec algorithm to obtain an embedding of each review.  We note that we use the full set of 100,000 reviews to learn the embedding, but our classification algorithm (here, we use Linear Regression) will be trained with the training set only.

In [4]:
class LabeledLineReview(object):
    def __init__(self, dflist):
        self.dflist = dflist
    
    def __iter__(self):
        for df in self.dflist:
            for idx in tqdm(df.index):
                yield models.doc2vec.LabeledSentence(review_to_wordlist(df.ix[idx, 'review']), [df.ix[idx, 'file']])
    
    def to_array(self):
        self.reviews = []
        for df in self.dflist:
            for idx in tqdm(df.index):
                self.reviews.append(models.doc2vec.LabeledSentence(review_to_wordlist(df.ix[idx, 'review']), [df.ix[idx, 'file']]))
        return self.reviews
    
    def reviews_perm(self):
        shuffle(self.reviews)
        return self.reviews
    
reviews = LabeledLineReview([train_pos, train_neg, test_pos, test_neg, unsup])

In [5]:
model = models.Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=8)
model.build_vocab(reviews.to_array())

for epoch in tqdm(range(10)):
    model.train(reviews.reviews_perm())

100%|██████████| 12500/12500 [00:07<00:00, 1620.67it/s]
100%|██████████| 12500/12500 [00:07<00:00, 1592.76it/s]
100%|██████████| 12500/12500 [00:07<00:00, 1626.76it/s]
100%|██████████| 12500/12500 [00:07<00:00, 1645.64it/s]
100%|██████████| 50000/50000 [00:28<00:00, 1761.05it/s]
100%|██████████| 10/10 [20:12<00:00, 120.20s/it]


In [6]:
# obtain train data embeddings and labels
train_array = np.zeros((25000, 100))
train_tags = list(train_pos['file'].values) + list(train_neg['file'].values)
for idx , val in enumerate(train_tags):
    train_array[idx] = model.docvecs[val]
train_labels = np.append(np.ones(12500), np.zeros(12500))

# obtain test data embeddings and labels
test_array = np.zeros((25000, 100))
test_tags = list(test_pos['file'].values) + list(test_neg['file'].values)
for idx , val in enumerate(test_tags):
    test_array[idx] = model.docvecs[val]
test_labels = np.append(np.ones(12500), np.zeros(12500))

In [7]:
classifier = LogisticRegression()
classifier.fit(train_array, train_labels)
classifier.score(test_array, test_labels)

0.89927999999999997

Woohoo!  We can predict sentiment with nearly 90 percent accuracy!