# Following the Kaggle Tutorial

We work with a dataset containing 25,000 movie reviews from IMDB, labeled by sentiment (positive/negative).  In addition, there are another 50,000 IMDB reviews provided without any rating labels.  

The reviews are split evenly into train and test sets (25k train and 25k test). The overall distribution of labels is also balanced within the train set (12.5k pos and 12.5k neg).  Our goal is to see how well we can predict sentiment in the test dataset. 

We note that Keras also has a version of this dataset, but it is missing the unlabeled reviews.  We could download the original dataset [here](http://ai.stanford.edu/~amaas/data/sentiment/), but we settle on a slighly more formatted (and so easier-to-use) version from Kaggle [here](https://www.kaggle.com/c/word2vec-nlp-tutorial/data?unlabeledTrainData.tsv.zip).

For this first part of the notebook, we follow along with the Kaggle tutorial [here](https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words).

In [1]:
import pandas as pd # for importing the data
from bs4 import BeautifulSoup # for removing HTML tags
import re # for processing text with regular expressions
import nltk.data # for sentence splitting
import logging # output messages for word2vec embedding step
from gensim.models import word2vec # embedding algorithm



In [2]:
# load the data
train = pd.read_csv("data/labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)
test = pd.read_csv( "data/testData.tsv", header=0, delimiter="\t", quoting=3 )
unlabeled_train = pd.read_csv("data/unlabeledTrainData.tsv", header=0, delimiter="\t", quoting=3 )

Here, `train` is a pandas dataframe with 25,000 rows, each corresponding to a review.  There are three columns:
- `id`: an identifier
- `sentiment` $\in \{0,1\}$: indicates pos(1) or neg(0) sentiment
- `review`: the full text of the review

`test` and `unlabeled_train` have 25,000 and 50,000 rows, respectively, and each contain two columns (`id` and `review`).

In [3]:
print("Read %d labeled train reviews, %d test reviews, " \
 "and %d unlabeled train reviews.\n" % (train["review"].size,  
 test["review"].size, unlabeled_train["review"].size))
print("TOTAL: %d reviews.\n" % int(train["review"].size + test["review"].size + unlabeled_train["review"].size))
print("In the labeled train set, %s reviews are positive, and %s are negative." % (sum(train["sentiment"].values), train.shape[0]-sum(train["sentiment"].values)))

Read 25000 labeled train reviews, 25000 test reviews, and 50000 unlabeled train reviews.

TOTAL: 100000 reviews.

In the labeled train set, 12500 reviews are positive, and 12500 are negative.


We visualize a sample review below.  

In [4]:
train["review"][2]

'"The film starts with a manager (Nicholas Bell) giving welcome investors (Robert Carradine) to Primal Park . A secret project mutating a primal animal using fossilized DNA, like ¨Jurassik Park¨, and some scientists resurrect one of nature\'s most fearsome predators, the Sabretooth tiger or Smilodon . Scientific ambition turns deadly, however, and when the high voltage fence is opened the creature escape and begins savagely stalking its prey - the human visitors , tourists and scientific.Meanwhile some youngsters enter in the restricted area of the security center and are attacked by a pack of large pre-historical animals which are deadlier and bigger . In addition , a security agent (Stacy Haiduk) and her mate (Brian Wimmer) fight hardly against the carnivorous Smilodons. The Sabretooths, themselves , of course, are the real star stars and they are astounding terrifyingly though not convincing. The giant animals savagely are stalking its prey and the group run afoul and fight against 

We would like to process each review to make it mathematically digestible.  We focus on using distributed word vectors created by the Word2Vec algorithm.  We use the implementation of Word2Vec from the gensim package.

Word2Vec takes as input single sentences, each one as a list of words.  That is, the input format is a list of lists.  It outputs a model of semantic meaning. 

We split a paragraph into sentences through the use of NLTK's __punkt__ tokenizer.  As an example, we print the first sentence identified by __punkt__ from the review above.  It nicely corresponds to our expectations!

In [5]:
# Load the punkt tokenizer
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

# Visualize first stripped sentence
print(tokenizer.tokenize(train['review'][2].strip())[0])

"The film starts with a manager (Nicholas Bell) giving welcome investors (Robert Carradine) to Primal Park .


We would furthermore need each sentence to be cleaned.  There are HTML tags such as `<br/>`, abbreviations, punctuation - all common issues when processing text from online. We need to tidy up the text before applying machine learning algorithms.

Our function `raw_sentence_to_wordlist` works as follows:
- HTML tags are removed through the use of the Beautiful Soup library.
- Numbers and punctuation are replaced with a space using regular expressions.
- We convert all of the words to lowercase and convert the sentence string into a list of words.

In [6]:
def raw_sentence_to_wordlist( raw_sentence ):
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw movie review), and 
    # the output is a single string (a preprocessed movie review)
    #
    # 1. Remove HTML
    review_text = BeautifulSoup(raw_sentence, "lxml").get_text() 
    #
    # 2. Remove non-letters (replace numbers and punctuation, for instance, with a space)      
    letters_only = re.sub("[^a-zA-Z]", " ", review_text) 
    #
    # 3. Convert to lower case, split into individual words
    wordlist = letters_only.lower().split()                             
    #
    # 4. Return a list of words.
    return(wordlist)   

As an example, we output the cleaned version of the sentence above.  Again, it appears to be working nicely!

In [7]:
print(raw_sentence_to_wordlist(tokenizer.tokenize(train['review'][2].strip())[0]))

['the', 'film', 'starts', 'with', 'a', 'manager', 'nicholas', 'bell', 'giving', 'welcome', 'investors', 'robert', 'carradine', 'to', 'primal', 'park']


We are almost done with processing the input for the Word2Vec embedding.  Now, we need only create a function (here, `review_to_sentences`) that puts everything together by looping through the paragraph and creates lists of lists of clean sentences.

In [8]:
# Define a function to split a review into parsed sentences
def review_to_sentences( review, tokenizer ):
    # Function to split a review into parsed sentences. Returns a 
    # list of sentences, where each sentence is a list of words
    #
    # 1. Use the NLTK tokenizer to split the paragraph into sentences
    raw_sentences = tokenizer.tokenize(review.strip())
    #
    # 2. Loop over each sentence
    sentences = []
    for raw_sentence in raw_sentences:
        # If a sentence is empty, skip it
        if len(raw_sentence) > 0:
            # Otherwise, call review_to_wordlist to get a list of words
            sentences.append( raw_sentence_to_wordlist( raw_sentence ))
    #
    # Return the list of sentences (each sentence is a list of words,
    # so this returns a list of lists
    return sentences

We now need only to call this function to populate our list of lists.

In [9]:
# I chose not to use the tqdm package here, but it is a possibility for tracking longer runtimes.
sentences = []  # Initialize an empty list of sentences

print("Parsing sentences from training set")
for review in train["review"]:
    sentences += review_to_sentences(review, tokenizer)

print("Parsing sentences from unlabeled set")
for review in unlabeled_train["review"]:
    sentences += review_to_sentences(review, tokenizer)

Parsing sentences from training set


  'Beautiful Soup.' % markup)
  ' that document to Beautiful Soup.' % decoded_markup


Parsing sentences from unlabeled set


  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  'Beautiful Soup.' % markup)
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup


The Word2Vec algorithm accepts a number of parameters that must be trained.  Here, we provide some starting points (that were trained by the writers of the Kaggle tutorial).

In [10]:
# OPTIONAL: Configure logging module so that Word2Vec creates nice output messages
# I chose not to use it, because it runs quite fast on my system.
# logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# Set values for various parameters
num_features = 300    # Word vector dimensionality                      
min_word_count = 40   # Minimum word count                        
num_workers = 4       # Number of threads to run in parallel
context = 10          # Context window size                                                                                    
downsampling = 1e-3   # Downsample setting for frequent words

# Initialize and train the model 
model = word2vec.Word2Vec(sentences, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling)

# If you don't plan to train the model any further, calling 
# init_sims will make the model much more memory-efficient.
model.init_sims(replace=True)

# It can be helpful to create a meaningful model name and save the model for later use. 
model_name = "300features_40minwords_10context"
model.save(model_name)
# to load: model = Word2Vec.load("300features_40minwords_10context")

Training model...


We can use the `most_similar` function to get insight into the model's word clusters.  This function returns the words that are closest to the provided word.

In [11]:
model.most_similar("bad")

[('terrible', 0.675002932548523),
 ('horrible', 0.6517869234085083),
 ('good', 0.642318069934845),
 ('lousy', 0.6187937259674072),
 ('awful', 0.5989417433738708),
 ('crappy', 0.5930088758468628),
 ('stupid', 0.581573486328125),
 ('cheesy', 0.5734360218048096),
 ('lame', 0.5684062242507935),
 ('atrocious', 0.5386618375778198)]

It seems we have a reasonably good model for semantic meaning :).  In case you're wondering how the model is stored, we can look beneath the hood.  The Word2Vec model consists of a feature vector for each word in the vocabulary, stored in a numpy array called "syn0".

In [12]:
print(type(model.syn0))
print(model.syn0.shape)

<class 'numpy.ndarray'>
(16490, 300)


Individual word vectors can be accessed in the following way:

In [17]:
# Each word is assinged a 1x300 numpy array
print(model["bad"].shape)

# As an example, we output the first 5 entries of the array assigned to "bad" 
print(model["bad"][:5])

(300,)
[ 0.01623794  0.00502289  0.09394965  0.03428585 -0.00123815]


We now have a representation of each document, with the $i$-th document as a $w_i$x300-dimensional matrix, where $w_i$ is the number of words in the $i$-th document.  Unlike the bag-of-words representation, this embedding captures the dynamics of word order and represents a useful improvement that could be fed into a potential classification algorithm.

However, there is a problem.  Classification and clustering algorithms typically require the text input to be represented as a fixed length vector.  The Kaggle tutorial investigates two methods to deal with this issue.  The resultant test accuracy was then compared to the results of simply running a Random Forest classifer on bag-of-words features.

The first method was to simply average the word vectors in a given review; they also removed stop words in processing.  The resultant set of average word vectors was then fed into a Random Forest classifier.  This method (slightly) underperformed bag-of-words.

The second method was to cluster the set of words, according to their Word2Vec embeddings.  They used K-means for this purpose, with the number of clusters set to one fifth of the vocabulary size ($K$ = 16490/5 = 3298).  (We note that while some clusters made intuitive sense, others did not.)  Documents were then represented as "bags-of-centroids", where each document was represented as a K=3298-dimensional vector, where the $i$-th entry contained the number of words in the document in the $i$-th cluster.  The resultant bag-of-centroids vectors were then fed into a Random Forest classifier.  This method (slightly) underperformed bag of words.

Thus, we must find a more clever approach to using our Word2Vec features. 

# Deviating from Kaggle Tutorial