# Data Preparation
In this exercise we will work with the IMDB sentiment dataset. This dataset contains movie reviews, each with a positive or negative sentiment (quantized by 1 for positive and 0 for negative). The labeled training and testing data is provided on Moodle. 

## Reading and preprocessing the data
To import the tsv file, it is recommended to use the pandas package. The provided file can be imported as follows

In [1]:
import numpy as np
import pandas as pd

# load data as pandas dataframe
train = pd.read_csv('labeledTrainData.tsv', 
                    header=0,
                    delimiter="\t", 
                    quoting=3 )

What data type is the variable train? Which values does it contain? Print some examples.

In [2]:
print('Data type:', type(train))
print()
print('Value 0,0:', train.values[0][0])
print('Value 0,1:',train.values[0][1])
print('Value 0,2:',train.values[0][2])

Data type: <class 'pandas.core.frame.DataFrame'>

Value 0,0: "5814_8"
Value 0,1: 1
Value 0,2: "With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans w

The text strings contain HTML tags, which have to be removed. To do this, use the bs4 package

In [3]:
from bs4 import BeautifulSoup

example1 = BeautifulSoup(train['review'][0],'lxml').get_text()
print(example1)

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.The actual feature film bit when it finally starts is only on for 20 mi

The imported text contains punctuation, numbers, and all (common) words. For now, we assume that these are not beneficial to the task of sentiment classification, and we want to remove them. Punctuation and numbers can be removed using the regular expressions (re) package

In [4]:
import re
# Use regular expressions to do a find-and-replace
letters_only = re.sub('[^a-zA-Z]',           # The pattern to search for
                      ' ',                   # The pattern to replace it with
                      example1 )  # The text to search
print(letters_only)

 With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him The actual feature film bit when it finally starts is only on for    mi

It is also beneficial, to convert all letters to lower case and to split the strings into individual words.

In [5]:
lower_case = letters_only.lower()        # Convert to lower case
print('Lower case version:')
print(lower_case)
words = lower_case.split()   # Split into words
print()
print('First Word:', words[0])

Lower case version:
 with all this stuff going down at the moment with mj i ve started listening to his music  watching the odd documentary here and there  watched the wiz and watched moonwalker again  maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  some of it has subtle messages about mj s feeling towards the press and also the obvious message of drugs are bad m kay visually impressive but of course this is all about michael jackson so unless you remotely like mj in anyway then you are going to hate this and find it boring  some may call mj an egotist for consenting to the making of this movie but mj and most of his fans would say that he made it for the fans which if true is really nice of him the actual feature film bit when it finally starts 

For now, we also want to remove common words that do not carry much meaning, such as `a', `is', or `the'. These are often referred to as stop words. A list of stop words can be obtained with the NLTK package:

In [6]:
import nltk
nltk.download('stopwords')  # Download text data sets, including stop words
from nltk.corpus import stopwords # Import the stop word list
stops=stopwords.words('english')
print(stops)

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/salamander/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'so

Write a function called `review_prepro` that takes as an input a raw review string and returns a preprocessed review, i.e. a string with HTML tags removes, all lower case letters, no stop words. Then apply this function to the entire training set. Return the list `clean_train_reviews`, which contains all the cleaned reviews.

In [7]:
# function for preprocessing the data
def review_prepro(data, remove_stopwords=False):
    # remove HTML tags
    review_text = BeautifulSoup(data, 'lxml').get_text()
    # remove non-letters and numbers
    letters_only = re.sub( '[^a-zA-Z]',
                          ' ',
                          review_text )
    # make all characters lower case and split the documents into single words
    words = letters_only.lower().split()
    
    if remove_stopwords:
        # remove stop words
        meaningful_words = [ w for w in words if not w in stops ]
        # return concatenated single string
        return ' '.join(meaningful_words)
    else:
        # or don't and concatenate to single string
        return ' '.join(words)

# preprocess train data
num_reviews = train['review'].size

clean_train_reviews = []
for i in range(num_reviews):
    #if (i+1)%1000 == 0:
   #     print('Review {} of {}\n'.format(i+1, num_reviews))
    clean_train_reviews.append( review_prepro(train['review'][i], remove_stopwords=True) )
    

## Creating Features from a Bag of Words
For generating a bag of words model, we will use the scikit-learn package. Use the following code

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

# define the vectorizer
vectorizer = CountVectorizer(analyzer = 'word',   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 5000) 
                             
# fit the vectorizer to the data
train_data_features = vectorizer.fit_transform(clean_train_reviews)
# convert to numpy array
train_data_features = train_data_features.toarray()

## Black box classifier
To do something meaningful with the generated data, we will use a prebuilt classifier, train it on the training data and then evaluate the learned classifier on the test data `labeledTestData.tsv`. 

First, preprocess the test data the same way as the training data and return the variable `test_data_features`. (Hint: use `vectorizer.transform`)

In [9]:
test = pd.read_csv('labeledTestData.tsv', 
                   header=0,
                   delimiter="\t",
                   quoting=3)

num_test_reviews = test['review'].size
clean_test_reviews = []
for i in range(num_test_reviews):
  #  if (i+1)%1000 == 0:
 #       print('Review {} of {}\n'.format(i+1, num_test_reviews))
    clean_test_reviews.append( review_prepro(test['review'][i], remove_stopwords=True) )

test_data_features = (vectorizer.transform(clean_test_reviews)).toarray()

To train a classifier with logistic regression use the following code

In [10]:
from sklearn.linear_model import LogisticRegression as LR

model = LR()
model.fit( train_data_features, train['sentiment'] )

p = model.predict_proba( test_data_features )[:,1] 
output = pd.DataFrame( data={'id':test['id'], 'sentiment':p} )

## Evaluate result
We will use the Area Under Curve (AUC) metric to measure performance. An AUC score of 0.5 is the same as a random classifier, the closer to 1 the score is the better.

In [11]:
from sklearn.metrics import roc_auc_score as AUC

auc = AUC( test['sentiment'].values, p )
print('AUC score:', auc)

AUC score: 0.929248214029


## More sophisticated methods
Use a prebuilt TF-IDF vectorizer and play around with its settings such as stop words and n-grams and the performance of an LR classifier.

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer_tf = TfidfVectorizer( max_features = 5000, 
                             ngram_range = ( 1, 1 ), 
                             sublinear_tf = True )

# fit the vectorizer to the data
train_data_features_tf = vectorizer_tf.fit_transform(clean_train_reviews)
# convert to numpy array
train_data_features_tf = train_data_features_tf.toarray()
test_data_features_tf = (vectorizer_tf.transform(clean_test_reviews)).toarray()


model_tf = LR()
model_tf.fit( train_data_features_tf, train['sentiment'] )
p_tf = model_tf.predict_proba( test_data_features_tf )[:,1] 
output_tf = pd.DataFrame( data={'id':test['id'], 'sentiment':p_tf} )

auc_tf = AUC( test['sentiment'].values, p_tf )
print('AUC score (TF-IDF):', auc_tf)

AUC score (TF-IDF): 0.9528444207
