# Sentiment prediction on IMDB reviews


25,000 IMDB movie reviews, each with a positive or negative sentiment label.

The reviews have html tags and we will use BeautifulSoup to remove the tags


In [69]:
import pandas as pd
import numpy as np

from bs4 import BeautifulSoup
import re

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split

pd.options.mode.chained_assignment = None

df_original = pd.read_csv("../data/labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)
print(df_original.columns)
print(df_original.shape[0])

# Inspect the first review
print(df_original["review"][0])




Index(['id', 'sentiment', 'review'], dtype='object')
25000
"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.

In [117]:
# limit size of data to 500 reviews

df = df_original[0:1499]
df.shape


(1499, 3)

In [118]:
# remove html tags
example1 = BeautifulSoup(df["review"][0], "html.parser")
print(example1.get_text())


"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.The actual feature film bit when it finally starts is only on for 20 mi

# 1. Remove HTML tags

In [119]:
df['clean_text'] = df.review.apply(lambda x : BeautifulSoup(x, "html.parser").get_text() )
# check a review
df['clean_text'][2]

'"The film starts with a manager (Nicholas Bell) giving welcome investors (Robert Carradine) to Primal Park . A secret project mutating a primal animal using fossilized DNA, like ¨Jurassik Park¨, and some scientists resurrect one of nature\'s most fearsome predators, the Sabretooth tiger or Smilodon . Scientific ambition turns deadly, however, and when the high voltage fence is opened the creature escape and begins savagely stalking its prey - the human visitors , tourists and scientific.Meanwhile some youngsters enter in the restricted area of the security center and are attacked by a pack of large pre-historical animals which are deadlier and bigger . In addition , a security agent (Stacy Haiduk) and her mate (Brian Wimmer) fight hardly against the carnivorous Smilodons. The Sabretooths, themselves , of course, are the real star stars and they are astounding terrifyingly though not convincing. The giant animals savagely are stalking its prey and the group run afoul and fight against 

# 2. Vectorize 

Apply vectorizer = [CountVectorizer](scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)  with 

* max_features = 5000
* stop_words = 'english'
* analyzer = "word"

to df.clean_text

* look at the vocabulary of the vectorizer

* CountVectorizer returns a sparse Matrix: 

        <499x1000 sparse matrix of type '<class 'numpy.int64'>' 
        with 23556 stored elements in Compressed Sparse Row format>

* The output of CountVectorizer is your X, and y= df.sentiment.values
* Split your X,Y into train and test with train_test_split(X,y, test_size=0.33)


In [120]:
vectorizer = CountVectorizer( analyzer = "word", stop_words = 'english', max_features = 4000 ) 
v_text = vectorizer.fit_transform(df.clean_text)
# Transform the sparse matrix into a dataframe
df_text = pd.DataFrame(v_text.todense(), columns=vectorizer.get_feature_names())
df_text.columns

y = df.sentiment

X_train, y_train, X_test, y_test = train_test_split(v_text,y, test_size=0.33)

X_train.shape

(1004, 4000)

# 3. Random Forest classification

from sklearn.ensemble import RandomForestClassifier

* Initialize a Random Forest classifier with 100 trees
forest = RandomForestClassifier(n_estimators = 100) 

* Fit the forest to the training set, using the bag of words as  features and the sentiment labels as the response variable



In [124]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, accuracy_score, confusion_matrix
np.random.seed(888)
X = v_text
y = df.sentiment.values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

forest = RandomForestClassifier(n_estimators = 400) 

forest.fit( X_train, y_train )

y_hat = forest.predict(X_test)

print(accuracy_score(y_hat, y_test))
confusion_matrix(y_hat, y_test)



0.817777777778


array([[191,  28],
       [ 54, 177]])

# 4. Tf-IDF 

Instead of using a CountVectorizer, use a [tf-idf vectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

Play with the following parameters

* max_df, min_df 
* max_features
* use_idf: true / False
* sublinear_tf



In [125]:
from sklearn.feature_extraction.text import HashingVectorizer

vectorizer = HashingVectorizer( analyzer = "word", stop_words = 'english', n_features = 1000 ) 
v_text = vectorizer.fit_transform(df.clean_text)
v_text
np.random.seed(88)
X = v_text
y = df.sentiment.values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

forest = RandomForestClassifier(n_estimators = 100) 

forest.fit( X_train, y_train )

y_hat = forest.predict(X_test)

print(accuracy_score(y_hat, y_test))
confusion_matrix(y_hat, y_test)



0.779797979798


array([[212,  53],
       [ 56, 174]])

# 5. Hashing Vectorizer

Compare score for different max_features


# 6. Multinomial NaiveBayes

        from sklearn.naive_bayes import MultinomialNB

        nb = MultinomialNB()

What's the accuracy?

In [126]:
from sklearn.naive_bayes import MultinomialNB
np.random.seed(0)


vectorizer = CountVectorizer( analyzer = "word", stop_words = 'english', max_features = 1000 ) 
v_text = vectorizer.fit_transform(df.clean_text)
# Transform the sparse matrix into a dataframe

df_text = pd.DataFrame(v_text.todense(), columns=vectorizer.get_feature_names())
df_text.columns

y = df.sentiment
X = v_text


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

nb = MultinomialNB()
nb.fit(X_train, y_train)
y_hat = nb.predict(X_test)

print(accuracy_score(y_hat, y_test))
confusion_matrix(y_hat, y_test)


0.796666666667


array([[128,  27],
       [ 34, 111]])