## Movie Review Classification

![This is getting exciting](https://i.kinja-img.com/gawker-media/image/upload/s--hIgTSFEs--/c_fit,fl_progressive,q_80,w_320/17j2zn73qxdlfgif.jpg)

Using all that we have learned, we will now combine our techniques to perform some basic classifcation! We'll be using the nltk movie reviews data set, we will classify positive and negative reviews. Here's some code to get you started:

In [123]:
from nltk.corpus import movie_reviews as reviews

X = [reviews.raw(fileid) for fileid in reviews.fileids()]
y = [reviews.categories(fileid)[0] for fileid in reviews.fileids()]

### 1 - Print a positive and negative review

In [124]:
# Print a positive movie review
print('\n**** POSTIVE REVIEW ****')
print(X[y.index('pos')][:300])

# Print a negative movie review
print('\n**** NEGATIVE REVIEW ****')
print(X[y.index('neg')][:300])


**** POSTIVE REVIEW ****
films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . 
for starters , it was created by 

**** NEGATIVE REVIEW ****
plot : two teen couples go to a church party , drink and then drive . 
they get into an accident . 
one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . 
what's the deal ? 
watch the movie and " sorta " find out . . . 
critique : a mind-fuck movie for the


### 2 - Using the scikit train_test_split function (http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html), split the data into a training set and a test set. 

In [125]:
from sklearn.model_selection import train_test_split

# Split the revews into train and test partitions
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=11)
print('Training Reviews: ', len(X_train), '  Testing Reviews: ', len(X_test))

Training Reviews:  1600   Testing Reviews:  400


### 3 - Then lemmatize or stem the reviews, and transform the documents to tf-idf

In [126]:
import numpy as np
import pandas as pd
import nltk
import re
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords
from sklearn import feature_extraction
from sklearn.feature_extraction.text import TfidfVectorizer

# Define stopwords and stemmer
stopwords = nltk.corpus.stopwords.words('english')
stemmer = SnowballStemmer('english')

# Define tokenize and stem functions 
def tokenize_and_stem(text):
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

def tokenize_only(text):
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

# tokenize and stem the reviews
totalvocab_stemmed = []
totalvocab_tokenized = []

for i in X:
    allwords_stemmed = tokenize_and_stem(i) 
    totalvocab_stemmed.extend(allwords_stemmed) 
    
    allwords_tokenized = tokenize_only(i)
    totalvocab_tokenized.extend(allwords_tokenized)
    
# Print a sample of the stems and words
vocab_df = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed)
print('*** There are ' + str(vocab_df.shape[0]) + ' items in the reviews vocabulary.')
print(vocab_df.sample(10))

# Configure TiIdf Vectorizer
vectorizer = TfidfVectorizer(stop_words=stopwords,
                             tokenizer=tokenize_only)
 
# Learn and transform train documents
X_train_tfidf  = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Extract feature names
feature_names = np.array(vectorizer.get_feature_names())
print ('   ')
print ('*** There are ' + str(len(feature_names)) + ' tokenized features. ')

# Extract smallest and largest features by tfidf
print (" ")
sorted_tfidf_index = X_train_tfidf.max(0).toarray()[0].argsort()
print('Smallest TfIdf: \n{}\n'.format(feature_names[sorted_tfidf_index[:10]]))
print('Largest TfIdf: \n{}\n'.format(feature_names[sorted_tfidf_index[:-11:-1]]))


*** There are 1316846 items in the reviews vocabulary.
           words
are          are
of            of
an            an
belong   belongs
bear        bear
is            is
be         being
script  scripted
if            if
in            in
   
*** There are 41463 tokenized features. 
 
Smallest TfIdf: 
['mikel' 'robinson-blackmore' 'koven' 'micheal' 'esmeralda' 'cradles'
 'pig-keeper' 'proudest' '25th' 'bardsley']

Largest TfIdf: 
['nbsp' 'webb' 'leila' 'pokemon' 'alessa' 'bye' 'flynt' 'matilda' 'rocky'
 'giles']



### 4 - Finally, build a model. To start, use a logistic regression (which we will review in detail in the coming lectures) 
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [127]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

# Transform  sentiment classes/labels to binary values
y_train_bin = [1 if x == 'pos' else 0 for x in y_train]
y_test_bin = [1 if x == 'pos' else 0 for x in y_test]

# Define and train logit_model using tfidf vectorized training data
logit_model = LogisticRegression()
logit_model.fit(X_train_tfidf, y_train_bin)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

### 5 - Measure the efficacy of your model using the Reciever Operator Characteristic (ROC) Area Under the Curve (AUC). Report this metric on the test set of your data.

For more info on this, see: http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc_crossval.html#sphx-glr-auto-examples-model-selection-plot-roc-crossval-py

In [128]:
# Predict test review sentiment class based on logit model
logit_predict = logit_model.predict(X_test_tfidf)

# Predict test review sentiment class based on logit model
roc_auc_1 = roc_auc_score(y_test_bin, logit_predict)
print('Model 1. Standard logit model using tokenized only vectorization of features.')
print('ROC AUC score 1: ', roc_auc_1)
print('  ')
print ('Correctly predicts sample as:')
print('The movie is really bad, I will never recommend it',' ||| The movie was sad, but really great ending.  I will see it again')
print(logit_model.predict(vectorizer.transform(['The movie is really bad, I will never recommend it.','The movie was sad, but really great ending.  I will see it again.'])))

Model 1. Standard logit model using tokenized only vectorization of features.
ROC AUC score 1:  0.822951844903
  
Correctly predicts sample as:
The movie is really bad, I will never recommend it  ||| The movie was sad, but really great ending.  I will see it again
[0 1]


### 6 - Change a parameter in your model (introduce regularization) or change a parameter in your word vector transformation
(http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). Try introducing the use of stop words, or employing a cutoff on terms with min or max df.

In [129]:
# Configure TiIdf Vectorizer
vectorizer = TfidfVectorizer(stop_words=stopwords,
                             tokenizer=tokenize_and_stem)
 
# Learn and transform train documents
X_train_tfidf  = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Extract feature names
feature_names = np.array(vectorizer.get_feature_names())
print ('*** There are ' + str(len(feature_names)) + ' tokenized features. ')

# Extract smallest and largest features by tfidf
print (" ")
sorted_tfidf_index = X_train_tfidf.max(0).toarray()[0].argsort()
print('Smallest TfIdf: \n{}\n'.format(feature_names[sorted_tfidf_index[:10]]))
print('Largest TfIdf: \n{}\n'.format(feature_names[sorted_tfidf_index[:-11:-1]]))

# Build the logit model
logit_model.fit(X_train_tfidf, y_train_bin)

# Predict test review sentiment class based on logit model
logit_predict = logit_model.predict(X_test_tfidf)

# Predict test review sentiment class based on logit model
roc_auc_2 = roc_auc_score(y_test_bin, logit_predict)
print('Model 2. Standard logit model using tokenized and stemming vectorization of features.')
print('ROC AUC score 2: ', roc_auc_2)

*** There are 28649 tokenized features. 
 
Smallest TfIdf: 
['koven' 'mikel' 'robinson-blackmor' '`never' 'dragonlik' 'unshroud'
 'fondest' 'heartbroken' 'fondacairo' '`disneyfi']

Largest TfIdf: 
['pimp' 'nbsp' 'webb' 'pokemon' 'alessa' 'leila' 'bye' 'matilda' 'flynt'
 'gile']

Model 2. Standard logit model using tokenized and stemming vectorization of features.
ROC AUC score 2:  0.827829893684


### 7 - Make four models in total, changing parameters and comparing the AUC results. Report your findings in a tabular form.

In [130]:
# Configure TiIdf Vectorizer
vectorizer = TfidfVectorizer(stop_words=stopwords,
                             tokenizer=tokenize_only,
                             min_df = .1, max_df = 0.7)
 
# Learn and transform train documents
X_train_tfidf  = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Extract feature names
feature_names = np.array(vectorizer.get_feature_names())
print ('*** There are ' + str(len(feature_names)) + ' tokenized features. ')

# Extract smallest and largest features by tfidf
print (" ")
sorted_tfidf_index = X_train_tfidf.max(0).toarray()[0].argsort()
print('Smallest TfIdf: \n{}\n'.format(feature_names[sorted_tfidf_index[:10]]))
print('Largest TfIdf: \n{}\n'.format(feature_names[sorted_tfidf_index[:-11:-1]]))

# Build the logit model
logit_model.fit(X_train_tfidf, y_train_bin)

# Predict test review sentiment class based on logit model
logit_predict = logit_model.predict(X_test_tfidf)

# Predict test review sentiment class based on logit model
roc_auc_3 = roc_auc_score(y_test_bin, logit_predict)
print('Model 3. Standard logit model using tokenize only and min/max freq of .1/.7 vectorization of features.')
print('ROC AUC score 3: ', roc_auc_3)


*** There are 422 tokenized features. 
 
Smallest TfIdf: 
['already' 'later' 'based' 'left' 'interest' 'directed' 'sort' 'wonder'
 'anyone' 'nearly']

Largest TfIdf: 
['family' 'star' 'classic' 'go' 'city' "'m" 'evil' 'black' 'run' 'series']

Model 3. Standard logit model using tokenize only and min/max freq of .1/.7 vectorization of features.
ROC AUC score 3:  0.780988117573


In [131]:
# Configure TiIdf Vectorizer
vectorizer = TfidfVectorizer(stop_words=stopwords,
                             tokenizer=tokenize_only,
                             ngram_range = (1,3))
 
# Learn and transform train documents
X_train_tfidf  = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Extract feature names
feature_names = np.array(vectorizer.get_feature_names())
print ('*** There are ' + str(len(feature_names)) + ' tokenized features. ')

# Extract smallest and largest features by tfidf
print (" ")
sorted_tfidf_index = X_train_tfidf.max(0).toarray()[0].argsort()
print('Smallest TfIdf: \n{}\n'.format(feature_names[sorted_tfidf_index[:10]]))
print('Largest TfIdf: \n{}\n'.format(feature_names[sorted_tfidf_index[:-11:-1]]))

# Build the logit model
logit_model.fit(X_train_tfidf, y_train_bin)

# Predict test review sentiment class based on logit model
logit_predict = logit_model.predict(X_test_tfidf)

# Predict test review sentiment class based on logit model
roc_auc_4 = roc_auc_score(y_test_bin, logit_predict)
print('Model 4. Standard logit model using tokenize only and ngram 1to3 vectorization.')
print('ROC AUC score 4: ', roc_auc_4)

*** There are 1022530 tokenized features. 
 
Smallest TfIdf: 
['project minds nbsp' 'nbsp finally get' 'nbsp first'
 'nbsp first screenplay' 'nbsp five' 'nbsp five years' 'nbsp given'
 'nbsp given enough' 'nbsp however' 'nbsp however rather']

Largest TfIdf: 
['nbsp' 'leila' 'webb' 'flynt' 'kermit' 'paulie' 'alessa' 'matilda' 'giles'
 'ghost dog']

Model 4. Standard logit model using tokenize only and ngram 1to3 vectorization.
ROC AUC score 4:  0.825641025641


In [132]:
# Configure TiIdf Vectorizer
vectorizer = TfidfVectorizer(stop_words=stopwords,
                             tokenizer=tokenize_and_stem,
                             max_features = 5000)

 
# Learn and transform train documents
X_train_tfidf  = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Extract feature names
feature_names = np.array(vectorizer.get_feature_names())
print ('*** There are ' + str(len(feature_names)) + ' tokenized features. ')

# Extract smallest and largest features by tfidf
print (" ")
sorted_tfidf_index = X_train_tfidf.max(0).toarray()[0].argsort()
print('Smallest TfIdf: \n{}\n'.format(feature_names[sorted_tfidf_index[:10]]))
print('Largest TfIdf: \n{}\n'.format(feature_names[sorted_tfidf_index[:-11:-1]]))

# Build the logit model
logit_model.fit(X_train_tfidf, y_train_bin)

# Predict test review sentiment class based on logit model
logit_predict = logit_model.predict(X_test_tfidf)

# Predict test review sentiment class based on logit model
roc_auc_5 = roc_auc_score(y_test_bin, logit_predict)
print('Model 5. Standard logit model using tokenize / stem and ngram 1to3 vectorization.')
print('ROC AUC score 5: ', roc_auc_5)

*** There are 5000 tokenized features. 
 
Smallest TfIdf: 
['characterist' 'peac' 'paragraph' 'februari' 'ensur' 'forewarn' 'incap'
 'releg' 'earnest' 'notion']

Largest TfIdf: 
['pimp' 'webb' 'nbsp' 'gile' 'bye' 'matilda' 'rocki' 'flynt' 'leila'
 'pollock']

Model 5. Standard logit model using tokenize / stem and ngram 1to3 vectorization.
ROC AUC score 5:  0.8301438399


In [137]:
from collections import OrderedDict

results = OrderedDict([ ('Model', ['#1 - TdIdf - Tokenization Only', '#2 - Tokenize and stemming', 
                                   '#3 - Tokenize w/ doc freq min/max = .1/.7', '#4 - Tokenize w/ Ngram=Trigram level',
                                   '#5 - Tokenize and Stem w/ max features = 5000']),
          ('# Features', [41463, 28649, 422, 1022539, 5000]),
          ('ROC AUC Score', [roc_auc_1, roc_auc_2, roc_auc_3, roc_auc_4, roc_auc_5]) ])

results_df = pd.DataFrame.from_dict(results)
results_score_df = results_df.sort_values(by=['ROC AUC Score'],ascending=False)
results_score_df[:]

Unnamed: 0,Model,# Features,ROC AUC Score
4,#5 - Tokenize and Stem w/ max features = 5000,5000,0.830144
1,#2 - Tokenize and stemming,28649,0.82783
3,#4 - Tokenize w/ Ngram=Trigram level,1022539,0.825641
0,#1 - TdIdf - Tokenization Only,41463,0.822952
2,#3 - Tokenize w/ doc freq min/max = .1/.7,422,0.780988


### End of Assignment
