# Document Classification & Clustering - Lecture

What could we do with the document-term-matrices (dtm[s]) created in the previous notebook? We could visualize them or train an algorithm to do some specific task. We have covered both classification and clustering before, so we won't focus on the particulars of algorithms. Instead we'll focus on the unique problems of dealing with text input for these models.

## Contents
* [Part 1](#p1): Vectorize a whole Corpus
* [Part 2](#p2): Tune the vectorizer
* [Part 3](#p3): Apply Vectorizer to Classification problem
* [Part 4](#p4): Introduce topic modeling on text data

**Business Case**: Your managers at Smartphone Inc. have asked to develop a system to bucket text messages into two categories: **spam** and **not spam (ham)**. The system will be implemented on your companies products to help users identify suspicious texts.

# Spam Filter - Count Vectorization Method

In [1]:
import pandas as pd
import numpy as np

pd.set_option('display.max_colwidth', 200)

**Import the data and take a look at it**

In [13]:
def load():
    url = "https://raw.githubusercontent.com/sokjc/BayesNotBaes/master/sms.tsv"

    df = pd.read_csv(url, sep='\t', header=None, 
                     names=['label', 'msg'])
    df = df.rename(columns={"msg":"text"})
    
    # encode target
    df['label_num'] = df['label'].map({'ham': 0, 'spam': 1})
    
    return df

pd.set_option('display.max_colwidth', 200)
df = load()
df.tail()

Unnamed: 0,label,text,label_num
5567,spam,"This is the 2nd time we have tried 2 contact u. U have won the £750 Pound prize. 2 claim is easy, call 087187272008 NOW1! Only 10p per minute. BT-national-rate.",1
5568,ham,Will ü b going to esplanade fr home?,0
5569,ham,"Pity, * was in mood for that. So...any other suggestions?",0
5570,ham,The guy did some bitching but I acted like i'd be interested in buying something else next week and he gave it to us for free,0
5571,ham,Rofl. Its true to its name,0


Notice that this text isn't as coherent as the job listings. We'll proceed like normal though. 

What is the ratio of Spam to Ham messages?

In [15]:
df['label'].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [16]:
df['label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: label, dtype: float64

**Model Validation - Train Test Split** (Cross Validation would be better here) 

In [17]:
from sklearn.model_selection import train_test_split

X = df['text']
y = df['label_num']

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.2, random_state=812)

In [18]:
print(X_train.shape,
      X_test.shape,
      y_train.shape,
      y_test.shape, sep='\n')

(4457,)
(1115,)
(4457,)
(1115,)


**Count Vectorizer**

Today we're just going to let Scikit-Learn do our text cleaning and preprocessing for us.

Lets run our vectorizer on our text messages and take a peek at the tokenization of the vocabulary

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=None, ngram_range=(1,1), 
                             stop_words='english')

vectorizer.fit(X_train)

print(vectorizer.get_feature_names()[300:325])

['150p16', '150pm', '150ppermesssubscription', '150ppm', '150ppmpobox10183bhamb64xe', '150ppmsg', '150pw', '151', '153', '15541', '16', '165', '1680', '169', '177', '18', '1843', '18p', '18yrs', '195', '1apple', '1b6a5ecef91ff9', '1cup', '1da', '1er']


Now we'll complete the vectorization with `.transform()`

In [9]:
train_word_counts = vectorizer.transform(X_train)

# not necessary to save to a dataframe, but helpful for previewing
X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), 
                                  columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(4457, 7443)


Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585334,0125698789,02,0207,...,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada,èn,ú1,〨ud
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We also need to vectorize our `X_test` data, but **we need to use the same vocabulary as the training dataset**, so we'll just call `.transform()` on `X_test` to get our `X_test_vectorized`

In [10]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), 
                                 columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(1115, 7443)


Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585334,0125698789,02,0207,...,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada,èn,ú1,〨ud
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Lets run some classification models and see what kind of accuracy we can get!

# Model Selection

In [11]:
from sklearn.metrics import accuracy_score

def assess_model(model, X_train, X_test, 
                 y_train, y_test, vect_type='Count'):
    model.fit(X_train, y_train)

    train_predictions = model.predict(X_train)
    test_predictions = model.predict(X_test)

    result = {}
    result['model'] = str(model).split('(')[0]
    result['acc_train'] = accuracy_score(y_train, train_predictions)
    result['acc_test'] = accuracy_score(y_test, test_predictions)
    result['vect_type'] = vect_type
    print(result)
    
    return result

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB # Multinomial Naive Bayes
from sklearn.ensemble import RandomForestClassifier

models = [LogisticRegression(random_state=42, solver='lbfgs'),
          MultinomialNB(),
          RandomForestClassifier()]

results = []
for model in models:
    result = assess_model(
        model,
        X_train_vectorized, X_test_vectorized, y_train, y_test)
    
    results.append(result)
    
pd.DataFrame.from_records(results)

{'model': 'LogisticRegression', 'acc_train': 0.9957370428539376, 'acc_test': 0.9766816143497757, 'vect_type': 'Count'}
{'model': 'MultinomialNB', 'acc_train': 0.9934933811981154, 'acc_test': 0.9856502242152466, 'vect_type': 'Count'}




{'model': 'RandomForestClassifier', 'acc_train': 0.9977563383441777, 'acc_test': 0.9721973094170404, 'vect_type': 'Count'}


Unnamed: 0,acc_test,acc_train,model,vect_type
0,0.976682,0.995737,LogisticRegression,Count
1,0.98565,0.993493,MultinomialNB,Count
2,0.972197,0.997756,RandomForestClassifier,Count


# Spam Filter - TF-IDF Vectorization Method

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    max_features=None, ngram_range=(1,1), stop_words='english')

# fit to train
vectorizer.fit(X_train)
print(vectorizer)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)


In [21]:
# apply to train
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(),
                                  columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(4457, 7443)


Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585334,0125698789,02,0207,...,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada,èn,ú1,〨ud
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [22]:
# apply to test
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(),
                                 columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(1115, 7443)


Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585334,0125698789,02,0207,...,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada,èn,ú1,〨ud
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [23]:
models = [LogisticRegression(random_state=42, solver='lbfgs'),
          MultinomialNB(),
          RandomForestClassifier()]

for model in models:
    result = assess_model(
        model,
        X_train_vectorized, X_test_vectorized, y_train, y_test,
        vect_type='Tfidf')
    
    results.append(result)
    
pd.DataFrame.from_records(results)

{'model': 'LogisticRegression', 'acc_train': 0.9703836661431456, 'acc_test': 0.9551569506726457, 'vect_type': 'Tfidf'}
{'model': 'MultinomialNB', 'acc_train': 0.982499439084586, 'acc_test': 0.9659192825112107, 'vect_type': 'Tfidf'}




{'model': 'RandomForestClassifier', 'acc_train': 0.9966345075162666, 'acc_test': 0.9695067264573991, 'vect_type': 'Tfidf'}


Unnamed: 0,acc_test,acc_train,model,vect_type
0,0.976682,0.995737,LogisticRegression,Count
1,0.98565,0.993493,MultinomialNB,Count
2,0.972197,0.997756,RandomForestClassifier,Count
3,0.955157,0.970384,LogisticRegression,Tfidf
4,0.965919,0.982499,MultinomialNB,Tfidf
5,0.969507,0.996635,RandomForestClassifier,Tfidf


# Sentiment Analysis

The objective of **sentiment analysis** is to take a text phrase and determine if its sentiment is: Postive, Neutral, or Negative. 

Suppose that you wanted to use NLP to classify reviews for your company's products as either positive, neutral, or negative. Maybe you don't trust the star ratings left by the users and you want an additional measure of sentiment from each review - maybe you would use this as a feature generation technique for additional modeling, or to identify disgruntled customers and reach out to them to improve your customer service, etc. Sentiment Analysis has also been used heavily in stock market price estimation by trying to track the sentiment of the tweets of individuals after breaking news comes out about a company.

Does every word in each review contribute to its overall sentiment? Not really. Stop words for example don't really tell us much about the overall sentiment of the text, so just like we did before, we will discard them. 

### NLTK Movie Review Sentiment Analysis

`pip install -U nltk`

In [41]:
import random
import nltk

In [43]:
def load_movie_reviews():
    from nltk.corpus import movie_reviews
    nltk.download('movie_reviews')
    nltk.download('stopwords')
    
    print("Total reviews:", len(movie_reviews.fileids()))
    print("Positive reviews:", len(movie_reviews.fileids('pos')))
    print("Negative reviews:", len(movie_reviews.fileids('neg')))
    
    # Get Reviews and randomize
    reviews = [(list(movie_reviews.words(fileid)), category)
              for category in movie_reviews.categories()
              for fileid in movie_reviews.fileids(category)]

    random.shuffle(reviews)
    
    documents = []
    sentiments = []

    for review in reviews:
        # Add sentiment to list
        if review[1] == "pos":
            sentiments.append(1)
        else:
            sentiments.append(0)

        # Add text to list
        review_text = " ".join(review[0])
        documents.append(review_text)

    df = pd.DataFrame({"text": documents, 
                       "sentiment": sentiments})
    
    return df

In [44]:
df = load_movie_reviews()
df.head()

[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\City_Year\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\City_Year\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Total reviews: 2000
Positive reviews: 1000
Negative reviews: 1000


Unnamed: 0,text,sentiment
0,"aspiring broadway composer robert ( aaron williams ) secretly carries a torch for his best friend , struggling actor marc ( michael shawn lucas ) . the problem is , marc only has eyes for "" perfec...",0
1,"it happens every year -- the days get longer , the weather gets warmer and the studios start releasing their big - budget blockbusters . this year ' s crop already seems inferior to that of past s...",0
2,"` the skulls ' is a laughably bad thriller , a teen - orientated doppelganger of ` the firm ' so blazingly ridiculous that it caused me to drift into a hypnotic stupor . certain moments are so pre...",0
3,"with the exception of their surrealistic satire barton fink , the films of joel and ethan coen fit into two broad categories : quirky and sometimes darkly humorous takes on the "" film noir "" genre...",1
4,"disney cements their place in the forefront of feature animation with the release of their latest animated adventure , mulan . while it adheres a bit too close to the disney formula to be perfect ...",1


### Train Test Split

In [45]:
X = df['text']
y = df['sentiment']

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.2, random_state=42)

# Sentiment Analysis - CountVectorizer

## Generate vocabulary from train dataset

In [46]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=None, ngram_range=(1,1), 
                             stop_words='english')

vectorizer.fit(X_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [47]:
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), 
                                  columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(1600, 35989)


Unnamed: 0,00,000,007,03,04,05,05425,10,100,1000,...,zuehlke,zuko,zukovsky,zulu,zurg,zus,zweibel,zwick,zwigoff,zycie
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0


In [48]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(400, 35989)


Unnamed: 0,00,000,007,03,04,05,05425,10,100,1000,...,zuehlke,zuko,zukovsky,zulu,zurg,zus,zweibel,zwick,zwigoff,zycie
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Model Selection

In [49]:
models = [LogisticRegression(random_state=42, solver='lbfgs'),
          MultinomialNB(),
          RandomForestClassifier()]

In [50]:
results = []

In [51]:
for model in models:
    result = assess_model(
        model,
        X_train_vectorized, X_test_vectorized, y_train, y_test,
        vect_type='Count')
    
    results.append(result)
    
pd.DataFrame.from_records(results)

{'model': 'LogisticRegression', 'acc_train': 1.0, 'acc_test': 0.8375, 'vect_type': 'Count'}
{'model': 'MultinomialNB', 'acc_train': 0.975625, 'acc_test': 0.785, 'vect_type': 'Count'}




{'model': 'RandomForestClassifier', 'acc_train': 0.99, 'acc_test': 0.725, 'vect_type': 'Count'}


Unnamed: 0,acc_test,acc_train,model,vect_type
0,0.8375,1.0,LogisticRegression,Count
1,0.785,0.975625,MultinomialNB,Count
2,0.725,0.99,RandomForestClassifier,Count


# Sentiment Analysis - tfidfVectorizer

In [52]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=2000, ngram_range=(1,2),
                             min_df = 5, max_df = .80,
                             stop_words='english')

vectorizer.fit(X_train)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.8, max_features=2000, min_df=5,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [53]:
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(1600, 2000)


Unnamed: 0,000,10,100,12,13,15,17,1995,1996,1997,...,year old,years,years ago,years later,yes,york,young,young man,younger,zero
0,0.0,0.0,0.0,0.0,0.0,0.071942,0.0,0.0,0.0,0.0,...,0.058335,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0535,0.0,0.078474,0.077244,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.047541,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.057325,0.0
4,0.0,0.0307,0.038514,0.0,0.0,0.0,0.041062,0.0,0.0,0.07191,...,0.0,0.039311,0.07021,0.0,0.0,0.0,0.042771,0.0,0.0,0.0


In [54]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), 
                                 columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(400, 2000)


Unnamed: 0,000,10,100,12,13,15,17,1995,1996,1997,...,year old,years,years ago,years later,yes,york,young,young man,younger,zero
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.05356,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Model Selection

In [55]:
for model in models:
    result = assess_model(
        model,
        X_train_vectorized, X_test_vectorized, y_train, y_test,
        vect_type='tfidf')
    
    results.append(result)
    
pd.DataFrame.from_records(results)

{'model': 'LogisticRegression', 'acc_train': 0.93875, 'acc_test': 0.82, 'vect_type': 'tfidf'}
{'model': 'MultinomialNB', 'acc_train': 0.883125, 'acc_test': 0.7725, 'vect_type': 'tfidf'}
{'model': 'RandomForestClassifier', 'acc_train': 0.99375, 'acc_test': 0.69, 'vect_type': 'tfidf'}


Unnamed: 0,acc_test,acc_train,model,vect_type
0,0.8375,1.0,LogisticRegression,Count
1,0.785,0.975625,MultinomialNB,Count
2,0.725,0.99,RandomForestClassifier,Count
3,0.82,0.93875,LogisticRegression,tfidf
4,0.7725,0.883125,MultinomialNB,tfidf
5,0.69,0.99375,RandomForestClassifier,tfidf


# Using NLTK to clean the data

### Importing the data fresh to avoid variable collisions

In [56]:
df = load_movie_reviews()

[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\City_Year\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\City_Year\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Total reviews: 2000
Positive reviews: 1000
Negative reviews: 1000


### Cleaning function to apply to each document

In [57]:
from nltk.corpus import stopwords
import string

# turn a doc into clean tokens
def clean_doc(doc):
    # split into tokens by white space
    tokens = doc.split()
    # remove punctuation from each token
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # filter out stop words
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    # filter out short tokens
    tokens = [word for word in tokens if len(word) > 1]
    return tokens

df_nltk = pd.DataFrame()
df_nltk['text'] = df.text.apply(clean_doc)
df_nltk['sentiment'] = df.sentiment
df_nltk.head()

Unnamed: 0,text,sentiment
0,"[movies, based, video, games, street, fighter, mario, bros, never, generated, much, interest, box, office, first, mortal, kombat, movie, came, surprisingly, well, simple, story, pulsating, soundtr...",0
1,"[great, actor, james, woods, said, paraphrasing, sex, messy, right, truly, profound, statement, one, could, made, entire, mad, slasher, genre, replace, sex, mad, slasher, film, uninformed, souls, ...",0
2,"[like, good, action, film, metro, action, keeps, involved, action, films, action, sequences, conventional, attention, detracted, diverted, thoughts, ghost, darkness, opened, months, ago, film, act...",0
3,"[first, troy, beyer, wrote, critically, panned, makes, directorial, debut, writing, directing, starring, sub, par, film, women, talking, sex, though, without, redeeming, qualities, film, bad, basi...",0
4,"[thought, baz, luhrmann, radical, take, williamshakespeare, sromeo, juliet, wild, wait, see, tony, award, winning, stage, director, julie, taymor, thelionking, thebroadwaymusical, bard, titusandro...",1


### Reformat reviews for sklearn

In [58]:
documents = []
for review in df_nltk.text:
    review = " ".join(review)
    documents.append(review)
  
sentiment = list(df_nltk.sentiment)
new_df = pd.DataFrame({'text': documents, 'sentiment': sentiment})
new_df.head()

Unnamed: 0,text,sentiment
0,movies based video games street fighter mario bros never generated much interest box office first mortal kombat movie came surprisingly well simple story pulsating soundtrack lots awesomely choreo...,0
1,great actor james woods said paraphrasing sex messy right truly profound statement one could made entire mad slasher genre replace sex mad slasher film uninformed souls mad slasher genre sub genre...,0
2,like good action film metro action keeps involved action films action sequences conventional attention detracted diverted thoughts ghost darkness opened months ago film action bland uninvolving in...,0
3,first troy beyer wrote critically panned makes directorial debut writing directing starring sub par film women talking sex though without redeeming qualities film bad basic story follows three sin...,0
4,thought baz luhrmann radical take williamshakespeare sromeo juliet wild wait see tony award winning stage director julie taymor thelionking thebroadwaymusical bard titusandronicus audacious bloody...,1


### Train Test Split

In [59]:
X = new_df.text
y = new_df.sentiment

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.2, random_state=42)

### Vectorize the reviews

In [60]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=None, ngram_range=(1,1), 
                             stop_words='english')

vectorizer.fit(X_train)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [61]:
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), 
                                  columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(1600, 35240)


Unnamed: 0,aa,aaa,aaaaaaaahhhh,aaaahhhs,aahs,aaliyah,aalyah,aamir,aardman,aaron,...,zukovsky,zulu,zundel,zurg,zus,zweibel,zwick,zwigoff,zycie,zzzzzzz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [62]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(400, 35240)


Unnamed: 0,aa,aaa,aaaaaaaahhhh,aaaahhhs,aahs,aaliyah,aalyah,aamir,aardman,aaron,...,zukovsky,zulu,zundel,zurg,zus,zweibel,zwick,zwigoff,zycie,zzzzzzz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Model Selection

In [63]:
models = [LogisticRegression(random_state=42, solver='lbfgs'),
          MultinomialNB(),
          RandomForestClassifier()]

In [64]:
results = []

In [65]:
for model in models:
    result = assess_model(
        model,
        X_train_vectorized, X_test_vectorized, y_train, y_test,
        vect_type='Tfidf')
    
    results.append(result)
    
pd.DataFrame.from_records(results)

{'model': 'LogisticRegression', 'acc_train': 0.985, 'acc_test': 0.8125, 'vect_type': 'Tfidf'}
{'model': 'MultinomialNB', 'acc_train': 0.975625, 'acc_test': 0.8025, 'vect_type': 'Tfidf'}




{'model': 'RandomForestClassifier', 'acc_train': 0.99, 'acc_test': 0.705, 'vect_type': 'Tfidf'}


Unnamed: 0,acc_test,acc_train,model,vect_type
0,0.8125,0.985,LogisticRegression,Tfidf
1,0.8025,0.975625,MultinomialNB,Tfidf
2,0.705,0.99,RandomForestClassifier,Tfidf


In [None]:
# import xgboost as xgb
from xgboost.sklearn import XGBClassifier

clf = XGBClassifier(
        #hyper params
        n_jobs = -1,
)

clf.fit(X_train_vectorized, y_train, eval_metric = 'auc')