<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 2 Lesson 3*

# Document Classification & Clustering

What do you do with all those cool document-term-matrices (dtm[s]) you created yesterday? You could use them in a sweet viz, or we can teach a dope algorithm to do some specific task. :) You have seen both classification and clustering before, so we won't focus on the particulars of algorithms. Instead we'll focus on the unique problems of dealing with text input for these models.

## Learning Objectives
* [Part 1](#p1): Vectorize a whole Corpus
* [Part 2](#p2): Tune the vectorizer
* [Part 3](#p3): Apply Vectorizer to Classification problem
* [Part 4](#p4): Introduce topic modeling on text data

**Business Case**: Your managers at Smartphone Inc. have asked to develop a system to bucket text messages into two categories: spam and not spam (ham). The system will be implemented on your companies products to help users identify suspicious texts.


Ham | Spam
:----: | :----:
<img align="left" src="https://images.unsplash.com/photo-1524438418049-ab2acb7aa48f?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=3300&q=80" width=500> | <img align="left" src="https://images2.minutemediacdn.com/image/upload/c_crop,h_1576,w_2800,x_0,y_52/f_auto,q_auto,w_1100/v1554931909/shape/mentalfloss/20997-istock-471531747.jpg" width=500>

# Spam Filter - Count Vectorization Method

In [0]:
import pandas as pd
import numpy as np

## Import the Data

Import the data and take a look at it.

In [0]:
url = "https://raw.githubusercontent.com/sokjc/BayesNotBaes/master/sms.tsv"

df = pd.read_csv(url, sep='\t', header=None, names=['label', 'msg'])

## Tidy up initial DataFrame

- Change Pandas display options so that we can see more of the text
- Drop the unnamed columns, I'm not sure why they're in there, but we don't need them.
- Rename the v1 and v2 columns.

In [4]:
pd.set_option('display.max_colwidth', 200)
df = df.rename(columns={"msg":"text"})
df.tail()

Unnamed: 0,label,text
5567,spam,"This is the 2nd time we have tried 2 contact u. U have won the £750 Pound prize. 2 claim is easy, call 087187272008 NOW1! Only 10p per minute. BT-national-rate."
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other suggestions?"
5570,ham,The guy did some bitching but I acted like i'd be interested in buying something else next week and he gave it to us for free
5571,ham,Rofl. Its true to its name


You'll notice right of the bat that this text isn't as coherent as the job listings. We'll proceed like normal though. 

What is the ratio of Spam to Ham messages?

In [5]:
df.label.value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [6]:
df['label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: label, dtype: float64

## Categorical encoding on labels.

In [7]:
df['label_num'] = df.label.map({'ham': 0, 'spam': 1})

df.head()

Unnamed: 0,label,text,label_num
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives around here though",0


## Model Validation - Train Test Split (quick and dirty)
Since we're going to do some modeling we're going to need some model validation. For simplicity lets just do a quick train_test_split for today. You can try out Cross Validation on your assignment today, I just want to get to a quick baseline. 

In [0]:
from sklearn.model_selection import train_test_split

X = df.text
y = df.label_num

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=812)

Look at sizes of our train and test datasets

In [9]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(4457,)
(1115,)
(4457,)
(1115,)


In [10]:
X_train

2757                                                                                                                  Have a good trip. Watch out for . Remember when you get back we must decide about easter.
1526                                                                                                                                                                           Pls pls find out from aunt nike.
1856                                                                                                                                                         K.:)you are the only girl waiting in reception ah?
554                                                                  Ok. Every night take a warm bath drink a cup of milk and you'll see a work of magic. You still need to loose weight. Just so that you know
3622             That means from february to april i'll be getting a place to stay down there so i don't have to hustle back and forth during audition season as i have 

## Count Vectorizer

Today we're just going to let Scikit-Learn do our text cleaning and preprocessing for us.

Lets run our vectorizer on our text messages and take a peek at the tokenization of the vocabulary

In [19]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=None, ngram_range=(1,1), stop_words='english')

vectorizer.fit(X_train)

print(vectorizer.get_feature_names()[300:325])

['150p16', '150pm', '150ppermesssubscription', '150ppm', '150ppmpobox10183bhamb64xe', '150ppmsg', '150pw', '151', '153', '15541', '16', '165', '1680', '169', '177', '18', '1843', '18p', '18yrs', '195', '1apple', '1b6a5ecef91ff9', '1cup', '1da', '1er']


Now we'll complete the vectorization by running .transform() and then save the results to a dataframe for viewing.
You don't need to save it to a dataframe, you can use most ML models with just the 2D array output.

That's a lot of columns.

In [20]:
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(4457, 7443)


Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585334,0125698789,02,0207,...,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada,èn,ú1,〨ud
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We also need to vectorize our X_test data, but we need to use the same vocabulary as the training dataset, so we'll just call .transform() on X_test to get our vectorized X_test_df

In [21]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(1115, 7443)


Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585334,0125698789,02,0207,...,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada,èn,ú1,〨ud
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Lets run some classification models and see what kind of accuracy we can get!

In [0]:
results = []

## Logistic Regression

In [70]:
from sklearn.linear_model import LogisticRegression

LR = LogisticRegression(random_state=42).fit(X_train_vectorized, y_train)

train_predictions = LR.predict(X_train_vectorized)
test_predictions = LR.predict(X_test_vectorized)



Now we'll evaluate both our training and testing accuracy. 

In [71]:
from sklearn.metrics import accuracy_score

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

columns = ['model', 'acc_train', 'acc_test', 'vect']

lr_result = {}
lr_result['model'] = 'Logistic Regression'
lr_result['acc_train'] = accuracy_score(y_train, train_predictions)
lr_result['acc_test'] = accuracy_score(y_test, test_predictions)
lr_result['vect_type'] = 'Count'

results.append(lr_result)

Train Accuracy: 0.9703836661431456
Test Accuracy: 0.9551569506726457


## Multinomial Naive Bayes

In [72]:
from sklearn.naive_bayes import MultinomialNB

MNB = MultinomialNB().fit(X_train_vectorized, y_train)

train_predictions = MNB.predict(X_train_vectorized)
test_predictions = MNB.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

result = {}
result['model'] = 'Multinomial Naive Bayes'
result['acc_train'] = accuracy_score(y_train, train_predictions)
result['acc_test'] = accuracy_score(y_test, test_predictions)
result['vect_type'] = 'Count'

results.append(result)

Train Accuracy: 0.982499439084586
Test Accuracy: 0.9659192825112107


In [63]:
results

[{'acc_test': 0.9551569506726457,
  'acc_train': 0.9703836661431456,
  'model': 'Logistic Regression',
  'vect_type': 'Count'},
 {'acc_test': 0.9659192825112107,
  'acc_train': 0.982499439084586,
  'model': 'Multinomial Naive Bayes',
  'vect_type': 'Count'}]

## Random Forest Classifier

In [73]:
from sklearn.ensemble import RandomForestClassifier

RFC = RandomForestClassifier().fit(X_train_vectorized, y_train)

train_predictions = RFC.predict(X_train_vectorized)
test_predictions = RFC.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

result = {}
result['model'] = 'Random Forest'
result['acc_train'] = accuracy_score(y_train, train_predictions)
result['acc_test'] = accuracy_score(y_test, test_predictions)
result['vect_type'] = 'Count'

results.append(result)



Train Accuracy: 0.9975319721785955
Test Accuracy: 0.9659192825112107


In [65]:
#results = pd.DataFrame.from_records(results)
#results.head()

Unnamed: 0,acc_test,acc_train,model,vect_type
0,0.955157,0.970384,Logistic Regression,Count
1,0.965919,0.982499,Multinomial Naive Bayes,Count
2,0.965919,0.998654,Random Forest,Count


# Spam Filter - TF-IDF Vectorization Method

In [66]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=None, ngram_range=(1,1), stop_words='english')

vectorizer.fit(X_train)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

## Vectorize training data

In [67]:
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(4457, 7443)


Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585334,0125698789,02,0207,...,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada,èn,ú1,〨ud
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Vectorize testing data

In [53]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(1115, 7443)


Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585334,0125698789,02,0207,...,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada,èn,ú1,〨ud
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Logistic Regression

In [74]:
LR = LogisticRegression(random_state=42).fit(X_train_vectorized, y_train)

train_predictions = LR.predict(X_train_vectorized)
test_predictions = LR.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

result = {}
result['model'] = 'Logistic'
result['acc_train'] = accuracy_score(y_train, train_predictions)
result['acc_test'] = accuracy_score(y_test, test_predictions)
result['vect_type'] = 'Tfidf'

results.append(result)

Train Accuracy: 0.9703836661431456
Test Accuracy: 0.9551569506726457




## Multinomial Naive Bayes

In [75]:
from sklearn.naive_bayes import MultinomialNB

MNB = MultinomialNB().fit(X_train_vectorized, y_train)

train_predictions = MNB.predict(X_train_vectorized)
test_predictions = MNB.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

result = {}
result['model'] = 'Naive Bayes'
result['acc_train'] = accuracy_score(y_train, train_predictions)
result['acc_test'] = accuracy_score(y_test, test_predictions)
result['vect_type'] = 'Tfidf'

results.append(result)

Train Accuracy: 0.982499439084586
Test Accuracy: 0.9659192825112107


## Random Forest Classifier

In [76]:
from sklearn.ensemble import RandomForestClassifier

RFC = RandomForestClassifier().fit(X_train_vectorized, y_train)

train_predictions = RFC.predict(X_train_vectorized)
test_predictions = RFC.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

result = {}
result['model'] = 'Random Forest'
result['acc_train'] = accuracy_score(y_train, train_predictions)
result['acc_test'] = accuracy_score(y_test, test_predictions)
result['vect_type'] = 'Tfidf'

results.append(result)



Train Accuracy: 0.99798070450976
Test Accuracy: 0.9650224215246637


In [0]:
results = pd.DataFrame.from_records(results)

In [78]:
results.head()

Unnamed: 0,acc_test,acc_train,model,vect_type
0,0.955157,0.970384,Logistic Regression,Count
1,0.965919,0.982499,Multinomial Naive Bayes,Count
2,0.965919,0.997532,Random Forest,Count
3,0.955157,0.970384,Logistic,Tfidf
4,0.965919,0.982499,Naive Bayes,Tfidf


# Sentiment Analysis

## What is Sentiment Analysis?

The objective of sentiment analysis is to take a phrase and based on the text of the phrase determine if its sentiment is: Postive, Neutral, or Negative. 

Suppose that you wanted to use NLP to classify reviews for your company's products as either positive, neutral, or negative. Maybe you don't trust the star ratings left by the users and you want an additional measure of sentiment from each review - maybe you would use this as a feature generation technique for additional modeling, or to identify disgruntled customers and reach out to them to improve your customer service, etc. Sentiment Analysis has also been used heavily in stock market price estimation by trying to track the sentiment of the tweets of individuals after breaking news comes out about a company.

Does every word in each review contribute to its overall sentiment? Not really. Stop words for example don't really tell us much about the overall sentiment of the text, so just like we did before, we will discard them. 

## NLTK Movie Review Sentiment Analysis

In [80]:
!pip install -U nltk

import nltk
nltk.download('movie_reviews')
nltk.download('stopwords')
from nltk.corpus import movie_reviews
import random

Collecting nltk
[?25l  Downloading https://files.pythonhosted.org/packages/73/56/90178929712ce427ebad179f8dc46c8deef4e89d4c853092bee1efd57d05/nltk-3.4.1.zip (3.1MB)
[K    100% |████████████████████████████████| 3.1MB 9.2MB/s 
Building wheels for collected packages: nltk
  Building wheel for nltk (setup.py) ... [?25ldone
[?25h  Stored in directory: /root/.cache/pip/wheels/97/8a/10/d646015f33c525688e91986c4544c68019b19a473cb33d3b55
Successfully built nltk
Installing collected packages: nltk
  Found existing installation: nltk 3.2.5
    Uninstalling nltk-3.2.5:
      Successfully uninstalled nltk-3.2.5
Successfully installed nltk-3.4.1


[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## Check that we have movie reviews

In [81]:
# How many total reviews are there?
print("Total reviews:", len(movie_reviews.fileids()))

# Total positive reviews
print("Positive reviews:", len(movie_reviews.fileids('pos'))) 
 
# Total negative reviews
print("Negative reviews:", len(movie_reviews.fileids('neg')))

Total reviews: 2000
Positive reviews: 1000
Negative reviews: 1000


In [85]:
movie_reviews.fileids('pos')[:10]

['pos/cv000_29590.txt',
 'pos/cv001_18431.txt',
 'pos/cv002_15918.txt',
 'pos/cv003_11664.txt',
 'pos/cv004_11636.txt',
 'pos/cv005_29443.txt',
 'pos/cv006_15448.txt',
 'pos/cv007_4968.txt',
 'pos/cv008_29435.txt',
 'pos/cv009_29592.txt']

## Get Reviews and randomize

In [0]:
reviews = [(list(movie_reviews.words(fileid)), category)
              for category in movie_reviews.categories()
              for fileid in movie_reviews.fileids(category)]

random.shuffle(reviews)

## Understand the format of the data

In [91]:
# Print Review Text:
print(reviews[0][0])

# Print Review Sentiment:
print(reviews[0][1])

['i', 'came', 'to', 'an', 'epiphany', 'while', 'watching', 'the', 'bachelor', ',', 'an', 'innocuous', '-', 'enough', '-', 'on', '-', 'the', '-', 'surface', 'romantic', 'comedy', '.', 'it', "'", 's', 'not', 'the', 'sort', 'of', 'film', 'in', 'which', 'one', 'would', 'expect', 'to', 'achieve', 'any', 'moment', 'of', 'clarity', ',', 'but', 'there', 'it', 'was', 'nonetheless', '.', 'i', 'sat', 'there', 'watching', 'this', 'marshmallow', 'of', 'a', 'movie', 'unfold', 'when', 'suddenly', 'i', 'realized', 'what', 'is', 'so', 'ridiculously', 'wrong', 'with', 'the', 'entire', 'romantic', 'comedy', 'genre', 'circa', '1999', '.', 'in', 'a', 'word', ',', 'it', "'", 's', 'the', 'same', 'thing', 'that', "'", 's', 'wrong', 'with', 'so', 'many', 'movies', 'circa', '1999', ':', 'writing', '.', 'more', 'to', 'the', 'point', ',', 'it', "'", 's', 'the', 'refusal', 'to', 'acknowledge', 'that', 'characterizations', 'matter', 'when', 'you', "'", 're', 'telling', 'a', 'story', 'about', 'a', 'relationship', '.

## Add reviews to a dataframe for kicks

In [92]:
documents = []
sentiments = []

for review in reviews:
  
  # Add sentiment to list
  if review[1] == "pos":
    sentiments.append(1)
  else:
    sentiments.append(0)
  
  # Add text to list
  review_text = " ".join(review[0])
  documents.append(review_text)
  
df = pd.DataFrame({"text": documents, "sentiment": sentiments})
df.head()

Unnamed: 0,text,sentiment
0,"i came to an epiphany while watching the bachelor , an innocuous - enough - on - the - surface romantic comedy . it ' s not the sort of film in which one would expect to achieve any moment of clar...",0
1,"aspiring broadway composer robert ( aaron williams ) secretly carries a torch for his best friend , struggling actor marc ( michael shawn lucas ) . the problem is , marc only has eyes for "" perfec...",0
2,"there have been bad films in recent years : ' mr . magoo ' was by far the worst ever made , the spectacularly bad ' blue in the face ' , the horrible ' baby genuises ' and now ' i woke up early th...",0
3,"one of the biggest cliches of any serial killer film is also one of the most believable . you know , the one where the detective looks at a wall of pictures and other police information , and sudd...",1
4,"nearly every film tim burton has directed has been an homage to the horror genre -- "" frankenweenie , "" "" beetlejuice , "" "" batman , "" "" edward scissorhands , "" "" ed wood , "" "" mars attacks ! "" --...",0


## Train Test Split

In [0]:
X = df.text
y = df.sentiment

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Sentiment Analysis - CountVectorizer

## Generate vocabulary from train dataset

In [94]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=None, ngram_range=(1,1), stop_words='english')

vectorizer.fit(X_train)

print(vectorizer.vocabulary_)



## Generate Vectorizations

In [95]:
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(1600, 35995)


Unnamed: 0,00,000,0009f,007,00s,05,10,100,1000,10000,...,zucker,zuko,zukovsky,zundel,zurg,zweibel,zwick,zwigoff,zycie,zzzzzzz
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [96]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(400, 35995)


Unnamed: 0,00,000,0009f,007,00s,05,10,100,1000,10000,...,zucker,zuko,zukovsky,zundel,zurg,zweibel,zwick,zwigoff,zycie,zzzzzzz
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Logistic Regression

In [97]:
LR = LogisticRegression(random_state=42).fit(X_train_vectorized, y_train)

train_predictions = LR.predict(X_train_vectorized)
test_predictions = LR.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')



Train Accuracy: 1.0
Test Accuracy: 0.845


## Multinomial Naive Bayes

In [98]:
MNB = MultinomialNB().fit(X_train_vectorized, y_train)

train_predictions = MNB.predict(X_train_vectorized)
test_predictions = MNB.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.981875
Test Accuracy: 0.8025


## Random Forest Classifier

In [99]:
RFC = RandomForestClassifier().fit(X_train_vectorized, y_train)

train_predictions = RFC.predict(X_train_vectorized)
test_predictions = RFC.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')



Train Accuracy: 0.98875
Test Accuracy: 0.6925


# Sentiment Analysis - tfidfVectorizer

## Vocabulary

In [107]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=2000, ngram_range=(1,2),
                             min_df = 5, max_df = .80,
                             stop_words='english')

vectorizer.fit(X_train)

print(vectorizer.vocabulary_)

{'scientist': 1520, 'dr': 497, 'bob': 194, 'created': 386, 'evil': 575, 'folks': 686, 'stop': 1680, 've': 1875, 'heard': 814, 'animals': 76, 'town': 1812, 'small': 1610, 'band': 145, 'particular': 1276, 'animal': 75, 'field': 651, 'team': 1757, 'creatures': 391, 'earth': 520, 'oh': 1243, 'yeah': 1988, 'military': 1145, 'way': 1922, 'running': 1489, 'short': 1576, 'time': 1795, 'sound': 1628, 'familiar': 618, 'plot': 1324, 'films': 667, 'release': 1437, 'stars': 1660, 'pulled': 1386, 'annoying': 81, 'played': 1317, 'order': 1257, 'help': 822, 'figure': 655, 'local': 1052, 'death': 426, 'recently': 1429, 'discovers': 479, 'character': 283, 'used': 1863, 'weapon': 1926, 'say': 1506, 'knew': 981, 'exactly': 578, 'getting': 740, 'going': 754, 'frame': 706, 'footage': 693, 'seen': 1542, 'led': 1024, 'believe': 168, 'good': 757, 'respect': 1449, 'disappointed': 474, 'level': 1031, 'bad': 138, 'effects': 533, 'awful': 133, 'script': 1529, 'contrived': 366, 'feature': 637, 'laughing': 1008, 'th

## Train

In [108]:
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(1600, 2000)


Unnamed: 0,000,10,100,13,15,1995,1996,1997,1998,1999,...,year old,years,years ago,years later,yes,york,young,young man,younger,zero
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.050565,0.0,0.0,0.0,0.0,0.0,...,0.04045,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.048029,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.041431,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Test

In [109]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(400, 2000)


Unnamed: 0,000,10,100,13,15,1995,1996,1997,1998,1999,...,year old,years,years ago,years later,yes,york,young,young man,younger,zero
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.071964,0.065068,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.043404,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.047101,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.026499,0.0,0.0,0.115906,0.0,0.028267,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.204767,0.0,0.0,0.0,0.0


## Logistic Regression

In [110]:
LR = LogisticRegression(random_state=42).fit(X_train_vectorized, y_train)

train_predictions = LR.predict(X_train_vectorized)
test_predictions = LR.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.939375
Test Accuracy: 0.8475




## Multinomial Naive Bayes

In [111]:
MNB = MultinomialNB().fit(X_train_vectorized, y_train)

train_predictions = MNB.predict(X_train_vectorized)
test_predictions = MNB.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.89
Test Accuracy: 0.8175


## Random Forest Classifier

In [112]:
RFC = RandomForestClassifier().fit(X_train_vectorized, y_train)

train_predictions = RFC.predict(X_train_vectorized)
test_predictions = RFC.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.985625
Test Accuracy: 0.64




# Using NLTK to clean the data

## Importing the data fresh to avoid variable collisions

In [0]:
reviews = [(list(movie_reviews.words(fileid)), category)
              for category in movie_reviews.categories()
              for fileid in movie_reviews.fileids(category)]

random.shuffle(reviews, )

In [114]:
documents = []
sentiments = []

for review in reviews:
  
  # Add sentiment to list
  if review[1] == "pos":
    sentiments.append(1)
  else:
    sentiments.append(0)
  
  # Add text to list
  review_text = " ".join(review[0])
  documents.append(review_text)
  
df = pd.DataFrame({"text": documents, "sentiment": sentiments})
df.head()

Unnamed: 0,text,sentiment
0,"way of the gun is brimming with surprises , some good , most bad . one of the good ones is ryan phillippe ' s surprisingly halfway decent performance . after the actor gained much attention by pos...",0
1,"i want to correct what i wrote last year in my retrospective of david lean ' s war picture . i still think that "" the bridge on the river kwai "" doesn ' t deserve being the number 13 in the americ...",1
2,"ingredients : man with amnesia who wakes up wanted for murder , dark science fiction city controlled by alien beings with mental powers . synopsis : what if you woke up one day , and suspected you...",1
3,"note : some may consider portions of the following text to be spoilers . be forewarned . east meets west in mulan , the latest installment in disney ' s parade of annual animated feature films . a...",1
4,"as you should know , this summer has been less than memorable . with a total of 4 decent films , it ' s not a surprise that these big budget failures keep appearing . with that said , you can pret...",0


## Cleaning function to apply to each document

In [115]:
from nltk.corpus import stopwords
import string

# turn a doc into clean tokens
def clean_doc(doc):
    # split into tokens by white space
    tokens = doc.split()
    # remove punctuation from each token
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # filter out stop words
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    # filter out short tokens
    tokens = [word for word in tokens if len(word) > 1]
    return tokens

df_nltk = pd.DataFrame()
df_nltk['text'] = df.text.apply(clean_doc)
df_nltk['sentiment'] = df.sentiment
df_nltk.head()

Unnamed: 0,text,sentiment
0,"[way, gun, brimming, surprises, good, bad, one, good, ones, ryan, phillippe, surprisingly, halfway, decent, performance, actor, gained, much, attention, posing, preening, teen, swill, like, know, ...",0
1,"[want, correct, wrote, last, year, retrospective, david, lean, war, picture, still, think, bridge, river, kwai, deserve, number, american, film, institute, list, greatest, american, movies, think,...",1
2,"[ingredients, man, amnesia, wakes, wanted, murder, dark, science, fiction, city, controlled, alien, beings, mental, powers, synopsis, woke, one, day, suspected, earth, instead, part, experiment, g...",1
3,"[note, may, consider, portions, following, text, spoilers, forewarned, east, meets, west, mulan, latest, installment, disney, parade, annual, animated, feature, films, odd, fusion, ancient, asian,...",1
4,"[know, summer, less, memorable, total, decent, films, surprise, big, budget, failures, keep, appearing, said, pretty, much, predict, opinion, warrior, film, based, michael, crichton, eaters, dead,...",0


## Reformat reviews for sklearn

In [116]:
documents = []
for review in df_nltk.text:
  review = " ".join(review)
  documents.append(review)
  
sentiment = list(df_nltk.sentiment)
new_df = pd.DataFrame({'text': documents, 'sentiment': sentiment})
new_df.head()

Unnamed: 0,text,sentiment
0,way gun brimming surprises good bad one good ones ryan phillippe surprisingly halfway decent performance actor gained much attention posing preening teen swill like know last summer hinted bit gro...,0
1,want correct wrote last year retrospective david lean war picture still think bridge river kwai deserve number american film institute list greatest american movies think angry men witness prosecu...,1
2,ingredients man amnesia wakes wanted murder dark science fiction city controlled alien beings mental powers synopsis woke one day suspected earth instead part experiment giant space terrarium mani...,1
3,note may consider portions following text spoilers forewarned east meets west mulan latest installment disney parade annual animated feature films odd fusion ancient asian traditions disconcerting...,1
4,know summer less memorable total decent films surprise big budget failures keep appearing said pretty much predict opinion warrior film based michael crichton eaters dead ahmed ibn fahdlan banishe...,0


## Train Test Split

In [0]:
X = new_df.text
y = new_df.sentiment

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Vectorize the reviews

In [118]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=None, ngram_range=(1,1), stop_words='english')

vectorizer.fit(X_train)

print(vectorizer.vocabulary_)




In [119]:
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(1600, 35288)


Unnamed: 0,aa,aaa,aaaaaaaaah,aaaaaaaahhhh,aaaaaah,aaliyah,aalyah,aamir,aardman,aaron,...,zukovsky,zulu,zundel,zurg,zus,zweibel,zwick,zwigoff,zycie,zzzzzzz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [120]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(400, 35288)


Unnamed: 0,aa,aaa,aaaaaaaaah,aaaaaaaahhhh,aaaaaah,aaliyah,aalyah,aamir,aardman,aaron,...,zukovsky,zulu,zundel,zurg,zus,zweibel,zwick,zwigoff,zycie,zzzzzzz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Logistic Regression

In [121]:
LR = LogisticRegression(random_state=42).fit(X_train_vectorized, y_train)

train_predictions = LR.predict(X_train_vectorized)
test_predictions = LR.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')



Train Accuracy: 0.981875
Test Accuracy: 0.855


## Multinomial Naive Bayes

In [122]:
MNB = MultinomialNB().fit(X_train_vectorized, y_train)

train_predictions = MNB.predict(X_train_vectorized)
test_predictions = MNB.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.973125
Test Accuracy: 0.8375


## Random Forest Classifier

In [123]:
RFC = RandomForestClassifier().fit(X_train_vectorized, y_train)

train_predictions = RFC.predict(X_train_vectorized)
test_predictions = RFC.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')



Train Accuracy: 0.99125
Test Accuracy: 0.7275


In [125]:
# import xgboost as xgb
from xgboost.sklearn import XGBClassifier

clf = XGBClassifier(
        #hyper params
        n_jobs = -1,
)

clf.fit(X_train_vectorized, y_train_vectorized, eval_metric = 'auc')

NameError: ignored