# Applications of Vectorization

# Spam Filter - Count Vectorization Method

In [1]:
import pandas as pd
import numpy as np

## Import the Data

Import the data and take a look at it.

In [2]:
url = "https://raw.githubusercontent.com/ryanleeallred/datasets/master/spam.csv"

df = pd.read_csv(url, encoding="ISO-8859-1")
print(df.shape)
df.head()

(5572, 5)


Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


## Tidy up initial DataFrame

- Change Pandas display options so that we can see more of the text
- Drop the unnamed columns, I'm not sure why they're in there, but we don't need them.
- Rename the v1 and v2 columns.

In [3]:
pd.set_option('display.max_colwidth', 200)
df = df.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
df = df.rename(columns={"v1":"label", "v2":"text"})
df.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


You'll notice right of the bat that this text isn't as coherent as the job listings. We'll proceed like normal though. 

What is the ratio of Spam to Ham messages?

In [4]:
df.label.value_counts()

ham     4825
spam     747
Name: label, dtype: int64

## Categorical encoding on labels.

In [5]:
df['label_num'] = df.label.map({'ham': 0, 'spam': 1})

df.head()

Unnamed: 0,label,text,label_num
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives around here though",0


## Model Validation - Train Test Split (quick and dirty)
Since we're going to do some modeling we're going to need some model validation. For simplicity lets just do a quick train_test_split for today. You can try out Cross Validation on your assignment today, I just want to get to a quick baseline. 

In [6]:
from sklearn.model_selection import train_test_split

X = df.text
y = df.label_num

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Look at sizes of our train and test datasets

In [7]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(4457,)
(1115,)
(4457,)
(1115,)


## Count Vectorizer

Today we're just going to let Scikit-Learn do our text cleaning and preprocessing for us.

Lets run our vectorizer on our text messages and take a peek at the tokenization of the vocabulary

In [50]:
X_train[0]

'folks disney common decency resurrected yet another cartoon turned live action hodgepodge expensive special effects embarrassing writing kid friendly slapstick mr magoo enough people obviously inspector gadget would call ideal family entertainment younger viewers likely taken abounding goofiness adult companions may feel wave nausea sweeping attempt endure appalling minute exercise glaring stupidity movie poorly edited grossly manipulative finished product resembles somewhat failed jigsaw puzzle elements manner director david kellogg pieces together laughable trite huge fan animated tv show first thing must express anger toward treatment main villain cartoon dr claw frightening raspy voiced presence remained total mystery viewer never saw face simply sat back arm chair watching surveillance cameras gently stroking loyal cat child always imagined dr claw would appear curiosity kept watching many years release live action movie face intriguing villain unrightfully exposed rupert everett

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=None, ngram_range=(1,1), stop_words='english')

vectorizer.fit(X_train)

print(vectorizer.vocabulary_)



Now we'll complete the vectorization by running .transform() and then save the results to a dataframe for viewing.
You don't need to save it to a dataframe, you can use most ML models with just the 2D array output.

That's a lot of columns.

In [9]:
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(4457, 7472)


Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585236,0125698789,02,0207,...,ìï,û_,û_thanks,ûªm,ûªt,ûªve,ûï,ûïharry,ûò,ûówell
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We also need to vectorize our X_test data, but we need to use the same vocabulary as the training dataset, so we'll just call .transform() on X_test to get our vectorized X_test_df

In [10]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(1115, 7472)


Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585236,0125698789,02,0207,...,ìï,û_,û_thanks,ûªm,ûªt,ûªve,ûï,ûïharry,ûò,ûówell
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Lets run some classification models and see what kind of accuracy we can get!

## Logistic Regression

In [11]:
from sklearn.linear_model import LogisticRegression

LR = LogisticRegression(random_state=42).fit(X_train_vectorized, y_train)

train_predictions = LR.predict(X_train_vectorized)
test_predictions = LR.predict(X_test_vectorized)



Now we'll evaluate both our training and testing accuracy. 

In [12]:
from sklearn.metrics import accuracy_score

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.9964101413506843
Test Accuracy: 0.9775784753363229


## Multinomial Naive Bayes

In [13]:
from sklearn.naive_bayes import MultinomialNB

MNB = MultinomialNB().fit(X_train_vectorized, y_train)

train_predictions = MNB.predict(X_train_vectorized)
test_predictions = MNB.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.9946152120260264
Test Accuracy: 0.9838565022421525


## Random Forest Classifier

In [14]:
from sklearn.ensemble import RandomForestClassifier

RFC = RandomForestClassifier().fit(X_train_vectorized, y_train)

train_predictions = RFC.predict(X_train_vectorized)
test_predictions = RFC.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')



Train Accuracy: 0.9957370428539376
Test Accuracy: 0.967713004484305


# Spam Filter - TF-IDF Vectorization Method

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=None, ngram_range=(1,1), stop_words='english')

vectorizer.fit(X_train)

print(vectorizer.vocabulary_)



## Vectorize training data

In [16]:
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(4457, 7472)


Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585236,0125698789,02,0207,...,ìï,û_,û_thanks,ûªm,ûªt,ûªve,ûï,ûïharry,ûò,ûówell
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.265494,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Vectorize testing data

In [17]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(1115, 7472)


Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585236,0125698789,02,0207,...,ìï,û_,û_thanks,ûªm,ûªt,ûªve,ûï,ûïharry,ûò,ûówell
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Logistic Regression

In [18]:
LR = LogisticRegression(random_state=42).fit(X_train_vectorized, y_train)

train_predictions = LR.predict(X_train_vectorized)
test_predictions = LR.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.9694862014808167




Test Accuracy: 0.9524663677130045


## Multinomial Naive Bayes

In [19]:
from sklearn.naive_bayes import MultinomialNB

MNB = MultinomialNB().fit(X_train_vectorized, y_train)

train_predictions = MNB.predict(X_train_vectorized)
test_predictions = MNB.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.9838456360780794
Test Accuracy: 0.9668161434977578


## Random Forest Classifier

In [20]:
from sklearn.ensemble import RandomForestClassifier

RFC = RandomForestClassifier().fit(X_train_vectorized, y_train)

train_predictions = RFC.predict(X_train_vectorized)
test_predictions = RFC.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')



Train Accuracy: 0.9975319721785955
Test Accuracy: 0.9766816143497757


# Sentiment Analysis

## What is Sentiment Analysis?

The objective of sentiment analysis is to take a phrase and based on the text of the phrase determine if its sentiment is: Postive, Neutral, or Negative. 

Suppose that you wanted to use NLP to classify reviews for your company's products as either positive, neutral, or negative. Maybe you don't trust the star ratings left by the users and you want an additional measure of sentiment from each review - maybe you would use this as a feature generation technique for additional modeling, or to identify disgruntled customers and reach out to them to improve your customer service, etc. Sentiment Analysis has also been used heavily in stock market price estimation by trying to track the sentiment of the tweets of individuals after breaking news comes out about a company.

Does every word in each review contribute to its overall sentiment? Not really. Stop words for example don't really tell us much about the overall sentiment of the text, so just like we did before, we will discard them. 

## NLTK Movie Review Sentiment Analysis

In [21]:
!pip install -U nltk

import nltk
nltk.download('movie_reviews')
nltk.download('stopwords')
from nltk.corpus import movie_reviews
import random

Requirement already up-to-date: nltk in c:\users\cwcol\anaconda3\lib\site-packages (3.4)


You are using pip version 19.0.1, however version 19.0.3 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\cwcol\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\movie_reviews.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\cwcol\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Check that we have movie reviews

In [22]:
# How many total reviews are there?
print("Total reviews:", len(movie_reviews.fileids()))

# Total positive reviews
print("Positive reviews:", len(movie_reviews.fileids('pos'))) 
 
# Total negative reviews
print("Negative reviews:", len(movie_reviews.fileids('neg')))

Total reviews: 2000
Positive reviews: 1000
Negative reviews: 1000


## Get Reviews and randomize

In [23]:
reviews = [(list(movie_reviews.words(fileid)), category)
              for category in movie_reviews.categories()
              for fileid in movie_reviews.fileids(category)]

random.shuffle(reviews)

## Understand the format of the data

In [24]:
# Print Review Text:
print(reviews[0][0])

# Print Review Sentiment:
print(reviews[0][1])

# Print Review Text:
print(reviews[1][0])

# Print Review Sentiment:
print(reviews[1][1])

pos
['alchemy', 'is', 'steeped', 'in', 'shades', 'of', 'blue', '.', 'kieslowski', "'", 's', 'blue', ',', 'that', 'is', '.', 'with', 'its', 'examination', 'of', 'death', ',', 'isolation', ',', 'character', 'restoration', ',', 'and', 'recovery', 'from', 'loss', ',', 'suzanne', 'myers', "'", 'new', 'independent', 'film', 'echoes', 'the', 'polish', 'director', "'", 's', 'internationally', '-', 'acclaimed', '1993', 'release', '.', 'language', 'aside', ',', 'the', 'principal', 'difference', 'between', 'the', 'films', 'is', 'that', ',', 'while', 'kieslowski', 'took', 'great', 'pains', 'to', 'draw', 'us', 'into', 'the', 'main', 'character', "'", 's', 'world', ',', 'alchemy', 'keeps', 'its', 'viewers', 'at', 'arm', "'", 's', 'length', '.', 'as', 'a', 'result', ',', 'while', 'we', "'", 're', 'able', 'to', 'appreciate', 'the', 'film', "'", 's', 'intellectual', 'tapestry', ',', 'it', 'is', 'emotionally', 'distant', '.', 'alchemy', 'is', 'divided', 'into', 'three', 'chapters', ':', '"', 'charity', 

## Add reviews to a dataframe for kicks

In [25]:
documents = []
sentiments = []

for review in reviews:
  
  # Add sentiment to list
  if review[1] == "pos":
    sentiments.append(1)
  else:
    sentiments.append(0)
  
  # Add text to list
  review_text = " ".join(review[0])
  documents.append(review_text)
  
df = pd.DataFrame({"text": documents, "sentiment": sentiments})
df.head()

Unnamed: 0,text,sentiment
0,"everyone ' s heard about this movie , and more specifically , * the * scene . everyone ' s heard the famous barnyard animal quote squealed ( no pun intended ) over and over . and everyone ' s got ...",1
1,"alchemy is steeped in shades of blue . kieslowski ' s blue , that is . with its examination of death , isolation , character restoration , and recovery from loss , suzanne myers ' new independent ...",1
2,"at one point in this movie there is a staging of an opera that goes completely wrong . but one member of the crowd stands up and cheers , thinking the performance was planned , and applauding it f...",0
3,"robocop is an intelligent science fiction thriller and social satire , one with class and style . the film , set in old detroit in the year 1991 , stars peter weller as murphy , a lieutenant on th...",1
4,"plot : token director alan smithee steals the only copy of his film "" trio "" from the studio , after they complete the "" final cut "" without him . he threatens to burn the film reel if they do not...",0


## Train Test Split

In [26]:
X = df.text
y = df.sentiment

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Sentiment Analysis - CountVectorizer

## Generate vocabulary from train dataset

In [27]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=None, ngram_range=(1,1), stop_words='english')

vectorizer.fit(X_train)

print(vectorizer.vocabulary_)



## Generate Vectorizations

In [28]:
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(1600, 35953)


Unnamed: 0,00,000,007,00s,03,04,05,05425,10,100,...,zuehlke,zuko,zukovsky,zulu,zundel,zurg,zweibel,zwick,zwigoff,zycie
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,2,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [29]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(400, 35953)


Unnamed: 0,00,000,007,00s,03,04,05,05425,10,100,...,zuehlke,zuko,zukovsky,zulu,zundel,zurg,zweibel,zwick,zwigoff,zycie
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,10,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,4,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Logistic Regression

In [30]:
LR = LogisticRegression(random_state=42).fit(X_train_vectorized, y_train)

train_predictions = LR.predict(X_train_vectorized)
test_predictions = LR.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')



Train Accuracy: 1.0
Test Accuracy: 0.8525


## Multinomial Naive Bayes

In [31]:
MNB = MultinomialNB().fit(X_train_vectorized, y_train)

train_predictions = MNB.predict(X_train_vectorized)
test_predictions = MNB.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.97875
Test Accuracy: 0.825


## Random Forest Classifier

In [32]:
RFC = RandomForestClassifier().fit(X_train_vectorized, y_train)

train_predictions = RFC.predict(X_train_vectorized)
test_predictions = RFC.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')



Train Accuracy: 0.993125
Test Accuracy: 0.7125


# Sentiment Analysis - tfidfVectorizer

## Vocabulary

In [33]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=None, ngram_range=(1,1), stop_words='english')

vectorizer.fit(X_train)

print(vectorizer.vocabulary_)



## Train

In [34]:
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(1600, 35953)


Unnamed: 0,00,000,007,00s,03,04,05,05425,10,100,...,zuehlke,zuko,zukovsky,zulu,zundel,zurg,zweibel,zwick,zwigoff,zycie
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.074541,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.019569,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Test

In [35]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(400, 35953)


Unnamed: 0,00,000,007,00s,03,04,05,05425,10,100,...,zuehlke,zuko,zukovsky,zulu,zundel,zurg,zweibel,zwick,zwigoff,zycie
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.381541,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.12975,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Logistic Regression

In [36]:
LR = LogisticRegression(random_state=42).fit(X_train_vectorized, y_train)

train_predictions = LR.predict(X_train_vectorized)
test_predictions = LR.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')



Train Accuracy: 0.979375
Test Accuracy: 0.8575


## Multinomial Naive Bayes

In [37]:
MNB = MultinomialNB().fit(X_train_vectorized, y_train)

train_predictions = MNB.predict(X_train_vectorized)
test_predictions = MNB.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.974375
Test Accuracy: 0.835


## Random Forest Classifier

In [38]:
RFC = RandomForestClassifier().fit(X_train_vectorized, y_train)

train_predictions = RFC.predict(X_train_vectorized)
test_predictions = RFC.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')



Train Accuracy: 0.99125
Test Accuracy: 0.695


# Using NLTK to clean the data

## Importing the data fresh to avoid variable collisions

In [39]:
reviews = [(list(movie_reviews.words(fileid)), category)
              for category in movie_reviews.categories()
              for fileid in movie_reviews.fileids(category)]

random.shuffle(reviews, )

In [40]:
documents = []
sentiments = []

for review in reviews:
  
  # Add sentiment to list
  if review[1] == "pos":
    sentiments.append(1)
  else:
    sentiments.append(0)
  
  # Add text to list
  review_text = " ".join(review[0])
  documents.append(review_text)
  
df = pd.DataFrame({"text": documents, "sentiment": sentiments})
df.head()

Unnamed: 0,text,sentiment
0,"do the folks at disney have no common decency ? they have resurrected yet another cartoon and turned it into a live action hodgepodge of expensive special effects , embarrassing writing and kid - ...",0
1,"vegas vacation is the fourth film starring chevy chase and beverly d ' angelo as the heads of the hapless griswold family . as with the other three films , their two children , rusty and audrey , ...",0
2,starring shawnee smith ; donovan leitch ; ricky paull goldin ; kevin dillon & billy beck the blob is the remake of the 1960 ' s classic ( a term that i use very loosely to define the original ) ab...,0
3,""" a breed apart "" casts rutger hauer as a crazy , bird - loving recluse who picks his feathered friends over kathleen turner . a bit hard to swallow ? that ' s only the first of many improbabiliti...",1
4,we share the descent into darkness of a talented boy pianist . years later we see his subsequent resurfacing ; in the mid 80 ' s a damaged man walks out of a rainstorm and back into the world . th...,1


## Cleaning function to apply to each document

In [41]:
from nltk.corpus import stopwords
import string

# turn a doc into clean tokens
def clean_doc(doc):
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', string.punctuation)
	tokens = [w.translate(table) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# filter out stop words
	stop_words = set(stopwords.words('english'))
	tokens = [w for w in tokens if not w in stop_words]
	# filter out short tokens
	tokens = [word for word in tokens if len(word) > 1]
	return tokens

df_nltk = pd.DataFrame()
df_nltk['text'] = df.text.apply(clean_doc)
df_nltk['sentiment'] = df.sentiment
df_nltk.head()

Unnamed: 0,text,sentiment
0,"[folks, disney, common, decency, resurrected, yet, another, cartoon, turned, live, action, hodgepodge, expensive, special, effects, embarrassing, writing, kid, friendly, slapstick, mr, magoo, enou...",0
1,"[vegas, vacation, fourth, film, starring, chevy, chase, beverly, angelo, heads, hapless, griswold, family, three, films, two, children, rusty, audrey, played, revolving, series, actors, time, etha...",0
2,"[starring, shawnee, smith, donovan, leitch, ricky, paull, goldin, kevin, dillon, billy, beck, blob, remake, classic, term, use, loosely, define, original, really, mean, glob, goop, takes, anything...",0
3,"[breed, apart, casts, rutger, hauer, crazy, bird, loving, recluse, picks, feathered, friends, kathleen, turner, bit, hard, swallow, first, many, improbabilities, film, hauer, stars, man, obsessed,...",1
4,"[share, descent, darkness, talented, boy, pianist, years, later, see, subsequent, resurfacing, mid, damaged, man, walks, rainstorm, back, world, movie, charts, causes, mental, breakdown, based, li...",1


## Reformat reviews for sklearn

In [42]:
documents = []
for review in df_nltk.text:
  review = " ".join(review)
  documents.append(review)
  
sentiment = list(df_nltk.sentiment)
new_df = pd.DataFrame({'text': documents, 'sentiment': sentiment})
new_df.head()

Unnamed: 0,text,sentiment
0,folks disney common decency resurrected yet another cartoon turned live action hodgepodge expensive special effects embarrassing writing kid friendly slapstick mr magoo enough people obviously ins...,0
1,vegas vacation fourth film starring chevy chase beverly angelo heads hapless griswold family three films two children rusty audrey played revolving series actors time ethan embry marisol nichols f...,0
2,starring shawnee smith donovan leitch ricky paull goldin kevin dillon billy beck blob remake classic term use loosely define original really mean glob goop takes anything gets way original version...,0
3,breed apart casts rutger hauer crazy bird loving recluse picks feathered friends kathleen turner bit hard swallow first many improbabilities film hauer stars man obsessed keeping birds island safe...,1
4,share descent darkness talented boy pianist years later see subsequent resurfacing mid damaged man walks rainstorm back world movie charts causes mental breakdown based life story david helfgott a...,1


## Train Test Split

In [43]:
X = new_df.text
y = new_df.sentiment

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Vectorize the reviews

In [44]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=None, ngram_range=(1,1), stop_words='english')

vectorizer.fit(X_train)

print(vectorizer.vocabulary_)




In [45]:
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(1600, 35345)


Unnamed: 0,aa,aaa,aaaaaaaaah,aaaaaah,aaaahhhs,aahs,aaliyah,aalyah,aardman,aaron,...,zuko,zukovsky,zulu,zundel,zus,zweibel,zwick,zwigoff,zycie,zzzzzzz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [46]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(400, 35345)


Unnamed: 0,aa,aaa,aaaaaaaaah,aaaaaah,aaaahhhs,aahs,aaliyah,aalyah,aardman,aaron,...,zuko,zukovsky,zulu,zundel,zus,zweibel,zwick,zwigoff,zycie,zzzzzzz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Logistic Regression

In [47]:
LR = LogisticRegression(random_state=42).fit(X_train_vectorized, y_train)

train_predictions = LR.predict(X_train_vectorized)
test_predictions = LR.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')



Train Accuracy: 0.981875
Test Accuracy: 0.8175


## Multinomial Naive Bayes

In [48]:
MNB = MultinomialNB().fit(X_train_vectorized, y_train)

train_predictions = MNB.predict(X_train_vectorized)
test_predictions = MNB.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.9725
Test Accuracy: 0.8075


## Random Forest Classifier

In [49]:
RFC = RandomForestClassifier().fit(X_train_vectorized, y_train)

train_predictions = RFC.predict(X_train_vectorized)
test_predictions = RFC.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')



Train Accuracy: 0.995
Test Accuracy: 0.6775
