<a href="https://colab.research.google.com/github/Ruwai/DS-Unit-4-Sprint-2-NLP/blob/master/module3-Document-Classification/LS_DS_423_Document_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Applications of Vectorization

# Spam Filter - Count Vectorization Method

In [0]:
import pandas as pd
import numpy as np

## Import the Data

Import the data and take a look at it.

In [2]:
url = "https://raw.githubusercontent.com/ryanleeallred/datasets/master/spam.csv"

df = pd.read_csv(url, encoding="ISO-8859-1")
print(df.shape)
df.head()

(5572, 5)


Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


## Tidy up initial DataFrame

- Change Pandas display options so that we can see more of the text
- Drop the unnamed columns, I'm not sure why they're in there, but we don't need them.
- Rename the v1 and v2 columns.

In [3]:
pd.set_option('display.max_colwidth', 200)
df = df.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
df = df.rename(columns={"v1":"label", "v2":"text"})
df.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


You'll notice right of the bat that this text isn't as coherent as the job listings. We'll proceed like normal though. 

What is the ratio of Spam to Ham messages?

In [7]:
df.label.value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [8]:
df['label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: label, dtype: float64

## Categorical encoding on labels.

In [5]:
df['label_num'] = df.label.map({'ham': 0, 'spam': 1})

df.head()

Unnamed: 0,label,text,label_num
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives around here though",0


## Model Validation - Train Test Split (quick and dirty)
Since we're going to do some modeling we're going to need some model validation. For simplicity lets just do a quick train_test_split for today. You can try out Cross Validation on your assignment today, I just want to get to a quick baseline. 

In [0]:
from sklearn.model_selection import train_test_split

X = df.text
y = df.label_num

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Look at sizes of our train and test datasets

In [9]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(4457,)
(1115,)
(4457,)
(1115,)


## Count Vectorizer

Today we're just going to let Scikit-Learn do our text cleaning and preprocessing for us.

Lets run our vectorizer on our text messages and take a peek at the tokenization of the vocabulary

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=None, ngram_range=(1,1), stop_words='english')

vectorizer.fit(X_train)

print(vectorizer.vocabulary_)



Now we'll complete the vectorization by running .transform() and then save the results to a dataframe for viewing.
You don't need to save it to a dataframe, you can use most ML models with just the 2D array output.

That's a lot of columns.

In [11]:
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(4457, 7472)


Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585236,0125698789,02,0207,...,ìï,û_,û_thanks,ûªm,ûªt,ûªve,ûï,ûïharry,ûò,ûówell
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We also need to vectorize our X_test data, but we need to use the same vocabulary as the training dataset, so we'll just call .transform() on X_test to get our vectorized X_test_df

In [12]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(1115, 7472)


Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585236,0125698789,02,0207,...,ìï,û_,û_thanks,ûªm,ûªt,ûªve,ûï,ûïharry,ûò,ûówell
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Lets run some classification models and see what kind of accuracy we can get!

## Logistic Regression

In [16]:
from sklearn.linear_model import LogisticRegression

LR = LogisticRegression(random_state=42).fit(X_train_vectorized, y_train)

train_predictions = LR.predict(X_train_vectorized)
test_predictions = LR.predict(X_test_vectorized)



Now we'll evaluate both our training and testing accuracy. 

In [17]:
from sklearn.metrics import accuracy_score

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.9964101413506843
Test Accuracy: 0.9775784753363229


## Multinomial Naive Bayes

In [18]:
from sklearn.naive_bayes import MultinomialNB

MNB = MultinomialNB().fit(X_train_vectorized, y_train)

train_predictions = MNB.predict(X_train_vectorized)
test_predictions = MNB.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.9946152120260264
Test Accuracy: 0.9838565022421525


## Random Forest Classifier

In [19]:
from sklearn.ensemble import RandomForestClassifier

RFC = RandomForestClassifier().fit(X_train_vectorized, y_train)

train_predictions = RFC.predict(X_train_vectorized)
test_predictions = RFC.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')



Train Accuracy: 0.9975319721785955
Test Accuracy: 0.9704035874439462


# Spam Filter - TF-IDF Vectorization Method

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=None, ngram_range=(1,1), stop_words='english')

vectorizer.fit(X_train)

print(vectorizer.vocabulary_)



## Vectorize training data

In [21]:
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(4457, 7472)


Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585236,0125698789,02,0207,...,ìï,û_,û_thanks,ûªm,ûªt,ûªve,ûï,ûïharry,ûò,ûówell
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.265494,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Vectorize testing data

In [0]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(1115, 7472)


Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585236,0125698789,02,0207,...,ìï,û_,û_thanks,ûªm,ûªt,ûªve,ûï,ûïharry,ûò,ûówell
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Logistic Regression

In [22]:
LR = LogisticRegression(random_state=42).fit(X_train_vectorized, y_train)

train_predictions = LR.predict(X_train_vectorized)
test_predictions = LR.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')



Train Accuracy: 0.9694862014808167
Test Accuracy: 0.9695067264573991


## Multinomial Naive Bayes

In [23]:
from sklearn.naive_bayes import MultinomialNB

MNB = MultinomialNB().fit(X_train_vectorized, y_train)

train_predictions = MNB.predict(X_train_vectorized)
test_predictions = MNB.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.9838456360780794
Test Accuracy: 0.97847533632287


## Random Forest Classifier

In [24]:
from sklearn.ensemble import RandomForestClassifier

RFC = RandomForestClassifier().fit(X_train_vectorized, y_train)

train_predictions = RFC.predict(X_train_vectorized)
test_predictions = RFC.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')



Train Accuracy: 0.9986538030065066
Test Accuracy: 0.9713004484304932


# Sentiment Analysis

## What is Sentiment Analysis?

The objective of sentiment analysis is to take a phrase and based on the text of the phrase determine if its sentiment is: Postive, Neutral, or Negative. 

Suppose that you wanted to use NLP to classify reviews for your company's products as either positive, neutral, or negative. Maybe you don't trust the star ratings left by the users and you want an additional measure of sentiment from each review - maybe you would use this as a feature generation technique for additional modeling, or to identify disgruntled customers and reach out to them to improve your customer service, etc. Sentiment Analysis has also been used heavily in stock market price estimation by trying to track the sentiment of the tweets of individuals after breaking news comes out about a company.

Does every word in each review contribute to its overall sentiment? Not really. Stop words for example don't really tell us much about the overall sentiment of the text, so just like we did before, we will discard them. 

## NLTK Movie Review Sentiment Analysis

In [25]:
!pip install -U nltk

import nltk
nltk.download('movie_reviews')
nltk.download('stopwords')
from nltk.corpus import movie_reviews
import random

Collecting nltk
[?25l  Downloading https://files.pythonhosted.org/packages/73/56/90178929712ce427ebad179f8dc46c8deef4e89d4c853092bee1efd57d05/nltk-3.4.1.zip (3.1MB)
[K     |████████████████████████████████| 3.1MB 2.8MB/s 
Building wheels for collected packages: nltk
  Building wheel for nltk (setup.py) ... [?25l[?25hdone
  Stored in directory: /root/.cache/pip/wheels/97/8a/10/d646015f33c525688e91986c4544c68019b19a473cb33d3b55
Successfully built nltk
Installing collected packages: nltk
  Found existing installation: nltk 3.2.5
    Uninstalling nltk-3.2.5:
      Successfully uninstalled nltk-3.2.5
Successfully installed nltk-3.4.1


[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## Check that we have movie reviews

In [26]:
# How many total reviews are there?
print("Total reviews:", len(movie_reviews.fileids()))

# Total positive reviews
print("Positive reviews:", len(movie_reviews.fileids('pos'))) 
 
# Total negative reviews
print("Negative reviews:", len(movie_reviews.fileids('neg')))

Total reviews: 2000
Positive reviews: 1000
Negative reviews: 1000


## Get Reviews and randomize

In [0]:
reviews = [(list(movie_reviews.words(fileid)), category)
              for category in movie_reviews.categories()
              for fileid in movie_reviews.fileids(category)]

random.shuffle(reviews)

## Understand the format of the data

In [28]:
# Print Review Text:
print(reviews[0][0])

# Print Review Sentiment:
print(reviews[0][1])

# Print Review Text:
print(reviews[1][0])

# Print Review Sentiment:
print(reviews[1][1])

['"', 'ladybugs', '"', 'is', 'a', 'typical', 'comedy', 'that', 'relies', 'on', 'three', 'supposed', 'guarantees', ':', 'the', 'pathetic', 'team', 'who', 'beats', 'the', 'champs', ';', 'cross', 'dressing', ';', 'and', 'the', 'presence', 'of', 'rodney', 'dangerfield', '.', 'this', 'picture', 'doesn', "'", 't', 'play', 'like', 'a', 'comedy', 'for', 'children', ',', 'so', 'who', 'is', 'it', 'aimed', 'at', '?', 'and', 'why', 'is', 'it', 'told', 'like', 'a', '91', '-', 'minute', 'sit', '-', 'com', 'instead', 'of', 'a', 'feature', 'film', '?', 'rodney', 'dangerfield', 'stars', 'as', 'chester', 'lee', ',', 'a', 'total', 'schmuck', 'working', 'at', 'a', 'huge', 'corporation', '.', 'he', 'obviously', 'doesn', "'", 't', 'have', 'a', 'lot', 'of', 'self', 'esteem', 'and', 'thinks', 'he', 'has', 'to', 'kiss', 'up', 'to', 'get', 'ahead', ',', 'which', 'he', 'does', 'by', 'volunteering', 'to', 'coach', 'the', 'company', "'", 's', 'girls', "'", 'soccer', 'team', '.', 'what', 'a', 'shock', 'to', 'learn'

## Add reviews to a dataframe for kicks

In [29]:
documents = []
sentiments = []

for review in reviews:
  
  # Add sentiment to list
  if review[1] == "pos":
    sentiments.append(1)
  else:
    sentiments.append(0)
  
  # Add text to list
  review_text = " ".join(review[0])
  documents.append(review_text)
  
df = pd.DataFrame({"text": documents, "sentiment": sentiments})
df.head()

Unnamed: 0,text,sentiment
0,""" ladybugs "" is a typical comedy that relies on three supposed guarantees : the pathetic team who beats the champs ; cross dressing ; and the presence of rodney dangerfield . this picture doesn ' ...",0
1,"( note : there are spoilers regarding the film ' s climax ; the election , of course ) we see matthew broderick , a man torn to a primal state ; he ' s been unfaithful to his wife , lied to and ma...",1
2,i want to be involved in show business one day . and i refuse to do any sequels to any movie i may make because i believe they only get worse . this movie proves it for me . i was a little worried...,0
3,there ' s good news and bad news about mulan . the positive is that disney has found a happy medium between the heavy - handedness of pocahontas and the hunchback of notre dame and the childishnes...,1
4,"when respecting a director , you must also respect the fact that they are not perfect . woody allen has made a couple less - than good films , and he ' s my favorite . even martin scorsese hasn ' ...",0


## Train Test Split

In [0]:
X = df.text
y = df.sentiment

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Sentiment Analysis - CountVectorizer

## Generate vocabulary from train dataset

In [31]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=None, ngram_range=(1,1), stop_words='english')

vectorizer.fit(X_train)

print(vectorizer.vocabulary_)



## Generate Vectorizations

In [32]:
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(1600, 36067)


Unnamed: 0,00,000,0009f,007,00s,03,04,05,05425,10,...,zsigmond,zucker,zuko,zukovsky,zundel,zurg,zweibel,zwick,zwigoff,zycie
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [0]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(400, 36212)


Unnamed: 0,00,000,0009f,007,03,04,05,05425,10,100,...,zuehlke,zuko,zukovsky,zundel,zurg,zus,zweibel,zwick,zwigoff,zycie
0,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Logistic Regression

In [33]:
LR = LogisticRegression(random_state=42).fit(X_train_vectorized, y_train)

train_predictions = LR.predict(X_train_vectorized)
test_predictions = LR.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')



ValueError: ignored

## Multinomial Naive Bayes

In [0]:
MNB = MultinomialNB().fit(X_train_vectorized, y_train)

train_predictions = MNB.predict(X_train_vectorized)
test_predictions = MNB.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

## Random Forest Classifier

In [0]:
RFC = RandomForestClassifier().fit(X_train_vectorized, y_train)

train_predictions = RFC.predict(X_train_vectorized)
test_predictions = RFC.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

# Sentiment Analysis - tfidfVectorizer

## Vocabulary

In [34]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=None, ngram_range=(1,1), stop_words='english')

vectorizer.fit(X_train)

print(vectorizer.vocabulary_)



## Train

In [35]:
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(1600, 36067)


Unnamed: 0,00,000,0009f,007,00s,03,04,05,05425,10,...,zsigmond,zucker,zuko,zukovsky,zundel,zurg,zweibel,zwick,zwigoff,zycie
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.017781,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Test

In [36]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(400, 36067)


Unnamed: 0,00,000,0009f,007,00s,03,04,05,05425,10,...,zsigmond,zucker,zuko,zukovsky,zundel,zurg,zweibel,zwick,zwigoff,zycie
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Logistic Regression

In [37]:
LR = LogisticRegression(random_state=42).fit(X_train_vectorized, y_train)

train_predictions = LR.predict(X_train_vectorized)
test_predictions = LR.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')



Train Accuracy: 0.980625
Test Accuracy: 0.8275


## Multinomial Naive Bayes

In [38]:
MNB = MultinomialNB().fit(X_train_vectorized, y_train)

train_predictions = MNB.predict(X_train_vectorized)
test_predictions = MNB.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.96875
Test Accuracy: 0.8175


## Random Forest Classifier

In [39]:
RFC = RandomForestClassifier().fit(X_train_vectorized, y_train)

train_predictions = RFC.predict(X_train_vectorized)
test_predictions = RFC.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')



Train Accuracy: 0.991875
Test Accuracy: 0.6625


# Using NLTK to clean the data

## Importing the data fresh to avoid variable collisions

In [0]:
reviews = [(list(movie_reviews.words(fileid)), category)
              for category in movie_reviews.categories()
              for fileid in movie_reviews.fileids(category)]

random.shuffle(reviews, )

In [41]:
documents = []
sentiments = []

for review in reviews:
  
  # Add sentiment to list
  if review[1] == "pos":
    sentiments.append(1)
  else:
    sentiments.append(0)
  
  # Add text to list
  review_text = " ".join(review[0])
  documents.append(review_text)
  
df = pd.DataFrame({"text": documents, "sentiment": sentiments})
df.head()

Unnamed: 0,text,sentiment
0,"i can see a decent sports movie struggling to break free of oliver stone ' s ` any given sunday ' . it ' s an entertaining movie that offers both insight and excitement into the rock - em , sock -...",0
1,"don ' t let this movie fool you into believing the romantic noirs of william shakespeare . no one will truly understand the heart and soul of this man except through his work , and this movie make...",0
2,"fantastically over hyped , godzila finally lumbers onto the big screen . the film opens with footage of nuclear testing on the french polynesian islands , then an attack on a boat from some beast ...",0
3,""" stuart little "" is one of the best family films to come out this year . it ' s a cute , funny and very good - natured film that has nothing for parents to squirm over except a few mild cusswords...",1
4,i read the new yorker magazine and i enjoy some of their really in - depth articles about some incident . they will take some incident like the investigation of a mysterious plane crash and tell y...,0


## Cleaning function to apply to each document

In [42]:
from nltk.corpus import stopwords
import string

# turn a doc into clean tokens
def clean_doc(doc):
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', string.punctuation)
	tokens = [w.translate(table) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# filter out stop words
	stop_words = set(stopwords.words('english'))
	tokens = [w for w in tokens if not w in stop_words]
	# filter out short tokens
	tokens = [word for word in tokens if len(word) > 1]
	return tokens

df_nltk = pd.DataFrame()
df_nltk['text'] = df.text.apply(clean_doc)
df_nltk['sentiment'] = df.sentiment
df_nltk.head()

Unnamed: 0,text,sentiment
0,"[see, decent, sports, movie, struggling, break, free, oliver, stone, given, sunday, entertaining, movie, offers, insight, excitement, rock, em, sock, em, profession, pro, football, unfortunately, ...",0
1,"[let, movie, fool, believing, romantic, noirs, william, shakespeare, one, truly, understand, heart, soul, man, except, work, movie, makes, vain, attempt, moves, glamorise, life, hollywood, annoyin...",0
2,"[fantastically, hyped, godzila, finally, lumbers, onto, big, screen, film, opens, footage, nuclear, testing, french, polynesian, islands, attack, boat, beast, finally, join, dr, nick, tatopoulos, ...",0
3,"[stuart, little, one, best, family, films, come, year, cute, funny, good, natured, film, nothing, parents, squirm, except, mild, cusswords, though, read, book, long, time, ago, really, remember, k...",1
4,"[read, new, yorker, magazine, enjoy, really, depth, articles, incident, take, incident, like, investigation, mysterious, plane, crash, tell, happened, detail, becomes, real, education, agencies, g...",0


## Reformat reviews for sklearn

In [43]:
documents = []
for review in df_nltk.text:
  review = " ".join(review)
  documents.append(review)
  
sentiment = list(df_nltk.sentiment)
new_df = pd.DataFrame({'text': documents, 'sentiment': sentiment})
new_df.head()

Unnamed: 0,text,sentiment
0,see decent sports movie struggling break free oliver stone given sunday entertaining movie offers insight excitement rock em sock em profession pro football unfortunately director seems one priori...,0
1,let movie fool believing romantic noirs william shakespeare one truly understand heart soul man except work movie makes vain attempt moves glamorise life hollywood annoying tendency subtract achie...,0
2,fantastically hyped godzila finally lumbers onto big screen film opens footage nuclear testing french polynesian islands attack boat beast finally join dr nick tatopoulos broderick looking years o...,0
3,stuart little one best family films come year cute funny good natured film nothing parents squirm except mild cusswords though read book long time ago really remember know film disappoint finally ...,1
4,read new yorker magazine enjoy really depth articles incident take incident like investigation mysterious plane crash tell happened detail becomes real education agencies get involved theories sug...,0


## Train Test Split

In [0]:
X = new_df.text
y = new_df.sentiment

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Vectorize the reviews

In [45]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=None, ngram_range=(1,1), stop_words='english')

vectorizer.fit(X_train)

print(vectorizer.vocabulary_)




In [46]:
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(1600, 35190)


Unnamed: 0,aa,aaa,aaaaaaaahhhh,aaaaaah,aahs,aaliyah,aalyah,aamir,aardman,aaron,...,zuko,zukovsky,zulu,zundel,zurg,zweibel,zwick,zwigoff,zycie,zzzzzzz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [47]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(400, 35190)


Unnamed: 0,aa,aaa,aaaaaaaahhhh,aaaaaah,aahs,aaliyah,aalyah,aamir,aardman,aaron,...,zuko,zukovsky,zulu,zundel,zurg,zweibel,zwick,zwigoff,zycie,zzzzzzz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.228345,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Logistic Regression

In [48]:
LR = LogisticRegression(random_state=42).fit(X_train_vectorized, y_train)

train_predictions = LR.predict(X_train_vectorized)
test_predictions = LR.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')



Train Accuracy: 0.9825
Test Accuracy: 0.8375


## Multinomial Naive Bayes

In [49]:
MNB = MultinomialNB().fit(X_train_vectorized, y_train)

train_predictions = MNB.predict(X_train_vectorized)
test_predictions = MNB.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.970625
Test Accuracy: 0.815


## Random Forest Classifier

In [50]:
RFC = RandomForestClassifier().fit(X_train_vectorized, y_train)

train_predictions = RFC.predict(X_train_vectorized)
test_predictions = RFC.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')



Train Accuracy: 0.98875
Test Accuracy: 0.6575


In [51]:
# import xgboost as xgb
from xgboost.sklearn import XGBClassifier

clf = XGBClassifier(
        #hyper params
        n_jobs = -1,
)

clf.fit(X_train, y_train, eval_metric = 'auc')

IndexError: ignored