<a href="https://colab.research.google.com/github/donw385/DS-Unit-4-Sprint-2-NLP/blob/master/module3-Document-Classification/LS_DS_423_Document_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Applications of Vectorization

# Spam Filter - Count Vectorization Method

In [0]:
import pandas as pd
import numpy as np

## Import the Data

Import the data and take a look at it.

In [5]:
url = "https://raw.githubusercontent.com/sokjc/BayesNotBaes/master/sms.tsv"

df = pd.read_csv(url, sep='\t', header=None, names=['label', 'msg'])

pd.set_option('display.max_colwidth', 200)
df = df.rename(columns={"msg":"text"})
df.tail()

Unnamed: 0,label,text
5567,spam,"This is the 2nd time we have tried 2 contact u. U have won the £750 Pound prize. 2 claim is easy, call 087187272008 NOW1! Only 10p per minute. BT-national-rate."
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other suggestions?"
5570,ham,The guy did some bitching but I acted like i'd be interested in buying something else next week and he gave it to us for free
5571,ham,Rofl. Its true to its name


In [6]:
df['label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: label, dtype: float64

## Tidy up initial DataFrame

- Change Pandas display options so that we can see more of the text
- Drop the unnamed columns, I'm not sure why they're in there, but we don't need them.
- Rename the v1 and v2 columns.

In [7]:
df['label_num'] = df.label.map({'ham': 0, 'spam': 1})

df.head()

Unnamed: 0,label,text,label_num
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives around here though",0


You'll notice right of the bat that this text isn't as coherent as the job listings. We'll proceed like normal though. 

What is the ratio of Spam to Ham messages?

In [0]:
df.label.value_counts()

ham     4825
spam     747
Name: label, dtype: int64

## Categorical encoding on labels.

In [8]:
df['label_num'] = df.label.map({'ham': 0, 'spam': 1})

df.head()

Unnamed: 0,label,text,label_num
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives around here though",0


## Model Validation - Train Test Split (quick and dirty)
Since we're going to do some modeling we're going to need some model validation. For simplicity lets just do a quick train_test_split for today. You can try out Cross Validation on your assignment today, I just want to get to a quick baseline. 

In [0]:
from sklearn.model_selection import train_test_split

X = df.text
y = df.label_num

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Look at sizes of our train and test datasets

In [10]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(4457,)
(1115,)
(4457,)
(1115,)


## Count Vectorizer

Today we're just going to let Scikit-Learn do our text cleaning and preprocessing for us.

Lets run our vectorizer on our text messages and take a peek at the tokenization of the vocabulary

In [69]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=5000, ngram_range=(1,2), stop_words='english')

vectorizer.fit(X_train)

print(vectorizer.get_feature_names()[300:325])


['azaria', 'ba', 'ba ku', 'babe', 'baby', 'bachelor', 'backdrop', 'background', 'bacon', 'bad', 'bad acting', 'bad dialogue', 'bad film', 'bad guy', 'bad guys', 'bad movie', 'bad thing', 'badly', 'bag', 'baker', 'balance', 'baldwin', 'ball', 'balls', 'band']


Now we'll complete the vectorization by running .transform() and then save the results to a dataframe for viewing.
You don't need to save it to a dataframe, you can use most ML models with just the 2D array output.

That's a lot of columns.

In [71]:
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(1600, 5000)


Unnamed: 0,abandon,abandoned,abilities,ability,able,aboard,absent,absolute,absolutely,absurd,...,young man,young woman,younger,youth,zane,zany,zero,zeta,zeta jones,zone
0,0,0,0,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


We also need to vectorize our X_test data, but we need to use the same vocabulary as the training dataset, so we'll just call .transform() on X_test to get our vectorized X_test_df

In [72]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(400, 5000)


Unnamed: 0,abandon,abandoned,abilities,ability,able,aboard,absent,absolute,absolutely,absurd,...,young man,young woman,younger,youth,zane,zany,zero,zeta,zeta jones,zone
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [0]:
results=[]

Lets run some classification models and see what kind of accuracy we can get!

## Logistic Regression

In [100]:
from sklearn.linear_model import LogisticRegression

LR = LogisticRegression(random_state=42).fit(X_train_vectorized, y_train)

train_predictions = LR.predict(X_train_vectorized)
test_predictions = LR.predict(X_test_vectorized)



Now we'll evaluate both our training and testing accuracy. 

In [101]:
from sklearn.metrics import accuracy_score

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

result ={}
result['model']='Logistic Regression'
result['acc_train']= accuracy_score(y_train, train_predictions)
result['acc_test']= accuracy_score(y_test, test_predictions)
result['vect_type'] = 'Count'
results.append(result)


Train Accuracy: 0.984375
Test Accuracy: 0.82


## Multinomial Naive Bayes

In [102]:
from sklearn.naive_bayes import MultinomialNB

MNB = MultinomialNB().fit(X_train_vectorized, y_train)

train_predictions = MNB.predict(X_train_vectorized)
test_predictions = MNB.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

result ={}
result['model']='Multinomial Naive Bayes'
result['acc_train']= accuracy_score(y_train, train_predictions)
result['acc_test']= accuracy_score(y_test, test_predictions)
result['vect_type'] = 'Count'
results.append(result)

Train Accuracy: 0.974375
Test Accuracy: 0.805


## Random Forest Classifier

In [103]:
from sklearn.ensemble import RandomForestClassifier

RFC = RandomForestClassifier().fit(X_train_vectorized, y_train)

train_predictions = RFC.predict(X_train_vectorized)
test_predictions = RFC.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

result ={}
result['model']='Random Forest'
result['acc_train']= accuracy_score(y_train, train_predictions)
result['acc_test']= accuracy_score(y_test, test_predictions)
result['vect_type'] = 'Count'
results.append(result)



Train Accuracy: 0.99
Test Accuracy: 0.6975


In [104]:
results = pd.DataFrame(results)
results

Unnamed: 0,acc_test,acc_train,model,vect_type
0,0.82,0.984375,Logistic Regression,Count
1,0.805,0.974375,Multinomial Naive Bayes,Count
2,0.6975,0.99,Random Forest,Count


# Spam Filter - TF-IDF Vectorization Method

In [105]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=None, ngram_range=(1,1), stop_words='english')

vectorizer.fit(X_train)



TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

## Vectorize training data

In [106]:
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(1600, 35417)


Unnamed: 0,aa,aaa,aaaaaaaaah,aaaaaah,aaaahhhs,aahs,aaliyah,aalyah,aamir,aardman,...,zukovsky,zulu,zundel,zurg,zus,zweibel,zwick,zwigoff,zycie,zzzzzzz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Vectorize testing data

In [107]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(400, 35417)


Unnamed: 0,aa,aaa,aaaaaaaaah,aaaaaah,aaaahhhs,aahs,aaliyah,aalyah,aamir,aardman,...,zukovsky,zulu,zundel,zurg,zus,zweibel,zwick,zwigoff,zycie,zzzzzzz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [0]:
results_2= []

## Logistic Regression

In [109]:
LR = LogisticRegression(random_state=42).fit(X_train_vectorized, y_train)

train_predictions = LR.predict(X_train_vectorized)
test_predictions = LR.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

result ={}
result['model']='Logistic Regression'
result['acc_train']= accuracy_score(y_train, train_predictions)
result['acc_test']= accuracy_score(y_test, test_predictions)
result['vect_type'] = 'Tfidf'
results_2.append(result)



Train Accuracy: 0.984375
Test Accuracy: 0.82


## Multinomial Naive Bayes

In [110]:
from sklearn.naive_bayes import MultinomialNB

MNB = MultinomialNB().fit(X_train_vectorized, y_train)

train_predictions = MNB.predict(X_train_vectorized)
test_predictions = MNB.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

result ={}
result['model']='Multinomial Naive Bayes'
result['acc_train']= accuracy_score(y_train, train_predictions)
result['acc_test']= accuracy_score(y_test, test_predictions)
result['vect_type'] = 'Tfidf'
results_2.append(result)

Train Accuracy: 0.974375
Test Accuracy: 0.805


## Random Forest Classifier

In [111]:
from sklearn.ensemble import RandomForestClassifier

RFC = RandomForestClassifier().fit(X_train_vectorized, y_train)

train_predictions = RFC.predict(X_train_vectorized)
test_predictions = RFC.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

result ={}
result['model']='Random Forest'
result['acc_train']= accuracy_score(y_train, train_predictions)
result['acc_test']= accuracy_score(y_test, test_predictions)
result['vect_type'] = 'Tfidf'
results_2.append(result)



Train Accuracy: 0.98625
Test Accuracy: 0.6825


In [112]:
results_2 = pd.DataFrame(results_2)
results_2

Unnamed: 0,acc_test,acc_train,model,vect_type
0,0.82,0.984375,Logistic Regression,Tfidf
1,0.805,0.974375,Multinomial Naive Bayes,Tfidf
2,0.6825,0.98625,Random Forest,Tfidf


In [113]:
results = results.append(results_2)

results

Unnamed: 0,acc_test,acc_train,model,vect_type
0,0.82,0.984375,Logistic Regression,Count
1,0.805,0.974375,Multinomial Naive Bayes,Count
2,0.6975,0.99,Random Forest,Count
0,0.82,0.984375,Logistic Regression,Tfidf
1,0.805,0.974375,Multinomial Naive Bayes,Tfidf
2,0.6825,0.98625,Random Forest,Tfidf


# Sentiment Analysis

## What is Sentiment Analysis?

The objective of sentiment analysis is to take a phrase and based on the text of the phrase determine if its sentiment is: Postive, Neutral, or Negative. 

Suppose that you wanted to use NLP to classify reviews for your company's products as either positive, neutral, or negative. Maybe you don't trust the star ratings left by the users and you want an additional measure of sentiment from each review - maybe you would use this as a feature generation technique for additional modeling, or to identify disgruntled customers and reach out to them to improve your customer service, etc. Sentiment Analysis has also been used heavily in stock market price estimation by trying to track the sentiment of the tweets of individuals after breaking news comes out about a company.

Does every word in each review contribute to its overall sentiment? Not really. Stop words for example don't really tell us much about the overall sentiment of the text, so just like we did before, we will discard them. 

## NLTK Movie Review Sentiment Analysis

In [24]:
!pip install -U nltk

import nltk
nltk.download('movie_reviews')
nltk.download('stopwords')
from nltk.corpus import movie_reviews
import random

Collecting nltk
[?25l  Downloading https://files.pythonhosted.org/packages/73/56/90178929712ce427ebad179f8dc46c8deef4e89d4c853092bee1efd57d05/nltk-3.4.1.zip (3.1MB)
[K     |████████████████████████████████| 3.1MB 2.7MB/s 
Building wheels for collected packages: nltk
  Building wheel for nltk (setup.py) ... [?25l[?25hdone
  Stored in directory: /root/.cache/pip/wheels/97/8a/10/d646015f33c525688e91986c4544c68019b19a473cb33d3b55
Successfully built nltk
Installing collected packages: nltk
  Found existing installation: nltk 3.2.5
    Uninstalling nltk-3.2.5:
      Successfully uninstalled nltk-3.2.5
Successfully installed nltk-3.4.1


[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## Check that we have movie reviews

In [25]:
# How many total reviews are there?
print("Total reviews:", len(movie_reviews.fileids()))

# Total positive reviews
print("Positive reviews:", len(movie_reviews.fileids('pos'))) 
 
# Total negative reviews
print("Negative reviews:", len(movie_reviews.fileids('neg')))

Total reviews: 2000
Positive reviews: 1000
Negative reviews: 1000


## Get Reviews and randomize

In [0]:
reviews = [(list(movie_reviews.words(fileid)), category)
              for category in movie_reviews.categories()
              for fileid in movie_reviews.fileids(category)]

random.shuffle(reviews)

## Understand the format of the data

In [27]:
# Print Review Text:
print(reviews[0][0])

# Print Review Sentiment:
print(reviews[0][1])

# Print Review Text:
print(reviews[1][0])

# Print Review Sentiment:
print(reviews[1][1])

['the', 'main', 'problem', 'with', 'martin', 'lawrence', "'", 's', 'pet', 'project', ',', 'a', 'thin', 'line', 'between', 'love', 'and', 'hate', ',', 'like', 'any', 'fatal', 'attraction', 'variation', 'where', 'the', 'protagonist', 'is', 'a', 'man', ',', 'is', 'that', 'his', 'character', 'is', 'an', 'irresponsible', 'jerk', ',', 'and', 'if', 'that', 'is', 'the', 'case', ',', 'it', 'doesn', "'", 't', 'seem', 'to', 'do', 'anything', 'except', 'justify', 'the', 'woman', "'", 's', 'actions', '.', 'that', 'is', 'especially', 'the', 'case', 'in', 'lawrence', "'", 's', 'darnell', 'wright', '.', 'he', 'is', 'one', 'of', 'those', 'macho', 'guys', 'with', 'women', 'lined', 'up', 'a', 'mile', 'long', '.', 'now', 'don', "'", 't', 'think', 'i', 'condone', 'this', 'just', 'because', 'i', "'", 'm', 'male', '.', 'my', 'philosophy', 'is', ',', 'if', 'you', 'are', 'one', 'of', 'the', 'few', 'heterosexual', 'males', 'lucky', 'enough', 'to', 'get', 'your', 'hands', 'on', 'a', 'beautiful', ',', 'kind', 'gi

## Add reviews to a dataframe for kicks

In [28]:
documents = []
sentiments = []

for review in reviews:
  
  # Add sentiment to list
  if review[1] == "pos":
    sentiments.append(1)
  else:
    sentiments.append(0)
  
  # Add text to list
  review_text = " ".join(review[0])
  documents.append(review_text)
  
df = pd.DataFrame({"text": documents, "sentiment": sentiments})
df.head()

Unnamed: 0,text,sentiment
0,"the main problem with martin lawrence ' s pet project , a thin line between love and hate , like any fatal attraction variation where the protagonist is a man , is that his character is an irrespo...",0
1,delicatessen ( directors : marc caro / jean - pierre jeunet ; screenwriters : gilles adrien / marc caro ; cinematographer : darius khondji ; editor : herve schneid ; cast : dominique pinon ( louis...,0
2,"dark city is such a rare treat : it ? s a stunning , hyperkinetic vision of a place where our reality is fused with noir , science fiction and the darkest nights in manhattan and london . to boot ...",1
3,"who knew that in 16 years eddie murphy , who made such a brash , raucous big - screen splash in _48_hrs . _ , would become . . . cuddly . the disconcerting trend begun in this summer ' s cutesy , ...",0
4,"at times , you ' d think edtv would be an entertaining film . i mean , who can resist the story of your average joe becomming a celebrity by having his life filmed every minute of every day ? but ...",0


## Train Test Split

In [0]:
X = df.text
y = df.sentiment

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Sentiment Analysis - CountVectorizer

## Generate vocabulary from train dataset

In [30]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=None, ngram_range=(1,1), stop_words='english')

vectorizer.fit(X_train)

print(vectorizer.vocabulary_)



## Generate Vectorizations

In [31]:
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(1600, 36207)


Unnamed: 0,00,000,0009f,007,05,10,100,1000,100m,101,...,zuehlke,zuko,zulu,zurg,zus,zweibel,zwick,zwigoff,zycie,zzzzzzz
0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [32]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(400, 36207)


Unnamed: 0,00,000,0009f,007,05,10,100,1000,100m,101,...,zuehlke,zuko,zulu,zurg,zus,zweibel,zwick,zwigoff,zycie,zzzzzzz
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,5,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Logistic Regression

In [33]:
LR = LogisticRegression(random_state=42).fit(X_train_vectorized, y_train)

train_predictions = LR.predict(X_train_vectorized)
test_predictions = LR.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')



Train Accuracy: 1.0
Test Accuracy: 0.845


## Multinomial Naive Bayes

In [34]:
MNB = MultinomialNB().fit(X_train_vectorized, y_train)

train_predictions = MNB.predict(X_train_vectorized)
test_predictions = MNB.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.98
Test Accuracy: 0.785


## Random Forest Classifier

In [35]:
RFC = RandomForestClassifier().fit(X_train_vectorized, y_train)

train_predictions = RFC.predict(X_train_vectorized)
test_predictions = RFC.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')



Train Accuracy: 0.995
Test Accuracy: 0.6825


# Sentiment Analysis - tfidfVectorizer

## Vocabulary

In [36]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=None, ngram_range=(1,1), stop_words='english')

vectorizer.fit(X_train)

print(vectorizer.vocabulary_)



## Train

In [37]:
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(1600, 36207)


Unnamed: 0,00,000,0009f,007,05,10,100,1000,100m,101,...,zuehlke,zuko,zulu,zurg,zus,zweibel,zwick,zwigoff,zycie,zzzzzzz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.037989,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Test

In [38]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(400, 36207)


Unnamed: 0,00,000,0009f,007,05,10,100,1000,100m,101,...,zuehlke,zuko,zulu,zurg,zus,zweibel,zwick,zwigoff,zycie,zzzzzzz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.043386,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.027512,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.202223,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Logistic Regression

In [39]:
LR = LogisticRegression(random_state=42).fit(X_train_vectorized, y_train)

train_predictions = LR.predict(X_train_vectorized)
test_predictions = LR.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')



Train Accuracy: 0.979375
Test Accuracy: 0.8025


## Multinomial Naive Bayes

In [40]:
MNB = MultinomialNB().fit(X_train_vectorized, y_train)

train_predictions = MNB.predict(X_train_vectorized)
test_predictions = MNB.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.97375
Test Accuracy: 0.795


## Random Forest Classifier

In [41]:
RFC = RandomForestClassifier().fit(X_train_vectorized, y_train)

train_predictions = RFC.predict(X_train_vectorized)
test_predictions = RFC.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')



Train Accuracy: 0.991875
Test Accuracy: 0.6675


# Using NLTK to clean the data

## Importing the data fresh to avoid variable collisions

In [0]:
reviews = [(list(movie_reviews.words(fileid)), category)
              for category in movie_reviews.categories()
              for fileid in movie_reviews.fileids(category)]

random.shuffle(reviews, )

In [43]:
documents = []
sentiments = []

for review in reviews:
  
  # Add sentiment to list
  if review[1] == "pos":
    sentiments.append(1)
  else:
    sentiments.append(0)
  
  # Add text to list
  review_text = " ".join(review[0])
  documents.append(review_text)
  
df = pd.DataFrame({"text": documents, "sentiment": sentiments})
df.head()

Unnamed: 0,text,sentiment
0,"` strange days ' chronicles the last two days of 1999 in los angeles . as the locals gear up for the new millenium , lenny nero ( ralph fiennes ) goes about his business of peddling erotic memory ...",1
1,"* this review contains spoilers * as with most of her films , director amy heckerling ' s latest , loser , seesaws between unpleasant and artificial , and is sometimes both at once . when she tack...",0
2,"toward the bottom of the ' 80s action movie barrel lies action jackson , the only movie in hollywood history to show sharon stone and vanity topless within a span of ten minutes . this carl "" apol...",0
3,"with his successful books and movies , michael crichton is doing well . with early successes with westworld ( 1973 ) and coma ( 1978 ) , and recent films such as jurassic park ( 1993 ) , his films...",0
4,""" nothing more than a high budget masturbation fantasy "" showgirls ( nc - 17 ) - contains graphic nudity , profanity , sexual situations and violence . some people , however , keep their clothes o...",0


## Cleaning function to apply to each document

In [44]:
from nltk.corpus import stopwords
import string

# turn a doc into clean tokens
def clean_doc(doc):
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', string.punctuation)
	tokens = [w.translate(table) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# filter out stop words
	stop_words = set(stopwords.words('english'))
	tokens = [w for w in tokens if not w in stop_words]
	# filter out short tokens
	tokens = [word for word in tokens if len(word) > 1]
	return tokens

df_nltk = pd.DataFrame()
df_nltk['text'] = df.text.apply(clean_doc)
df_nltk['sentiment'] = df.sentiment
df_nltk.head()

Unnamed: 0,text,sentiment
0,"[strange, days, chronicles, last, two, days, los, angeles, locals, gear, new, millenium, lenny, nero, ralph, fiennes, goes, business, peddling, erotic, memory, clips, pines, ex, girlfriend, faith,...",1
1,"[review, contains, spoilers, films, director, amy, heckerling, latest, loser, seesaws, unpleasant, artificial, sometimes, tackles, big, issues, abortion, fast, times, ridgemont, high, impossible, ...",0
2,"[toward, bottom, action, movie, barrel, lies, action, jackson, movie, hollywood, history, show, sharon, stone, vanity, topless, within, span, ten, minutes, carl, apollo, creed, weathers, vehicle, ...",0
3,"[successful, books, movies, michael, crichton, well, early, successes, westworld, coma, recent, films, jurassic, park, films, entertaining, however, seems, taken, wrong, turn, somewhere, sphere, m...",0
4,"[nothing, high, budget, masturbation, fantasy, showgirls, nc, contains, graphic, nudity, profanity, sexual, situations, violence, people, however, keep, clothes, watch, porn, films, intellectual, ...",0


## Reformat reviews for sklearn

In [45]:
documents = []
for review in df_nltk.text:
  review = " ".join(review)
  documents.append(review)
  
sentiment = list(df_nltk.sentiment)
new_df = pd.DataFrame({'text': documents, 'sentiment': sentiment})
new_df.head()

Unnamed: 0,text,sentiment
0,strange days chronicles last two days los angeles locals gear new millenium lenny nero ralph fiennes goes business peddling erotic memory clips pines ex girlfriend faith juliette lewis noticing an...,1
1,review contains spoilers films director amy heckerling latest loser seesaws unpleasant artificial sometimes tackles big issues abortion fast times ridgemont high impossible tell whether matter fac...,0
2,toward bottom action movie barrel lies action jackson movie hollywood history show sharon stone vanity topless within span ten minutes carl apollo creed weathers vehicle features traditional cop v...,0
3,successful books movies michael crichton well early successes westworld coma recent films jurassic park films entertaining however seems taken wrong turn somewhere sphere million mess good directo...,0
4,nothing high budget masturbation fantasy showgirls nc contains graphic nudity profanity sexual situations violence people however keep clothes watch porn films intellectual values write reviews re...,0


## Train Test Split

In [0]:
X = new_df.text
y = new_df.sentiment

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Vectorize the reviews

In [47]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=None, ngram_range=(1,1), stop_words='english')

vectorizer.fit(X_train)

print(vectorizer.vocabulary_)




In [48]:
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(1600, 35417)


Unnamed: 0,aa,aaa,aaaaaaaaah,aaaaaah,aaaahhhs,aahs,aaliyah,aalyah,aamir,aardman,...,zukovsky,zulu,zundel,zurg,zus,zweibel,zwick,zwigoff,zycie,zzzzzzz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [49]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(400, 35417)


Unnamed: 0,aa,aaa,aaaaaaaaah,aaaaaah,aaaahhhs,aahs,aaliyah,aalyah,aamir,aardman,...,zukovsky,zulu,zundel,zurg,zus,zweibel,zwick,zwigoff,zycie,zzzzzzz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Logistic Regression

In [50]:
LR = LogisticRegression(random_state=42).fit(X_train_vectorized, y_train)

train_predictions = LR.predict(X_train_vectorized)
test_predictions = LR.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')



Train Accuracy: 0.984375
Test Accuracy: 0.82


## Multinomial Naive Bayes

In [51]:
MNB = MultinomialNB().fit(X_train_vectorized, y_train)

train_predictions = MNB.predict(X_train_vectorized)
test_predictions = MNB.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.974375
Test Accuracy: 0.805


## Random Forest Classifier

In [52]:
RFC = RandomForestClassifier().fit(X_train_vectorized, y_train)

train_predictions = RFC.predict(X_train_vectorized)
test_predictions = RFC.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')



Train Accuracy: 0.990625
Test Accuracy: 0.6425


In [53]:
# import xgboost as xgb
from xgboost.sklearn import XGBClassifier

clf = XGBClassifier(
        #hyper params
        n_jobs = -1,
)

clf.fit(X_train, y_train, eval_metric = 'auc')

IndexError: ignored