# Machine Learning and NLP Exercises #

## Introduction ##

We will be using the same review data set from Kaggle from Week 2 for this exercise. The product we'll focus on this time is a cappuccino cup. The goal of this week is to not only preprocess the data, but to classify reviews as positive or negative based on the review text.

The following code will help you load in the data.

In [148]:
import nltk
import pandas as pd
import numpy as np
import re
import string
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB

In [81]:
data = pd.read_csv('coffee.csv')
data.head()

Unnamed: 0,user_id,stars,reviews
0,A2XP9IN4JOMROD,1,I wanted to love this. I was even prepared for...
1,A2TS09JCXNV1VD,5,Grove Square Cappuccino Cups were excellent. T...
2,AJ3L5J7GN09SV,2,I bought the Grove Square hazelnut cappuccino ...
3,A3CZD34ZTUJME7,1,"I love my Keurig, and I love most of the Keuri..."
4,AWKN396SHAQGP,1,It's a powdered drink. No filter in k-cup.<br ...


## Question 1 ##

* Determine how many reviews there are in total.
* Determine the percent of 1, 2, 3, 4 and 5 star reviews.
* Create a new data set for modeling with the following columns:
     - Column 1: 'positive' if review = 4 or 5, and 'negative' if review = 1 or 2
     - Column 2: review text
* Take a look at the number of positive and negative reviews in the newly created data set.

Checkpoint: the resulting data set should have 514 reviews.

Use the preprocessing code below to clean the reviews data before moving on to modeling.

In [82]:
len(data.reviews)

542

In [83]:
counts = data.stars.value_counts()
df = pd.DataFrame({'rating':counts.index, 'count':counts.values})
df

Unnamed: 0,rating,count
0,5,308
1,1,96
2,4,65
3,2,45
4,3,28


In [84]:
df['pct'] = (df['count']/df['count'].sum())*100
df

Unnamed: 0,rating,count,pct
0,5,308,56.826568
1,1,96,17.712177
2,4,65,11.99262
3,2,45,8.302583
4,3,28,5.166052


In [85]:
New_df= data[['stars','reviews']]
New_df=New_df.rename(columns={"stars":"sentiment"})
New_df.head(5)

Unnamed: 0,sentiment,reviews
0,1,I wanted to love this. I was even prepared for...
1,5,Grove Square Cappuccino Cups were excellent. T...
2,2,I bought the Grove Square hazelnut cappuccino ...
3,1,"I love my Keurig, and I love most of the Keuri..."
4,1,It's a powdered drink. No filter in k-cup.<br ...


In [1]:
New_df=New_df.loc[(New_df.sentiment != 3)]
len(New_df)

NameError: name 'New_df' is not defined

In [88]:
New_df["sentiment"] = np.where(New_df.sentiment > 3,"positive","negative")
New_df.head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,sentiment,reviews
0,negative,I wanted to love this. I was even prepared for...
1,positive,Grove Square Cappuccino Cups were excellent. T...
2,negative,I bought the Grove Square hazelnut cappuccino ...
3,negative,"I love my Keurig, and I love most of the Keuri..."
4,negative,It's a powdered drink. No filter in k-cup.<br ...


In [89]:
# Text preprocessing steps - remove numbers, captial letters and punctuation


alphanumeric = lambda x: re.sub(r"""\w*\d\w*""", ' ', x)
punc_lower = lambda x: re.sub('[%s]' % re.escape(string.punctuation), ' ', x.lower())

New_df['reviews'] = New_df.reviews.map(alphanumeric).map(punc_lower)
New_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


Unnamed: 0,sentiment,reviews
0,negative,i wanted to love this i was even prepared for...
1,positive,grove square cappuccino cups were excellent t...
2,negative,i bought the grove square hazelnut cappuccino ...
3,negative,i love my keurig and i love most of the keuri...
4,negative,it s a powdered drink no filter in k cup br ...


## Question 2 ##

Prepare the data for modeling:
* Split the data into training and test sets. You should have four sets of data - X_train, X_test, y_train, y_test

Create numerical features with Count Vectorizer. Create two document-term matrices:
* Matrix 1: Terms should be unigrams (single words), and values should be word counts (Hint: this is the Count Vectorizer default)
* Matrix 2: Terms should be unigrams and bigrams, and values should be binary values

Recommendation: Utilize Count Vectorizer's stop words function to remove stop words from the reviews text.

In [107]:
cv = CountVectorizer(stop_words='english')
X = cv.fit_transform(New_df.reviews)
Matrix1=pd.DataFrame(X.toarray(), columns=cv.get_feature_names())
temp=pd.DataFrame(New_df.sentiment)
temp.columns= ["sentiment"]
Mat1=pd.concat([temp.reset_index(drop=True),Matrix1],axis=1)
Mat1.head(5)

Unnamed: 0,sentiment,able,abomination,absolute,absolutely,acceptable,accident,acidy,actual,actually,...,years,yes,yessiree,yesterday,york,yuck,yucky,yum,yummy,yup
0,negative,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,positive,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,negative,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,negative,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,negative,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [126]:
X_train1, X_test1, y_train1, y_test1 = train_test_split(Matrix1,Mat1.sentiment, test_size=0.25)
X_train1.head(5)

Unnamed: 0,able,abomination,absolute,absolutely,acceptable,accident,acidy,actual,actually,add,...,years,yes,yessiree,yesterday,york,yuck,yucky,yum,yummy,yup
287,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
341,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
478,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
310,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
480,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [128]:
cv = CountVectorizer(stop_words='english',ngram_range=(1, 2),binary= 1)
X = cv.fit_transform(New_df.reviews)
Matrix2=pd.DataFrame(X.toarray(), columns=cv.get_feature_names())
temp=pd.DataFrame(New_df.sentiment)
temp.columns= ["sentiment"]
Mat2=pd.concat([temp.reset_index(drop=True),Matrix2],axis=1)
Mat2.head(5)

Unnamed: 0,sentiment,able,able cappuccino,able drink,able finish,able longer,able make,able return,able switch,abomination,...,yummy perfect,yummy price,yummy real,yummy run,yummy strong,yummy suitable,yummy treat,yummy won,yup,yup exactly
0,negative,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,positive,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,negative,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,negative,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,negative,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [129]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(Matrix2,Mat2.sentiment, test_size=0.25)
X_train2.head(5)

Unnamed: 0,able,able cappuccino,able drink,able finish,able longer,able make,able return,able switch,abomination,abomination bet,...,yummy perfect,yummy price,yummy real,yummy run,yummy strong,yummy suitable,yummy treat,yummy won,yup,yup exactly
170,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
257,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
384,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
418,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
53,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Question 3 ##

Use Logistic Regression to classify reviews as positive or negative. Do this for both matrices.
* Fit a Logistic Regression model on the training data
* Apply the model on the test data and calculate the following error metrics: accuracy, precision, recall, F1 score
* Optional: Visualize the confusion matrix for both models
* Compare the error metrics of the two matrices

Recommendation: Create a function to calculate the error metrics, since you'll be doing this multiple times.

In [145]:
logreg = LogisticRegression()
logreg.fit(X_train1, y_train1)
y_pred1 = logreg.predict(X_test1)
print('Accuracy of logistic regression classifier on test set: {:.2f}\n'.format(logreg.score(X_test1, y_test1)))
confusion_matrix1 = confusion_matrix(y_test1, y_pred1)
error_metrics1=classification_report(y_test1, y_pred1)
print(classification_report(y_test1, y_pred1))

Accuracy of logistic regression classifier on test set: 0.85

             precision    recall  f1-score   support

   negative       0.83      0.57      0.68        35
   positive       0.86      0.96      0.90        94

avg / total       0.85      0.85      0.84       129



In [146]:
logreg = LogisticRegression()
logreg.fit(X_train2, y_train2)
y_pred2 = logreg.predict(X_test2)
print('Accuracy of logistic regression classifier on test set: {:.2f}\n'.format(logreg.score(X_test2, y_test2)))
confusion_matrix2 = confusion_matrix(y_test2, y_pred2)
error_metrics3=classification_report(y_test2, y_pred2)
print(classification_report(y_test2, y_pred2))

Accuracy of logistic regression classifier on test set: 0.83

             precision    recall  f1-score   support

   negative       0.62      0.54      0.58        28
   positive       0.88      0.91      0.89       101

avg / total       0.82      0.83      0.82       129



## Question 4 ##

Let's try using another machine learning technique to classify these reviews as positive or negative. Go through the exact same exercise in the previous step, except this time, use Naive Bayes instead of Logistic Regression.

For count data, use [Multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB). For binary data, use [Bernoulli Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html#sklearn.naive_bayes.BernoulliNB).

Compare the results of both the Logistic Regression and Naive Bayes models.

In [151]:
clf = MultinomialNB()
clf.fit(X_train1,y_train1)
y_pred3 = clf.predict(X_test1)
print('Accuracy of logistic regression classifier on test set: {:.2f}\n'.format(clf.score(X_test1, y_test1)))
confusion_matrix3 = confusion_matrix(y_test1, y_pred3)
error_metrics3=classification_report(y_test1, y_pred3)
print(classification_report(y_test1, y_pred3))

Accuracy of logistic regression classifier on test set: 0.89

             precision    recall  f1-score   support

   negative       0.92      0.66      0.77        35
   positive       0.88      0.98      0.93        94

avg / total       0.89      0.89      0.89       129



In [155]:
blf = BernoulliNB()
blf.fit(X_train2,y_train2)
y_pred4 = blf.predict(X_test2)
print('Accuracy of logistic regression classifier on test set: {:.2f}\n'.format(blf.score(X_test2, y_test2)))
confusion_matrix4 = confusion_matrix(y_test2, y_pred4)
error_metrics4=classification_report(y_test2, y_pred4)
print(classification_report(y_test2, y_pred4))

Accuracy of logistic regression classifier on test set: 0.79

             precision    recall  f1-score   support

   negative       0.60      0.11      0.18        28
   positive       0.80      0.98      0.88       101

avg / total       0.76      0.79      0.73       129



## Question 5 ##

Up to this point, we've been using Count Vectorizer to create document-term matrices to input into the models. For at least one of the four models you've created so far, use TF-IDF Vectorizer instead of Count Vectorizer, and see if it improves the results.

Out of all of the models you've created, which model do you think best classifies positive and negative cappuccino cup reviews?

In [156]:
cv = CountVectorizer(stop_words='english')
X = cv.fit_transform(New_df.reviews)
Matrix3=pd.DataFrame(X.toarray(), columns=cv.get_feature_names())
temp=pd.DataFrame(New_df.sentiment)
temp.columns= ["sentiment"]
Mat3=pd.concat([temp.reset_index(drop=True),Matrix1],axis=1)
Mat3.head(5)

Unnamed: 0,sentiment,able,abomination,absolute,absolutely,acceptable,accident,acidy,actual,actually,...,years,yes,yessiree,yesterday,york,yuck,yucky,yum,yummy,yup
0,negative,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,positive,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,negative,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,negative,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,negative,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [157]:
X_train3, X_test3, y_train3, y_test3 = train_test_split(Matrix3,Mat3.sentiment, test_size=0.25)


In [158]:
logreg = LogisticRegression()
logreg.fit(X_train3, y_train3)
y_pred5 = logreg.predict(X_test3)
print('Accuracy of logistic regression classifier on test set: {:.2f}\n'.format(logreg.score(X_test3, y_test3)))
confusion_matrix5 = confusion_matrix(y_test3, y_pred5)
error_metrics5=classification_report(y_test3, y_pred5)
print(classification_report(y_test3, y_pred5))

Accuracy of logistic regression classifier on test set: 0.88

             precision    recall  f1-score   support

   negative       0.83      0.69      0.75        35
   positive       0.89      0.95      0.92        94

avg / total       0.87      0.88      0.87       129



In [159]:
clf = MultinomialNB()
clf.fit(X_train3,y_train3)
y_pred6 = clf.predict(X_test3)
print('Accuracy of logistic regression classifier on test set: {:.2f}\n'.format(clf.score(X_test3, y_test3)))
confusion_matrix6 = confusion_matrix(y_test3, y_pred6)
error_metrics6=classification_report(y_test3, y_pred6)
print(classification_report(y_test3, y_pred6))

Accuracy of logistic regression classifier on test set: 0.92

             precision    recall  f1-score   support

   negative       0.93      0.77      0.84        35
   positive       0.92      0.98      0.95        94

avg / total       0.92      0.92      0.92       129

