# Machine Learning and NLP Exercises #

## Introduction ##

We will be using the same review data set from Kaggle from Week 2 for this exercise. The product we'll focus on this time is a cappuccino cup. The goal of this week is to not only preprocess the data, but to classify reviews as positive or negative based on the review text.

The following code will help you load in the data.

In [68]:
import numpy as np
import pandas as pd
import nltk
import re
import string
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support
from sklearn.naive_bayes import MultinomialNB,BernoulliNB

In [83]:
data = pd.read_csv('coffee.csv')
data.head(10)

Unnamed: 0,user_id,stars,reviews
0,A2XP9IN4JOMROD,1,I wanted to love this. I was even prepared for...
1,A2TS09JCXNV1VD,5,Grove Square Cappuccino Cups were excellent. T...
2,AJ3L5J7GN09SV,2,I bought the Grove Square hazelnut cappuccino ...
3,A3CZD34ZTUJME7,1,"I love my Keurig, and I love most of the Keuri..."
4,AWKN396SHAQGP,1,It's a powdered drink. No filter in k-cup.<br ...
5,A35NA371SV1PAH,3,Not enough coffee flavor and definitely to swe...
6,A1LR5HPNQLH4RI,1,don't bother! bet you couldn't tell the differ...
7,A2RCZ8YKLE8B3O,1,"Never tasted this coffee before, I felt much t..."
8,A31D6GWYLIKF4X,2,While the overall idea behind the product is l...
9,A1KZPDB5MOWNVU,5,I bought a keurig and bought these to try. Wel...


## Question 1 ##

* Determine how many reviews there are in total.
* Determine the percent of 1, 2, 3, 4 and 5 star reviews.
* Create a new data set for modeling with the following columns:
     - Column 1: 'positive' if review = 4 or 5, and 'negative' if review = 1 or 2
     - Column 2: review text
* Take a look at the number of positive and negative reviews in the newly created data set.

Checkpoint: the resulting data set should have 514 reviews.

Use the preprocessing code below to clean the reviews data before moving on to modeling.

In [95]:
print("Total reviews -", len(data))

Total reviews - 542


In [121]:
distribution = data.groupby("stars",as_index=False).size().reset_index()
distribution.columns = ['stars','counts']
distribution['percent'] = np.round(distribution.counts/np.sum(distribution.counts)*100,2)
distribution

Unnamed: 0,stars,counts,percent
0,1,96,17.71
1,2,45,8.3
2,3,28,5.17
3,4,65,11.99
4,5,308,56.83


In [107]:
data1 = data[data.stars!=3].reset_index(drop=True)
dataset = pd.DataFrame()
dataset["sentiment"] = np.where(data1.stars>=4,"positive","negative")
dataset["reviews"] = data1.reviews
print("Number of reviews in new dataset - ",len(dataset))

Number of reviews in new dataset -  514


In [98]:
# Text preprocessing steps - remove numbers, captial letters and punctuation
alphanumeric = lambda x: re.sub(r"""\w*\d\w*""", ' ', x)
punc_lower = lambda x: re.sub('[%s]' % re.escape(string.punctuation), ' ', x.lower())

dataset['reviews'] = dataset.reviews.map(alphanumeric).map(punc_lower)
dataset.head(10)
        

Unnamed: 0,sentiment,reviews
0,negative,i wanted to love this i was even prepared for...
1,positive,grove square cappuccino cups were excellent t...
2,negative,i bought the grove square hazelnut cappuccino ...
3,negative,i love my keurig and i love most of the keuri...
4,negative,it s a powdered drink no filter in k cup br ...
5,negative,don t bother bet you couldn t tell the differ...
6,negative,never tasted this coffee before i felt much t...
7,negative,while the overall idea behind the product is l...
8,positive,i bought a keurig and bought these to try wel...
9,positive,my husband and i love this french vanilla capp...


## Question 2 ##

Prepare the data for modeling:
* Split the data into training and test sets. You should have four sets of data - X_train, X_test, y_train, y_test

Create numerical features with Count Vectorizer. Create two document-term matrices:
* Matrix 1: Terms should be unigrams (single words), and values should be word counts (Hint: this is the Count Vectorizer default)
* Matrix 2: Terms should be unigrams and bigrams, and values should be binary values

Recommendation: Utilize Count Vectorizer's stop words function to remove stop words from the reviews text.

In [110]:
cv = CountVectorizer(stop_words='english')
cv.fit(dataset.reviews)
dtm = cv.transform(dataset.reviews)
matrix1 = pd.DataFrame(dtm.toarray(), columns=cv.get_feature_names())
matrix1["sentiment"] = dataset.sentiment
matrix1.head()

Unnamed: 0,00,0g,10,100,10oz,11,11s,12,170mg,18,...,yes,yessiree,yesterday,york,yuck,yucky,yum,yummy,yup,sentiment
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,negative
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,positive
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,negative
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,negative
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,negative


In [111]:
X_train1, X_test1, y_train1, y_test1 = train_test_split(matrix1.drop("sentiment",axis=1),matrix1[["sentiment"]],test_size=0.25)

In [112]:
cv = CountVectorizer(stop_words='english',ngram_range=(1,2),binary=1)
cv.fit(dataset.reviews)
dtm = cv.transform(dataset.reviews)
matrix2 = pd.DataFrame(dtm.toarray(), columns=cv.get_feature_names())
matrix2["sentiment"] = dataset.sentiment
matrix2.head()

Unnamed: 0,00,00 cups,00 thought,0g,0g protein,10,10 00,10 2012,10 47,10 bought,...,yummy price,yummy real,yummy run,yummy strong,yummy suitable,yummy treat,yummy won,yup,yup exactly,sentiment
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,negative
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,positive
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,negative
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,negative
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,negative


In [113]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(matrix2.drop("sentiment",axis=1),matrix2[["sentiment"]],test_size=0.25)

## Question 3 ##

Use Logistic Regression to classify reviews as positive or negative. Do this for both matrices.
* Fit a Logistic Regression model on the training data
* Apply the model on the test data and calculate the following error metrics: accuracy, precision, recall, F1 score
* Optional: Visualize the confusion matrix for both models
* Compare the error metrics of the two matrices

Recommendation: Create a function to calculate the error metrics, since you'll be doing this multiple times.

In [127]:
def errors_metrics(test,preds):
    cm = confusion_matrix(test,preds)
    print("Accuracy -",(cm[0,0]+cm[1,1])/np.sum(cm))
    vals = precision_recall_fscore_support(test, preds,average='weighted')
    print("Precision -",vals[0])
    print("Recall -",vals[1])
    print("f1_score -",vals[2])

In [128]:
logreg= LogisticRegression(solver='lbfgs')

fit = logreg.fit(X_train1,y_train1)
preds = fit.predict(X_train1)
print("\nmatrix1 - Train Data -")
errors_metrics(y_train1,preds)

preds = fit.predict(X_test1)
print("\nmatrix1 - Test Data -")
errors_metrics(y_test1,preds)

logreg= LogisticRegression()
fit = logreg.fit(X_train2,y_train2)
preds = fit.predict(X_train2)
print("\nmatrix2 - Train Data -")
errors_metrics(y_train2,preds)

preds = fit.predict(X_test2)
print("\nmatrix2 - Test Data -")
errors_metrics(y_test2,preds)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)



matrix1 - Train Data -
Accuracy - 0.9974025974025974
Precision - 0.9974118408282109
Recall - 0.9974025974025974
f1_score - 0.9973986984963624

matrix1 - Test Data -
Accuracy - 0.9224806201550387
Precision - 0.9300067735380447
Recall - 0.9224806201550387
f1_score - 0.9182065924644426

matrix2 - Train Data -
Accuracy - 1.0
Precision - 1.0
Recall - 1.0
f1_score - 1.0

matrix2 - Test Data -
Accuracy - 0.7984496124031008
Precision - 0.793035560477421
Recall - 0.7984496124031008
f1_score - 0.7787115744453355


Accuracy for Logistic Regression:

Unigram model got an test accuracy of 92.2%;
unigram-bigram model got an test accuracy of 79.8%. 

Unigram model(matrix1) is performing better.

## Question 4 ##

Let's try using another machine learning technique to classify these reviews as positive or negative. Go through the exact same exercise in the previous step, except this time, use Naive Bayes instead of Logistic Regression.

For count data, use [Multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB). For binary data, use [Bernoulli Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html#sklearn.naive_bayes.BernoulliNB).

Compare the results of both the Logistic Regression and Naive Bayes models.

In [124]:
mnb = MultinomialNB()
fit = mnb.fit(X_train1,y_train1)
preds = fit.predict(X_train1)
print("\nmatrix1 Multinomial Naive Bayes- Train Data:")
errors_metrics(y_train1,preds)

preds = fit.predict(X_test1)
print("\nmatrix1 Multinomial Naive Bayes - Test Data:")
errors_metrics(y_test1,preds)

bnb= BernoulliNB()
fit = bnb.fit(X_train2,y_train2)
preds = fit.predict(X_train2)
print("\nmatrix2 Bernoulli Naive Bayes - Train Data:")
errors_metrics(y_train2,preds)

preds = fit.predict(X_test2)
print("\nmatrix2 Bernoulli Naive Bayes - Test Data:")
errors_metrics(y_test2,preds)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)



matrix1 Multinomial Naive Bayes- Train Data:
Accuracy - 0.9662337662337662
Precision - 0.9666365209185279
Recall - 0.9662337662337662
f1_score - 0.9657448498043414

matrix1 Multinomial Naive Bayes - Test Data:
Accuracy - 0.9224806201550387
Precision - 0.9257919147396467
Recall - 0.9224806201550387
f1_score - 0.9192339966434909

matrix2 Bernoulli Naive Bayes - Train Data:
Accuracy - 0.8181818181818182
Precision - 0.8481954212613237
Recall - 0.8181818181818182
f1_score - 0.7838383838383837

matrix2 Bernoulli Naive Bayes - Test Data:
Accuracy - 0.7286821705426356
Precision - 0.7038327247101032
Recall - 0.7286821705426356
f1_score - 0.6597275296914996


 Logistic Regression Accuracy-
    Unigram Test: 92.2%;
    Unigram-bigram Test:79.8%

 Naive Bayes Accuracy-
    Multinomial matrix1 Test: 92.2%;
    Bernoulli matrix2 Test: 72.8%

Logistic regression with unigram words and multinomial naive bayes with unigram words are performing equally well both with accuracy of 92.2%

## Question 5 ##

Up to this point, we've been using Count Vectorizer to create document-term matrices to input into the models. For at least one of the four models you've created so far, use TF-IDF Vectorizer instead of Count Vectorizer, and see if it improves the results.

Out of all of the models you've created, which model do you think best classifies positive and negative cappuccino cup reviews?

In [117]:
tfv = TfidfVectorizer(stop_words='english')
# tokenize and build vocab
tfv.fit(dataset.reviews)
dtm = tfv.transform(dataset.reviews)
matrix3 = pd.DataFrame(dtm.toarray(), columns=tfv.get_feature_names())
matrix3["sentiment"] = dataset.sentiment

In [118]:
X_train3, X_test3, y_train3, y_test3 = train_test_split(matrix3.drop("sentiment",axis=1),matrix3[["sentiment"]],test_size=0.25)

In [126]:
mnb = MultinomialNB()
fit = mnb.fit(X_train3,y_train3)
preds = fit.predict(X_train3)
print("\nmatrix3 TFIDF Multinomial Naive Bayes- Train Data -")
errors_metrics(y_train3,preds)

preds = fit.predict(X_test3)
print("\nmatrix3 TFIDF Multinomial Naive Bayes- Test Data -")
errors_metrics(y_test3,preds)


matrix3 TFIDF Multinomial Naive Bayes- Train Data -
Accuracy - 0.787012987012987
Precision - 0.8351257557869128
Recall - 0.787012987012987
f1_score - 0.731267421329533

matrix3 TFIDF Multinomial Naive Bayes- Test Data -
Accuracy - 0.7209302325581395
Precision - 0.7994186046511628
Recall - 0.7209302325581395
f1_score - 0.6115722710581951


  y = column_or_1d(y, warn=True)


In [125]:
fit = logreg.fit(X_train3,y_train3)
preds = fit.predict(X_train3)
print("\nmatrix3 TFIDF - Train Data -")
errors_metrics(y_train3,preds)

preds = fit.predict(X_test3)
print("\nmatrix3 TFIDF - Test Data -")
errors_metrics(y_test3,preds)


matrix3 TFIDF - Train Data -
Accuracy - 0.8545454545454545
Precision - 0.8787159428108984
Recall - 0.8545454545454545
f1_score - 0.8343413697527137

matrix3 TFIDF - Test Data -
Accuracy - 0.7441860465116279
Precision - 0.8117209302325582
Recall - 0.7441860465116279
f1_score - 0.6606878200386334


  y = column_or_1d(y, warn=True)


Logistic Regression Accuracy-
    Unigram matrix1 Test: 92.2%;
    Unigram-bigram matrix2 Test:79.8%;
    TFIDF matrix3 Test: 74.4%

Naive Bayes Accuracy-
    Multinomial matrix1 Test: 92.2%;
    Bernoulli matrix2 Test: 72.8%;
    Multinomial TFIDF matrix3 Test:72.1%

Logistic regression with unigram words and multinomial naive bayes with unigram words and count vectorizer are performing equally well both with accuracy of 92.2%