# Sentiment Analysis on Reviews Data

## Introduction ##

In [23]:
import nltk
import pandas as pd

In [24]:
data = pd.read_csv('coffee.csv')
data.head()

Unnamed: 0,user_id,stars,reviews
0,A2XP9IN4JOMROD,1,I wanted to love this. I was even prepared for...
1,A2TS09JCXNV1VD,5,Grove Square Cappuccino Cups were excellent. T...
2,AJ3L5J7GN09SV,2,I bought the Grove Square hazelnut cappuccino ...
3,A3CZD34ZTUJME7,1,"I love my Keurig, and I love most of the Keuri..."
4,AWKN396SHAQGP,1,It's a powdered drink. No filter in k-cup.<br ...


# Task 1

* Determine how many reviews there are in total.
* Determine the percent of 1, 2, 3, 4 and 5 star reviews.
* Create a new data set for modeling with the following columns:
     - Column 1: 'positive' if review = 4 or 5, and 'negative' if review = 1 or 2
     - Column 2: review text
* Take a look at the number of positive and negative reviews in the newly created data set.


In [25]:
# Text preprocessing steps - remove numbers, captial letters and punctuation
import re
import string

alphanumeric = lambda x: re.sub(r"""\w*\d\w*""", ' ', x)
punc_lower = lambda x: re.sub('[%s]' % re.escape(string.punctuation), ' ', x.lower())

data['reviews'] = data.reviews.map(alphanumeric).map(punc_lower)
data.head()

Unnamed: 0,user_id,stars,reviews
0,A2XP9IN4JOMROD,1,i wanted to love this i was even prepared for...
1,A2TS09JCXNV1VD,5,grove square cappuccino cups were excellent t...
2,AJ3L5J7GN09SV,2,i bought the grove square hazelnut cappuccino ...
3,A3CZD34ZTUJME7,1,i love my keurig and i love most of the keuri...
4,AWKN396SHAQGP,1,it s a powdered drink no filter in k cup br ...


In [26]:
#Number of reviews
"No of reviews: " + str(len(data))

'No of reviews: 542'

In [27]:
# % of 1,2,3,4,5 star ratings
from collections import Counter
c = Counter(data.stars)
[(i, c[i] / len(data) * 100.0) for i in c]

[(1, 17.712177121771216),
 (5, 56.82656826568265),
 (2, 8.302583025830259),
 (3, 5.166051660516605),
 (4, 11.992619926199263)]

In [28]:
d3 = data[data.stars != 3]
len(d3)

514

In [29]:
# fucntion for setting the review type criteria new datset
def f(stars):
    if stars == 1:
      val = 'Negative'
    elif stars == 2:
      val = 'Negative'
    elif stars == 4:
      val = 'Positive'
    elif stars == 5:
      val = 'Positive'
    else:
       val = ''
    return val

In [30]:
d3['Review_Type']= d3['stars'].apply(f)
d3 = d3[d3.columns[2:4]]
d3.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,reviews,Review_Type
0,i wanted to love this i was even prepared for...,Negative
1,grove square cappuccino cups were excellent t...,Positive
2,i bought the grove square hazelnut cappuccino ...,Negative
3,i love my keurig and i love most of the keuri...,Negative
4,it s a powdered drink no filter in k cup br ...,Negative
6,don t bother bet you couldn t tell the differ...,Negative
7,never tasted this coffee before i felt much t...,Negative
8,while the overall idea behind the product is l...,Negative
9,i bought a keurig and bought these to try wel...,Positive
10,my husband and i love this french vanilla capp...,Positive


In [31]:
data_positive = d3.loc[d3['Review_Type']=='Positive']
lp= len(data_positive)
print ("No. of Positive Reviews: " + str(lp))

No. of Positive Reviews: 373


In [32]:
data_negative = d3.loc[d3['Review_Type']=='Negative']
ln= len(data_negative)
print ("No. of Negative Reviews: " + str(ln))

No. of Negative Reviews: 141


## Task 2 ##

Prepare the data for modeling:
* Split the data into training and test sets. You should have four sets of data - X_train, X_test, y_train, y_test

Create numerical features with Count Vectorizer. Create two document-term matrices:
* Matrix 1: Terms should be unigrams (single words), and values should be word counts (Hint: this is the Count Vectorizer default)
* Matrix 2: Terms should be unigrams and bigrams, and values should be binary values

Recommendation: Utilize Count Vectorizer's stop words function to remove stop words from the reviews text.

In [33]:
import random
import sklearn

In [34]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words='english')
cdr = cv.fit_transform(d3.reviews)
data_n = pd.DataFrame(cdr.toarray(), columns=cv.get_feature_names())
revtype = pd.DataFrame(d3.Review_Type)
dnew = pd.concat([revtype.reset_index(drop=True),data_n],axis=1)
dnew.head(10)

Unnamed: 0,Review_Type,able,abomination,absolute,absolutely,acceptable,accident,acidy,actual,actually,...,years,yes,yessiree,yesterday,york,yuck,yucky,yum,yummy,yup
0,Negative,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,Positive,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Negative,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Negative,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Negative,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,Negative,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,Negative,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,Negative,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,Positive,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,Positive,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [35]:
X = dnew.loc[:,dnew.columns!='Review_Type']
y = dnew.loc[:,'Review_Type']
X_train,X_test,y_train,y_test = sklearn.model_selection.train_test_split(X,y,test_size=0.3,random_state=0)

In [36]:
cv2 = CountVectorizer(stop_words='english',ngram_range=(1,2),binary=True)
cdr2= cv2.fit_transform(d3.reviews)
data_nb = pd.DataFrame(cdr2.toarray(), columns=cv2.get_feature_names())
revtype2= pd.DataFrame(d3.Review_Type)
datanew = pd.concat([revtype2.reset_index(drop=True),data_nb],axis=1)
datanew.head(10)

Unnamed: 0,Review_Type,able,able cappuccino,able drink,able finish,able longer,able make,able return,able switch,abomination,...,yummy perfect,yummy price,yummy real,yummy run,yummy strong,yummy suitable,yummy treat,yummy won,yup,yup exactly
0,Negative,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Positive,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Negative,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Negative,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Negative,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,Negative,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,Negative,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,Negative,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,Positive,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,Positive,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [37]:
X2 = datanew.loc[:,datanew.columns!='Review_Type']
y2 = datanew.loc[:,'Review_Type']
X_train2,X_test2,y_train2,y_test2 = sklearn.model_selection.train_test_split(X2,y2,test_size=0.3,random_state=0)

## Task 3 ##

Use Logistic Regression to classify reviews as positive or negative. Do this for both matrices.
* Fit a Logistic Regression model on the training data
* Apply the model on the test data and calculate the following error metrics: accuracy, precision, recall, F1 score
* Visualize the confusion matrix for both models
* Compare the error metrics of the two matrices

Recommendation: Create a function to calculate the error metrics, since you'll be doing this multiple times.

In [38]:
def error_metrics(true,predict):
    tn, fp, fn, tp = confusion_matrix(true,predict).ravel()
    a = (tn+tp)/(tn+fp+fn+tp)
    p = (tp/(tp+fp))
    r = (tp/(tp+fn))
    f = p*r
    print("accuracy",a,"precision",p,"recall",r,"F_score",f)
    
from sklearn.linear_model import LogisticRegression
model=LogisticRegression()
model.fit(X_train,y_train)
predclass = model.predict(X_test)
from sklearn.metrics import confusion_matrix
unigram = error_metrics(y_test,predclass)
unigram

accuracy 0.8451612903225807 precision 0.8782608695652174 recall 0.9099099099099099 F_score 0.7991382687034861




In [41]:
model.fit(X_train2,y_train2)
predclass2 = model.predict(X_test2)
unibigram = error_metrics(y_test2,predclass2)
unibigram

accuracy 0.8258064516129032 precision 0.828125 recall 0.954954954954955 F_score 0.7908220720720721




**Comparison: ** Unigram has slightly higher precision and accuracy. Recall is higher for unigram than bigram. F Score is almost the same for both the matrices

## Task 4 ##

Let's try using another machine learning technique to classify these reviews as positive or negative using Naive Bayes Algorithm

For count data, use [Multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB). For Binary data, use [Bernoulli Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html#sklearn.naive_bayes.BernoulliNB).

Compare the results of both the Logistic Regression and Naive Bayes models.

In [42]:
from sklearn.naive_bayes import MultinomialNB
nm = MultinomialNB()
nm.fit(X_train,y_train)
pred_nm = nm.predict(X_test)
ug_nb = error_metrics(y_test,pred_nm)
ug_nb

accuracy 0.8838709677419355 precision 0.8907563025210085 recall 0.954954954954955 F_score 0.8506321447497919


In [43]:
from sklearn.naive_bayes import BernoulliNB
nmb = BernoulliNB()
nmb.fit(X_train2,y_train2)
pred_nb = nmb.predict(X_test2)
ugb_nb = error_metrics(y_test2,pred_nb)
ugb_nb

accuracy 0.7354838709677419 precision 0.7464788732394366 recall 0.954954954954955 F_score 0.7128536987691917


**Comaprison of Logistic Regression and Naive Bayes Model: ** 
Logistic Regression is better than Bernoulli Naive Bayes Model as it has better accuracy, precision and Fscore with almost the same Recall. 
Multinomial Navie Bayes is better than Logistic Regression as it has higher Accuracy, Precision and recall with almost the same Recall  

## Task 5 ##

Up to this point, we've been using Count Vectorizer to create document-term matrices to input into the models. Use TF-IDF Vectorizer instead of Count Vectorizer, and see if it improves the results.

Out of all of the models you've created, which model do you think best classifies positive and negative cappuccino cup reviews?

In [44]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')
tf = tfidf.fit_transform(d3.reviews)
dt = pd.DataFrame(tf.toarray(), columns=tfidf.get_feature_names())
tgtf = pd.DataFrame(d3.Review_Type)
dtf = pd.concat([tgtf.reset_index(drop=True),dt],axis=1)
dtf.head(10)

Unnamed: 0,Review_Type,able,abomination,absolute,absolutely,acceptable,accident,acidy,actual,actually,...,years,yes,yessiree,yesterday,york,yuck,yucky,yum,yummy,yup
0,Negative,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.112391,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Positive,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Negative,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Negative,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Negative,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Negative,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Negative,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Negative,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Positive,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Positive,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [45]:
Xtf = dtf.loc[:,dtf.columns!='Review_Type']
ytf = dtf.loc[:,'Review_Type']
X_train_tf,X_test_tf,y_train_tf,y_test_tf = sklearn.model_selection.train_test_split(Xtf,ytf,test_size=0.3,random_state=0)
mnm = MultinomialNB()
mnm.fit(X_train_tf,y_train_tf)
predtf = mnm.predict(X_test_tf)
ugnb = error_metrics(y_test_tf,predtf)
ugnb

accuracy 0.7354838709677419 precision 0.7302631578947368 recall 1.0 F_score 0.7302631578947368


**As per the accuracy, precision, recall and F-Score analysis, the best model is Multinomial Naive Bayes Model**