# BT4222 Assignment: Yelp Reviews Data

### Due on: *9 Feb 2018 @ 18:00 (Week 4)*  

### Submit this .ipynb file to:  *IVLE > Student Submission > Individual Assignment*

### In addition, please prepend your NUS userID to the filename, i.e., "`a0123456_bt4222_assignment.ipynb`"

## Introduction

This assignment uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition.

**Description of the data:**

- **`yelp.csv`** contains the dataset. It is stored in the course repository (in the **`data`** directory), so there is no need to download anything from the Kaggle website.
- Each observation (row) in this dataset is a review of a particular business by a particular user.
- The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
- The **text** column is the text of the review.

**Goal:** Predict the star rating of a review using **only** the review text.

**Tip:** After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations.

In [1]:
# for Python 2: use print only as a function
# from __future__ import print_function

## Task 1 (1 point)

Read **`yelp.csv`** into a Pandas DataFrame and examine it.

In [2]:
import pandas as pd

In [3]:
path = './yelp.csv'
yelp = pd.read_csv(path)

## Task 2 (2 pts)

Create a new DataFrame that only contains the **5-star** and **1-star** reviews.

- **Hint:** [How do I apply multiple filter criteria to a pandas DataFrame?](https://www.youtube.com/watch?v=YPItfQ87qjM&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=9) explains how to do this.

In [4]:
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]
yelp_best_worst.shape

(4086, 10)

## Task 3 (2 pts)

Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response.

- **Hint:** Keep in mind that X should be a Pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows.

In [5]:
# define X and y
X = yelp_best_worst.text
y = yelp_best_worst.stars

# split X and y into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# examine the object shapes
print(X_train.shape)
print(X_test.shape)

(3064,)
(1022,)




## Task 4 (2 pts)

Use CountVectorizer to create **document-term matrices** from X_train and X_test.

In [6]:
# use CountVectorizer to create document-term matrices from X_train and X_test
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

# fit and transform X_train
X_train_dtm = vect.fit_transform(X_train)

# only transform X_test
X_test_dtm = vect.transform(X_test)

# examine the shapes: rows are documents, columns are terms (aka "tokens" or "features")
print(X_train_dtm.shape)
print(X_test_dtm.shape)

(3064, 16825)
(1022, 16825)


## Task 5 (2 pts)

Use Multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to interpret both classification accuracy and the confusion matrix.

In [7]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

In [8]:
nb = MultinomialNB()

# train the model using X_train_dtm (timing it with an IPython "magic command")
%time nb.fit(X_train_dtm, y_train)

# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)

Wall time: 6.01 ms


In [9]:
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.9187866927592955

In [10]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)

array([[126,  58],
       [ 25, 813]], dtype=int64)

## Task 6 (3 pts)

Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy!

In [11]:
# calculate null accuracy 
y_test.value_counts().head(1) / y_test.shape

5    0.819961
Name: stars, dtype: float64

## Task 7 (4 pts)

Browse through the review text of some of the **false positives** and **false negatives**. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews?

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains the definitions of "false positives" and "false negatives".
- **Hint:** Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the "positive class"?

In [12]:
# convert the test list into a DataFrame
pd.DataFrame({'X_test':X_test, 'y_test':y_test, 'y_pred_class':y_pred_class}).head(10)

Unnamed: 0,X_test,y_pred_class,y_test
3922,"Looking a cutting edge, wanting the best for e...",5,5
8379,"Greatness in the form of food, just like the o...",5,5
4266,The Flower Studio far exceeded my expectations...,5,5
5577,So yummy! Strange combination but great place,5,5
537,I've been hearing about these cheesecakes from...,5,5
2175,This has to be the worst restaurant in terms o...,5,1
4556,I ate at Scramble last Friday and I have to sa...,1,1
1502,We decided to eat here on a whim. My husband g...,5,5
6434,I LOVE BURRITO EXPRESS. My fiance has been goi...,5,5
2282,Just open. I had the roast beef sandwich and ...,5,5


In [13]:
# print message text for the false positives (meaning they were incorrectly classified as 5 star)
X_test[y_test < y_pred_class].head()

2175    This has to be the worst restaurant in terms o...
1781    If you like the stuck up Scottsdale vibe this ...
2674    I'm sorry to be what seems to be the lone one ...
9984    Went last night to Whore Foods to get basics t...
3392    I found Lisa G's while driving through phoenix...
Name: text, dtype: object

In [14]:
# print message text for the false negatives (meaning they were incorrectly classified as 1 star)
X_test[y_test > y_pred_class].head()

7148    I now consider myself an Arizonian. If you dri...
4963    This is by far my favourite department store, ...
6318    Since I have ranted recently on poor customer ...
380     This is a must try for any Mani Pedi fan. I us...
5565    I`ve had work done by this shop a few times th...
Name: text, dtype: object

In [26]:
print(X_test[2175]) #False Positive example
print(X_test[7148]) #False Negative example

This has to be the worst restaurant in terms of hygiene. Two of my friends had food -poisoning after having dinner here. The food is just unhealthy with tons of oil floating on the top of curries, and I am not sure if any health/hygiene code is followed here. 
The service is poor and the information on its website is incorrect, the owner does not allow dine-in after 9 or 10 even though it says that the restaurant is open till 11. 

One night I saw the owner cleaning the place without gloves and she was nice enough to give us a to-go parcel without cleaning her hands (great example to the servers!). I had a peek inside the kitchen when the door was ajar, and it definitely looked dirty.

I have been a lot of hole-in-the-wall places around this restaurant, including Haji Baba, the Vietnamese place and others, but neither any of my friends nor I have fallen sick coz of the food. If you need a spicy-food fix, i strongly recommend you do not try this place, lest you want a visit to the docto

# As the model is built from training data, texts which are incorrectly classified as false positive are based on tokens from 5 star reviews in training data which are also found in 1 star reviews in the test data. Likewise for false negatives, tokens in 1 star reviews in training data are also found 5 star test reviews. Hence the model assumes 5 star reviews to be 1 star and vice versa. This occurs more frequently when reviews in test sets are found to be very lengthy and contain both unnecessary as well as sacarstic words.

## Task 8 (4 pts)

Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**.

- **Hint:** Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the `feature_count_` and `class_count_` attributes of the Naive Bayes model object.

In [16]:
# store the vocabulary of X_train
X_train_tokens = vect.get_feature_names()
len(X_train_tokens)

16825

In [17]:
nb.feature_count_

array([[26.,  4.,  1., ...,  0.,  0.,  0.],
       [39.,  5.,  0., ...,  1.,  1.,  1.]])

In [18]:
nb.feature_count_.shape

(2, 16825)

In [19]:
onestar_token_count = nb.feature_count_[0, :]
print(onestar_token_count)

fivestar_token_count = nb.feature_count_[1, :]
print(fivestar_token_count)

[26.  4.  1. ...  0.  0.  0.]
[39.  5.  0. ...  1.  1.  1.]


In [20]:
# create a DataFrame of tokens with their separate 5 star and 1 star  counts
tokens = pd.DataFrame({'token':X_train_tokens, 'onestar_token_count':onestar_token_count, 'fivestar_token_count':fivestar_token_count}).set_index('token')
# examine 5 random DataFrame rows
tokens.sample(5, random_state=6)

Unnamed: 0_level_0,fivestar_token_count,onestar_token_count
token,Unnamed: 1_level_1,Unnamed: 2_level_1
fourteen,0.0,1.0
bean,38.0,7.0
student,6.0,0.0
cerignola,1.0,0.0
bethany,2.0,0.0


In [21]:
# add 1 to avoid dividing by 0
tokens['onestar_token_count'] = tokens.onestar_token_count + 1
tokens['fivestar_token_count'] = tokens.fivestar_token_count + 1
tokens.sample(5, random_state=6)

Unnamed: 0_level_0,fivestar_token_count,onestar_token_count
token,Unnamed: 1_level_1,Unnamed: 2_level_1
fourteen,1.0,2.0
bean,39.0,8.0
student,7.0,1.0
cerignola,2.0,1.0
bethany,3.0,1.0


In [22]:
# Naive Bayes counts the number of observations in each class
nb.class_count_

array([ 565., 2499.])

In [23]:
# convert the 5 and 1 star counts into frequencies
tokens['onestar_token_count'] = tokens.onestar_token_count / nb.class_count_[0]
tokens['fivestar_token_count'] = tokens.fivestar_token_count / nb.class_count_[1]
tokens['fivestar_ratio'] = tokens.fivestar_token_count / tokens.onestar_token_count

In [24]:
#Top 10 tokens in 5 star reviews
tokens.sort_values('fivestar_ratio', ascending=False).head(10)

Unnamed: 0_level_0,fivestar_token_count,onestar_token_count,fivestar_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
fantastic,0.077231,0.00354,21.817727
perfect,0.098039,0.00531,18.464052
yum,0.02481,0.00177,14.017607
favorite,0.138055,0.012389,11.143029
outstanding,0.019608,0.00177,11.078431
brunch,0.016807,0.00177,9.495798
gem,0.016006,0.00177,9.043617
mozzarella,0.015606,0.00177,8.817527
pasty,0.015606,0.00177,8.817527
amazing,0.185274,0.021239,8.723323


In [25]:
#Top 10 tokens in 1 star reviews
tokens['onestar_ratio'] = tokens.onestar_token_count / tokens.fivestar_token_count
tokens.sort_values('onestar_ratio', ascending=False).head(10)

Unnamed: 0_level_0,fivestar_token_count,onestar_token_count,fivestar_ratio,onestar_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
staffperson,0.0004,0.030088,0.013299,75.19115
refused,0.0004,0.024779,0.016149,61.922124
disgusting,0.0008,0.042478,0.018841,53.076106
filthy,0.0004,0.019469,0.020554,48.653097
unacceptable,0.0004,0.015929,0.025121,39.80708
acknowledge,0.0004,0.015929,0.025121,39.80708
unprofessional,0.0004,0.015929,0.025121,39.80708
ugh,0.0008,0.030088,0.026599,37.595575
yuck,0.0008,0.028319,0.028261,35.384071
fuse,0.0004,0.014159,0.028261,35.384071
