# Tutorial Exercise: Yelp reviews

## Introduction

This exercise uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition.

**Description of the data:**

- **`yelp.csv`** contains the dataset. It is stored in the repository (in the **`data`** directory), so there is no need to download anything from the Kaggle website.
- Each observation (row) in this dataset is a review of a particular business by a particular user.
- The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
- The **text** column is the text of the review.

**Goal:** Predict the star rating of a review using **only** the review text.

**Tip:** After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations.

## Task 1

Read **`yelp.csv`** into a pandas DataFrame and examine it.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('yelp.csv')
df.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


## Task 2

Create a new DataFrame that only contains the **5-star** and **1-star** reviews.

- **Hint:** [How do I apply multiple filter criteria to a pandas DataFrame?](http://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb#9.-How-do-I-apply-multiple-filter-criteria-to-a-pandas-DataFrame%3F-%28video%29) explains how to do this.

In [3]:
one_or_five = df[(df['stars'] == 1) | (df['stars'] == 5)]

## Task 3

Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response.

- **Hint:** Keep in mind that X should be a pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows.

In [29]:
from sklearn.model_selection import train_test_split

def st(p):
    print(p.shape, type(p))

ds = one_or_five[['text', 'stars']]
tt = train_test_split(ds, random_state=1337)
train = tt[0]
test = tt[1]

st(train)
st(test)

train.head()

(3064, 2) <class 'pandas.core.frame.DataFrame'>
(1022, 2) <class 'pandas.core.frame.DataFrame'>


Unnamed: 0,text,stars
4756,"I am not religious, but I will be the first pe...",5
1772,This place was incredible. The staff was frie...,5
84,"really, I can't believe this place has receive...",1
9482,Place is awsome. Got the beef cheesesteak and...,5
8730,Yum! Healthy selections and delicious flavors!,5


## Task 4

Use CountVectorizer to create **document-term matrices** from X_train and X_test.

In [30]:
from sklearn.feature_extraction.text import CountVectorizer
    
cv = CountVectorizer(stop_words='english', strip_accents='unicode', min_df=2)
newtrain = train['text'] 
X_train = cv.fit_transform(newtrain)
X_test = cv.transform(test['text'])
y_train = train['stars']
y_test = test['stars']

st(X_train)
st(X_test)
st(y_train)
st(y_test)

(3064, 8444) <class 'scipy.sparse.csr.csr_matrix'>
(1022, 8444) <class 'scipy.sparse.csr.csr_matrix'>
(3064,) <class 'pandas.core.series.Series'>
(1022,) <class 'pandas.core.series.Series'>


## Task 5

Use multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to interpret both classification accuracy and the confusion matrix.

In [31]:
from sklearn.naive_bayes import MultinomialNB 
from sklearn import metrics 

clf = MultinomialNB()
clf.fit(X_train, y_train)
print('accuracy', clf.score(X_test, y_test))
y_pred = clf.predict(X_test)
metrics.confusion_matrix(y_test, y_pred)

accuracy 0.9256360078277887


array([[144,  39],
       [ 37, 802]])

## Task 6 (Challenge)

Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy!

In [214]:
print(y_test.value_counts())
print('NULL accuracy', y_test.value_counts().head(1) / len(y_test))


5    811
1    211
Name: stars, dtype: int64
NULL accuracy 5    0.793542
Name: stars, dtype: float64


## Task 7 (Challenge)

Browse through the review text of some of the **false positives** and **false negatives**. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews?

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains the definitions of "false positives" and "false negatives".
- **Hint:** Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the "positive class"?

In [42]:
# Since we have a non-series here `y_pred` we need to convert
# And we need to drop the indexes because they are not aligned
# Let's first save a reference to our data before doing this
print(type(test['text']), type(y_test), type(y_pred))
print(test['text'].head())
print(test['text'][9299], '------', y_test[9299], y_pred[0])
print(test['text'][2618], '------', y_test[2618], y_pred[1])
print(test['text'][9277], '------', y_test[9277], y_pred[2])
# pd.Series(y_pred)


<class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'> <class 'numpy.ndarray'>
9299    The salad plates were not chilled... As they u...
2618    Love this place, it's on our dinner rotation. ...
9277    This place is fantastic.\n\nAbout the restaura...
656     No matter where you go in the World if you are...
7895    I grew up in Arizona and we went to Wongs ever...
Name: text, dtype: object
The salad plates were not chilled... As they usually are it however is a busy night but something's are expected at the olive garden... The food tasted like it was under a hot lamp for so long it tasted almost hard and also under cooked for the seafood... Not a great experience at the olive garden there are certain things expected... The expectations are not that high shouldnt be that hard to do.... ------ 1 5
Love this place, it's on our dinner rotation. The owners and wait staff always greet us since we are regulars. Great place and cheap too. Try the #100 its the best ever. ----

In [70]:
# 1 - 5 = -4 ...should be 1 but was 5 (-4's are false positives) 
# 5 - 1 = 4 ...should be 5 but was 1 (4's are false negative)

res = test['text'].reset_index()
fpn = pd.DataFrame({'text': res['text'], 'old_index': res['index'], 'value': y_test.reset_index(drop=True) - pd.Series(y_pred)})

print(fpn.head())

false_positives = fpn[fpn['value'] == -4]
false_negatives = fpn[fpn['value'] == 4]

# Should match how many errors we actually have
print(len(false_positives) + len(false_negatives))

def print_loc(idx):
    f = fpn.iloc[idx]
    print(f['text'], 'actual:', y_test[f['old_index']], 'predicted', y_pred[idx])
    print('-------------------')

# Examine some false positives
print(false_positives.head())
print_loc(0)
print_loc(57)
print_loc(94)

# Examine some false negatives
print('False negatives')
print(false_negatives.head())
print_loc(12)
print_loc(65)
print_loc(151)

                                                text  old_index  value
0  The salad plates were not chilled... As they u...       9299     -4
1  Love this place, it's on our dinner rotation. ...       2618      0
2  This place is fantastic.\n\nAbout the restaura...       9277      0
3  No matter where you go in the World if you are...        656      0
4  I grew up in Arizona and we went to Wongs ever...       7895      0
76
                                                  text  old_index  value
0    The salad plates were not chilled... As they u...       9299     -4
57   Take your money elsewhere, unless you've got k...        119     -4
94   Price reflects quality and service.  5/2010 pu...       8081     -4
156  How very disappointing!  Nello's had a delight...       2625     -4
212                 Poor service-small portions-pricey       5011     -4
The salad plates were not chilled... As they usually are it however is a busy night but something's are expected at the olive garden.

## Task 8 (Challenge)

Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**.

- **Hint:** Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the `feature_count_` and `class_count_` attributes of the Naive Bayes model object.

In [83]:
fn = cv.get_feature_names()
star_1 = clf.feature_count_[0]
star_5 = clf.feature_count_[1]


top_df = pd.DataFrame({'feature_names': fn, 'star_1': star_1, 'star_5': star_5})

top_bad = top_df.sort_values('star_1', ascending=False)
top_good = top_df.sort_values('star_5', ascending=False)

print('TOP 10 tokens indicating bad review')
print(top_bad[:10])

print('TOP 10 tokens indicating good review')
print(top_good[:10])

TOP 10 tokens indicating bad review
     feature_names  star_1  star_5
3061          food   461.0  1258.0
5643         place   368.0  1462.0
4416          like   337.0   932.0
4152          just   322.0   936.0
6684       service   238.0   688.0
7670          time   234.0   798.0
3352          good   234.0  1234.0
2378           don   190.0   508.0
4866       minutes   176.0    96.0
3364           got   176.0   358.0
TOP 10 tokens indicating good review
     feature_names  star_1  star_5
3411         great    74.0  1537.0
5643         place   368.0  1462.0
3061          food   461.0  1258.0
3352          good   234.0  1234.0
4152          just   322.0   936.0
4416          like   337.0   932.0
4535          love    38.0   831.0
7670          time   234.0   798.0
8029            ve   118.0   733.0
811           best    53.0   730.0


## Task 9 (Challenge)

Up to this point, we have framed this as a **binary classification problem** by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a **5-class classification problem**.

Here are the steps:

- Define X and y using the original DataFrame. (y should contain 5 different classes.)
- Split X and y into training and testing sets.
- Create document-term matrices using CountVectorizer.
- Calculate the testing accuracy of a Multinomial Naive Bayes model.
- Compare the testing accuracy with the null accuracy, and comment on the results.
- Print the confusion matrix, and comment on the results. (This [Stack Overflow answer](http://stackoverflow.com/a/30748053/1636598) explains how to read a multi-class confusion matrix.)
- Print the [classification report](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report), and comment on the results. If you are unfamiliar with the terminology it uses, research the terms, and then try to figure out how to calculate these metrics manually from the confusion matrix!