# Homework with Yelp reviews data

## Introduction

This assignment uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition.

**Description of the data:**

- **`yelp.csv`** contains the dataset. It is stored in the course repository (in the **`data`** directory), so there is no need to download anything from the Kaggle website.
- Each observation (row) in this dataset is a review of a particular business by a particular user.
- The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
- The **text** column is the text of the review.

**Goal:** Predict the star rating of a review using **only** the review text.

**Tip:** After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations.

## Task 1

Read **`yelp.csv`** into a Pandas DataFrame and examine it.

In [1]:
import pandas as pd

data = pd.read_csv("../data/yelp.csv")

In [2]:
data.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


In [3]:
data.describe()

Unnamed: 0,stars,cool,useful,funny
count,10000.0,10000.0,10000.0,10000.0
mean,3.7775,0.8768,1.4093,0.7013
std,1.214636,2.067861,2.336647,1.907942
min,1.0,0.0,0.0,0.0
25%,3.0,0.0,0.0,0.0
50%,4.0,0.0,1.0,0.0
75%,5.0,1.0,2.0,1.0
max,5.0,77.0,76.0,57.0


In [4]:
data_star_1 = data[data['stars'] == 1][['stars','text', 'type', 'cool', 'useful']]
data_star_5 = data[data['stars'] == 5][['stars','text', 'type', 'cool', 'useful']]

In [5]:
data_star_1.count()

stars     749
text      749
type      749
cool      749
useful    749
dtype: int64

In [6]:
data_star_5.count()

stars     3337
text      3337
type      3337
cool      3337
useful    3337
dtype: int64

## Task 2

Create a new DataFrame that only contains the **5-star** and **1-star** reviews.

- **Hint:** [How do I apply multiple filter criteria to a pandas DataFrame?](https://www.youtube.com/watch?v=YPItfQ87qjM&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=9) explains how to do this.

In [7]:
data_15 = data[(data.stars == 1) | (data.stars == 5)]

In [8]:
data_15.sample(5)

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
4665,IuAPYzf3NSyfyXYgT46YVA,2011-08-05,G_WnFnEd6H5grE3Dr9zb2g,5,Fried Squash blossoms! How do they come up wi...,review,Zwtx7FLZcK5F2P0-Tz8mHg,2,3,1
6542,4JOv7EnnfZ8fD3JunQQpyg,2009-01-27,YydJLd9f9e9JhwKjuZf-wg,5,Tempe Beach Park - such a gem in our landlocke...,review,fczQCSmaWF78toLEmb0Zsw,8,10,5
7614,-5rFC4EVrT-v8g1PSEf6Xg,2011-07-16,si-nMGBPO5Q8Axg0zgly0g,5,I was impressed with this place. I had the Mon...,review,0BfSTRcQpBTeNlx9SuAGWw,0,1,0
5962,xmjv8g356v8Qo55ICjG8rg,2011-03-11,4gElVx0ozu6OLz47Dcz4Bg,5,I love CK's. Their happy hour specials are gre...,review,GeGDZ02UfARKBRl7LqKZuA,0,1,0
8357,gUt-pPUpOVVhaCFC8-E4yQ,2011-03-27,Qv1jJdZftPlfJef45OX3GQ,5,Bomb ............................................,review,_tTNzjkD-pvWqSb-Ahw9Uw,0,0,0


## Task 3

Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response.

- **Hint:** Keep in mind that X should be a Pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows.

In [9]:
X = data_15.text
y = data_15.stars

In [10]:
X.head()

0    My wife took me here on my birthday for breakf...
1    I have no idea why some people give bad review...
3    Rosie, Dakota, and I LOVE Chaparral Dog Park!!...
4    General Manager Scott Petello is a good egg!!!...
6    Drop what you're doing and drive here. After I...
Name: text, dtype: object

In [11]:
y.head()

0    5
1    5
3    5
4    5
6    5
Name: stars, dtype: int64

In [12]:
from sklearn.cross_validation import train_test_split

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2)

In [14]:
print(X_train.shape)
print(y_train.shape)

(3064,)
(3064,)


In [15]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

## Task 4

Use CountVectorizer to create **document-term matrices** from X_train and X_test.

In [16]:
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_train_dtm

<3064x16530 sparse matrix of type '<type 'numpy.int64'>'
	with 234651 stored elements in Compressed Sparse Row format>

In [17]:
vect.get_feature_names()

[u'00',
 u'000',
 u'00am',
 u'00pm',
 u'01',
 u'02',
 u'03',
 u'03342',
 u'04',
 u'05',
 u'06',
 u'07',
 u'09',
 u'0l',
 u'10',
 u'100',
 u'1000',
 u'1000x',
 u'100s',
 u'100th',
 u'101',
 u'102',
 u'1030',
 u'105',
 u'108',
 u'109',
 u'10am',
 u'10ish',
 u'10min',
 u'10mins',
 u'10minutes',
 u'10pm',
 u'10th',
 u'10x',
 u'10yo',
 u'11',
 u'110',
 u'112',
 u'115',
 u'115th',
 u'116',
 u'118',
 u'11a',
 u'11am',
 u'12',
 u'120',
 u'128i',
 u'129',
 u'12am',
 u'12oz',
 u'12pm',
 u'12th',
 u'13',
 u'1300',
 u'13331',
 u'13th',
 u'14',
 u'140',
 u'147',
 u'14lbs',
 u'15',
 u'150',
 u'1500',
 u'150mm',
 u'15am',
 u'15mins',
 u'15th',
 u'16',
 u'160',
 u'165',
 u'16th',
 u'17',
 u'175',
 u'17th',
 u'18',
 u'180',
 u'1800',
 u'1895',
 u'18th',
 u'19',
 u'1900',
 u'1913',
 u'1940',
 u'1955',
 u'1956',
 u'1960',
 u'1961',
 u'1968',
 u'1970',
 u'1980',
 u'1980s',
 u'1987',
 u'1990',
 u'1990s',
 u'1995',
 u'1996',
 u'1998',
 u'1999',
 u'19th',
 u'1cent',
 u'1k',
 u'1p',
 u'1pm',
 u'1st',
 u'20',


In [18]:
#vect.fit(X_test)
X_test_dtm = vect.transform(X_test)
X_test_dtm

<1022x16530 sparse matrix of type '<type 'numpy.int64'>'
	with 79710 stored elements in Compressed Sparse Row format>

# Task 5
Use Multinomial Naive Bayes to predict the star rating for the reviews in the testing set, and then calculate the accuracy and print the confusion matrix.

- **Hint:** [Evaluating a classification](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) model explains how to interpret both classification accuracy and the confusion matrix.

In [19]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [20]:
# train the model using X_train_dtm (timing it with an IPython "magic command")
%time nb.fit(X_train_dtm, y_train)

CPU times: user 5.65 ms, sys: 1.7 ms, total: 7.35 ms
Wall time: 10.1 ms


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [21]:
# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)

In [22]:
y_pred_class

array([5, 5, 5, ..., 5, 1, 5])

In [23]:
# calculate the model accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))

0.92759295499


In [24]:
# Confusion matrix
print(metrics.confusion_matrix(y_test, y_pred_class))

[[128  61]
 [ 13 820]]


In [25]:
print("True: ", y_test.values[10:30])
print("Pred: ", y_pred_class[10:30])

('True: ', array([5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 1, 5, 1, 1, 5, 5, 5, 5]))
('Pred: ', array([5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 1, 5, 5, 5, 5, 5, 5, 5]))


## Task 6 (Challenge)

Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy!

In [26]:
sy_class = pd.Series(y_test)

In [27]:
sy_class.value_counts()

5    833
1    189
dtype: int64

In [28]:
sy_class

5646    5
5573    5
2747    5
201     5
8406    5
9866    5
1177    5
1291    1
6724    5
1288    5
7800    5
1749    5
74      5
8143    5
7569    5
5240    5
5761    5
8025    5
334     5
9178    5
8697    5
4445    5
2726    1
6002    5
9351    1
1227    1
9667    5
266     5
9212    5
4106    5
       ..
3019    5
8605    5
2006    5
2306    5
2615    1
8233    1
2297    1
7413    1
9082    5
7377    1
1747    1
1116    1
4034    5
754     5
870     5
8642    1
2129    5
8072    1
4962    5
1253    5
145     5
9782    5
4       5
6984    5
8270    1
101     5
1458    5
9474    5
8758    1
6378    5
Name: stars, dtype: int64

In [29]:
print(1-sy_class.mean())

-3.2602739726


## Task 7 (Challenge)

Browse through the review text of some of the **false positives** and **false negatives**. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews?

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains the definitions of "false positives" and "false negatives".
- **Hint:** Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the "positive class"?

In [30]:
print(X_test.head())
print(y_test.head())

5646    So I was introduced to this station about 6 mo...
5573    Fair and honest prices.  Friends and I have be...
2747    Better than Starbucks anyday! \nThe staff enco...
201     A group of us from the IVAA Summit went to Zin...
8406    went down to Scottsdale for Spring Training (G...
Name: text, dtype: object
5646    5
5573    5
2747    5
201     5
8406    5
Name: stars, dtype: int64


In [31]:
sy_pred_class = pd.Series(y_pred_class)
sy_pred_class.head()

0    5
1    5
2    5
3    5
4    5
dtype: int64

In [32]:
new_df = pd.DataFrame({'text': X_test, 'True': y_test, 'Pred': y_pred_class})

In [33]:
new_df[new_df['Pred'] > new_df['True']]

Unnamed: 0,Pred,True,text
9351,5,1,"Dr. Pierrend is ok, he will write prescription..."
1227,5,1,This place was messy and loud. The food reall...
9296,5,1,My boyfriend and I tried this place last year ...
1406,5,1,"From my door, it's a five minute stroll throug..."
6229,5,1,Forget the yogurt and the berry berry bad serv...
7803,5,1,I'm sad to report that we dined here for lunch...
4562,5,1,despite it's billing as the 'largest thrift st...
6977,5,1,Had to eat here with coworkers for lunch today...
5814,5,1,I do enjoy a good bowl of Pho. This was -not...
666,5,1,I waited 45min and ended up with a tiny gross ...


In [34]:
new_df[new_df['Pred'] < new_df['True']]

Unnamed: 0,Pred,True,text
3052,1,5,When I met some friends for dinner at this res...
3999,1,5,TJ was there for me when my water heater broke...
6376,1,5,They have a mechanical bull. Need I say more?
6512,1,5,When my youngest son graduated I took him to B...
3075,1,5,Unfortunately Out of Business.
6050,1,5,I went to sears today to check on a layaway th...
7903,1,5,"First, I'm sorry this review is lengthy, but i..."
2444,1,5,EXCELLENT CUSTOMER SERVICE! \n\nEven with Happ...
3149,1,5,I was told to see Greg after a local shop diag...
6334,1,5,I came here today for a manicure and pedicure....


## Task 8 (Challenge)

Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**.

- **Hint:** Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the `feature_count_` and `class_count_` attributes of the Naive Bayes model object.

In [40]:
nb.feature_count_

array([[ 28.,   4.,   3., ...,   0.,   0.,   0.],
       [ 36.,   7.,   2., ...,   1.,   1.,   1.]])

In [42]:
nb.class_count_

array([  560.,  2504.])

In [43]:
X_train_tokens = vect.get_feature_names()
len(X_train_tokens)

16530

## Task 9 (Challenge)

Up to this point, we have framed this as a **binary classification problem** by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a **5-class classification problem**.

Here are the steps:

- Define X and y using the original DataFrame. (y should contain 5 different classes.)
- Split X and y into training and testing sets.
- Create document-term matrices using CountVectorizer.
- Calculate the testing accuracy of a Multinomial Naive Bayes model.
- Compare the testing accuracy with the null accuracy, and comment on the results.
- Print the confusion matrix, and comment on the results. (This [Stack Overflow answer](http://stackoverflow.com/a/30748053/1636598) explains how to read a multi-class confusion matrix.)
- Print the [classification report](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report), and comment on the results. If you are unfamiliar with the terminology it uses, research the terms, and then try to figure out how to calculate these metrics manually from the confusion matrix!