
# Introduction
This exercise uses a small subset of the data from Kaggle's Yelp Business Rating Prediction competition.
Description of the data:
1. yelp.csv contains the dataset. It is stored in the repository (in the data directory), so there is no need to download anything from the Kaggle website.
2. Each observation (row) in this dataset is a review of a particular business by a particular user.
3. The stars column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
4. The text column is the text of the review.
4. **Goal:** Predict the star rating of a review using only the review text.


## Task 1
Read yelp.csv into a pandas DataFrame and examine it.

In [4]:
import pandas as pd

# Read the entire file of json
with open('..\..\HeavyDataset\Yelp Business Rating Prediction\yelp_training_set\yelp_training_set_review.json', 'rb') as file:
    data = file.readlines()

# Joining all data into a lump JSON String
data_json_str = "[" + ','.join(data) + "]"

# now, load it into pandas
df = pd.read_json(data_json_str)

In [5]:
df.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,votes
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,"{u'funny': 0, u'useful': 5, u'cool': 2}"
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,"{u'funny': 0, u'useful': 0, u'cool': 0}"
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,"{u'funny': 0, u'useful': 1, u'cool': 0}"
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,"{u'funny': 0, u'useful': 2, u'cool': 1}"
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,"{u'funny': 0, u'useful': 0, u'cool': 0}"


## Task 2
Create a new DataFrame that only contains the 5-star and 1-star reviews.
Hint: How do I apply multiple filter criteria to a pandas DataFrame? explains how to do this.

In [6]:
dftoplow = df[(df['stars']==1) | (df['stars']==5) ]

In [7]:
dftoplow.shape

(93709, 8)

## Task 3
Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the review text as the only feature and the star rating as the response.
Hint: Keep in mind that X should be a pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows.

In [8]:
X =  dftoplow['text']
y = dftoplow['stars']

In [9]:
from sklearn.cross_validation import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=1)

In [10]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(70281L,)
(70281L,)
(23428L,)
(23428L,)


## Task 4
Use CountVectorizer to create document-term matrices from X_train and X_test.

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [12]:
%time X_train_dtm = vect.fit_transform(X_train)

Wall time: 8.79 s


In [13]:
X_test_dtm = vect.transform(X_test)

## Task 5
Use multinomial Naive Bayes to predict the star rating for the reviews in the testing set, and then calculate the accuracy and print the confusion matrix.
Hint: Evaluating a classification model explains how to interpret both classification accuracy and the confusion matrix.

In [14]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()
nb.fit(X_train_dtm,y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [15]:
y_pred_class = nb.predict(X_test_dtm)
print(y_pred_class)

[5 5 5 ..., 1 5 5]


In [16]:
from sklearn import metrics
print(metrics.accuracy_score(y_test,y_pred_class))

0.931534915486


In [17]:
print(metrics.confusion_matrix(y_test,y_pred_class))

[[ 3731   658]
 [  946 18093]]


## Task 6 (Challenge)
Calculate the null accuracy, which is the classification accuracy that could be achieved by always predicting the most frequent class.
Hint: Evaluating a classification model explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy!

In [18]:
# Examine the class distribution of testing set
print(y_test.value_counts())

5    19039
1     4389
Name: stars, dtype: int64


In [19]:
# Calculate null accuracy
y_test.value_counts().head(1)/y_test.shape

5    0.81266
Name: stars, dtype: float64

## Task 7 (Challenge)
Browse through the review text of some of the false positives and false negatives. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews?
Hint: Evaluating a classification model explains the definitions of "false positives" and "false negatives".
Hint: Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the "positive class"?

In [23]:
# False positives
X_test[y_test<y_pred_class].head(15)

180398    I have been eating Chicago Style Food for 25 y...
22525     This is in my neighborhood and I heard good th...
127083    What happened? I have no idea. With all the ac...
52786     Please note, this is my very first review of a...
30751     decided to hit me with two months of dues when...
213544    Very disappointed.  Place feels and looks like...
41783     i'm not in love with this chain, and this loca...
26889     So I figured out why this place has 5 from one...
201166    I went to this place because it was closed my ...
45563     "Arribas: At least you won't die."\n\nBland fo...
6051      This place has really bad service and the food...
29579     The staff was friendly. That's about the only ...
100690    These people have NO idea what they are doing....
152035    I have been coming here for a while now. I lik...
192552    Cool historic building but the service at lunc...
Name: text, dtype: object

In [22]:
# false positive: model is reacting to the words "good", "impressive", "nice"
X_test[22525]

u"This is in my neighborhood and I heard good things so I decided to give it a shot. The place is not recommended dine in as it gives me the feeling of an indoor version of one of them mobile food wagons with some fold out benches setup. We ordered some carne asada tacos and some guacamole and chips and a few bottled cokes I believe. \n\n\n The carne asada tacos portion of meat inside was small and the portions of guacamole,cabbage, and onions outweighed the 10 miniscule pieces of meat inside. The taco had almost a hint of peanut butter taste to the guacamole inside the tacos and dip which is disgusting. I am not sure if it was the crazy amount of cabbage that caused the taste but pretty gross. I am completely stunned how anyone can actually like the tacos or guacamole here as even chains like Rubio's have better street tacos(they aren't too shabby actually).\n\nService: 5/5 they were nice and prompt\nAmbiance: definitely take out unless you like picnics\nparking: pretty terrible if I 

In [26]:
# false positive: Not enough data to work with
X_test[192552]

u'Cool historic building but the service at lunch today was mediocre at best and stay away from the Chicken Parmigiana unless you like fishy tasting chicken.  The best part of my experience was the bread and butter.  :('

In [27]:
# False negatives
X_test[y_test>y_pred_class].head(10)

179135    Just saw Mission Impossible (Ghost Protocol) a...
221005    I've been here many times purchasing and/or re...
104737    My recent trip to Snowbowl had a bus breakdown...
29135     My former brother-in-law used to do all of our...
103490    after reading the reviews this morning I decid...
211439    Wowoweewow.\n\nI've never been so fascinated a...
64687     Whoa, a woman lost a $70000 engagement ring in...
9297      Fast service, the woman who did my hair was gr...
155531    crazy quick!!!! I got there around 10:20 and w...
119562    HELLS YEAH! I'll never eat people again! this ...
Name: text, dtype: object

In [30]:
# false negative: Model is reacting on the word 'poor and sad'
X_test[64687]

u'Whoa, a woman lost a $70000 engagement ring in the toilet at this place!\n\n"A Phoenix plumber became a hero after retrieving a $70,000 ring that had been flushed down the toilet of a Phoenix restaurant.\n\n"We just did what we do," said Mike Roberts, general manager of Mr. Rooter, a plumbing company. Roberts said he spent about eight hours fishing down the toilet with a fiber optic cable on Jan. 14.\n\nThe woman, Allison Berry, from California had gone to the restroom after eating at Phoenix\'s Black Bear Diner, 2410 West Bell Road when the accident happened. Her 7-carat diamond ring slipped off her finger and into the toilet as she was pulling up her pants, she reportedly told a waitress."\n\nRead more: http://www.azcentral.com/community/phoenix/articles/2009/01/23/20090123toilet0124.html'

## Task 8 (Challenge)
Calculate which 10 tokens are the most predictive of 5-star reviews, and which 10 tokens are the most predictive of 1-star reviews.
Hint: Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the feature_count_ and class_count_ attributes of the Naive Bayes model object.

In [31]:
# Store the vocabularies
X_train_tokens = vect.get_feature_names()
len(X_train_tokens)

68775

In [32]:
# Get the features by the nb
nb.feature_count_.shape

(2L, 68775L)

In [33]:
one_star_token_count = nb.feature_count_[0,:]
five_star_token_count = nb.feature_count_[1,:]

In [34]:
# create a DataFrame of tokens with their separate one-star and five-star counts
dftoken = pd.DataFrame({'token':X_train_tokens,'one_star':one_star_token_count,'five_star':five_star_token_count}).set_index('token')

In [35]:
dftoken.head()

Unnamed: 0_level_0,five_star,one_star
token,Unnamed: 1_level_1,Unnamed: 2_level_1
00,783.0,559.0
000,143.0,88.0
0005,0.0,1.0
000s,1.0,0.0
000sf,1.0,0.0


In [36]:
# Add token by one to avoid dividing by 0
dftoken['one_star'] = dftoken['one_star'] +1
dftoken['five_star'] = dftoken['five_star'] +1


In [37]:
nb.class_count_

array([ 13127.,  57154.])

In [38]:
dftoken['one_star'] = dftoken['one_star']/nb.class_count_[0]
dftoken['five_star'] = dftoken['five_star']/nb.class_count_[0]
dftoken['five_star_ratio'] = dftoken['five_star']/dftoken['one_star']

In [39]:
dftoken.sort_values('five_star_ratio',ascending=False).head()

Unnamed: 0_level_0,five_star,one_star,five_star_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
unassuming,0.009294,7.6e-05,122.0
deliciousness,0.016683,0.000152,109.5
downside,0.015845,0.000152,104.0
scrumptious,0.015388,0.000152,101.0
faves,0.007237,7.6e-05,95.0


## Task 9 (Challenge)
Up to this point, we have framed this as a binary classification problem by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a 5-class classification problem.
Here are the steps:
Define X and y using the original DataFrame. (y should contain 5 different classes.)
Split X and y into training and testing sets.
Create document-term matrices using CountVectorizer.
Calculate the testing accuracy of a Multinomial Naive Bayes model.
Compare the testing accuracy with the null accuracy, and comment on the results.
Print the confusion matrix, and comment on the results. (This Stack Overflow answer explains how to read a multi-class confusion matrix.)
Print the classification report, and comment on the results. If you are unfamiliar with the terminology it uses, research the terms, and then try to figure out how to calculate these metrics manually from the confusion matrix!

In [40]:
X=df.text
y= df.stars

In [41]:
# Split with train test split
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=1)

In [42]:
# Train the vector
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

X_train_dtm = vect.fit_transform(X_train)

In [43]:
X_test_dtm = vect.transform(X_test)

In [44]:
# Use naive bayes model
nb.fit(X_train_dtm,y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [45]:
# Make class prediction
y_pred_class = nb.predict(X_test_dtm)

In [46]:
# calculate the accuracy
metrics.accuracy_score(y_test,y_pred_class)

0.54112079614454478

In [47]:
# Print the confusion matrix
metrics.confusion_matrix(y_test,y_pred_class)

array([[ 2427,  1078,   427,   212,   169],
       [  966,  1579,  1828,   706,   230],
       [  488,   758,  3268,  3625,   669],
       [  486,   322,  1800, 12186,  5239],
       [  720,   123,   315,  6214, 11642]])

Confusion matrix comments:
Nearly all 4-star and 5-star reviews are classified as 4 or 5 stars, but they are hard for the model to distinguish between.
1-star, 2-star, and 3-star reviews are most commonly classified as 4 stars, probably because it's the predominant class in the training data.

In [49]:
# Print the classification report
print(metrics.classification_report(y_test,y_pred_class))

             precision    recall  f1-score   support

          1       0.48      0.56      0.52      4313
          2       0.41      0.30      0.34      5309
          3       0.43      0.37      0.40      8808
          4       0.53      0.61      0.57     20033
          5       0.65      0.61      0.63     19014

avg / total       0.54      0.54      0.54     57477



**Precision** answers the question: "When a given class is predicted, how often are those predictions correct?" To calculate the precision for class 1, for example, you divide 55 by the sum of the first column of the confusion matrix.

**Recall** answers the question: "When a given class is the true class, how often is that class predicted?" To calculate the recall for class 1, for example, you divide 55 by the sum of the first row of the confusion matrix.

**F1** score is a weighted average of precision and recall.

**Support answers the question**: "How many observations exist for which a given class is the true class?" To calculate the support for class 1, for example, you sum the first row of the confusion matrix.


## Classification report comments:
1. Class 2 has low recall, meaning that the model has a hard time detecting the 1-star reviews, but decent precision, meaning that when the model predicts a review is 2-star, it's likely to be correct.
2. Class 4 and 5 has high recall and precision, probably because 4-star and 5-star reviews have polarized language, and because the model has a lot of observations to learn from.

## Predicting the test dataset

In [53]:
# Read the entire file of json
with open('..\..\HeavyDataset\Yelp Business Rating Prediction\yelp_test_set\yelp_test_set_review.json', 'rb') as file:
    data = file.readlines()

# Joining all data into a lump JSON String
data_json_str = "[" + ','.join(data) + "]"

# now, load it into pandas
dftest = pd.read_json(data_json_str)

In [55]:
print(dftest.shape)
dftest.head()

(22956, 3)


Unnamed: 0,business_id,type,user_id
0,AuMz7XGkjLcIUurp_AD51w,review,2WkM3pYfx7bt46tv7u4hHA
1,8i5hB_dmf33NVbWE5SwoMQ,review,eHWbF0k5QOBLgQXhGdeHmg
2,nvaAUTTl7oqiJDhuimNG6A,review,HrjjHfDGTafXyKpQKNrYHg
3,QwaoxP5Mgm3PJuZo_4bFsw,review,DrWLhrK8WMZf7Jb-Oqc7ww
4,0lEp4vISRmOXa8Xz2pWhbw,review,jDCONTPR6nyc3J7iimwzkQ
