# Linear regression homework with Yelp votes

## Introduction

This assignment uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition.

**Description of the data:**

- `yelp.json` is the original format of the file. `yelp.csv` contains the same data, in a more convenient format. Both of the files are in this repo, so there is no need to download the data from the Kaggle website.
- Each observation in this dataset is a review of a particular business by a particular user.
- The "stars" column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
- The "cool" column is the number of "cool" votes this review received from other Yelp users. All reviews start with 0 "cool" votes, and there is no limit to how many "cool" votes a review can receive. In other words, it is a rating of the review itself, not a rating of the business.
- The "useful" and "funny" columns are similar to the "cool" column.

## Task 1

Read `yelp.csv` into a DataFrame.

In [1]:
# access yelp.csv using a relative path
import pandas as pd
yelpvotes = pd.read_csv('../2_dataset/yelp.csv')

In [2]:
yelpvotes.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


## Task 1

Ignore the `yelp.csv` file, and construct this DataFrame yourself from `yelp.json`. This involves reading the data into Python, decoding the JSON, converting it to a DataFrame, and adding individual columns for each of the vote types.

In [3]:
yelpvotes = pd.read_json('../2_dataset/yelp.json', lines=True)

In [4]:
# show the first review
yelpvotes.head(1)

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,votes
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,"{u'funny': 0, u'useful': 5, u'cool': 2}"


In [5]:
yelpvotes.votes.head()

0    {u'funny': 0, u'useful': 5, u'cool': 2}
1    {u'funny': 0, u'useful': 0, u'cool': 0}
2    {u'funny': 0, u'useful': 1, u'cool': 0}
3    {u'funny': 0, u'useful': 2, u'cool': 1}
4    {u'funny': 0, u'useful': 0, u'cool': 0}
Name: votes, dtype: object

## Task 2

Explore the relationship between each of the vote types (cool/useful/funny) and the number of stars.

In [6]:
yelpvotes = pd.read_csv('../2_dataset/yelp.csv')
yelpvotes.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


In [7]:
yelpvotes.cool.describe()

count    10000.000000
mean         0.876800
std          2.067861
min          0.000000
25%          0.000000
50%          0.000000
75%          1.000000
max         77.000000
Name: cool, dtype: float64

In [8]:
yelpvotes.cool.value_counts(normalize=True)

0     0.6290
1     0.1955
2     0.0749
3     0.0396
4     0.0209
5     0.0119
6     0.0088
7     0.0041
8     0.0031
10    0.0030
11    0.0017
9     0.0015
13    0.0014
14    0.0010
12    0.0009
16    0.0006
15    0.0005
17    0.0005
27    0.0001
19    0.0001
22    0.0001
18    0.0001
20    0.0001
38    0.0001
28    0.0001
21    0.0001
32    0.0001
77    0.0001
23    0.0001
Name: cool, dtype: float64

In [9]:
# treat stars as a categorical variable and look for differences between groups
pd.crosstab(yelpvotes.stars, yelpvotes.cool)

cool,0,1,2,3,4,5,6,7,8,9,...,19,20,21,22,23,27,28,32,38,77
stars,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,549,118,29,24,10,4,5,5,1,0,...,0,0,0,0,0,0,0,0,0,0
2,609,175,66,33,14,11,9,2,3,1,...,0,0,0,0,0,0,0,0,0,0
3,931,284,104,62,31,17,10,9,3,1,...,0,0,0,0,0,0,0,0,0,0
4,2129,718,285,146,92,44,38,12,10,6,...,0,1,1,1,0,0,0,0,1,0
5,2072,660,265,131,62,43,26,13,14,7,...,1,0,0,0,1,1,1,1,0,1


In [10]:
yelpvotes.stars.plot(kind='hist')

<matplotlib.axes._subplots.AxesSubplot at 0x106c41050>

In [11]:
yelpvotes.groupby('stars').cool.mean()

stars
1    0.576769
2    0.719525
3    0.788501
4    0.954623
5    0.944261
Name: cool, dtype: float64

In [12]:
yelpvotes.groupby('stars').useful.mean()

stars
1    1.604806
2    1.563107
3    1.306639
4    1.395916
5    1.381780
Name: useful, dtype: float64

In [13]:
yelpvotes.groupby('stars').funny.mean()

stars
1    1.056075
2    0.875944
3    0.694730
4    0.670448
5    0.608631
Name: funny, dtype: float64

In [14]:
yelpvotes.groupby('stars').cool.agg(['count', 'min', 'max', 'mean'])

Unnamed: 0_level_0,count,min,max,mean
stars,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,749,0,17,0.576769
2,927,0,14,0.719525
3,1461,0,18,0.788501
4,3526,0,38,0.954623
5,3337,0,77,0.944261


## Task 3

Define cool/useful/funny as the features, and stars as the response.

In [15]:
# correlation matrix (ranges from 1 to -1)
yelpvotes.corr()

Unnamed: 0,stars,cool,useful,funny
stars,1.0,0.052555,-0.023479,-0.061306
cool,0.052555,1.0,0.887102,0.764342
useful,-0.023479,0.887102,1.0,0.723406
funny,-0.061306,0.764342,0.723406,1.0


In [16]:
# Pandas scatter plot
yelpvotes.plot(kind='scatter', x='cool', y='stars', alpha=0.2)

<matplotlib.axes._subplots.AxesSubplot at 0x106eb4490>

In [17]:
# create X and y
feature_cols = ['cool', 'useful', 'funny']
X = yelpvotes[feature_cols]
y = yelpvotes.stars

## Task 4

Fit a linear regression model and interpret the coefficients. Do the coefficients make intuitive sense to you? Explore the Yelp website to see if you detect similar trends.

In [18]:
# import, instantiate, fit
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [19]:
# print the coefficients
print linreg.coef_
linreg.intercept_

[ 0.27435947 -0.14745239 -0.13567449]


3.8398947927830829

In [20]:
# pair the feature names with the coefficients using a command called 'zip'
print(pd.Series(list(zip(feature_cols, linreg.coef_))))

0       (cool, 0.274359468589)
1    (useful, -0.147452390994)
2     (funny, -0.135674490537)
dtype: object


## Task 5

Evaluate the model by splitting it into training and testing sets and computing the RMSE. Does the RMSE make intuitive sense to you?

In [21]:
from sklearn.cross_validation import train_test_split
from sklearn import metrics
import numpy as np



In [22]:
# define a function that accepts a list of features and returns testing RMSE
def train_test_rmse(feature_cols):
    X = yelpvotes[feature_cols]
    y = yelpvotes.stars
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)
    linreg = LinearRegression()
    linreg.fit(X_train, y_train)
    y_pred = linreg.predict(X_test)
    return np.sqrt(metrics.mean_squared_error(y_test, y_pred))

In [23]:
print(train_test_rmse(['funny']))

1.20043591364


In [24]:
# calculate RMSE with all three features
print(train_test_rmse(['funny','useful','cool']))

1.17336862742


## Task 6

Try removing some of the features and see if the RMSE improves.

In [25]:
print(train_test_rmse(['funny','useful']))
print(train_test_rmse(['funny','cool']))
print(train_test_rmse(['useful','cool']))
print(train_test_rmse(['funny']))
print(train_test_rmse(['useful']))
print(train_test_rmse(['cool']))

1.20070113589
1.1851949299
1.18537944234
1.20043591364
1.20143488625
1.20049049928


## Task 7 (Bonus)

Think of some new features you could create from the existing data that might be predictive of the response. Figure out how to create those features in Pandas, add them to your model, and see if the RMSE improves.

In [26]:
# new feature: review length (number of characters)


In [27]:
# new features: whether or not the review contains 'love' or 'hate'


In [28]:
# add new features to the model and calculate RMSE


## Task 8 (Bonus)

Compare your best RMSE on the testing set with the RMSE for the "null model", which is the model that ignores all features and simply predicts the mean response value in the testing set.

In [29]:
# split the data (outside of the function)


In [30]:
# create a NumPy array with the same shape as y_test


In [31]:
# fill the array with the mean of y_test

In [32]:
# calculate null RMSE

## Task 9 (Bonus)

Instead of treating this as a regression problem, treat it as a classification problem and see what testing accuracy you can achieve with KNN.

In [33]:
# import and instantiate KNN

In [34]:
# classification models will automatically treat the response value (1/2/3/4/5) as unordered categories

## Task 10 (Bonus)

Figure out how to use linear regression for classification, and compare its classification accuracy with KNN's accuracy.

In [35]:
# use linear regression to make continuous predictions

In [36]:
# round its predictions to the nearest integer

In [37]:
# calculate classification accuracy of the rounded predictions

## Task 1 (Bonus)

Interact with JSON

In [38]:
# read the data from yelp.json into a list of rows
# each row is decoded into a dictionary using using json.loads()

In [39]:
# convert the list of dictionaries to a DataFrame

In [40]:
# add DataFrame columns for cool, useful, and funny

In [41]:
# drop the votes column