# Regression Phase 3: Multiple Regression (Interpretation)

In this notebook we will use data crawled from Expdia on hotels to predict hotel prices using multiple regression. 

* Use SFrames to do some feature engineering
* Use built-in graphlab functions to compute the regression weights (coefficients/parameters)
* Given the regression weights, predictors and outcome write a function to compute the Residual Sum of Squares
* Look at coefficients and interpret their meanings
* Evaluate multiple models via RSS

# Import the Graphlab

In [1]:
import graphlab

# Load in Hotels data


Dataset is from Expedia, the region where the city of Los Angeles, CA is located.

In [21]:
hotels = graphlab.SFrame('LA_0421.csv')
# convert the string to float
hotels['price'] = hotels['price'].astype(float)
hotels['rates'] = hotels['rates'].astype(float)

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,float,float,str,str,str,str,int,str,str,int,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [22]:
hotels.head()

name,zone,star,rating,rates,checkin,checkout
Sportsmen's Lodge,Universal Studios,3.5,3.7,1.0,04/21/2017,04/22/2017
The Bicycle Hotel & Casino ...,Bell Gardens,4.0,4.5,296.0,04/21/2017,04/22/2017
The Bicycle Hotel & Casino ...,Bell Gardens,4.0,4.5,296.0,04/21/2017,04/22/2017
The Bicycle Hotel & Casino ...,Bell Gardens,4.0,4.5,296.0,04/21/2017,04/22/2017
The Bicycle Hotel & Casino ...,Bell Gardens,4.0,4.5,296.0,04/21/2017,04/22/2017
The Bicycle Hotel & Casino ...,Bell Gardens,4.0,4.5,296.0,04/21/2017,04/22/2017
The Bicycle Hotel & Casino ...,Bell Gardens,4.0,4.5,296.0,04/21/2017,04/22/2017
Motel 6 Los Angeles LAX,Los Angeles Intl. (LAX),2.0,3.0,1.0,04/21/2017,04/22/2017
Motel 6 Los Angeles LAX,Los Angeles Intl. (LAX),2.0,3.0,1.0,04/21/2017,04/22/2017
Motel 6 Los Angeles LAX,Los Angeles Intl. (LAX),2.0,3.0,1.0,04/21/2017,04/22/2017

room,size,price,bed,guests,address
Studio Suite,400,221.0,1 king bed,2,Ventura Blvd
"Deluxe Room, 2 Queen Beds",490,164.0,2 queen beds,4,Casino Dr.
"Deluxe Room, 1 King Bed",455,164.0,1 king bed,2,Casino Dr.
"Classic Suite, 1 King Bed",654,239.0,1 king bed,3,Casino Dr.
"Classic Suite, Accessible",654,239.0,1 king bed,3,Casino Dr.
Suite (The Bike),904,314.0,1 king and 1 sofa bed,4,Casino Dr.
"Suite, Accessible (The Bike) ...",904,314.0,1 king and 1 sofa bed,4,Casino Dr.
"Standard Room, 1 Queen Bed, Accessible (Roll-in ...",200,106.0,1 queen bed,2,W Century Blvd
"Standard Room, 2 Queen Beds ...",200,106.0,2 queen beds,4,W Century Blvd
"Standard Room, 1 King Bed",200,106.0,1 king bed,2,W Century Blvd

link
https://www.expedia.com /Universal-Studios- ...
https://www.expedia.com /Bell-Gardens-Hotels- ...
https://www.expedia.com /Bell-Gardens-Hotels- ...
https://www.expedia.com /Bell-Gardens-Hotels- ...
https://www.expedia.com /Bell-Gardens-Hotels- ...
https://www.expedia.com /Bell-Gardens-Hotels- ...
https://www.expedia.com /Bell-Gardens-Hotels- ...
https://www.expedia.com /Los-Angeles-Intl-Hot ...
https://www.expedia.com /Los-Angeles-Intl-Hot ...
https://www.expedia.com /Los-Angeles-Intl-Hot ...


# Split data into training and testing.


In [23]:
train_data,test_data = hotels.random_split(.8,seed=0)
train_data

name,zone,star,rating,rates,checkin,checkout
Sportsmen's Lodge,Universal Studios,3.5,3.7,1.0,04/21/2017,04/22/2017
The Bicycle Hotel & Casino ...,Bell Gardens,4.0,4.5,296.0,04/21/2017,04/22/2017
The Bicycle Hotel & Casino ...,Bell Gardens,4.0,4.5,296.0,04/21/2017,04/22/2017
The Bicycle Hotel & Casino ...,Bell Gardens,4.0,4.5,296.0,04/21/2017,04/22/2017
The Bicycle Hotel & Casino ...,Bell Gardens,4.0,4.5,296.0,04/21/2017,04/22/2017
The Bicycle Hotel & Casino ...,Bell Gardens,4.0,4.5,296.0,04/21/2017,04/22/2017
The Bicycle Hotel & Casino ...,Bell Gardens,4.0,4.5,296.0,04/21/2017,04/22/2017
Motel 6 Los Angeles LAX,Los Angeles Intl. (LAX),2.0,3.0,1.0,04/21/2017,04/22/2017
Motel 6 Los Angeles LAX,Los Angeles Intl. (LAX),2.0,3.0,1.0,04/21/2017,04/22/2017
Motel 6 Los Angeles LAX,Los Angeles Intl. (LAX),2.0,3.0,1.0,04/21/2017,04/22/2017

room,size,price,bed,guests,address
Studio Suite,400,221.0,1 king bed,2,Ventura Blvd
"Deluxe Room, 2 Queen Beds",490,164.0,2 queen beds,4,Casino Dr.
"Deluxe Room, 1 King Bed",455,164.0,1 king bed,2,Casino Dr.
"Classic Suite, 1 King Bed",654,239.0,1 king bed,3,Casino Dr.
"Classic Suite, Accessible",654,239.0,1 king bed,3,Casino Dr.
Suite (The Bike),904,314.0,1 king and 1 sofa bed,4,Casino Dr.
"Suite, Accessible (The Bike) ...",904,314.0,1 king and 1 sofa bed,4,Casino Dr.
"Standard Room, 1 Queen Bed, Accessible (Roll-in ...",200,106.0,1 queen bed,2,W Century Blvd
"Standard Room, 2 Queen Beds ...",200,106.0,2 queen beds,4,W Century Blvd
"Standard Room, 1 King Bed",200,106.0,1 king bed,2,W Century Blvd

link
https://www.expedia.com /Universal-Studios- ...
https://www.expedia.com /Bell-Gardens-Hotels- ...
https://www.expedia.com /Bell-Gardens-Hotels- ...
https://www.expedia.com /Bell-Gardens-Hotels- ...
https://www.expedia.com /Bell-Gardens-Hotels- ...
https://www.expedia.com /Bell-Gardens-Hotels- ...
https://www.expedia.com /Bell-Gardens-Hotels- ...
https://www.expedia.com /Los-Angeles-Intl-Hot ...
https://www.expedia.com /Los-Angeles-Intl-Hot ...
https://www.expedia.com /Los-Angeles-Intl-Hot ...


# Learning a multiple regression model


Recall we can use the following code to learn a multiple regression model predicting 'price' based on the following features: example_features = ['star', 'size', 'ratings'] on training data with the following code:
(Aside: We set validation_set = None)

In [24]:
example_features = ['star', 'size', 'rating']
#take a look at the API doc of the regression model
example_model = graphlab.linear_regression.create(train_data, target = 'price', features = example_features, 
                                                  validation_set = None)

Now that we have fitted the model we can extract the regression weights (coefficients) as an SFrame as follows:

In [25]:
example_weight_summary = example_model.get("coefficients")
print example_weight_summary

+-------------+-------+-----------------+-----------------+
|     name    | index |      value      |      stderr     |
+-------------+-------+-----------------+-----------------+
| (intercept) |  None |  -129.244772102 |  23.7829617109  |
|     star    |  None |  116.550058842  |  5.31826612178  |
|     size    |  None | 0.0808452500304 | 0.0142100330904 |
|    rating   |  None |  -9.03680177658 |  8.14256748473  |
+-------------+-------+-----------------+-----------------+
[4 rows x 4 columns]



# Making Predictions


We will use existing graphlab create functions to analyze multiple regressions.

We can use the .predict() function to find the predicted values for data we pass. For example using the example model above:

In [26]:
example_predictions = example_model.predict(train_data)
print example_predictions[0]

277.582367282


# Compute RSS


Now that we can make predictions given the model, we need a function to compute the RSS of the model.

In [27]:
def get_residual_sum_of_squares(model, data, outcome):
    # First get the predictions
    predictions = model.predict(data)
    # Then compute the residuals/errors
    errors = outcome - predictions
    # Then square and add them up
    sqr_serrors = errors * errors
    RSS = sqr_serrors.sum()
    return(RSS)

Test the function by computing the RSS on TEST data for the example model:

In [28]:
rss_example_train = get_residual_sum_of_squares(example_model, test_data, test_data['price'])
print rss_example_train # should be 3779506.10575

3779506.10575


# Create some new features


Although we often think of multiple regression as including multiple different features (e.g. # of 'star', 'size', 'rating') but we can also consider transformations of existing features e.g. the log of the size or even "interaction" features such as the product of rating and rates.

In [29]:
from math import log

Next create the following 4 new features as column in both TEST and TRAIN data:

* star_squared = star * star
* star_rating = star * rating
* log_size = log(size)

As an example here's the first one:

In [187]:
train_data['star_squared'] = train_data['star'].apply(lambda x: x**2)
test_data['star_squared'] = test_data['star'].apply(lambda x: x**2)

In [188]:
# create the remaining 3 features in both TEST and TRAIN data
#star_rating
train_data['star_rating'] = train_data['star','rating'].apply(lambda x: x['star'] * x['rating'])
test_data['star_rating'] = test_data['star','rating'].apply(lambda x: x['star'] * x['rating'])
#log_size
train_data['log_size'] = train_data['size'].apply(lambda x: log(x))
test_data['log_size'] = test_data['size'].apply(lambda x: log(x))

* Squaring star will increase the separation between not fancy hotels (e.g. 2 star) and really fancy hotel (e.g. 5) since 2^2 = 4 but 5^2 = 25. Consequently this feature will mostly affect big-star luxury hotels.
* star times rating gives what's called an "interaction" feature. It is large when both of them are large.
* Taking the log of size of the hotel romm has the effect of bringing large values closer together and spreading out small values.
* The difference between rates and rating is totally non-sensical but we will do it just for testing and comparison

In [189]:
print test_data['star_squared'].mean();
print test_data['star_rating'].mean();
print test_data['log_size'].mean();


11.4475409836
12.8421311475
5.91405696273


# Learning Multiple Models


Now we will learn the weights for three (nested) models for predicting hotel prices. The first model will have the fewest features the second model will add one more feature and the third will add a few more:


* Model 1: size, # star, # rating, # rates
* Model 2: add star_squared
* Model 3: Add log size, star_squared, 

In [190]:
model_1_features = ['size', 'star', 'rating']
model_2_features = model_1_features + ['star_rating']
model_3_features = model_2_features + ['star_squared', 'log_size']

In [191]:
# Learn the three models: (don't forget to set validation_set = None)
model_1 = graphlab.linear_regression.create(train_data, target = 'price', features = model_1_features, 
                                                  validation_set = None);
model_2 = graphlab.linear_regression.create(train_data, target = 'price', features = model_2_features, 
                                                  validation_set = None);
model_3 = graphlab.linear_regression.create(train_data, target = 'price', features = model_3_features, 
                                                  validation_set = None);

In [192]:
# Examine/extract each model's coefficients:
model_1_weight_summary = model_1.get("coefficients");
model_2_weight_summary = model_2.get("coefficients");
model_3_weight_summary = model_3.get("coefficients");

In [193]:
model_1_weight_summary

name,index,value,stderr
(intercept),,-129.244772102,23.7829617109
size,,0.0808452500304,0.0142100330904
star,,116.550058842,5.31826612178
rating,,-9.03680177658,8.14256748473


In [194]:
model_2_weight_summary

name,index,value,stderr
(intercept),,44.7513118174,79.7804756611
size,,0.079088697748,0.014204322029
star,,49.0818241099,29.9971649747
rating,,-51.2732109322,20.1963107268
star_rating,,16.2172232403,7.0962843183


In [195]:
model_3_weight_summary

name,index,value,stderr
(intercept),,-389.245307082,103.794553221
size,,0.0032073110041,0.0217214751786
star,,-1.54875235654,30.3383464326
rating,,77.3984967902,31.5805324912
star_rating,,-29.493545194,11.0228408461
star_squared,,34.4068605547,6.20052335183
log_size,,56.5395795933,12.5090974673


# Comparing multiple models


Now that we've learned three models and extracted the model weights we want to evaluate which model is best.

First use our functions from earlier to compute the RSS on TRAINING Data for each of the three models.

In [196]:
# Compute the RSS on TRAINING data for each of the three models and record the values:
rss_model_1_train = get_residual_sum_of_squares(model_1, train_data, train_data['price']);
rss_model_2_train = get_residual_sum_of_squares(model_2, train_data, train_data['price']);
rss_model_3_train = get_residual_sum_of_squares(model_3, train_data, train_data['price']);
#print rss_example_train

In [197]:
rss_model_1_train

14308109.19300476

In [198]:
rss_model_2_train

14242865.809031466

In [199]:
rss_model_3_train

13666716.383272396

Those numbers tell us the seclection of the features are very important