# Week 2_1: Multiple regression (Interpretation)

The goal of this first notebook is to explore multiple regression and feature engineering with existing graphlab functions.

In this notebook you will use data on house sales in King County to predict prices using multiple regression. You will:
* Use SFrames to do some feature engineering
* Use built-in graphlab functions to compute the regression weights (coefficients/parameters)
* Given the regression weights, predictors and outcome write a function to compute the Residual Sum of Squares
* Look at coefficients and interpret their meanings
* Evaluate multiple models via RSS

In [1]:
import graphlab
import numpy as np

[INFO] This non-commercial license of GraphLab Create is assigned to chengjun@chem.ku.dk and will expire on January 27, 2017. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-32502 - Server binary: /usr/local/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1454915892.log
[INFO] GraphLab Server Version: 1.8.1


## Load data

In [2]:
sales = graphlab.SFrame('kc_house_data.gl/')

In [11]:
sales['bedrooms_squared'] = sales['bedrooms']**2
sales['bed_bath_rooms'] = sales['bedrooms']*sales['bathrooms']
sales['log_sqft_living'] = np.log(sales['sqft_living'])
sales['lat_plus_long'] = sales['lat'] + sales['long']

In [12]:
train_data, test_data = sales.random_split(0.8, seed=0)

In [16]:
print test_data['bedrooms_squared'].mean()
print test_data['bed_bath_rooms'].mean()
print test_data['log_sqft_living'].mean()
print test_data['lat_plus_long'].mean()

12.4466777016
7.50390163159
7.55027467965
-74.6533349722


#Model 1: ‘sqft_living’, ‘bedrooms’, ‘bathrooms’, ‘lat’, and ‘long’
#Model 2: ‘sqft_living’, ‘bedrooms’, ‘bathrooms’, ‘lat’,‘long’, and ‘bed_bath_rooms’
#Model 3: ‘sqft_living’, ‘bedrooms’, ‘bathrooms’, ‘lat’,‘long’, ‘bed_bath_rooms’, ‘bedrooms_squared’, ‘log_sqft_living’, and ‘lat_plus_long’

## Learning a multiple regression model

In [32]:
# model 1
model_1 = graphlab.linear_regression.create(train_data,
                                           target='price',
                                           features=['bedrooms', 'bathrooms', 'lat', 'long'],
                                           validation_set=None)

PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 17384
PROGRESS: Number of features          : 4
PROGRESS: Number of unpacked features : 4
PROGRESS: Number of coefficients    : 5
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | 1         | 2        | 0.037443     | 5853449.620139     | 294260.514511 |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: SUCCESS: Optimal solution found.
PROGRESS:


In [33]:
# model 2
model_2 = graphlab.linear_regression.create(train_data,
                                           target='price',
                                           features=['bedrooms', 'bathrooms', 'lat', 'long', 'bed_bath_rooms'],
                                           validation_set=None)

PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 17384
PROGRESS: Number of features          : 5
PROGRESS: Number of unpacked features : 5
PROGRESS: Number of coefficients    : 6
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | 1         | 2        | 0.044404     | 5700638.852504     | 290492.952343 |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: SUCCESS: Optimal solution found.
PROGRESS:


In [34]:
# model 3
model_3 = graphlab.linear_regression.create(train_data,
                                           target='price',
                                           features=['bedrooms', 'bathrooms', 'lat', 'long', 'bed_bath_rooms', 'bedrooms_squared', 'log_sqft_living', 'lat_plus_long'],
                                           validation_set=None)

PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 17384
PROGRESS: Number of features          : 8
PROGRESS: Number of unpacked features : 8
PROGRESS: Number of coefficients    : 9
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | 1         | 2        | 0.036926     | 5251970.041717     | 257390.979477 |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: SUCCESS: Optimal solution found.
PROGRESS:


In [35]:
model_1.coefficients

name,index,value,stderr
(intercept),,-48012691.9099,2046592.14162
bedrooms,,24506.5145753,2903.98016739
bathrooms,,236382.876979,3470.90606414
lat,,727267.262026,16271.2035192
long,,-109490.510514,16417.0706888


In [36]:
model_1.get('coefficients')

name,index,value,stderr
(intercept),,-48012691.9099,2046592.14162
bedrooms,,24506.5145753,2903.98016739
bathrooms,,236382.876979,3470.90606414
lat,,727267.262026,16271.2035192
long,,-109490.510514,16417.0706888


In [39]:
model_2.get('coefficients')

name,index,value,stderr
(intercept),,-44996364.5931,2028764.71099
bedrooms,,-87078.6089966,5924.15137777
bathrooms,,52930.0951045,9196.84386488
lat,,715304.419626,16072.4954437
long,,-92554.3230669,16247.8626679
bed_bath_rooms,,51546.9605159,2394.21790502


In [40]:
print model_1.evaluate(train_data)
print model_2.evaluate(train_data)
print model_3.evaluate(train_data)

{'max_error': 5853449.620139115, 'rmse': 294260.5145112174}
{'max_error': 5700638.852503888, 'rmse': 290492.9523426828}
{'max_error': 5251970.041716702, 'rmse': 257390.97947653994}


In [41]:
print model_1.evaluate(test_data)
print model_2.evaluate(test_data)
print model_3.evaluate(test_data)

{'max_error': 4894438.0333952755, 'rmse': 282171.3588745187}
{'max_error': 4212529.507538155, 'rmse': 277503.88834027527}
{'max_error': 15371303.52558896, 'rmse': 339617.6663557775}
