# Multiple Regression (Interpretation)

**Goal**: to explore multiple regression and feature engineering with existing Turi Create functions.

We will use data on house sales in King County to predict prices using multiple regression. Including:
* Use SFrames to do some feature engineering
* Use built-in Turi Create functions to compute the regression weights (coefficients/parameters)
* Given the regression weights, predictors and outcome write a function to compute the Residual Sum of Squares
* Look at coefficients and interpret their meanings
* Evaluate multiple models via RSS

## Load in house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.

In [9]:
import turicreate as tc
import math

In [4]:
sales = tc.SFrame('../ml-foundations/data/home_data.sframe')

## Split data into training and testing.
We use seed=0 so that everyone running this notebook gets the same results.  In practice, you may set a random seed (or let Turi Create pick a random seed for you).  

In [5]:
train, test = sales.random_split(.8,seed=0)

### Add new features based on pre-existing ones

In [6]:
train['bedrooms_squared'] = train['bedrooms']*train['bedrooms']
test['bedrooms_squared'] = test['bedrooms']*test['bedrooms']

In [7]:
train['bed_bath_rooms'] = train['bedrooms']*train['bathrooms']
test['bed_bath_rooms'] = test['bedrooms']*test['bathrooms']

In [10]:
train['log_sqft_living'] = train['sqft_living'].apply(math.log)
test['log_sqft_living'] = test['sqft_living'].apply(math.log)

In [11]:
train['lat_plus_long'] = train['lat'] + train['long']
test['lat_plus_long'] = test['lat'] + test['long']

In [49]:
# Q: what are the mean values of the new variables on TEST data?
print("Bedrooms squared",test['bedrooms_squared'].mean())
print("Bedrooms bathrooms",test['bed_bath_rooms'].mean())
print("Log of sqft",test['log_sqft_living'].mean())
print("Latitute+Longitude",test['lat_plus_long'].mean())

Bedrooms squared 12.446677701584301
Bedrooms bathrooms 7.503901631591394
Log of sqft 7.550274679645938
Latitute+Longitude -74.65333497217307


## Learning (multiple) Multi-Regression Models

 Use `turicreate.linear_regression.create()` (or any other regression library/function) to estimate the regression coefficients/weights for predicting ‘price’ for the following three models:(In all 3 models include an intercept -- most software does this by default).
 
(Aside: We set validation_set = None to ensure that the results are always the same)

In [13]:
features_1 = ['sqft_living', 'bedrooms', 'bathrooms', 'lat', 'long']
model_1 = tc.linear_regression.create(train, target='price', features=features_1, 
                                                    validation_set = None)

In [14]:
features_2 = ['sqft_living', 'bedrooms', 'bathrooms', 'lat', 'long','bed_bath_rooms']
model_2 = tc.linear_regression.create(train, target='price', features=features_2, 
                                                    validation_set = None)

In [15]:
features_3 = ['sqft_living', 'bedrooms', 'bathrooms', 'lat', 'long','bed_bath_rooms','bedrooms_squared', 'log_sqft_living', 'lat_plus_long']
model_3 = tc.linear_regression.create(train, target='price', features=features_3, 
                                                    validation_set = None)

Now that we have fitted the models we can extract their regression weights (coefficients) into an SFrame as follows:

In [16]:
model1_weights = model_1.coefficients
model2_weights = model_2.coefficients
model3_weights = model_3.coefficients

### Examining the models' coefficients

In [31]:
# What is the sign of the coefficient for ‘bathrooms’ in Model 1?
model1_weights

name,index,value,stderr
(intercept),,-56140675.74114427,1649985.420135553
sqft_living,,310.26332577692136,3.1888296040737765
bedrooms,,-59577.11606759667,2487.279773224501
bathrooms,,13811.840541653264,3593.542132967073
lat,,629865.7894714845,13120.710032363884
long,,-214790.28516471,13284.285159576595


In [33]:
#What is the sign of the coefficient/weight for ‘bathrooms’ in Model 2?
model2_weights

name,index,value,stderr
(intercept),,-54410676.1071702,1650405.1652726454
sqft_living,,304.44929805407946,3.20217535637094
bedrooms,,-116366.04322451768,4805.549665485823
bathrooms,,-77972.33050970349,7565.059910947983
lat,,625433.8349445503,13058.353097300462
long,,-203958.60289731968,13268.12837000966
bed_bath_rooms,,26961.624907583264,1956.3656155588428


# Making Predictions

In the gradient descent notebook we use numpy to do our regression. In this book we will use existing Turi Create functions to analyze multiple regressions. 

Recall that once a model is built we can use the .predict() function to find the predicted values for data we pass. For example using the example model above:

In [34]:
example_predictions = model_1.predict(train)
print(example_predictions[0])

245784.01808077097


# Compute RSS

Now that we can make predictions given the model, let's write a function to compute the RSS of the model. Complete the function below to calculate RSS given the model, data, and the outcome.

In [36]:
def get_residual_sum_of_squares(model, data, labels):
    # First get the predictions
    predictions = model.predict(data)
    # Then compute the residuals/errors
    RSS = 0
    for i in range(len(predictions)):
        residual = labels[i] - predictions[i]
        # Then square and add them up
        RSS += residual*residual
    
    return(RSS)

Test your function by computing the RSS on TEST data for the example model:

# Comparing multiple models

Now that you've learned three models and extracted the model weights we want to evaluate which model is best.

In [43]:
# Compute the RSS on TRAINING data for remaining models and record the values:
rss_model1 = get_residual_sum_of_squares(model_1, train, train['price'])
rss_model2 = get_residual_sum_of_squares(model_2, train, train['price'])
rss_model3 = get_residual_sum_of_squares(model_3, train, train['price'])


In [44]:
# Q: Which of the 3 models has lowest RSS on TRAINING DATA?
print("{:e}".format(rss_model1))
print("{:e}".format(rss_model2))
print("{:e}".format(rss_model3))

9.713282e+14
9.615921e+14
9.052763e+14


Now compute the RSS on on TEST data for each of the three models.

In [47]:
rss_model1_test = get_residual_sum_of_squares(model_1, test, test['price'])
rss_model2_test = get_residual_sum_of_squares(model_2, test, test['price'])
rss_model3_test = get_residual_sum_of_squares(model_3, test, test['price'])

In [48]:
# Q: Which of the 3 models has lowest RSS on TEST DATA?
print("{:e}".format(rss_model1_test))
print("{:e}".format(rss_model2_test))
print("{:e}".format(rss_model3_test))

2.265681e+14
2.243688e+14
2.518293e+14


Note: The test data RSS is smaller compared to training data because training data is a much bigger dataset than training; RSS is a SUM over the number of obersvations.