# Regression Week 1: Simple Linear Regression

In this notebook we will use data on house sales in King County to predict house prices using simple (one input) linear regression. You will:
* Use graphlab SArray and SFrame functions to compute important summary statistics
* Write a function to compute the Simple Linear Regression weights using the closed form solution
* Write a function to make predictions of the output given the input feature
* Turn the regression around to predict the input given the output
* Compare two different models for predicting house prices

In this notebook you will be provided with some already complete code as well as some code that you should complete yourself in order to answer quiz questions. The code we provide to complte is optional and is there to assist you with solving the problems but feel free to ignore the helper code and write your own.

# Fire up graphlab create

In [1]:
import graphlab

# Load house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.

##1

If you are using SFrame, import graphlab and load in the house data, otherwise you can also download the csv. (Note that we will be using the training and testing csv files provided). e.g in python with SFrames:

**sales = graphlab.SFrame('kc_house_data.gl/')**

In [2]:
sales = graphlab.SFrame('kc_house_data.gl/')

[INFO] [1;32m1450543746 : INFO:     (initialize_globals_from_environment:282): Setting configuration variable GRAPHLAB_FILEIO_ALTERNATIVE_SSL_CERT_FILE to /Users/sofia/.graphlab/anaconda/lib/python2.7/site-packages/certifi/cacert.pem
[0m[1;32m1450543746 : INFO:     (initialize_globals_from_environment:282): Setting configuration variable GRAPHLAB_FILEIO_ALTERNATIVE_SSL_CERT_DIR to 
[0mThis non-commercial license of GraphLab Create is assigned to sathenikos@gmail.com and will expire on November 27, 2016. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-13084 - Server binary: /Users/sofia/.graphlab/anaconda/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1450543746.log
[INFO] GraphLab Server Version: 1.7.1


# Split data into training and testing

We use seed=0 so that everyone running this notebook gets the same results.  In practice, you may set a random seed (or let GraphLab Create pick a random seed for you).  

##2

Split data into 80% training and 20% test data. Using SFrame, use this command to set the same seed for everyone. e.g. in python with SFrames:

**train_data,test_data = sales.random_split(.8,seed=0)**

In [3]:
train_data,test_data = sales.random_split(.8,seed=0)

# Useful SFrame summary functions

In order to make use of the closed form solution as well as take advantage of graphlab's built in functions we will review some important ones. In particular:
* Computing the sum of an SArray
* Computing the arithmetic average (mean) of an SArray
* multiplying SArrays by constants
* multiplying SArrays by other SArrays

In [4]:
# Let's compute the mean of the House Prices in King County in 2 different ways.
prices = sales['price'] # extract the price column of the sales SFrame -- this is now an SArray

# recall that the arithmetic average (the mean) is the sum of the prices divided by the total number of houses:
sum_prices = prices.sum()
num_houses = prices.size() # when prices is an SArray .size() returns its length
avg_price_1 = sum_prices/num_houses
avg_price_2 = prices.mean() # if you just want the average, the .mean() function
print "average price via method 1: " + str(avg_price_1)
print "average price via method 2: " + str(avg_price_2)

average price via method 1: 540088.141905
average price via method 2: 540088.141905


As we see we get the same answer both ways

In [5]:
# if we want to multiply every price by 0.5 it's a simple as:
half_prices = 0.5*prices
# Let's compute the sum of squares of price. We can multiply two SArrays of the same length elementwise also with *
prices_squared = prices*prices
sum_prices_squared = prices_squared.sum() # price_squared is an SArray of the squares and we want to add them up.
print "the sum of price squared is: " + str(sum_prices_squared)

the sum of price squared is: 9.21732513355e+15


Aside: The python notation x.xxe+yy means x.xx \* 10^(yy). e.g 100 = 10^2 = 1*10^2 = 1e2 

# Build a generic simple linear regression function 

Armed with these SArray functions we can use the closed form solution found from lecture to compute the slope and intercept for a simple linear regression on observations stored as SArrays: input_feature, output.

Complete the following function (or write your own) to compute the simple linear regression slope and intercept:

##3

Write a generic function that accepts a column of data (e.g, an SArray) ‘input_feature’ and another column ‘output’ and returns the Simple Linear Regression parameters ‘intercept’ and ‘slope’. Use the closed form solution from lecture to calculate the slope and intercept. e.g. in python:

**def simple_linear_regression(input_feature, output):**

    [your code here]

**return(intercept, slope)**

Formulas for slope and intercept:

**slope = sum[(xi-xmean)*(yi-ymean)] / sum[(xi-xmean)^2]**

**intercept = ymean - slope * xmean**

In [35]:
def simple_linear_regression(input_feature, output):
    # compute the mean of input_feature and output
    input_mean = input_feature.mean()
    output_mean = output.mean()
    
    # create SArray containing input_feature[i] - input_mean
    input_diff = input_feature - input_mean
    
    # create SArray containing output[i] - output_mean
    output_diff = output - output_mean

    # compute sum of product of input_diff and output_diff
    input_diff_output_diff_product = input_diff * output_diff
    input_diff_output_diff_product_sum = input_diff_output_diff_product.sum()
    
    # compute sum of squared product of input_diff
    squared_input_diff_product = input_diff * input_diff
    squared_input_diff_product_sum = squared_input_diff_product.sum()
    
    # use the formula for the slope
    slope = input_diff_output_diff_product_sum / squared_input_diff_product_sum
    
    # use the formula for the intercept
    intercept = output_mean - slope * input_mean
    
    return (intercept, slope)

We can test that our function works by passing it something where we know the answer. In particular we can generate a feature and then put the output exactly on a line: output = 1 + 1\*input_feature then we know both our slope and intercept should be 1

In [36]:
test_feature = graphlab.SArray(range(5))
test_output = graphlab.SArray(1 + 1*test_feature)
(test_intercept, test_slope) =  simple_linear_regression(test_feature, test_output)
print "Intercept: " + str(test_intercept)
print "Slope: " + str(test_slope)

Intercept: 1.0
Slope: 1.0


Now that we know it works let's build a regression model for predicting price based on sqft_living. Rembember that we train on train_data!

##4

Use your function to calculate the estimated slope and intercept on the training data to predict ‘price’ given ‘sqft_living’. e.g. in python with SFrames using:

**input_feature = train_data[‘sqft_living’]**

**output = train_data[‘price’]**

save the value of the slope and intercept for later (you might want to call them e.g. squarfeet_slope, and squarefeet_intercept)

In [30]:
input_feature = train_data['sqft_living']
output = train_data['price']

sqft_intercept, sqft_slope = simple_linear_regression(input_feature, output)

print "Intercept: " + str(sqft_intercept)
print "Slope: " + str(sqft_slope)

Intercept: -47116.0765749
Slope: 281.958838568


# Predicting Values

Now that we have the model parameters: intercept & slope we can make predictions. Using SArrays it's easy to multiply an SArray by a constant and add a constant value. Complete the following function to return the predicted output given the input_feature, slope and intercept:

##5

Write a function that accepts a column of data ‘input_feature’, the ‘slope’, and the ‘intercept’ you learned, and returns an a column of predictions ‘predicted_output’ for each entry in the input column. e.g. in python:

**def get_regression_predictions(input_feature, intercept, slope)**

    [your code here]
    
**return(predicted_output)**

In [41]:
def get_regression_predictions(input_feature, intercept, slope):
    # calculate the predicted output
    predicted_output = intercept + (slope*input_feature)
    return predicted_output

Now that we can calculate a prediction given the slop and intercept let's make a prediction. Use (or alter) the following to find out the estimated price for a house with 2650 squarefeet according to the squarefeet model we estiamted above.

##6

**Quiz Question: Using your Slope and Intercept from (4), What is the predicted price for a house with 2650 sqft?**

In [43]:
my_house_sqft = 2650
estimated_price = get_regression_predictions(my_house_sqft, sqft_intercept, sqft_slope)
print "The estimated price for a house with %d squarefeet is $%.2f" % (my_house_sqft, estimated_price)

The estimated price for a house with 2650 squarefeet is $700074.85


# Residual Sum of Squares

Now that we have a model and can make predictions let's evaluate our model using Residual Sum of Squares (RSS). Recall that RSS is the sum of the squares of the residuals and the residuals is just a fancy word for the difference between the predicted output and the true output. 

Complete the following (or write your own) function to compute the RSS of a simple linear regression model given the input_feature, output, intercept and slope:

##7

Write a function that accepts column of data: ‘input_feature’, and ‘output’ and the regression parameters ‘slope’ and ‘intercept’ and outputs the Residual Sum of Squares (RSS). e.g. in python:

**def get_residual_sum_of_squares(input_feature, output, intercept, slope):**

    [your code here]
    
**return(RSS)**

Recall that the RSS is the sum of the squares of the prediction errors (difference between output and prediction).

In [44]:
def get_residual_sum_of_squares(input_feature, output, intercept, slope):
    # first get the predictions
    predicted_output = get_regression_predictions(input_feature, intercept, slope)

    # then compute the residuals (since we are squaring it doesn't matter which order you subtract)
    residuals = output - predicted_output

    # square the residuals and add them up
    squared_residuals = residuals * residuals
    RSS = squared_residuals.sum()

    return(RSS)

Let's test our get_residual_sum_of_squares function by applying it to the test model where the data lie exactly on a line. Since they lie exactly on a line the residual sum of squares should be zero!

In [45]:
print get_residual_sum_of_squares(test_feature, test_output, test_intercept, test_slope) # should be 0.0

0.0


Now use your function to calculate the RSS on training data from the squarefeet model calculated above.

##8

**Quiz Question: According to this function and the slope and intercept from the squarefeet model What is the RSS for the simple linear regression using squarefeet to predict prices on TRAINING data?**

In [46]:
rss_prices_on_sqft = get_residual_sum_of_squares(input_feature, output, sqft_intercept, sqft_slope)
print 'The RSS of predicting Prices based on Square Feet is : ' + str(rss_prices_on_sqft)

The RSS of predicting Prices based on Square Feet is : 1.20191835632e+15


# Predict the squarefeet given price

What if we want to predict the squarefoot given the price? Since we have an equation y = a + b\*x we can solve the function for x. So that if we have the intercept (a) and the slope (b) and the price (y) we can solve for the estimated squarefeet (x).

Comlplete the following function to compute the inverse regression estimate, i.e. predict the input_feature given the output!

##9

Note that although we estimated the regression slope and intercept in order to predict the output from the input, since this is a simple linear relationship with only two variables we can invert the linear function to estimate the input given the output!

Write a function that accept a column of data:‘output’ and the regression parameters ‘slope’ and ‘intercept’ and outputs the column of data: ‘estimated_input’. Do this by solving the linear function output = intercept + slope*input for the ‘input’ variable (i.e. ‘input’ should be on one side of the equals sign by itself). e.g. in python:

**def inverse_regression_predictions(output, intercept, slope):**

    [your code here]

**return(estimated_input)**

Formula for input:

**input = (output - intercept) / slope**

In [48]:
def inverse_regression_predictions(output, intercept, slope):
    # solve output = intercept + slope*input_feature for input_feature. 
    estimated_feature = (output - intercept) / slope

    return estimated_feature

Now that we have a function to compute the squarefeet given the price from our simple regression model let's see how big we might expect a house that coses $800,000 to be.

##10

**Quiz Question: According to this function and the regression slope and intercept from (3) what is the estimated square-feet for a house costing $800,000?**

In [49]:
my_house_price = 800000
estimated_squarefeet = inverse_regression_predictions(my_house_price, sqft_intercept, sqft_slope)
print "The estimated squarefeet for a house worth $%.2f is %d" % (my_house_price, estimated_squarefeet)

The estimated squarefeet for a house worth $800000.00 is 3004


# New Model: estimate prices from bedrooms

We have made one model for predicting house prices using squarefeet, but there are many other features in the sales SFrame. 
Use your simple linear regression function to estimate the regression parameters from predicting Prices based on number of bedrooms. Use the training data!

##11

Instead of using ‘sqft_living’ to estimate prices we could use ‘bedrooms’ (a count of the number of bedrooms in the house) to estimate prices. Using your function from (3) calculate the Simple Linear Regression slope and intercept for estimating price based on bedrooms. Save this slope and intercept for later (you might want to call them e.g. bedroom_slope, bedroom_intercept).

In [52]:
# Estimate the slope and intercept for predicting 'price' based on 'bedrooms'
bedroom_input_feature = train_data['bedrooms']
bedroom_output = train_data['price']

(bedroom_intercept, bedroom_slope) = simple_linear_regression(bedroom_input_feature, bedroom_output)

print "Intercept: " + str(bedroom_intercept)
print "Slope: " + str(bedroom_slope)

Intercept: 109473.180469
Slope: 127588.952175


# Test your Linear Regression Algorithm

Now we have two models for predicting the price of a house. How do we know which one is better? Calculate the RSS on the TEST data (remember this data wasn't involved in learning the model). Compute the RSS from predicting prices using bedrooms and from predicting prices using squarefeet.

##12

Now that we have 2 different models compute the RSS from BOTH models on TEST data.

In [54]:
# Compute RSS when using bedrooms on TEST data:
test_bedroom_feature = test_data['bedrooms']
test_bedroom_output = test_data['price']

(test_bedroom_intercept, test_bedroom_slope) = simple_linear_regression(test_bedroom_feature, test_bedroom_output)

test_bedroom_rss = get_residual_sum_of_squares(test_bedroom_feature, test_bedroom_output, test_bedroom_intercept, test_bedroom_slope)

print "RSS on TEST data with Bedrooms: " + str(test_bedroom_rss)

RSS on TEST data with Bedrooms: 4.90597140208e+14


In [56]:
# Compute RSS when using squarfeet on TEST data:
test_sqft_feature = test_data['sqft_living']
test_sqft_output = test_data['price']

(test_sqft_intercept, test_sqft_slope) = simple_linear_regression(test_sqft_feature, test_sqft_output)

test_sqft_rss = get_residual_sum_of_squares(test_sqft_feature, test_sqft_output, test_sqft_intercept, test_sqft_slope)

print "RSS on TEST data with Sqft: " + str(test_sqft_rss)

RSS on TEST data with Sqft: 2.75168576502e+14


##13

**Quiz Question: Which model (square feet or bedrooms) has lowest RSS on TEST data? Think about why this might be the case.**

Sqft model