# Regression Week 1: Simple Linear Regression

In this notebook we will use data on house sales in King County to predict house prices using simple (one input) linear regression.

In [14]:
import pandas as pd

sales = pd.read_csv('kc_house_data.csv')
sales.head(5)

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


Split the data into a training and test set.

In [15]:
import numpy as np

train_mask = np.random.rand(len(sales)) < 0.8

train_data = sales[train_mask]
test_data = sales[~train_mask]

# Build a generic simple linear regression function

We will computer a simple linear regression model using a single feature. Our algorithm will use the formula for a closed-form solution to compute the answer.

In [16]:
def simple_linear_regression(input_feature, output):
    sum_input = input_feature.sum()
    sum_output = output.sum()
    sum_product_of_input_output = (input_feature * output).sum()
    sum_input_squared = (input_feature * input_feature).sum()

    length = len(output)
    inverse_length = 1 / length

    slope_num = (sum_product_of_input_output - inverse_length * sum_input * sum_output)
    slope_denom = (sum_input_squared - inverse_length * sum_input * sum_input)
    slope = slope_num / slope_denom
    
    intercept = inverse_length * sum_output - slope * inverse_length * sum_input

    return (intercept, slope)

# Math Behind Simple Linear Regression

Simple linear regression is trying to compute the relationship:

$$y = w_0 + w_1 * x$$

where y is the desired output, x is the given input from the sample data, and w0, w1 are the features (slope and intercept).

For any real-world applications, this formula is too simple to make perfect predictions of the output (and this is true even when adding more variables of greater complexity). We are not looking for a perfect solution, but rather to find a solution that minimizes the overall error of our prediction. To measure the error of a set of samples, we will use the residual sum of squares (RSS):

$$RSS(w) = \sum_{i=1}^{N} (y_i - (w_0 + w_1x_i))^2$$

The inputs (y) and outputs (x) are fixed, so the goal is to adjust the features (w0 and w1) to minimize the distance of our prediction from the true output. We want to minimize the error across all samples. To do that, we will take the gradient of the RSS and solve for when the gradient is 0. This will give us a local extrema of the function (in this case, the extrema is the minimum).

$$\frac{\partial RSS(w)}{\partial w_0} = -2\sum_{i=1}^{N} (y_i - (w_0 + w_1x_i))$$
$$\frac{\partial RSS(w)}{\partial w_1} = -2\sum_{i=1}^{N} x_i(y_i - (w_0 + w_1x_i))$$

Together, the partials w0 and w1 form the gradient of the residual sum of squares formula. I will leave it as an exercise to calculate the closed-form solution for the minimum values (w0 and w1) of the RSS.

# Predicting Values

Now that we have the model parameters: intercept & slope we can make predictions. Complete the following function to return the prediction output given the input_feature, slope, and intercept:

In [17]:
def get_regression_predictions(input_feature, intercept, slope):
    return input_feature * slope + intercept

Now that we can calculate a prediction given the slope and intercept, let's make a prediction. Use (or alter) the following to find out the estimated price for a house with 2650 square feet according to the square feet model we estimated above.

In [20]:
(sqft_living_intercept, sqft_living_slope) = simple_linear_regression(train_data['sqft_living'], train_data['price'])

(sqft_living_intercept, sqft_living_slope)

(-53999.99432768591, 285.85279020428396)

In [21]:
estimated_price = get_regression_predictions(2650, sqft_living_intercept, sqft_living_slope)
estimated_price

703509.8997136666

# Residual Sum of Squares

Now that we have a model and can make predictions let's evaluate our model using Residual Sum of Squares (RSS). Recall that RSS is the sum of the squares of the residuals and the residuals is just a fancy word for the difference between the predicted output and true output.

$$RSS(w) = \sum_{i=1}^{N} (y_i - (w_0 + w_1x_i))^2$$

Complete the following (or write your own) function to compute the RSS of a simple linear regression model given the input_feature, output, intercept and slope:

In [22]:
def get_residual_sum_of_squares(input_feature, output, intercept, slope):
    predictions = get_regression_predictions(input_feature, intercept, slope)
    residuals = predictions - output
    return sum(residuals * residuals)

Now use your function to calculate the RSS on training data from the squarefeet model calculated above.

In [23]:
rss_prices_on_sqft = get_residual_sum_of_squares(
    train_data['sqft_living'],
    train_data['price'],
    sqft_living_intercept,
    sqft_living_slope)

rss_prices_on_sqft

1219043435766016.0

# Predict the squarefeet given price

What if we want to predict the squarefoot given the price? Since we have an equation $$y=a + bx$$ we can solve the function for x. So that if we have the intercept (a) and the slope (b) and the price (y) we can solve for the estimated squarefeet (x)

Now that we have a function to compute the squarefeet given the price from our simple regression model let's see how big we might expect a house that costs $800,000 to be.

In [24]:
def inverse_regression_predictions(output, intercept, slope):
    return (output - intercept) / slope

In [25]:
my_house_price = 800000
estimated_squarefeet = inverse_regression_predictions(
    my_house_price,
    sqft_living_intercept,
    sqft_living_slope)
estimated_squarefeet

2987.5517175024843

# New Model: Estimate Prices From Bedrooms

We have made one model for predicting house prices using squarefeet, but there are many other features in the sales object. Use your simple linear regression function to estimate regression parameters from predicting Prices based on number of bedrooms. Use the training data!

In [26]:
(bedrooms_intercept, bedrooms_slope) = simple_linear_regression(
    train_data['bedrooms'],
    train_data['price'])

(bedrooms_intercept, bedrooms_slope)

(132516.56542891404, 121179.69187145524)

# Test your Linear Regression Algorithm

Now we have two models for predicting the price of a house. How do we know which one is better? Calculate the RSS on the TEST data (remember this data wasn't involved in learning the model). Compute the RSS from predicting prices using bedrooms and from predicting prices using squarefeet.

In [27]:
# Compute RSS using bedrooms on TEST data:
get_residual_sum_of_squares(
    test_data['bedrooms'],
    test_data['price'],
    bedrooms_intercept,
    bedrooms_slope)

422061818864021.44

In [28]:
# Compute RSS using sqft living on TEST data:
get_residual_sum_of_squares(
    test_data['sqft_living'],
    test_data['price'],
    sqft_living_intercept,
    sqft_living_slope)

258735947931487.72