### Import Libraries

In [18]:
import pandas as pd
import numpy as np

#### Import dataset

For those students not using Turi Create please download the training and testing data csv files.

In [19]:
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':str, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}

house_data = pd.read_csv(filepath_or_buffer= './data/kc_house_data.csv', dtype = dtype_dict)  
train_data = pd.read_csv('./data/kc_house_train_data.csv', dtype = dtype_dict)  
test_data = pd.read_csv('./data/kc_house_test_data.csv', dtype = dtype_dict)  

3. Write a generic function that accepts a column of data (e.g, an SArray) ‘input_feature’ and another column ‘output’ and returns the Simple Linear Regression parameters ‘intercept’ and ‘slope’.

Use the closed form solution from lecture to calculate the slope and intercept. e.g. in python:\

We want the line that “best fits” this data set as measured by residual sum of squares -- the simple linear regression cost. We have a closed form solution that involves the following terms:

1. The number of data points (N)
2. The sum (or mean) of the Ys
3. The sum (or mean) of the Xs
4. The sum (or mean) of the product of the Xs and the Ys
5. The sum (or mean) of the Xs squared


Then once we have calculated all of these terms, we can use the formulas to compute the slope and intercept. Recall that we first solve for the slope and then we use the value of the slope to solve for the intercept.  The formula for the slope is a fraction with:

- numerator = (sum of X*Y) - (1/N)*((sum of X) * (sum of Y))
- denominator = (sum of X^2) - (1/N)*((sum of X) * (sum of X))

Note that you can divide both the numerator and denominator by N (which doesn’t change the answer!) to get:
- numerator = (mean of X \* Y) - (mean of X)*(mean of Y)
- denominator = (mean of X^2) - (mean of X)*(mean of X)

In [23]:
# numerator = (mean of X * Y) - (mean of X)*(mean of Y)
# denominator = (mean of X^2) - (mean of X)*(mean of X)
# slope = numerator / denominator
# intercept = (mean of Y) - slope * (mean of X)

def simple_linear_regression(input_feature, output):
    numerator=(input_feature*output).sum()-(input_feature.sum())*(output.sum())/len(output)
    denominator=((input_feature**2).sum())-(input_feature.sum())*(input_feature.sum())/len(output)
    slope=numerator/denominator
    intercept=(output.mean())-slope*(input_feature.mean())
    return intercept,slope


In [24]:
sqft_intercept, sqft_slope = simple_linear_regression(train_data['sqft_living'],train_data['price'])
bed_intercept, bed_slope = simple_linear_regression(train_data['bedrooms'],train_data['price'])

4. Use your function to calculate the estimated slope and intercept on the training data to predict ‘price’ given ‘sqft_living’.

save the value of the slope and intercept for later (you might want to call them e.g. squarfeet_slope, and squarefeet_intercept)

In [25]:
input_feature = train_data['sqft_living']
output = train_data['price']

In [26]:
intercept_train, slope_train = simple_linear_regression(input_feature, output)

5. Write a function that accepts a column of data ‘input_feature’, the ‘slope’, and the ‘intercept’ you learned, and returns an a column of predictions ‘predicted_output’ for each entry in the input column. e.g. in python:

In [27]:
def get_regression_predictions(input_feature, intercept, slope):
    predicted_output = slope * input_feature + intercept
    return(predicted_output)

6. Quiz Question: Using your Slope and Intercept from (4), What is the predicted price for a house with 2650 sqft?

In [28]:
input_feature = 2650
print(get_regression_predictions(2650, intercept_train, slope_train))

700074.8459475137


7. Write a function that accepts column of data: ‘input_feature’, and ‘output’ and the regression parameters ‘slope’ and ‘intercept’  and outputs the Residual Sum of Squares (RSS). e.g. in python:

In [31]:
def get_residual_sum_of_squares(input_feature, output, intercept,slope):
    RSS = ((output - (intercept + slope * input_feature))**2).sum(axis = 0)
    return(RSS)

8. Quiz Question: According to this function and the slope and intercept from (4) What is the RSS for the simple linear regression using squarefeet to predict prices on TRAINING data?

In [43]:
sqftrss = get_residual_sum_of_squares(train_data['sqft_living'], train_data['price'], intercept_train, slope_train)

bedrss = get_residual_sum_of_squares(train_data['bedrooms'], train_data['price'], intercept_train, slope_train)

In [44]:
sqftrss, bedrss

(1201918354177283.0, 8334814087871522.0)

9. Note that although we estimated the regression slope and intercept in order to predict the output from the input, since this is a simple linear relationship with only two variables we can invert the linear function to estimate the input given the output!

Write a function that accept a column of data:‘output’ and the regression parameters ‘slope’ and ‘intercept’ and outputs the column of data: ‘estimated_input’. Do this by solving the linear function output = intercept + slope*input for the ‘input’ variable (i.e. ‘input’ should be on one side of the equals sign by itself). e.g. in python

In [13]:
def inverse_regression_predictions(output, intercept, slope):
    estimated_input = (output - intercept)/slope
    return(estimated_input)

10. Quiz Question: According to this function and the regression slope and intercept from (3) what is the estimated square-feet for a house costing $800,000?

In [33]:
output = 800000
print(inverse_regression_predictions(output,intercept_train,slope_train))

3004.3962451522766


11. Instead of using ‘sqft_living’ to estimate prices we could use ‘bedrooms’ (a count of the number of bedrooms in the house) to estimate prices. Using your function from (3) calculate the Simple Linear Regression slope and intercept for estimating price based on bedrooms. Save this slope and intercept for later (you might want to call them e.g. bedroom_slope, bedroom_intercept).