# Simple Linear Regression on House Sales data

### Fire up graphlab create

In [1]:
import graphlab

### Load house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.

In [2]:
sales = graphlab.SFrame('kc_house_data.gl/')

This non-commercial license of GraphLab Create for academic use is assigned to agrawal.pr@husky.neu.edu and will expire on March 12, 2018.


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: C:\Users\agraw\AppData\Local\Temp\graphlab_server_1493429537.log.0


### Data Exploration

In [6]:
sales

id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront
7129300520,2014-10-13 00:00:00+00:00,221900.0,3.0,1.0,1180.0,5650,1,0
6414100192,2014-12-09 00:00:00+00:00,538000.0,3.0,2.25,2570.0,7242,2,0
5631500400,2015-02-25 00:00:00+00:00,180000.0,2.0,1.0,770.0,10000,1,0
2487200875,2014-12-09 00:00:00+00:00,604000.0,4.0,3.0,1960.0,5000,1,0
1954400510,2015-02-18 00:00:00+00:00,510000.0,3.0,2.0,1680.0,8080,1,0
7237550310,2014-05-12 00:00:00+00:00,1225000.0,4.0,4.5,5420.0,101930,1,0
1321400060,2014-06-27 00:00:00+00:00,257500.0,3.0,2.25,1715.0,6819,2,0
2008000270,2015-01-15 00:00:00+00:00,291850.0,3.0,1.5,1060.0,9711,1,0
2414600126,2015-04-15 00:00:00+00:00,229500.0,3.0,1.0,1780.0,7470,1,0
3793500160,2015-03-12 00:00:00+00:00,323000.0,3.0,2.5,1890.0,6560,2,0

view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat
0,3,7,1180,0,1955,0,98178,47.51123398
0,3,7,2170,400,1951,1991,98125,47.72102274
0,3,6,770,0,1933,0,98028,47.73792661
0,5,7,1050,910,1965,0,98136,47.52082
0,3,8,1680,0,1987,0,98074,47.61681228
0,3,11,3890,1530,2001,0,98053,47.65611835
0,3,7,1715,0,1995,0,98003,47.30972002
0,3,7,1060,0,1963,0,98198,47.40949984
0,3,7,1050,730,1960,0,98146,47.51229381
0,3,7,1890,0,2003,0,98038,47.36840673

long,sqft_living15,sqft_lot15
-122.25677536,1340.0,5650.0
-122.3188624,1690.0,7639.0
-122.23319601,2720.0,8062.0
-122.39318505,1360.0,5000.0
-122.04490059,1800.0,7503.0
-122.00528655,4760.0,101930.0
-122.32704857,2238.0,6819.0
-122.31457273,1650.0,9711.0
-122.33659507,1780.0,8113.0
-122.0308176,2390.0,7570.0


### Split data into training and testing

In [3]:
train_data,test_data = sales.random_split(.8,seed=0)

In [7]:
prices = sales['price'] # extract the price column of the sales SFrame -- this is now an SArray
sum_prices = prices.sum()
num_houses = prices.size()
avg_price_1 = sum_prices/num_houses
avg_price_2 = prices.mean()
print "average price via method 1: " + str(avg_price_1)
print "average price via method 2: " + str(avg_price_2)

average price via method 1: 540088.141905
average price via method 2: 540088.141905


In [8]:
prices_squared = prices*prices
sum_prices_squared = prices_squared.sum() # price_squared is an SArray of the squares and we want to add them up.
print "the sum of price squared is: " + str(sum_prices_squared)

the sum of price squared is: 9.21732513355e+15


### Simple linear regression algorithmic function - Part 1
To calculate the intercept and the slope of the regression line

In [11]:
# x is the input and y is the output
def simple_linear_regression(x, y):
    
    # compute the sum of input and output
    sum = x + y
    
    # compute the product of the output and the input and its sum
    product = x * y
    sum_of_product = product.sum()
    
    # compute the squared value of the input and its sum
    x_squared = x * x
    sum_x_squared = x_squared.sum()
    
    # use the formula for the slope
    numerator = sum_of_product - ((x.sum() * y.sum()) / x.size())
    denominator = sum_x_squared - ((x.sum() * x.sum()) / x.size())
    slope = numerator / denominator
    
    # use the formula for the intercept
    intercept = y.mean() - (slope * x.mean())
    
    return (intercept, slope)

We can test that our function works by passing it something where we know the answer. In particular we can generate a feature and then put the output exactly on a line: output = 1 + 1\*input_feature then we know both our slope and intercept should be 1

In [14]:
test_feature = graphlab.SArray(range(5))
test_output = graphlab.SArray(1 + 1*test_feature)
(test_intercept, test_slope) =  simple_linear_regression(test_feature, test_output)
print test_feature
print test_output
print "Intercept: " + str(test_intercept)
print "Slope: " + str(test_slope)

[0L, 1L, 2L, 3L, 4L]
[1L, 2L, 3L, 4L, 5L]
Intercept: 1.0
Slope: 1


So now it works let's build a regression model for predicting price based on sqft_living

In [15]:
sqft_intercept, sqft_slope = simple_linear_regression(train_data['sqft_living'], train_data['price'])

print "Intercept: " + str(sqft_intercept)
print "Slope: " + str(sqft_slope)

Intercept: -47116.0765749
Slope: 281.958838568


### Simple linear regression algorithmic function - Part 2
To calculate the predicted output

In [16]:
def get_regression_predictions(input_feature, intercept, slope):
    # calculate the predicted values:
    predicted_values = intercept + (slope * input_feature)
    return predicted_values

**What is the predicted price for a house with 2650 sqft?**

In [18]:
my_house_sqft = 2650
estimated_price = get_regression_predictions(my_house_sqft, sqft_intercept, sqft_slope)
print "The estimated price for a house with %d squarefeet is $%.2f" % (my_house_sqft, estimated_price)

The estimated price for a house with 2650 squarefeet is $700074.85


### Residual Sum of Squares

RSS is the sum of the squares of the residuals which is the difference between the predicted output and the true output.

In [22]:
def get_residual_sum_of_squares(input_feature, actual_output, intercept, slope):
    # First get the predictions
    predicted_output = intercept + (slope * input_feature)

    # then compute the residuals (since we are squaring it doesn't matter which order you subtract)
    residuals = actual_output - predicted_output

    # square the residuals and add them up
    residuals_squared = residuals * residuals
    residual_sum_squares = residuals_squared.sum()

    return(residual_sum_squares)

In [33]:
rss_prices_on_sqft = get_residual_sum_of_squares(train_data['sqft_living'], train_data['price'], sqft_intercept, sqft_slope)
print 'The RSS of predicting Prices based on Square Feet is : ' + str(rss_prices_on_sqft)

The RSS of predicting Prices based on Square Feet is : 1.20191835632e+15


### Predict the squarefeet given price

In other words predict the input given the output. The function will be

In [34]:
def inverse_regression_predictions(output, intercept, slope):
    # solve output = intercept + slope*input_feature for input_feature. Use this equation to compute the inverse predictions:
    estimated_feature = (output - intercept) / slope
    return estimated_feature

**What is the estimated square-feet for a house costing $800,000?**

In [26]:
my_house_price = 800000
estimated_squarefeet = inverse_regression_predictions(my_house_price, sqft_intercept, sqft_slope)
print "The estimated squarefeet for a house worth $%.2f is %d" % (my_house_price, estimated_squarefeet)

The estimated squarefeet for a house worth $800000.00 is 3004


### New Model: estimate prices from bedrooms

In [29]:
# Estimate the slope and intercept for predicting 'price' based on 'bedrooms'
bedrm_intercept, bedrm_slope = simple_linear_regression(train_data['bedrooms'], train_data['price'])
print "Intercept: " + str(bedrm_intercept)
print "Slope: " + str(bedrm_slope)

Intercept: 109473.180469
Slope: 127588.952175


### Test Linear Regression Algorithm for Square feet and Bedrooms Model

**Which model (square feet or bedrooms) has lowest RSS on TEST data?**

In [30]:
# Compute RSS when using bedrooms on TEST data:
get_residual_sum_of_squares(test_data['sqft_living'], test_data['price'], sqft_intercept, sqft_slope)

275402936247141.3

In [32]:
# Compute RSS when using squarefeet on TEST data:
get_residual_sum_of_squares(test_data['bedrooms'], test_data['price'], bedrm_intercept, bedrm_slope)

493364582868287.94

# So the Square feet model has a lower RSS than the Bedrooms model.