# Regression Week 1: Simple Linear Regression

In this notebook we will use data on house sales in King County to predict house prices using simple (one input) linear regression. 

# import packages

In [51]:
import pandas as pd
import numpy as np

# Load house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.

In [3]:
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':float, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}
sales = pd.read_csv('week4csv/kc_house_data.csv', dtype=dtype_dict)

# Split data into training and testing

We use seed=0 so that everyone running this notebook gets the same results.  In practice, you may set a random seed (or let GraphLab Create pick a random seed for you).  

In [8]:
from sklearn.cross_validation import train_test_split
train_data,test_data= train_test_split(sales, train_size = 0.8, random_state = 0)

# Summary functions

In [10]:
# Let's compute the mean of the House Prices in King County in 2 different ways.
prices = sales['price'] # extract the price column of the sales SFrame -- this is now an SArray

# recall that the arithmetic average (the mean) is the sum of the prices divided by the total number of houses:
sum_prices = prices.sum()
num_houses = len(prices) # when prices is an SArray .size() returns its length
avg_price_1 = sum_prices/num_houses
avg_price_2 = prices.mean() # if you just want the average, the .mean() function
print "average price via method 1: " + str(avg_price_1)
print "average price via method 2: " + str(avg_price_2)

average price via method 1: 540088.141767
average price via method 2: 540088.141767


As we see we get the same answer both ways

In [11]:
# if we want to multiply every price by 0.5 it's a simple as:
half_prices = 0.5*prices
# Let's compute the sum of squares of price. We can multiply two SArrays of the same length elementwise also with *
prices_squared = prices*prices
sum_prices_squared = prices_squared.sum() # price_squared is an SArray of the squares and we want to add them up.
print "the sum of price squared is: " + str(sum_prices_squared)

the sum of price squared is: 9.21732513847e+15


Aside: The python notation x.xxe+yy means x.xx \* 10^(yy). e.g 100 = 10^2 = 1*10^2 = 1e2 

# Simple linear regression function with sklearn

In [43]:
X1_train=train_data[['sqft_living']]
y1_train=train_data[['price']]
X1_train=X1_train.reset_index(drop=True)
y1_train=y1_train.reset_index(drop=True)

In [45]:
from sklearn import linear_model
model = linear_model.LinearRegression()
model.fit(X1_train, y1_train)
model.coef_
model.intercept_

array([-48257.06359103])

# $R^2$ value

In [47]:
model.score(X1_train, y1_train)

0.49552283606032776

# RMSE

In [52]:
X1_test=test_data[['sqft_living']]
y1_test=test_data[['price']]
X1_test=X1_test.reset_index(drop=True)
y1_test=y1_test.reset_index(drop=True)

In [62]:
from sklearn.metrics import mean_squared_error
mean_squared_error(y1_test, model.predict(X1_test))

61940787124.62471

# Simple Linear Regression with statsmodel

In [56]:
import statsmodels.api as sm

x1_train=sm.add_constant(X1_train)

results=sm.OLS(y1_train, x1_train).fit().summary()
res=sm.OLS(y1_train, x1_train).fit()
results

0,1,2,3
Dep. Variable:,price,R-squared:,0.496
Model:,OLS,Adj. R-squared:,0.495
Method:,Least Squares,F-statistic:,16980.0
Date:,"Tue, 11 Oct 2016",Prob (F-statistic):,0.0
Time:,00:23:02,Log-Likelihood:,-240410.0
No. Observations:,17290,AIC:,480800.0
Df Residuals:,17288,BIC:,480800.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
const,-4.826e+04,4961.872,-9.726,0.000,-5.8e+04 -3.85e+04
sqft_living,283.9686,2.179,130.312,0.000,279.697 288.240

0,1,2,3
Omnibus:,12071.546,Durbin-Watson:,2.017
Prob(Omnibus):,0.0,Jarque-Bera (JB):,471754.959
Skew:,2.876,Prob(JB):,0.0
Kurtosis:,27.935,Cond. No.,5620.0


In [60]:
res.rsquared

0.49552283606032776

# test data set

In [92]:
x1_test=sm.add_constant(X1_test)
sum((np.array(y1_test['price'])-np.array(res.predict(x1_test)))**2)/len(x1_test)

61940787124.624931