## King County House Prices

### Background

This [dataset](http://www.kaggle.com/harlfoxem/housesalesprediction) of house sale prices in King County, Washington between May 2014 and May 2015 comes from [Kaggle](http://www.kaggle.com).

From the Kaggle description, this dataset contains *19 house features plus the price and the id columns, along with 21613 observations*.

Also from the Kaggle description, the features are described as:
- **id** a notation for a house
- **date** Date house was sold
- **price** Price is prediction target
- **bedrooms** Number of Bedrooms/House
- **bathrooms** Number of bathrooms/House
- **sqft_living** square footage of the home
- **sqft_lot** square footage of the lot
- **floors** Total floors (levels) in house
- **waterfront** House which has a view to a waterfront
- **view** Has been viewed
- **condition** How good the condition is ( Overall )
- **grade** overall grade given to the housing unit, based on King County grading system
- **sqft_above** square footage of house apart from basement
- **sqft_basement** square footage of the basement
- **yr_built** Built Year
- **yr_renovated** Year when house was renovated
- **zipcode** zip
- **lat** Latitude coordinate
- **long** Longitude coordinate
- **sqft_living15** Living room area in 2015(implies-- some renovations) This might or might not have affected the lotsize area
- **sqft_lot15** lotSize area in 2015(implies-- some renovations)

### Loading Data

In [1]:
# Standard imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy as sp
import sklearn
from IPython.display import display

# Don't display deprecation warnings
import warnings
warnings.filterwarnings('ignore')

In [5]:
housing_data = pd.read_csv('data/kc_house_data.csv')
print('Shape: {}'.format(housing_data.shape))

Shape: (21613, 21)


In [7]:
housing_data.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


### Fitting the Model

In [9]:
reg_cols = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view',
            'condition', 'grade', 'sqft_above', 'sqft_basement', 'sqft_living15', 'sqft_lot15']

X = housing_data[reg_cols]
y = housing_data['price']

In [11]:
# Construct and split the data set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Fit the model
from sklearn.linear_model import LinearRegression
lr = LinearRegression().fit(X_train, y_train)