# House Prices

url = https://www.kaggle.com/c/house-prices-advanced-regression-techniques

### Understanding the Question

Given a dataset of various features, can we predict the sales price for each house?

This is a a supervised regression problem.

### Getting Started - Load & Inspect the Data

The data is available from Kaggle at https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data.  There is also a data_description.txt file available that elaborates a bit on the feature columns.  The column 'SalePrice' is our label column, this is the target variable we are trying to predict.

In [29]:
import pandas as pd
import numpy as np

df = pd.read_csv("train.csv")
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [30]:
df.shape

(1460, 81)

In [31]:
df.dtypes

Id                 int64
MSSubClass         int64
MSZoning          object
LotFrontage      float64
LotArea            int64
Street            object
Alley             object
LotShape          object
LandContour       object
Utilities         object
LotConfig         object
LandSlope         object
Neighborhood      object
Condition1        object
Condition2        object
BldgType          object
HouseStyle        object
OverallQual        int64
OverallCond        int64
YearBuilt          int64
YearRemodAdd       int64
RoofStyle         object
RoofMatl          object
Exterior1st       object
Exterior2nd       object
MasVnrType        object
MasVnrArea       float64
ExterQual         object
ExterCond         object
Foundation        object
                  ...   
BedroomAbvGr       int64
KitchenAbvGr       int64
KitchenQual       object
TotRmsAbvGrd       int64
Functional        object
Fireplaces         int64
FireplaceQu       object
GarageType        object
GarageYrBlt      float64


In [32]:
df.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


In [33]:
df.isnull().sum()

Id                  0
MSSubClass          0
MSZoning            0
LotFrontage       259
LotArea             0
Street              0
Alley            1369
LotShape            0
LandContour         0
Utilities           0
LotConfig           0
LandSlope           0
Neighborhood        0
Condition1          0
Condition2          0
BldgType            0
HouseStyle          0
OverallQual         0
OverallCond         0
YearBuilt           0
YearRemodAdd        0
RoofStyle           0
RoofMatl            0
Exterior1st         0
Exterior2nd         0
MasVnrType          8
MasVnrArea          8
ExterQual           0
ExterCond           0
Foundation          0
                 ... 
BedroomAbvGr        0
KitchenAbvGr        0
KitchenQual         0
TotRmsAbvGrd        0
Functional          0
Fireplaces          0
FireplaceQu       690
GarageType         81
GarageYrBlt        81
GarageFinish       81
GarageCars          0
GarageArea          0
GarageQual         81
GarageCond         81
PavedDrive

In [34]:
df.corr()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
Id,1.0,0.011156,-0.010601,-0.033226,-0.028365,0.012609,-0.012713,-0.021998,-0.050298,-0.005024,...,-0.029643,-0.000477,0.002889,-0.046635,0.00133,0.057044,-0.006242,0.021172,0.000712,-0.021917
MSSubClass,0.011156,1.0,-0.386347,-0.139781,0.032628,-0.059316,0.02785,0.040581,0.022936,-0.069836,...,-0.012579,-0.0061,-0.012037,-0.043825,-0.02603,0.008283,-0.007683,-0.013585,-0.021407,-0.084284
LotFrontage,-0.010601,-0.386347,1.0,0.426095,0.251646,-0.059213,0.123349,0.088866,0.193458,0.233633,...,0.088521,0.151972,0.0107,0.070029,0.041383,0.206167,0.003368,0.0112,0.00745,0.351799
LotArea,-0.033226,-0.139781,0.426095,1.0,0.105806,-0.005636,0.014228,0.013788,0.10416,0.214103,...,0.171698,0.084774,-0.01834,0.020423,0.04316,0.077672,0.038068,0.001205,-0.014261,0.263843
OverallQual,-0.028365,0.032628,0.251646,0.105806,1.0,-0.091932,0.572323,0.550684,0.411876,0.239666,...,0.238923,0.308819,-0.113937,0.030371,0.064886,0.065166,-0.031406,0.070815,-0.027347,0.790982
OverallCond,0.012609,-0.059316,-0.059213,-0.005636,-0.091932,1.0,-0.375983,0.073741,-0.128101,-0.046231,...,-0.003334,-0.032589,0.070356,0.025504,0.054811,-0.001985,0.068777,-0.003511,0.04395,-0.077856
YearBuilt,-0.012713,0.02785,0.123349,0.014228,0.572323,-0.375983,1.0,0.592855,0.315707,0.249503,...,0.22488,0.188686,-0.387268,0.031355,-0.050364,0.00495,-0.034383,0.012398,-0.013618,0.522897
YearRemodAdd,-0.021998,0.040581,0.088866,0.013788,0.550684,0.073741,0.592855,1.0,0.179618,0.128451,...,0.205726,0.226298,-0.193919,0.045286,-0.03874,0.005829,-0.010286,0.02149,0.035743,0.507101
MasVnrArea,-0.050298,0.022936,0.193458,0.10416,0.411876,-0.128101,0.315707,0.179618,1.0,0.264736,...,0.159718,0.125703,-0.110204,0.018796,0.061466,0.011723,-0.029815,-0.005965,-0.008201,0.477493
BsmtFinSF1,-0.005024,-0.069836,0.233633,0.214103,0.239666,-0.046231,0.249503,0.128451,0.264736,1.0,...,0.204306,0.111761,-0.102303,0.026451,0.062021,0.140491,0.003571,-0.015727,0.014359,0.38642


In [35]:
df.corr()['SalePrice']

Id              -0.021917
MSSubClass      -0.084284
LotFrontage      0.351799
LotArea          0.263843
OverallQual      0.790982
OverallCond     -0.077856
YearBuilt        0.522897
YearRemodAdd     0.507101
MasVnrArea       0.477493
BsmtFinSF1       0.386420
BsmtFinSF2      -0.011378
BsmtUnfSF        0.214479
TotalBsmtSF      0.613581
1stFlrSF         0.605852
2ndFlrSF         0.319334
LowQualFinSF    -0.025606
GrLivArea        0.708624
BsmtFullBath     0.227122
BsmtHalfBath    -0.016844
FullBath         0.560664
HalfBath         0.284108
BedroomAbvGr     0.168213
KitchenAbvGr    -0.135907
TotRmsAbvGrd     0.533723
Fireplaces       0.466929
GarageYrBlt      0.486362
GarageCars       0.640409
GarageArea       0.623431
WoodDeckSF       0.324413
OpenPorchSF      0.315856
EnclosedPorch   -0.128578
3SsnPorch        0.044584
ScreenPorch      0.111447
PoolArea         0.092404
MiscVal         -0.021190
MoSold           0.046432
YrSold          -0.028923
SalePrice        1.000000
Name: SalePr

### Simplifying the Problem

This is a pretty big dataset with over 80 potential features.  For this first attempt at predicting the sale price of a home I am going to simplify by selecting a handful of features I think will matter the most.

Using my domain knowledge of real estate and the correlations to SalePrice above, I picked the following 14 features.  I also paid attention to which features were heavily correlated with eachother so as not to use them both, for example 'GarageCars' with 'GarageArea', is roughly the same thing.

In [36]:
feature_cols = ['Neighborhood', 'BldgType', 'OverallQual', 'YearBuilt', 'RoofStyle', 'ExterQual', 'BsmtFinSF1', 'TotalBsmtSF',
               '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'FullBath', 'TotRmsAbvGrd', 'GarageCars']
train_df = df[feature_cols]
train_df.head()

Unnamed: 0,Neighborhood,BldgType,OverallQual,YearBuilt,RoofStyle,ExterQual,BsmtFinSF1,TotalBsmtSF,1stFlrSF,2ndFlrSF,GrLivArea,FullBath,TotRmsAbvGrd,GarageCars
0,CollgCr,1Fam,7,2003,Gable,Gd,706,856,856,854,1710,2,8,2
1,Veenker,1Fam,6,1976,Gable,TA,978,1262,1262,0,1262,2,6,2
2,CollgCr,1Fam,7,2001,Gable,Gd,486,920,920,866,1786,2,6,2
3,Crawfor,1Fam,7,1915,Gable,TA,216,756,961,756,1717,1,7,3
4,NoRidge,1Fam,8,2000,Gable,Gd,655,1145,1145,1053,2198,2,9,3


In [37]:
train_df.isnull().sum()

Neighborhood    0
BldgType        0
OverallQual     0
YearBuilt       0
RoofStyle       0
ExterQual       0
BsmtFinSF1      0
TotalBsmtSF     0
1stFlrSF        0
2ndFlrSF        0
GrLivArea       0
FullBath        0
TotRmsAbvGrd    0
GarageCars      0
dtype: int64

### Cleaning & Pre-Processing the Data

Luckily none of the columns I selected have any missing data so no need to fill in anything.  However there are a handful of fields that need to be turned into a numerical data type and then the entire training set can be normalized.

We will also need to apply these same transforms to our features in the test data.

First we need to convert all non-numerical columns to numerical columns.  Let's start with 'Neighborhood'.

In [38]:
train_df.dtypes

Neighborhood    object
BldgType        object
OverallQual      int64
YearBuilt        int64
RoofStyle       object
ExterQual       object
BsmtFinSF1       int64
TotalBsmtSF      int64
1stFlrSF         int64
2ndFlrSF         int64
GrLivArea        int64
FullBath         int64
TotRmsAbvGrd     int64
GarageCars       int64
dtype: object

In [39]:
train_df['Neighborhood'].value_counts()

NAmes      225
CollgCr    150
OldTown    113
Edwards    100
Somerst     86
Gilbert     79
NridgHt     77
Sawyer      74
NWAmes      73
SawyerW     59
BrkSide     58
Crawfor     51
Mitchel     49
NoRidge     41
Timber      38
IDOTRR      37
ClearCr     28
SWISU       25
StoneBr     25
MeadowV     17
Blmngtn     17
BrDale      16
Veenker     11
NPkVill      9
Blueste      2
Name: Neighborhood, dtype: int64

In [40]:
from sklearn.preprocessing import LabelEncoder

le1 = LabelEncoder()
le1.fit(train_df['Neighborhood'])
le1.classes_ #Print classes

array(['Blmngtn', 'Blueste', 'BrDale', 'BrkSide', 'ClearCr', 'CollgCr',
       'Crawfor', 'Edwards', 'Gilbert', 'IDOTRR', 'MeadowV', 'Mitchel',
       'NAmes', 'NPkVill', 'NWAmes', 'NoRidge', 'NridgHt', 'OldTown',
       'SWISU', 'Sawyer', 'SawyerW', 'Somerst', 'StoneBr', 'Timber',
       'Veenker'], dtype=object)

In [41]:
train_df['Neighborhood'] = le1.transform(train_df['Neighborhood'])
train_df['Neighborhood'].head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


0     5
1    24
2     5
3     6
4    15
Name: Neighborhood, dtype: int64

In [42]:
train_df.dtypes

Neighborhood     int64
BldgType        object
OverallQual      int64
YearBuilt        int64
RoofStyle       object
ExterQual       object
BsmtFinSF1       int64
TotalBsmtSF      int64
1stFlrSF         int64
2ndFlrSF         int64
GrLivArea        int64
FullBath         int64
TotRmsAbvGrd     int64
GarageCars       int64
dtype: object

In [43]:
#Repeat the process for our other categorical data that is in text format
le2 = LabelEncoder()
train_df['BldgType'] = le2.fit_transform(train_df['BldgType'])
le3 = LabelEncoder()
train_df['RoofStyle'] = le3.fit_transform(train_df['RoofStyle'])
le4 = LabelEncoder()
train_df['ExterQual'] = le4.fit_transform(train_df['ExterQual'])
train_df.dtypes

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Neighborhood    int64
BldgType        int64
OverallQual     int64
YearBuilt       int64
RoofStyle       int64
ExterQual       int64
BsmtFinSF1      int64
TotalBsmtSF     int64
1stFlrSF        int64
2ndFlrSF        int64
GrLivArea       int64
FullBath        int64
TotRmsAbvGrd    int64
GarageCars      int64
dtype: object

In [44]:
#Scale the features
from sklearn.preprocessing import StandardScaler

X = StandardScaler().fit_transform(train_df.values)
X[:10, :] #Inspect

array([[-1.20621453, -0.41169079,  0.65147924,  1.05099379, -0.49151573,
        -0.77797579,  0.57542484, -0.45930254, -0.79343379,  1.16185159,
         0.37033344,  0.78974052,  0.91220977,  0.31172464],
       [ 1.95430223, -0.41169079, -0.07183611,  0.15673371, -0.49151573,
         0.66345144,  1.17199212,  0.46646492,  0.25714043, -0.79516323,
        -0.48251191,  0.78974052, -0.31868327,  0.31172464],
       [-1.20621453, -0.41169079,  0.65147924,  0.9847523 , -0.49151573,
        -0.77797579,  0.09290718, -0.31336875, -0.62782603,  1.18935062,
         0.51501256,  0.78974052, -0.31868327,  0.31172464],
       [-1.03987154, -0.41169079,  0.65147924, -1.86363165, -0.49151573,
         0.66345144, -0.49927358, -0.68732408, -0.52173356,  0.93727612,
         0.38365915, -1.02604084,  0.29676325,  1.65030694],
       [ 0.45721535, -0.41169079,  1.3747946 ,  0.95163156, -0.49151573,
        -0.77797579,  0.46356847,  0.19967971, -0.04561126,  1.61787729,
         1.2993257 ,  0.78

#### Apply same Transforms to Test Features

In [47]:
#Load Data
test_data = pd.read_csv('test.csv')
test_df = test_data[feature_cols]

test_df['Neighborhood'] = le1.transform(test_df['Neighborhood'])
test_df['BldgType'] = le2.transform(test_df['BldgType'])
test_df['RoofStyle'] = le3.transform(test_df['RoofStyle'])
test_df['ExterQual'] = le4.transform(test_df['ExterQual'])

#Verify
test_df.dtypes

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Neighborhood      int64
BldgType          int64
OverallQual       int64
YearBuilt         int64
RoofStyle         int64
ExterQual         int64
BsmtFinSF1      float64
TotalBsmtSF     float64
1stFlrSF          int64
2ndFlrSF          int64
GrLivArea         int64
FullBath          int64
TotRmsAbvGrd      int64
GarageCars      float64
dtype: object

In [50]:
#Check for any NaN values
test_df.isnull().sum()

Neighborhood    0
BldgType        0
OverallQual     0
YearBuilt       0
RoofStyle       0
ExterQual       0
BsmtFinSF1      1
TotalBsmtSF     1
1stFlrSF        0
2ndFlrSF        0
GrLivArea       0
FullBath        0
TotRmsAbvGrd    0
GarageCars      1
dtype: int64

In [57]:
#Fill NA values
test_df['BsmtFinSF1'].fillna(test_df['BsmtFinSF1'].mean(), inplace=True)
test_df['TotalBsmtSF'].fillna(test_df['TotalBsmtSF'].mean(), inplace=True)
test_df['GarageCars'].fillna(test_df['GarageCars'].median(), inplace=True)
test_df.isnull().sum()

Neighborhood    0
BldgType        0
OverallQual     0
YearBuilt       0
RoofStyle       0
ExterQual       0
BsmtFinSF1      0
TotalBsmtSF     0
1stFlrSF        0
2ndFlrSF        0
GrLivArea       0
FullBath        0
TotRmsAbvGrd    0
GarageCars      0
dtype: int64

In [59]:
#Scale Test Set Features
X_test = StandardScaler().fit_transform(test_df.values)

#Extract Training Set Label
Y = df['SalePrice'].values

So now we have both sets of features ready to go.

X is the set of features from our training set, Y is the label for our training set
X_test is the test set of feature that we will make predictions on for our Kaggle submission


### Build & Test ML Models

Using cross validation we will now test several different supervised regression algorithims.

Our scoring metric is 'neg_mean_squared_error'.  The output will be a negative number in which we want to have the largest value (closest to 0).  This is because sklearn's cross_val_score is always seeking to maximize the score, so therefore instead of maximizing the mean squared error (which leads to the worst fit regerssion), the function instead will maximize the negative mean squared error which gives us what we want (the best fitting regression).

#### Linear Regression

In [60]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LinearRegression

#Initialize Model
lin_reg = LinearRegression()
#Create Kfold
kfold = KFold(n_splits=7, random_state=5)
#Train & Test model
cross_val_results = cross_val_score(lin_reg, X, Y, cv=kfold, scoring='neg_mean_squared_error')
print cross_val_results.mean()

-1399335160.06


#### LASSO Regression

In [61]:
from sklearn.linear_model import Lasso

#Initialize Model
lasso = Lasso()
#Create Kfold
kfold = KFold(n_splits=7, random_state=5)
#Train & Test model
cross_val_results = cross_val_score(lasso, X, Y, cv=kfold, scoring='neg_mean_squared_error')
print cross_val_results.mean()

-1399198731.48




#### ElasticNet Regression

In [62]:
from sklearn.linear_model import ElasticNet

#Initialize Model
enet = ElasticNet()
#Create Kfold
kfold = KFold(n_splits=7, random_state=5)
#Train & Test model
cross_val_results = cross_val_score(enet, X, Y, cv=kfold, scoring='neg_mean_squared_error')
print cross_val_results.mean()

-1424410394.93


#### Support Vector Regression

In [63]:
from sklearn.svm import SVR

#Initialize Model
svr = SVR()
#Create Kfold
kfold = KFold(n_splits=7, random_state=5)
#Train & Test model
cross_val_results = cross_val_score(svr, X, Y, cv=kfold, scoring='neg_mean_squared_error')
print cross_val_results.mean()

-6616321063.18


### Get Predictions & Make Submission

The LASSO model provided the smallest error using the default settings.  Let's use that as a first submission.

In [68]:
#Fit model
lasso = Lasso()
lasso.fit(X, Y)

#Make Predictions
predictions = lasso.predict(X_test)

In [70]:
#Make Id Column for submission
id_value = np.arange(1461, 2920)

In [71]:
#Create Submission DF
submission = pd.DataFrame({'Id':id_value, 'SalePrice':predictions})
submission.head()

Unnamed: 0,Id,SalePrice
0,1461,113860.273987
1,1462,170054.910428
2,1463,176844.173402
3,1464,188662.331638
4,1465,200522.783472


In [72]:
#Write file
submission.to_csv("house_prices_lasso_reg.csv", index=False)

#### Conclusion

This simple LASSO regression on a modified feature set scored 0.18262 on Kaggle.