Analysis of of Kaggle House Prices data set

WHJ Wood (2019)

Data set from :
https://www.kaggle.com/c/house-prices-advanced-regression-techniques

Part 1.
Linear regression

In the first instance it is always useful just look at the data.

Here we are interested in predicting the SalePrice variable (the final column), the price in USD for which the property was sold.

In [42]:
import numpy as np
import pandas as pd
housing_data = pd.read_csv("train.csv", index_col="Id")
print(housing_data.shape)
housing_data.head()


(1460, 80)


Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000


For simplicity, we will using only numeric features as listed below

In [43]:
# select numeric data types (int64 or float64)
housing_data_numeric = housing_data.select_dtypes(include=[np.int64,np.float64])
COLS = list(housing_data_numeric.columns)
COLS.remove('SalePrice')
print(COLS)

['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold']


From these features, we select those which are predictive of SalePrice

In this case we choose features for which the Pearson correlation is greater than 0.5

In [44]:
Numeric_Features = []
for col in COLS:
    X = np.array(housing_data[col])
    y = np.array(housing_data["SalePrice"])
    Corr = np.corrcoef(X,y)
    m, c, Rsqr = linear_prediction_analysis(X,y)
    if abs(Corr[0,1]) > 0.5:
        print(col, Corr[0,1])
        Numeric_Features.append(col)

OverallQual 0.7909816005838047
YearBuilt 0.5228973328794969
YearRemodAdd 0.5071009671113865
TotalBsmtSF 0.6135805515591954
1stFlrSF 0.6058521846919148
GrLivArea 0.7086244776126522
FullBath 0.5606637627484456
TotRmsAbvGrd 0.5337231555820284
GarageCars 0.6404091972583531
GarageArea 0.6234314389183617


  


We can now train a linear regression regression model to predict SalePrice from our chosen features.

Here i will train the model using the Scikit-learn LinearRegression model. Results will validate the model using a 75%/25% train/test split.

In [45]:
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import mean_absolute_error

feature_names = Numeric_Features
output_variable = "SalePrice"
X = housing_data[feature_names]
y = housing_data[output_variable]

# We want to do a train/test split

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0) # default is 75% train



regr = linear_model.LinearRegression()
regr.fit(train_X,train_y)
print(regr.coef_,regr.intercept_)
yhat = regr.predict(val_X)

val_mae = mean_absolute_error(val_y,yhat)
fraction_error = np.abs(val_y-yhat)/val_y

print("Mean error (fraction)", np.mean(fraction_error))
print("Mean absolute error",val_mae)


[ 1.75742334e+04  2.94528857e+02  3.19795121e+02  3.08379287e+01
  1.19694437e+01  5.81931025e+01 -8.38956913e+03 -1.91486699e+02
  2.03929348e+03  4.05333995e+01] -1284468.5319339475
Mean error (fraction) 0.14549517053393599
Mean absolute error 24303.111780726205


The model is accurate to around 15 % (out of sample error). However, It is likely that the prediction can be improved in a number of ways such as:

Including categorical features

Using a general model which does not assume linearity

Imputation of missing values

Here, i use a random forest regressor to predict SalePrice. I also make use of Scikit-learn pipelines which are useful for packaging pipelines (preprocessing + machine learning)


In [46]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestRegressor

housing_data = pd.read_csv("train.csv", index_col="Id")

# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = list(housing_data.select_dtypes(include=[np.object]).columns)

# Select numerical columns
numeric_cols = list(housing_data.select_dtypes(include=[np.int64,np.float64]).columns)


numeric_cols.remove("SalePrice")
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])



categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy="most_frequent", fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_cols),
        ('cat', categorical_transformer, categorical_cols)])

# Define model
model = RandomForestRegressor(n_estimators=100, random_state=0)

# Bundle preprocessing and modeling code in a pipeline
PL = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', model)
                     ])

# Remove rows with missing target, separate target from predictors
housing_data.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = housing_data.SalePrice
housing_data.drop(['SalePrice'], axis=1, inplace=True)

# Break off validation set from training data
X_train, X_val, y_train, y_val = train_test_split(housing_data, y)


# Keep selected columns only
my_cols = numeric_cols + categorical_cols


X_train = X_train[my_cols].copy()
X_val = X_val[my_cols].copy()


# Preprocessing of training data, fit model 
PL.fit(X_train, y_train)
yhat = PL.predict(X_val)

val_mae = mean_absolute_error(y_val ,yhat)
fraction_error = np.abs(y_val -yhat)/y_val 

print("Mean error (fraction)", np.mean(fraction_error))
print("Mean absolute error",val_mae)

Mean error (fraction) 0.10037055071316117
Mean absolute error 15322.23616438356


We see an improvement on the linear model as the error is now around 10 % with the random forest regressor model.