Using Keras to solve the Kaggle competition: House Prices: Advanced Regression Techniques  
https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview

In [1]:
import numpy as np
import pandas as pd

In [2]:
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
print('train data shape:',train_data.shape)
print('test data shape:', test_data.shape)

train data shape: (1460, 81)
test data shape: (1459, 80)


In [6]:
train_data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [7]:
test_data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal


First we need to concat the training data and the test data to gether since we need to apply the feature engineering simultaneously.

In [3]:
all_data = pd.concat((train_data.iloc[:,1:-1],test_data))

Check the missing value.

In [4]:
all_data_na = all_data.isnull().sum()
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index)
all_data_na = all_data_na.sort_values(ascending = False)
na_data = pd.DataFrame({'Na Value':all_data_na})
na_data

Unnamed: 0,Na Value
PoolQC,2909
MiscFeature,2814
Alley,2721
Fence,2348
Id,1460
FireplaceQu,1420
LotFrontage,486
GarageCond,159
GarageFinish,159
GarageQual,159


We still need to do the normalization, and simpy fill the NA values with the mean

In [5]:
numeric_features = all_data.dtypes[all_data.dtypes != 'object'].index
all_data[numeric_features] = all_data[numeric_features].apply(
lambda x:(x-x.mean())/x.std())

all_data = all_data.fillna(all_data.mean())

all_data = pd.get_dummies(all_data,dummy_na = True)

print('The shape of all data set: ',all_data.shape)

The shape of all data set:  (2919, 332)


Separate the training set and the test set.

In [8]:
ntrain = train_data.shape[0]
train_set = np.array(all_data[:ntrain].values)
test_set = np.array(all_data[ntrain:].values)
train_labels = np.array(train_data.SalePrice.values).reshape(-1,1)

Build a simple model to train it.

In [18]:
from keras.layers import Input, Dense, Activation, Dropout
from keras.models import Model
from keras import regularizers

import keras.backend as K

In [37]:
def simplemodel(input_shape):
    X_input=Input(input_shape)
    X=Dense(256,activation='relu')(X_input)
    X=Dense(128,activation='relu')(X)
    X=Dense(1)(X)
    model = Model(inputs = X_input, outputs = X, name='simple')
    return model

In [38]:
Simplemodel = simplemodel(train_set.shape[1:])
Simplemodel.summary

<bound method Network.summary of <keras.engine.training.Model object at 0x00000245B1765B70>>

In [39]:
Simplemodel.compile(optimizer = 'Adam', loss = "mean_squared_error",metrics=["mean_squared_error"])

In [40]:
Simplemodel.fit(x = train_set, y = train_labels,epochs = 100, batch_size = 64,verbose=1)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100


Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.History at 0x245b180cac8>

Use the evaluation method provided by the competition:

In [41]:
def rmlse2(y_true, y_pred):
    a = np.mean(np.square((np.log(y_pred)-np.log(y_true))**2))
    return a

In [42]:
pred = Simplemodel.predict(train_set)
print(rmlse2(pred, train_labels))

0.00509644230855


In [43]:
pred_test=Simplemodel.predict(test_set)
test_data['SalePrice'] = pd.Series(pred_test.reshape(1, -1)[0])
submission_nn = pd.concat([test_data['Id'], test_data['SalePrice']], axis=1)
submission_nn.to_csv('submission_nn.csv', index=False)

Upload the prediction and see the score.  
scored 0.15246.

The result is not good enough, which means there is more work to be done.