# A Deep Learning Approach to Real Estate

## Background
This notebook will take the classic Kaggle Real Estate problem and apply a deep learning approach using modern techniques including dropout, batch normalisation and mini-batch training.

## Shoutouts
Full credit for this technique goes to Andrew Beam and Max Berggren for their post that inspired this kernel (below). 
http://maxberggren.se/2017/06/18/deep-learning-vs-xgboost/ 

https://www.valentinmihov.com/2015/04/17/adult-income-data-set/

## To Do
- Hyper parameter optimisation in keras  (http://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/, https://github.com/maxpumperla/hyperas)
- Ensembling in keras (look at this file in the dir keras-ensemble.py)
- Tensorboard (we need to instrument this network to understand if it is converging?)

In [261]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

#from subprocess import check_output
#print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

In [262]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')

seed = 123456
np.random.seed(seed)

In [263]:
target_variable = 'saleprice'
df = (
    pd.read_csv('input/train.csv') # change this to run on kaggle
    #pd.read_csv('../input/train.csv')

    # Rename columns to lowercase and underscores
    .pipe(lambda d: d.rename(columns={
        k: v for k, v in zip(
            d.columns,
            [c.lower().replace(' ', '_') for c in d.columns]
        )
    }))
    # Switch categorical classes to integers
    #.assign(**{target_variable: lambda r: r[target_variable].astype('category').cat.codes})
)
print('Done')

Done


In [265]:
# Log saleprice
df['saleprice'] = np.log(df['saleprice'])

In [266]:
import sklearn.preprocessing as preprocessing
import seaborn as sns

In [267]:
# Encode the categorical features as numbers
def number_encode_features(df):
    result = df.copy()
    encoders = {}
    for column in result.columns:
        #print(column)
        #print(result.dtypes[column])
        if result.dtypes[column] == np.object:
            encoders[column] = preprocessing.LabelEncoder()
            # if there are NaN's in the categorical data fill it with 'None' which becomes another category
            result[column] = encoders[column].fit_transform(result[column].fillna(value='None'))
    return result, encoders

# Calculate the correlation and plot it
encoded_data, _ = number_encode_features(df)

print('Done')

Done


In [268]:
# Data is now in dataframe "encoded_data"

y = encoded_data[target_variable].values
X = (
    # Drop target variable
    encoded_data.drop(target_variable, axis=1)
    # Min-max-scaling (only needed for the DL model)
    .pipe(lambda d: (d-d.min())/d.max()).fillna(0)
    .as_matrix()
)
print('Done')

Done


In [269]:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, cross_val_score

test_size = 0.0

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=test_size, random_state=seed
)
print('Done')

Done


In [294]:
from keras.models import Sequential
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.layers import Dense, Activation, Dropout
from keras.layers.advanced_activations import  LeakyReLU  
from keras import optimizers
from keras import initializers
from keras.layers.normalization import BatchNormalization
print('Done')

Done


In [298]:
dropout = 0.5
init_mean = 0.0
init_stdev = 0.05

m = Sequential()
m.add(Dense(1024, input_shape=(X.shape[1],)
            ,kernel_initializer=initializers.TruncatedNormal(mean=init_mean
                                                             ,stddev=init_stdev
                                                             ,seed=seed)
            ,bias_initializer='zeros'))
m.add(BatchNormalization())
m.add(LeakyReLU())  # helps to stop disappearing gradient
m.add(Dropout(dropout))

m.add(Dense(512
            ,kernel_initializer=initializers.TruncatedNormal(mean=init_mean
                                                             ,stddev=init_stdev
                                                             ,seed=seed)
            ,bias_initializer='zeros'))
m.add(BatchNormalization())
m.add(LeakyReLU())
m.add(Dropout(dropout))

m.add(Dense(128
             ,kernel_initializer=initializers.TruncatedNormal(mean=init_mean
                                                             ,stddev=init_stdev
                                                             ,seed=seed)
            ,bias_initializer='zeros'))
m.add(BatchNormalization())
m.add(LeakyReLU())
m.add(Dropout(dropout))

m.add(Dense(1, activation=None))  # linear activation for regression

m.compile(
    optimizer=optimizers.Adam(lr=0.002),
    loss='mean_squared_error',
    #metrics=[log_rmse]
)

print('Done')

epochs = 100
batch_size= 256

print('Starting training....')

m.fit(
    # Feature matrix
    X_train,
    # Target class one-hot-encoded
    y_train,
    # Iterations to be run if not stopped by EarlyStopping
    epochs=epochs,
    callbacks=[
        # Stop iterations when validation loss has not improved
        EarlyStopping(monitor='val_loss', patience=25),
        # Nice for keeping the last model before overfitting occurs
        ModelCheckpoint(
            'best.model',
            monitor='val_loss',
            save_best_only=True,
            verbose=1
        )
    ],
    verbose=2,
    validation_split=0.1,
    batch_size=batch_size
)
print('Done')


Done
Starting training....
Train on 1314 samples, validate on 146 samples
Epoch 1/100
Epoch 00000: val_loss improved from inf to 138.52853, saving model to best.model
3s - loss: 136.5800 - val_loss: 138.5285
Epoch 2/100
Epoch 00001: val_loss improved from 138.52853 to 131.11693, saving model to best.model
0s - loss: 121.8826 - val_loss: 131.1169
Epoch 3/100
Epoch 00002: val_loss improved from 131.11693 to 123.16247, saving model to best.model
0s - loss: 109.7582 - val_loss: 123.1625
Epoch 4/100
Epoch 00003: val_loss improved from 123.16247 to 115.39986, saving model to best.model
0s - loss: 98.3148 - val_loss: 115.3999
Epoch 5/100
Epoch 00004: val_loss improved from 115.39986 to 108.35216, saving model to best.model
0s - loss: 87.6649 - val_loss: 108.3522
Epoch 6/100
Epoch 00005: val_loss improved from 108.35216 to 102.76442, saving model to best.model
0s - loss: 76.6463 - val_loss: 102.7644
Epoch 7/100
Epoch 00006: val_loss improved from 102.76442 to 97.42483, saving model to best.mod

### Now for the Submission set

In [272]:
df = (
    pd.read_csv('input/test.csv') # change this to run on kaggle
    #pd.read_csv('../input/train.csv')

    # Rename columns to lowercase and underscores
    .pipe(lambda d: d.rename(columns={
        k: v for k, v in zip(
            d.columns,
            [c.lower().replace(' ', '_') for c in d.columns]
        )
    }))
    # Switch categorical classes to integers
    #.assign(**{target_variable: lambda r: r[target_variable].astype('category').cat.codes})
)
print('Done')

Done


In [273]:
# make the dummy columns for categoricals
encoded_data, _ = number_encode_features(df)

X_sub = (# Min-max-scaling (only needed for the DL model)
    encoded_data.pipe(lambda d: (d-d.min())/d.max()).fillna(0)
    .as_matrix()
)
print('Done')

Done


In [274]:
## Save to CSV with Image name and results
# Run the model
y_sub_preds = np.exp( m.predict(X_sub) ) # bring them back to sales prices using exponential
pred = pd.DataFrame(data=y_sub_preds) 

print("Here is a sample...")

result = pd.concat([df['id'], pred], axis=1)
result.columns = ['Id','SalePrice'] 
print(result[0:10])

# Header: [image ALB BET DOL LAG NoF OTHER   SHARK   YFT]
result.to_csv('submission.csv', index = False)

print('Done')

Here is a sample...
     Id      SalePrice
0  1461  146040.984375
1  1462  226797.500000
2  1463  175905.562500
3  1464  238224.109375
4  1465  201734.234375
5  1466  232493.140625
6  1467  230393.218750
7  1468  215095.250000
8  1469  262079.515625
9  1470  122354.031250
Done
