# Using H2o AutoML to predict housing prices

This quick tutorial shows how you can use H2o AutoML to train and evaluate a large amount of models with only a few lines of coding.
Please, upvote if you find it useful.

### Load libraries

In [1]:
# You can easily install the library using pip
!pip install h2o



In [2]:
# And then load the libraries you'll use in this notebook
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline


import h2o
from h2o.automl import H2OAutoML

In [3]:
# Initialize your cluster
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: java version "1.8.0_192"; Java(TM) SE Runtime Environment (build 1.8.0_192-b12); Java HotSpot(TM) 64-Bit Server VM (build 25.192-b12, mixed mode)
  Starting server from /Users/Daniela/anaconda3/envs/general/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/1y/zfb5jdcs2vzg_qh45b_g07x00000gn/T/tmpmhru9jnz
  JVM stdout: /var/folders/1y/zfb5jdcs2vzg_qh45b_g07x00000gn/T/tmpmhru9jnz/h2o_Daniela_started_from_python.out
  JVM stderr: /var/folders/1y/zfb5jdcs2vzg_qh45b_g07x00000gn/T/tmpmhru9jnz/h2o_Daniela_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O cluster uptime:,02 secs
H2O cluster timezone:,Europe/Stockholm
H2O data parsing timezone:,UTC
H2O cluster version:,3.28.0.1
H2O cluster version age:,12 days
H2O cluster name:,H2O_from_python_Daniela_rgqrfo
H2O cluster total nodes:,1
H2O cluster free memory:,910 Mb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


### Load train and test datasets

In [4]:
train = h2o.import_file('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
test = h2o.import_file('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')
train.head()

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%


Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000
6,50,RL,85.0,14115,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Mitchel,Norm,Norm,1Fam,1.5Fin,5,5,1993,1995,Gable,CompShg,VinylSd,VinylSd,,0,TA,TA,Wood,Gd,TA,No,GLQ,732,Unf,0,64,796,GasA,Ex,Y,SBrkr,796,566,0,1362,1,0,1,1,1,1,TA,5,Typ,0,,Attchd,1993,Unf,2,480,TA,TA,Y,40,30,0,320,0,0,,MnPrv,Shed,700,10,2009,WD,Normal,143000
7,20,RL,75.0,10084,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Somerst,Norm,Norm,1Fam,1Story,8,5,2004,2005,Gable,CompShg,VinylSd,VinylSd,Stone,186,Gd,TA,PConc,Ex,TA,Av,GLQ,1369,Unf,0,317,1686,GasA,Ex,Y,SBrkr,1694,0,0,1694,1,0,2,0,3,1,Gd,7,Typ,1,Gd,Attchd,2004,RFn,2,636,TA,TA,Y,255,57,0,0,0,0,,,,0,8,2007,WD,Normal,307000
8,60,RL,,10382,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NWAmes,PosN,Norm,1Fam,2Story,7,6,1973,1973,Gable,CompShg,HdBoard,HdBoard,Stone,240,TA,TA,CBlock,Gd,TA,Mn,ALQ,859,BLQ,32,216,1107,GasA,Ex,Y,SBrkr,1107,983,0,2090,1,0,2,1,3,1,TA,7,Typ,2,TA,Attchd,1973,RFn,2,484,TA,TA,Y,235,204,228,0,0,0,,,Shed,350,11,2009,WD,Normal,200000
9,50,RM,51.0,6120,Pave,,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Artery,Norm,1Fam,1.5Fin,7,5,1931,1950,Gable,CompShg,BrkFace,Wd Shng,,0,TA,TA,BrkTil,TA,TA,No,Unf,0,Unf,0,952,952,GasA,Gd,Y,FuseF,1022,752,0,1774,0,0,2,0,2,2,TA,8,Min1,2,TA,Detchd,1931,Unf,2,468,Fa,TA,Y,90,0,205,0,0,0,,,,0,4,2008,WD,Abnorml,129900
10,190,RL,50.0,7420,Pave,,Reg,Lvl,AllPub,Corner,Gtl,BrkSide,Artery,Artery,2fmCon,1.5Unf,5,6,1939,1950,Gable,CompShg,MetalSd,MetalSd,,0,TA,TA,BrkTil,TA,TA,No,GLQ,851,Unf,0,140,991,GasA,Ex,Y,SBrkr,1077,0,0,1077,1,0,1,0,2,2,TA,5,Typ,2,TA,Attchd,1939,RFn,1,205,Gd,TA,Y,0,4,0,0,0,0,,,,0,1,2008,WD,Normal,118000




### EDA, Data Processing and Feature Engineering

I won't perform neither EDA nor Feature Engineering as these are not the focus of this kernel. At least not an extensive one. Here I'll only drop registers whose `GrLivArea` are above 4500 and log transform the target variable `SalePrice`.

In [5]:
train = train[train['GrLivArea'] < 4500]
train['SalePrice'] = train['SalePrice'].log1p()

### Using H2o AutoML

The H2OAutoML function is quite easy to use. You specify the dataset you will use for training at `training_frame`, while `x` and `y` receives the column names of the features which will be used and the name of the target variable, respectively.

You can customize your AutoML to fit your needs. You can add or exclude algorithms, set nfolds for cross-validation, choose metrics, use validation sets, early stopping and so on. Please check the documentation at http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html

In [6]:
# Identify predictors and response
x = [col for col in train.columns if col not in ['Id','SalePrice']]
y = 'SalePrice'
test_id = test['Id']

Let's now create our model:

In [7]:
aml = H2OAutoML(max_models = 30, max_runtime_secs=300, seed = 0, stopping_metric = 'RMSLE')
aml.train(x = x, y = y, training_frame = train)

AutoML progress: |████████████████████████████████████████████████████████| 100%


The Leaderboard displays the performance of the trained models

In [8]:
lb = aml.leaderboard; lb

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
GLM_1_AutoML_20191229_171415,0.0134072,0.11579,0.0134072,0.0806608,0.0090615
XGBoost_2_AutoML_20191229_171415,0.0149222,0.122156,0.0149222,0.0843188,0.00951875
XGBoost_1_AutoML_20191229_171415,0.0150298,0.122596,0.0150298,0.0846317,0.00956583
XGBoost_3_AutoML_20191229_171415,0.0150535,0.122693,0.0150535,0.0852775,0.00956654
GBM_3_AutoML_20191229_171415,0.0156149,0.124959,0.0156149,0.08617,0.00975866
GBM_grid__1_AutoML_20191229_171415_model_2,0.0156493,0.125097,0.0156493,0.0867238,0.00975575
GBM_4_AutoML_20191229_171415,0.0159195,0.126173,0.0159195,0.0857864,0.00985091
GBM_2_AutoML_20191229_171415,0.0159408,0.126257,0.0159408,0.0874646,0.00984835
XGBoost_grid__1_AutoML_20191229_171415_model_1,0.0159655,0.126355,0.0159655,0.0869011,0.00986021
GBM_1_AutoML_20191229_171415,0.0163378,0.12782,0.0163378,0.088993,0.00998078




The attribute `leader` of your AutoML object holds the data about your best model.

In [9]:
aml.leader

Model Details
H2OGeneralizedLinearEstimator :  Generalized Linear Modeling
Model Key:  GLM_1_AutoML_20191229_171415


GLM Model: summary


Unnamed: 0,Unnamed: 1,family,link,regularization,lambda_search,number_of_predictors_total,number_of_active_predictors,number_of_iterations,training_frame
0,,gaussian,identity,Ridge ( lambda = 0.01606 ),"nlambda = 30, lambda.max = 32.81, lambda.min = 0.01606, lambda.1se...",304,303,28,automl_training_py_3_sid_b6b2




ModelMetricsRegressionGLM: glm
** Reported on train data. **

MSE: 0.009926662182109965
RMSE: 0.09963263612948302
MAE: 0.07028477975615482
RMSLE: 0.007793609197735483
R^2: 0.9378265403090414
Mean Residual Deviance: 0.009926662182109965
Null degrees of freedom: 1457
Residual degrees of freedom: 1154
Null deviance: 232.78539642915086
Residual deviance: 14.473073461516329
AIC: -1977.4454237490122

ModelMetricsRegressionGLM: glm
** Reported on cross-validation data. **

MSE: 0.013407243824510874
RMSE: 0.11578965335689918
MAE: 0.08066081483664383
RMSLE: 0.009061498531336433
R^2: 0.9160266846804281
Mean Residual Deviance: 0.013407243824510874
Null degrees of freedom: 1457
Residual degrees of freedom: 1159
Null deviance: 233.26066584152278
Residual deviance: 19.547761496136854
AIC: -1549.213112794158

Cross-Validation Metrics Summary: 


Unnamed: 0,Unnamed: 1,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
0,mae,0.080781005,0.002941696,0.07843183,0.08065324,0.08516379,0.08183488,0.07782129
1,mean_residual_deviance,0.013300789,0.0021990049,0.015283024,0.012890057,0.01582157,0.011762972,0.010746321
2,mse,0.013300789,0.0021990049,0.015283024,0.012890057,0.01582157,0.011762972,0.010746321
3,null_deviance,46.652134,4.6403284,53.98957,45.268993,44.027664,47.982864,41.991573
4,r2,0.91622835,0.013008964,0.91688246,0.9157046,0.8950556,0.9285597,0.92493933
5,residual_deviance,3.8793285,0.6472956,4.462643,3.7638965,4.619899,3.4230247,3.1271791
6,rmse,0.115012884,0.009541001,0.123624526,0.11353438,0.12578383,0.10845724,0.10366446
7,rmsle,0.00899113,0.00081375416,0.009876991,0.008816265,0.009792915,0.008386962,0.0080825165



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,iteration,lambda,predictors,deviance_train,deviance_test,deviance_xval,deviance_se
0,,2019-12-29 17:15:30,0.000 sec,1,33.0,304,0.109922,,0.117879,0.006004
1,,2019-12-29 17:15:30,0.019 sec,2,24.0,304,0.098072,,0.107005,0.005695
2,,2019-12-29 17:15:30,0.030 sec,3,17.0,304,0.085299,,0.094839,0.005331
3,,2019-12-29 17:15:30,0.036 sec,4,13.0,304,0.072343,,0.081965,0.004917
4,,2019-12-29 17:15:30,0.044 sec,5,9.2,304,0.059998,,0.069147,0.004468
5,,2019-12-29 17:15:30,0.053 sec,6,6.7,304,0.049016,,0.057162,0.004005
6,,2019-12-29 17:15:30,0.061 sec,7,4.9,304,0.039837,,0.046719,0.003553
7,,2019-12-29 17:15:30,0.066 sec,8,3.6,304,0.032598,,0.038148,0.003128
8,,2019-12-29 17:15:30,0.077 sec,9,2.6,304,0.027139,,0.031501,0.002743
9,,2019-12-29 17:15:30,0.090 sec,10,1.9,304,0.023138,,0.026557,0.002406



See the whole table with table.as_data_frame()




It also gives you the Feature Importance

In [10]:
aml.leader.varimp_plot()

<Figure size 1400x1000 with 1 Axes>

### Predict

You can simply call predict on your best model

In [11]:
preds = aml.leader.predict(test)

glm prediction progress: |████████████████████████████████████████████████| 100%




In [12]:
# Convert results back(they had been transformed using log, remember?) and save them in a csv format.
result = preds.expm1()
sub = test_id.cbind(result)
sub.columns = ['Id','SalePrice']
sub = sub.as_data_frame()
sub.to_csv('submission.csv', index = False)