# H2O's AutoML for Predictive Modeling
Automated Machine Learning (AutoML) is revolutionizing the field of data science by making algorithms and models accessible to non-experts and improving efficiency of experts. One of the significant players in this space is H2O.ai, with their open-source platform, H2O. In this notebook, we will walk through the process of using H2O's AutoML to train a predictive model and make predictions on unseen data. We will use a customer churn dataset, where our task is to predict whether a customer will churn based on various features like their usage patterns and account characteristics.

## Install the required Libraries
Before we get started, we need to install the required Python libraries. These include the H2O library itself, as well as a few others that we'll use along the way.

In [2]:
!pip install requests
!pip install tabulate
!pip install "colorama>=0.3.8"
!pip install future



In [3]:
!pip install h2o



## Initialize H2O and Set Up AutoML
Once we've installed the libraries, we'll import them and initialize H2O. We'll also specify the maximum memory size to be used.

In [4]:
import h2o
from h2o.automl import H2OAutoML
h2o.init(max_mem_size='16G') 

Checking whether there is an H2O instance running at http://localhost:54321. connected.


0,1
H2O_cluster_uptime:,8 hours 57 mins
H2O_cluster_timezone:,Asia/Kolkata
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.40.0.4
H2O_cluster_version_age:,1 month and 20 days
H2O_cluster_name:,H2O_from_python_pujachaudhury_f8wv8n
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,13.81 Gb
H2O_cluster_total_cores:,8
H2O_cluster_allowed_cores:,8


## Import Data and Set Up Variables
Next, we'll import our dataset, take a quick look at the first few rows, and split it into a training set and a test set.

In [5]:
df = h2o.import_file("Ecom.csv")

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [6]:
df.head()

CustomerID,Churn,Tenure,PreferredLoginDevice,CityTier,WarehouseToHome,PreferredPaymentMode,Gender,HourSpendOnApp,NumberOfDeviceRegistered,PreferedOrderCat,SatisfactionScore,MaritalStatus,NumberOfAddress,Complain,OrderAmountHikeFromlastYear,CouponUsed,OrderCount,DaySinceLastOrder,CashbackAmount
50001,1,4.0,Mobile Phone,3,6,Debit Card,Female,3.0,3,Laptop & Accessory,2,Single,9,1,11,1,1,5,160
50002,1,,Phone,1,8,UPI,Male,3.0,4,Mobile,3,Single,7,1,15,0,1,0,121
50003,1,,Phone,1,30,Debit Card,Male,2.0,4,Mobile,3,Single,6,1,14,0,1,3,120
50004,1,0.0,Phone,3,15,Debit Card,Male,2.0,4,Laptop & Accessory,5,Single,8,0,23,0,1,3,134
50005,1,0.0,Phone,1,12,CC,Male,,3,Mobile,5,Single,3,0,11,1,1,3,130
50006,1,0.0,Computer,1,22,Debit Card,Female,3.0,5,Mobile Phone,5,Single,2,1,22,4,6,7,139
50007,1,,Phone,3,11,Cash on Delivery,Male,2.0,3,Laptop & Accessory,2,Divorced,4,0,14,0,1,0,121
50008,1,,Phone,1,6,CC,Male,3.0,3,Mobile,2,Divorced,3,1,16,2,2,0,123
50009,1,13.0,Phone,3,9,E wallet,Male,,4,Mobile,3,Divorced,2,1,14,0,1,2,127
50010,1,,Phone,1,31,Debit Card,Male,2.0,5,Mobile,3,Single,2,0,12,1,1,1,123


In [7]:
df_train,df_test= df.split_frame(ratios=[.8])

In [8]:
df_train

CustomerID,Churn,Tenure,PreferredLoginDevice,CityTier,WarehouseToHome,PreferredPaymentMode,Gender,HourSpendOnApp,NumberOfDeviceRegistered,PreferedOrderCat,SatisfactionScore,MaritalStatus,NumberOfAddress,Complain,OrderAmountHikeFromlastYear,CouponUsed,OrderCount,DaySinceLastOrder,CashbackAmount
50001,1,4.0,Mobile Phone,3,6,Debit Card,Female,3,3,Laptop & Accessory,2,Single,9,1,11,1,1,5,160
50002,1,,Phone,1,8,UPI,Male,3,4,Mobile,3,Single,7,1,15,0,1,0,121
50003,1,,Phone,1,30,Debit Card,Male,2,4,Mobile,3,Single,6,1,14,0,1,3,120
50004,1,0.0,Phone,3,15,Debit Card,Male,2,4,Laptop & Accessory,5,Single,8,0,23,0,1,3,134
50006,1,0.0,Computer,1,22,Debit Card,Female,3,5,Mobile Phone,5,Single,2,1,22,4,6,7,139
50007,1,,Phone,3,11,Cash on Delivery,Male,2,3,Laptop & Accessory,2,Divorced,4,0,14,0,1,0,121
50010,1,,Phone,1,31,Debit Card,Male,2,5,Mobile,3,Single,2,0,12,1,1,1,123
50013,1,0.0,Phone,1,11,COD,Male,2,3,Mobile,3,Single,2,1,13,2,2,2,134
50014,1,0.0,Phone,1,15,CC,Male,3,4,Mobile,3,Divorced,1,1,17,0,1,0,134
50015,1,9.0,Mobile Phone,3,15,Credit Card,Male,3,4,Fashion,2,Single,2,0,16,0,4,7,196


We'll then specify our target variable and the input features. In this case, our target variable is 'Churn', and the features are all the other columns in the dataframe except for 'CustomerID'.

In [9]:
y = "Churn"  ## dependent variable
x = df.columns  ## Independent variable
x.remove(y)
x.remove('CustomerID')

## Train AutoML Model
Now we're ready to train our model. We'll specify a few parameters for the AutoML function, such as the maximum runtime, the maximum number of models to train, and the number of folds for cross-validation.

In [10]:
aml = H2OAutoML(max_runtime_secs=500,max_models = 15, seed = 7, verbosity="info", nfolds=4)

In [11]:
aml.train(x=x,y=y, training_frame=df_train)

AutoML progress: |
04:02:18.867: Project: AutoML_6_20230618_40218
04:02:18.869: Setting stopping tolerance adaptively based on the training frame: 0.014867525836251314
04:02:18.869: Build control seed: 7
04:02:18.874: training frame: Frame key: AutoML_6_20230618_40218_training_py_3_sid_ba79    cols: 20    rows: 4524  chunks: 32    size: 212881  checksum: 7254733612136558130
04:02:18.876: validation frame: NULL
04:02:18.876: leaderboard frame: NULL
04:02:18.876: blending frame: NULL
04:02:18.876: response column: Churn
04:02:18.876: fold column: null
04:02:18.876: weights column: null
04:02:18.892: Loading execution steps: [{XGBoost : [def_2 (1g, 10w), def_1 (2g, 10w), def_3 (3g, 10w), grid_1 (4g, 90w), lr_search (7g, 30w)]}, {GLM : [def_1 (1g, 10w)]}, {DRF : [def_1 (2g, 10w), XRT (3g, 10w)]}, {GBM : [def_5 (1g, 10w), def_2 (2g, 10w), def_3 (2g, 10w), def_4 (2g, 10w), def_1 (3g, 10w), grid_1 (4g, 60w), lr_annealing (7g, 10w)]}, {DeepLearning : [def_1 (3g, 10w), grid_1 (4g, 30w), grid_2 

key,value
Stacking strategy,cross_validation
Number of base models (used / total),5/15
# GBM base models (used / total),2/6
# XGBoost base models (used / total),2/5
# DRF base models (used / total),1/2
# DeepLearning base models (used / total),0/1
# GLM base models (used / total),0/1
Metalearner algorithm,GLM
Metalearner fold assignment scheme,Random
Metalearner nfolds,4

Unnamed: 0,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid
mae,0.0930765,0.0047428,0.0985117,0.0888732,0.0955842,0.089337
mean_residual_deviance,0.0323132,0.0025117,0.0358759,0.0299873,0.0315239,0.0318657
mse,0.0323132,0.0025117,0.0358759,0.0299873,0.0315239,0.0318657
null_deviance,158.31607,3.1326356,159.50478,153.64606,160.33812,159.77534
r2,0.7691805,0.0134369,0.7502401,0.769403,0.7804991,0.77658
residual_deviance,36.504395,2.2222786,39.822304,35.324997,35.18072,35.689552
rmse,0.1796593,0.0069034,0.1894095,0.1731683,0.1775498,0.1785096
rmsle,0.1286928,0.0041045,0.1346084,0.1251248,0.1276941,0.1273438


Once the model is trained, we c  an view the leaderboard, which is a table that displays the performance of all the models that AutoML has trained.

In [12]:
lb = aml.leaderboard

In [13]:
lb

model_id,rmse,mse,mae,rmsle,mean_residual_deviance
StackedEnsemble_AllModels_1_AutoML_6_20230618_40218,0.179688,0.032288,0.0931891,0.128751,0.032288
StackedEnsemble_BestOfFamily_1_AutoML_6_20230618_40218,0.180193,0.0324695,0.0937873,0.129205,0.0324695
GBM_4_AutoML_6_20230618_40218,0.181717,0.0330211,0.0972626,0.130253,0.0330211
GBM_3_AutoML_6_20230618_40218,0.188147,0.0353992,0.103258,0.134788,0.0353992
GBM_2_AutoML_6_20230618_40218,0.196277,0.0385249,0.111302,0.140603,0.0385249
DRF_1_AutoML_6_20230618_40218,0.20007,0.0400282,0.109469,0.140022,0.0400282
GBM_5_AutoML_6_20230618_40218,0.20153,0.0406145,0.111857,0.143754,0.0406145
XGBoost_1_AutoML_6_20230618_40218,0.205472,0.0422186,0.117884,0.153185,0.0422186
XGBoost_2_AutoML_6_20230618_40218,0.206981,0.0428412,0.115222,0.153181,0.0428412
GBM_grid_1_AutoML_6_20230618_40218_model_1,0.207316,0.04298,0.127282,0.149832,0.04298


In [14]:
model_ids = list(aml.leaderboard['model_id'].as_data_frame().iloc[:,0])

In [15]:
model_ids

['StackedEnsemble_AllModels_1_AutoML_6_20230618_40218',
 'StackedEnsemble_BestOfFamily_1_AutoML_6_20230618_40218',
 'GBM_4_AutoML_6_20230618_40218',
 'GBM_3_AutoML_6_20230618_40218',
 'GBM_2_AutoML_6_20230618_40218',
 'DRF_1_AutoML_6_20230618_40218',
 'GBM_5_AutoML_6_20230618_40218',
 'XGBoost_1_AutoML_6_20230618_40218',
 'XGBoost_2_AutoML_6_20230618_40218',
 'GBM_grid_1_AutoML_6_20230618_40218_model_1',
 'XGBoost_3_AutoML_6_20230618_40218',
 'XRT_1_AutoML_6_20230618_40218',
 'GBM_1_AutoML_6_20230618_40218',
 'XGBoost_grid_1_AutoML_6_20230618_40218_model_2',
 'XGBoost_grid_1_AutoML_6_20230618_40218_model_1',
 'DeepLearning_1_AutoML_6_20230618_40218',
 'GLM_1_AutoML_6_20230618_40218']

In [16]:
aml.leader.model_performance(df_test)

In [17]:
aml.leader

key,value
Stacking strategy,cross_validation
Number of base models (used / total),5/15
# GBM base models (used / total),2/6
# XGBoost base models (used / total),2/5
# DRF base models (used / total),1/2
# DeepLearning base models (used / total),0/1
# GLM base models (used / total),0/1
Metalearner algorithm,GLM
Metalearner fold assignment scheme,Random
Metalearner nfolds,4

Unnamed: 0,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid
mae,0.0930765,0.0047428,0.0985117,0.0888732,0.0955842,0.089337
mean_residual_deviance,0.0323132,0.0025117,0.0358759,0.0299873,0.0315239,0.0318657
mse,0.0323132,0.0025117,0.0358759,0.0299873,0.0315239,0.0318657
null_deviance,158.31607,3.1326356,159.50478,153.64606,160.33812,159.77534
r2,0.7691805,0.0134369,0.7502401,0.769403,0.7804991,0.77658
residual_deviance,36.504395,2.2222786,39.822304,35.324997,35.18072,35.689552
rmse,0.1796593,0.0069034,0.1894095,0.1731683,0.1775498,0.1785096
rmsle,0.1286928,0.0041045,0.1346084,0.1251248,0.1276941,0.1273438


## Making Predictions on the Test Data
Finally, we'll use our trained model to make predictions on the test data.

In [18]:
pred = aml.predict(df_test)
pred.head()

stackedensemble prediction progress: |███████████████████████████████████████████| (done) 100%


predict
0.75434
0.815013
0.819143
0.91465
0.925408
0.0505852
0.566787
0.013585
-0.0233839
0.147007


And there you have it! You've trained a predictive model using H2O's AutoML and made predictions on unseen data. We encourage you to try H2O's AutoML on your own projects and experiment with the different parameters to see how they affect the model's performance. Happy modeling!