<center> <h1>XXX Datathon - Team n°1</h1> </center>

## In this notebook:

In this notebook, we will build a **AutoML** instance using the **H2O AutoML library**
(https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html#). 

AutoML, or Automated machine learning, automatically generating a data analysis pipeline that can include data pre-processing, feature selection, and feature engineering methods along with machine learning methods and parameter settings that are optimized for your data. Traditionally, the above-mentioned steps are very time consuming and requires expertise. The aim of AutoML, therefore, is to perform these steps qucikly and make data science and machine learning more readily accessible. (http://automl.info)

We want to experiment this tool and see whether it could help us. 


In [1]:
#Database Management
import numpy as np 
import pandas as pd 

#Sklearn
from sklearn.model_selection import train_test_split

#Other
#!pip install h2o
import h2o
from h2o.automl import H2OAutoML
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', 5000)
pd.set_option('display.max_rows', 5000)


# <font color='darkorange'>Loading the dataset, train/test split </font>

In [2]:
df = pd.read_csv('./data/processed_data.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5191 entries, 0 to 5190
Data columns (total 54 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   FACUL_NUM              5191 non-null   object 
 1   DIVISION_NUM           5191 non-null   int64  
 2   SEGMENT_LOB            5191 non-null   object 
 3   UWYEAR                 5191 non-null   int64  
 4   CT_PERIOD              5191 non-null   float64
 5   MAINOCCUPANCY          5191 non-null   object 
 6   SECTOR                 5191 non-null   object 
 7   BUSINESSUNIT           5191 non-null   object 
 8   UWCENTER               5191 non-null   object 
 9   SCOPE_PERILS           5191 non-null   object 
 10  SUBSIDIARY             5191 non-null   object 
 11  PARTTYPE               5191 non-null   object 
 12  GUARANTEE              5191 non-null   object 
 13  MAIN_PRICING_CATEG     5191 non-null   object 
 14  BI_TYPE                5191 non-null   object 
 15  BI_P

In [3]:
train, test = train_test_split(df, test_size=0.2)

# <font color='darkorange'> AutoML </font>

We use the H2O AutoML library. The current version trains and cross-validates the following algorithms (in the following order): 

- three pre-specified XGBoost GBM (Gradient Boosting Machine) models
- a fixed grid of GLMs
- a default Random Forest (DRF)
- five pre-specified H2O GBMs
- a near-default Deep Neural Net
- an Extremely Randomized Forest (XRT)
- a random grid of XGBoost GBMs
- a random grid of H2O GBMs
- a random grid of Deep Neural Nets. 

In some cases, there will not be enough time to complete all the algorithms, so some may be missing from the leaderboard. AutoML then trains additionaly two Stacked Ensemble models.

As a first step, we need to initialize an H2O instance.

In [4]:

h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O_cluster_uptime:,30 mins 58 secs
H2O_cluster_timezone:,Europe/Paris
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.32.0.4
H2O_cluster_version_age:,1 month and 5 days
H2O_cluster_name:,H2O_from_python_ziwang_b32vfn
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.772 Gb
H2O_cluster_total_cores:,12
H2O_cluster_allowed_cores:,12


We now create a H2OFrame, which holds our training data. This frame is going to be fed to the AutoML model to perform computation. We also define the predictors and response variables.

In [5]:
hf = h2o.H2OFrame(train)

predictors = hf.drop('PREMIUM').columns
response = 'PREMIUM'

Parse progress: |█████████████████████████████████████████████████████████| 100%


Now we create a AutoML instance and train it on the frame we created to generate a leaderboard of models.

In [6]:
aml = H2OAutoML(
    max_models=20,
    max_runtime_secs=300,
    seed=1234
)

aml.train(x=predictors,
        y=response,
        training_frame=hf,
)

AutoML progress: |████████████████████████████████████████████████████████| 100%


In [7]:
lb = aml.leaderboard
lb.head(rows=10) 

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
StackedEnsemble_BestOfFamily_AutoML_20210306_201945,19707400000.0,140383,19707400000.0,59936.2,
StackedEnsemble_AllModels_AutoML_20210306_201945,20125700000.0,141865,20125700000.0,59500.1,
GBM_1_AutoML_20210306_201945,22840600000.0,151131,22840600000.0,59352.7,
XGBoost_3_AutoML_20210306_201945,23237000000.0,152437,23237000000.0,66023.0,
GBM_grid__1_AutoML_20210306_201945_model_2,23677400000.0,153874,23677400000.0,58012.1,
GBM_3_AutoML_20210306_201945,24821500000.0,157549,24821500000.0,60611.7,
XGBoost_grid__1_AutoML_20210306_201945_model_1,25085100000.0,158383,25085100000.0,66207.6,
GBM_2_AutoML_20210306_201945,25550500000.0,159845,25550500000.0,62366.1,
GBM_4_AutoML_20210306_201945,25759900000.0,160499,25759900000.0,60171.4,
XGBoost_grid__1_AutoML_20210306_201945_model_3,27009000000.0,164344,27009000000.0,67124.5,




Upon first look, tree-like structures scuch as GBM and XGBoost seem to perform well. 


# <font color='darkorange'>Deep dive into the AutoML results </font>


To avoid overfittig, let's examine the train, validation and test performance of the No.1 ranking model.

In [8]:
model_ids = list(aml.leaderboard['model_id'].as_data_frame().iloc[:,0])
m0 = h2o.get_model(model_ids[0])  
m0.explain

Model Details
H2OStackedEnsembleEstimator :  Stacked Ensemble
Model Key:  StackedEnsemble_BestOfFamily_AutoML_20210306_201945

No model summary for this model

ModelMetricsRegressionGLM: stackedensemble
** Reported on train data. **

MSE: 2362052959.0505233
RMSE: 48600.95635942284
MAE: 29340.13515534512
RMSLE: NaN
R^2: 0.9748720142032873
Mean Residual Deviance: 2362052959.0505233
Null degrees of freedom: 4151
Residual degrees of freedom: 4148
Null deviance: 390291683755277.8
Residual deviance: 9807243885977.773
AIC: 101404.63862302399

ModelMetricsRegressionGLM: stackedensemble
** Reported on cross-validation data. **

MSE: 19707377881.950706
RMSE: 140382.96863206272
MAE: 59936.1950519356
RMSLE: NaN
R^2: 0.7903490226115972
Mean Residual Deviance: 19707377881.950706
Null degrees of freedom: 4151
Residual degrees of freedom: 4148
Null deviance: 390442747169562.75
Residual deviance: 81825032965859.33
AIC: 110212.94859212307


<bound method explain of >

In [9]:
hf_test = h2o.H2OFrame(test)

aml.leader.model_performance(hf_test)

Parse progress: |█████████████████████████████████████████████████████████| 100%

ModelMetricsRegressionGLM: stackedensemble
** Reported on test data. **

MSE: 21714116251.622528
RMSE: 147357.10451696086
MAE: 61759.31460377956
RMSLE: NaN
R^2: 0.7456351311449001
Mean Residual Deviance: 21714116251.622528
Null degrees of freedom: 1038
Residual degrees of freedom: 1035
Null deviance: 88696406085753.73
Residual deviance: 22560966785435.81
AIC: 27688.030584280343




There seems to be an issue of overfitting, since the test RMSE bigger than the train RMSE.
Now we take a look at the best performing single model: GBM

In [10]:
m2 = h2o.get_model(model_ids[2]) 
m2.explain

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  GBM_1_AutoML_20210306_201945


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
0,,97.0,97.0,84388.0,6.0,6.0,6.0,32.0,63.0,51.42268




ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 851254709.3190936
RMSE: 29176.26962651486
MAE: 20194.57110341232
RMSLE: NaN
Mean Residual Deviance: 851254709.3190936

ModelMetricsRegression: gbm
** Reported on cross-validation data. **

MSE: 22840627400.555138
RMSE: 151131.15959508528
MAE: 59352.65349862437
RMSLE: NaN
Mean Residual Deviance: 22840627400.555138

Cross-Validation Metrics Summary: 


Unnamed: 0,Unnamed: 1,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
0,mae,59351.484,5811.2476,68823.15,54737.953,60198.73,54615.742,58381.844
1,mean_residual_deviance,22838671400.0,8782501900.0,36636443000.0,17165000700.0,24189853700.0,13671348200.0,22530709500.0
2,mse,22838671400.0,8782501900.0,36636443000.0,17165000700.0,24189853700.0,13671348200.0,22530709500.0
3,r2,0.76057994,0.050299212,0.69457316,0.7430632,0.7466154,0.8252824,0.7933656
4,residual_deviance,22838671400.0,8782501900.0,36636443000.0,17165000700.0,24189853700.0,13671348200.0,22530709500.0
5,rmse,148995.9,28259.79,191406.48,131015.27,155530.88,116924.54,150102.33
6,rmsle,,0.0,,,,,



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance
0,,2021-03-06 20:20:08,2.185 sec,0.0,306595.641235,156899.467009,94000890000.0
1,,2021-03-06 20:20:08,2.220 sec,5.0,214335.453882,110063.879654,45939690000.0
2,,2021-03-06 20:20:08,2.253 sec,10.0,158685.944094,82895.701592,25181230000.0
3,,2021-03-06 20:20:08,2.291 sec,15.0,121776.936717,66730.463661,14829620000.0
4,,2021-03-06 20:20:08,2.326 sec,20.0,96907.852814,56062.434868,9391132000.0
5,,2021-03-06 20:20:09,2.359 sec,25.0,81720.048292,49059.531719,6678166000.0
6,,2021-03-06 20:20:09,2.405 sec,30.0,70739.845425,43638.719076,5004126000.0
7,,2021-03-06 20:20:09,2.450 sec,35.0,62923.767205,39671.53119,3959400000.0
8,,2021-03-06 20:20:09,2.489 sec,40.0,56144.499859,36085.031643,3152205000.0
9,,2021-03-06 20:20:09,2.524 sec,45.0,51577.018522,33521.286783,2660189000.0



See the whole table with table.as_data_frame()

Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,TOTALINSUREDVALUE,446497400000000.0,1.0,0.238918
1,INSUREDVALUEPD,212418700000000.0,0.475745,0.113664
2,XXX_SHARE,209058400000000.0,0.468219,0.111866
3,MAINOCCUPANCY,200856900000000.0,0.44985,0.107477
4,MAIN_PRICING_CATEG,122939600000000.0,0.275342,0.065784
5,MODELED_CAT_EXPLOSS,119885100000000.0,0.268501,0.06415
6,INSUREDVALUEBI,48971120000000.0,0.109678,0.026204
7,uw_index,45066160000000.0,0.100933,0.024115
8,BI_time(Days),37336330000000.0,0.08362,0.019978
9,UWYEAR,36462540000000.0,0.081664,0.019511



See the whole table with table.as_data_frame()


<bound method explain of >

In [11]:
m2.model_performance(hf_test)


ModelMetricsRegression: gbm
** Reported on test data. **

MSE: 22157806169.9135
RMSE: 148854.98369189224
MAE: 60580.09890986102
RMSLE: NaN
Mean Residual Deviance: 22157806169.9135




The train and test RMSE are similar.

# <font color='darkorange'>Conclusion </font>


This experiement shows that AutoML result cannot replace hand-selected models for several reasons:

- Overfitting is not considered when ranking the models
- The data could be better prepared for certain models or only selecting features of importance but this aspect is not explored.
- The best models are the ensemble models, but it is posisble that the best ensemble models should be structured using some different base models.
- ...

However, AutoML results can be useful for several reasons:

- Serves as good baseline
- Provide hints on parameter tuning 
- ...

Although AutoML result should not be directly used, it is interesting to integrate it into our workflow and to get a sense that in general, tree-like models perform well on our dataset. 