## Learning Objectives 

* Develop a function: if a block of code will be re-used several times, develop it as a function to avoid coding errors.
* ROC/AUC
* VARImp
* Train and test your model with a small dataset.
* Model with the entire dataset when there is no error.
* Keep your notebook clean and readable.
* AutoML. See [this H2O autoML api](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html) and [this one](http://docs.h2o.ai/h2o-tutorials/latest-stable/h2o-world-2017/automl/index.html)


In [26]:
import numpy as np
import datetime
import pandas as pd
import matplotlib.pyplot as plt
#import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [27]:
data = pd.read_csv("//Users/Chriskuo/Downloads/XYZloan_default_selected_vars.csv")

In [28]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(
     data, test_size=0.40, random_state=42)

In [29]:
train.shape

(48000, 90)

In [30]:
train.dtypes

Unnamed: 0          int64
Unnamed: 0.1        int64
Unnamed: 0.1.1      int64
id                  int64
loan_default        int64
AP001               int64
AP002               int64
AP003               int64
AP004               int64
AP005              object
AP006              object
AP007               int64
AP008               int64
AP009               int64
TD001               int64
TD002               int64
TD005               int64
TD006               int64
TD009               int64
TD010               int64
TD013               int64
TD014               int64
TD015               int64
TD022             float64
TD023             float64
TD024             float64
TD025             float64
TD026             float64
TD027             float64
TD028             float64
                   ...   
CD107             float64
CD108             float64
CD113             float64
CD114             float64
CD115             float64
CD117             float64
CD118             float64
CD120       

In [31]:
var = pd.DataFrame(train.dtypes)
var.head(10)

Unnamed: 0,0
Unnamed: 0,int64
Unnamed: 0.1,int64
Unnamed: 0.1.1,int64
id,int64
loan_default,int64
AP001,int64
AP002,int64
AP003,int64
AP004,int64
AP005,object


In [32]:
var = pd.DataFrame(train.dtypes).reset_index()
var.head()

Unnamed: 0,index,0
0,Unnamed: 0,int64
1,Unnamed: 0.1,int64
2,Unnamed: 0.1.1,int64
3,id,int64
4,loan_default,int64


In [33]:
var.columns = ['varname','dtype'] 
var.head(10)

Unnamed: 0,varname,dtype
0,Unnamed: 0,int64
1,Unnamed: 0.1,int64
2,Unnamed: 0.1.1,int64
3,id,int64
4,loan_default,int64
5,AP001,int64
6,AP002,int64
7,AP003,int64
8,AP004,int64
9,AP005,object


In [34]:
var['source'] = var['varname'].str[:2]
var.head()

Unnamed: 0,varname,dtype,source
0,Unnamed: 0,int64,Un
1,Unnamed: 0.1,int64,Un
2,Unnamed: 0.1.1,int64,Un
3,id,int64,id
4,loan_default,int64,lo


In [35]:
var['source'].value_counts()

CD    36
TD    24
AP     9
CR     8
PA     6
Un     3
MB     2
lo     1
id     1
Name: source, dtype: int64

In [36]:
# "AP004" is a bad data field and should be removed.
MB_list = list(var[var['source']=='MB']['varname'])
AP_list = list(var[(var['source']=='AP') & (var['varname']!='AP004')]['varname'])
TD_list = list(var[var['source']=='TD']['varname'])
CR_list = list(var[var['source']=='CR']['varname'])
PA_list = list(var[var['source']=='PA']['varname'])
CD_list = list(var[var['source']=='CD']['varname'])
AP_list

['AP001', 'AP002', 'AP003', 'AP005', 'AP006', 'AP007', 'AP008', 'AP009']

In [37]:
train['loan_default'].value_counts(dropna=False)

0    38736
1     9264
Name: loan_default, dtype: int64

# H2O

* If you encounter errors, [this page](https://h2o-release.s3.amazonaws.com/h2o/rel-yates/5/index.html) may help.

In [38]:
import h2o
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O cluster uptime:,37 mins 37 secs
H2O cluster timezone:,America/New_York
H2O data parsing timezone:,UTC
H2O cluster version:,3.24.0.5
H2O cluster version age:,4 months and 28 days !!!
H2O cluster name:,H2O_from_python_chriskuo_axau69
H2O cluster total nodes:,1
H2O cluster free memory:,1.556 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


In [39]:
from h2o.automl import H2OAutoML

In [40]:
target='loan_default'

### When you model, you should run with a small sample dataset

* Try to write repeating code in a function

In [41]:
train_smpl = train.sample(frac=0.1, random_state=1)
test_smpl = test.sample(frac=0.1, random_state=1)
train_hex = h2o.H2OFrame(train_smpl)
test_hex = h2o.H2OFrame(test_smpl)

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%


In [42]:
predictors = CR_list + TD_list + AP_list + MB_list + CR_list + PA_list 

## Run AutoML 

Run AutoML, stopping after 60 seconds.  The `max_runtime_secs` argument provides a way to limit the AutoML run by time.  When using a time-limited stopping criterion, the number of models train will vary between runs.  If different hardware is used or even if the same machine is used but the available compute resources on that machine are not the same between runs, then AutoML may be able to train more models on one run vs another. 

The `test` frame is passed explicitly to the `leaderboard_frame` argument here, which means that instead of using cross-validated metrics, we use test set metrics for generating the leaderboard.

In [43]:
# Run AutoML for 20 base models (limited to 1 hour max runtime by default)
aml_v1 = H2OAutoML(max_runtime_secs = 60, max_models=20, seed=1)
aml_v1.train(predictors,target,training_frame=train_hex)

AutoML progress: |████████████████████████████████████████████████████████| 100%


## Leaderboard

Next, we will view the AutoML Leaderboard.  Since we specified a `leaderboard_frame` in the `H2OAutoML.train()` method for scoring and ranking the models, the AutoML leaderboard uses the performance on this data to rank the models.

After viewing the `"powerplant_lb_frame"` AutoML project leaderboard, we compare that to the leaderboard for the `"powerplant_full_data"` project.  We can see that the results are better when the full dataset is used for training.  

A default performance metric for each machine learning task (binary classification, multiclass classification, regression) is specified internally and the leaderboard will be sorted by that metric.  In the case of regression, the default ranking metric is mean residual deviance.  In the future, the user will be able to specify any of the H2O metrics so that different metrics can be used to generate rankings on the leaderboard.

In [44]:
aml_v1.leaderboard.head()

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
GLM_grid_1_AutoML_20191116_210806_model_1,0.147001,0.383407,0.147001,0.29361,0.26904
StackedEnsemble_BestOfFamily_AutoML_20191116_210806,0.14713,0.383575,0.14713,0.293999,0.269089
StackedEnsemble_AllModels_AutoML_20191116_210806,0.147396,0.383922,0.147396,0.294316,0.2694
XGBoost_grid_1_AutoML_20191116_210806_model_1,0.148957,0.385949,0.148957,0.298872,0.272188
XGBoost_grid_1_AutoML_20191116_210806_model_3,0.14897,0.385966,0.14897,0.297642,0.27194
GBM_5_AutoML_20191116_210806,0.14929,0.386381,0.14929,0.294278,0.271852
XGBoost_3_AutoML_20191116_210806,0.149667,0.386868,0.149667,0.309214,0.274551
XGBoost_grid_1_AutoML_20191116_210806_model_4,0.150192,0.387547,0.150192,0.298969,0.273508
GBM_1_AutoML_20191116_210806,0.151545,0.389288,0.151545,0.29507,0.274555
DeepLearning_1_AutoML_20191116_210806,0.151646,0.389418,0.151646,0.289783,0.273573




Now we will view a snapshot of the top models. Here we should see the two Stacked Ensembles at or near the top of the leaderboard. Stacked Ensembles can almost always outperform a single model.

## Predict Using Leader Model

If you need to generate predictions on a test set, you can make predictions on the `"H2OAutoML"` object directly, or on the leader model object.

In [46]:
pred = aml_v1.predict(test_hex)
pred.head()

glm prediction progress: |████████████████████████████████████████████████| 100%


predict
0.245087
0.146986
0.178348
0.163205
0.134303
0.257439
0.158125
0.30727
0.135461
0.159373




If needed, the standard `model_performance()` method can be applied to the AutoML leader model and a test set to generate an H2O model performance object.

In [None]:
perf = aml_v1.leader.model_performance(test_hex)
perf

In [52]:
def createGains(model):
    predictions = model.predict(test_hex)
    test_scores = test_hex['loan_default'].cbind(predictions).as_data_frame()

    #sort on prediction (descending), add id, and decile for groups containing 1/10 of datapoints
    test_scores = test_scores.sort_values(by='predict',ascending=False)
    test_scores['row_id'] = range(0,0+len(test_scores))
    test_scores['decile'] = ( test_scores['row_id'] / (len(test_scores)/10) ).astype(int)
    #see count by decile
    test_scores.loc[test_scores['decile'] == 10]=9
    test_scores['decile'].value_counts()

    #create gains table
    gains = test_scores.groupby('decile')['loan_default'].agg(['count','sum'])
    gains.columns = ['count','actual']
    gains

    #add features to gains table
    gains['non_actual'] = gains['count'] - gains['actual']
    gains['cum_count'] = gains['count'].cumsum()
    gains['cum_actual'] = gains['actual'].cumsum()
    gains['cum_non_actual'] = gains['non_actual'].cumsum()
    gains['percent_cum_actual'] = (gains['cum_actual'] / np.max(gains['cum_actual'])).round(2)
    gains['percent_cum_non_actual'] = (gains['cum_non_actual'] / np.max(gains['cum_non_actual'])).round(2)
    gains['if_random'] = np.max(gains['cum_actual']) /10 
    gains['if_random'] = gains['if_random'].cumsum()
    gains['lift'] = (gains['cum_actual'] / gains['if_random']).round(2)
    gains['K_S'] = np.abs( gains['percent_cum_actual'] -  gains['percent_cum_non_actual'] ) * 100
    gains['gain']=(gains['cum_actual']/gains['cum_count']*100).round(2)
    gains = pd.DataFrame(gains)
    return(gains)

createGains(aml_v1)

glm prediction progress: |████████████████████████████████████████████████| 100%


Unnamed: 0_level_0,count,actual,non_actual,cum_count,cum_actual,cum_non_actual,percent_cum_actual,percent_cum_non_actual,if_random,lift,K_S,gain
decile,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,320,105,215,320,105,215,0.18,0.08,60.0,1.75,10.0,32.81
1,320,89,231,640,194,446,0.32,0.17,120.0,1.62,15.0,30.31
2,320,83,237,960,277,683,0.46,0.26,180.0,1.54,20.0,28.85
3,320,73,247,1280,350,930,0.58,0.36,240.0,1.46,22.0,27.34
4,320,64,256,1600,414,1186,0.69,0.46,300.0,1.38,23.0,25.87
5,320,51,269,1920,465,1455,0.78,0.56,360.0,1.29,22.0,24.22
6,320,49,271,2240,514,1726,0.86,0.66,420.0,1.22,20.0,22.95
7,320,39,281,2560,553,2007,0.92,0.77,480.0,1.15,15.0,21.6
8,320,27,293,2880,580,2300,0.97,0.88,540.0,1.07,9.0,20.14
9,320,20,300,3200,600,2600,1.0,1.0,600.0,1.0,0.0,18.75


### Now the code works fine with the small dataset, we can model with the entire dataset