In this notebook we will create an ensemble of models trained on the same data as the "LightGBM Basic Example" notebook and use the `estimate_aps_user_defined` function with parallelization to compute APS. 

## Training

We will train and convert 6 LGBM models with categorical features according to this Kaggle notebook: https://www.kaggle.com/ezietsman/simple-python-lightgbm-example?select=train.csv
The data is sourced from Porto Seguro's Safe Driver Prediction competition: https://www.kaggle.com/c/porto-seguro-safe-driver-prediction

In [3]:
import numpy as np
import pandas as pd
import lightgbm
from sklearn.model_selection import train_test_split

This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.


In [4]:
def train_model():   
    # Prepare training data 
    train = pd.read_csv('data/lgbm_cat_train.csv')

    # get the labels
    y = train.target.values
    train.drop(['id', 'target'], inplace=True, axis=1)

    x = train.values

    # Create training and validation sets
    x, x_test, y, y_test = train_test_split(x, y, test_size=0.2, stratify=y)

    # Create the LightGBM data containers
    categorical_features = [c for c, col in enumerate(train.columns) if 'cat' in col]
    train_data = lightgbm.Dataset(x, label=y, categorical_feature=categorical_features)
    test_data = lightgbm.Dataset(x_test, label=y_test)

    # Train the model

    parameters = {
        'application': 'binary',
        'objective': 'binary',
        'metric': 'auc',
        'is_unbalance': 'true',
        'boosting': 'gbdt',
        'num_leaves': 31,
        'feature_fraction': 0.5,
        'bagging_fraction': 0.5,
        'bagging_freq': 20,
        'learning_rate': 0.05,
        'verbose': 0
    }

    model = lightgbm.train(parameters,
                           train_data,
                           valid_sets=test_data,
                           num_boost_round=5000,
                           early_stopping_rounds=100)
    return model

In [6]:
ensemble = []
for _ in range(6):
    model = train_model()
    ensemble.append(model)

New categorical_feature is [1, 3, 4, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
  'New categorical_feature is {}'.format(sorted(list(categorical_feature))))


[1]	valid_0's auc: 0.58609
Training until validation scores don't improve for 100 rounds
[2]	valid_0's auc: 0.599257
[3]	valid_0's auc: 0.601341
[4]	valid_0's auc: 0.606353
[5]	valid_0's auc: 0.60704
[6]	valid_0's auc: 0.608194
[7]	valid_0's auc: 0.609453
[8]	valid_0's auc: 0.610432
[9]	valid_0's auc: 0.611283
[10]	valid_0's auc: 0.612212
[11]	valid_0's auc: 0.612913
[12]	valid_0's auc: 0.61372
[13]	valid_0's auc: 0.614502
[14]	valid_0's auc: 0.615222
[15]	valid_0's auc: 0.615022
[16]	valid_0's auc: 0.615609
[17]	valid_0's auc: 0.615754
[18]	valid_0's auc: 0.616106
[19]	valid_0's auc: 0.616498
[20]	valid_0's auc: 0.616598
[21]	valid_0's auc: 0.617099
[22]	valid_0's auc: 0.617738
[23]	valid_0's auc: 0.618009
[24]	valid_0's auc: 0.618406
[25]	valid_0's auc: 0.61857
[26]	valid_0's auc: 0.618785
[27]	valid_0's auc: 0.618896
[28]	valid_0's auc: 0.618865
[29]	valid_0's auc: 0.619373
[30]	valid_0's auc: 0.619726
[31]	valid_0's auc: 0.620125
[32]	valid_0's auc: 0.620288
[33]	valid_0's auc: 0.6

New categorical_feature is [1, 3, 4, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
  'New categorical_feature is {}'.format(sorted(list(categorical_feature))))


[1]	valid_0's auc: 0.589782
Training until validation scores don't improve for 100 rounds
[2]	valid_0's auc: 0.606818
[3]	valid_0's auc: 0.610424
[4]	valid_0's auc: 0.616272
[5]	valid_0's auc: 0.617168
[6]	valid_0's auc: 0.617871
[7]	valid_0's auc: 0.619349
[8]	valid_0's auc: 0.61903
[9]	valid_0's auc: 0.620005
[10]	valid_0's auc: 0.620951
[11]	valid_0's auc: 0.620478
[12]	valid_0's auc: 0.621204
[13]	valid_0's auc: 0.620627
[14]	valid_0's auc: 0.621075
[15]	valid_0's auc: 0.620771
[16]	valid_0's auc: 0.621928
[17]	valid_0's auc: 0.622155
[18]	valid_0's auc: 0.622872
[19]	valid_0's auc: 0.622993
[20]	valid_0's auc: 0.622949
[21]	valid_0's auc: 0.623328
[22]	valid_0's auc: 0.623908
[23]	valid_0's auc: 0.624182
[24]	valid_0's auc: 0.62436
[25]	valid_0's auc: 0.624211
[26]	valid_0's auc: 0.624055
[27]	valid_0's auc: 0.623759
[28]	valid_0's auc: 0.623849
[29]	valid_0's auc: 0.624258
[30]	valid_0's auc: 0.624526
[31]	valid_0's auc: 0.625086
[32]	valid_0's auc: 0.62487
[33]	valid_0's auc: 0.

New categorical_feature is [1, 3, 4, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
  'New categorical_feature is {}'.format(sorted(list(categorical_feature))))


[1]	valid_0's auc: 0.585369
Training until validation scores don't improve for 100 rounds
[2]	valid_0's auc: 0.604273
[3]	valid_0's auc: 0.606683
[4]	valid_0's auc: 0.613009
[5]	valid_0's auc: 0.615081
[6]	valid_0's auc: 0.615631
[7]	valid_0's auc: 0.616144
[8]	valid_0's auc: 0.615853
[9]	valid_0's auc: 0.616981
[10]	valid_0's auc: 0.618419
[11]	valid_0's auc: 0.618536
[12]	valid_0's auc: 0.619144
[13]	valid_0's auc: 0.618855
[14]	valid_0's auc: 0.619276
[15]	valid_0's auc: 0.618767
[16]	valid_0's auc: 0.619266
[17]	valid_0's auc: 0.619336
[18]	valid_0's auc: 0.62003
[19]	valid_0's auc: 0.620305
[20]	valid_0's auc: 0.620104
[21]	valid_0's auc: 0.620421
[22]	valid_0's auc: 0.621339
[23]	valid_0's auc: 0.621854
[24]	valid_0's auc: 0.622638
[25]	valid_0's auc: 0.622819
[26]	valid_0's auc: 0.622597
[27]	valid_0's auc: 0.622988
[28]	valid_0's auc: 0.623201
[29]	valid_0's auc: 0.623654
[30]	valid_0's auc: 0.624236
[31]	valid_0's auc: 0.624588
[32]	valid_0's auc: 0.624634
[33]	valid_0's auc: 

New categorical_feature is [1, 3, 4, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
  'New categorical_feature is {}'.format(sorted(list(categorical_feature))))


[1]	valid_0's auc: 0.59226
Training until validation scores don't improve for 100 rounds
[2]	valid_0's auc: 0.606675
[3]	valid_0's auc: 0.610052
[4]	valid_0's auc: 0.615578
[5]	valid_0's auc: 0.617222
[6]	valid_0's auc: 0.61801
[7]	valid_0's auc: 0.620161
[8]	valid_0's auc: 0.620652
[9]	valid_0's auc: 0.621827
[10]	valid_0's auc: 0.623312
[11]	valid_0's auc: 0.623555
[12]	valid_0's auc: 0.624566
[13]	valid_0's auc: 0.624778
[14]	valid_0's auc: 0.625269
[15]	valid_0's auc: 0.625137
[16]	valid_0's auc: 0.625752
[17]	valid_0's auc: 0.625865
[18]	valid_0's auc: 0.626749
[19]	valid_0's auc: 0.626911
[20]	valid_0's auc: 0.626592
[21]	valid_0's auc: 0.626649
[22]	valid_0's auc: 0.62701
[23]	valid_0's auc: 0.627225
[24]	valid_0's auc: 0.627253
[25]	valid_0's auc: 0.627026
[26]	valid_0's auc: 0.626896
[27]	valid_0's auc: 0.626941
[28]	valid_0's auc: 0.626866
[29]	valid_0's auc: 0.627023
[30]	valid_0's auc: 0.627031
[31]	valid_0's auc: 0.627143
[32]	valid_0's auc: 0.626955
[33]	valid_0's auc: 0.

New categorical_feature is [1, 3, 4, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
  'New categorical_feature is {}'.format(sorted(list(categorical_feature))))


[1]	valid_0's auc: 0.593882
Training until validation scores don't improve for 100 rounds
[2]	valid_0's auc: 0.608751
[3]	valid_0's auc: 0.609948
[4]	valid_0's auc: 0.616541
[5]	valid_0's auc: 0.617983
[6]	valid_0's auc: 0.619698
[7]	valid_0's auc: 0.621025
[8]	valid_0's auc: 0.621669
[9]	valid_0's auc: 0.622879
[10]	valid_0's auc: 0.623634
[11]	valid_0's auc: 0.62374
[12]	valid_0's auc: 0.624887
[13]	valid_0's auc: 0.624973
[14]	valid_0's auc: 0.625709
[15]	valid_0's auc: 0.625415
[16]	valid_0's auc: 0.626045
[17]	valid_0's auc: 0.6264
[18]	valid_0's auc: 0.626615
[19]	valid_0's auc: 0.626854
[20]	valid_0's auc: 0.626896
[21]	valid_0's auc: 0.626943
[22]	valid_0's auc: 0.627462
[23]	valid_0's auc: 0.627964
[24]	valid_0's auc: 0.628073
[25]	valid_0's auc: 0.628241
[26]	valid_0's auc: 0.628624
[27]	valid_0's auc: 0.62874
[28]	valid_0's auc: 0.628526
[29]	valid_0's auc: 0.62868
[30]	valid_0's auc: 0.628878
[31]	valid_0's auc: 0.629309
[32]	valid_0's auc: 0.629106
[33]	valid_0's auc: 0.62

New categorical_feature is [1, 3, 4, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
  'New categorical_feature is {}'.format(sorted(list(categorical_feature))))


[1]	valid_0's auc: 0.586076
Training until validation scores don't improve for 100 rounds
[2]	valid_0's auc: 0.597059
[3]	valid_0's auc: 0.602155
[4]	valid_0's auc: 0.609465
[5]	valid_0's auc: 0.611814
[6]	valid_0's auc: 0.612369
[7]	valid_0's auc: 0.613606
[8]	valid_0's auc: 0.613733
[9]	valid_0's auc: 0.615001
[10]	valid_0's auc: 0.616491
[11]	valid_0's auc: 0.616331
[12]	valid_0's auc: 0.617034
[13]	valid_0's auc: 0.61713
[14]	valid_0's auc: 0.617839
[15]	valid_0's auc: 0.617395
[16]	valid_0's auc: 0.61794
[17]	valid_0's auc: 0.617849
[18]	valid_0's auc: 0.618362
[19]	valid_0's auc: 0.618582
[20]	valid_0's auc: 0.618487
[21]	valid_0's auc: 0.618761
[22]	valid_0's auc: 0.619973
[23]	valid_0's auc: 0.620004
[24]	valid_0's auc: 0.620296
[25]	valid_0's auc: 0.62032
[26]	valid_0's auc: 0.620483
[27]	valid_0's auc: 0.620371
[28]	valid_0's auc: 0.620695
[29]	valid_0's auc: 0.621119
[30]	valid_0's auc: 0.621257
[31]	valid_0's auc: 0.621873
[32]	valid_0's auc: 0.621825
[33]	valid_0's auc: 0.

## Define Ensemble and Compute APS

In [9]:
X = pd.read_csv('data/lgbm_cat_test.csv')
X.drop(['id'], inplace=True, axis=1)
array = np.array(X)
ensemble[0].predict(array)

array([0.42158631, 0.44365395, 0.3972221 , ..., 0.53028419, 0.40072007,
       0.42507558])

In [10]:
def lgbm_ensemble(X):
    preds = np.column_stack([model.predict(X) for model in ensemble])
    ensemble_pred = np.mean(preds, axis=1)
    return ensemble_pred

In [None]:
from IVaps import estimate_aps_user_defined
# Compute APS using ensemble function
aps = estimate_aps_user_defined(lgbm_ensemble, data = X, parallel=True)

`data` given but no indices passed. We will assume that all the variables in `data` are continuous.
Running APS estimation with default (# processors) workers...
