# Dead-Simple Boost Ensemble

This notebook was prepared for the Kaggle October 2021 tabular-data competition, with the objective of creating a high performing model with minimal manual effort. Therefore, we're going with an ensemble of boosting models. As the tree construction used by XGBoost and LightGBM, we'll want to use both of those models in our ensemble, as well as the potential addition of a tabular-MLP model if time permits. The competition data is pretty simple, it's a binary classification problem with a mixture of normalized float & boolean features. The data is balanced & scored via ROC auc. 

Behind the scenes, this script does some pretty cool stuff --at least I think so

1) Converts source data to a HDF5 table. The hdf5 data is grouped like '/{train|test}{0-N}{data|target}', The {0-N} seems a little wonky, but it allows the data to be further separated into minibatches, which makes multiprocessing super easy. Combining this with pytorch's Dataloader cuts the time to load our data into memory from 60s (using the original csv file) to 3s. The {train|test} split is automatically stratified (even though it's not necessary for this dataset).

2) The ModelBuilder has a hook for our MLFlow server. This makes keeping track of experiments much easier. After running this script for a while, we can run 'get_best_models()' which will return the N best performing models. This is especially useful later on when we want to build out our ensemble.

3) ModelBuilder also incorporates hyperopt. You'll still need to define the models' valid hyperparameter search space, but we now get to use TPE when determining our next search params, a massive improvement over grid-search or random search.

4) The models (XGBModel or LGBModel) are trained using the GPU. We use kfold cross-validation on our training data (split in to train/val) to compute the run's average auc metric. Once finished the model is retrained using the entire training data & uploaded to the MLFlow artifact store.

5) Finally, we can query the MLFlow server to get the top N XGBModels and top N LGBModels. Assign weights to them and plug them into our ensembler. (Not coded out yet, but the logic is pretty simple).

6) Done.

In [None]:
from hdf_utils import HDFDatasetBuilder,HDFLoader
from modelbuilder import ModelBuilder
from hyperopt import hp

### Transform source data into a more computationally friendly datastructure

In [None]:
HDFDatasetBuilder(
    source_filepath='data/train.csv',
    output_filepath='data.hdf5',
    desired_batchsize = 50000,
    expected_workers = 10,
    test_sample_size = 0.15,
    shuffle_samples = True,
    stratify_data = True,
    features_to_exclude = ['id'],
    target_column_name = 'target'
)

### Define hyperparameter search spaces for LightGBM and XGBoost models

In [None]:
lgb_hyperopt_space = {
    'reg_alpha': hp.uniform('reg_alpha',0.0,0.3),
    'reg_lambda': hp.uniform('reg_lambda',0.0,0.2),
    'learning_rate': hp.uniform('learning_rate',0.01,0.25),
    'num_leaves': hp.choice('num_leaves', range(20,50,1)),
    'max_depth': hp.choice('max_depth', range(2,12,1)),
    'n_estimators': hp.choice('n_estimators',range(1000,10000,1)),
    'subsample': hp.uniform('subsample',0.5,1.0),
}

xgb_hyperopt_space = {
    'eta': hp.uniform('eta', 0.01, 0.15),
    'max_depth': hp.choice('max_depth', range(2, 8, 1)),
    'subsample': hp.uniform('subsample', 0.5, 0.75),
    'n_estimators': hp.choice('n_estimators',range(100,101,1))
}

### Run Hyperparameter search to find optimal model configurations

In [None]:
model_builder = ModelBuilder()
while True:
    model_builder.run(
        experiment_name='taboct-lgbm',
        model_type='LGBModel',
        hyperopt_space=lgb_hyperopt_space,
    )
    
    model_builder.run(
        experiment_name='taboct-xgbm',
        model_type='XGBModel',
        hyperopt_space=xgb_hyperopt_space,
    )