# Hyperparameter tuning with XGBoost, Ray Tune, Hyperopt and Optuna

## Introduction


In this post we are going to demonstrate how we can speed up hyperparameter tuning with:

1) Bayesian optimization tuning algos like HyperOpt and Optuna, running on…

2) the [Ray](https://ray.io/) distributed ML framework, with a [unified API to many hyperparameter search algos](https://medium.com/riselab/cutting-edge-hyperparameter-tuning-with-ray-tune-be6c0447afdf) and…

3) a distributed cluster of cloud instances for even more speedup.

### Outline:
- Overview of hyperparameter tuning
- Baseline linear regression with no hyperparameters
- ElasticNet with L1 and L2 regularization using ElasticNetCV hyperparameter optimization
- ElasticNet with GridSearchCV hyperparameter optimization
- XGBoost: sequential grid search over hyperparameter subsets with early stopping 
- XGBoost: with HyperOpt and Optuna search algorithms
- LightGBM: with HyperOpt and Optuna search algorithms
- XGBoost: HyperOpt on a Ray cluster
- LightGBM: HyperOpt on a Ray cluster
- Conclusions

But first, here are results on the Ames housing data set, predicting Iowa home prices:

| ML Algo           | Hyperparameter search algo   | CV Error (RMSE in $)  | Time     |
|-------------------|------------------------------|-----------------------|----------|
| XGB               | Sequential Grid Search       | $18783                |   36:09  |
| XGB               | HyperOpt (128 samples)       | $18770                |   14:08  |
| LightGBM          | HyperOpt (128 samples)       | $18770                |   14:08  |
| XGB               | Optuna                       | $18618
| LightGBM          | Optuna                       | $18618
| XGB               | Optuna - 16-instance cluster | $18618
| LightGBM          | Optuna - 16-instance cluster | $18618
| Linear Regression | --                           | $18192                |   0:01s  |
| ElasticNet        | ElasticNetCV (Grid Search)   | $18122                |   0:02s  |          
| ElasticNet        | GridSearchCV                 | $18061                |   0:05s  |          

We see both speedup and RMSE improvement when using HyperOpt and Optuna, and the cluster. But our feature engineering was quite good and our simple linear model still outperforms boosting. (Not shown, SVR and KR are high-performing and an ensemble improves over all individual algos)


## Hyperparameter Tuning Overview

Here are [the principal approaches to hyperparameter tuning](https://en.wikipedia.org/wiki/Hyperparameter_optimization)

- Grid search: given a finite set of discrete values for each hyperparameter, exhaustively cross-validate all combinations

- Random search: given a discrete or continuous distribution for each hyperparameter, randomly sample from the joint distribution. Generally [more efficient than exhaustive grid search.](https://dl.acm.org/doi/10.5555/2188385.2188395 ) 

- Bayesian optimization: update the search space as you go based on outcomes of prior searches.

- Gradient-based optimization: attempt to estimate the gradient of the CV metric with respect to the hyperparameter and ascend/descend the gradient.

- Evolutionary optimization: sample the search space, discard combinations with poor metrics, and genetically evolve new combinations to try based on the successful combinations.

- Population-based: A method of performing hyperparameter optimization at the same time as training.

In this post we focus on Bayesian optimization with HyperOpt and Optuna. What is Bayesian optimization? When we perform a grid search, the search space can be considered a prior belief that the best hyperparameter vector is in the search space, and the combinations have equal probability of being the best combination. So we try them all and pick the best one.

Perhaps we might do two passes of grid search. After an initial search on a broad, coarsely spaced grid, we might do a deeper dive in a smaller area around the best metric from the first pass, with a more finely-spaced grid. In Bayesian terminology we updated our prior belief.

Bayesian optimization first samples randomly, e.g. 30 combinations, and computes the cross-validation metric for each combination. Then the algorithm updates the distribution it samples from, so it is more likely to sample combinations near the good metrics, and less likely to sample combinations near the poor metrics. As it continues to sample, it continues to update the search distribution based on the metrics it finds.

Early stopping may also highly beneficial: often we can discard a combination without fully training it. In this post we use [ASHA](https://arxiv.org/abs/1810.05934). 

We use 4 regression algorithms:
- LinearRegression: baseline with no hyperparameters
- ElasticNet: Linear regression with L1 and L2 regularization (2 hyperparameters).
- XGBoost
- LightGBM

We use 5 approaches :
- *Native CV*: In sklearn if an algo has hyperparameters it will often have an xxxCV version which performs automated hyperparameter tuning over a search space with specified kfolds.
- *GridSearchCV*: Abstracts CV for any sklearn algo, running multithreaded trials over specified folds. 
- *Manual sequential grid search*: What we typically do with XGBoost, which doesn't play well with GridSearchCV and has too many hyperparameters to tune in one pass.
- *Ray on local machine*: HyperOpt and Optuna with early stopping.
- *Ray on cluster*: Additionally scale out to run a single hyperparameter optimization task over many instances.

We use data from the Ames Housing Dataset https://www.kaggle.com/c/house-prices-advanced-regression-techniques . The original data has 79 raw features. The data we will use has 100 features with a fair amount of feature engineering from [my own attempt at modeling](https://github.com/druce/iowa), which was in the top 5% or so when I submitted it to Kaggle.

### Further reading: 
 - [Hyper-Parameter Optimization: A Review of Algorithms and Applications](https://arxiv.org/abs/2003.05689) Tong Yu, Hong Zhu (2020)
 - [Hyperparameter Search in Machine Learning](https://arxiv.org/abs/1502.02127v2), Marc Claesen, Bart De Moor (2015)
 - [Hyperparameter Optimization](https://link.springer.com/chapter/10.1007/978-3-030-05318-5_1), Matthias Feurer, Frank Hutter (2019) 

In [1]:
from itertools import product
from datetime import datetime, timedelta
import os
import random
import string

import numpy as np
import pandas as pd

import sklearn
from sklearn.linear_model import LinearRegression, ElasticNet, ElasticNetCV, Ridge, RidgeCV
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, GridSearchCV, KFold
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.pipeline import make_pipeline

#!conda install -y -c conda-forge  xgboost 
import xgboost
from xgboost import XGBRegressor
from xgboost import plot_importance

import lightgbm
from lightgbm import LGBMRegressor

import ray
from ray import tune
from ray.tune.suggest import ConcurrencyLimiter
from ray.tune.schedulers import AsyncHyperBandScheduler
from ray.tune.suggest.bayesopt import BayesOptSearch
from ray.tune.suggest.hyperopt import HyperOptSearch
from ray.tune.suggest.optuna import OptunaSearch
from ray.tune.logger import DEFAULT_LOGGERS
from ray.tune.integration.wandb import WandbLogger

# import wandb
# os.environ['WANDB_NOTEBOOK_NAME']='hyperparameter_optimization.ipynb'

print(datetime.now())

print ("%-20s %s"% ("numpy", np.__version__))
print ("%-20s %s"% ("pandas", pd.__version__))
print ("%-20s %s"% ("sklearn", sklearn.__version__))
print ("%-20s %s"% ("xgboost", xgboost.__version__))
print ("%-20s %s"% ("lightgbm", lightgbm.__version__))
print ("%-20s %s"% ("ray", ray.__version__))


2020-10-20 02:59:56.746577
numpy                1.19.1
pandas               1.1.3
sklearn              0.23.2
xgboost              1.2.0
lightgbm             2.3.0
ray                  1.0.0


In [2]:
# set seed for reproducibility
RANDOMSTATE = 42
np.random.seed(RANDOMSTATE)


In [3]:
# import train data
df = pd.read_pickle('df_train.pickle')

response = 'SalePrice'
predictors = ['YearBuilt',
              'BsmtFullBath',
              'FullBath',
              'KitchenAbvGr',
              'GarageYrBlt',
              'LotFrontage',
              'MasVnrArea',
              '1stFlrSF',
              'GrLivArea',
              'GarageArea',
              'WoodDeckSF',
              'PorchSF',
              'AvgBltRemod',
              'FireBathRatio',
              'TotalSF x OverallQual x OverallCond',
              'AvgBltRemod x Functional x TotalFinSF',
              'Functional x OverallQual',
              'KitchenAbvGr x KitchenQual',
              'GarageCars x GarageYrBlt',
              'GarageQual x GarageCond x GarageCars',
              'HeatingQC x Heating',
              'monthnum',
              'log_YearBuilt',
              'log_LotArea',
              'log_TotalFinSF',
              'log_GarageRatio',
              'log_TotalSF x OverallQual x OverallCond',
              'log_TotalSF x OverallCond',
              'log_AvgBltRemod x TotalFinSF',
              'sq_2ndFlrSF',
              'sq_BsmtFinSF',
              'sq_BsmtFinSF x BsmtQual',
              'sq_BsmtFinSF x BsmtBath',
              'BldgType_4',
              'BsmtExposure_1',
              'BsmtExposure_4',
              'BsmtFinType1_1',
              'BsmtFinType1_2',
              'BsmtFinType1_4',
              'BsmtFinType1_5',
              'BsmtFinType1_6',
              'CentralAir_0',
              'CentralAir_1',
              'Condition1_1',
              'Condition1_3',
              'ExterCond_2',
              'ExterQual_2',
              'Exterior1st_4',
              'Exterior1st_5',
              'Exterior1st_10',
              'Fence_0',
              'Fence_2',
              'Foundation_1',
              'Foundation_5',
              'GarageCars_1',
              'GarageFinish_2',
              'GarageFinish_3',
              'GarageType_2',
              'HouseStyle_2',
              'KitchenQual_4',
              'LotConfig_0',
              'LotConfig_4',
              'MSSubClass_30',
              'MSSubClass_70',
              'MSZoning_0',
              'MSZoning_1',
              'MSZoning_4',
              'MasVnrType_2',
              'MasVnrType_3',
              'MoSold_1',
              'MoSold_5',
              'MoSold_6',
              'MoSold_11',
              'Neighborhood_3',
              'Neighborhood_4',
              'Neighborhood_5',
              'Neighborhood_10',
              'Neighborhood_11',
              'Neighborhood_16',
              'Neighborhood_17',
              'Neighborhood_19',
              'Neighborhood_22',
              'Neighborhood_24',
              'OverallCond_7',
              'OverallQual_5',
              'OverallQual_6',
              'OverallQual_7',
              'OverallQual_9',
              'PavedDrive_0',
              'PavedDrive_2',
              'SaleCondition_1',
              'SaleCondition_2',
              'SaleCondition_5',
              'SaleType_4',
              'BedroomAbvGr_1',
              'BedroomAbvGr_4',
              'BedroomAbvGr_5',
              'HalfBath_1',
              'TotalBath_1.0',
              'TotalBath_2.5']

X_train, X_test, y_train, y_test = train_test_split(df, df[response], test_size=.25)

display(df[predictors].head())
display(df[[response]].head())


Unnamed: 0_level_0,YearBuilt,BsmtFullBath,FullBath,KitchenAbvGr,GarageYrBlt,LotFrontage,MasVnrArea,1stFlrSF,GrLivArea,GarageArea,...,SaleCondition_1,SaleCondition_2,SaleCondition_5,SaleType_4,BedroomAbvGr_1,BedroomAbvGr_4,BedroomAbvGr_5,HalfBath_1,TotalBath_1.0,TotalBath_2.5
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,7,1,2,1,7,65.0,196.0,856,1710,548.0,...,0,0,0,1,0,0,0,1,0,0
2,34,0,2,1,34,80.0,0.0,1262,1262,460.0,...,0,0,0,1,0,0,0,0,0,1
3,9,1,2,1,9,68.0,162.0,920,1786,608.0,...,0,0,0,1,0,0,0,1,0,0
4,95,1,1,1,12,60.0,0.0,961,1717,642.0,...,1,0,0,1,0,0,0,0,0,0
5,10,1,2,1,10,84.0,350.0,1145,2198,836.0,...,0,0,0,1,0,1,0,1,0,0


Unnamed: 0_level_0,SalePrice
Id,Unnamed: 1_level_1
1,12.247699
2,12.109016
3,12.317171
4,11.849405
5,12.42922


In [4]:
# we are training on a response which is the log of 1 + the sale price
# transform prediction back to original basis with expm1 and evaluate vs. original

MEAN_RESPONSE=df[response].mean()
def cv_to_raw(cv_val, mean_response=MEAN_RESPONSE):
    """convert log1p rmse to underlying SalePrice error"""
    # MEAN_RESPONSE assumes folds have same mean response, which is true in expectation but not in each fold
    # we can also pass the actual response for each fold
    # but we're usually looking to consistently convert the log value to a more meaningful unit
    return np.expm1(mean_response+cv_val) - np.expm1(mean_response)

In [5]:
# always use same k-folds for reproducibility
kfolds = KFold(n_splits=10, shuffle=True, random_state=RANDOMSTATE)


# Ray Cluster

- Cluster config is in `ray1.1.yaml`
- Edit `ray1.1.yaml` file with your region, availability zone, subnet, imageid information
    - to get those variables launch the latest Deep Learning AMI (Ubuntu 18.04) Version 35.0 into a small instance in your favorite region/zone
    - test that it works
    - note those 4 variables: region, availability zone, subnet, AMI imageid
    - terminate the instance and edit `ray1.1.yaml` accordingly
    - in future you can create your own image with everything pre-installed and specify its AMI imageid, instead of using the generic image and installing everything at launch.
- To run the cluster: 
`ray up ray1.1.yaml`
    - Creates head instance using image specified.
    - Installs ray and related requirements
    - Clones this Iowa repo
    - Launches worker nodes per auto-scaling parameters (currently we fix the number of nodes because we're not benching the time the cluster will take to auto-scale)
- After cluster starts you can check AWS console and note that several instances launched.
- Check `ray monitor ray1.1.yaml` for any error messages
- Run Jupyter on the cluster with port forwarding
 `ray exec ray1.1.yaml --port-forward=8899 'jupyter notebook --port=8899'`
- Open the notebook on the generated URL e.g. http://localhost:8899/?token=5f46d4355ae7174524ba71f30ef3f0633a20b19a204b93b4
- Make sure to hoose the default kernel to make sure it runs in the conda environment with all installs
- Make sure to use the ray.init() command given in the startup messages.
- You can also run a terminal on the head node of the cluster with
 `ray attach /Users/drucev/projects/iowa/ray1.1.yaml`
- You can also ssh explicitly with the IP address and the generated private key
 `ssh -o IdentitiesOnly=yes -i ~/.ssh/ray-autoscaler_1_us-east-1.pem ubuntu@54.161.200.54`
- run port forwarding to the Ray dashboard with   
`ray dashboard ray1.1.yaml`
and then open
 http://localhost:8265/

see https://docs.ray.io/en/latest/cluster/launcher.html for additional info

In [9]:
# make sure local ray service is shutdown
ray.shutdown()


In [10]:
# launch cluster in terminal with ray up ray1.1.yaml
# initialize ray on cluster
ray.init(address='localhost:6379', _redis_password='5241590000000000')


2020-10-20 03:01:38,742	INFO worker.py:634 -- Connecting to existing Ray cluster at address: 172.30.1.63:6379


{'node_ip_address': '172.30.1.63',
 'raylet_ip_address': '172.30.1.63',
 'redis_address': '172.30.1.63:6379',
 'object_store_address': '/tmp/ray/session_2020-10-19_23-52-35_378334_30868/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-10-19_23-52-35_378334_30868/sockets/raylet',
 'webui_url': 'localhost:8265',
 'session_dir': '/tmp/ray/session_2020-10-19_23-52-35_378334_30868',
 'metrics_export_port': 64101}

In [8]:
# refactor to give ray.tune a single function of hyperparameters to optimize

# @wandb_mixin
def my_xgb(config):
    
    # fix these configs to match calling convention
    # search wants to pass in floats but xgb wants ints
    #config['max_leaves'] = int(config['max_leaves'])
    config['n_estimators'] = int(config['n_estimators'])   # pass float eg loguniform distribution, use int
    # hyperopt needs left to start at 0 but we want to start at 2    
    config['max_depth'] = int(config['max_depth']) + 2
    config['learning_rate'] = 10 ** config['learning_rate']
    
    xgb = XGBRegressor(
        objective='reg:squarederror',
        n_jobs=1,
        random_state=RANDOMSTATE,
        booster='gbtree',   
        scale_pos_weight=1, 
        **config,
    )
    scores = -cross_val_score(xgb, df[predictors], df[response],
                                      scoring="neg_root_mean_squared_error",
                                      cv=kfolds)
    rmse = np.mean(scores)
    tune.report(rmse=rmse)
#     wandb.log({"rmse": rmse})
    
    return {"rmse": rmse}

In [9]:
xgb_tune_kwargs = {
    "n_estimators": tune.loguniform(100, 10000),
    "max_depth": tune.randint(0, 5),
    # max_leaves doesn't seem to have any impact on XGBoost but num_leaves does help LGBM, oddly.
    # 'max_leaves': tune.loguniform(1, 1000),    
    "subsample": tune.quniform(0.25, 0.75, 0.01),
    "colsample_bytree": tune.quniform(0.05, 0.5, 0.01),
    "colsample_bylevel": tune.quniform(0.05, 0.5, 0.01),    
    "learning_rate": tune.quniform(-3.0, -1.0, 0.5),
#     "wandb": {
#         "project": "iowa_xgb",
#         "api_key_file": "~/secrets/wandb.txt",
#    }    
}

xgb_tune_params = [k for k in xgb_tune_kwargs.keys() if k != 'wandb']
xgb_tune_params

['n_estimators',
 'max_depth',
 'subsample',
 'colsample_bytree',
 'colsample_bylevel',
 'learning_rate']

In [10]:
NUM_SAMPLES=2048

print("XGBoost HyperOpt")

start_time = datetime.now()
print("%-20s %s" % ("Start Time", start_time))

algo = HyperOptSearch(random_state_seed=RANDOMSTATE)
# to limit number of cores, uncomment and set max_concurrent 
# algo = ConcurrencyLimiter(algo, max_concurrent=10)
scheduler = AsyncHyperBandScheduler()

analysis = tune.run(my_xgb,
                    num_samples=NUM_SAMPLES,
                    config=xgb_tune_kwargs,                    
                    name="hyperopt_xgb",
                    metric="rmse",
                    mode="min",
                    search_alg=algo,
                    scheduler=scheduler,
                    verbose=1,
#                    loggers=DEFAULT_LOGGERS + (WandbLogger, ),
                   )

end_time = datetime.now()
print("%-20s %s" % ("Start Time", start_time))
print("%-20s %s" % ("End Time", end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))

Trial name,status,loc,colsample_bylevel,colsample_bytree,learning_rate,max_depth,n_estimators,subsample,iter,total time (s),rmse
my_xgb_0bcbb918,TERMINATED,,0.36,0.16,-3.0,4,4626.62,0.48,1,71.884,0.177395
my_xgb_0bce02cc,TERMINATED,,0.07,0.47,-2.5,3,2853.98,0.46,1,42.8083,0.111589
my_xgb_0bcf6892,TERMINATED,,0.08,0.44,-1.0,1,5842.93,0.67,1,71.048,0.11164
my_xgb_0bd0f32e,TERMINATED,,0.09,0.11,-3.0,2,6877.73,0.47,1,69.7961,0.158541
my_xgb_0bd25610,TERMINATED,,0.09,0.29,-1.5,3,220.67,0.66,1,3.93005,0.123829
my_xgb_0bd3b910,TERMINATED,,0.37,0.46,-1.5,1,751.724,0.65,1,18.0691,0.10924
my_xgb_0bd51ab2,TERMINATED,,0.49,0.3,-1.5,2,7563.79,0.62,1,162.579,0.108419
my_xgb_0bd7c0dc,TERMINATED,,0.22,0.35,-2.5,0,3554.09,0.56,1,43.9944,0.113938
my_xgb_0bd9ea1a,TERMINATED,,0.07,0.2,-1.5,4,6918.6,0.52,1,83.3956,0.107773
my_xgb_0bdb6d9a,TERMINATED,,0.15,0.22,-2.0,1,2519.09,0.73,1,29.3867,0.10597


Start Time           2020-10-19 18:36:47.947494
End Time             2020-10-19 19:54:34.083124
1:17:46


In [11]:
param_cols = ['config.' + k for k in xgb_tune_params]
analysis_results_df = analysis.results_df[['rmse', 'date', 'time_this_iter_s'] + param_cols].sort_values('rmse')
analysis_results_df


Unnamed: 0_level_0,rmse,date,time_this_iter_s,config.n_estimators,config.max_depth,config.subsample,config.colsample_bytree,config.colsample_bylevel,config.learning_rate
trial_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0c464386,0.104323,2020-10-19_18-46-27,115.649617,9833,2,0.42,0.19,0.14,0.010000
0c4b2f5e,0.104862,2020-10-19_18-46-00,82.254241,5441,4,0.42,0.21,0.30,0.010000
10f66c44,0.105090,2020-10-19_18-49-22,48.515120,3959,2,0.35,0.20,0.15,0.010000
1dacf69c,0.105120,2020-10-19_18-55-18,44.188888,3908,2,0.33,0.20,0.09,0.031623
0d75f8be,0.105147,2020-10-19_18-46-49,56.106637,5321,2,0.33,0.50,0.05,0.010000
...,...,...,...,...,...,...,...,...,...
0b6c2da8,10.435452,2020-10-19_19-52-41,1.678954,100,4,0.26,0.13,0.25,0.001000
0e026e9c,10.435452,2020-10-19_19-53-05,1.261150,100,4,0.26,0.08,0.26,0.001000
46d0baae,10.435452,2020-10-19_19-07-24,1.437412,100,4,0.26,0.13,0.22,0.001000
0be8be68,10.435452,2020-10-19_19-52-43,2.298043,100,4,0.26,0.12,0.22,0.001000


In [12]:
best_config = {z: analysis_results_df.iloc[0]['config.' + z] for z in xgb_tune_params}

xgb = XGBRegressor(
    objective='reg:squarederror',
    random_state=RANDOMSTATE,    
    verbosity=1,
    n_jobs=-1,
    **best_config
)
print(xgb)

scores = -cross_val_score(xgb, df[predictors], df[response],
                          scoring="neg_root_mean_squared_error",
                          cv=kfolds)

raw_scores = [cv_to_raw(x) for x in scores]
print()
print("Log1p CV RMSE %.06f (STD %.04f)" % (np.mean(scores), np.std(scores)))
print("Raw CV RMSE %.0f (STD %.0f)" % (np.mean(raw_scores), np.std(raw_scores)))


XGBRegressor(base_score=None, booster=None, colsample_bylevel=0.14,
             colsample_bynode=None, colsample_bytree=0.19, gamma=None,
             gpu_id=None, importance_type='gain', interaction_constraints=None,
             learning_rate=0.01, max_delta_step=None, max_depth=2,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             n_estimators=9833, n_jobs=-1, num_parallel_tree=None,
             random_state=42, reg_alpha=None, reg_lambda=None,
             scale_pos_weight=None, subsample=0.42, tree_method=None,
             validate_parameters=None, verbosity=1)

Log1p CV RMSE 0.104323 (STD 0.0132)
Raw CV RMSE 18309 (STD 2428)


In [10]:
NUM_SAMPLES=2048

print("XGBoost Optuna")

start_time = datetime.now()
print("%-20s %s" % ("Start Time", start_time))

algo = OptunaSearch()
# to limit number of cores, uncomment and set max_concurrent 
# algo = ConcurrencyLimiter(algo, max_concurrent=10)
scheduler = AsyncHyperBandScheduler()

analysis = tune.run(my_xgb,
                    num_samples=NUM_SAMPLES,
                    config=xgb_tune_kwargs,                    
                    name="hyperopt_xgb",
                    metric="rmse",
                    mode="min",
                    search_alg=algo,
                    scheduler=scheduler,
                    verbose=1,
#                    loggers=DEFAULT_LOGGERS + (WandbLogger, ),
                   )

end_time = datetime.now()
print("%-20s %s" % ("Start Time", start_time))
print("%-20s %s" % ("End Time", end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))

Trial name,status,loc,colsample_bylevel,colsample_bytree,learning_rate,max_depth,n_estimators,subsample,iter,total time (s),rmse
my_xgb_657dd450,TERMINATED,,0.41,0.17,-1.0,4,4567.98,0.38,1,75.1909,0.115673
my_xgb_657ffba4,TERMINATED,,0.06,0.3,-2.0,4,3122.97,0.66,1,35.3099,0.106145
my_xgb_65831212,TERMINATED,,0.5,0.45,-2.5,1,2900.1,0.69,1,60.7716,0.114522
my_xgb_65846cb6,TERMINATED,,0.33,0.4,-2.0,3,864.612,0.34,1,16.4259,0.112361
my_xgb_6586b7b4,TERMINATED,,0.29,0.32,-1.5,2,345.632,0.69,1,6.45409,0.111497
my_xgb_6587f5a2,TERMINATED,,0.13,0.45,-2.5,2,1857.8,0.5,1,25.5816,0.125424
my_xgb_658979c2,TERMINATED,,0.21,0.05,-2.0,2,1461.07,0.68,1,16.4858,0.123313
my_xgb_658b295c,TERMINATED,,0.3,0.07,-3.0,4,4389.33,0.41,1,50.3121,0.228986
my_xgb_658c98c8,TERMINATED,,0.13,0.13,-3.0,4,519.547,0.28,1,5.88145,6.8762
my_xgb_658dded6,TERMINATED,,0.43,0.34,-2.5,2,834.411,0.34,1,14.2498,0.856262


Start Time           2020-10-19 21:02:28.384147
End Time             2020-10-19 22:05:45.410090
1:03:17


In [11]:
param_cols = ['config.' + k for k in xgb_tune_params]
analysis_results_df = analysis.results_df[['rmse', 'date', 'time_this_iter_s'] + param_cols].sort_values('rmse')
analysis_results_df


Unnamed: 0_level_0,rmse,date,time_this_iter_s,config.n_estimators,config.max_depth,config.subsample,config.colsample_bytree,config.colsample_bylevel,config.learning_rate
trial_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
71aaa7ee,0.104342,2020-10-19_21-51-09,91.091878,9484,5,0.34,0.49,0.08,0.003162
6ee68988,0.104501,2020-10-19_21-40-18,85.187055,9585,3,0.43,0.17,0.09,0.010000
71e47ff0,0.104648,2020-10-19_21-51-48,27.259119,4665,2,0.26,0.11,0.15,0.031623
73c4d6c6,0.104730,2020-10-19_22-01-24,62.847865,5462,3,0.31,0.14,0.46,0.010000
66b47a90,0.104791,2020-10-19_21-08-48,115.582195,9640,6,0.44,0.23,0.05,0.010000
...,...,...,...,...,...,...,...,...,...
733f7d00,10.382251,2020-10-19_21-57-23,0.826261,105,2,0.45,0.42,0.11,0.001000
6a121a6c,10.403109,2020-10-19_21-19-14,1.454697,103,3,0.42,0.11,0.35,0.001000
71f82186,10.413595,2020-10-19_21-51-32,1.501438,102,7,0.40,0.50,0.19,0.001000
700d448c,10.433581,2020-10-19_21-44-34,1.338971,100,5,0.68,0.12,0.43,0.001000


In [12]:
best_config = {z: analysis_results_df.iloc[0]['config.' + z] for z in xgb_tune_params}

xgb = XGBRegressor(
    objective='reg:squarederror',
    random_state=RANDOMSTATE,    
    verbosity=1,
    n_jobs=-1,
    **best_config
)
print(xgb)

scores = -cross_val_score(xgb, df[predictors], df[response],
                          scoring="neg_root_mean_squared_error",
                          cv=kfolds)

raw_scores = [cv_to_raw(x) for x in scores]
print()
print("Log1p CV RMSE %.06f (STD %.04f)" % (np.mean(scores), np.std(scores)))
print("Raw CV RMSE %.0f (STD %.0f)" % (np.mean(raw_scores), np.std(raw_scores)))


XGBRegressor(base_score=None, booster=None, colsample_bylevel=0.08,
             colsample_bynode=None, colsample_bytree=0.49, gamma=None,
             gpu_id=None, importance_type='gain', interaction_constraints=None,
             learning_rate=0.0031622776601683794, max_delta_step=None,
             max_depth=5, min_child_weight=None, missing=nan,
             monotone_constraints=None, n_estimators=9484, n_jobs=-1,
             num_parallel_tree=None, random_state=42, reg_alpha=None,
             reg_lambda=None, scale_pos_weight=None,
             subsample=0.33999999999999997, tree_method=None,
             validate_parameters=None, verbosity=1)

Log1p CV RMSE 0.104342 (STD 0.0130)
Raw CV RMSE 18313 (STD 2403)


#### LGBM

In [6]:
lgbm_tune_kwargs = {
    "n_estimators": tune.loguniform(100, 10000),
    "max_depth": tune.randint(0, 5),
    'num_leaves': tune.quniform(1, 10, 1.0),               # xgb max_leaves
    "bagging_fraction": tune.quniform(0.5, 0.8, 0.01),    # xgb subsample
    "feature_fraction": tune.quniform(0.05, 0.5, 0.01),   # xgb colsample_bytree
    "learning_rate": tune.quniform(-3.0, -1.0, 0.5),
#     "wandb": {
#         "project": "iowa",
#     }        
}

#print("wandb name:", lgbm_tune_kwargs['wandb']['name'])
lgbm_tune_params = [k for k in lgbm_tune_kwargs.keys() if k != 'wandb']
print(lgbm_tune_params)


['n_estimators', 'max_depth', 'num_leaves', 'bagging_fraction', 'feature_fraction', 'learning_rate']


In [7]:
def my_lgbm(config):
    
    # fix these configs 
    config['n_estimators'] = int(config['n_estimators'])   # pass float eg loguniform distribution, use int
    config['num_leaves'] = int(2**config['num_leaves'])
    config['learning_rate'] = 10**config['learning_rate']
    
    lgbm = LGBMRegressor(objective='regression',
                         max_bin=200,
                         feature_fraction_seed=7,
                         min_data_in_leaf=2,
                         verbose=-1,
                         n_jobs=1,
                         # these are specified to suppress warnings
                         colsample_bytree=None,
                         min_child_samples=None,
                         subsample=None,
                         **config,
                         # early stopping params, maybe in fit
                         #early_stopping_rounds=early_stopping_rounds,
                         #valid_sets=[xgtrain, xgvalid], valid_names=['train','valid'], evals_result=evals_results
                         #num_boost_round=num_boost_round,
                         )
    
    scores = -cross_val_score(lgbm, df[predictors], df[response],
                              scoring="neg_root_mean_squared_error",
                              cv=kfolds)
    rmse=np.mean(scores)  
    tune.report(rmse=rmse)
    # wandb.log({"rmse": rmse})
    
    return {'rmse': np.mean(scores)}

In [12]:
# tune LightGBM
print("LightGBM HyperOpt")

NUM_SAMPLES=2048

start_time = datetime.now()
print("%-20s %s" % ("Start Time", start_time))

algo = HyperOptSearch(random_state_seed=RANDOMSTATE)
# uncomment and set max_concurrent to limit number of cores
# algo = ConcurrencyLimiter(algo, max_concurrent=10)
scheduler = AsyncHyperBandScheduler()

# lgbm_tune_kwargs['wandb']['name'] = 'hyperopt_' + xgb_tune_kwargs['wandb']['name']

analysis = tune.run(my_lgbm,
                    num_samples=NUM_SAMPLES,
                    config = lgbm_tune_kwargs,
                    name="hyperopt_lgbm",
                    metric="rmse",
                    mode="min",
                    search_alg=algo,
                    scheduler=scheduler,
                    verbose=1,
#                     loggers=DEFAULT_LOGGERS + (WandbLogger, ),
                   )

end_time = datetime.now()
print("%-20s %s" % ("Start Time", start_time))
print("%-20s %s" % ("End Time", end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))


Trial name,status,loc,bagging_fraction,feature_fraction,learning_rate,max_depth,n_estimators,num_leaves,iter,total time (s),rmse
my_lgbm_a369ded6,TERMINATED,,0.71,0.16,-3.0,4,4626.62,5,1,40.4076,0.114896
my_lgbm_a36f6b30,TERMINATED,,0.51,0.47,-2.5,3,2853.98,5,1,26.2362,0.115249
my_lgbm_a373e66a,TERMINATED,,0.52,0.44,-1.0,1,5842.93,9,1,19.8297,0.114466
my_lgbm_a3788166,TERMINATED,,0.52,0.11,-3.0,2,6877.73,5,1,21.7738,0.119708
my_lgbm_a37bc9b6,TERMINATED,,0.53,0.29,-1.5,3,220.67,8,1,1.7295,0.115158
my_lgbm_a3810e58,TERMINATED,,0.72,0.46,-1.5,1,751.724,8,1,2.2245,0.118629
my_lgbm_a3871adc,TERMINATED,,0.79,0.3,-1.5,2,7563.79,8,1,32.9871,0.111968
my_lgbm_a38c4d40,TERMINATED,,0.62,0.35,-2.5,0,3554.09,7,1,420.315,0.117366
my_lgbm_a38f5b3e,TERMINATED,,0.52,0.2,-1.5,4,6918.6,6,1,57.1518,0.108537
my_lgbm_a392e998,TERMINATED,,0.56,0.22,-2.0,1,2519.09,10,1,5.64772,0.117232


Start Time           2020-10-19 23:56:00.160601
End Time             2020-10-20 01:50:59.020029
1:54:58


In [13]:
param_cols = ['config.' + k for k in lgbm_tune_params]
analysis_results_df = analysis.results_df[['rmse', 'date', 'time_this_iter_s'] + param_cols].sort_values('rmse')
analysis_results_df


Unnamed: 0_level_0,rmse,date,time_this_iter_s,config.n_estimators,config.max_depth,config.num_leaves,config.bagging_fraction,config.feature_fraction,config.learning_rate
trial_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
5244be12,0.105979,2020-10-20_01-14-11,39.858768,8989,4,32,0.77,0.11,0.010
0d4bdf02,0.105982,2020-10-20_00-33-52,40.803843,8861,4,32,0.79,0.11,0.010
519892c2,0.105984,2020-10-20_01-14-11,58.197027,9412,4,32,0.78,0.11,0.010
0da7d992,0.105986,2020-10-20_00-33-56,43.700580,9338,4,32,0.79,0.11,0.010
edd87ab8,0.106003,2020-10-20_00-22-59,70.161946,9805,4,32,0.79,0.11,0.010
...,...,...,...,...,...,...,...,...,...
995fef6a,0.374434,2020-10-20_01-41-02,0.816357,109,4,4,0.71,0.10,0.001
97ebadb8,0.375581,2020-10-20_01-40-53,1.121649,101,4,4,0.71,0.11,0.001
2f73e64c,0.375825,2020-10-20_01-02-34,0.410276,100,4,4,0.71,0.11,0.001
98bf4330,0.375825,2020-10-20_01-41-03,0.881516,100,4,4,0.71,0.11,0.001


In [14]:
best_config = {z: analysis_results_df.iloc[0]['config.' + z] for z in lgbm_tune_params}

lgbm = LGBMRegressor(objective='regression',
                     max_bin=200,
                     feature_fraction_seed=7,
                     min_data_in_leaf=2,
                     verbose=-1,
                     **best_config,
                     # early stopping params, maybe in fit
                     #early_stopping_rounds=early_stopping_rounds,
                     #valid_sets=[xgtrain, xgvalid], valid_names=['train','valid'], evals_result=evals_results
                     #num_boost_round=num_boost_round,
                     )
 
print(lgbm)

scores = -cross_val_score(lgbm, df[predictors], df[response],
                          scoring="neg_root_mean_squared_error",
                          cv=kfolds)

raw_scores = [cv_to_raw(x) for x in scores]
print()
print("Log1p CV RMSE %.06f (STD %.04f)" % (np.mean(scores), np.std(scores)))
print("Raw CV RMSE %.0f (STD %.0f)" % (np.mean(raw_scores), np.std(raw_scores)))


LGBMRegressor(bagging_fraction=0.77, feature_fraction=0.11,
              feature_fraction_seed=7, learning_rate=0.01, max_bin=200,
              max_depth=4, min_data_in_leaf=2, n_estimators=8989, num_leaves=32,
              objective='regression', verbose=-1)

Log1p CV RMSE 0.105979 (STD 0.0127)
Raw CV RMSE 18615 (STD 2354)


In [11]:
# tune LightGBM
print("LightGBM Optuna")

NUM_SAMPLES=2048

start_time = datetime.now()
print("%-20s %s" % ("Start Time", start_time))

algo = OptunaSearch()
# uncomment and set max_concurrent to limit number of cores
# algo = ConcurrencyLimiter(algo, max_concurrent=10)
scheduler = AsyncHyperBandScheduler()

# lgbm_tune_kwargs['wandb']['name'] = 'hyperopt_' + xgb_tune_kwargs['wandb']['name']

analysis = tune.run(my_lgbm,
                    num_samples=NUM_SAMPLES,
                    config = lgbm_tune_kwargs,
                    name="hyperopt_lgbm",
                    metric="rmse",
                    mode="min",
                    search_alg=algo,
                    scheduler=scheduler,
                    verbose=1,
#                     loggers=DEFAULT_LOGGERS + (WandbLogger, ),
                   )

end_time = datetime.now()
print("%-20s %s" % ("Start Time", start_time))
print("%-20s %s" % ("End Time", end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))


Trial name,status,loc,bagging_fraction,feature_fraction,learning_rate,max_depth,n_estimators,num_leaves,iter,total time (s),rmse
my_lgbm_a06d3588,TERMINATED,,0.54,0.39,-2.0,0,104.17,7,1,9.87122,0.186489
my_lgbm_a06e4bda,TERMINATED,,0.78,0.09,-3.0,2,543.697,6,1,1.91405,0.307559
my_lgbm_a06f9ada,TERMINATED,,0.67,0.26,-2.0,2,233.94,1,1,1.00491,0.209622
my_lgbm_a070efca,TERMINATED,,0.72,0.18,-2.5,2,730.262,5,1,2.91437,0.162804
my_lgbm_a0740e58,TERMINATED,,0.64,0.2,-1.0,4,401.978,9,1,2.93445,0.111907
my_lgbm_a0757324,TERMINATED,,0.53,0.25,-1.5,5,1033.12,1,1,2.79062,0.115774
my_lgbm_a076a76c,TERMINATED,,0.7,0.35,-1.0,1,306.475,7,1,1.24875,0.118222
my_lgbm_a077e276,TERMINATED,,0.68,0.46,-1.0,3,9557.93,1,1,29.5012,0.116475
my_lgbm_a078f9ae,TERMINATED,,0.75,0.29,-2.0,2,4899.19,4,1,17.2029,0.110086
my_lgbm_a07a22e8,TERMINATED,,0.52,0.38,-2.0,5,167.962,10,1,4.48442,0.150043


Start Time           2020-10-20 03:02:02.123224
End Time             2020-10-20 04:32:03.907193
1:30:01


In [12]:
param_cols = ['config.' + k for k in lgbm_tune_params]
analysis_results_df = analysis.results_df[['rmse', 'date', 'time_this_iter_s'] + param_cols].sort_values('rmse')
analysis_results_df


Unnamed: 0_level_0,rmse,date,time_this_iter_s,config.n_estimators,config.max_depth,config.num_leaves,config.bagging_fraction,config.feature_fraction,config.learning_rate
trial_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
a98542aa,0.105552,2020-10-20_03-40-08,15.154661,6224,3,256,0.75,0.05,0.010000
a6bdaf12,0.105632,2020-10-20_03-25-21,17.946296,6889,3,32,0.57,0.05,0.010000
ac525afe,0.106055,2020-10-20_03-56-24,36.828710,8795,3,8,0.66,0.09,0.003162
a1dbfe9a,0.106117,2020-10-20_03-07-16,46.485202,9858,4,8,0.58,0.11,0.010000
ab29ef70,0.106148,2020-10-20_03-50-23,25.347550,9072,3,1024,0.65,0.06,0.010000
...,...,...,...,...,...,...,...,...,...
a548dc92,0.376857,2020-10-20_03-19-38,1.088077,124,0,2,0.67,0.15,0.001000
a0def416,0.377950,2020-10-20_03-03-25,0.563442,114,1,512,0.77,0.18,0.001000
ac971f4a,0.378223,2020-10-20_03-56-42,0.373703,102,1,4,0.71,0.41,0.001000
a53649ec,0.378765,2020-10-20_03-19-15,0.631133,100,1,32,0.51,0.31,0.001000


In [13]:
best_config = {z: analysis_results_df.iloc[0]['config.' + z] for z in lgbm_tune_params}

lgbm = LGBMRegressor(objective='regression',
                     max_bin=200,
                     feature_fraction_seed=7,
                     min_data_in_leaf=2,
                     verbose=-1,
                     **best_config,
                     # early stopping params, maybe in fit
                     #early_stopping_rounds=early_stopping_rounds,
                     #valid_sets=[xgtrain, xgvalid], valid_names=['train','valid'], evals_result=evals_results
                     #num_boost_round=num_boost_round,
                     )
 
print(lgbm)

scores = -cross_val_score(lgbm, df[predictors], df[response],
                          scoring="neg_root_mean_squared_error",
                          cv=kfolds)

raw_scores = [cv_to_raw(x) for x in scores]
print()
print("Log1p CV RMSE %.06f (STD %.04f)" % (np.mean(scores), np.std(scores)))
print("Raw CV RMSE %.0f (STD %.0f)" % (np.mean(raw_scores), np.std(raw_scores)))


LGBMRegressor(bagging_fraction=0.75, feature_fraction=0.05,
              feature_fraction_seed=7, learning_rate=0.01, max_bin=200,
              max_depth=3, min_data_in_leaf=2, n_estimators=6224,
              num_leaves=256, objective='regression', verbose=-1)

Log1p CV RMSE 0.105552 (STD 0.0133)
Raw CV RMSE 18537 (STD 2452)
