# Hyperparameter tuning with XGBoost, Ray Tune, Hyperopt and BayesOpt

Complex machine learning algorithms like XGBoost, LightGBM, neural networks have many tuning parameters, so finding the most appropriate combination via a grid search is time-consuming. This is an example of speeding hyperparameter tuning using Ray Tune, advanced search algorithms, and clusters to significantly accelerate tuning.

Outline: I will use this Housing Prices Competition for Kaggle Learn Users: https://www.kaggle.com/c/house-prices-advanced-regression-techniques . The response we are predicting is the log-transformed SalePrice based on house features like square feet, neighborhood, features like pool, condition. I already did some feature engineering and feature selection and my submission was top 5% when I submitted it. link to github

- Baseline linear regression with no hyperparameters
- ElasticNet with L1 and L2 regularization using ElasticNetCV hyperparameter optimization
- ElasticNet with GridSearchCV hyperparameter optimization
- XGBoost with a semi-manual hyperparameter optimization using early stopping and looping over a grid
- XGBoost with Ray, HyperOpt and BayesOpt search algorithms
- Accelerate advanced algorithms with a Ray cluster

Results table with cv error and timing

In [1]:
from itertools import product
from datetime import datetime

import numpy as np
import pandas as pd

from sklearn.linear_model import LinearRegression, ElasticNet, ElasticNetCV, Ridge, RidgeCV

from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, GridSearchCV, KFold
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.pipeline import make_pipeline

#!conda install -y -c conda-forge  xgboost 
import xgboost
from xgboost import XGBRegressor
from xgboost import plot_importance

import ray
from ray import tune
from ray.tune.suggest import ConcurrencyLimiter
from ray.tune.schedulers import AsyncHyperBandScheduler
from ray.tune.suggest.bayesopt import BayesOptSearch
from ray.tune.suggest.hyperopt import HyperOptSearch
# pip install bayesian-optimization
# pip install hyperopt

In [2]:
# set seed for reproducibility
RANDOMSTATE = 42
np.random.seed(RANDOMSTATE)


This dataset is from Iowa House prices
Housing Prices Competition for Kaggle Learn Users
https://www.kaggle.com/c/house-prices-advanced-regression-techniques

Already did some feature engineering and feature selection
predicting log of the SalePrice

In [3]:
# import train data
df = pd.read_pickle('df_train.pickle')

response = 'SalePrice'
predictors = ['YearBuilt',
              'BsmtFullBath',
              'FullBath',
              'KitchenAbvGr',
              'GarageYrBlt',
              'LotFrontage',
              'MasVnrArea',
              '1stFlrSF',
              'GrLivArea',
              'GarageArea',
              'WoodDeckSF',
              'PorchSF',
              'AvgBltRemod',
              'FireBathRatio',
              'TotalSF x OverallQual x OverallCond',
              'AvgBltRemod x Functional x TotalFinSF',
              'Functional x OverallQual',
              'KitchenAbvGr x KitchenQual',
              'GarageCars x GarageYrBlt',
              'GarageQual x GarageCond x GarageCars',
              'HeatingQC x Heating',
              'monthnum',
              'log_YearBuilt',
              'log_LotArea',
              'log_TotalFinSF',
              'log_GarageRatio',
              'log_TotalSF x OverallQual x OverallCond',
              'log_TotalSF x OverallCond',
              'log_AvgBltRemod x TotalFinSF',
              'sq_2ndFlrSF',
              'sq_BsmtFinSF',
              'sq_BsmtFinSF x BsmtQual',
              'sq_BsmtFinSF x BsmtBath',
              'BldgType_4',
              'BsmtExposure_1',
              'BsmtExposure_4',
              'BsmtFinType1_1',
              'BsmtFinType1_2',
              'BsmtFinType1_4',
              'BsmtFinType1_5',
              'BsmtFinType1_6',
              'CentralAir_0',
              'CentralAir_1',
              'Condition1_1',
              'Condition1_3',
              'ExterCond_2',
              'ExterQual_2',
              'Exterior1st_4',
              'Exterior1st_5',
              'Exterior1st_10',
              'Fence_0',
              'Fence_2',
              'Foundation_1',
              'Foundation_5',
              'GarageCars_1',
              'GarageFinish_2',
              'GarageFinish_3',
              'GarageType_2',
              'HouseStyle_2',
              'KitchenQual_4',
              'LotConfig_0',
              'LotConfig_4',
              'MSSubClass_30',
              'MSSubClass_70',
              'MSZoning_0',
              'MSZoning_1',
              'MSZoning_4',
              'MasVnrType_2',
              'MasVnrType_3',
              'MoSold_1',
              'MoSold_5',
              'MoSold_6',
              'MoSold_11',
              'Neighborhood_3',
              'Neighborhood_4',
              'Neighborhood_5',
              'Neighborhood_10',
              'Neighborhood_11',
              'Neighborhood_16',
              'Neighborhood_17',
              'Neighborhood_19',
              'Neighborhood_22',
              'Neighborhood_24',
              'OverallCond_7',
              'OverallQual_5',
              'OverallQual_6',
              'OverallQual_7',
              'OverallQual_9',
              'PavedDrive_0',
              'PavedDrive_2',
              'SaleCondition_1',
              'SaleCondition_2',
              'SaleCondition_5',
              'SaleType_4',
              'BedroomAbvGr_1',
              'BedroomAbvGr_4',
              'BedroomAbvGr_5',
              'HalfBath_1',
              'TotalBath_1.0',
              'TotalBath_2.5']

X_train, X_test, y_train, y_test = train_test_split(df, df[response], test_size=.25)

display(df[predictors].head())
display(df[[response]].head())


Unnamed: 0_level_0,YearBuilt,BsmtFullBath,FullBath,KitchenAbvGr,GarageYrBlt,LotFrontage,MasVnrArea,1stFlrSF,GrLivArea,GarageArea,...,SaleCondition_1,SaleCondition_2,SaleCondition_5,SaleType_4,BedroomAbvGr_1,BedroomAbvGr_4,BedroomAbvGr_5,HalfBath_1,TotalBath_1.0,TotalBath_2.5
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,7,1,2,1,7,65.0,196.0,856,1710,548.0,...,0,0,0,1,0,0,0,1,0,0
2,34,0,2,1,34,80.0,0.0,1262,1262,460.0,...,0,0,0,1,0,0,0,0,0,1
3,9,1,2,1,9,68.0,162.0,920,1786,608.0,...,0,0,0,1,0,0,0,1,0,0
4,95,1,1,1,12,60.0,0.0,961,1717,642.0,...,1,0,0,1,0,0,0,0,0,0
5,10,1,2,1,10,84.0,350.0,1145,2198,836.0,...,0,0,0,1,0,1,0,1,0,0


Unnamed: 0_level_0,SalePrice
Id,Unnamed: 1_level_1
1,12.247699
2,12.109016
3,12.317171
4,11.849405
5,12.42922


In [4]:
# we are training on a response which is the log of 1 + the sale price
# transform prediction back to original basis with expm1 and evaluate vs. original

def evaluate(y_train, y_pred_train, y_test, y_pred_test):
    """evaluate in train_test split"""
    print('Train RMSE', np.sqrt(mean_squared_error(np.expm1(y_train), np.expm1(y_pred_train))))
    print('Train R-squared', r2_score(np.expm1(y_train), np.expm1(y_pred_train)))
    print('Train MAE', mean_absolute_error(np.expm1(y_train), np.expm1(y_pred_train)))
    print()
    print('Test RMSE', np.sqrt(mean_squared_error(np.expm1(y_test), np.expm1(y_pred_test))))
    print('Test R-squared', r2_score(np.expm1(y_test), np.expm1(y_pred_test)))
    print('Test MAE', mean_absolute_error(np.expm1(y_test), np.expm1(y_pred_test)))

MEAN_RESPONSE=df[response].mean()
def cv_to_raw(cv_val):
    """convert log1p rmse to underlying SalePrice error"""
    return np.expm1(MEAN_RESPONSE+cv_val) - np.expm1(MEAN_RESPONSE)

In [5]:
# always use same k-folds
kfolds = KFold(n_splits=10, shuffle=True, random_state=RANDOMSTATE)


## Baseline linear regression


In [6]:
%%time
# Tune lr search space for alphas and l1_ratio
print("LinearRegression")

print(len(predictors), "predictors")

lr = LinearRegression()

#train and evaluate in train/test split
lr.fit(X_train[predictors], y_train)

y_pred_train = lr.predict(X_train[predictors])
y_pred_test = lr.predict(X_test[predictors])
evaluate(y_train, y_pred_train, y_test, y_pred_test)

# evaluate using kfolds, same process as train/test split but average results over 10 folds
# more sample-efficient, less CPU-efficient

scores = -cross_val_score(lr, df[predictors], df[response],
                          scoring="neg_root_mean_squared_error",
                          cv=kfolds,
                          n_jobs=-1)
raw_scores = [cv_to_raw(x) for x in scores]
print()
print("Log1p CV RMSE %.04f (STD %.04f)" % (np.mean(scores), np.std(scores)))
print("Raw CV RMSE %.04f (STD %.04f)" % (np.mean(raw_scores), np.std(raw_scores)))


LinearRegression
100 predictors
Train RMSE 16551.163619379957
Train R-squared 0.955449013094547
Train MAE 10998.860386226794

Test RMSE 17846.1719186747
Test R-squared 0.9364301769500323
Test MAE 12755.395673617797

Log1p CV RMSE 0.1037 (STD 0.0099)
Raw CV RMSE 18191.9791 (STD 1838.6678)
CPU times: user 265 ms, sys: 184 ms, total: 449 ms
Wall time: 1.62 s


## Native Sklearn xxxCV
- LogisticRegressionCV, LassoCV, RidgeCV, ElasticNetCV, etc.
- Test many hyperparameters in parallel with multithreading
- Note improvement vs. LinearRegression due to controlling overfitting


In [7]:
%%time
# Tune elasticnet search space for alphas and L1_ratio
# predictor selection used to create the training set used lasso
# so l1 parameter is close to 0
# could use ridge (eg elasticnet with 0 L1 regularization)
# but then only 1 param, more general and useful to do this with elasticnet
print("ElasticnetCV")

# make pipeline
# with regularization must scale predictors
elasticnetcv = make_pipeline(RobustScaler(),
                             ElasticNetCV(max_iter=100000, 
                                          #l1_ratio=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99],
                                          l1_ratio=np.linspace(0.01, 0.21, 21),
                                          alphas=np.logspace(-4, -2, 21),
                                          cv=kfolds))

#train and evaluate in train/test split
elasticnetcv.fit(X_train[predictors], y_train)

y_pred_train = elasticnetcv.predict(X_train[predictors])
y_pred_test = elasticnetcv.predict(X_test[predictors])
evaluate(y_train, y_pred_train, y_test, y_pred_test)
l1_ratio = elasticnetcv._final_estimator.l1_ratio_
alpha = elasticnetcv._final_estimator.alpha_
print('l1_ratio', l1_ratio)
print('alpha', alpha)

# evaluate using kfolds on full dataset
# I don't see API to get CV error from elasticnetcv, so we use cross_val_score
elasticnet = ElasticNet(alpha=alpha,
                        l1_ratio=l1_ratio,
                        max_iter=10000)

scores = -cross_val_score(elasticnet, df[predictors], df[response],
                          scoring="neg_root_mean_squared_error",
                          cv=kfolds,
                          n_jobs=-1)
raw_scores = [cv_to_raw(x) for x in scores]
print()
print("Log1p CV RMSE %.04f (STD %.04f)" % (np.mean(scores), np.std(scores)))
print("Raw CV RMSE %.04f (STD %.04f)" % (np.mean(raw_scores), np.std(raw_scores)))


ElasticnetCV
Train RMSE 16749.841578148156
Train R-squared 0.9543730254176759
Train MAE 11014.427829330485

Test RMSE 17479.444684207574
Test R-squared 0.9390159700504748
Test MAE 12415.248148857769
l1_ratio 0.01
alpha 0.005011872336272725

Log1p CV RMSE 0.1032 (STD 0.0112)
Raw CV RMSE 18103.4794 (STD 2060.3546)
CPU times: user 24.4 s, sys: 5.64 s, total: 30 s
Wall time: 4.61 s


## GridSearchCV
- Useful for algos with no native multithreaded xxxCV
- Test many hyperparameter combinations in parallel with multithreading
- Same result vs ElasticNetCV


In [8]:
#%%time
gs = make_pipeline(RobustScaler(),
                   GridSearchCV(ElasticNet(max_iter=100000),
                                param_grid={'l1_ratio': np.linspace(0.01, 0.21, 21),
                                            'alpha': np.logspace(-4, -2, 21),
                                           },
                                scoring='neg_mean_squared_error',
                                refit=True,
                                cv=kfolds,
                                n_jobs=-1,
                                verbose=1
                               ))

# do cv using kfolds on full dataset
print("\nCV on full dataset")
gs.fit(df[predictors], df[response])
print('best params', gs._final_estimator.best_params_)
print('best score', -gs._final_estimator.best_score_)
l1_ratio = gs._final_estimator.best_params_['l1_ratio']
alpha = gs._final_estimator.best_params_['alpha']

elasticnet = ElasticNet(alpha=alpha,
                        l1_ratio=l1_ratio,
                        max_iter=100000)
print(elasticnet)

scores = -cross_val_score(elasticnet, df[predictors], df[response],
                          scoring="neg_root_mean_squared_error",
                          cv=kfolds,
                          n_jobs=-1)
raw_scores = [cv_to_raw(x) for x in scores]
print()
print("Log1p CV RMSE %.06f (STD %.04f)" % (np.mean(scores), np.std(scores)))
print("Raw CV RMSE %.06f (STD %.04f)" % (np.mean(raw_scores), np.std(raw_scores)))

# difference in average CV scores reported by GridSearchCV and cross_val_score
# with same alpha, l1_ratio, kfolds
# one reason could be that we used simple average, GridSearchCV is weighted by # of samples per fold?
nsamples = [len(z[1]) for z in kfolds.split(df)]
print("weighted average %.06f" % np.average(scores, weights=nsamples))
# not sure why 



CV on full dataset
Fitting 10 folds for each of 441 candidates, totalling 4410 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done 305 tasks      | elapsed:    2.0s
[Parallel(n_jobs=-1)]: Done 804 tasks      | elapsed:    4.7s
[Parallel(n_jobs=-1)]: Done 1504 tasks      | elapsed:    7.8s
[Parallel(n_jobs=-1)]: Done 2404 tasks      | elapsed:   11.2s
[Parallel(n_jobs=-1)]: Done 4000 tasks      | elapsed:   16.6s
[Parallel(n_jobs=-1)]: Done 4410 out of 4410 | elapsed:   17.7s finished


best params {'alpha': 0.002511886431509582, 'l1_ratio': 0.01}
best score 0.010632767727426403
ElasticNet(alpha=0.002511886431509582, l1_ratio=0.01, max_iter=100000)

Log1p CV RMSE 0.102973 (STD 0.0108)
Raw CV RMSE 18054.997021 (STD 1984.3274)
weighted average 0.102993


In [9]:
# roll-our-own CV 
# matches cross_val_score
alpha = 0.002511886431509582
l1_ratio = 0.01
regressor = ElasticNet(alpha=alpha,
                       l1_ratio=l1_ratio,
                       max_iter=10000)
print(regressor)
cverrors = []
for train_fold, cv_fold in kfolds.split(df): 
    fold_X_train=df[predictors].values[train_fold]
    fold_y_train=df[response].values[train_fold]
    fold_X_test=df[predictors].values[cv_fold]
    fold_y_test=df[response].values[cv_fold]
    regressor.fit(fold_X_train, fold_y_train)
    y_pred_test=regressor.predict(fold_X_test)
    cverrors.append(np.sqrt(mean_squared_error(fold_y_test, y_pred_test)))
    
print("%.06f" % np.average(cverrors))
    

ElasticNet(alpha=0.002511886431509582, l1_ratio=0.01, max_iter=10000)
0.102973


## XGBoost CV 
- XGBoost is a powerful gradient boost algo with built-in multithreading, native CV
- XGBoost has many tuning parameters so a complete grid search has an unreasonable number of combinations
- We tune reduced sets sequentially and use early stopping. 

### Tuning methodology
- Set an initial set of starting parameters
- Do 10-fold CV
- Use early stopping in each fold to halt training if no improvement after eg 100 rounds, average error over kfolds
- Tune max_depth and min_child_weight that result in smallest CV error
- Tune subsample and colsample_bytree
- Tune alpha, lambda and gamma (regularization)
- Tune learning rate: lower learning rate will need more rounds/n_estimators
- Retrain on full dataset with best learning rate and best n_estimators (average stopping point over kfolds)

### Notes
- It doesn't seem possible to get XGBoost early stopping and also use GridSearchCV. GridSearchCV doesn't pass the kfolds in a way that XGboost understands for early stopping
- 2 alternative approaches 
    - use native xgboost .cv which understands early stopping but doesn't use sklearn API (uses DMatrix, not np array or dataframe)
    - use sklearn API and roll our own grid search instead of GridSearchCV (used below)
- XGboost terminology differs from sklearn
    - boost_rounds = n_estimators
    - eta = learning_rate
- parameter reference: https://xgboost.readthedocs.io/en/latest/parameter.html
- training reference: https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.training


In [10]:
# initial XGboost parameters

max_depth = 5
min_child_weight=5
colsample_bytree = 0.5
subsample = 0.5
reg_alpha = 1e-05
reg_lambda = 1
reg_gamma = 0
learning_rate = 0.01

BOOST_ROUNDS=50000   # we use early stopping so make this arbitrarily high
EARLY_STOPPING_ROUNDS=100 # stop if no improvement after 100 rounds

# round 1: tune depth and min_child_weight
max_depths = list(range(1,5))
min_child_weights = list(range(1,5))
gridsearch_params_1 = product(max_depths, min_child_weights)

# round 2: tune subsample and colsample_bytree
subsamples = np.linspace(0.1, 1.0, 10)
colsample_bytrees = np.linspace(0.1, 1.0, 10)
gridsearch_params_2 = product(subsamples, colsample_bytrees)

# round 2 (refined): tune subsample and colsample_bytree
subsamples = np.linspace(0.6, 0.8, 5)
colsample_bytrees = np.linspace(0.05, 0.25, 5)
gridsearch_params_2 = product(subsamples, colsample_bytrees)

# round 3: tune alpha, lambda, gamma
reg_alphas = np.logspace(-3, -2, 3)
reg_lambdas = np.logspace(-2, 1, 4)
reg_gammas = [0]
#reg_gammas = np.linspace(0, 5, 6)
gridsearch_params_3 = product(reg_alphas, reg_lambdas, reg_gammas)

# round 4: learning rate
learning_rates = reversed(np.logspace(-3, -1, 5).tolist())
gridsearch_params_4 = learning_rates

# override initial parameters after search
# round 1:
max_depth=3
min_child_weight=2
# # round 2:
subsample=0.65
colsample_bytree=0.05
# # round 3: 
reg_alpha = 0.001000
reg_lambda = 0.01
reg_gamma = 0

def my_cv(df, predictors, response, kfolds, regressor, verbose=False):
    """Roll our own CV over kfolds with early stopping"""
    metrics = []
    best_iterations = []

    for train_fold, cv_fold in kfolds.split(df): 
        fold_X_train=df[predictors].values[train_fold]
        fold_y_train=df[response].values[train_fold]
        fold_X_test=df[predictors].values[cv_fold]
        fold_y_test=df[response].values[cv_fold]
        regressor.fit(fold_X_train, fold_y_train,
                      early_stopping_rounds=EARLY_STOPPING_ROUNDS,
                      eval_set=[(fold_X_test, fold_y_test)],
                      eval_metric='rmse',
                      verbose=verbose
                     )
        y_pred_test=regressor.predict(fold_X_test)
        metrics.append(np.sqrt(mean_squared_error(fold_y_test, y_pred_test)))
        best_iterations.append(xgb.best_iteration)
    return np.average(metrics), np.std(metrics), np.average(best_iterations)

results = []
best_iterations = []

# for i, (max_depth, min_child_weight) in enumerate(gridsearch_params_1): # round 1
# for i, (subsample, colsample_bytree) in enumerate(gridsearch_params_2): # round 2
# for i, (reg_alpha, reg_lambda, reg_gamma) in enumerate(gridsearch_params_3): # round 3
for i, learning_rate in enumerate(gridsearch_params_4): # round 4

    params = {
        'max_depth': max_depth,
        'min_child_weight': min_child_weight,
        'subsample': subsample,
        'colsample_bytree': colsample_bytree,
        'reg_alpha': reg_alpha,
        'reg_lambda': reg_lambda,
        'gamma': reg_gamma,
        'learning_rate': learning_rate,
    }
    print("%s params  %3d: %s" % (datetime.strftime(datetime.now(), "%T"), i, params))
    
    xgb = XGBRegressor(
        objective='reg:squarederror',
        n_estimators=BOOST_ROUNDS,
        early_stopping_rounds=EARLY_STOPPING_ROUNDS,
        random_state=RANDOMSTATE,    
        verbosity=1,
        n_jobs=-1,
        **params
    )
    
    metric_rmse, metric_std, best_iteration = my_cv(df, predictors, response, kfolds, xgb, verbose=False)    
    results.append([max_depth, min_child_weight, subsample, colsample_bytree, reg_alpha, reg_lambda, reg_gamma, 
                   learning_rate, metric_rmse, metric_std, best_iteration])
    
    print("%s %3d result mean: %.6f std: %.6f, iter: %.2f" % (datetime.strftime(datetime.now(), "%T"), i, metric_rmse, metric_std, best_iteration))


results_df = pd.DataFrame(results, columns=['max_depth', 'min_child_weight', 'subsample', 'colsample_bytree', 
                               'reg_alpha', 'reg_lambda', 'reg_gamma', 'learning_rate', 'rmse', 'std', 'best_iter']).sort_values('rmse')
results_df


11:29:18 params    0: {'max_depth': 3, 'min_child_weight': 2, 'subsample': 0.65, 'colsample_bytree': 0.05, 'reg_alpha': 0.001, 'reg_lambda': 0.01, 'gamma': 0, 'learning_rate': 0.1}
11:29:22   0 result mean: 0.108778 std: 0.012633, iter: 434.00
11:29:22 params    1: {'max_depth': 3, 'min_child_weight': 2, 'subsample': 0.65, 'colsample_bytree': 0.05, 'reg_alpha': 0.001, 'reg_lambda': 0.01, 'gamma': 0, 'learning_rate': 0.03162277660168379}
11:29:31   1 result mean: 0.106535 std: 0.014412, iter: 843.80
11:29:31 params    2: {'max_depth': 3, 'min_child_weight': 2, 'subsample': 0.65, 'colsample_bytree': 0.05, 'reg_alpha': 0.001, 'reg_lambda': 0.01, 'gamma': 0, 'learning_rate': 0.01}
11:29:52   2 result mean: 0.105469 std: 0.013154, iter: 2296.30
11:29:52 params    3: {'max_depth': 3, 'min_child_weight': 2, 'subsample': 0.65, 'colsample_bytree': 0.05, 'reg_alpha': 0.001, 'reg_lambda': 0.01, 'gamma': 0, 'learning_rate': 0.0031622776601683794}
11:30:37   3 result mean: 0.106649 std: 0.013061, i

Unnamed: 0,max_depth,min_child_weight,subsample,colsample_bytree,reg_alpha,reg_lambda,reg_gamma,learning_rate,rmse,std,best_iter
2,3,2,0.65,0.05,0.001,0.01,0,0.01,0.105469,0.013154,2296.3
1,3,2,0.65,0.05,0.001,0.01,0,0.031623,0.106535,0.014412,843.8
4,3,2,0.65,0.05,0.001,0.01,0,0.001,0.106623,0.013864,16281.0
3,3,2,0.65,0.05,0.001,0.01,0,0.003162,0.106649,0.013061,5185.0
0,3,2,0.65,0.05,0.001,0.01,0,0.1,0.108778,0.012633,434.0


In [11]:
max_depth = int(results_df.iloc[0]['max_depth'])
min_child_weight = results_df.iloc[0]['min_child_weight']
subsample = results_df.iloc[0]['subsample']
colsample_bytree = results_df.iloc[0]['colsample_bytree']
reg_alpha = results_df.iloc[0]['reg_alpha']
reg_lambda = results_df.iloc[0]['reg_lambda']
reg_gamma = results_df.iloc[0]['reg_gamma']
learning_rate = results_df.iloc[0]['learning_rate']
N_ESTIMATORS = int(results_df.iloc[0]['best_iter'])

params = {
    'max_depth': int(max_depth),
    'min_child_weight': min_child_weight,
    'subsample': subsample,
    'colsample_bytree': colsample_bytree,
    'reg_alpha': reg_alpha,
    'reg_lambda': reg_lambda,
    'gamma': reg_gamma,
    'learning_rate': learning_rate,
    'n_estimators': N_ESTIMATORS,    
}

print(params)

{'max_depth': 3, 'min_child_weight': 2.0, 'subsample': 0.65, 'colsample_bytree': 0.05, 'reg_alpha': 0.001, 'reg_lambda': 0.01, 'gamma': 0.0, 'learning_rate': 0.01, 'n_estimators': 2296}


In [12]:
# evaluate without early stopping

xgb = XGBRegressor(
    objective='reg:squarederror',
    random_state=RANDOMSTATE,    
    verbosity=1,
    n_jobs=-1,
    **params
)
print(xgb)

scores = -cross_val_score(elasticnet, df[predictors], df[response],
                          scoring="neg_root_mean_squared_error",
                          cv=kfolds,
                          n_jobs=-1)
raw_scores = [cv_to_raw(x) for x in scores]
print()
print("Log1p CV RMSE %.06f (STD %.04f)" % (np.mean(scores), np.std(scores)))
print("Raw CV RMSE %.06f (STD %.04f)" % (np.mean(raw_scores), np.std(raw_scores)))


XGBRegressor(base_score=None, booster=None, colsample_bylevel=None,
             colsample_bynode=None, colsample_bytree=0.05, gamma=0.0,
             gpu_id=None, importance_type='gain', interaction_constraints=None,
             learning_rate=0.01, max_delta_step=None, max_depth=3,
             min_child_weight=2.0, missing=nan, monotone_constraints=None,
             n_estimators=2296, n_jobs=-1, num_parallel_tree=None,
             random_state=42, reg_alpha=0.001, reg_lambda=0.01,
             scale_pos_weight=None, subsample=0.65, tree_method=None,
             validate_parameters=False, verbosity=1)

Log1p CV RMSE 0.102973 (STD 0.0108)
Raw CV RMSE 18054.997021 (STD 1984.3274)


In [13]:
# refactor for ray.tune
def my_xgb(config):
    
    # fix these configs for hyperopt
    config['max_depth'] += 2   # hyperopt needs left to start at 0 but we want to start at 2
    config['n_estimators'] = int(config['n_estimators'])   # pass float eg loguniform distribution, use int
    
    xgb = XGBRegressor(
        objective='reg:squarederror',
        n_jobs=1,
        **config,
    )
    scores = np.sqrt(-cross_val_score(xgb, df[predictors], df[response],
                                      scoring="neg_mean_squared_error",
                                      cv=kfolds))
    tune.report(mse=np.mean(scores))
    return xgb


In [14]:
config = {
    'max_depth': max_depth-2,
    'min_child_weight': min_child_weight,
    'subsample': subsample,
    'colsample_bytree': colsample_bytree,
    'reg_alpha': reg_alpha,
    'reg_lambda': reg_lambda,
    'gamma': reg_gamma,
    'learning_rate': learning_rate,
    'n_estimators': N_ESTIMATORS,    
}

xgb = my_xgb(config)

print(xgb)

scores = -cross_val_score(elasticnet, df[predictors], df[response],
                          scoring="neg_root_mean_squared_error",
                          cv=kfolds,
                          n_jobs=-1)
raw_scores = [cv_to_raw(x) for x in scores]
print()
print("Log1p CV RMSE %.06f (STD %.04f)" % (np.mean(scores), np.std(scores)))
print("Raw CV RMSE %.06f (STD %.04f)" % (np.mean(raw_scores), np.std(raw_scores)))


Session not detected. You should not be calling this function outside `tune.run` or while using the class API. 


XGBRegressor(base_score=None, booster=None, colsample_bylevel=None,
             colsample_bynode=None, colsample_bytree=0.05, gamma=0.0,
             gpu_id=None, importance_type='gain', interaction_constraints=None,
             learning_rate=0.01, max_delta_step=None, max_depth=3,
             min_child_weight=2.0, missing=nan, monotone_constraints=None,
             n_estimators=2296, n_jobs=1, num_parallel_tree=None,
             random_state=None, reg_alpha=0.001, reg_lambda=0.01,
             scale_pos_weight=None, subsample=0.65, tree_method=None,
             validate_parameters=False, verbosity=None)

Log1p CV RMSE 0.102973 (STD 0.0108)
Raw CV RMSE 18054.997021 (STD 1984.3274)


In [15]:
start_time = datetime.now()
print("%-20s %s" % ("Start Time", start_time))

algo = HyperOptSearch()
# algo = ConcurrencyLimiter(algo, max_concurrent=10)
scheduler = AsyncHyperBandScheduler()

tune_kwargs = {
    "num_samples": 512,
    "config": {
        "n_estimators": tune.loguniform(100, 10000),
        "max_depth": tune.randint(0, 6),
        'min_child_weight': tune.randint(0, 6),
        "subsample": tune.quniform(0.4, 0.9, 0.05),
        "colsample_bytree": tune.quniform(0.05, 0.8, 0.05),
        "reg_alpha": tune.loguniform(1e-04, 1),
        "reg_lambda": tune.loguniform(1e-04, 100),
        "gamma": 0,
        "learning_rate": tune.loguniform(0.001, 0.1)
    }
}

analysis = tune.run(my_xgb,
                    name="my_xgb",
                    metric="mse",
                    mode="min",
                    search_alg=algo,
                    scheduler=scheduler,
                    verbose=1,
                    **tune_kwargs)

print("%-20s %s" % ("Start Time", start_time))
print("%-20s %s" % ("End Time", datetime.now()))


Trial name,status,loc,colsample_bytree,gamma,learning_rate,max_depth,min_child_weight,n_estimators,reg_alpha,reg_lambda,subsample,iter,total time (s),mse
my_xgb_c7a07c9e,PENDING,,0.4,0,0.0186291,1,4,2722.49,0.0568616,0.000219101,0.85,,,
my_xgb_c7a711c6,PENDING,,0.45,0,0.025078,1,4,2312.5,0.0690756,0.000511991,0.85,,,
my_xgb_c7add0b0,PENDING,,0.4,0,0.0294749,1,4,1639.88,0.0330019,0.00071687,0.85,,,
my_xgb_c7b46a56,PENDING,,0.4,0,0.0104948,1,4,1216.53,0.0488063,1.68645,0.85,,,
my_xgb_c7bae53e,PENDING,,0.55,0,0.00971409,1,4,1490.05,0.0537389,0.373096,0.8,,,
my_xgb_c7c13948,PENDING,,0.55,0,0.00768961,1,4,2265.4,0.082557,1.24952,0.8,,,
my_xgb_c7c7f3aa,PENDING,,0.55,0,0.00808841,1,4,1084.78,0.000344957,2.13606,0.6,,,
my_xgb_c64f2d7c,RUNNING,,0.15,0,0.00170248,3,3,734.384,0.203226,0.0107159,0.6,,,
my_xgb_c6ac7590,RUNNING,,0.8,0,0.00905756,4,5,4412.05,0.0102789,36.6868,0.8,,,
my_xgb_c6b2a988,RUNNING,,0.75,0,0.0110132,4,5,6063.41,0.0111333,0.012667,0.85,,,


[2m[33m(pid=raylet)[0m E1009 12:19:00.761010  8339 378379712 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 2: Stream removed
[2m[33m(pid=raylet)[0m E1009 12:19:01.774693  8339 378379712 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Transport closed
[2m[33m(pid=raylet)[0m E1009 12:19:02.786226  8339 378379712 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 2: Stream removed
[2m[33m(pid=raylet)[0m E1009 12:19:03.797698  8339 378379712 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 2: Stream removed
[2m[33m(pid=raylet)[0m E1009 12:19:04.813293  8339 378379712 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 2: Stream removed
[2m[33m(pid=raylet)[0m E1009 12:19:05.828049  8339 378379712 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 2: Stream removed
[2m[33m(pid=raylet)[0m E1009 12:19:06.84

KeyboardInterrupt: 

[2m[33m(pid=raylet)[0m E1009 12:19:12.917996  8339 378379712 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 2: Stream removed


In [None]:
analysis.results_df.columns

In [None]:
analysis_results_df = analysis.results_df[['mse', 'date', 'time_this_iter_s',
       'config.n_estimators', 'config.max_depth', 'config.min_child_weight', 'config.subsample',
       'config.colsample_bytree', 'config.reg_alpha', 'config.reg_lambda', 'config.gamma',
       'config.learning_rate']].sort_values('mse')
analysis_results_df


In [None]:
max_depth = analysis_results_df.iloc[0]['config.max_depth']
min_child_weight = analysis_results_df.iloc[0]['config.min_child_weight']
subsample = analysis_results_df.iloc[0]['config.subsample']
colsample_bytree = analysis_results_df.iloc[0]['config.colsample_bytree']
reg_alpha = analysis_results_df.iloc[0]['config.reg_alpha']
reg_lambda = analysis_results_df.iloc[0]['config.reg_lambda']
reg_gamma = analysis_results_df.iloc[0]['config.gamma']
learning_rate = analysis_results_df.iloc[0]['config.learning_rate']
N_ESTIMATORS = analysis_results_df.iloc[0]['config.n_estimators']    


In [None]:
best_config = {
    'max_depth': 3,
    'min_child_weight': 0,
    'subsample': 0.9,
    'colsample_bytree': 0.1,
    'reg_alpha': 0.0734501,
    'reg_lambda': 0.0247377,
    'gamma': 0,
    'learning_rate': 0.00994503,
    'n_estimators':  5555
}

xgb = XGBRegressor(
    objective='reg:squarederror',
    random_state=RANDOMSTATE,    
    verbosity=1,
    n_jobs=-1,
    **best_config
)
print(xgb)

scores = -cross_val_score(xgb, df[predictors], df[response],
                          scoring="neg_root_mean_squared_error",
                          cv=kfolds)
print("CV Score %.04f (STD %.04f)" % (np.mean(scores), np.std(scores)))
print()

xgb.fit(X_train[predictors], y_train)
y_pred_train = xgb.predict(X_train[predictors])
y_pred_test = xgb.predict(X_test[predictors])
evaluate(y_train, y_pred_train, y_test, y_pred_test)

In [None]:
analysis.results_df['config.max_depth'].max()

In [None]:
# bayesopt
start_time = datetime.now()
print("%-20s %s" % ("Start Time", start_time))

algo = BayesOptSearch(utility_kwargs={
    "kind": "ucb",
    "kappa": 2.5,
    "xi": 0.0
})
    
# algo = ConcurrencyLimiter(algo, max_concurrent=10)
scheduler = AsyncHyperBandScheduler()

tune_kwargs = {
    "num_samples": 512,
    "config": {
        "n_estimators": tune.loguniform(100, 10000),
        "max_depth": tune.randint(0, 6),
        'min_child_weight': tune.randint(0, 6),
        "subsample": tune.quniform(0.4, 0.9, 0.05),
        "colsample_bytree": tune.quniform(0.05, 0.8, 0.05),
        "reg_alpha": tune.loguniform(1e-04, 1),
        "reg_lambda": tune.loguniform(1e-04, 100),
        "gamma": 0,
        "learning_rate": tune.loguniform(0.001, 0.1)
    }
}

analysis = tune.run(my_xgb,
                    name="my_xgb",
                    metric="mse",
                    mode="min",
                    search_alg=algo,
                    scheduler=scheduler,
                    verbose=1,
                    **tune_kwargs)

print("%-20s %s" % ("Start Time", start_time))
print("%-20s %s" % ("End Time", datetime.now()))
