### Flavors of Boosting

In this notebook, we build a photometric redshift estimator using various boosting methods: AdaBoost and various flavors of Gradient Boosted Trees (GBM, HistGBM, and XGBoost). We also look at using RandomizedSearchCV in order to improve our exploration of parameter space.

Our goal is to estimate photometric redshifts starting from observations of galaxy magnitudes in six different photometric bands (u, g, r, i, z, y). 


Copyright: Viviana Acquaviva (2023); see also other data credits below.

License: [BSD-3-clause](https://opensource.org/license/bsd-3-clause/)

Essentially, we try to reproduce/improve upon the results of [this paper](https://arxiv.org/abs/1903.08174), for which the data are public and available [here](http://d-scholarship.pitt.edu/36064/).

In [None]:
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_colwidth', 100)

font = {'size'   : 16}
matplotlib.rc('font', **font)
matplotlib.rc('xtick', labelsize=14) 
matplotlib.rc('ytick', labelsize=14) 
#matplotlib.rcParams.update({'figure.autolayout': True})
matplotlib.rcParams['figure.dpi'] = 300

In [None]:
from sklearn import metrics
from sklearn.model_selection import cross_validate, KFold, cross_val_predict, GridSearchCV
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, AdaBoostRegressor, GradientBoostingRegressor

### We can read the data set with the selections applied in the previous notebook.

In [None]:
sel_features = pd.read_csv('sel_features.csv', sep = '\t')

In [None]:
sel_target = pd.read_csv('sel_target.csv')

In [None]:
sel_features.shape

In [None]:
sel_target.values.ravel() #changes shape to 1d row-like array

We begin with AdaBoost; if you want to read the book, you'll discover that stacking learners that are too weak (e.g., very short Decision Trees) doesn't help.

This allows us to run a more informed parameter optimization.

In [None]:
# %%time Wall time on my machine was 1 min 17 s

# Note: the variables after the "time" magic command are not updated - 
# e.g. the "model" object will not be the one defined later in the cell.
# This may depend on the Jupyter notebook version.

parameters = {'base_estimator__max_depth':[6,10,None], 'loss':['linear','square'], 'n_estimators':[20,50,100], 'learning_rate': [0.3,0.5,1.0]}
nmodels = np.product([len(el) for el in parameters.values()])
model = GridSearchCV(AdaBoostRegressor(base_estimator=DecisionTreeRegressor()), parameters, \
                     cv = KFold(n_splits=5, shuffle=True, random_state = 5), \
                     verbose = 2, n_jobs = 4, return_train_score=True)
model.fit(sel_features,sel_target.values.ravel())

print('Best params, best score:', "{:.4f}".format(model.best_score_), \
      model.best_params_)

We can take a look at the winning model scores; in this case, we also pay attention to the standard deviation of test scores, because we want to know what differences are statistically significant when we compare different models.

In [None]:
scores = pd.DataFrame(model.cv_results_)
scoresCV = scores[['params','mean_test_score','std_test_score','mean_train_score']].sort_values(by = 'mean_test_score', \
                                                    ascending = False)
scoresCV.head()

We can see that the standard deviation is 0.03 - giving us a hint of what's significant - and that a few different models have similar scores. If you change the random seed in the cross validation, the scores will change by a similar amount, and the best model may change as well.

Additionally, the resulting scores will not be exactly reproducible because there is another random component in the adaptive learning set (this means that if you run the cross_validate function using the best model from above, you might get a different average score!)

#### Let's pick the best model and check the derived metrics as well (OLF, NMAD). We should do nested cross validation to get the generalization errors right - but if we are just comparing models, this is ok.

In [None]:
bm = model.best_estimator_

Let's generate predicted values.

In [None]:
y_pred_bm = cross_val_predict(bm, sel_features, sel_target.values.ravel(), cv = KFold(n_splits=5, shuffle = True, random_state=10))

Calculate outlier fraction and NMAD:

In [None]:
print(np.round(len(np.where(np.abs(sel_target.values.ravel()-y_pred_bm)>0.15*(1+sel_target.values.ravel()))[0])/len(sel_target.values.ravel()),3))

print(np.round(1.48*np.median(np.abs(sel_target.values.ravel()-y_pred_bm)/(1 + sel_target.values.ravel())),3))

These are actually .... what we obtained for the Random Forests model! But is the difference statistically significant? One way to explore this is by generating several sets of predictions, and calculating their standard deviation.

In [None]:
seeds = np.random.choice(100,8, replace = False) #pick 8 different seeds

olf = np.zeros(8)
NMAD = np.zeros(8)

for i in range(8): #A bit rough, but it gives a sense of what happens by varying the random seeds!
    print('Iteration', i) #this is just to see the progress.
    ypred = cross_val_predict(bm, sel_features, sel_target.values.ravel(), cv = KFold(n_splits=5, shuffle=True, random_state=seeds[i]))
    olf[i] = len(np.where(np.abs(sel_target.values.ravel()-ypred)>0.15*(1+sel_target.values.ravel()))[0])/len(sel_target.values.ravel())
    NMAD[i] = 1.48*np.median(np.abs(sel_target.values.ravel()-ypred)/(1 + sel_target.values.ravel()))

print('OLF avg/std:, {0:.5f}, {1:0.5f}'.format(olf.mean(), olf.std()))
print('NMAD avg/std:, {0:.5f}, {1:0.5f}'.format(NMAD.mean(), NMAD.std()))

The result seems to be relatively solid, indicating that AdaBoost might be slightly .... RF when we take into account not just the R2 score, but the specific metrics we are monitoring for this problem.

### The next step is to compare Adaptive Boosting with different Gradient Boosted Trees algorithms. 

We begin by using sklearn's GBM, then we move on to the lighter version, HistGBM, and finally we consider one of the most popular GBT-based algorithm, XGBoost.

We also take a look at the possibility of using a Randomized Search instead of a Grid Search in order to speed up our optimization process.

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

The parameters depend on the particular implementation.

In the sklearn formulation, the parameters of each tree are essentially the same we have for Random Forests; additionally we have the "learning_rate" parameter, which dictates how much each tree contribute to the final estimator, and the "subsample" parameters, which allows one to use a < 1.0 fraction of samples and introduce some regularization.


### We can run the optimization process for this algorithm on a similar grid to the one used for AdaBoost.

In [None]:
# %%time
# This took about 4.5 minutes on my machine

parameters = {'max_depth':[6,10,None], 'loss':['squared_error','absolute_error'], 
              'n_estimators':[20,50,100], 'learning_rate': [0.1,0.3,0.5]}
nmodels = np.product([len(el) for el in parameters.values()])
model = GridSearchCV(GradientBoostingRegressor(), parameters, 
                     cv = KFold(n_splits=5, shuffle=True, random_state = 5), \
                     verbose = 2, n_jobs = 4, return_train_score=True)
model.fit(sel_features,sel_target.values.ravel())

print('Best params, best score:', "{:.4f}".format(model.best_score_), \
      model.best_params_)

These are comparable to what we obtained with AdaBoost (slightly lower, typically). We can check what happens to the outlier fraction and NMAD.

In [None]:
bm = model.best_estimator_

In [None]:
y_pred_bm = cross_val_predict(bm, sel_features, sel_target.values.ravel(), cv = KFold(n_splits=5, shuffle = True, random_state=10))

In [None]:
print(np.round(len(np.where(np.abs(sel_target.values.ravel()-y_pred_bm)>0.15*(1+sel_target.values.ravel()))[0])/len(sel_target.values.ravel()),3))

print(np.round(1.48*np.median(np.abs(sel_target.values.ravel()-y_pred_bm)/(1 + sel_target.values.ravel())),3))

Overall, the performance of the two algorithms is similar, but one important difference is *timing*. To explore exactly the same parameter space, GBR took ~ 3 times longer than AdaBoost. Additionally, gradient boosted methods typically require more estimators, and we should explore more regularization parameters (e.g. subsampling) as well. In a nutshell, it would be great to speed up things.

### How can we make things faster?

We can improve on the time constraints in two ways: by switching to the histogram-based version of Gradient Boosting Regressor, and by using a Random Search instead of a Grid Search.

HistGradientBoostingRegressor (inspired by [LightGBM](https://lightgbm.readthedocs.io/en/latest/)) works by binning the features into integer-valued bins (the default value is 256, but this parameter can be adjusted; note however that 256 is the maximum!), which greatly reduces the number of splitting points to consider, and results in a vast reduction of computation time, especially for large data sets.

In [None]:
from sklearn.ensemble import HistGradientBoostingRegressor

In [None]:
# %%time
# This took ~ 18s

parameters = {'max_depth':[6,10,None], 'loss':['squared_error','absolute_error'], 
              'max_iter':[20,50,100], 'learning_rate': [0.1,0.3,0.5]}
nmodels = np.product([len(el) for el in parameters.values()])
model = GridSearchCV(HistGradientBoostingRegressor(), parameters, 
                     cv = KFold(n_splits=5, shuffle=True, random_state = 5), \
                     verbose = 2, n_jobs = 4, return_train_score=True)
model.fit(sel_features,sel_target.values.ravel())

print('Best params, best score:', "{:.4f}".format(model.best_score_), \
      model.best_params_)

In [None]:
scores = pd.DataFrame(model.cv_results_)
scoresCV = scores[['params','mean_test_score','std_test_score','mean_train_score']].sort_values(by = 'mean_test_score', \
                                                    ascending = False)
scoresCV.head()

Even for this relatively small data set, this is much faster (about 15x faster than GradientBoostingRegressor), giving us a chance to explore a wider parameter space (e.g. more trees, more options for learning rate). The trade-off is that we obtain a slight decrease in performance, compared with GBR. However, the standard deviation of test scores over the 5 CV folds suggests that this difference is not statistically significant.

Let's explore a wider parameter space here:

In [None]:
#%%time
# This took 2 min 42 secs

parameters = {'max_depth':[6,10,None], 'loss':['squared_error','absolute_error'], 
              'max_iter':[100,200,500], 'learning_rate': [0.05, 0.1,0.3,0.5], 
              'early_stopping':[True, False]}
nmodels = np.product([len(el) for el in parameters.values()])
model = GridSearchCV(HistGradientBoostingRegressor(), parameters, cv = KFold(n_splits=5, shuffle=True), \
                     verbose = 2, n_jobs = 4, return_train_score=True)
model.fit(sel_features,sel_target.values.ravel())

print('Best params, best score:', "{:.4f}".format(model.best_score_), \
      model.best_params_)

In [None]:
scores = pd.DataFrame(model.cv_results_)
scoresCV = scores[['params','mean_test_score','std_test_score','mean_train_score']].sort_values(by = 'mean_test_score', \
                                                    ascending = False)
scoresCV.head()

HistGBR seems to improve (as expected) by using more trees and a smaller learning rate.

### Comparison with Random Search

In [None]:
from sklearn.model_selection import RandomizedSearchCV

Finally, we can compare the performance and timings of the grid search above with the option of using a Randomized Search instead. We note that Random Search is usually preferable when we have a high-dimensional parameter space; its use is not particularly warranted here.

The number of iterations (the number of models that are considered) also needs to be adjusted, and depends on the dimensionality of the parameter space as well as the functional dependence of the loss function on the parameters. We will compare the timings with the cell above, where we explore 144 models, and only use 30 for the random search.

The references here explores various ways of running a parameter search.

Bergstra, J. and Bengio, Y., Random search for hyper-parameter optimization, The Journal of Machine Learning Research (2012)

Bergstra, James, et al. "Algorithms for hyper-parameter optimization." Advances in neural information processing systems 24 (2011).

In [None]:
#%%time
# 31 seconds

parameters = {'max_depth':[6,10,None], 'loss':['squared_error','absolute_error'], 
              'max_iter':[100,200,500], 'learning_rate': [0.05, 0.1,0.3,0.5], 
             'early_stopping':[True, False]}
nmodels = np.product([len(el) for el in parameters.values()])
model = RandomizedSearchCV(HistGradientBoostingRegressor(), parameters, cv = KFold(n_splits=5, shuffle=True), \
                     verbose = 2, n_jobs = 4, return_train_score=True, n_iter=30)
model.fit(sel_features,sel_target.values.ravel())

print('Best params, best score:', "{:.4f}".format(model.best_score_), \
      model.best_params_)

The Randomized Search was able to find a comparably good solution in less than 1/5 of the time. As we mentioned, the true gains of a Randomized Search pertain to exploring high-dimensional spaces. It is also possible to use the Randomized Search to find the general area of optimal parameters, and then refine the search in that neighborhood with a finer Grid Search.

Actually, probably a better idea (thanks Paco!) is to use [Optuna](https://optuna.org/), a general-purpose hyperparameter optimizer that would explore the hyperparameter in a smart way (learning on the fly about where the minima are).

### Finally, we can compare with XGBoost.

[XGBoost](https://xgboost.readthedocs.io/en/latest/index.html#) stands for “Extreme Gradient Boosting”. It is sometimes known as "regularized" GBM, as it has a default regularization term on the weights of the ensemble, and is more robust to overfitting. It has more flexibility in defining weak learners, as well as the objective (loss) function (note that this doesn't apply to the base estimators, e.g. how splits in trees are chosen, but on the loss that is used to compute pseudoresiduals and gradients). 


In [None]:
import xgboost as xgb

Medium article explaining XGBoost: [here](https://towardsdatascience.com/a-beginners-guide-to-xgboost-87f5d4c30ed7); some nice tutorials from XGBoost's site: [here](https://xgboost.readthedocs.io/en/latest/tutorials/index.html)

We can begin by using Grid Search and the original parameter space, in order to compare timings with GBM and HistGBM.

In [None]:
# %%time

# This took ~ 3 minutes

parameters = {'max_depth':[6,10,None], 'n_estimators':[50,100,200], 
              'learning_rate': [0.1, 0.3,0.5], 'objective':['reg:squarederror','reg:squaredlogerror']}


nmodels = np.product([len(el) for el in parameters.values()])
model = GridSearchCV(xgb.XGBRegressor(), parameters, cv = KFold(n_splits=5, shuffle=True), \
                     verbose = 2, n_jobs = 4, return_train_score=True)
model.fit(sel_features,sel_target.values.ravel())

print('Best params, best score:', "{:.4f}".format(model.best_score_), \
      model.best_params_)

XGBoost is slightly more efficient than GBM, and achieves comparable results on a similar grid. We can use the Random Search to explore some more intensive models (more trees, lower learning rate), and add subsampling as an extra form of regularization.

In [None]:
#%%time

# 3 min 36 secs

parameters = {'max_depth':[6,10,None], 'n_estimators':[50, 100, 200,500], 
              'learning_rate': [0.02,0.05,0.1,0.3], 'objective':['reg:squarederror',
                'reg:squaredlogerror'], 'subsample':[0.5,1]}

nmodels = np.product([len(el) for el in parameters.values()])
model = RandomizedSearchCV(xgb.XGBRegressor(), parameters, cv = KFold(n_splits=5, shuffle=True), \
                     verbose = 2, n_jobs = 4, return_train_score=True, n_iter = 30)
model.fit(sel_features,sel_target.values.ravel())

print('Best params, best score:', "{:.4f}".format(model.best_score_), \
      model.best_params_)

In [None]:
scores = pd.DataFrame(model.cv_results_)
scoresCV = scores[['params','mean_test_score','std_test_score','mean_train_score']].sort_values(by = 'mean_test_score', \
                                                    ascending = False)
scoresCV.head()

We are able to get slightly higher scores using this wider parameter space in combination with the Randomized Search, but again, the statistical significance of this increase is very low.

We can also look for outlier fraction and NMAD:

In [None]:
y_pred_bm = cross_val_predict(bm, sel_features, sel_target.values.ravel(), cv = KFold(n_splits=5, shuffle = True, random_state=5))

In [None]:
print(len(np.where(np.abs(sel_target.values.ravel()-y_pred_bm)>0.15*(1+sel_target.values.ravel()))[0])/len(sel_target.values.ravel()))

print(1.48*np.median(np.abs(sel_target.values.ravel()-y_pred_bm)/(1 + sel_target.values.ravel())))

### Conclusion: all boosting algorithms behave fairly similarly for this data set. It might be worth simply using the fastest one (HistGBR + Random Search).