hgboost is short for **Hyperoptimized Gradient Boosting**. The aim of hgboost is too determine the most robust model by efficiently searching across the parameter space using hyperoptimization for which the loss is evaluated by means of a train/test-set with k-fold cross-validation. The final optimized model is evaluated on an independent validation set. The incorporated boosting methods are *xgboost*, *catboost* and *lightboost*. hgboost can be applied for classification and regression tasks. This notebook will show some **regression** examples.

More information can be found here:

* [Github](https://github.com/erdogant/hgboost/blob/master/README.md)
* [API documentation](https://erdogant.github.io/hgboost/)
* [Classification examples Colab](https://colab.research.google.com/github/erdogant/hgboost/blob/master/notebooks/hgboost_classification_examples.ipynb)



In [None]:
pip install hgboost

Import the hgboost library

In [None]:
from hgboost import hgboost
import numpy as np

Initialize using specified parameters. The parameters here are the default parameters.

In [None]:
hgb = hgboost(max_eval=250, threshold=0.5, cv=5, test_size=0.2, val_size=0.2, top_cv_evals=10, random_state=None, verbose=3)

Import example dataset. In this case it is the titanic dataset. We are going to set **age** as our response variable (y). Lets see how good we can predict **age**.

In [None]:
# Import data
df = hgb.import_example()
y = df['Age'].values
del df['Age']
I = ~np.isnan(y)
X = hgb.preprocessing(df, verbose=0)
X = X.loc[I,:]
y = y[I]

At this point we can initizalize which boosting model we want to **fit**. For **regression** there is the *xgboost*, *lightboost* or *xgboost*. In addition it is possible to fit an **ensemble** of all (specified) models. For demonstration we will first fit using *xgboost*. If other boosting methods are desired, simply uncomment.

In [None]:
# Fit
# results = hgb.lightboost_reg(X, y)
# results = hgb.catboost_reg(X, y)
results = hgb.xgboost_reg(X, y)

Done! Fast and clean! We evaluated 250 sets of parameters using HyperOpt in a cross-validation to determine the most optimal set of parameters for predictions using the specified evaluation metric (default is *auc*). We can now easily predict new samples using the **predict** function.

In [None]:
# Use the predictor
y_pred, y_proba = hgb.predict(X)

In [None]:
# First 10 elements
y_pred[0:10]

Lets examine the hyper-parameters. We can plot all the individual parameters examine the density, and how the parameters evolve over the iterations.

In [None]:
# Make some plots
hgb.plot_params(figsize=(20,20))

Examine each of the iterations. The top 10 results with cross validation are depicted with blue bars. The green dashed line is the best model without using CV. The red dashed line is the best model with CV. It can be seen that iterations are available that scored higher then the CV but are not selected.

In [None]:
hgb.plot(figsize=(15,8))

We can now deeper dive into the cross validation of the best performing model (red dashed line) by plotting the scores for the CVs. Here we see the results for the 5 crosses.

In [None]:
hgb.plot_cv()

Plot the best performing tree, and the ranked features.

In [None]:
hgb.treeplot()

Evaluate the results on the independent validation dataset.

In [None]:
hgb.plot_validation()

Lets see whether we can improve the results using the ensemble method!

In [None]:
results = hgb.ensemble(X, y, methods=['xgb_reg','ctb_reg','lgb_reg'])

*Wow!! Much better!!!*

**[hgboost] >[Ensemble] [rmse]: 27.38 on independent validation dataset**

[hgboost] >[xgb_reg]  [rmse]: 141 on independent validation dataset

[hgboost] >[ctb_reg]  [rmse]: 128.1 on independent validation dataset

[hgboost] >[lgb_reg]  [rmse]: 147.1 on independent validation dataset


In [None]:
hgb.plot_validation()