# Fine-tuning your XGBoost model

## Why tune your model?
### Tuning the number of boosting rounds
  
Let's start with parameter tuning by seeing how the number of boosting rounds (number of trees you build) impacts the out-of-sample performance of your XGBoost model. You'll use `xgb.cv()` inside a for loop and build one model per `num_boost_round` parameter. Working with the Ames housing dataset.

In [1]:
import pandas as pd
import numpy as np
import xgboost as xgb


# Load df
df = pd.read_csv('../_datasets/ames_housing_trimmed_processed.csv')

# X/y split
X, y = df.iloc[:, :-1], df.iloc[:, -1]

In [2]:
# SEED
SEED = 123

# Creating the DMatrix
housing_dmatrix = xgb.DMatrix(data= X, label= y)

# Creating the parameter dictionary for each tree
params = {
    'objective' : 'reg:squarederror',
    'max_depth' : 3
}

# Creating a list containing the number of boosting rounds
number_rounds = [5, 10, 15]

# Empty list for storing the final round RMSE per XGBoost model
final_rmse_per_round = []

# Iterate over num_rounds and build one model per num_boost_round parameter
for curr_num_rounds in number_rounds:
    # Preforming cross-validation
    cv_results = xgb.cv(
        dtrain= housing_dmatrix,
        params= params,
        nfold= 3,
        num_boost_round= curr_num_rounds,
        metrics= 'rmse',
        as_pandas= True,
        seed=SEED
    )
    # Append final round RMSE to the empty list
    final_rmse_per_round.append(cv_results['test-rmse-mean'].tail().values[-1])

# Printing the resulting dataframe
# zip() is a generator that combines these two together, we then create a list from it.
# Lastly we use this list as the data structure, giving each a column name.
number_rounds_rmses = list(zip(number_rounds, final_rmse_per_round))
print(pd.DataFrame(number_rounds_rmses, columns=['num_boosting_rounds', 'rmse']))

# Increasing the number of boosting rounds decreases the RMSE.

   num_boosting_rounds          rmse
0                    5  50903.299752
1                   10  34774.194090
2                   15  32895.099185


### Automated boosting round selection using early_stopping
Now, instead of attempting to cherry pick the best possible number of boosting rounds, you can very easily have XGBoost automatically select the number of boosting rounds for you within `xgb.cv()`. This is done using a technique called **early stopping**.  
Early stopping works by testing the XGBoost model after every boosting round against a hold-out dataset and stopping the creation of additional boosting rounds (thereby finishing training of the model early) if the hold-out metric (**"rmse"** in our case) does not improve for a given number of rounds.  
Here you will use the `early_stopping_rounds` parameter in `xgb.cv()` with a large possible number of boosting rounds (50). Bear in mind that if the holdout metric continuously improves up through when `num_boost_rounds` is reached, then early stopping does not occur.

In [3]:
# Creating the housing DMatrix
housing_dmatrix = xgb.DMatrix(data= X, label= y)

# Creating the parameter dictionary for each tree
params = {
    'objective' : 'reg:squarederror',
    'max_depth' : 4
}

# Preform cross-validation with early-stopping
cv_results = xgb.cv(
    dtrain= housing_dmatrix,
    nfold= 3,
    params= params,
    metrics= 'rmse',
    early_stopping_rounds= 10,
    num_boost_round= 50,
    as_pandas= True,
    seed= SEED
)

# Print cv_results
print(cv_results)

    train-rmse-mean  train-rmse-std  test-rmse-mean  test-rmse-std
0     141871.635216      403.633062   142640.653507     705.559723
1     103057.033818       73.768079   104907.664683     111.117033
2      75975.967655      253.727043    79262.056654     563.766693
3      57420.530642      521.658273    61620.137859    1087.693428
4      44552.956483      544.170426    50437.560906    1846.446643
5      35763.948865      681.796675    43035.659539    2034.471115
6      29861.464164      769.571418    38600.880800    2169.796804
7      25994.675122      756.520639    36071.817710    2109.795408
8      23306.836299      759.237848    34383.186387    1934.547433
9      21459.770256      745.624640    33509.140338    1887.375358
10     20148.721060      749.612186    32916.806725    1850.893437
11     19215.382607      641.387200    32197.833474    1734.456654
12     18627.388962      716.256240    31770.852340    1802.154296
13     17960.695080      557.043324    31482.782172    1779.12

## Overview of XGBoost's hyperparameters

Common tree tunable parameters  
- `eta`: learning rate/eta  
The learning rate affects how quickly the model fits the residual error using additional base learners. A low learning rate will require more boosting rounds to achieve the same reduction in residual error as an XGBoost model with a high learning rate.  
  
- `gamma`: min loss reduction to create new tree split  
Has an effect on how strongly regularized the trained model will be.  
  
- `lambda`: L2 regularization on leaf weights  
Has an effect on how strongly regularized the trained model will be.  
  
- `alpha`: L1 regularization on leaf weights  
Has an effect on how strongly regularized the trained model will be.  
  
- `max_depth`: max depth per tree  
Max_depth must be a positive integer value and affects how deeply each tree is allowed to grow during any given boosting round.  
  
- `subsample`: % of samples used per tree  
Subsample must be a value between 0 and 1 and is the fraction of the total training set that can be used for any given boosting round. If the value is low, then the fraction of your training data used per boosting round would be low and you may run into underfitting problems, a value that is very high can lead to overfitting as well.  
  
- `colsample_bytree`: % of features used per tree  
Colsample_bytree is the fraction of features you can select from during any given boosting round and must also be a value between 0 and 1. A large value means that almost all features can be used to build a tree during a given boosting round, whereas a small value means that the fraction of features that can be selected from is very small. In general, smaller colsample_bytree values can be thought of as providing additional regularization to the model, whereas using all columns may in certain cases overfit a trained model.
  
Linear tunable parameters
- `lambda`: L2 reg on weights  
  
- `alpha`: L1 reg on weights  
  
- `lambda_bias`: L2 reg term on bias
  
Its important to mention that the number of boosting rounds (that is, either the number of trees you build or the number of linear base learners you construct) is itself a tunable parameter, ie. you can also tune the number of estimators used for both base model types.

## Tuning eta
  
It's time to practice tuning other XGBoost hyperparameters in earnest and observing their effect on model performance! You'll begin by tuning "`eta`", also known as the learning rate.

The learning rate in XGBoost is a parameter that can range between 0 and 1, with higher values of "`eta`" penalizing feature weights more strongly, causing much stronger regularization.

In [4]:
# Creating the DMatrix
housing_dmatrix = xgb.DMatrix(data= X, label= y)

# Creating the parameter dictionary for each tree in the boosting round
params = {
    'objective' : 'reg:squarederror',
    'max_depth' : 3
}

# Creating list of eta values and an empty list to store final round rmse per xgboost model
eta_vals = [0.001, 0.01, 0.1]
best_rmse = []

# Systematically vary the eta
for curr_val in eta_vals:

    # Appending the eta value to the grid, each iter it changes to the value which is in the eta_vals list.
    params['eta'] = curr_val

    # performing cross-validation
    cv_results = xgb.cv(
        dtrain= housing_dmatrix,
        params = params,
        nfold= 3,
        early_stopping_rounds= 5,
        num_boost_round= 10,
        metrics= 'rmse',
        seed= SEED,
        as_pandas= True
    )

    # Append the final round rmse to best_rmse
    # The tail() method is used to select the last 5 values of this column (by default, tail() returns the last 5 rows of a DataFrame).
    # The values attribute is used to retrieve the underlying NumPy array of these 5 selected rows. Since we are selecting 5 rows, 
    # this will return a NumPy array with 5 elements. We then use [-1] to extract the last (i.e., most recent) element from the array, 
    # which corresponds to the mean RMSE of the most recently evaluated model.
    best_rmse.append(cv_results['test-rmse-mean'].tail().values[-1])

# Print the resulting Dataframe
print(pd.DataFrame(list(zip(eta_vals, best_rmse)), columns= ['eta', 'best_rmse']))

     eta      best_rmse
0  0.001  195736.402543
1  0.010  179932.183986
2  0.100   79759.411808


### Tuning max_depth
  
In this exercise, your job is to tune `max_depth`, which is the parameter that dictates the maximum depth that each tree in a boosting round can grow to. Smaller values will lead to shallower trees, and larger values to deeper trees.

In [5]:
# Create DMatrix
housing_dmatrix = xgb.DMatrix(data= X, label= y)

# Create parameter dictionary
params = {
    'objective' : 'reg:squarederror',
}

# Create list of max_depth values
max_depths = [2, 5, 10, 20]

# Empty list to store values
best_rmse = []

# Loop for tuning max_depth, systematically varying the parameter
for curr_val in max_depths:

    # Loop param max_depth
    params['max_depth'] = curr_val

    # Preform cross-validation
    cv_results = xgb.cv(
        dtrain= housing_dmatrix,
        params= params,
        nfold= 2,
        early_stopping_rounds= 5,
        num_boost_round= 10,
        seed= SEED,
        as_pandas= True
    )

    # Append the final round rmse to best_rmse
    best_rmse.append(cv_results['test-rmse-mean'].tail().values[-1])

# Print results
print(pd.DataFrame(list(zip(max_depths, best_rmse)), columns=['max_depth', 'best_rmse']))

   max_depth     best_rmse
0          2  37957.469464
1          5  35596.599504
2         10  36065.547345
3         20  36739.576068


### Tuning colsample_bytree
  
Now, it's time to tune "`colsample_bytree`". You've already seen this if you've ever worked with scikit-learn's `RandomForestClassifier` or `RandomForestRegressor`, where it just was called `max_features`. In both xgboost and sklearn, this parameter (although named differently) simply specifies the fraction of features to choose from at every split in a given tree. In xgboost, `colsample_bytree` must be specified as a float between 0 and 1.

In [6]:
# Creating DMatrix
housing_dmatrix = xgb.DMatrix(data= X, label= y)

# Creating param dictionary
params = {
    'objective' : 'reg:squarederror',
    'max_depth' : 3
}

# Creating list of hyperparameter values for colsample_bytree
cosample_bytree_vals = [0.1, 0.5, 0.8, 1]

# Empty list for storage
best_rmse = []

# Systematically vary the hyperparameter value
for curr_val in cosample_bytree_vals:

    # Append varying colsample_bytree value to grid
    params['colsample_bytree'] = curr_val

    cv_results = xgb.cv(
        dtrain= housing_dmatrix,
        params= params,
        nfold= 2,
        num_boost_round= 10,
        early_stopping_rounds= 5,
        metrics= 'rmse',
        as_pandas= True,
        seed= SEED
    )

    # Append final round to best_rmse
    best_rmse.append(cv_results['test-rmse-mean'].tail().values[-1])

# Display results
print(pd.DataFrame(list(zip(cosample_bytree_vals, best_rmse)), columns=['colsample_bytree', 'best_rmse']))

   colsample_bytree     best_rmse
0               0.1  51386.578876
1               0.5  36585.351887
2               0.8  36093.663501
3               1.0  35836.044343


There are several other individual parameters that you can tune, such as "`subsample`", which dictates the fraction of the training data that is used during any given boosting round. Next up: Grid Search and Random Search to tune XGBoost hyperparameters more efficiently!

### Review of grid search and random search

Grid Search: Review  
Grid search is a method of exhaustively searching through a collection of possible parameter values. For example, if you have 2 hyperparameters you would like to tune, and 4 possible values for each hyperparameter, then a grid search over that parameter space would try all 16 possible parameter configurations. In a grid search, you try every parameter configuration, evaluate some metric for that configuration, and pick the parameter configuration that gave you the best value for the metric you were using, which in our case will be the root mean squared error.
- Search exhaustively over a given set of hyperparameters, once per set of hyperparameters
- Number of models = number of distinct values per hyperparameter multiplied across each hyperparameter
- Pick final model hyperparameter values that give best cross-validated evaluation metric value
  
Random Search: Review  
Random search is significantly different from grid search in that the number of models that you are required to iterate over doesn't grow as you expand the overall hyperparameter space. In random search, you get to decide how many models, or iterations, you want to try out before stopping. Random search simply involves drawing a random combination of possible hyperparameter values from the range of allowable hyperparameters a set number of times. Each time, you train a model with the selected hyperparameters, evaluate the performance of that model, and then rinse and repeat. When you've created the number of models you had specified initially, you simply pick the best one. 
- Create a (possibly infinte) range of hyperparameter values per hyperparameter that you would like to search over
- Set the number of iterations you would like for the random search to continue
- During each iteration, randomly draw a value in the range of specified values for each hyperparameter searched over and train/evaluate a model with those hyperparameters
- After you've reached the maximum number of iterations, select the hyperparameter configuration with the best evaluated score

### Grid search with XGBoost
  
Now that you've learned how to tune parameters individually with XGBoost, let's take your parameter tuning to the next level by using scikit-learn's `GridSearch` and `RandomizedSearch` capabilities with internal cross-validation using the `GridSearchCV` and `RandomizedSearchCV` functions. You will use these to find the best model exhaustively from a collection of possible parameter values across multiple parameters simultaneously. Let's get to work, starting with `GridSearchCV`!

In [7]:
from sklearn.model_selection import GridSearchCV


# Creating the parameter grid
gbm_param_grid = {
    'colsample_bytree' : [0.3, 0.7],
    'n_estimators' : [50],
    'max_depth' : [2, 5]
}

# Instantiate the regressor
gbm = xgb.XGBRegressor()

# Preform grid search
grid_mse = GridSearchCV(
    param_grid= gbm_param_grid,
    estimator= gbm,
    scoring= 'neg_mean_squared_error',
    cv= 4,
    verbose= 1
)

# Fit grid_mse to the data
grid_mse.fit(X, y)

# Print the best parameters and lowest RMSE
print('Best parameters found: {}'.format(grid_mse.best_params_))
print('Lowest RMSE found: {}'.format(np.sqrt(np.abs(grid_mse.best_score_))))

Fitting 4 folds for each of 4 candidates, totalling 16 fits
Best parameters found: {'colsample_bytree': 0.7, 'max_depth': 2, 'n_estimators': 50}
Lowest RMSE found: 30355.698207097197


### Random search with XGBoost
  
Often, `GridSearchCV` can be really time consuming, so in practice, you may want to use `RandomizedSearchCV` instead, as you will do in this exercise. The good news is you only have to make a few modifications to your `GridSearchCV` code to do `RandomizedSearchCV`. The key difference is you have to specify a `param_distributions` parameter instead of a `param_grid` parameter.
  
GBM stands for "**Gradient Boosting Machine**" or "**Gradient Boosting Model**". It is a type of ensemble machine learning algorithm used for supervised learning tasks, particularly for regression and classification problems. In the context of the code snippet provided, gbm refers to an instance of the `XGBRegressor` class from the XGBoost library, which is an implementation of the gradient boosting algorithm for regression problems.

In [8]:
from sklearn.model_selection import RandomizedSearchCV


# Create the parameter grid
# n_estimators is used in the gbm_param_grid dictionary to specify a range of 
# values for the number of boosting rounds (i.e., the number of decision trees) 
# that should be used in the XGBRegressor algorithm. In this case, only one value, 25, is given.
gbm_param_grid = {
    'n_estimators' : [25],
    'max_depth' : range(2,12)  # 2,3,4,5,6,7,8,9,10,11 (10 values total)
}

# Instantiate the regressor
# n_estimators is also used as a fixed value in the instantiation of the gbm 
# XGBRegressor object. Here, n_estimators is set to 10, which is the initial number 
# of boosting rounds that the XGBRegressor algorithm will use.
gbm = xgb.XGBRegressor(n_estimators= 10)

# Preform random search
randomized_mse = RandomizedSearchCV(
    param_distributions= gbm_param_grid,
    estimator= gbm,
    scoring= 'neg_mean_squared_error',
    n_iter= 5,
    cv= 4,
    verbose= 1
)

# Fit randomized_mse to the data
randomized_mse.fit(X, y)

# Print the best parameters and lowest RMSE
print('Best parameters found: {}'.format(randomized_mse.best_params_))
print('Lowest RMSE found: {}'.format(np.sqrt(np.abs(randomized_mse.best_score_))))

Fitting 4 folds for each of 5 candidates, totalling 20 fits
Best parameters found: {'n_estimators': 25, 'max_depth': 4}
Lowest RMSE found: 29998.4522530019


### Limits of grid search and random search
  
Grid Search
- Number of models you must build with every additionary new parameter grows very quickly  
  
Random Search
- Parameter space to explore can be massive
Randomly jumping throughtout the space looking for a "best" results becomes a waiting game