# Extreme Gradient Boosting with XGBoost

## Fine-tuning your XGBoost model

In [1]:
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

In [2]:
california_housing = fetch_california_housing(as_frame=True)

# Convert to Pandas
housing = pd.DataFrame(data=california_housing.data, columns=california_housing.feature_names)
housing["target"] = california_housing.target
housing.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


### Tuning the number of boosting rounds
- Create a DMatrix called housing_dmatrix from X and y.
- Create a parameter dictionary called params, passing in the appropriate "objective" ("reg:squarederror") and "max_depth" (set it to 3).
- Iterate over num_rounds inside a for loop and perform 3-fold cross-validation. In each iteration of the loop, pass in the current number of boosting rounds (curr_num_rounds) to xgb.cv() as the argument to num_boost_round.
- Append the final boosting round RMSE for each cross-validated XGBoost model to the final_rmse_per_round list.
- num_rounds and final_rmse_per_round have been zipped and converted into a DataFrame so you can easily see how the model performs with each boosting round.

In [3]:
X, y = housing[housing.columns.to_list()[:-1]],housing[housing.columns.to_list()[-1]]

# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary for each tree: params 
params = {"objective":"reg:squarederror", "max_depth":3}

# Create list of number of boosting rounds
num_rounds = [5, 10, 15]

# Empty list to store final round rmse per XGBoost model
final_rmse_per_round = []

# Iterate over num_rounds and build one model per num_boost_round parameter
for curr_num_rounds in num_rounds:

    # Perform cross-validation: cv_results
    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=3, num_boost_round=curr_num_rounds, metrics="rmse", as_pandas=True, seed=123)
    
    # Append final round RMSE
    final_rmse_per_round.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
num_rounds_rmses = list(zip(num_rounds, final_rmse_per_round))
print(pd.DataFrame(num_rounds_rmses,columns=["num_boosting_rounds","rmse"]))

   num_boosting_rounds      rmse
0                    5  0.781960
1                   10  0.622003
2                   15  0.581889


### Automated boosting round selection using early_stopping

Early stopping works by testing the XGBoost model after every boosting round against a hold-out dataset and stopping the creation of additional boosting rounds (thereby finishing training of the model early) if the hold-out metric ("rmse" in our case) does not improve for a given number of rounds. Here you will use the early_stopping_rounds parameter in xgb.cv() with a large possible number of boosting rounds (50). Bear in mind that if the holdout metric continuously improves up through when num_boost_rounds is reached, then early stopping does not occur.

In [4]:
# Create the parameter dictionary for each tree: params
params = {"objective":"reg:squarederror", "max_depth":4}

# Perform cross-validation with early stopping: cv_results
cv_results = xgb.cv(dtrain=housing_dmatrix, params = params, nfold=3, num_boost_round = 50,
                   early_stopping_rounds=10, metrics="rmse", as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

    train-rmse-mean  train-rmse-std  test-rmse-mean  test-rmse-std
0          1.462828        0.005529        1.465340       0.011217
1          1.143124        0.004570        1.150056       0.009817
2          0.939146        0.003773        0.948381       0.007459
3          0.808610        0.004855        0.821553       0.007730
4          0.725585        0.006141        0.739873       0.005441
5          0.671414        0.004638        0.688221       0.005761
6          0.633774        0.003228        0.650904       0.008133
7          0.606048        0.008341        0.624120       0.008888
8          0.586223        0.008837        0.605858       0.006527
9          0.568964        0.005324        0.589065       0.011414
10         0.555295        0.002455        0.576495       0.008162
11         0.546593        0.002349        0.570481       0.009861
12         0.540629        0.001898        0.566814       0.011006
13         0.528651        0.003696        0.555914       0.00

### Tuning ETA (Learning Rate)
- Create a list called eta_vals to store the following "eta" values: 0.001, 0.01, and 0.1.
- Iterate over your eta_vals list using a for loop.
- In each iteration of the for loop, set the "eta" key of params to be equal to curr_val. Then, perform 3-fold cross-validation with early stopping (5 rounds), 10 boosting rounds, a metric of "rmse", and a seed of 123. Ensure the output is a DataFrame.
- Append the final round RMSE to the best_rmse list.

In [5]:
# Create your housing DMatrix: housing_dmatrix
# housing_dmatrix = xgb.DMatrix(data=X, label=y) ~ done above

# Create the parameter dictionary for each tree (boosting round)
params = {"objective":"reg:squarederror", "max_depth":3}

# Create a list of eta values and empty list to store final round rmse per xgboost model
eta_vals = [0.001, 0.01, 0.1]
best_rmse = []

# Systimatically vary the eta
for curr_val in eta_vals:
  
  params["eta"] = curr_val
  
  # Perform cross-validation: cv_results
  cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=3,
                     num_boost_round=10, early_stopping_rounds=5,
                     metrics="rmse", as_pandas=True, seed=123)
  
  # Append the final round rmse to best_rmse
  best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])
  
# Print the resultant DataFrame
print(pd.DataFrame(list(zip(eta_vals, best_rmse)), columns=["eta", "best_rmse"]))

     eta  best_rmse
0  0.001   1.931078
1  0.010   1.793167
2  0.100   0.974117


### Tuning max_depth
- Create a list called max_depths to store the following "max_depth" values: 2, 5, 10, and 20.
- Iterate over your max_depths list using a for loop.
- Systematically vary "max_depth" in each iteration of the for loop and perform 2-fold cross-validation with early stopping (5 rounds), 10 boosting rounds, a metric of "rmse", and a seed of 123. Ensure the output is a DataFrame.

In [6]:
# Create the parameter dictionary
params = {"objective":"reg:squarederror"}

# Create list of max_depth values
max_depths = [2, 5, 10, 20]
best_rmse = []

# Systematically vary the max_depth
for curr_val in max_depths:

    params["max_depth"] = curr_val
    
    # Perform cross-validation
    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2,
                 num_boost_round=10, early_stopping_rounds=5,
                 metrics="rmse", as_pandas=True, seed=123)
    
    # Append the final round rmse to best_rmse
    best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
print(pd.DataFrame(list(zip(max_depths, best_rmse)),columns=["max_depth","best_rmse"]))

   max_depth  best_rmse
0          2   0.696380
1          5   0.562417
2         10   0.541528
3         20   0.565039


### Tuning colsample_by tree

Tune "colsample_bytree". You've already seen this if you've ever worked with scikit-learn's RandomForestClassifier or RandomForestRegressor, where it just was called max_features. In both xgboost and sklearn, this parameter (although named differently) **simply specifies the fraction of features to choose from at every split in a given tree.** In xgboost, colsample_bytree must be specified as a float between 0 and 1.

------------------------------------------------------------------------------------------------------------------------------------------
- Create a list called colsample_bytree_vals to store the values 0.1, 0.5, 0.8, and 1.
- Systematically vary "colsample_bytree" and perform cross-validation, exactly as you did with max_depth and eta previously.

In [7]:
# Create your housing DMatrix
housing_dmatrix = xgb.DMatrix(data=X,label=y)

# Create the parameter dictionary
params={"objective":"reg:squarederror","max_depth":3}

# Create list of hyperparameter values: colsample_bytree_vals
colsample_bytree_vals = [0.1, 0.5, 0.8, 1]
best_rmse = []

# Systamatically vary the hyperparameter value

for curr_val in colsample_bytree_vals:
  
  params['colsample_bytree'] = curr_val
  
  # Perform cross-validation
  cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2,
               num_boost_round=10, early_stopping_rounds=5,
               metrics="rmse", as_pandas=True, seed=123)
    
    # Append the final round rmse to best_rmse
  best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
print(pd.DataFrame(list(zip(colsample_bytree_vals, best_rmse)), columns=["colsample_bytree","best_rmse"]))

   colsample_bytree  best_rmse
0               0.1   0.852127
1               0.5   0.655957
2               0.8   0.625618
3               1.0   0.623250


### Grid search with XGBoost

Now that you've learned how to tune parameters individually with XGBoost, let's take your parameter tuning to the next level by using scikit-learn's GridSearch and RandomizedSearch capabilities with internal cross-validation using the GridSearchCV and RandomizedSearchCV functions. You will use these to find the best model exhaustively from a collection of possible parameter values across multiple parameters simultaneously

---------------------------------------------------------------------------------------------------------------------------

- Create a parameter grid called gbm_param_grid that contains a list of "colsample_bytree" values (0.3, 0.7), a list with a single value for "n_estimators" (50), and a list of 2 "max_depth" (2, 5) values.
- Instantiate an XGBRegressor object called gbm.
- Create a GridSearchCV object called grid_mse, passing in: the parameter grid to param_grid, the XGBRegressor to estimator, "neg_mean_squared_error" to scoring, and 4 to cv. Also specify verbose=1 so you can better understand the output.
- Fit the GridSearchCV object to X and y.
- Print the best parameter values and lowest RMSE, using the .best_params_ and .best_score_ attributes, respectively, of grid_mse.

**Note: we are including the GridSearchCV library from sklearn**

In [8]:
# Import GridSeachCV
from sklearn.model_selection import GridSearchCV

# Create the parameter grid: gbm_param_grid
gbm_param_grid = {
  'colsample_bytree':[0.3, 0.7],
  'n_estimators':[50],
  'max_depth':[2, 5]
}

# Instantiate the regressor: gbm
gbm = xgb.XGBRegressor()

# Perform grid search: grid_mse
grid_mse = GridSearchCV(estimator=gbm, param_grid=gbm_param_grid, 
                        scoring="neg_mean_squared_error", cv=4, verbose=1)

# Fit grid_mse to the data
grid_mse.fit(X, y)

# Print the best parameters and lowest RMSE
print("Best parameters found: ", grid_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(grid_mse.best_score_)))

Fitting 4 folds for each of 4 candidates, totalling 16 fits
Best parameters found:  {'colsample_bytree': 0.7, 'max_depth': 5, 'n_estimators': 50}
Lowest RMSE found:  0.7076752258703651


### Random Search with XGBoost

Often, GridSearchCV can be really time consuming, so in practice, you may want to use RandomizedSearchCV instead, as you will do in this exercise. The good news is you only have to make a few modifications to your GridSearchCV code to do RandomizedSearchCV. The key difference is you have to specify a param_distributions parameter instead of a param_grid parameter.

---------------------------------------------------------------------------------------------------------------------------------------------------

- Create a parameter grid called gbm_param_grid that contains a list with a single value for 'n_estimators' (25), and a list of 'max_depth' values between 2 and 11 for 'max_depth' - use range(2, 12) for this.
- Create a RandomizedSearchCV object called randomized_mse, passing in: the parameter grid to param_distributions, the XGBRegressor to estimator, "neg_mean_squared_error" to scoring, 5 to n_iter, and 4 to cv. Also specify verbose=1 so you can better understand the output.
- Fit the RandomizedSearchCV object to X and y.

**Note: we are including the RandomizedSearchCV library from sklearn**

In [10]:
# Import the RandomizedSearchCV library
from sklearn.model_selection import RandomizedSearchCV

# Create the parameter grid: gbm_param_grid
gbm_param_grid = {
  'n_estimators':[25],
  'max_depth':np.arange(2,12)
}

# Instantiate the regressor: gbm
gbm = xgb.XGBRegressor(n_estimators=10)

# Perform random search: grid_mse
randomized_mse = RandomizedSearchCV(estimator=gbm, param_distributions=gbm_param_grid,
                                 scoring='neg_mean_squared_error', n_iter=5, cv=4, verbose=1)

# Fit the randomized_mse to the data
randomized_mse.fit(X,y)

# Print the best parameters and lowest RMSE
print("Best parameters found: ", randomized_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(randomized_mse.best_score_)))

Fitting 4 folds for each of 5 candidates, totalling 20 fits
Best parameters found:  {'n_estimators': 25, 'max_depth': 5}
Lowest RMSE found:  0.6526003321681331


**The search space size can be massive for Grid Search in certain cases, whereas for Random Search the number of hyperparameters has a significant effect on how long it takes to run.**