# Capstone Project
### Jupyter Notebook (3/4)

### Bruno Athayde e Silva - 448898

#### XGBoost Tuning

---

### Table of Contents

### [Introduction](#introduction)

### [Methodology](#methodology)

   - [Step 1: Optimal scaler, 'learning_rate' and 'n_estimators'](#step_1)
   
   - [Step 2: Optimal 'max_depth', 'reg_alpha', 'reg_lambda'](#step_2)
   
   - [Step 3: Optimal 'subsample' and 'colsample_bytree'](#step_3)
   
### [Next Steps](#next)

---

### Introduction
<a id = 'introduction'></a>

In this third Jupyter Notebook, the goal is to find the most suitable hyperparameters for the model predicting ***FarePerMile***. Later, those values would help optimize the XGBoost Regression to predict the target variable using the model's features.

I will start importing some essential libraries and loading the cleaned data from the EDA Jupyter Notebook.  

In [1]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns


# ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
# set the display to show every column and row
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [3]:
# load the data
X = pd.read_csv('/Volumes/GoogleDrive/My Drive/BrainStation/CAPSTONE/DATABASE/FINAL/tkt_final_X.csv')
X = X.drop(columns = 'Unnamed: 0')

y = pd.read_csv('/Volumes/GoogleDrive/My Drive/BrainStation/CAPSTONE/DATABASE/FINAL/tkt_final_Y.csv')
y = y.drop(columns = 'Unnamed: 0')

In [4]:
# check the dataframe
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731515 entries, 0 to 731514
Columns: 585 entries, _Year to TPA-PHL
dtypes: float64(3), int64(582)
memory usage: 3.2 GB


In [5]:
# check the dataframe
y.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731515 entries, 0 to 731514
Data columns (total 1 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   FarePerMile  731515 non-null  float64
dtypes: float64(1)
memory usage: 5.6 MB


---

### Methodology
<a id = 'methodology'></a>

After loading the dataset and defining **X** and **y**, I first need to split the dataset into Training, Validation and Test, and then start working with the model. 

After splitting the dataset, I will import a few additional libraries and start testing the hyperparameters for the XGBoost model, using **Pipeline** and **GridSearchCV**, from Scikit-Learn. 

In [6]:
# import train_test_split
from sklearn.model_selection import train_test_split

In [7]:
# train_test_split
# first split
X_rem, X_test, y_rem, y_test = train_test_split(X,    
                                                y,
                                                test_size = 0.10,     
                                                random_state = 15)

In [8]:
# train_test_split
# first split
X_train, X_valid, y_train, y_valid = train_test_split(X_rem, 
                                                      y_rem, 
                                                      test_size = 0.15, 
                                                      random_state = 15)

In [9]:
# check dataframes shapes
print(f"The shape of the X_train dataframe is: {X_train.shape}.")
print(f"The shape of the X_valid dataframe is: {X_valid.shape}.")
print(f"The shape of the X_test dataframe is: {X_test.shape}.\n")
print(f"The shape of the y_train dataframe is: {y_train.shape}.")
print(f"The shape of the y_valid dataframe is: {y_valid.shape}.")
print(f"The shape of the y_test dataframe is: {y_test.shape}.\n")

The shape of the X_train dataframe is: (559608, 585).
The shape of the X_valid dataframe is: (98755, 585).
The shape of the X_test dataframe is: (73152, 585).

The shape of the y_train dataframe is: (559608, 1).
The shape of the y_valid dataframe is: (98755, 1).
The shape of the y_test dataframe is: (73152, 1).



In [10]:
# import additional libraries 
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from xgboost import XGBRegressor
from sklearn.metrics import r2_score, mean_absolute_error
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from yellowbrick.regressor import residuals_plot
from yellowbrick.regressor import prediction_error

#### Step 1: Optimal scaler, 'learning_rate' and 'n_estimators'
<a id = 'step_1'></a>

Due to hardware limitations, I will use **Pipeline** and **GridSearchCV** in three steps. 

In this first step, I will start tuning: 
   - scaler 
   - 'learning_rate' 
   - 'n_estimators'

In [11]:
# combine XGBoost and GridSearchCV
# Step 1: Setting optimal Scaler, 'learning_rate' and n_estimators

# create placeholders for all three steps
estimators = [
    ('scaling', StandardScaler()),
    ('model', XGBRegressor())
]

# instantiate Pipeline with the estimators above
pipe = Pipeline(estimators)

# define parameters for GridSearchCV()
params = [
    {
        'scaling': [MinMaxScaler(), StandardScaler()],
        'model': [XGBRegressor(tree_method = 'hist')], 
        'model__learning_rate': [0.1, 0.2, 0.3], 
        'model__n_estimators': [350, 500, 750, 1000]
    }
]

# instantiate GridSearchCV() using Pipeline as an estimator
# and the parameters set above
grid = GridSearchCV(estimator = pipe, 
                    param_grid = params, 
                    scoring = 'neg_mean_squared_error', 
                    verbose = 1, 
                    cv = 3)

# fit the GridSearchCV()
grid.fit(X_train, y_train)


Fitting 3 folds for each of 24 candidates, totalling 72 fits


In [12]:
# check best parameters for the first step
print(f"Best parameters: {grid.best_params_}.")

Best parameters: {'model': XGBRegressor(base_score=None, booster=None, colsample_bylevel=None,
             colsample_bynode=None, colsample_bytree=None, gamma=None,
             gpu_id=None, importance_type='gain', interaction_constraints=None,
             learning_rate=0.2, max_delta_step=None, max_depth=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             n_estimators=1000, n_jobs=None, num_parallel_tree=None,
             random_state=None, reg_alpha=None, reg_lambda=None,
             scale_pos_weight=None, subsample=None, tree_method='hist',
             validate_parameters=None, verbosity=None), 'model__learning_rate': 0.2, 'model__n_estimators': 1000, 'scaling': MinMaxScaler()}.


As a result of the first GridSearchCV, using 3 folds, the best parameters are:

   - Scaler: MinMaxScaler( )
   - Model: learning_rate --> 0.2
   - Model: n_estimators --> 1000
   
Those parameters will be hard-coded in the next GridSearchCV step.

#### Step 2: Optimal max_depth, reg_alpha, reg_lambda
<a id = 'step_2'></a>

In this second step, given the tuned hyperparameters above, I will tune: 
   - max_depth
   - reg_alpha
   - reg_lambda

In [13]:
# combine XGBoost and GridSearchCV
# Step 2: Given the optimal scaler, learning_rate and n_estimators from the previous
# GridSearch, find the optimal max_depth, reg_alpha and reg_lambda

# create placeholders for all three steps
estimators = [
    ('scaling', StandardScaler()),
    ('model', XGBRegressor())
]

# instantiate Pipeline with the estimators above
pipe = Pipeline(estimators)

# define parameters for GridSearchCV()
params = [
    {
        'scaling': [MinMaxScaler()],
        'model': [XGBRegressor(tree_method = 'hist')], 
        'model__learning_rate': [0.2], 
        'model__n_estimators': [1000],
        'model__max_depth': [5, 4, 6, 7, 8, 9, 10],
        'model__reg_alpha': [0.01, 0.1, 1, 10],
        'model__reg_lambda': [0.8, 1, 1.2]
    }
]

# instantiate GridSearchCV() using Pipeline as an estimator
# and the parameters set above
grid = GridSearchCV(estimator = pipe, 
                    param_grid = params, 
                    scoring = 'neg_mean_squared_error', 
                    verbose = 1, 
                    cv = 3)

# fit the GridSearchCV()
grid.fit(X_train, y_train)


Fitting 3 folds for each of 84 candidates, totalling 252 fits


In [14]:
# check best parameters for the first step
print(f"Best parameters: {grid.best_params_}.")

Best parameters: {'model': XGBRegressor(base_score=None, booster=None, colsample_bylevel=None,
             colsample_bynode=None, colsample_bytree=None, gamma=None,
             gpu_id=None, importance_type='gain', interaction_constraints=None,
             learning_rate=0.2, max_delta_step=None, max_depth=6,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             n_estimators=1000, n_jobs=None, num_parallel_tree=None,
             random_state=None, reg_alpha=1, reg_lambda=0.8,
             scale_pos_weight=None, subsample=None, tree_method='hist',
             validate_parameters=None, verbosity=None), 'model__learning_rate': 0.2, 'model__max_depth': 6, 'model__n_estimators': 1000, 'model__reg_alpha': 1, 'model__reg_lambda': 0.8, 'scaling': MinMaxScaler()}.


As a result of the second GridSearchCV, given the results of the first GridSearchCV, using 3 folds, the best parameter are:

   - Scaler: MinMaxScaler( )
   - Model: learning_rate --> 0.2
   - Model: n_estimators --> 1000
   - Model: max_depth --> 6
   - Model: reg_alpha --> 1
   
Those parameters will be hard-coded in the next GridSearchCV step.

#### Step 3: Optimal 'subsample' and 'colsample_bytree'
<a id = 'step_3'></a>

In this third step, given the tuned hyperparameters above, I will tune: 
   - subsample
   - colsample_bytree

In [16]:
# combine XGBoost and GridSearchCV
# Step 3: Given the optimal scaler, learning_rate, n_estimators, max_depth and reg_alpha from the previous
# GridSearch, find the optimal subsample, colsample_bytree 

# create placeholders for all three steps
estimators = [
    ('scaling', StandardScaler()),
    ('model', XGBRegressor())
]

# instantiate Pipeline with the estimators above
pipe = Pipeline(estimators)

# define parameters for GridSearchCV()
params = [
    {
        'scaling': [MinMaxScaler()],
        'model': [XGBRegressor(tree_method = 'hist')], 
        'model__learning_rate': [0.2], 
        'model__n_estimators': [1000],
        'model__reg_alpha': [1.0],
        'model__max_depth': [6], 
        'model__subsample': [0.6, 0.7, 0.8, 0.9, 1.0],
        'model__colsample_bytree': [0.6, 0.7, 0.8, 0.9, 1.0]
    }
]

# instantiate GridSearchCV() using Pipeline as an estimator
# and the parameters set above
grid = GridSearchCV(estimator = pipe, 
                    param_grid = params, 
                    scoring = 'neg_mean_squared_error', 
                    verbose = 1, 
                    cv = 3)

# fit the GridSearchCV()
grid.fit(X_train, y_train)


Fitting 3 folds for each of 25 candidates, totalling 75 fits


In [17]:
# check best parameters for the first step
print(f"Best parameters: {grid.best_params_}.")

Best parameters: {'model': XGBRegressor(base_score=None, booster=None, colsample_bylevel=None,
             colsample_bynode=None, colsample_bytree=0.7, gamma=None,
             gpu_id=None, importance_type='gain', interaction_constraints=None,
             learning_rate=0.2, max_delta_step=None, max_depth=6,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             n_estimators=1000, n_jobs=None, num_parallel_tree=None,
             random_state=None, reg_alpha=1.0, reg_lambda=None,
             scale_pos_weight=None, subsample=1.0, tree_method='hist',
             validate_parameters=None, verbosity=None), 'model__colsample_bytree': 0.7, 'model__learning_rate': 0.2, 'model__max_depth': 6, 'model__n_estimators': 1000, 'model__reg_alpha': 1.0, 'model__subsample': 1.0, 'scaling': MinMaxScaler()}.


As a result of the third GridSearchCV, given the results of the first and second GridSearchCV, using 3 folds, the best parameter are:

   - Scaler: MinMaxScaler( )
   - Model: learning_rate -> 0.2
   - Model: n_estimators -> 1000
   - Model: max_depth -> 6
   - Model: reg_alpha -> 1.0
   - Model: subsample -> 1.0
   - Model: colsample_bytree -> 0.7
   
Those hyperparameters are the ones I will use in my final model.

---

### Next Steps
<a id = 'next'></a>

On the next Jupyter Notebook, I will test the accuracy of this model against real world data and create some visuals to help me explain the findings of this model. 

---