In [3]:
# Import necessary libraries
import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix, fbeta_score, make_scorer
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import Pipeline
import pickle

# XGBoost Classifier

In this section, we venture into the world of Extreme Gradient Boosting (XGBoost), a robust and highly efficient machine learning algorithm known for its superior performance in classification and regression tasks. XGBoost stands out for its ability to handle complex datasets, automatically handling missing values, and minimizing overfitting through regularization techniques.

## Key Characteristics of XGBoost:

- **Gradient Boosting**: XGBoost employs a gradient boosting framework, sequentially adding decision trees that correct the errors of the previous trees.
- **Regularization**: The algorithm incorporates L1 and L2 regularization terms to prevent overfitting, enhancing its generalization capability.
- **Speed and Efficiency**: XGBoost is optimized for speed and efficiency, making it a top choice for large datasets.

XGBoost has demonstrated its effectiveness in numerous machine learning competitions and real-world applications, making it a formidable tool for classification challenges.

In this section, we will delve into the development, fine-tuning, and evaluation of an XGBoost Classifier tailored to our specific dataset. This will prepare us for later comparisons with other classification models.

For a comprehensive guide to our implementation, please refer to the notebook "Final_Project_Data_Gen."

---


Let's load our dataset from CSV files, with `X_train` and `y_train` as our training features and labels, and `X_test` and `y_test` for testing, using Pandas DataFrames to facilitate data manipulation and analysis.


In [4]:
# Import cleaned train and test data
X_train = pd.read_csv("train_X_In-Car-Rec.csv")
y_train = pd.read_csv("train_y_In-Car-Rec.csv")
X_test = pd.read_csv("test_X_In-Car-Rec.csv")
y_test = pd.read_csv("test_y_In-Car-Rec.csv")

Let's examine the columns of our training dataset, `X_train`, to inspect the feature names and gain a better understanding of the available input variables, helping us verify data integrity and plan our analysis effectively.


In [5]:
# View columns to spot check values
X_train.columns

Index(['TEMPERATURE', 'HAS_CHILDREN', 'TOCOUPON_GEQ5MIN', 'TOCOUPON_GEQ15MIN',
       'TOCOUPON_GEQ25MIN', 'DIRECTION_SAME', 'DIRECTION_OPP',
       'DESTINATION_HOME', 'DESTINATION_NO_URGENT_PLACE', 'DESTINATION_WORK',
       ...
       'RESTAURANTLESSTHAN20_1~3', 'RESTAURANTLESSTHAN20_4~8',
       'RESTAURANTLESSTHAN20_GT8', 'RESTAURANTLESSTHAN20_LESS1',
       'RESTAURANTLESSTHAN20_NEVER', 'RESTAURANT20TO50_1~3',
       'RESTAURANT20TO50_4~8', 'RESTAURANT20TO50_GT8',
       'RESTAURANT20TO50_LESS1', 'RESTAURANT20TO50_NEVER'],
      dtype='object', length=109)

We'll ensure that the imported data is complete and correctly loaded by displaying the shapes of our training and test datasets, `X_train`, `y_train`, `X_test`, and `y_test`. These shape dimensions allow us to confirm that the data has been successfully loaded and is consistent with our expectations.


In [6]:
# Verify imported data is complete
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(10147, 109)
(10147, 1)
(2537, 109)
(2537, 1)


We perform a check to verify that the columns have been imported correctly, as sometimes an additional index column may be present when writing data to a CSV file. We examine the first few rows of our training and test datasets, `X_train`, `y_train`, `X_test`, and `y_test`, to ensure that the data columns match our expectations.


In [7]:
# Verify columns imported correctly, sometimes an extra index column is present when writing to csv
print(X_train.head())
print(y_train.head())
print(X_test.head())
print(y_test.head())

   TEMPERATURE  HAS_CHILDREN  TOCOUPON_GEQ5MIN  TOCOUPON_GEQ15MIN  \
0         80.0           0.0               1.0                1.0   
1         30.0           1.0               1.0                0.0   
2         55.0           0.0               1.0                0.0   
3         55.0           0.0               1.0                1.0   
4         80.0           0.0               1.0                1.0   

   TOCOUPON_GEQ25MIN  DIRECTION_SAME  DIRECTION_OPP  DESTINATION_HOME  \
0                0.0             0.0            1.0                 0   
1                0.0             0.0            1.0                 0   
2                0.0             1.0            0.0                 0   
3                1.0             0.0            1.0                 0   
4                0.0             0.0            1.0                 0   

   DESTINATION_NO_URGENT_PLACE  DESTINATION_WORK  ...  \
0                            1                 0  ...   
1                            1  

We simplify our model by creating a variable, `xgb`, which represents the Extreme Gradient Boosting (XGBoost) Classifier. This variable will be used for our XGBoost model throughout the analysis, making it more convenient to refer to and reuse in various sections.


In [8]:
# simplify the model with a variable for later use
xgb = XGBClassifier()

The section below sets up the hyperparameters for tuning the model via random search. Due to the large number of combinations, two rounds of tuning will ensue. The first round will feature half the parameters with defaults selected for the others. After the first round, the best model will be used to set a fixed value for the first hyperparameters and the second set will be tested with a distribution via `RandomSearchCV`.

These hyperparameters include the criterion used for data splits (`booster`), `max_depth` (round 1), `min_child_weight` (round 1), `subsample` (round 1), `colsample_bytree` (round 1), `learning_rate` (round 2), `gamma` (round 2), and `n_estimators`. These parameters will be explored systematically to optimize the performance of our XGBoost Classifier.


In [226]:
# Criterion used to guide data splits
booster = [
    "gbtree"
]  # This is the default value. Linear booster is rarely used due to poor performance
# max_depth round 1 parameter: [int(x) for x in np.linspace(1,100, num=20)]
max_depth = [60]  # Any positive value, default 6
# min_child_weight round 1 parameter: [int(x) for x in np.linspace(1,10000, num=100)]
min_child_weight = [1]  # Any positive value, default 0, larger = less overfitting
# subsample round 1 parameter: [x for x in np.linspace(0,1, num=10)]
subsample = [
    0.77
]  # any value 0-1, default 1, lower = less over fitting but may underfit
# colsample_bytree round 1 parameter: [x for x in np.linspace(0,1, num=10)]
colsample_bytree = [0.44]  # any value 0-1, ratio of colmns selected for each tree
# learning_rate round 2 parameter: [x for x in np.linspace(0,1, num=100)]
learning_rate = [0.01]  # any value 0-1, default 0.3
# gamma round 2 parameter: [0,0.1,1,10 ]
gamma = [0, 0.1, 1, 10]  # Any positive value, default 0, larger = conservative
# n_estimators round 3 parameter: [int(x) for x in np.linspace(0,1000, num=100)]
n_estimators = [
    int(x) for x in np.linspace(0, 1000, num=100)
]  # Number of trees in model, more = overfit

# Tune hyperparameters stepwise
# GROUP 1: max_depth , min_child_weight, subsample, colsample_bytree
# GROUP 2: learning_rate, gamma,n_estimators

# Create the random grid
param_grid_random = {
    "booster": booster,  # Default, stated for clarity
    "max_depth": max_depth,  # Round 1
    "min_child_weight": min_child_weight,  # Round 1
    "subsample": subsample,  # Round 1
    "colsample_bytree": colsample_bytree,  # Round 1
    "learning_rate": learning_rate,  # Round 2
    "gamma": gamma,  # Round 2
    "n_estimators": n_estimators,  # Round 2
}

Here is where we will establish our custom `scorer`. We have decided to use a custom F2 score to prioritize recall (minimizing the chance of not giving a coupon to someone who would use it) while considering the value of precision (minimizing the chance of giving a coupon to someone who will not use it).


In [9]:
# Create a custom score to optimize model
f2_scorer = make_scorer(fbeta_score, beta=2)

Let's execute a randomized search to tune our XGBoost model's hyperparameters efficiently. This process involves trying various hyperparameter combinations from the specified `param_grid_random`. The search is conducted using 5-fold cross-validation, and we use the F2 scoring metric to evaluate the model's performance. The `RandomizedSearchCV` is run with a fixed random state for reproducibility, and its execution time is monitored using the `%time` magic command.


In [228]:
%%time
random_search = RandomizedSearchCV(xgb, param_grid_random, n_iter=60, cv=5, random_state=42,
                                  scoring = f2_scorer, n_jobs = -1)
random_search.fit(X_train, y_train)

# This code block was used multiple times to tune parameters in a step wise manner

Wall time: 1h 3min 56s


We then store the best-performing estimator, `best_estimator`, obtained from the randomized search. Additionally, we retrieve the optimal hyperparameters, stored in `best_random_params`, and the highest cross-validation score achieved, indicated by `best_random_score`. These values provide valuable insights into the ideal configuration of our tuned XGBoost Classifier model and its performance on the dataset.


In [229]:
# Store best estimator
best_estimator = random_search.best_estimator_

# Get the best parameters and score
best_random_params = random_search.best_params_
best_random_score = random_search.best_score_
best_random_params, best_random_score

({'subsample': 0.77,
  'n_estimators': 616,
  'min_child_weight': 1,
  'max_depth': 60,
  'learning_rate': 0.010101010101010102,
  'gamma': 0,
  'colsample_bytree': 0.44,
  'booster': 'gbtree'},
 0.818844534668821)

After applying our best parameters, we will save our preliminary XGBoost model, which was fine-tuned using random search. To assess its performance, we generate predictions on the test dataset, `X_test`, and store them in `y_pred_xgb_random`. We then compute and display the confusion matrix, `cm1`, to spot-check the model's performance, particularly its ability to correctly classify instances.


In [230]:
# Save preliminary model and view confusion matrix to spot check
# This model had issues with underfitting the data
y_pred_xgb_random = best_estimator.predict(X_test)

cm1 = confusion_matrix(y_test, y_pred_xgb_random)
print(cm1)

[[ 748  330]
 [ 224 1235]]
[[4312   84]
 [  33 5718]]


For the first confusion matrix, we see that the model correctly identified 748 instances as negative and 1235 instances as positive, but it made 330 false positive and 224 false negative classifications, indicating a trade-off between precision and recall.

In the second confusion matrix, which might be from a different scenario or dataset, the model exhibits impressive performance with a high number of true negatives (4312) and true positives (5718), while making only a small number of false positive (84) and false negative (33) classifications, showcasing its robustness in correctly classifying instances.


# Second Try - possible underfitting

The second try section records the results for my second attempt at hyperparameter tuning as I ran into serious underfitting issues the first time around


### Round 1

Below are the optimal parameters after round one of random search (time to complete- 39:48). These will be implimented as static values for round two. The round two parameters will now feature a distribution of values instead of defaults. Round one used 1000 random selections from 2 given parameters:

1. max_depth
2. min_child_weight
3. subsample
4. colsample_bytree

({'subsample': 0.7777777777777777,
'min_child_weight': 1,
'max_depth': 58,
'colsample_bytree': 0.4444444444444444,
'booster': 'gbtree'},
0.7942921145067275)


### Round 2

Below are the optimal parameters after round two of random search (time to complete- 1:03:56 mins). These will be implimented as static values for round three. The round two parameters will now feature a distribution of values instead of defaults. Round two used 60 random selections from 3 given parameters and was able to explore all combinations:

1. learning_rate
2. gamma
3. n_estimators

({'subsample': 0.77,
'n_estimators': 616,
'min_child_weight': 1,
'max_depth': 60,
'learning_rate': 0.010101010101010102,
'gamma': 0,
'colsample_bytree': 0.44,
'booster': 'gbtree'},
0.818844534668821)


---


# First Try - possible underfitting


### Round 1 Results


Below are the optimal parameters after round one of random search (time to complete- 3:07:22). These will be implimented as static values for round two. The round two parameters will now feature a distribution of values instead of defaults. Round one used 7000 random selections from 4 given parameters:

1. max_depth
2. min_child_weight
3. subsample
4. colsample_bytree

({'subsample': 0.5050505050505051,  
 'min_child_weight': 950,  
 'max_depth': 11,  
 'colsample_bytree': 0.33333333333333337,  
 'booster': 'gbtree'},
0.8673946364835304)


### Round 2 Results


Below are the optimal parameters after round two of random search (time to complete- 34 mins). These will be implimented as static values for round three. The round two parameters will now feature a distribution of values instead of defaults. Round two used 1400 random selections from 2 given parameters and was able to explore all combinations:

1. learning_rate
2. gamma

({'subsample': 0.5,  
 'min_child_weight': 950,  
 'max_depth': 11,  
 'learning_rate': 0.010101010101010102,  
 'gamma': 0.001,  
 'colsample_bytree': 0.33,  
 'booster': 'gbtree'},  
 0.8673946364835304)


### Round 3 Results


Round three yeilded the result that 10 was the optimal number for n_estimators given the other static parameters. Training time was 12:03. Results were as follows:

({'subsample': 0.5,  
 'n_estimators': 10,  
 'min_child_weight': 950,  
 'max_depth': 11,  
 'learning_rate': 0.01,  
 'gamma': 0.001,  
 'colsample_bytree': 0.33,  
 'booster': 'gbtree'},  
 0.8673946364835304)


### Grid Search


In [10]:
# Criterion used to guide data splits
booster = [
    "gbtree"
]  # This is the default value. Linear booster is rarely used due to poor performance
max_depth = [50, 60, 70]  # Any positive value, default 6
min_child_weight = [
    1,
    5,
    10,
]  # Any positive value, default 0, larger = less overfitting
subsample = [
    0.7,
    0.8,
    0.9,
]  # any value 0-1, default 1, lower = less over fitting but may underfit
colsample_bytree = [
    0.4,
    0.45,
    0.5,
]  # any value 0-1, ratio of colmns selected for each tree
learning_rate = [0.01, 0.05]  # any value 0-1, default 0.3
gamma = [0, 1, 10]  # Any positive value, default 0, larger = conservative
n_estimators = [500, 600, 700]  # Number of trees in model, more = overfit

# Create the grid
param_grid = {
    "booster": booster,  # Default, stated for clarity
    "max_depth": max_depth,
    "min_child_weight": min_child_weight,
    "subsample": subsample,
    "colsample_bytree": colsample_bytree,
    "learning_rate": learning_rate,
    "gamma": gamma,
    "n_estimators": n_estimators,
}

We perform a grid search to further fine-tune our XGBoost model, systematically exploring hyperparameter combinations specified in `param_grid`. This process is conducted with 5-fold cross-validation and utilizes the F2 scoring metric for evaluation. The `%time` magic command is used to monitor the execution time of this grid search operation.


In [11]:
%%time
best_grid_search_model = GridSearchCV(xgb, param_grid, cv = 5,
                                      scoring=f2_scorer, n_jobs = -1)

_ = best_grid_search_model.fit(X_train, y_train)

# Obtain the best model through grid search

  if is_sparse(dtype):
  is_categorical_dtype(dtype) or is_pa_ext_categorical_dtype(dtype)
  if is_categorical_dtype(dtype):
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)
  if is_categorical_dtype(dtype):
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)
  if is_categorical_dtype(dtype):
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)
  if is_categorical_dtype(dtype):
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)
  if is_categorical_dtype(dtype):
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)


CPU times: total: 1min 1s
Wall time: 43min 24s


In [12]:
# Get the best parameters and score
best_params = best_grid_search_model.best_params_
best_score = best_grid_search_model.best_score_
best_params, best_score

({'booster': 'gbtree',
  'colsample_bytree': 0.4,
  'gamma': 0,
  'learning_rate': 0.01,
  'max_depth': 50,
  'min_child_weight': 1,
  'n_estimators': 500,
  'subsample': 0.7},
 0.8221619337168005)

### Testing Our Best Parameters


We create the final pipeline for our XGBoost Classifier, incorporating the optimal hyperparameters obtained from the grid search. The pipeline consists of the `XGBClassifier` with the following hyperparameter settings: `booster="gbtree"`, `colsample_bytree=0.4`, `gamma=0`, `learning_rate=0.01`, `max_depth=50`, `min_child_weight=1`, `n_estimators=500`, and `subsample=0.7`.


In [10]:
# final pipeline
xgb_pipeline = Pipeline(
    [
        (
            "xgb",
            XGBClassifier(
                booster="gbtree",
                colsample_bytree=0.4,
                gamma=0,
                learning_rate=0.01,
                max_depth=50,
                min_child_weight=1,
                n_estimators=500,
                subsample=0.7,
            ),
        ),
    ]
)

We train our final XGBoost pipeline, `xgb_pipeline`, on the training data, `X_train` and `y_train`, to develop a robust and optimized XGBoost Classifier model that will be used for subsequent predictions and evaluations.


In [12]:
# Train the final pipeline
xgb_pipeline.fit(X_train, y_train.values.ravel())

Now, let's generate predictions on the test dataset, `X_test`, using our trained XGBoost Classifier model, storing the results in `y_pred_xgb`. These predictions will be used for evaluating the model's performance on unseen data.


In [13]:
y_pred_xgb = xgb_pipeline.predict(X_test)

We compute the confusion matrix, `cm`, to assess the performance of our XGBoost Classifier model on the test data. The matrix provides details on the number of true negatives, false positives, false negatives, and true positives, allowing us to evaluate the model's classification accuracy and potential areas for improvement.


In [14]:
cm = confusion_matrix(y_test, y_pred_xgb)
print(cm)

[[ 730  348]
 [ 220 1239]]


In analyzing the confusion matrix for our final XGBoost Classifier model, we observe that the model correctly classified 730 instances as negative and 1239 instances as positive, indicating its ability to correctly identify both classes. However, it made 348 false positive and 220 false negative classifications, which implies a trade-off between precision and recall. While the model demonstrates decent performance, further optimization may be necessary to balance these two aspects and enhance its overall accuracy, especially in scenarios where minimizing false positives or false negatives is critical.


In [15]:
fbeta_score(y_test, y_pred_xgb, beta=2)

0.8345682338677085

The F2 score for our final XGBoost Classifier model, calculated using the provided test data and predictions, is approximately 0.8345. This score reflects the model's ability to balance precision and recall, with an emphasis on recall, making it suitable for scenarios where minimizing false negatives is important.


### Pickling Our Model


In [16]:
# Specify the filename where you want to save the model
filename = "XGB_Model.pkl"

# Export the model to the file using pickle.dump
with open(filename, "wb") as file:
    pickle.dump(xgb_pipeline, file)