## XG-Boost Exploration

The attached notebook represents an exploration of the implementation of an xg-boost cross-validation pipeline to extract the most promising hyper-parameters given our scalar feature TRAINING dataset. The process leverages grid-search CV. The pertinent thing to come out of this notebook was the production of a set of parameters, which are then used upon instantiation of the XG-Boost model in 3_final_models.ipynb to test with both the training and test set as a whole. 

In [31]:
# IMPORTS 

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
# ! pip install xgboost
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay, classification_report


# ! pip install imblearn
from imblearn.over_sampling import SMOTE

In [14]:
def xgb_grid_search_cv(X, y, cv_folds=5):
    """
    Perform grid search cross-validation for XGBoost classifier on the given data.
    
    Parameters:
    - X: Features dataset.
    - y: Target variable dataset.
    - cv_folds: Number of folds for cross-validation.
    
    Returns:
    - grid_search: The fitted GridSearchCV object.
    """
    
    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=23)

    smote = SMOTE(random_state=23)
    X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)
    
    # Define a pipeline with the classifier
    pipeline = Pipeline([
        ('xgb', XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', random_state=23))  # XGBoost classifier
    ])
    
    param_grid = {
        'xgb__n_estimators': [100, 200, 300],  # Number of gradient boosted trees. Equivalent to the number of boosting rounds.
        'xgb__learning_rate': [0.01, 0.1, 0.2],  # Step size shrinkage used to prevent overfitting. It scales the contribution of each tree by a factor between 0 and 1.
        'xgb__max_depth': [3, 6, 9],  # Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit.
        'xgb__min_child_weight': [1, 2, 3],  # Minimum sum of instance weight (hessian) needed in a child. Used to control over-fitting. Higher value = more regularized 
        'xgb__subsample': [0.5, 0.7, 1.0],  # Subsample ratio of the training instances. Helps prevent overfitting.
        'xgb__colsample_bytree': [0.5, 0.7, 1.0]  # Subsample ratio of columns when constructing each tree. Subsampling occurs once for every tree constructed.
    }

    
    # Initialize GridSearchCV
    grid_search = GridSearchCV(pipeline, param_grid, cv=cv_folds, scoring='accuracy', verbose=3, n_jobs=-1)
    
    # Perform grid search cross-validation
    grid_search.fit(X_train_balanced, y_train_balanced)
    
    print("Best parameters found: ", grid_search.best_params_)
    print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))
    
    # Optionally, evaluate on the test set
    test_score = grid_search.score(X_test, y_test)
    print("Test set score: {:.2f}".format(test_score))
    
    return grid_search

# Import Data

In [27]:
train_df_scalar = pd.read_pickle('/Users/erin/Documents/comp-viz/final-project/fabric/pkls/train_0406_scalar_non-aug.pkl')
train_df_scalar.head()

Unnamed: 0,label,category,0,1,2,3,4,5,6,7,...,22,23,24,25,26,27,28,29,30,31
0,0,Blended,3441.793288,20285310.0,1844.05677,0.150226,346.120911,0.00521,9.3e-05,6.3e-05,...,0.027344,0.054688,0.035156,0.039062,0.03125,0.027344,0.046875,0.0625,0.050781,0.050781
1,1,Denim,7211.992783,7747671.0,2271.840993,0.151475,348.99826,0.004833,-6.7e-05,3.8e-05,...,0.1,0.048649,0.021622,0.024324,0.013514,0.021622,0.024324,0.043243,0.016216,0.07027
2,2,Polyester,8856.756862,5854463.0,1967.259618,0.160454,369.684998,0.002032,-2e-06,8e-06,...,0.027778,0.00463,0.00463,0.25,0.12963,0.194444,0.018519,0.00463,0.027778,0.00463
3,0,Blended,7018.112788,7817569.0,1953.124972,0.1523,350.8992,0.004582,1.1e-05,3.4e-05,...,0.052239,0.063433,0.052239,0.007463,0.022388,0.044776,0.085821,0.074627,0.033582,0.029851
4,3,Cotton,7932.263905,6971318.0,2053.412469,0.157971,363.965454,0.002823,-3e-06,1.7e-05,...,0.030612,0.047619,0.054422,0.047619,0.05102,0.081633,0.013605,0.044218,0.047619,0.054422


In [28]:
X_sc = train_df_scalar.iloc[:,2:]
y_sc = train_df_scalar.iloc[:,0]

# Run Grid Search CV

In [11]:
cv_params_scalar = xgb_grid_search_cv(X_sc, y_sc)

Fitting 5 folds for each of 729 candidates, totalling 3645 fits
[CV 2/5] END xgb__colsample_bytree=0.5, xgb__learning_rate=0.01, xgb__max_depth=3, xgb__min_child_weight=1, xgb__n_estimators=100, xgb__subsample=0.5;, score=0.516 total time=   0.9s
[CV 1/5] END xgb__colsample_bytree=0.5, xgb__learning_rate=0.01, xgb__max_depth=3, xgb__min_child_weight=1, xgb__n_estimators=100, xgb__subsample=0.5;, score=0.501 total time=   1.0s
[CV 5/5] END xgb__colsample_bytree=0.5, xgb__learning_rate=0.01, xgb__max_depth=3, xgb__min_child_weight=1, xgb__n_estimators=100, xgb__subsample=0.5;, score=0.542 total time=   0.9s
[CV 4/5] END xgb__colsample_bytree=0.5, xgb__learning_rate=0.01, xgb__max_depth=3, xgb__min_child_weight=1, xgb__n_estimators=100, xgb__subsample=0.5;, score=0.539 total time=   0.9s
[CV 3/5] END xgb__colsample_bytree=0.5, xgb__learning_rate=0.01, xgb__max_depth=3, xgb__min_child_weight=1, xgb__n_estimators=100, xgb__subsample=0.7;, score=0.544 total time=   0.9s
[CV 3/5] END xgb__col



[CV 4/5] END xgb__colsample_bytree=0.5, xgb__learning_rate=0.01, xgb__max_depth=6, xgb__min_child_weight=1, xgb__n_estimators=300, xgb__subsample=0.5;, score=0.703 total time=   9.2s
[CV 4/5] END xgb__colsample_bytree=0.5, xgb__learning_rate=0.01, xgb__max_depth=6, xgb__min_child_weight=1, xgb__n_estimators=300, xgb__subsample=0.7;, score=0.705 total time=   9.0s
[CV 2/5] END xgb__colsample_bytree=0.5, xgb__learning_rate=0.01, xgb__max_depth=6, xgb__min_child_weight=1, xgb__n_estimators=300, xgb__subsample=0.7;, score=0.680 total time=   9.2s
[CV 3/5] END xgb__colsample_bytree=0.5, xgb__learning_rate=0.01, xgb__max_depth=6, xgb__min_child_weight=1, xgb__n_estimators=300, xgb__subsample=0.7;, score=0.703 total time=   9.1s
[CV 1/5] END xgb__colsample_bytree=0.5, xgb__learning_rate=0.01, xgb__max_depth=6, xgb__min_child_weight=1, xgb__n_estimators=300, xgb__subsample=1.0;, score=0.653 total time=   8.9s
[CV 1/5] END xgb__colsample_bytree=0.5, xgb__learning_rate=0.01, xgb__max_depth=6, xg

Notes on cross-validation:
* the implementation is not entirely stable within the folds, based on the print-outs produced. This means that it is likely overfitting and we can be less confident in the generalizability of our model given any single set of parameters
* the above comment is further solidified by the gap between the estimated test set score produced by the grid search and the average score across all tested folds (0.77 and 0.85 respectively)
* the model also takes a significant amount of time to run on a relatively powerful laptop
* we proceded by extracting the best parameters and ran them in the final_models notebook, which contains confusion matrices for comparison