#### Load the cleaned dataset

After discussion at the end of week 2, Amy shared the data cleaning procedures she had tested and the improved AUROC scores she received. This newly cleaned data was pushed to the master branch. Let us begin by loading it and testing our best model thus far to see if there is an improvement in the AUROC score. Source: yang_yang-14169837-week2_dataprocessing_lazypredict.ipynb [Github Link: (master branch - notebooks) - https://github.com/amy-panda/NBA_Career_Prediction.git]

In [179]:
# Load the pandas and numpy packages
import pandas as pd
import numpy as np

In [202]:
#Import cleaned dataset using load_sets function defined in src.data.sets
import sys
sys.path.insert(1, '..')
from src.data.sets import load_sets
X_train, y_train, X_val, y_val, X_test, y_test = load_sets(path='../data/processed/week2')

#### Random Forest with Random Search

Let us re-run the Random Forest with Random Search model from last week and see if there is an improvement in the AUROC score with this newly cleaned data. 

In [203]:
# Import the RandomForestClassifier from sklearn.ensemble
from sklearn.ensemble import RandomForestClassifier

# Import randint from scipy.stats
from scipy.stats import randint

# Define the hyperparameters value range. Keep n_estimators and min_samples_leaf the same but increase max_depth. 
hyperparams_dist6 = {
'n_estimators': randint(35, 50),
'max_depth': randint(30, 40),
'min_samples_leaf': randint(31,50)
}
# Import RandomizedSearchCV and KFold from sklearn.model_selection
from sklearn.model_selection import RandomizedSearchCV, KFold
rf6 = RandomForestClassifier(random_state=8)

# Instantiate a KFold with 5 splits
kf_cv = KFold(n_splits=5)

# Instantiate a RandomizedSearchCV with the hyperparameter values and the random forest model
random_search_rf6 = RandomizedSearchCV(rf6, hyperparams_dist6, random_state=8, cv=kf_cv, verbose=1)

# Fit the RandomizedSearchCV on the training set
random_search_rf6.fit(X_train, y_train)

# Import the roc_auc_score and roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

# Calculate the probabilities for train and validation datasets
probs_train=random_search_rf6.predict_proba(X_train)[:,1]
probs_val=random_search_rf6.predict_proba(X_val)[:,1]

# Calculate the roc_auc_score for train and validation dataset
print(f'Random_Search_Train ROC AUC Score: {roc_auc_score(y_train, probs_train)}')
print(f'Random_Search_Val ROC AUC  Score: {roc_auc_score(y_val, probs_val)}')

Fitting 5 folds for each of 10 candidates, totalling 50 fits
Random_Search_Train ROC AUC Score: 0.7995328883411028
Random_Search_Val ROC AUC  Score: 0.7097045016851229


We see that there is very little difference in the AUROC score (0.7097496 vs.0.7097045). The validation AUROC has reduced marginally by approximately 0.00005 (5dp) but overfitting has reduced slightly as well (0.093206 vs. 0.089828 difference between training and validation scores). This cleaned dataset improves upon the last by increasing dimensionality through feature engineering which can be very useful in training a model. We will proceed with this newly cleaned dataset and see if we can improve upon it further as well. 


#### Feature Engineering & Re-running Random Forest 

In [206]:
#Import cleaned dataset using load_sets function defined in src.data.sets
import sys
sys.path.insert(1, '..')
from src.data.sets import load_sets
X_train, y_train, X_val, y_val, X_test, y_test = load_sets(path='../data/processed/week3') #data cleaning and feature engineering performed under - vimalasri_chanthru-NA-week3_dataprocessing.ipynb

In [208]:
# Re-Run Random Forest with Random Search (best model from prior experiment) with data cleaning/feature engineering from this week (week3)

# Import the RandomForestClassifier from sklearn.ensemble
from sklearn.ensemble import RandomForestClassifier

# Import randint from scipy.stats
from scipy.stats import randint

# Define the hyperparameters value range. Keep n_estimators and min_samples_leaf the same but increase max_depth. 
hyperparams_dist6 = {
'n_estimators': randint(35, 50),
'max_depth': randint(30, 40),
'min_samples_leaf': randint(31,50)
}
# Import RandomizedSearchCV and KFold from sklearn.model_selection
from sklearn.model_selection import RandomizedSearchCV, KFold
rf6 = RandomForestClassifier(random_state=8)

# Instantiate a KFold with 5 splits
kf_cv = KFold(n_splits=5)

# Instantiate a RandomizedSearchCV with the hyperparameter values and the random forest model
random_search_rf6 = RandomizedSearchCV(rf6, hyperparams_dist6, random_state=8, cv=kf_cv, verbose=1)

# Fit the RandomizedSearchCV on the training set
random_search_rf6.fit(X_train, y_train)

# Import the roc_auc_score and roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

# Calculate the probabilities for train and validation datasets
probs_train=random_search_rf6.predict_proba(X_train)[:,1]
probs_val=random_search_rf6.predict_proba(X_val)[:,1]

# Calculate the roc_auc_score for train and validation dataset
print(f'Random_Search_Train ROC AUC Score: {roc_auc_score(y_train, probs_train)}')
print(f'Random_Search_Val ROC AUC  Score: {roc_auc_score(y_val, probs_val)}')

Fitting 5 folds for each of 10 candidates, totalling 50 fits
Random_Search_Train ROC AUC Score: 0.7999084846772528
Random_Search_Val ROC AUC  Score: 0.7099572701011073


AUROC score and overfitting has improved slightly. We will proceed with this newly cleaned data as our base. Further discussions with our team highlighted the potential for XGBoost model prediction. Between Logistic Regression, KNN, SVM, Random Forest and XGBoost models tested by our team over the past three weeks, XGBoost performed the best. The focus now shifted to training a XGBoost model to see if it could outperform our Random Forest with Random Search score above (0.7100 (4dp)). 

#### XG Boost (Default Hyperparameters)

In [207]:
#Let us run a XG Boost model with Default Hyperparameters using the week 3 processed data. 

# Train Xgboost model

import xgboost as xgb

# Instantiate the RandomForest class into a variable called rf with random_state=8

xgboost1 = xgb.XGBClassifier()

# Fit the model with the prepared data

xgboost1.fit(X_train, y_train)

# Import `dump` from `joblib` and save the fitted model into the folder `models` as a file called `xgboost_default`

from joblib import dump 

dump(xgboost1,  '../models/xgboost_default.joblib')

# Calculate and save the probability when target=1 for training and validation sets into 2 variables called `y_train_preds` and `y_val_preds`

y_train_preds = xgboost1.predict_proba(X_train)[:,1]
y_val_preds = xgboost1.predict_proba(X_val)[:,1]

# Import `print_class_perf` from `src/models/performance` and display the AUROC score of this baseline model on the training and validation sets

import sys
sys.path.insert(1, '..')
from src.models.performance import print_class_perf


print_class_perf(y_probs=y_train_preds, y_actuals=y_train, set_name='Training')
print_class_perf(y_probs=y_val_preds, y_actuals=y_val, set_name='Validation')

ROC AUC Score Training: 0.9996618753357936
ROC AUC Score Validation: 0.6596774193548387


Poor AUROC score and overfitting. Let us try to combat this with hyperparameter tuning via Hyperopt. 

#### XG Boost (Hyperparameter Tuning via Hyperopt)

In [189]:
#XG Boost with Hyperopt Hyperparameter Tuning 

#Import Trials, STATUS_OK, tpe, hp, fmin from hyperopt package

from hyperopt import Trials, STATUS_OK, tpe, hp, fmin

#Define the search space for xgboost hyperparameters

space = {
    'max_depth' : hp.choice('max_depth', range(5, 20, 1)),
    'learning_rate' : hp.quniform('learning_rate', 0.01, 0.5, 0.05),
    'min_child_weight' : hp.quniform('min_child_weight', 1, 10, 1),
    'subsample' : hp.quniform('subsample', 0.1, 1, 0.05),
    'colsample_bytree' : hp.quniform('colsample_bytree', 0.1, 1.0, 0.05)
}

#Define a function called `objective` with the following logics:
    #-input parameters: hyperparameter seacrh space (`space`)
    #-logics: train a xgboost model with the search space and calculate the average accuracy score for cross validation with 10 folds
    #-output parameters: dictionary with the loss score and STATUS_OK

def objective(space):
    from sklearn.model_selection import cross_val_score
    
    xgboost = xgb.XGBClassifier(
        max_depth = int(space['max_depth']),
        learning_rate = space['learning_rate'],
        min_child_weight = space['min_child_weight'],
        subsample = space['subsample'],
        colsample_bytree = space['colsample_bytree'])
    
    roc_auc = cross_val_score(xgboost, X_train, y_train, cv=10, scoring="roc_auc").mean()

    return{'loss': 1-roc_auc, 'status': STATUS_OK }

# Launch Hyperopt search and save the result in a variable called `best`

best = fmin(
    fn=objective,   
    space=space,       
    algo=tpe.suggest,       
    max_evals=5
)

# Print the best set of hyperparameters

print("Best: ", best)

# Instantiate a XGBClassifier with best set of hyperparameters

xgboost2 = xgb.XGBClassifier(
    max_depth = best['max_depth'],
    learning_rate = best['learning_rate'],
    min_child_weight = best['min_child_weight'],
    subsample = best['subsample'],
    colsample_bytree = best['colsample_bytree'])

# Fit the model with the prepared data

xgboost2.fit(X_train, y_train)

# Save the fitted model into the folder models as a file called `xgboost_best`

dump(xgboost2,  '../models/xgboost_best.joblib')

# Calculate the probability when target=1

probs_train=xgboost2.predict_proba(X_train)[:,1]
probs_val=xgboost2.predict_proba(X_val)[:,1]


#Display the AUROC score of this tuned model on the training and validation sets

print_class_perf(y_probs=probs_train, y_actuals=y_train, set_name='Training')
print_class_perf(y_probs=probs_val, y_actuals=y_val, set_name='Validation')



100%|██████████| 5/5 [00:12<00:00,  2.59s/trial, best loss: 0.32807957679871413]
Best:  {'colsample_bytree': 0.4, 'learning_rate': 0.05, 'max_depth': 6, 'min_child_weight': 7.0, 'subsample': 0.55}
ROC AUC Score Training: 0.8373546477363767
ROC AUC Score Validation: 0.7138149975926819


We see this run of our XGBoost model with Hyperopt hyperparameter tuning has yielded better results than our best random forest with random search. To give us more control, let us use the best parameters identified by the package as a base and attempt to generate better results. Best Parameters - {'colsample_bytree': 0.4, 'learning_rate': 0.05, 'max_depth': 6, 'min_child_weight': 7.0, 'subsample': 0.55}

#### XGBoost with Manual Hyperparameter Tuning 

In [210]:
# Instantiate the XGBClassifier class into a variable called xgb_manual where can optimise the hyperparameters. 
# Add n_estimators, eta, min_child_weight, scale_pos_weight (to handle imbalanced data), and gamma. 

xgb_manual = xgb.XGBClassifier(
    n_estimators =150,
    eta=0.02, 
    max_depth=3, 
    learning_rate=0.05,
    min_child_weight=5,
    subsample=0.75,
    scale_pos_weight=0.80,
    gamma=5
    ) 

# Fit the XGBoost model
xgb_manual.fit(X_train, y_train)

# Import dump from joblib and save the model
from joblib import dump 

dump(xgb_manual,  '../models/xgb_manual.joblib')

# Calculate the probability when target=1
probs_train=xgb_manual.predict_proba(X_train)[:,1]
probs_val=xgb_manual.predict_proba(X_val)[:,1]


# Import the function print_class_perf from models.performance and display the ROC-AUC score
import sys
sys.path.insert(1, '..')
from src.models.performance import print_class_perf

print_class_perf(y_actuals=y_train, y_probs=probs_train,set_name='Training')
print_class_perf(y_actuals=y_val, y_probs=probs_val,set_name='Validation')

ROC AUC Score Training: 0.7606793385423062
ROC AUC Score Validation: 0.7188493018777082


We have our best AUROC score yet with relatively low overfitting. 

#### Load, clean and predict probabilities for the test dataset

In [152]:
# Load the pandas and numpy packages
import pandas as pd
import numpy as np

In [153]:
# Import csv file of test data and save into data_test
data_test=pd.read_csv('../data/raw/2022_test.csv')

In [154]:
# Create a copy of data_test and save it into a variable data_test_cleaned
data_test_cleaned=data_test.copy()

In [155]:
# Remove the columns of id, 3P Made, 3PA, 3P% and BLK
data_test_cleaned.drop(['Id','3P Made','3PA','3P%','BLK'],axis=1,inplace=True)

In [156]:
## Add the columns 'TOTAL_MIN','TOTAL_PTS' and 'FG/FT'
data_test_cleaned['TOTAL_MIN']=data_test_cleaned['MIN'] * data_test_cleaned['GP']
data_test_cleaned['TOTAL_PTS']=data_test_cleaned['PTS'] * data_test_cleaned['GP']
data_test_cleaned['FG/FT']=data_test_cleaned['FG%']/data_test_cleaned['FT%']

In [157]:
#  Import StandardScaler from sklearn.preprocessing
from sklearn.preprocessing import StandardScaler

In [158]:
# Instantiate the StandardScaler
scaler=StandardScaler()

In [159]:
# Fit and apply the scaling on data_test_cleaned
data_test_cleaned=scaler.fit_transform(data_test_cleaned)

In [160]:
# Create the variable X_test
X_test=data_test_cleaned

In [161]:
# Calculate the probabilities for test datasets
probs_test=xgb_manual.predict_proba(X_test)[:,1]

In [162]:
# Join the probs_test column into data_test
data_test['TARGET_5Yrs']=probs_test

In [163]:
# Export the csv file 'rf_submission_091122.csv' for Kaggle submission
output=data_test[['Id','TARGET_5Yrs']]
output.to_csv('../XGBoost_Manual_chanthru_submission.csv',index=False)