### Overview

The focus of this week's line of experimentation is model testing and hyperparameter tuning. Random Forest (Random Search vs. Grid Search) and Support Vector Machine (C and Gamma manual tuning). In order to accurately compare our results to last week's insights, we will retain the same cleaned dataset and compare AUROC's accordingly. As per our group discussion, Amy will focus her attention this week on further cleaning, feature engineering and Logistic Regression. Yatin will explore KNN and XGBoost. At the end of the week we will combine our findings to narrow down: 

1. The best performing model (by AUROC) 
2. The cleaned dataset that best enhances these results. 

### Load the dataset

In [1]:
# Import the pandas, numpy packages and dump from joblib
import pandas as pd
import numpy as np
from joblib import dump

In [2]:
# Load the saved sets from last week (data/processed) using numpy
X_train = np.load('../data/processed/X_train.npy')
X_val   = np.load('../data/processed/X_val.npy'  )
y_train = np.load('../data/processed/y_train.npy')
y_val   = np.load('../data/processed/y_val.npy'  )

### Random Forest 

#### Train Initial Random Forest Model with Default Hyperparameters

In [48]:
# Import the RandomForestClassifier from sklearn.ensemble
from sklearn.ensemble import RandomForestClassifier

In [49]:
# Instantiate the RandomForestClassifier class called rf1 with a random state=8
rf1 = RandomForestClassifier(random_state=8)

In [50]:
# Fit the RandomForest model
rf1.fit(X_train, y_train)

In [51]:
# Calculate the probability when target=1
probs_train=rf1.predict_proba(X_train)[:,1]
probs_val=rf1.predict_proba(X_val)[:,1]

In [52]:
# Import the roc_auc_score and roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve


In [53]:
# Print the ROC AUC score for train and validation data
print(f'Train ROC AUC Score: {roc_auc_score(y_train, probs_train)}')
print(f'Val ROC AUC  Score: {roc_auc_score(y_val, probs_val)}')

Train ROC AUC Score: 1.0
Val ROC AUC  Score: 0.6740837144920558


We see that the random forest model with default hyperparameters is clearly overfitting. Unlike last week, let us explore hyperparameter tuning with Grid Search to see if it yields a better result. Note: the AUROC scores last week for Random Forest with Random Search were 0.7812 and 0.7067 for the training and validation sets respectively. 

#### Hyperparameter Tuning with Grid Search

In [54]:
#Import GridSearchCV from sklearn.model_selection
from sklearn.model_selection import GridSearchCV

In [55]:
# Let's create a dictionary containing the grid search parameters
hyperparams_grid2 = {
    'n_estimators': np.arange(10, 100, 20),
    'max_depth': np.arange(5, 30, 5),
    'min_samples_leaf': np.arange(2, 20, 4)
    }
hyperparams_grid2

{'n_estimators': array([10, 30, 50, 70, 90]),
 'max_depth': array([ 5, 10, 15, 20, 25]),
 'min_samples_leaf': array([ 2,  6, 10, 14, 18])}

In [56]:
#Import the RandomForestClassifier from sklearn.ensemble and instantiate the RandomForestClassifier class called rf with a random state=8
from sklearn.ensemble import RandomForestClassifier 
rf2 = RandomForestClassifier(random_state=8)

In [57]:
#Instantiate a GridSearchCV with the hyperparameter grid and the random forest model
grid_search_rf2 = GridSearchCV(rf2, hyperparams_grid2, cv=2, verbose=1)


In [58]:
#Fit the GridSearchCV on the training set
grid_search_rf2.fit(X_train, y_train)

Fitting 2 folds for each of 125 candidates, totalling 250 fits


In [59]:
#Display the best set of hyperparameters
grid_search_rf2.best_params_

{'max_depth': 15, 'min_samples_leaf': 10, 'n_estimators': 50}

In [60]:
# Calculate the probabilities for train and validation datasets
probs_train=grid_search_rf2.predict_proba(X_train)[:,1]
probs_val=grid_search_rf2.predict_proba(X_val)[:,1]

In [61]:
# Calculate the roc_auc_score for train and validation dataset
print(f'Grid_Search_Train ROC AUC Score: {roc_auc_score(y_train, probs_train)}')
print(f'Grid_Search_Val ROC AUC  Score: {roc_auc_score(y_val, probs_val)}')

Grid_Search_Train ROC AUC Score: 0.9292779367147442
Grid_Search_Val ROC AUC  Score: 0.6989618440057775


Grid Search performs well but is still prone to overfitting and underperforms in comparison to the random search model from last week. Let us see if we can improve this score by amending the hyperparameter dictionary. 

In [62]:
# Let's create a new dictionary containing amended grid search parameters with a greater range and number of fits 

hyperparams_grid3 = {
    'n_estimators': np.arange(5, 200, 40),
    'max_depth': np.arange(1, 40, 5),
    'min_samples_leaf': np.arange(1, 60, 5)
    }
hyperparams_grid3

#Import the RandomForestClassifier from sklearn.ensemble and instantiate the RandomForestClassifier class called rf with a random state=8
from sklearn.ensemble import RandomForestClassifier 
rf3 = RandomForestClassifier(random_state=8)

#Instantiate a GridSearchCV with the hyperparameter grid and the random forest model
grid_search_rf3 = GridSearchCV(rf3, hyperparams_grid3, cv=2, verbose=1)

#Fit the GridSearchCV on the training set
grid_search_rf3.fit(X_train, y_train)

# Calculate the probabilities for train and validation datasets
probs_train=grid_search_rf3.predict_proba(X_train)[:,1]
probs_val=grid_search_rf3.predict_proba(X_val)[:,1]

# Calculate the roc_auc_score for train and validation dataset
print(f'Grid_Search_Train ROC AUC Score: {roc_auc_score(y_train, probs_train)}')
print(f'Grid_Search_Val ROC AUC  Score: {roc_auc_score(y_val, probs_val)}')

Fitting 2 folds for each of 480 candidates, totalling 960 fits
Grid_Search_Train ROC AUC Score: 0.7884951059885361
Grid_Search_Val ROC AUC  Score: 0.708266129032258


The AUROC score has improved and overfitting has been reduced. This process however took approximately 5 minutes to execute. Further increasing the hyperparameter range and number of fits would not be the most effective use of time and resources currently. Let us use a new dictionary that compromises between rf2 and rf3 in terms of the number of fits and models tested to see if the result can be improved. 

In [64]:
# Let's create a new dictionary with grid parameters compromising the pros and cons of rf2 and rf3 explored above
hyperparams_grid4 = {
    'n_estimators': np.arange(10, 200, 20),
    'max_depth': np.arange(5, 40, 10),
    'min_samples_leaf': np.arange(2, 50, 10)
    }
hyperparams_grid4

#Import the RandomForestClassifier from sklearn.ensemble and instantiate the RandomForestClassifier class called rf with a random state=8
from sklearn.ensemble import RandomForestClassifier 
rf4 = RandomForestClassifier(random_state=8)

#Instantiate a GridSearchCV with the hyperparameter grid and the random forest model
grid_search_rf4 = GridSearchCV(rf4, hyperparams_grid4, cv=2, verbose=1)

#Fit the GridSearchCV on the training set
grid_search_rf4.fit(X_train, y_train)

#Display the best set of hyperparameters
grid_search_rf4.best_params_

# Calculate the probabilities for train and validation datasets
probs_train=grid_search_rf4.predict_proba(X_train)[:,1]
probs_val=grid_search_rf4.predict_proba(X_val)[:,1]

# Calculate the roc_auc_score for train and validation dataset
print(f'Grid_Search_Train ROC AUC Score: {roc_auc_score(y_train, probs_train)}')
print(f'Grid_Search_Val ROC AUC  Score: {roc_auc_score(y_val, probs_val)}')

Fitting 2 folds for each of 200 candidates, totalling 400 fits
Grid_Search_Train ROC AUC Score: 0.999740161223188
Grid_Search_Val ROC AUC  Score: 0.6932926095329803


A compromise has been made with 400 models tested (compared to 250 in rf2 and 960 in rf3). The AUROC score has worsened so it seems random search may prove to be a more effective compromise between time/resources and performance. As of right now, Grid Search outperforms the Random Search model tested last week. Let us see if the hyperparameter range used for the random search model can be tuned to provide better results that require less computational power to execute.  

#### Hyperparameter Tuning with Random Search

In [65]:
# Import randint from scipy.stats
from scipy.stats import randint

In [66]:
# Define the hyperparameters. Let us use the parameters for random search last week as a starting point, ensuring our parameters fall within the same range.
# {'max_depth': 10, 'min_samples_leaf': 46, 'n_estimators': 186}
# However, let us reduce the n_estimators to see if there is a trade off between performance and computation. 

hyperparams_dist5 = {
'n_estimators': randint(35, 50),
'max_depth': randint(10, 30),
'min_samples_leaf': randint(31,50)
}

In [67]:
# Import RandomizedSearchCV and KFold from sklearn.model_selection
from sklearn.model_selection import RandomizedSearchCV, KFold
rf5 = RandomForestClassifier(random_state=8)

In [68]:
# Instantiate a KFold with 5 splits
kf_cv = KFold(n_splits=5)

In [69]:
# Instantiate a RandomizedSearchCV with the hyperparameter values and the random forest model
random_search_rf5 = RandomizedSearchCV(rf5, hyperparams_dist5, random_state=8, cv=kf_cv, verbose=1)

In [70]:
# Fit the RandomizedSearchCV on the training set
random_search_rf5.fit(X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


In [71]:
# Display the best set of hyperparameters
random_search_rf5.best_params_

{'max_depth': 22, 'min_samples_leaf': 44, 'n_estimators': 44}

In [72]:
# Calculate the probabilities for train and validation datasets
probs_train=random_search_rf5.predict_proba(X_train)[:,1]
probs_val=random_search_rf5.predict_proba(X_val)[:,1]

In [73]:
# Calculate the roc_auc_score for train and validation dataset
print(f'Random_Search_Train ROC AUC Score: {roc_auc_score(y_train, probs_train)}')
print(f'Random_Search_Val ROC AUC  Score: {roc_auc_score(y_val, probs_val)}')

Random_Search_Train ROC AUC Score: 0.7844567856623883
Random_Search_Val ROC AUC  Score: 0.7087084737602312


Compared to last week's results (below), the AUROC score has improved and now outperforms the Grid Search tuned model.

Original_Random_Search_Train ROC AUC Score: 0.7812034352902928
Original_Random_Search_Val ROC AUC  Score: 0.706701372171401

Increasing max_depth tends to improve the models performance but increases risk of overfitting. Let us see if this can help us further improve the models performance. 

In [74]:
# Define the hyperparameters value range. Keep n_estimators and min_samples_leaf the same but increase max_depth. 
hyperparams_dist6 = {
'n_estimators': randint(35, 50),
'max_depth': randint(30, 40),
'min_samples_leaf': randint(31,50)
}

rf6 = RandomForestClassifier(random_state=8)

# Instantiate a KFold with 5 splits
kf_cv = KFold(n_splits=5)

# Instantiate a RandomizedSearchCV with the hyperparameter values and the random forest model
random_search_rf6 = RandomizedSearchCV(rf6, hyperparams_dist6, random_state=8, cv=kf_cv, verbose=1)

# Fit the RandomizedSearchCV on the training set
random_search_rf6.fit(X_train, y_train)

# Calculate the probabilities for train and validation datasets
probs_train=random_search_rf6.predict_proba(X_train)[:,1]
probs_val=random_search_rf6.predict_proba(X_val)[:,1]

# Calculate the roc_auc_score for train and validation dataset
print(f'Random_Search_Train ROC AUC Score: {roc_auc_score(y_train, probs_train)}')
print(f'Random_Search_Val ROC AUC  Score: {roc_auc_score(y_val, probs_val)}')


Fitting 5 folds for each of 10 candidates, totalling 50 fits
Random_Search_Train ROC AUC Score: 0.8029553010453014
Random_Search_Val ROC AUC  Score: 0.7097496389022628


Our model has improved even further, resulting in our best result to date. There is also very little overfitting. Let us see if increasing the max_depth further will improve our result. 

In [75]:
# Define the hyperparameters value range. Keep n_estimators and min_samples_leaf the same but increase max_depth further. 
hyperparams_dist7 = {
'n_estimators': randint(35, 50),
'max_depth': randint(50, 70),
'min_samples_leaf': randint(31,50)
}

rf7 = RandomForestClassifier(random_state=8)

# Instantiate a KFold with 5 splits
kf_cv = KFold(n_splits=5)

# Instantiate a RandomizedSearchCV with the hyperparameter values and the random forest model
random_search_rf7 = RandomizedSearchCV(rf7, hyperparams_dist7, random_state=8, cv=kf_cv, verbose=1)

# Fit the RandomizedSearchCV on the training set
random_search_rf7.fit(X_train, y_train)

# Calculate the probabilities for train and validation datasets
probs_train=random_search_rf7.predict_proba(X_train)[:,1]
probs_val=random_search_rf7.predict_proba(X_val)[:,1]

# Calculate the roc_auc_score for train and validation dataset
print(f'Random_Search_Train ROC AUC Score: {roc_auc_score(y_train, probs_train)}')
print(f'Random_Search_Val ROC AUC  Score: {roc_auc_score(y_val, probs_val)}')

Fitting 5 folds for each of 10 candidates, totalling 50 fits
Random_Search_Train ROC AUC Score: 0.7844567856623883
Random_Search_Val ROC AUC  Score: 0.7087084737602312


AUROC has slightly reduced. We will retain the max_depth in hyperparams_dist6. Generally speaking, a higher n_estimator means more trees and therefore better performance. Let us try increasing it slightly to see if performance can be improved without too much of an increase in computational power. 

In [76]:
# Define the hyperparameters value range. Keep max_depth and min_samples_leaf the same as rf6 but increase n_estimators. 
hyperparams_dist8 = {
'n_estimators': randint(55, 95),
'max_depth': randint(30, 40),
'min_samples_leaf': randint(31,50)
}

rf8 = RandomForestClassifier(random_state=8)

# Instantiate a KFold with 5 splits
kf_cv = KFold(n_splits=5)

# Instantiate a RandomizedSearchCV with the hyperparameter values and the random forest model
random_search_rf8 = RandomizedSearchCV(rf8, hyperparams_dist8, random_state=8, cv=kf_cv, verbose=1)

# Fit the RandomizedSearchCV on the training set
random_search_rf8.fit(X_train, y_train)

# Calculate the probabilities for train and validation datasets
probs_train=random_search_rf8.predict_proba(X_train)[:,1]
probs_val=random_search_rf8.predict_proba(X_val)[:,1]

# Calculate the roc_auc_score for train and validation dataset
print(f'Random_Search_Train ROC AUC Score: {roc_auc_score(y_train, probs_train)}')
print(f'Random_Search_Val ROC AUC  Score: {roc_auc_score(y_val, probs_val)}')


Fitting 5 folds for each of 10 candidates, totalling 50 fits
Random_Search_Train ROC AUC Score: 0.7922118384446687
Random_Search_Val ROC AUC  Score: 0.7058497833413577


Our AUROC score has worsened yet again and we will therefore retain n_estimators between 35 and 50. We want to avoid decreasing min_samples_leaf to avoid very specific rules that apply to just a few observations. Higher values reduce the risk of overfitting. Let us see if our AUROC improves. 

In [77]:
# Define the hyperparameters value range. Keep n_estimators and max_depth the same as rf6 but increase min_samples_leaf.  
hyperparams_dist9 = {
'n_estimators': randint(35, 50),
'max_depth': randint(30, 40),
'min_samples_leaf': randint(50,100)
}

rf9 = RandomForestClassifier(random_state=8)

# Instantiate a KFold with 5 splits
kf_cv = KFold(n_splits=5)

# Instantiate a RandomizedSearchCV with the hyperparameter values and the random forest model
random_search_rf9 = RandomizedSearchCV(rf9, hyperparams_dist9, random_state=8, cv=kf_cv, verbose=1)

# Fit the RandomizedSearchCV on the training set
random_search_rf9.fit(X_train, y_train)

# Calculate the probabilities for train and validation datasets
probs_train=random_search_rf9.predict_proba(X_train)[:,1]
probs_val=random_search_rf9.predict_proba(X_val)[:,1]

# Calculate the roc_auc_score for train and validation dataset
print(f'Random_Search_Train ROC AUC Score: {roc_auc_score(y_train, probs_train)}')
print(f'Random_Search_Val ROC AUC  Score: {roc_auc_score(y_val, probs_val)}')

Fitting 5 folds for each of 10 candidates, totalling 50 fits
Random_Search_Train ROC AUC Score: 0.751389970328769
Random_Search_Val ROC AUC  Score: 0.7046732065479056


Increasing the min_samples_leaf has also worsened the AUROC score. We shall proceed with the range of values dictated in hyperparams_dist6 (rf6).

In [78]:
# Display the best set of hyperparameters
random_search_rf6.best_params_

{'max_depth': 39, 'min_samples_leaf': 36, 'n_estimators': 48}

Now that we have tuned our Random Forest via Random Search, let us look at a different type of model - Support Vector Machine (SVM). We will explore the AUROC with default hyperparameters then tune with C, Gamma and kernel type hyperparameters. We will compare its performance to our best Random Forest score (0.7097) to see which model is better suited at predicting if a rookie player will last at least 5 years in the NBA league based on their current stats. 

### Support Vector Machine (SVM) 

#### Train Initial SVM with Default Hyperparameters

In [79]:
#Import SVC from sklearn.svm
from sklearn.svm import SVC

In [80]:
#Instantiate a SVC() model with default hyperparameters except probability = True for AUROC
svc_0=SVC(probability=True)

In [81]:
#Train the model on the train set 

svc_0.fit(X_train,y_train)

In [82]:
# Calculate the probability when target=1
probs_train=svc_0.predict_proba(X_train)[:,1]
probs_val=svc_0.predict_proba(X_val)[:,1]

In [83]:
# Import the roc_auc_score, roc_curve, matplotlib.pyplot
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve


In [84]:
#Print the AUROC score of the training set 
print(f'Train ROC AUC Score: {roc_auc_score(y_train, probs_train)}')
print(f'Val ROC AUC  Score: {roc_auc_score(y_val, probs_val)}')

Train ROC AUC Score: 0.8131324664065582
Val ROC AUC  Score: 0.6251354116514203


In [85]:
svc_0.get_params()

{'C': 1.0,
 'break_ties': False,
 'cache_size': 200,
 'class_weight': None,
 'coef0': 0.0,
 'decision_function_shape': 'ovr',
 'degree': 3,
 'gamma': 'scale',
 'kernel': 'rbf',
 'max_iter': -1,
 'probability': True,
 'random_state': None,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}

SVM with Default Hyperparameters is overfitting and performs worse (AUROC) in comparison to Random Forest with default hyperparameters, grid search or random forest. Let us see if hyperparameter tuning will increase its performance. 

#### Hyperparameter Tuning - Gamma 

In [86]:
#Gamma 1

svc_1 = SVC(C=1.0, kernel='rbf',gamma=0.005, probability=True)

#Train the model on the train set 
svc_1.fit(X_train,y_train)

# Calculate the probability when target=1
probs_train=svc_1.predict_proba(X_train)[:,1]
probs_val=svc_1.predict_proba(X_val)[:,1]

#Print the AUROC score of the training set 
print(f'Train ROC AUC Score: {roc_auc_score(y_train, probs_train)}')
print(f'Val ROC AUC  Score: {roc_auc_score(y_val, probs_val)}')

Train ROC AUC Score: 0.6020638622843928
Val ROC AUC  Score: 0.5456337265286471


AUROC is worse than with default hyperparameters. Let us increase gamma to see if results improve. 

In [87]:
#Gamma 2

svc_2 = SVC(C=1.0, kernel='rbf',gamma=0.1, probability=True)

#Train the model on the train set 
svc_2.fit(X_train,y_train)

# Calculate the probability when target=1
probs_train=svc_2.predict_proba(X_train)[:,1]
probs_val=svc_2.predict_proba(X_val)[:,1]

#Print the AUROC score of the training set 
print(f'Train ROC AUC Score: {roc_auc_score(y_train, probs_train)}')
print(f'Val ROC AUC  Score: {roc_auc_score(y_val, probs_val)}')

Train ROC AUC Score: 0.8345361798668403
Val ROC AUC  Score: 0.6154399374097255


Increasing gamma has improved the AUROC of both training and validation sets but increased overfitting. Larger gamma results tend to increase overfitting so we will attempt to optimise gamma within the range of 0.005 and 0.1. As 0.1 yielded much better results, we start by testing values closer to this. 

In [88]:
#Gamma 3
svc_3 = SVC(C=1.0, kernel='rbf',gamma=0.09, probability=True)

#Train the model on the train set 
svc_3.fit(X_train,y_train)

# Calculate the probability when target=1
probs_train=svc_3.predict_proba(X_train)[:,1]
probs_val=svc_3.predict_proba(X_val)[:,1]

#Print the AUROC score of the training set 
print(f'Train ROC AUC Score: {roc_auc_score(y_train, probs_train)}')
print(f'Val ROC AUC  Score: {roc_auc_score(y_val, probs_val)}')

Train ROC AUC Score: 0.8289506137437649
Val ROC AUC  Score: 0.6194240491092922


AUROC has improved and overfitting has reduced. Let us lower the gamma to see the effect. 

In [89]:
#Gamma 4
svc_4 = SVC(C=1.0, kernel='rbf',gamma=0.05, probability=True)

#Train the model on the train set 
svc_4.fit(X_train,y_train)

# Calculate the probability when target=1
probs_train=svc_4.predict_proba(X_train)[:,1]
probs_val=svc_4.predict_proba(X_val)[:,1]

#Print the AUROC score of the training set 
print(f'Train ROC AUC Score: {roc_auc_score(y_train, probs_train)}')
print(f'Val ROC AUC  Score: {roc_auc_score(y_val, probs_val)}')

Train ROC AUC Score: 0.803231852539647
Val ROC AUC  Score: 0.6280181752527684


AUROC has once again improved and overfitting reduced. 

In [90]:
#Gamma 5
svc_5 = SVC(C=1.0, kernel='rbf',gamma=0.02, probability=True)

#Train the model on the train set 
svc_5.fit(X_train,y_train)

# Calculate the probability when target=1
probs_train=svc_5.predict_proba(X_train)[:,1]
probs_val=svc_5.predict_proba(X_val)[:,1]

#Print the AUROC score of the training set 
print(f'Train ROC AUC Score: {roc_auc_score(y_train, probs_train)}')
print(f'Val ROC AUC  Score: {roc_auc_score(y_val, probs_val)}')

Train ROC AUC Score: 0.7425463918650231
Val ROC AUC  Score: 0.6319210399614829


AUROC has further improved and overfitting reduced. 

In [91]:
#Gamma 6
svc_6 = SVC(C=1.0, kernel='rbf',gamma=0.01, probability=True)

#Train the model on the train set 
svc_6.fit(X_train,y_train)

# Calculate the probability when target=1
probs_train=svc_6.predict_proba(X_train)[:,1]
probs_val=svc_6.predict_proba(X_val)[:,1]

#Print the AUROC score of the training set 
print(f'Train ROC AUC Score: {roc_auc_score(y_train, probs_train)}')
print(f'Val ROC AUC  Score: {roc_auc_score(y_val, probs_val)}')

Train ROC AUC Score: 0.6760186929106764
Val ROC AUC  Score: 0.6003761434761675


Now whilst overfitting has reduced, the AUROC validation score has worsened. It seems the optimal gamma value is between 0.01 and 0.02. 

In [92]:
#Gamma 7
svc_7 = SVC(C=1.0, kernel='rbf',gamma=0.015, probability=True)

#Train the model on the train set 
svc_7.fit(X_train,y_train)

# Calculate the probability when target=1
probs_train=svc_7.predict_proba(X_train)[:,1]
probs_val=svc_7.predict_proba(X_val)[:,1]

#Print the AUROC score of the training set 
print(f'Train ROC AUC Score: {roc_auc_score(y_train, probs_train)}')
print(f'Val ROC AUC  Score: {roc_auc_score(y_val, probs_val)}')

Train ROC AUC Score: 0.726139429100606
Val ROC AUC  Score: 0.6303863745787193


Validation AUROC is slightly worse compared to a gamma of 0.02. Let us test closer to 0.02 as one final check. 

In [93]:
#Gamma 8
svc_8 = SVC(C=1.0, kernel='rbf',gamma=0.019, probability=True)

#Train the model on the train set 
svc_8.fit(X_train,y_train)

# Calculate the probability when target=1
probs_train=svc_8.predict_proba(X_train)[:,1]
probs_val=svc_8.predict_proba(X_val)[:,1]

#Print the AUROC score of the training set 
print(f'Train ROC AUC Score: {roc_auc_score(y_train, probs_train)}')
print(f'Val ROC AUC  Score: {roc_auc_score(y_val, probs_val)}')

Train ROC AUC Score: 0.7486773203497215
Val ROC AUC  Score: 0.6245456186807896


Validation AUROC has worsened once again. It seems Gamma #7 with a value of 0.02 has performed best. We will proceed with this. The performance is still quite low compared to Random Forest bur we will do a few quick checks with tuning C and kernel type to see if there is any significant difference. 

#### Hyperparameter Tuning - C

In [94]:
#C1 - increasing C to a large value to test effect
svc_9 = SVC(C=1000.0, kernel='rbf',gamma=0.02, probability=True)

#Train the model on the train set 
svc_9.fit(X_train,y_train)

# Calculate the probability when target=1
probs_train=svc_9.predict_proba(X_train)[:,1]
probs_val=svc_9.predict_proba(X_val)[:,1]

#Print the AUROC score of the training set 
print(f'Train ROC AUC Score: {roc_auc_score(y_train, probs_train)}')
print(f'Val ROC AUC  Score: {roc_auc_score(y_val, probs_val)}')

Train ROC AUC Score: 0.7944805458690208
Val ROC AUC  Score: 0.6434099662975445


Increasing C to 1000 has improved validation AUROC but increased overfitting and requires a higher degree of computational power. Let us try reducing. 

In [95]:
#C2
svc_10 = SVC(C=20.0, kernel='rbf',gamma=0.02, probability=True)

#Train the model on the train set 
svc_10.fit(X_train,y_train)

# Calculate the probability when target=1
probs_train=svc_10.predict_proba(X_train)[:,1]
probs_val=svc_10.predict_proba(X_val)[:,1]

#Print the AUROC score of the training set 
print(f'Train ROC AUC Score: {roc_auc_score(y_train, probs_train)}')
print(f'Val ROC AUC  Score: {roc_auc_score(y_val, probs_val)}')

Train ROC AUC Score: 0.7801427934586072
Val ROC AUC  Score: 0.6498104236880116


C = 20 seems to yield a better AUROC with less overfitting and reduced computational strain. The tuned SVM model still however performs poorly compared to Random Forest with Hyperparameter tuning (Random Search and Grid Search). Let us do one final check with kernel type. The default is rbf. The data is not linear in nature so not worthwhile using this kernel type. Precomputed matrix must be a square matrix so cannot be applied. We will therefore test poly. 

#### Hyperparameter Tuning - Kernel

In [96]:
#Poly Kernel Type
svc_11 = SVC(C=20.0, kernel='poly',gamma=0.02, probability=True)

#Train the model on the train set 
svc_11.fit(X_train,y_train)

# Calculate the probability when target=1
probs_train=svc_11.predict_proba(X_train)[:,1]
probs_val=svc_11.predict_proba(X_val)[:,1]

#Print the AUROC score of the training set 
print(f'Train ROC AUC Score: {roc_auc_score(y_train, probs_train)}')
print(f'Val ROC AUC  Score: {roc_auc_score(y_val, probs_val)}')

Train ROC AUC Score: 0.7248662718712298
Val ROC AUC  Score: 0.6589461964371689


Poly has improved the score and reduced overfitting compared to rbf. We have now fully tuned our SVM model but it is clear that it does not perform as well as Random Forest. It will be interesting to see what models performed best from Amy and Yatin's experimentation (Logistic Regression, KNN and XGBoost). Random Forest will be compared to them in the coming weeks and with further data cleaning and feature engineering, an AUROC score better than our current high of 0.7097 will be hopefully be achieved. 

#### Load and clean the test dataset

In [97]:
# Load the pandas and numpy packages
import pandas as pd
import numpy as np

In [98]:
# Import csv file of test data and save into data_test
data_test=pd.read_csv('../data/raw/2022_test.csv')

In [99]:
# Create a copy of data_test and save it into a variable data_test_cleaned
data_test_cleaned=data_test.copy()

In [100]:
# Remove the columns of id, 3P Made, 3PA, 3P% and BLK
data_test_cleaned.drop(['Id','3P Made','3PA','3P%','BLK'],axis=1,inplace=True)

In [101]:
#  Import StandardScaler from sklearn.preprocessing
from sklearn.preprocessing import StandardScaler

In [102]:
# Instantiate the StandardScaler
scaler=StandardScaler()

In [103]:
# Fit and apply the scaling on data_test_cleaned
data_test_cleaned=scaler.fit_transform(data_test_cleaned)

In [104]:
# Create the variable X_test
X_test=data_test_cleaned

In [105]:
# Calculate the probabilities for test datasets
probs_test=random_search_rf6.predict_proba(X_test)[:,1]

In [106]:
# Join the probs_test column into data_test
data_test['TARGET_5Yrs']=probs_test

In [107]:
# Export the csv file 'rf_submission_091122.csv' for Kaggle submission
output=data_test[['Id','TARGET_5Yrs']]
output.to_csv('../rf6_submission_091122.csv',index=False)