## 2- Model exploration

In this section, we'll explore a range of machine learning models to predict the career longevity of NBA rookies based on their first-season performance statistics. The goal is to identify the model that offers the best performance, ensuring robust predictions on whether an investment in a player will yield a long-term return. After trying out different models with default configurations, we’ll proceed to hyperparameter tuning to improve the selected model's performance. We will then save the best model to use it in the API.

#### Models

We will experiment with the following machine learning model:
- Logistic Regression
- K-nearest Neighbors 
- C-Support Vector Classification (SVC)
- Decision Tree
- Random Forest
- Gaussian Naive Bayes
- XGBoost

#### Metrics

We will focus on the following evaluation metrics to assess model performance:

1. **Precision**

    > ***Definition***: Precision is the ratio of correctly predicted positive observations to the total predicted positives. It answers the question: "Of all the players we predicted would last more than 5 years, how many actually did?"

    > ***​Business Impact***: High precision means that when the model predicts a player will last more than 5 years, it is usually correct. This is crucial when making investment decisions because it reduces the risk of investing in players who won’t have a long career (false positives).

    $$Precision = \frac{TP}{TP+FP}$$


2. **Recall**

    > ***Definition***: Recall is the ratio of correctly predicted positive observations to all observations in the actual class. It answers the question: "Of all the players that actually lasted more than 5 years, how many did we correctly identify?"

    > ***Business Impact***: High recall means we’re capturing most of the players who will last long-term, ensuring we don’t miss valuable investment opportunities. However, lower recall could result in missing out on potentially good players (false negatives).

    $$Recall = \frac{TP}{TP+FN}$$


3. **F1 Score**

    > ***Definition***: F1 Score is the harmonic mean of precision and recall. It provides a single measure of a model’s performance when both precision and recall are important

    > ***Business Impact***: The F1 score is crucial for balancing the trade-off between precision and recall. In the context of investing in players, the F1 score ensures that we don’t overly focus on one metric (e.g., precision) at the expense of the other (e.g., recall). This balance ensures a more well-rounded investment strategy.

    $$F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}$$

### Imports

In [1]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import seaborn as sns
sns.set_palette("Set2")
import matplotlib.pyplot as plt
%matplotlib inline
from itables import show

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import KFold, StratifiedKFold

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV

from sklearn.pipeline import Pipeline

import pickle as pkl

### Load data

We start by loading the cleaned data and dropping the name column.

In [2]:
data = pd.read_csv('../data/nba_logreg_preprocessed.csv')
data = data.drop('Name', axis=1)
data["TARGET_5Yrs"] = data["TARGET_5Yrs"].astype("category")
show(data)

GP,MIN,PTS,FGM,FGA,FG%,3P Made,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TOV,TARGET_5Yrs
Loading ITables v2.2.2 from the internet... (need help?),,,,,,,,,,,,,,,,,,,


We choose the features that we selected in the EDA notebook, and separate the features from the target value in our dataset.

In [3]:
# X = data.drop('TARGET_5Yrs', axis=1)
X = data[['GP', 'MIN', 'PTS', 'FGM']]
y = data['TARGET_5Yrs']

We now use the function provided in `test.py` and add some minor modifications to track different metrics.

In [4]:
def score_classifier(dataset, classifier, labels):
    """
    performs 3 random trainings/tests to build a confusion matrix and prints results with precision and recall scores
    :param dataset: the dataset to work on
    :param classifier: the classifier to use
    :param labels: the labels used for training and validation
    :return: the accuracy, precision, recall and f1 scores
    """

    kf = KFold(n_splits=3, random_state=50, shuffle=True)
    accuracy_list = []
    precision_list = []
    recall_list = []
    f1_list = []
    for training_ids, test_ids in kf.split(dataset):
        training_set = dataset[training_ids]
        training_labels = labels[training_ids]
        test_set = dataset[test_ids]
        test_labels = labels[test_ids]

        classifier.fit(training_set, training_labels)
        predicted_labels = classifier.predict(test_set)

        accuracy = accuracy_score(test_labels, predicted_labels)
        precision = precision_score(test_labels, predicted_labels)
        recall = recall_score(test_labels, predicted_labels)
        f1 = f1_score(test_labels, predicted_labels)

        recall_list.append(recall)
        precision_list.append(precision)
        accuracy_list.append(accuracy)
        f1_list.append(f1)

    recall = np.mean(recall_list)
    precision = np.mean(precision_list)
    accuracy = np.mean(accuracy_list)
    f1 = np.mean(f1_list)

    # print("Accuracy: {:.2f}".format(accuracy))
    # print("Precision: {:.2f}".format(precision))
    # print("Recall: {:.2f}".format(recall))
    # print("F1: {:.2f}".format(f1))

    return {"Accuracy": accuracy, "Precision": precision, "Recall": recall, "F1": f1}

### Scaling the data

We will use the standard scaler to scale our data.

In [5]:
X_scaled = StandardScaler().fit_transform(X)

### Model comparison

We will first try a couple of different classifiers with their default parameters, comparing their scores on different metrics (but mainly F1 and Precision), in order to select the best model (or models) to continue our analysis

In [6]:
classifiers = [
    LogisticRegression(),
    KNeighborsClassifier(),
    SVC(),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    GaussianNB(),
    XGBClassifier()
]

for classifier in classifiers:
    classifier.random_state = 50

In [7]:
scores = []

for classifier in classifiers:
    result = score_classifier(X_scaled, classifier, y)
    result["Classifier"] = classifier.__class__.__name__
    scores.append(result)

scores_df = pd.DataFrame(scores)
scores_df = scores_df.sort_values(by="F1", ascending=False)
scores_df = scores_df.reset_index(drop=True)
scores_df

Unnamed: 0,Accuracy,Precision,Recall,F1,Classifier
0,0.705975,0.744422,0.806056,0.773898,LogisticRegression
1,0.705975,0.74847,0.797256,0.771896,SVC
2,0.671384,0.718285,0.780856,0.748126,XGBClassifier
3,0.670597,0.723019,0.768282,0.744392,RandomForestClassifier
4,0.664308,0.714664,0.769549,0.741014,KNeighborsClassifier
5,0.629717,0.703961,0.703993,0.703394,DecisionTreeClassifier
6,0.660377,0.82004,0.584443,0.681768,GaussianNB


We observe that both the Logistic Regression and SVC give the best results.

Shoutout however to the Naive Bayes model that gave an amazing precision (but a very bad recall). This means that when the model predicts that a player will last more than five years, it is correct a lot of the time. At the same time, it is missing many actual potential star players. From the business point of view, this could lead to a conservative strategy that avoids risky investments, but at the cost of potentially overlooking valuable players.

In our case, we want a good balance of precision and recall. This is why we will try to optimize the Logistic Regression and the SVC models moving forward.

### Logistic Regression

Let's try to tune the hyperparameters of the logistic regression model using Grid Search.

In [8]:
param_grid_lr = {
    'C': [0.001, 0.01, 0.1, 1, 10],
    'penalty': ['l2'],
    'solver': ['liblinear', 'saga', 'lbfgs', 'newton-cg', 'sag'],
}

cv_lr = KFold(n_splits=3, random_state=50, shuffle=True)

grid_search_lr = GridSearchCV(LogisticRegression(), param_grid_lr, cv=cv_lr, scoring='f1')
grid_search_lr.fit(X_scaled, y)
print(grid_search_lr.best_params_)

{'C': 0.1, 'penalty': 'l2', 'solver': 'liblinear'}


Let's evaluate the model with the selected best parameters.

In [9]:
log_reg = LogisticRegression(C=1, penalty="l2", solver="liblinear")
score_classifier(X_scaled, log_reg, y)

{'Accuracy': 0.7044025157232704,
 'Precision': 0.7438007775459159,
 'Recall': 0.8035401181627598,
 'F1': 0.77238024053494}

We notice a very slight improvement over the default parameters of the Logistic Regression classifier.

### SVC

Let's try to tune the hyperparameters of the SVC model using Grid Search.

In [10]:
param_grid_svc = {
    'C': [0.001, 0.01, 0.1, 1, 10],
    'kernel': ['linear', 'rbf', 'sigmoid'],
    'degree': [2, 3, 4, 5],
    'gamma': ['scale', 'auto'],
}

cv_svc = KFold(n_splits=3, random_state=50, shuffle=True)

grid_search_svc = GridSearchCV(SVC(), param_grid_svc, cv=cv_svc, scoring='f1')
grid_search_svc.fit(X_scaled, y)
print(grid_search_svc.best_params_)

{'C': 0.1, 'degree': 2, 'gamma': 'scale', 'kernel': 'rbf'}


Let's evaluate the model with the selected best parameters.

In [11]:
svc = SVC(C=0.1, degree=2, gamma='scale', kernel='rbf', random_state=50)
score_classifier(X_scaled, svc, y)

{'Accuracy': 0.7122641509433962,
 'Precision': 0.7483468718765591,
 'Recall': 0.8123546788641128,
 'F1': 0.7789482206068512}

Our tuned SVC model performs slightly better than the tuned logistic regression model.

We will use this SVC model for our API since it gives us the best performance with the optimized hyperparameters.

### Saving the model

We create a pipeline to scale the data and fit the model using the tuned SVC model, and save it to a pickle file.

In [12]:
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('svc', SVC(C=0.1, degree=2, gamma='scale', kernel='rbf', random_state=50))
])

pipe.fit(X, y)

pkl.dump(pipe, open('../models/nba_classifier.pkl', 'wb'))