## Waiver Prediction Models
This shows some early attempts to predict by various models if a player would be waived, as well as to make probabilistic predictions of the same variable. We ultimately moved from this to player retention (i.e. if a player is in the league at all) as we found higher accuracy for that prediction.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

In [None]:
player_data = pd.read_csv("train_data.csv")

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
class YearWiseStandardScaler(BaseEstimator,TransformerMixin):
    def __init__(self, year_column):
        self.year_column = year_column
        self.year_stats = {}
    def fit(self, X, y = None):
        grouped = X.groupby(self.year_column)
        for year, group in grouped:
            self.year_stats[year] = {
                "mean": group.mean(),
                "std": group.std(ddof=0), 
            }
        return self
    def transform(self, X):
        def scale_row(row):
            year = row[self.year_column]
            stats = self.year_stats[year]
            return (row - stats["mean"]) / stats["std"]

        scaled_data = X.apply(scale_row, axis=1)
        scaled_data = scaled_data.drop(columns=self.year_column)
        
        return scaled_data

In [None]:
#Checking that my YearWiseStandardScaler works
toy_data = {
    "Feature1": [1.5, 1.0, 2.4, 0.0, 8.1, 9.2],
    "Feature2": [10, 20, 30, 40, 50, 60],
    "Year": [2023, 2022, 2022, 2023, 2019, 2022]
}

In [None]:
toy_df = df = pd.DataFrame(toy_data)

In [None]:
scaler = YearWiseStandardScaler(year_column ="Year")
scaler.fit(toy_df)
scaled_df = scaler.transform(toy_df)
print(scaled_df)

In [None]:
len(player_data)

In [None]:
player_data['WAIVED_BY_START_OF_NEXT_SEASON'] = player_data[['WAIVED', 'RELEASED']].any(axis=1).astype(int)

In [None]:
player_data.sample()

In [None]:
player_data = player_data.dropna(subset=['SALARY'])

In [None]:
len(player_data)

In [None]:
player_data = player_data[player_data['MIN'] != 0]

In [None]:
len(player_data)

In [None]:
columns_to_normalize = ['FGM', 'FGA', 'PTS', 'PF', 'DREB', 'OREB', 'REB', 'FTA', 'FTM', 'STL', 'TOV', 'BLK', 'AST', 'FG3A', 'FG3M']

# Normalize the selected columns by dividing by 'MIN'
player_data[columns_to_normalize] = player_data[columns_to_normalize].div(player_data['MIN'], axis=0)

# Rename columns to include "per minute"
rename_dict = {col: f"{col} / MIN" for col in columns_to_normalize}
player_data.rename(columns=rename_dict, inplace=True)

In [None]:
player_data.sample()

Now let's find the most correlated features

In [None]:
numeric_data = player_data.select_dtypes(include=['number'])

In [None]:
correlations = numeric_data.corr()['WAIVED_BY_START_OF_NEXT_SEASON']
sorted_correlations = correlations.sort_values(ascending=False)
pd.set_option('display.max_rows', None)

# Print the sorted correlations
print(sorted_correlations)

# Reset display options if needed
pd.reset_option('display.max_rows')

Since player_ID is correlated at 10% despite being complete random, let's take anything that's correlated at 15% or higher

In [None]:
Xfeatures = ['GP', 'MIN', 'DWS', 'GS', 'WS', 'OBPM', 'BPM', 'PER', 'OWS', 'WS_48', 'FT_PCT', 'FG_PCT', 'FGM / MIN', 'PTS / MIN', 'VORP', 'TS_PERCENT', 'SALARY', 'FTM / MIN', 'USG_PERCENT', 'FTA / MIN', 'FGA / MIN' , 'SEASON_START']
len(Xfeatures)

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
train_player_data, test_player_data = train_test_split(player_data, test_size=0.2, stratify=player_data['WAIVED_BY_START_OF_NEXT_SEASON'], random_state=812)
print(len(train_player_data))
print(len(test_player_data))

In [None]:
train_player_data_waived = train_player_data.loc[train_player_data['WAIVED_BY_START_OF_NEXT_SEASON'] == 1]
train_player_data_unwaived = train_player_data.loc[train_player_data['WAIVED_BY_START_OF_NEXT_SEASON'] == 0]
p = len(train_player_data_waived)/len(train_player_data)
print(p)
print(p*(1 - p))
bbs = p*(1 - p)

About 11% of the players are waived, which is useful to have as a baseline for the various metrics. For instance, baseline accuracy would be p^2 + (1 - p)^2 which is about .8042.

We next look at some individual stats to determine what if there are statistically significant differences in the aggregate between the two classes of players. For example, in Field Goal percentage we get the following two charts:

In [None]:
plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)
plt.hist(train_player_data_waived['FG_PCT'].dropna(), bins=20, color='blue', alpha=0.7)
plt.title('FG% for players who are waived')
plt.xlabel('Field Goal Percentage')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
plt.hist(train_player_data_unwaived['FG_PCT'].dropna(), bins=20, color='green', alpha=0.7)
plt.title('FG% for players who are not waived')
plt.xlabel('Field Goal Percentage')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

In [None]:
train_player_data_waived.loc[:, 'FG_PCT'].mean()

In [None]:
train_player_data_unwaived.loc[:, 'FG_PCT'].mean()

On the whole, this difference is highly statically significant. The p-value for a sample of unwaived players to have a FG% as low as unwaived players on average is extremely small.

In [None]:
from scipy import stats
unstandardizedZ = train_player_data_waived.loc[:,'FG_PCT'].mean()
mu = train_player_data_unwaived.loc[:, 'FG_PCT'].mean()
sigma = train_player_data_unwaived.loc[:, 'FG_PCT'].std()
size = len(train_player_data_waived)
standardizedZ = (size*unstandardizedZ - mu*size)/(np.sqrt(size)*sigma)
p_value = stats.norm.cdf(standardizedZ)
print(standardizedZ)
print(p_value)

If there is more EDA stuff to add it should be added here.

Now we try some models to see how well the stats can predict at <i>an individual level</i> if a player will be waived or traded. We'll look at KNN classification, logistic classification, decision trees, and random forest classifiers. The metrics we are interested in are f1_score, accuracy_score, balanced_accuracy_score, and the brier score (with predict proba). Given the 11/89 split of waived and unwaived, the baseline metrics to compare to with a random guess that guesses a player will be waived with probability 11 would have metrics:

Accuracy: 0.7952 <br>
Balanced Accuracy: 0.5000 <br>
F1-Score: 0.1158 <br>
1 - Brier Score: 0.8976 <br>(Subtracting from 1 since a <i> smaller </i> Brier score is better.)

These are what we should compare our metrics against.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score, brier_score_loss, balanced_accuracy_score

In [None]:

train_player_data = train_player_data.dropna(subset = Xfeatures)
print(len(train_player_data))
X = train_player_data[Xfeatures]
y = train_player_data['WAIVED_BY_START_OF_NEXT_SEASON']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=812, stratify = y)
knn_pipeline = Pipeline([
    ('scaler', YearWiseStandardScaler(year_column = 'SEASON_START')),
    ('knn', KNeighborsClassifier(n_neighbors = 15))
])
knn_pipeline.fit(X_train, y_train)
y_pred = knn_pipeline.predict(X_test)
f1score = f1_score(y_test, y_pred)
print("F1 score: ", f1score)
print("Balanced Accuracy: ", balanced_accuracy_score(y_test, y_pred))
print("Accuracy: ", accuracy_score(y_test, y_pred))

print("Baseline F1 score: ", p)
print("Baseline Balanced Accuracy: .5")
print("Baseline Accuracy: ", p**2+(1 - p)**2)

In [None]:
y_prob = knn_pipeline.predict_proba(X_test)
y_prob = y_prob[:, 1]
brier_score = brier_score_loss(y_test, y_prob)
print("Brier Score: ", brier_score)
print("Brier Skill Score: ", 1 - brier_score/bbs)
print("Baseline Brier Score: ", bbs)

In [None]:

metric_summary  = {
    "Metric": ["Accuracy", "Balanced Accuracy", "F1 Score", "1-Brier Score"],
    "KNN Model": [0.8846, 0.5538, 0.1985, 1-0.0876],
    "Baseline": [0.7952, 0.5000, 0.1158, 1-0.1024],
}

# Create a DataFrame
metrics_df = pd.DataFrame(metric_summary)
metrics_df

In [None]:
from sklearn.linear_model import LogisticRegression
X = train_player_data[Xfeatures]
y = train_player_data['WAIVED_BY_START_OF_NEXT_SEASON']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=813, stratify = y)


In [None]:
log_reg = Pipeline([
    ('scaler', YearWiseStandardScaler(year_column = 'SEASON_START')),
    ('logreg', LogisticRegression(max_iter=10000))
])

log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)
f1score = f1_score(y_test, y_pred)
print("F1 score: ", f1score)
print("Balanced Accuracy: ", balanced_accuracy_score(y_test, y_pred))

In [None]:
y_prob = log_reg.predict_proba(X_test)

In [None]:

y_prob
y_prob = y_prob[:, 1]
brier_score = brier_score_loss(y_test, y_prob)
print("Brier Score: ", brier_score)
print("Brier Skill Score: ", 1 - brier_score/bbs)

In [None]:
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier

In [None]:
#tree_cfr = Pipeline([(scaler = YearWiseStandardScaler(year_column = 'SEASON_START'), (classifier = DecisionTreeClassifier(max_depth=6, random_state=814)))])
tree_cfr = Pipeline([
    ('scaler', YearWiseStandardScaler(year_column='SEASON_START')),  
    ('classifier', DecisionTreeClassifier(max_depth = 6, random_state=814))  
])
X = train_player_data[Xfeatures]
y = train_player_data['WAIVED_BY_START_OF_NEXT_SEASON']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=815, stratify = y)
tree_cfr.fit(X_train, y_train)
y_pred = tree_cfr.predict(X_test)
f1score = f1_score(y_test, y_pred)
print("F1 score: ", f1score)
print("Balanced Accuracy: ", balanced_accuracy_score(y_test, y_pred))

In [None]:
y_prob = tree_cfr.predict_proba(X_test)
y_prob = y_prob[:, 1]
brier_score = brier_score_loss(y_test, y_prob)
print("Brier Score: ", brier_score)
print("Brier Skill Score: ", 1 - brier_score/bbs)

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(
    n_estimators = 50, 
    max_depth = 3, 
    max_features = 6, 
    bootstrap= True, 
    max_samples = 500,
    random_state = 816
    )
rf = Pipeline([('scaler', YearWiseStandardScaler(year_column='SEASON_START')), ('classifier', RandomForestClassifier(
    n_estimators = 50, 
    max_depth = 3, 
    max_features = 6, 
    bootstrap= True, 
    max_samples = 500,
    random_state = 816
    ))])

In [None]:
X = train_player_data[Xfeatures]
y = train_player_data['WAIVED_BY_START_OF_NEXT_SEASON']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=817, stratify = y)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
f1score = f1_score(y_test, y_pred)
print("F1 score: ", f1score)
print("Balanced Accuracy: ", balanced_accuracy_score(y_test, y_pred))

In [None]:
y_prob = tree_cfr.predict_proba(X_test)
y_prob = y_prob[:, 1]
brier_score = brier_score_loss(y_test, y_prob)
print("Brier Score: ", brier_score)
print("Brier Skill Score: ", 1 - brier_score/bbs)

Of the non-ensemble methods, the decision tree works best, so let's boost that one with AdaBoost

In [None]:
from sklearn.ensemble import AdaBoostClassifier

In [None]:
ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=6),
                                n_estimators=100,
                                algorithm = 'SAMME',
                                learning_rate = 0.05,
                                random_state=818)
ada_clf = Pipeline([('scalar', YearWiseStandardScaler(year_column = 'SEASON_START')), ('classifier', AdaBoostClassifier(DecisionTreeClassifier(max_depth=6),
                                n_estimators=100,
                                algorithm = 'SAMME',
                                learning_rate = 0.05,
                                random_state=818))])

In [None]:
X = train_player_data[Xfeatures]
y = train_player_data['WAIVED_BY_START_OF_NEXT_SEASON']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=817, stratify = y)
ada_clf.fit(X_train, y_train)

In [None]:
y_pred = tree_cfr.predict(X_test)
f1score = f1_score(y_test, y_pred)
print("F1 score: ", f1score)
print("Balanced Accuracy: ", balanced_accuracy_score(y_test, y_pred))

In [None]:
y_prob = ada_clf.predict_proba(X_test)
y_prob = y_prob[:, 1]
brier_score = brier_score_loss(y_test, y_prob)
print("Brier Score: ", brier_score)
print("Brier Skill Score: ", 1 - brier_score/bbs)