# Set-up
In this notebook I've provided the primary code I used to initialize and tune my model. I attempted to trim some of the inefficiency in my first solo project by coding my models into pipeline functions, and as a result much of my modeling work can be seen in the nba_modeling_functions.py file in this repo.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import seaborn as sns
from scipy.spatial.distance import euclidean as euc
import numpy as np
from sklearn.datasets import make_classification
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, precision_score, log_loss
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB
import random
from sklearn.metrics import plot_confusion_matrix, plot_roc_curve, classification_report, roc_auc_score, roc_curve, confusion_matrix, auc, precision_score, recall_score, accuracy_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, VotingClassifier, AdaBoostClassifier, BaggingRegressor
from ipywidgets import interactive, FloatSlider
import pickle
from sklearn.model_selection import GridSearchCV
from tqdm import tqdm
from collections import Counter
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
import xgboost as xgb
import nba_all_modeling_functions as nbam

pd.set_option("display.max_rows", 600)
pd.set_option("display.max_columns", 60)
%matplotlib inline

# Unpickling my dataset

In [None]:
with open("supdated_df.pickle", "rb") as read_file:
    all_data_df = pickle.load(read_file)

# Initial attempt at model (logistic regression)
Initially I threw three simple features into a logistic regression model. I got the following results:

- Precision (non-All-Stars): 0.98
- Precision (all-Stars): 0.80
- Precision (weighted average): 0.97
- Recall (non-All-Stars): 0.99
- Recall (All-Stars): 0.59
- Recall (weighted average): 0.97
- F1 score (non-All-Stars): 0.99
- F1 score (All-Stars): 0.68

I also printed out a classification report and plotted and confusion matrix. The initial results above were tough to improve upon throughout.

In [None]:
# Using features I suspected would have high correlation
features = ["PTS", "AST", "PER"]
X = all_data_df[features]
y = all_data_df["All-Star next season?"]

# Test-train split
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, random_state=10)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, random_state=10)

# Scaling training set (for now) and testing set (for later)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

# The prob predictions I looked at separately
logr = LogisticRegression()
logr.fit(X_train_scaled, y_train)
logr_val_predictions = logr.predict(X_val_scaled)
logr_prob_predictions = logr.predict_proba(X_val_scaled)
logr_prob_dict = dict(zip(list(X_val.index), list(logr_prob_predictions)))

# Performance
print(f'Accuracy (val): {accuracy_score(y_val, logr_val_predictions)}')
print(classification_report(y_val, logr_val_predictions))
fig, ax = plt.subplots(figsize=(7, 7))
plot_confusion_matrix(logr, X_val_scaled, y_val, ax=ax, cmap="Oranges")

# Next baseline model: KNN (k of arbitrarily chosen 5)

In [None]:
nn = KNeighborsClassifier(n_neighbors=5, n_jobs=-1)
nn.fit(X_train_scaled, y_train)
knn_predictions = nn.predict(X_val_scaled)
knn_prob_predictions = nn.predict_proba(X_val_scaled)
knn_tn, knn_fp, knn_fn, knn_tp = confusion_matrix(y_val, knn_predictions).ravel()

# Performance
print(f'Accuracy (val): {accuracy_score(y_val, knn_predictions)}')
print(classification_report(y_val, knn_predictions))
fig, ax = plt.subplots(figsize=(7, 7));
plot_confusion_matrix(nn, X_val_scaled, y_val, ax=ax, cmap="Oranges");

# Plotting the above two baseline models on an ROC curve
Early on, logistic regression looked to be a better option than logistice regression. That said I knew I still needed to try more models, add or at least try a ton more features, tune my k hyperparameter.

In [None]:
# Logistic regression model
logr_pos_preds = logr_prob_predictions[:, 1]
logr_fpr, logr_tpr, logr_threshold = roc_curve(y_val, logr_pos_preds)
roc_auc = auc(logr_fpr, logr_tpr)

# KNN model
knn_pos_preds = knn_prob_predictions[:, 1]
knn_fpr, knn_tpr, knn_threshold = roc_curve(y_val, knn_pos_preds)
roc_auc = auc(knn_fpr, knn_tpr)

fig, ax = plt.subplots(figsize=(12,7), sharex=True, sharey=True)
ax.set_title('ROC', fontdict={"fontsize": 25}, y=1.05)
ax.plot(logr_fpr, logr_tpr, label="Logistic Regression", lw=4)
ax.plot(knn_fpr, knn_tpr, label="KNN (k=5)", lw=4)
ax.legend(loc = 'lower right')
ax.plot([0, 1], [0, 1],'r--');

# Second round of models
Before I wizened up and built my pipeline functions I incrementally built upon these baseline models by deliberately adding features. I also tried out some different k values. Instead of flooding this notebook with all that I've scratched out a select sample below, including my initial attempts at fitting random forest, Naive Bayes, and SVC (with SMOTE) models to my dataset.

## Second attempt at KNN (adding "MP" and "Age")

In [None]:
features = ["PTS", "AST", "PER", "MP", "Age"]
X = all_data_df[features]
y = all_data_df["All-Star next season?"]

# Test-train split
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, random_state=10)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, random_state=10)

# Scaling training set (for now) and validation set (for later)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

# Instantiating and fitting
nn = KNeighborsClassifier(n_neighbors=20, n_jobs=-1)
nn.fit(X_train_scaled, y_train)
knn_predictions = nn.predict(X_val_scaled)
knn_prob_predictions = nn.predict_proba(X_val_scaled)
knn_tn, knn_fp, knn_fn, knn_tp = confusion_matrix(y_val, knn_predictions).ravel()

# Performance
print(f'Accuracy (val): {accuracy_score(y_val, knn_predictions)}')
print(classification_report(y_val, knn_predictions))
fig, ax = plt.subplots(figsize=(7, 7));
plot_confusion_matrix(nn, X_val_scaled, y_val, ax=ax, cmap="Oranges");

## Another attempt at logistic regression...
...with the same features as above plus "All-Star?" which is All-Star status the season prior to that which is the target year).

In [None]:
features = ["PTS", "AST", "PER", "MP", "Age", "All-Star?"]
X = all_data_df[features]
y = all_data_df["All-Star next season?"]

# Test-train split
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, random_state=10)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, random_state=10)

# Scaling training set (for now) and validation set (for later)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

# Instantiating and fitting
logr = LogisticRegression()
logr.fit(X_train_scaled, y_train)
logr_val_predictions = logr.predict(X_val_scaled)
logr_tn, logr_fp, logr_fn, logr_tp = confusion_matrix(y_val, logr_val_predictions).ravel()
logr_prob_predictions = logr.predict_proba(X_val_scaled)

# Performance
print(f'Accuracy (val): {accuracy_score(y_val, logr_val_predictions)}')
print(classification_report(y_val, predictions))
fig, ax = plt.subplots(figsize=(7, 7));
plot_confusion_matrix(logr, X_val_scaled, y_val, ax=ax, cmap="Oranges");

## First random forest model...
...with the same features as above. My training and validation sets remained scaled, even though they did not need to be. I simply pumped the material from the last logistic regression model right in here.

In [None]:
ran_for = RandomForestClassifier()
ran_for.fit(X_train_scaled, y_train)
ran_for_predictions = ran_for.predict(X_val)
ran_for_prob_predictions = ran_for.predict_proba(X_val_scaled)

# Performance
print(f'Accuracy (val): {accuracy_score(y_val, ran_for_predictions)}')
print(classification_report(y_val, ran_for_predictions))
fig, ax = plt.subplots(figsize=(7, 7));
plot_confusion_matrix(ran_for, X_val_scaled, y_val, ax=ax, cmap="Oranges");

In [None]:
# And a look at feature importance as determined by this random forest model
print(ran_for.feature_importances_)

## First Naive Bayes model...
...also with the same features as above.

In [None]:
# Naive Bayes (Gaussian)
gnb = GaussianNB()
gnb.fit(X_train_scaled, y_train)
gnb_predictions = gnb.predict(X_val)
gnb_prob_predictions = gnb.predict_proba(X_val_scaled)
gnb_tn, gnb_fp, gnb_fn, gnb_tp = confusion_matrix(y_val, gnb_predictions).ravel()

# Performance
print(classification_report(y_val, gnb_predictions))
fig, ax = plt.subplots(figsize=(7, 7));
plot_confusion_matrix(gnb, X_val, y_val, ax=ax, cmap="Oranges");

# Another ROC curve...
...for another look at all the models so far against one another. Logistic regression at this point no longer lapped the field but still seemed the likely best fit among the group, both in terms of performance so far, AUC showing, and the interpretability it offers.

In [None]:
# Logistic regression
logr_pos_preds = logr_prob_predictions[:, 1]
logr_fpr, logr_tpr, logr_threshold = roc_curve(y_val, logr_pos_preds)
logr_roc_auc = auc(logr_fpr, logr_tpr)

# KNN
knn_pos_preds = knn_prob_predictions[:, 1]
knn_fpr, knn_tpr, knn_threshold = roc_curve(y_val, knn_pos_preds)
knn_roc_auc = auc(knn_fpr, knn_tpr)

# Random forest
ran_for_pos_preds = ran_for_prob_predictions[:, 1]
ran_for_fpr, ran_for_tpr, ran_for_threshold = roc_curve(y_val, ran_for_pos_preds)
ran_for_roc_auc = auc(ran_for_fpr, ran_for_tpr)

# Naive Bayes
gnb_pos_preds = gnb_prob_predictions[:, 1]
gnb_fpr, gnb_tpr, gnb_threshold = roc_curve(y_val, gnb_pos_preds)
gnb_roc_auc = auc(gnb_fpr, gnb_tpr)

fig, ax = plt.subplots(figsize=(12,7), sharex=True, sharey=True)
ax.set_title('ROC', fontdict={"fontsize": 25}, y=1.05)
ax.plot(logr_fpr, logr_tpr, label="Logistic Regression", lw=4)
ax.plot(knn_fpr, knn_tpr, label="KNN (k=5)", lw=4)
ax.plot(ran_for_fpr, ran_for_tpr, label="Random forest", lw=4)
ax.plot(gnb_fpr, gnb_tpr, label="Naive Bayes (Gaussian)", lw=4)
ax.legend(loc = 'lower right')
ax.plot([0, 1], [0, 1],'r--');

# In optimizing my models, I identified the following values as best-fit hyperparameters:
 
 - k: 9 (KNN)
 - max_depth: 5 (random forest)
 - min_samples_split: 7 (random forest)

Below are the processes I used.

## Finding best k

In [None]:
print(f'Default best k value (sqrt): {np.sqrt(all_data_df.shape[0]}')

features = ["All-Star?", "PTS/game", "AST/game", "Years from prime", "PER", "Trajectory", "Adjusted TV market value * GS", "TRB/game", "PTS+AST/game", "MP/game", "FT/game", "All-Star next season?"]
X = all_data_df[features]
y = all_data_df["All-Star next season?"]

# Test-train split
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, random_state=10)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, random_state=10)

# Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

acc = []
for k in range(1,30):
    knn_ks = KNeighborsClassifier(n_neighbors = k).fit(X_train, y_train)
    predictions_ks = knn_ks.predict(X_test)
    acc.append(accuracy_score(y_test, predictions_ks))
    
plt.figure(figsize=(18,6))
plt.plot(range(1, 30), acc, color = "cornflowerblue", marker='o', markerfacecolor="cornflowerblue", markersize=10)
plt.title("K value x accuracy", fontdict={"fontsize":20}, y=1.05)
plt.xlabel("K", fontsize=16)
plt.ylabel("Accuracy", fontsize=16, rotation=90)
print(f'Best k: {acc.index(max(acc))} with accuracy score of {max(acc)}');

## Finding best max_depth

In [None]:
features = ["All-Star?", "PTS/game", "AST/game", "Years from prime", "PER", "Trajectory", "Adjusted TV market value * GS", "TRB/game", "PTS+AST/game", "MP/game", "FT/game", "0", "1-3", "4-7", "8+"]
X = all_data_df[features]
y = all_data_df["All-Star next season?"]

# Test-train split
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, random_state=10)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, random_state=10)

acc = []
for depth in range(1,30):
    ran_for = RandomForestClassifier(max_depth=depth).fit(X_train, y_train)
    ran_for_predictions = ran_for.predict(X_val)
    acc.append(accuracy_score(y_val, ran_for_predictions))
    
plt.figure(figsize=(18,6))
plt.plot(range(1, 30), acc, color = "cornflowerblue", marker='o', markerfacecolor="cornflowerblue", markersize=10)
plt.title("max_depth x Accuracy", fontdict={"fontsize":20}, y=1.05)
plt.xlabel("max_depth", fontsize=16)
plt.ylabel("Accuracy", fontsize=16, rotation=90)
print(f'Best max_depth: {acc.index(max(acc))} with accuracy score of {max(acc)}');
plt.savefig("Best max_depth value.svg");

## Finding best min_samples_split

In [None]:
features = ["All-Star?", "PTS/game", "AST/game", "Years from prime", "PER", "Trajectory", "Adjusted TV market value * GS", "TRB/game", "PTS+AST/game", "MP/game", "FT/game", "0", "1-3", "4-7", "8+"]
X = all_data_df[features]
y = all_data_df["All-Star next season?"]

# Test-train split
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, random_state=10)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, random_state=10)

acc = []
for depth in range(1,30):
    ran_for = RandomForestClassifier(max_depth=depth).fit(X_train, y_train)
    ran_for_predictions = ran_for.predict(X_val)
    acc.append(accuracy_score(y_val, ran_for_predictions))
    
plt.figure(figsize=(18,6))
plt.plot(range(1, 30), acc, color = "cornflowerblue", marker='o', markerfacecolor="cornflowerblue", markersize=10)
plt.title("max_depth x Accuracy", fontdict={"fontsize":20}, y=1.05)
plt.xlabel("max_depth", fontsize=16)
plt.ylabel("Accuracy", fontsize=16, rotation=90)
print(f'Best max_depth: {acc.index(max(acc))} with accuracy score of {max(acc)}');
plt.savefig("Best max_depth value.svg");

# More model tuning

## After engineering per game stats from the totals
"TV market size," which I added in a previous step, didn't do anything at all on its own. Disappointing but not surprising, given how reductive it was as a measure of player profile/star power. The per-game-adjusted features meanwhile didn't brought me a .01 bump in my precision  but didn't return the sort of leap I thought was possible. One thought I had here is that the NBA's top tier in key measures really do sit far from the mean. I engineered "...relative" version of these measures in an attempt to emphasize the strength of the best performers but that didn't get me anywhere new. I'm pretty sure I just normalized the data as the likes of a StandardScaler does.

In [None]:
features = ["PTS/game", "AST/game", "PER", "MP/game", "Age", "All-Star?", "TV market size"]
X = all_data_df[features]
y = all_data_df["All-Star next season?"]

# Test-train split
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, random_state=10)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, random_state=10)

# Scaling training set (for now) and validation set (for later)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

# Instantiating and fitting
logr = LogisticRegression()
logr.fit(X_train_scaled, y_train)
logr_val_predictions = logr.predict(X_val_scaled)
logr_tn, logr_fp, logr_fn, logr_tp = confusion_matrix(y_val, logr_val_predictions).ravel()
logr_prob_predictions = logr.predict_proba(X_val_scaled)

# Performance
print(f'Accuracy (val): {accuracy_score(y_val, logr_val_predictions)}')
print(classification_report(y_val, logr_val_predictions))
fig, ax = plt.subplots(figsize=(7, 7));
plot_confusion_matrix(logr, X_val_scaled, y_val, ax=ax, cmap="Oranges");

# Modeling with pipeline functions
At this point, or some point, I moved the work I was simply repeating and opening to unnecessary error into a handful of functions that included train_test splits (random_seed=10), scaling, training/fitting, cross-validation, scoring (including log loss), and plotting confusion matrices. I used these functions to inefficiently toggle on/off combinations of my dataset's 50+ features, including features I went back to scraping to obtain. I iterated through combinations, favoring logistic regression, until I pulled four models into an ensemble. I also gave SVC and SMOTE a shot. Not much luck with that. Among my other observations:

- "BLK/game" and "STL/game" didn't help, in line with the perception that defense doesn't make an All-Star
- "WS/48" slightly improved precision but knocksed recall and overall score
- "TS%" didn't help (volume over efficiency?)
- My "Years from prime" feature, which I thought might better quantify age relative to accepted average peak (27), and then my "Years from prime ^ 2" feature to give it extra weight, didn't return huge gains
- "Trajectory" added the most to my overall accuracy of all of my engineered features
- Though I built them into my pipeline functions, neither RandomOverSampler nor RandomUnderSampler had any tangible effect for reasons I just do not see clearly (even plugging in different values for sampling_strategy)

## First SV model...
...in part to use SMOTE as a means of smoothing out irregularities caused by my imbalanced targets.

In [None]:
score, class_report, con_matrix = nba_svc(all_data_df, ["PTS/game relative", "AST/game relative", "PER", "MP/game", "Age", "All-Star?"], "All-Star next season?", SMOTE=True, print_all=True)

## Logistic regression function in action

In [None]:
score, logloss, class_report, con_matrix = nba_log_regression(all_data_df, ["PTS/game", "AST/game", "PER", "MP/game", "Age", "Adjusted TV market value", "All-Star?"], "All-Star next season?", print_all=True)

## KNN function in action

In [None]:
score, logloss, class_report, con_matrix = nba_knn(all_data_df, ["All-Star?", "PTS/game", "AST/game", "PER", "MP/game", "Years from prime", "Trajectory", "Adjusted TV market value * GS"], "All-Star next season?", RandomOverSampler=True, print_all=True)

## Random forest function in action

In [None]:
score, logloss, class_report, con_matrix = nba_random_forest(all_data_df, ["All-Star?", "PTS/game", "AST/game", "Years from prime", "PER", "Trajectory", "Adjusted TV market value * GS"], "All-Star next season?", RandomOverSampler=True, print_all=True)

## SVC function in action

In [None]:
score, logloss, class_report, con_matrix = nba_svc(all_data_df, ["All-Star?", "PTS/game", "AST/game", "Years from prime", "PER", "Trajectory", "Adjusted TV market value * GS"], "All-Star next season?", print_all=True)

# Ensembling and predictions
My feature engineering, modeling and hyperparameter tuning didn't return a single model that could predict the exact set of NBA All-Stars in a given season, though it did get 23/24 on the 2015-16 season I held out at the beginning. A very good result, but still out of sync with the results I was getting again and again on my training sets. My last effort was to try XGBoost, which surprisingly didn't improve my score at all, and a VotingClassifier() ensemble, which did, though marginally. My final model was an ensemble containing a logistic regression model, a KNN model (K=9), a random forest model (max_depth=5, min_samples_split=7), and an SVC model. It performed near-negligably better than my initial, three-feature model, proving I guess that you need more than two weeks to innovate as an NBA data trader. And so I will try again...

In the meantime, I put together a function that used my final model in the manner I made my goal two weeks back: as a means of predicting 24 players exactly when fed a single season worth of data—the 24 players projected as most likely to be selected as All-Stars in the following season. That function which I built into my first Flask app (currently only hosted locally), is below.

In [None]:
nbam.nba_ensemble_predict(all_data_df, ["All-Star?", "PTS/game", "AST/game", "Years from prime", "PER", "Trajectory", "Adjusted TV market value * GS", "TRB/game", "PTS+AST/game", "MP/game", "FT/game"], "All-Star next season?", plot=False)