<a href="https://colab.research.google.com/github/frankwillard/NBA-Hall-Of-Fame-Model/blob/main/Hall_of_Fame_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Load Packages and Data

In [2]:
from math import exp
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.linear_model import LinearRegression, Lasso, LassoCV
from sklearn.model_selection import train_test_split, GridSearchCV, RepeatedStratifiedKFold
from sklearn.metrics import confusion_matrix, classification_report, get_scorer, accuracy_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.feature_selection import SelectKBest, mutual_info_regression, RFE, SelectFromModel, SequentialFeatureSelector
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [3]:
# Read in cleaned player data
model_df = pd.read_csv("https://raw.githubusercontent.com/frankwillard/NBA-Hall-Of-Fame-Model/main/Scraped%20Player%20Data.csv", index_col=0)
model_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4977 entries, 0 to 4976
Data columns (total 76 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Player                     4977 non-null   object 
 1   Eligible                   4977 non-null   int64  
 2   Position                   4977 non-null   object 
 3   Hall_of_Fame               4977 non-null   int64  
 4   MVP                        4977 non-null   int64  
 5   Finals_MVP                 4977 non-null   int64  
 6   NBA_Champ                  4977 non-null   int64  
 7   All_NBA                    4977 non-null   int64  
 8   All_Defensive              4977 non-null   int64  
 9   Def_POY                    4977 non-null   int64  
 10  All_Star                   4977 non-null   int64  
 11  Scoring_Champ              4977 non-null   int64  
 12  TRB_Champ                  4977 non-null   int64  
 13  AST_Champ                  4977 non-null   int64

### Data Cleaning

In [4]:
# Replace all instances of -999 with NA (consider doing this in scraper to eliminate a step)
model_df = model_df.replace(-999, np.nan)

In [5]:
# Reduce number of possible positions to guard, forward, center
model_df.loc[model_df['Position'] == 'Center/Forward', 'Position'] = 'Center'
model_df.loc[model_df['Position'].isin(['PointGuard', 'ShootingGuard', 'Guard/Forward']), 'Position'] = 'Guard'
model_df.loc[model_df['Position'].isin(['SmallForward', 'PowerForward', 'Forward/Guard', 'Forward/Center']), 'Position'] = 'Forward'

In [6]:
# Combine individual ABA and NBA accolades
model_df['All_League'] = model_df['All_NBA'] + model_df['All_ABA']
model_df['Champ'] = model_df['NBA_Champ'] + model_df['ABA_Champ']

#### Dealing with NA values

In [7]:
# Count number of NAs by column:
hofers = model_df['Hall_of_Fame'].value_counts()[1]
for col in model_df.columns:
  na_df = model_df[model_df[col].isna()]
  if len(na_df) > 0:
    try:
      na_rows = na_df['Hall_of_Fame'].value_counts()[1]
    except KeyError:
      na_rows = 0
    print(f"{col}:\t{len(model_df[model_df[col].isna()])} nulls \t{na_rows}/{hofers} HOFers are null")

MP_per_game:	340 nulls 	1/140 HOFers are null
3P_per_game:	1118 nulls 	53/140 HOFers are null
3PA_per_game:	1118 nulls 	53/140 HOFers are null
2P_per_game:	1118 nulls 	53/140 HOFers are null
2PA_per_game:	1118 nulls 	53/140 HOFers are null
ORB_per_game:	949 nulls 	39/140 HOFers are null
DRB_per_game:	949 nulls 	39/140 HOFers are null
TRB_per_game:	288 nulls 	1/140 HOFers are null
STL_per_game:	1180 nulls 	41/140 HOFers are null
BLK_per_game:	1180 nulls 	41/140 HOFers are null
MP_totals:	340 nulls 	1/140 HOFers are null
GS_totals:	1689 nulls 	66/140 HOFers are null
FG%_totals:	34 nulls 	0/140 HOFers are null
3P_totals:	1118 nulls 	53/140 HOFers are null
3PA_totals:	1118 nulls 	53/140 HOFers are null
3P%_totals:	1627 nulls 	54/140 HOFers are null
2P_totals:	1118 nulls 	53/140 HOFers are null
2PA_totals:	1118 nulls 	53/140 HOFers are null
2P%_totals:	1162 nulls 	53/140 HOFers are null
eFG%_totals:	1146 nulls 	53/140 HOFers are null
FT%_totals:	241 nulls 	0/140 HOFers are null
ORB_totals:	

In [8]:
# Columns with -999s
for col in model_df.columns:
  if len(model_df[model_df[col] == -999]) > 0:
    print(col, "-", len(model_df[model_df[col] == -999]))

Columns to drop:
`GS_totals`, `Trp_Dbl_totals`, `ORB_per_game`, `DRB_per_game`, `ORB_totals`, `DRB_totals`, `3P%_totals`, `2P%_totals`, `eFG%_totals`, `OWS_advanced`, `DWS_advanced`, `WS/48_advanced`, `OBPM_advanced`, `DBPM_advanced`
<br/>
<br/>

Columns to consider dropping: `3P_per_game`, `3PA_per_game`, `3P_totals`, `3PA_totals`, `FG%_totals` (these players never took a shot), `FT%_totals` (these players never took a FT)
<br/>
<br/>

Columns to impute from FGM, FGA, etc.:
`2P_per_game`, `2PA_per_game`, `2P_totals`, `2PA_totals`
<br/>
<br/>

Columns to fill with league average:
`PER_advanced`, `VORP_advanced` (consider some more advanced PER/VORP)
<br/>
<br/>

Columns to make 0:
`WS_advanced`, `BPM_advanced` (consider some more advanced imputation for BPM)
<br/>
<br/>

Columns to make 0 or fill with mean (undecided):,
        "`TS%_advanced` (these players never took a shot or free throw)
<br/>
<br/>

Columns to fill with mean by position:,
`PTS_per_game`, `TRB_per_game`, `AST_per_game`, `STL_per_game`, `BLK_per_game`, `TRB_totals`, `AST_totals`, `STL_totals`, `BLK_totals`

In [9]:
# Fill NAs accordingly
def fillNulls(model_df):
  cols_to_zero = ['WS_advanced', 'OWS_advanced', 'DWS_advanced', 'BPM_advanced',
                  '3P_per_game', '3PA_per_game', '3P_totals', '3PA_totals', 'FG%_totals', 'FT%_totals', 'TS%_advanced']
  model_df[cols_to_zero] = model_df[cols_to_zero].fillna(0) # fill cols with 0
  
  cols_to_avg = ['PER_advanced', 'VORP_advanced', '3P%_totals', '2P%_totals', 'eFG%_totals']
  model_df[cols_to_avg] = model_df[cols_to_avg].fillna(model_df[cols_to_avg].mean()) # fill cols with avg
  
  cols_to_position_avg = ['TRB_per_game', 'AST_per_game', 'STL_per_game', 'BLK_per_game']
  model_df[cols_to_position_avg] = model_df.groupby("Position")[cols_to_position_avg].transform(lambda x: x.fillna(x.mean())) # fills cols with avg by position

  cols_to_scale_avg = ['TRB_totals', 'AST_totals', 'STL_totals', 'BLK_totals']
  for col_total, col_avg in zip(cols_to_scale_avg, cols_to_position_avg):
    model_df[col_total] = model_df[col_total].fillna(model_df[col_avg] * model_df['G_totals'])
  
  cols_to_fill = ['2P_per_game', '2PA_per_game', '2P_totals', '2PA_totals']
  cols_to_fill_with = ['FG_per_game', 'FGA_per_game', 'FG_totals', 'FGA_totals']
  model_df[cols_to_fill] = model_df[cols_to_fill].fillna(model_df[cols_to_fill_with]) # fill 2P shooting columns with FG columns

  return model_df
fillNulls(model_df)

Unnamed: 0,Player,Eligible,Position,Hall_of_Fame,MVP,Finals_MVP,NBA_Champ,All_NBA,All_Defensive,Def_POY,...,DWS_advanced,WS_advanced,WS/48_advanced,OBPM_advanced,DBPM_advanced,BPM_advanced,VORP_advanced,peak_ws_advanced,All_League,Champ
0,Alaa Abdelnaby,1,Forward,0,0,0,0,0,0,0,...,4.1,4.8,0.072,-2.9,-0.9,-3.8,-1.500000,2.1,0,0
1,Zaid Abdul-Aziz,1,Center,0,0,0,0,0,0,0,...,11.6,17.5,0.076,0.6,-0.2,0.4,2.700000,6.5,0,0
2,Kareem Abdul-Jabbar,1,Center,1,6,2,6,15,11,0,...,94.5,273.4,0.228,4.1,1.6,5.7,85.700000,25.4,15,6
3,Mahmoud Abdul-Rauf,1,Guard,0,0,0,0,0,0,0,...,8.4,25.2,0.077,0.7,-1.5,-0.8,4.500000,6.8,0,0
4,Tariq Abdul-Wahad,1,Guard,0,0,0,0,0,0,0,...,4.1,3.5,0.035,-2.6,-0.4,-3.0,-1.200000,2.2,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4972,Jim Zoet,1,Center,0,0,0,0,0,0,0,...,0.0,-0.1,-0.123,-5.6,0.2,-5.4,-0.100000,-0.1,0,0
4973,Bill Zopf,1,Guard,0,0,0,0,0,0,0,...,0.4,-0.1,-0.011,,,0.0,3.434634,-0.1,0,0
4974,Ivica Zubac,0,Center,0,0,0,0,0,0,0,...,9.6,26.1,0.183,0.3,0.4,0.6,4.500000,7.2,0,0
4975,Matt Zunic,1,Guard,0,0,0,0,0,0,0,...,1.8,2.0,,,,0.0,3.434634,2.0,0,0


#### Excluding Columns


*   NBA and ABA were combined for All League, Championship
*   Games played not a stat
*   Name, eligibility, position not relevant
*   Attempts does not mean anything- can just pass in makes and percentage
*   FT, FG encoded in points

In [10]:
# Exclude these columns
exclude_cols = ['Player', 'Eligible', 'Position', 'NBA_Champ', 'All_NBA', 'All_ABA', 'ABA_Champ', 'G_totals', 
                '3PA_totals', 'FTA_totals', 'FGA_totals', 
                'FTA_per_game', 'FGA_per_game', '3PA_per_game',
                'FT_per_game', 'FT_totals', 'FG_per_game', 'FG_totals']

# Count and drop columns with remaining null values
for col in model_df.columns:
  na_df = model_df[model_df[col].isna()]
  if len(na_df) > 0:
    try:
      na_rows = na_df['Hall_of_Fame'].value_counts()[1]
    except KeyError:
      na_rows = 0
    print(f"{col}:\t{len(model_df[model_df[col].isna()])} nulls \t{na_rows}/{hofers} HOFers are null")
    exclude_cols.append(col)

MP_per_game:	340 nulls 	1/140 HOFers are null
2P_per_game:	1118 nulls 	53/140 HOFers are null
2PA_per_game:	1118 nulls 	53/140 HOFers are null
ORB_per_game:	949 nulls 	39/140 HOFers are null
DRB_per_game:	949 nulls 	39/140 HOFers are null
MP_totals:	340 nulls 	1/140 HOFers are null
GS_totals:	1689 nulls 	66/140 HOFers are null
2P_totals:	1118 nulls 	53/140 HOFers are null
2PA_totals:	1118 nulls 	53/140 HOFers are null
ORB_totals:	949 nulls 	39/140 HOFers are null
DRB_totals:	949 nulls 	39/140 HOFers are null
Trp_Dbl_totals:	4526 nulls 	53/140 HOFers are null
WS/48_advanced:	344 nulls 	1/140 HOFers are null
OBPM_advanced:	1185 nulls 	41/140 HOFers are null
DBPM_advanced:	1185 nulls 	41/140 HOFers are null


In [11]:
exclude_cols

['Player',
 'Eligible',
 'Position',
 'NBA_Champ',
 'All_NBA',
 'All_ABA',
 'ABA_Champ',
 'G_totals',
 '3PA_totals',
 'FTA_totals',
 'FGA_totals',
 'FTA_per_game',
 'FGA_per_game',
 '3PA_per_game',
 'FT_per_game',
 'FT_totals',
 'FG_per_game',
 'FG_totals',
 'MP_per_game',
 '2P_per_game',
 '2PA_per_game',
 'ORB_per_game',
 'DRB_per_game',
 'MP_totals',
 'GS_totals',
 '2P_totals',
 '2PA_totals',
 'ORB_totals',
 'DRB_totals',
 'Trp_Dbl_totals',
 'WS/48_advanced',
 'OBPM_advanced',
 'DBPM_advanced']

### Feature Selection

In [12]:
# Get dataframe of eligible Hall of Frame players with all remaining columns
all_cols_eligible_df = (model_df[model_df['Eligible'] == 1]).loc[:, ~model_df.columns.isin(exclude_cols)]

In [13]:
# Scale features and split eligible dataset into dependent and independent variables
sc0 = StandardScaler()

X_all = pd.DataFrame(sc0.fit_transform(all_cols_eligible_df.iloc[:,1:]), index=all_cols_eligible_df.index, columns=all_cols_eligible_df.columns[1:])
y_all = all_cols_eligible_df.iloc[:,0]

In [14]:
# Return number of reamining columns
print(len(X_all.columns), "columns remaining")
X_all.columns

44 columns remaining


Index(['MVP', 'Finals_MVP', 'All_Defensive', 'Def_POY', 'All_Star',
       'Scoring_Champ', 'TRB_Champ', 'AST_Champ', 'STL_Champ', 'BLK_Champ',
       'ROY', '3P_per_game', 'TRB_per_game', 'AST_per_game', 'STL_per_game',
       'BLK_per_game', 'PTS_per_game', 'FG%_totals', '3P_totals', '3P%_totals',
       '2P%_totals', 'eFG%_totals', 'FT%_totals', 'TRB_totals', 'AST_totals',
       'STL_totals', 'BLK_totals', 'PTS_totals', 'pts_per_g_seasonal',
       'mvp_shares_seasonal', 'trb_per_g_seasonal', 'ast_per_g_seasonal',
       'ws_seasonal', 'accum_mvp_shares_seasonal', 'PER_advanced',
       'TS%_advanced', 'OWS_advanced', 'DWS_advanced', 'WS_advanced',
       'BPM_advanced', 'VORP_advanced', 'peak_ws_advanced', 'All_League',
       'Champ'],
      dtype='object')

#### Experimenting with Various Feature Selectors

In [15]:
# Select top k features based on mutual info regression
kbest_selector = SelectKBest(mutual_info_regression, k = 10)
kbest_selector.fit(X_all, y_all)
kbest_cols = list(X_all.columns[kbest_selector.get_support()])

kbest_cols

['All_Star',
 'PTS_per_game',
 'PTS_totals',
 'mvp_shares_seasonal',
 'ws_seasonal',
 'accum_mvp_shares_seasonal',
 'DWS_advanced',
 'WS_advanced',
 'peak_ws_advanced',
 'All_League']

In [16]:
# Select top features from RFE selector
rfe_selector = RFE(estimator=LogisticRegression(max_iter=120),n_features_to_select = 10, step = 1)
rfe_selector.fit(X_all, y_all)
rfe_cols = list(X_all.columns[rfe_selector.get_support()])

rfe_cols

['MVP',
 'All_Star',
 'TRB_per_game',
 'FG%_totals',
 'TRB_totals',
 'BLK_totals',
 'pts_per_g_seasonal',
 'PER_advanced',
 'DWS_advanced',
 'WS_advanced']

In [17]:
# Select top features from SFM selector
sfm_selector = SelectFromModel(estimator=LogisticRegression())
sfm_selector.fit(X_all, y_all)
sfm_cols = list(X_all.columns[sfm_selector.get_support()])

sfm_cols

['MVP',
 'All_Star',
 'Scoring_Champ',
 'TRB_per_game',
 'AST_per_game',
 'BLK_per_game',
 'FG%_totals',
 'FT%_totals',
 'TRB_totals',
 'BLK_totals',
 'PTS_totals',
 'pts_per_g_seasonal',
 'ws_seasonal',
 'PER_advanced',
 'OWS_advanced',
 'DWS_advanced',
 'WS_advanced',
 'Champ']

Variables found significant by each technique:

In [18]:
list(set(sfm_cols) & set(rfe_cols) & set(kbest_cols))

['DWS_advanced', 'All_Star', 'WS_advanced']

Variables found significant by 2/3 techniques

In [19]:
list(set(sfm_cols) & set(rfe_cols) | set(kbest_cols) & set(rfe_cols) | set(kbest_cols) & set(sfm_cols))

['TRB_totals',
 'PTS_totals',
 'All_Star',
 'MVP',
 'TRB_per_game',
 'DWS_advanced',
 'ws_seasonal',
 'WS_advanced',
 'pts_per_g_seasonal',
 'BLK_totals',
 'FG%_totals',
 'PER_advanced']

In [20]:
# Calculate VIF for multicollinearity detection
X_selected_all = X_all[['All_Star', 'WS_advanced', 'BLK_totals',
                        'FG%_totals', 'PER_advanced', 'PTS_totals', 'MVP',
                        'ws_seasonal', 'pts_per_g_seasonal', 'TRB_per_game',
                        'DWS_advanced', 'TRB_totals']]

# Create a dataframe to store VIF results
vif_data = pd.DataFrame()
vif_data['Variable'] = X_selected_all.columns
vif_data['VIF'] = [variance_inflation_factor(X_selected_all.values, i) for i in range(X_selected_all.shape[1])]

# Print the VIF results
print(vif_data)

              Variable        VIF
0             All_Star   4.030357
1          WS_advanced  22.492040
2           BLK_totals   2.574060
3           FG%_totals   1.666574
4         PER_advanced   1.938330
5           PTS_totals  13.859977
6                  MVP   2.290717
7          ws_seasonal   5.743731
8   pts_per_g_seasonal   3.512514
9         TRB_per_game   3.493599
10        DWS_advanced  14.286602
11          TRB_totals  13.222963


**Collinearity analysis of variables found significant by 2/3 techiniques**

Using a threshold of 10 for VIF, we determine a few columns to be multicollinear:
*   'WS_advanced' and 'DWS_advanced': collinear with one another since win shares are offensive win shares plus defensive win shares. Instead, we can include both 'OWS_advanced' and 'DWS_advanced' in final model, which significantly decreases VIF of both variables
*   'TRB_totals': likely collinear with 'TRB_per_game', since rebounders who have a high career number of rebounds also are likely to have a high per game average.
*   'PTS_totals': likely collinear with 'pts_per_g_seasonal' for a similar reason. However, high scoring across individual seasons may be as important as high career scoring, so we may experiment with both variables in the final model and determine the best choice.















In [21]:
# Calculate VIF after choices
X_selected_all = X_all[['All_Star', 'OWS_advanced', 'BLK_totals',
                        'FG%_totals', 'PER_advanced', 'MVP',
                        'ws_seasonal', 'pts_per_g_seasonal', 'DWS_advanced',
                        'TRB_totals']]

# Create a dataframe to store VIF results
vif_data = pd.DataFrame()
vif_data['Variable'] = X_selected_all.columns
vif_data['VIF'] = [variance_inflation_factor(X_selected_all.values, i) for i in range(X_selected_all.shape[1])]

# Print the VIF results
print(vif_data)

             Variable       VIF
0            All_Star  4.003411
1        OWS_advanced  4.400314
2          BLK_totals  2.469652
3          FG%_totals  1.652027
4        PER_advanced  1.772769
5                 MVP  2.280258
6         ws_seasonal  4.539655
7  pts_per_g_seasonal  2.983937
8        DWS_advanced  9.392229
9          TRB_totals  7.479541


We also know that NBA champions, especially high performers on championship teams, are viewed favorably by the Selection Committee. However, it seems reasonable that our model would not have known this. Since each player on the championship team gets credit for the championship (including bench players who barely or never played), championships were downweighted. We can remedy this by including `champ` in the model, as well as interaction terms `champ * OWS_advanced` and `champ * DWS_advanced` to increase predictive power towards these high performers on championship teams.

In [22]:
# Create interaction terms
X_selected_all['Champ'] = X_all['Champ']
X_selected_all['Champ_x_OWS_advanced'] = X_all['Champ'] * X_all['OWS_advanced']
X_selected_all['Champ_x_DWS_advanced'] = X_all['Champ'] * X_all['DWS_advanced']

# Create a dataframe to store VIF results
vif_data = pd.DataFrame()
vif_data['Variable'] = X_selected_all.columns
vif_data['VIF'] = [variance_inflation_factor(X_selected_all.values, i) for i in range(X_selected_all.shape[1])]

# Print the VIF results
print(vif_data)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_selected_all['Champ'] = X_all['Champ']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_selected_all['Champ_x_OWS_advanced'] = X_all['Champ'] * X_all['OWS_advanced']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_selected_all['Champ_x_DWS_advanced'] = X_all['Champ'] * X_all['DWS_advanced']


                Variable        VIF
0               All_Star   4.093590
1           OWS_advanced   4.941695
2             BLK_totals   2.685916
3             FG%_totals   1.653981
4           PER_advanced   1.773810
5                    MVP   2.949502
6            ws_seasonal   4.591775
7     pts_per_g_seasonal   3.101335
8           DWS_advanced  11.250812
9             TRB_totals   7.860735
10                 Champ   2.220325
11  Champ_x_OWS_advanced   3.151245
12  Champ_x_DWS_advanced   3.404330


#### Selecting Our Columns

In [23]:
# Select variables for model
model_cols = [
    # Variables for splitting/assessing predictions
    'Player', 'Eligible', 'Hall_of_Fame',
    # Variables from variable selection
    'All_Star', 'OWS_advanced', 'BLK_totals', 'FG%_totals', 'PER_advanced',
    'MVP', 'ws_seasonal', 'pts_per_g_seasonal', 'DWS_advanced', 'TRB_totals',
    # Empirical selection
    'Champ'
]

eligible_df = (model_df[model_df['Eligible'] == 1]).loc[:, model_df.columns.isin(model_cols)]
noneligible_df = (model_df[model_df['Eligible'] == 0]).loc[:, model_df.columns.isin(model_cols)]

In [24]:
# Dropping players that are unpredictable (reasoning explained below)
extraneous_players = ['Maurice Stokes', 'Bill Bradley', 'Toni Kukoč',
       'Calvin Murphy', 'Vlade Divac', 'Buddy Jeannette',
       'Dražen Petrović', 'Al Cervi', 'Arvydas Sabonis',
       'Šarūnas Marčiulionis', 'Dino Radja', 'Chuck Cooper',
       'Bob Houbregs']

eligible_df = eligible_df[~eligible_df['Player'].isin(extraneous_players)]

### Train/Test Split

In [25]:
# Split training set into X and y
X_eligible = eligible_df.iloc[:, 3:].values
y_eligible = eligible_df.iloc[:, 2].values

In [26]:
# Train-test split dividing HOF eligible players into a training set and a validation set
X_training, X_validation, y_train, y_val = train_test_split(eligible_df, y_eligible, test_size = 0.25, random_state = 0)

In [27]:
X_train = X_training.iloc[:,3:].values
X_val = X_validation.iloc[:,3:].values

In [28]:
X_test = noneligible_df.iloc[:, 3:].values

### Feature Scaling

In [29]:
# Scale whole matrix of features to prevent information leakage
# Scale for training set, validation set, and test set
sc1 = StandardScaler()
X_eligible = sc1.fit_transform(X_eligible)

sc2 = StandardScaler()
X_train = sc2.fit_transform(X_train)
X_val = sc2.transform(X_val)
X_test = sc2.transform(X_test)

### Model Selection

In [30]:
# Define classifiers / regressors

CLASSIFIERS = [
    LogisticRegression(),
    XGBClassifier(),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    SVC(),
    GaussianNB(),
    KNeighborsClassifier(),
    AdaBoostClassifier(),
    SVC(kernel = 'rbf')
  ]

In [31]:
# Define accuracy metrics

METRICS = [
    'f1',
    'accuracy',
    'precision',
    'recall',
    'average_precision'
]

Accuracy: percentage of samples correctly classified

Precision: if players model identified as HOF players, how many were truly HOF

Recall: of true HOF players, how many did model identify as HOF

F1: harmonic mean of precision and recall

In [32]:
# Function that fits, predicts, and evaluates on different model types
def get_metrics(classifier, X, y_true):
  print(classifier)
  classifier.fit(X_train, y_train)
  y_pred = classifier.predict(X)

  output_metrics = []

  for metric in METRICS:
    score = get_scorer(metric)._score_func(y_true, y_pred)
    output_metrics.append(score)

  return output_metrics

In [33]:
# Get metrics for each model type
training_metrics = []
val_metrics = []

for classifier in CLASSIFIERS:
  clf_train_metrics = get_metrics(classifier, X_train, y_train)
  val_train_metrics = get_metrics(classifier, X_val, y_val)
  
  training_metrics.append([classifier] + clf_train_metrics)
  val_metrics.append([classifier] + val_train_metrics)

train_metrics_df = pd.DataFrame(data = training_metrics, columns=['Classifier'] + METRICS)
val_metrics_df = pd.DataFrame(data = val_metrics, columns=['Classifier'] + METRICS)

LogisticRegression()
LogisticRegression()
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=None, ...)
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enabl

In [34]:
# Output modeling metrics for the training set
train_metrics_df.sort_values(by='f1', ascending=False)

Unnamed: 0,Classifier,f1,accuracy,precision,recall,average_precision
1,"XGBClassifier(base_score=None, booster=None, c...",1.0,1.0,1.0,1.0,1.0
2,DecisionTreeClassifier(),1.0,1.0,1.0,1.0,1.0
3,"(DecisionTreeClassifier(max_features='sqrt', r...",1.0,1.0,1.0,1.0,1.0
7,"(DecisionTreeClassifier(max_depth=1, random_st...",1.0,1.0,1.0,1.0,1.0
4,SVC(),0.958763,0.997386,0.94898,0.96875,0.920304
8,SVC(),0.958763,0.997386,0.94898,0.96875,0.920304
0,LogisticRegression(),0.931217,0.995752,0.946237,0.916667,0.869998
6,KNeighborsClassifier(),0.863636,0.992157,0.95,0.791667,0.758619
5,GaussianNB(),0.673759,0.969935,0.510753,0.989583,0.505759


In [35]:
# Output modeling metrics for the validation set
val_metrics_df.sort_values(by='f1', ascending=False)

Unnamed: 0,Classifier,f1,accuracy,precision,recall,average_precision
1,"XGBClassifier(base_score=None, booster=None, c...",0.792453,0.989216,0.954545,0.677419,0.656431
0,LogisticRegression(),0.785714,0.988235,0.88,0.709677,0.63334
5,GaussianNB(),0.759494,0.981373,0.625,0.967742,0.605819
4,SVC(),0.754717,0.987255,0.909091,0.645161,0.597295
6,KNeighborsClassifier(),0.754717,0.987255,0.909091,0.645161,0.597295
8,SVC(),0.754717,0.987255,0.909091,0.645161,0.597295
3,"(DecisionTreeClassifier(max_features='sqrt', r...",0.740741,0.986275,0.869565,0.645161,0.571794
2,DecisionTreeClassifier(),0.692308,0.984314,0.857143,0.580645,0.510441
7,"(DecisionTreeClassifier(max_depth=1, random_st...",0.679245,0.983333,0.818182,0.580645,0.487818


Logistic regression has the highest f1 in both the training and validation sets, so we will use a logistic regression classifier for our model.

### Hyperparameter Optimization

In [36]:
logistic_classifier = LogisticRegression(random_state= 0)

Links for hyperparameter optimization:
*   https://machinelearningmastery.com/hyperparameters-for-classification-machine-learning-algorithms/
*   https://scikit-learn.org/stable/auto_examples/model_selection/
*  https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html#sphx-glr-auto-examples-model-selection-plot-grid-search-digits-py







In [37]:
# Define a list of optimization algorithms for logistic regression
solvers = ['newton-cg', 'lbfgs', 'liblinear']

# Define a list of regularization penalty types
penalty = ['l2']

# Define a list of values for the regularization parameter C
c_values = [100, 10, 1.0, 0.1, 0.01]

# Create a dictionary to represent the grid of hyperparameters
grid = dict(solver=solvers, penalty=penalty, C=c_values)

In [38]:
# Create function to print cross validated results
def print_dataframe(filtered_cv_results):
    """Pretty print for filtered dataframe"""
    for mean_precision, std_precision, mean_recall, std_recall, params in zip(
        filtered_cv_results["mean_test_precision"],
        filtered_cv_results["std_test_precision"],
        filtered_cv_results["mean_test_recall"],
        filtered_cv_results["std_test_recall"],
        filtered_cv_results["params"],
    ):
        print(
            f"precision: {mean_precision:0.3f} (±{std_precision:0.03f}),"
            f" recall: {mean_recall:0.3f} (±{std_recall:0.03f}),"
            f" for {params}"
        )
    print()

In [39]:
# Create function to select estimator
def refit_strategy(cv_results):
  """Define the strategy to select the best estimator.

  The strategy defined here is to filter-out all results below a precision threshold
  of 0.9, rank the remaining by recall and select the model with the highest
  recall.

  Parameters
  ----------
  cv_results : dict of numpy (masked) ndarrays
      CV results as returned by the `GridSearchCV`.

  Returns
  -------
  best_index : int
      The index of the best estimator as it appears in `cv_results`.
  """
  
  precision_threshold = 0.89
  cv_results_ = pd.DataFrame(cv_results)
  print("All grid-search results:")
  print_dataframe(cv_results_)

  # Filter-out all results below the threshold
  high_precision_cv_results = cv_results_[
        cv_results_["mean_test_precision"] > precision_threshold
  ]

  print(f"Models with a precision higher than {precision_threshold}:")
  print_dataframe(high_precision_cv_results)

  high_precision_cv_results = high_precision_cv_results[
        [
            "mean_score_time",
            "mean_test_recall",
            "std_test_recall",
            "mean_test_precision",
            "std_test_precision",
            "rank_test_recall",
            "rank_test_precision",
            "params",
        ]
    ]

  # Select the most performant models in terms of recall
  best_recall_index = high_precision_cv_results["mean_test_recall"].idxmax()

  print(
        "\nThe selected final model is the fastest to predict out of the previously\n"
        "selected subset of best models based on precision and recall.\n"
        "Its scoring time is:\n\n"
        f"{high_precision_cv_results.loc[best_recall_index]}"
    )
  
  return best_recall_index

In [40]:
scorers = ['precision', 'recall']

# Define grid search
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=logistic_classifier, param_grid=grid, n_jobs=-1, cv=cv, scoring=scorers,refit=refit_strategy,error_score=0)
grid_result = grid_search.fit(X_eligible, y_eligible)

All grid-search results:
precision: 0.895 (±0.075), recall: 0.868 (±0.077), for {'C': 100, 'penalty': 'l2', 'solver': 'newton-cg'}
precision: 0.895 (±0.075), recall: 0.868 (±0.077), for {'C': 100, 'penalty': 'l2', 'solver': 'lbfgs'}
precision: 0.895 (±0.075), recall: 0.870 (±0.072), for {'C': 100, 'penalty': 'l2', 'solver': 'liblinear'}
precision: 0.898 (±0.073), recall: 0.867 (±0.087), for {'C': 10, 'penalty': 'l2', 'solver': 'newton-cg'}
precision: 0.898 (±0.073), recall: 0.867 (±0.087), for {'C': 10, 'penalty': 'l2', 'solver': 'lbfgs'}
precision: 0.898 (±0.069), recall: 0.873 (±0.087), for {'C': 10, 'penalty': 'l2', 'solver': 'liblinear'}
precision: 0.904 (±0.070), recall: 0.862 (±0.092), for {'C': 1.0, 'penalty': 'l2', 'solver': 'newton-cg'}
precision: 0.904 (±0.070), recall: 0.862 (±0.092), for {'C': 1.0, 'penalty': 'l2', 'solver': 'lbfgs'}
precision: 0.902 (±0.072), recall: 0.870 (±0.089), for {'C': 1.0, 'penalty': 'l2', 'solver': 'liblinear'}
precision: 0.916 (±0.078), recall: 0

### Modeling

In [41]:
# Fit a classifier with parameters found above
classifier = LogisticRegression(random_state = 1, **grid_result.best_params_)
classifier.fit(X_train, y_train)

### Predictions

In [42]:
# Predict both class and probability for the training set 
y_train_pred_probs = classifier.predict_proba(X_train)[:, 1]
y_train_pred = classifier.predict(X_train)

In [43]:
# Predict both class and probability for the val set 
y_val_pred_probs = classifier.predict_proba(X_val)[:, 1]
y_val_pred = classifier.predict(X_val)

In [44]:
#eligible_df['pred'] = y_train_pred_probs
X_training['pred'] = y_train_pred_probs
X_validation['pred'] = y_val_pred_probs

#### Borderline Correct Positive HOF Predictions

In train:

In [45]:
X_training[(X_training['pred'] < 0.56) & (X_training['pred'] > 0.5) & (X_training['Hall_of_Fame'] == 1)].sort_values(by='pred', ascending=True)

Unnamed: 0,Player,Eligible,Hall_of_Fame,MVP,All_Star,FG%_totals,TRB_totals,BLK_totals,pts_per_g_seasonal,ws_seasonal,PER_advanced,OWS_advanced,DWS_advanced,Champ,pred
3136,Chris Mullin,1,1,0,5,0.509,4034.0,549.0,21,0,18.8,69.2,23.8,0,0.549504


In validation:

In [46]:
X_validation[(X_validation['pred'] < 0.56) & (X_validation['pred'] > 0.5) & (X_validation['Hall_of_Fame'] == 1)].sort_values(by='pred', ascending=True)

Unnamed: 0,Player,Eligible,Hall_of_Fame,MVP,All_Star,FG%_totals,TRB_totals,BLK_totals,pts_per_g_seasonal,ws_seasonal,PER_advanced,OWS_advanced,DWS_advanced,Champ,pred


#### Under Predictions

In [47]:
X_training[(X_training['pred'] < 0.5) & (X_training['Hall_of_Fame'] == 1)].sort_values(by='pred', ascending=False)

Unnamed: 0,Player,Eligible,Hall_of_Fame,MVP,All_Star,FG%_totals,TRB_totals,BLK_totals,pts_per_g_seasonal,ws_seasonal,PER_advanced,OWS_advanced,DWS_advanced,Champ,pred
1781,Tim Hardaway,1,1,0,5,0.431,2855.0,129.0,5,8,18.6,57.7,27.2,0,0.497592
4698,Paul Westphal,1,1,0,5,0.504,1580.0,262.0,9,6,19.4,44.6,23.0,1,0.470971
761,Maurice Cheeks,1,1,0,4,0.523,3088.0,294.0,0,2,16.5,60.9,42.6,1,0.467544
4588,Ben Wallace,1,1,0,4,0.474,10482.0,2137.0,0,1,15.5,22.9,70.6,1,0.465142
3791,Dennis Rodman,1,1,0,2,0.521,11954.0,531.0,0,2,14.6,35.4,54.5,5,0.401135
3064,Sidney Moncrief,1,1,0,5,0.502,3575.0,228.0,0,38,18.7,61.4,28.9,0,0.346043
3790,Guy Rodgers,1,1,0,4,0.378,3791.0,127.815696,0,0,13.6,-5.3,38.6,0,0.281187
4754,Jamaal Wilkes,1,1,0,3,0.499,5117.0,262.0,0,0,16.5,43.5,27.8,4,0.128889


**Analysis of incorrect predictions in train:**



*   Sydney Moncrief (46% bball ref) - 2x DPOY, 5x All-NBA in shorter career
*   Paul Westphal (41% bball ref) -  5x All-Star and 4x All-NBA
*   Guy Rodgers (9% bball ref) - Long NBA career, 4x All-Star, 2x AST champ
*   Manu Ginobili (20% bball ref) - Had short Italy career. Long career in NBA with solid stats, many championships. Top 75 in win shares
*   Rodman (75% bball ref) - Insane defensive player, hurt because our model doesn't include many defensive statistics and his other stats (points, All-Star selections, etc. are weak)
*   Jamaal Wilkes (18% bball ref) - 3x All-Star, 4 championships



In [48]:
X_validation[(X_validation['pred'] < 0.5) & (X_validation['Hall_of_Fame'] == 1)].sort_values(by='pred', ascending=False)

Unnamed: 0,Player,Eligible,Hall_of_Fame,MVP,All_Star,FG%_totals,TRB_totals,BLK_totals,pts_per_g_seasonal,ws_seasonal,PER_advanced,OWS_advanced,DWS_advanced,Champ,pred
3730,Arnie Risen,1,1,0,4,0.381,5011.0,426.752639,11,12,16.7,32.4,23.6,2,0.47968
4606,Bobby Wanzer,1,1,0,5,0.393,1979.0,81.389367,5,14,17.3,47.3,16.6,1,0.43945
989,Bob Davies,1,1,0,4,0.378,980.0,66.200506,12,5,18.1,35.4,14.3,1,0.295582
967,Bob Dandridge,1,1,0,4,0.484,5715.0,303.0,2,0,16.7,47.3,33.0,2,0.263439
1576,Tom Gola,1,1,0,5,0.431,5617.0,100.017215,0,5,14.2,29.1,24.1,1,0.191486
2211,Gus Johnson,1,1,0,5,0.44,7624.0,214.992182,1,0,16.7,11.6,25.2,1,0.099585
3627,Frank Ramsey,1,1,0,0,0.399,3410.0,89.27038,0,2,15.6,23.7,25.4,7,0.074039
3878,Ralph Sampson,1,1,0,4,0.486,4011.0,752.0,0,0,16.0,0.4,19.7,0,0.00285


**Analysis of incorrect predictions in validation:**

(Incomeplete)

Nate Thurmond (67% bball ref)- 7x all star, 5x all defense, 12th all time in rebounds, top 10 in rebounds many times

Bob Dandridge (16% bball ref)- 2x champ, all defensive once, solid rebounder and scorer. Helpful in championship run- all star, all nba, all defensive. Took many years to be selected

Gail Goodrich (74% bball ref)- 5x all star, champ, many win shares in 72 season (8th). Several seasons in top 10 in scoring, top 20 in assists

Tom Gola (29% bball ref)- 50s and early 60s. Good rebounder and low scorer with 5x all star and a championship. Somewhat borderline

Arnie Risen (25% bball ref)- 50s, low scorer, high rebounder, short career, 2x champ, many win shares as rookie

Frank Ramsey (26% bball ref)- No other accolades but 7x champ and solid contributor over short career. 

Ralph Sampson - 6 solid years in NBA (of his 9), 4x all star, legend in college

KC Jones (29% bball ref)- No other accolades but 8x champ. Lesser contributor than Frank- never eclipsed 10ppg. Not sure why he is in HOF


#### Over predictions

In [49]:
X_training[(X_training['pred'] > 0.5) & (X_training['Hall_of_Fame'] == 0)].sort_values(by='pred', ascending=False)

Unnamed: 0,Player,Eligible,Hall_of_Fame,MVP,All_Star,FG%_totals,TRB_totals,BLK_totals,pts_per_g_seasonal,ws_seasonal,PER_advanced,OWS_advanced,DWS_advanced,Champ,pred
2281,Jimmy Jones,1,0,0,6,0.509,2930.0,47.0,9,26,17.1,53.2,25.6,0,0.81295
1440,Donnie Freeman,1,0,0,5,0.456,2292.0,48.0,16,10,18.2,37.0,21.0,1,0.779368
351,Chauncey Billups,1,0,0,5,0.415,2992.0,168.0,0,25,18.8,92.4,28.3,1,0.754094
4240,Amar'e Stoudemire,1,0,0,6,0.537,6632.0,1054.0,18,17,21.8,63.1,29.4,0,0.663624
2363,Shawn Kemp,1,0,0,6,0.488,8834.0,1279.0,0,3,19.1,37.3,52.2,0,0.625829
888,Larry Costello,1,0,0,6,0.438,2705.0,101.163544,0,2,14.5,41.8,20.9,1,0.506875


(Incomplete)

Larry Foust deserves to be in HOF. Also has 94% rating on Bball ref

Amare Stoudemire (72% bball ref), Chauncey (84% on bball ref), Shawn Marion (76% on bball ref) are newer to ballot

Larry Costello is in HOF as Contributor- this led to confusion as one of his key contributions was his play. 71% on bball ref

Tom Sanders in HOF as contributor but 15% rating on bball ref

Max Zaslofsky	

In [50]:
X_validation[(X_validation['pred'] > 0.5) & (X_validation['Hall_of_Fame'] == 0)].sort_values(by='pred', ascending=False)

Unnamed: 0,Player,Eligible,Hall_of_Fame,MVP,All_Star,FG%_totals,TRB_totals,BLK_totals,pts_per_g_seasonal,ws_seasonal,PER_advanced,OWS_advanced,DWS_advanced,Champ,pred
1412,Larry Foust,1,0,0,8,0.405,8041.0,547.342082,7,20,19.8,51.1,28.0,0,0.954338
2287,Larry Jones,1,0,0,4,0.453,2725.0,18.0,27,28,18.9,38.3,18.2,0,0.703003
2767,Shawn Marion,1,0,0,4,0.484,10101.0,1233.0,0,19,18.8,63.6,61.3,1,0.662409


(Incomplete)

Jimmy Jones- ABA Legend, very efficient. Think ABA was less respected when he was in it than Erving. Shorter career

Mack Calvin- Early ABA

Shawn Kemp (39% bball ref)- Lower longevity

Donnie Freeman- ABA Star i late 60s early 70s

Walter Davis (31% bball ref)

Jermaine O'Neal (32% bball ref)

### Model Coefficients

In [51]:
for col, coef in zip(model_cols[3:], classifier.coef_[0]):
  print(f"{col}: {exp(coef)}")

All_Star: 11.669261118208027
OWS_advanced: 8.647589089308802
BLK_totals: 0.35617114642966624
FG%_totals: 0.62010087368839
PER_advanced: 0.6553728032690266
MVP: 2.780500373529551
ws_seasonal: 1.2101104230957134
pts_per_g_seasonal: 0.9219021392259796
DWS_advanced: 1.2380944452230043
TRB_totals: 6.1910046999487856
Champ: 1.7622640482570644


### Confusion Matrix

In [52]:
# Output confuson matrix for the training set
cm = confusion_matrix(y_train, y_train_pred)
#[00 01]
#[10 11]
print(cm)

[[2958    6]
 [   8   88]]


In [53]:
print(f"Accuracy: {get_scorer('accuracy')._score_func(y_train, y_train_pred)}")
print(f"F1 score: {get_scorer('f1')._score_func(y_train, y_train_pred)}")
print(f"Precision score: {get_scorer('precision')._score_func(y_train, y_train_pred)}")
print(f"Recall score: {get_scorer('recall')._score_func(y_train, y_train_pred)}")

Accuracy: 0.9954248366013072
F1 score: 0.9263157894736843
Precision score: 0.9361702127659575
Recall score: 0.9166666666666666


In [54]:
# Output confusion matrix for the validation set
cm = confusion_matrix(y_val, y_val_pred)
#[00 01]
#[10 11]
print(cm)
accuracy_score(y_val, y_val_pred)

[[986   3]
 [  8  23]]


0.9892156862745098

In [55]:
print(f"Accuracy: {get_scorer('accuracy')._score_func(y_val, y_val_pred)}")
print(f"F1 score: {get_scorer('f1')._score_func(y_val, y_val_pred)}")
print(f"Precision score: {get_scorer('precision')._score_func(y_val, y_val_pred)}")
print(f"Recall score: {get_scorer('recall')._score_func(y_val, y_val_pred)}")

Accuracy: 0.9892156862745098
F1 score: 0.8070175438596492
Precision score: 0.8846153846153846
Recall score: 0.7419354838709677


### Other Metrics

In [56]:
print(classification_report(y_train, y_train_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      2964
           1       0.94      0.92      0.93        96

    accuracy                           1.00      3060
   macro avg       0.97      0.96      0.96      3060
weighted avg       1.00      1.00      1.00      3060



In [57]:
print(classification_report(y_val, y_val_pred))

              precision    recall  f1-score   support

           0       0.99      1.00      0.99       989
           1       0.88      0.74      0.81        31

    accuracy                           0.99      1020
   macro avg       0.94      0.87      0.90      1020
weighted avg       0.99      0.99      0.99      1020



### Predictions

In [58]:
# Predict both class and probability for the test set 
y_test_pred_probs = classifier.predict_proba(X_test)[:, 1]
y_val_pred = classifier.predict(X_test)

In [59]:
noneligible_df['pred'] = y_test_pred_probs

In [60]:
noneligible_df.sort_values(by='pred', ascending=False).head()

Unnamed: 0,Player,Eligible,Hall_of_Fame,MVP,All_Star,FG%_totals,TRB_totals,BLK_totals,pts_per_g_seasonal,ws_seasonal,PER_advanced,OWS_advanced,DWS_advanced,Champ,pred
2141,LeBron James,0,0,4,18,0.505,10210.0,1041.0,108,116,27.3,173.9,75.6,4,1.0
957,Stephen Curry,0,0,2,8,0.473,3838.0,187.0,41,36,23.8,87.6,32.7,4,1.0
1191,Kevin Durant,0,0,1,12,0.496,6646.0,1038.0,70,55,25.3,111.1,44.1,2,1.0
3271,Dirk Nowitzki,0,0,1,14,0.471,11489.0,1281.0,40,73,22.4,143.8,62.6,1,1.0
1783,James Harden,0,0,1,10,0.442,5294.0,512.0,69,73,24.5,112.1,37.5,0,1.0
