# Player Statistics

This notebook attempts to build an ML model for predicting the position of a player given their statistics.

##### Data Sources:

##### The attributes I chose to use in building the model include:

**Points, Assists, Rebounds, Steals and Blocks**, all on a per game basis. The model would have been improved by using statistics that further distinguish positions such as three point percentage (as guards and forwards are usually much better than centers in this department), turnovers or free throw percentage.  

However, due to these more advanced statistics rarely being recorded outside of professional games, I decided to only use the traditionally recorded statistics of points, assists, rebounds, steals and blocks as the average user who may have only played up to high school basketball would either have these attributes recorded or know a rough estimate of their numbers for these attributes. For example, american highschool varsity basketball only records these statistics: [See Here](https://www.maxpreps.com/basketball/stat-leaders/).

Potential Resources to download stats:

* https://www.basketball-reference.com/leagues/NBA_2023_per_game.html (Download as many seasons as possible). For each download, need to read the file in, and collate player statistics such that players who played on multiple teams that season have their stats collated.


In [154]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import os
import pickle

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV

sns.set_theme()
%matplotlib inline

In [155]:
np.random.seed(0)

# Data Preprocessing

In [156]:
# Reading in data
player_statistics = pd.read_csv(os.path.join("data", "Seasons_Stats.csv"))
active_players = pd.read_csv(os.path.join("data", "active_players.csv"))

In [157]:
player_statistics.head()

Unnamed: 0.1,Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,0,1950.0,Curly Armstrong,G-F,31.0,FTW,63.0,,,,...,0.705,,,,176.0,,,,217.0,458.0
1,1,1950.0,Cliff Barker,SG,29.0,INO,49.0,,,,...,0.708,,,,109.0,,,,99.0,279.0
2,2,1950.0,Leo Barnhorst,SF,25.0,CHS,67.0,,,,...,0.698,,,,140.0,,,,192.0,438.0
3,3,1950.0,Ed Bartels,F,24.0,TOT,15.0,,,,...,0.559,,,,20.0,,,,29.0,63.0
4,4,1950.0,Ed Bartels,F,24.0,DNN,13.0,,,,...,0.548,,,,20.0,,,,27.0,59.0


In [158]:
active_players.head()

Unnamed: 0,playerid,fname,lname,position,height,weight,birthday,country,school,draft_year,draft_round,draft_number
0,1630173,Precious,Achiuwa,Forward,6-8,225,1999-09-19,Nigeria,Memphis,2020,1.0,20.0
1,203500,Steven,Adams,Center,6-11,265,1993-07-20,New Zealand,Pittsburgh,2013,1.0,12.0
2,1628389,Bam,Adebayo,Center-Forward,6-9,255,1997-07-18,USA,Kentucky,2017,1.0,14.0
3,1630534,Ochai,Agbaji,Guard,6-5,215,2000-04-20,USA,Kansas,2022,1.0,14.0
4,1630583,Santi,Aldama,Forward-Center,7-0,215,2001-01-10,Spain,Loyola-Maryland,2021,1.0,30.0


### Handling the player statistics dataset

First, I'm going to convert all column headers into lowercase to have a consistent case between all columns.

In [159]:
player_statistics.columns = player_statistics.columns.str.lower()

Now, I'm going to keep only relevant attributes that are feasible (see explanation above) and may have a relationship with a players position (Year, Age).

In [160]:
player_statistics = player_statistics[['player', 'pos', 'year', 'age', 'pts', 'ast', 'trb', 'stl', 'blk', 'g']]
player_statistics.head()

Unnamed: 0,player,pos,year,age,pts,ast,trb,stl,blk,g
0,Curly Armstrong,G-F,1950.0,31.0,458.0,176.0,,,,63.0
1,Cliff Barker,SG,1950.0,29.0,279.0,109.0,,,,49.0
2,Leo Barnhorst,SF,1950.0,25.0,438.0,140.0,,,,67.0
3,Ed Bartels,F,1950.0,24.0,63.0,20.0,,,,15.0
4,Ed Bartels,F,1950.0,24.0,59.0,20.0,,,,13.0


Next, we're going to drop all NA values as we're going to want to keep only players with all the relevant statistics. Furthermore, any duplicates entries will be dropped.

In [161]:
# Checking the shape, number of duplicates and missing values
print(f'The number of rows in the dataset is: {player_statistics.shape[0]}')
print(f'The number of columns/features in the dataset is: {player_statistics.shape[1]}')
print(f'The number of duplicate entries in the dataset is: {player_statistics.duplicated().sum()}')
print(f'The number of missing values in the dataset is: {player_statistics.isna().sum().sum()}')

The number of rows in the dataset is: 24691
The number of columns/features in the dataset is: 10
The number of duplicate entries in the dataset is: 67
The number of missing values in the dataset is: 8644


In [162]:
player_statistics = player_statistics.dropna()
player_statistics = player_statistics.drop_duplicates().reset_index(drop=True)

In [163]:
# Checking the number of duplicates and missing values
print(f'The number of duplicate entries in the dataset is: {player_statistics.duplicated().sum()}')
print(f'The number of missing values in the dataset is: {player_statistics.isna().sum().sum()}')

The number of duplicate entries in the dataset is: 0
The number of missing values in the dataset is: 0


It is evident that the statistics are shown as **totals**, and not their statistics **per game**. Although users could calculate their total statistics, it's more common that a players statistics are recorded as per game. Therefore, next we will convert all the statistics into per game.

In [164]:
# Convert stats into per game
stats_cols = ['pts', 'ast', 'trb', 'stl', 'blk']
for col in stats_cols:
    player_statistics[col] = np.round(player_statistics[col].div(player_statistics['g']), 2)
    
# Drop games
player_statistics.drop(columns=['g'], inplace=True)

In [165]:
player_statistics.head()

Unnamed: 0,player,pos,year,age,pts,ast,trb,stl,blk
0,Zaid Abdul-Aziz,C,1974.0,27.0,10.95,2.1,11.68,1.01,1.32
1,Kareem Abdul-Jabbar*,C,1974.0,26.0,27.05,4.77,14.54,1.38,3.49
2,Don Adams,SF,1974.0,26.0,10.26,1.91,6.05,1.49,0.16
3,Rick Adelman,PG,1974.0,27.0,3.31,1.02,1.25,0.65,0.02
4,Lucius Allen,PG,1974.0,26.0,17.61,5.19,4.04,1.9,0.31


It is evident that the Guard position is broken down into PG (Point Guard) and SG (Shooting Guard). Additionally, the Forward position is broken down into SF (Small Forward) and Power Forward (PF). 

Similar to before, we're trying to predict the general position of a player (Guard, Forward, Center) and so we handle this below.

In [166]:
player_statistics['pos'] = player_statistics['pos'].apply(lambda x: x[-1])
player_statistics.head()

Unnamed: 0,player,pos,year,age,pts,ast,trb,stl,blk
0,Zaid Abdul-Aziz,C,1974.0,27.0,10.95,2.1,11.68,1.01,1.32
1,Kareem Abdul-Jabbar*,C,1974.0,26.0,27.05,4.77,14.54,1.38,3.49
2,Don Adams,F,1974.0,26.0,10.26,1.91,6.05,1.49,0.16
3,Rick Adelman,G,1974.0,27.0,3.31,1.02,1.25,0.65,0.02
4,Lucius Allen,G,1974.0,26.0,17.61,5.19,4.04,1.9,0.31


In [167]:
player_statistics['pos'].value_counts()

pos
F    8394
G    8260
C    4142
Name: count, dtype: int64

* Now we only have the desired G, F and C positions.

# Building ML Model

### Trying different models & Evaluation

### Best Model: Random Forest

In [168]:
# Obtain predictors and response
player_predictors = player_statistics.drop(['player', 'pos'], axis = 1)
player_response = player_statistics['pos']

# Convert into 2D array with shape (n, 1)
player_response = np.array(player_response).reshape(-1, 1)

# Encode player response to be numeric
ohe = OneHotEncoder(sparse=False, categories='auto')
enc_player_response = ohe.fit_transform(player_response)

# Obtain training/test split
X_train, X_test, y_train, y_test = train_test_split(player_predictors,
                                                    enc_player_response,
                                                    train_size=0.75)



In [169]:
rfc_pipeline = make_pipeline(StandardScaler(),
                             RandomForestClassifier(n_estimators=500, 
                                                    random_state=0))

# Define dictionary of {'param': [values_to_try]} with __ for pipeline
param_grid = {
    'randomforestclassifier__max_depth': [10, 20, 25],
    'randomforestclassifier__min_samples_leaf': [3, 4, 5, 6],
    'randomforestclassifier__bootstrap': [True, False]
}

rfc_grid = GridSearchCV(estimator=rfc_pipeline,
                    param_grid=param_grid,
                    cv = 3,
                    refit=True,
                    n_jobs = -1
                    )

rfc_grid.fit(X_train, y_train)

In [None]:
rfc_grid.best_params_

{'randomforestclassifier__bootstrap': False,
 'randomforestclassifier__max_depth': 25,
 'randomforestclassifier__min_samples_leaf': 3}

In [None]:
# Accuracy
y_pred = rfc_grid.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)

Accuracy: 0.7509136372379304


In [None]:
# Transform back to obtain prediction
y_pred = rfc_grid.predict(X_test)
ohe.inverse_transform(y_pred[2].reshape(1, 3))

array([['F']], dtype=object)

# Retrain model on full dataset

Finally, I will retrain the model on the full dataset (training & validation sets) to help the model learn from as much data as possible.

In [None]:
# Export model
pickle.dump(rfc_pipeline, open(os.path.join("models", "stats_rf.sav"), "wb"))
pickle.dump(ohe, open(os.path.join("models", "stats_ohe.sav"), "wb"))