# Physical Dimensions

This notebook attempts to build an ML model for predicting the position of a player given their physical dimensions.

### TODO:

* Use PCA visualisations to determine how well separated the positions are given the data I have. Keep trying to introduce new variables and see if it improves the separation = potentially better model that fits the data better. (Page 2 - Chapter 5 of DS book).
* Explore the differences in model performance if I use more up to date data (i.e. players before a certain time and after).
* Visualise 4-dimensional data using seaborn.
* Visualise data before and after including year_start and year_end to understand why it improves performance.

In [527]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import re
import pickle

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

sns.set_theme()
%matplotlib inline

In [528]:
np.random.seed(0)

# Data Preprocessing

First, let's read in the data and take a look at the features.

In [529]:
# Reading in data
player_attributes = pd.read_csv(os.path.join("data", "Players.csv"))
player_pos = pd.read_csv(os.path.join("data", "player_data.csv"))
player_stats = pd.read_csv(os.path.join("data", "Seasons_Stats.csv"))

In [530]:
player_attributes.head()

Unnamed: 0.1,Unnamed: 0,Player,height,weight,collage,born,birth_city,birth_state
0,0,Curly Armstrong,180.0,77.0,Indiana University,1918.0,,
1,1,Cliff Barker,188.0,83.0,University of Kentucky,1921.0,Yorktown,Indiana
2,2,Leo Barnhorst,193.0,86.0,University of Notre Dame,1924.0,,
3,3,Ed Bartels,196.0,88.0,North Carolina State University,1925.0,,
4,4,Ralph Beard,178.0,79.0,University of Kentucky,1927.0,Hardinsburg,Kentucky


In [531]:
player_pos.head()

Unnamed: 0,name,year_start,year_end,position,height,weight,birth_date,college
0,Alaa Abdelnaby,1991,1995,F-C,6-10,240.0,"June 24, 1968",Duke University
1,Zaid Abdul-Aziz,1969,1978,C-F,6-9,235.0,"April 7, 1946",Iowa State University
2,Kareem Abdul-Jabbar,1970,1989,C,7-2,225.0,"April 16, 1947","University of California, Los Angeles"
3,Mahmoud Abdul-Rauf,1991,2001,G,6-1,162.0,"March 9, 1969",Louisiana State University
4,Tariq Abdul-Wahad,1998,2003,F,6-6,223.0,"November 3, 1974",San Jose State University


Next, we want to obtain a single dataframe that contains the physical dimensions of players along with their NBA position. I will accomplish this through a merge.

In [532]:
# Join player_attributes with player_data to obtain position (response)
player_data = pd.merge(player_attributes, player_pos, how = "inner", left_on = "Player", right_on = "name")

In [533]:
print("Number of observations in player attributes:", len(player_attributes))
print("Number of observations in player positions:", len(player_pos))
print("Number of observations in player data:", len(player_data))

Number of observations in player attributes: 3922
Number of observations in player positions: 4550
Number of observations in player data: 3814


There are 3922 players that have physical attributes, 4550 players that have positions but only 3814 players who have both physical attributes and positions in this dataset. Clearly, there are either players with no positions, or positional data with no associated player.  

For an example, the dataframe below shows players without a position.

In [534]:
# Lets explore the players that didn't have a position
full_player_data = pd.merge(player_attributes, player_pos, how = "left", left_on = "Player", right_on = "name")

full_player_data[(full_player_data['position'].isna()) & (~ full_player_data['height_x'].isna())].tail()

Unnamed: 0.1,Unnamed: 0,Player,height_x,weight_x,collage,born,birth_city,birth_state,name,year_start,year_end,position,height_y,weight_y,birth_date,college
3351,3307,Luc Mbah,201.0,99.0,,1984.0,,,,,,,,,,
3591,3544,Nando De,206.0,97.0,,1968.0,,,,,,,,,,
3775,3727,James Michael,198.0,90.0,,1992.0,,,,,,,,,,
3870,3822,Walter Tavares,221.0,117.0,,1992.0,Maio,Cape Verde,,,,,,,,
3932,3884,Sheldon McClellan,196.0,90.0,University of Miami,1992.0,Houston,Texas,,,,,,,,


The dataframe below shows players without a weight.

In [535]:
full_player_data = pd.merge(player_pos, player_attributes, how = "left", left_on = "name", right_on = "Player")

full_player_data[(full_player_data['weight_x'].isna())].tail()

Unnamed: 0.1,name,year_start,year_end,position,height_x,weight_x,birth_date,college,Unnamed: 0,Player,height_y,weight_y,collage,born,birth_city,birth_state
2360,Dick Lee,1968,1968,F,6-6,,,University of Washington,,,,,,,,
2783,Murray Mitchell,1950,1950,C,6-6,,"March 19, 1923",Sam Houston State University,132.0,Murray Mitchell,201.0,95.0,Sam Houston State University,1923.0,,
2973,Paul Nolen,1954,1954,C,6-10,,"September 3, 1929",Texas Tech University,341.0,Paul Nolen,211.0,106.0,Texas Tech University,1929.0,,
4279,Ray Wertis,1947,1948,G,5-11,,"January 1, 1922",St. John's University,,,,,,,,
4472,Bob Wood,1950,1950,G,5-10,,"October 7, 1921",Northern Illinois University,221.0,Bob Wood,175.0,70.0,Northern Illinois University,1921.0,,


In [536]:
# Convert player position into string
player_data['position'] = player_data['position'].astype(str)

Next, lets consider only the relevant columns for the Machine Learning models including:

* Dependent Variable (i.e. position)
* Potential Independent Variables (i.e. height, weight, year_start, year_end)
* Player Name

In [537]:
player_data = player_data[['Player', 'height_x', 'weight_x', 'year_start', 'year_end', 'position']]
player_data.rename(columns = {"Player": "player", "height_x": "height", "weight_x": "weight"}, inplace=True)

Next, let's explore whether there are any NA values in the height, weight and position columns to drop.

In [538]:
print("The number of observations before removing NA values:", len(player_data))

player_data.dropna(subset = ['height', 'weight', 'position'], inplace = True)

print("The number of observations after removing NA values:", len(player_data))

The number of observations before removing NA values: 3814
The number of observations after removing NA values: 3814


As there are no NA values, we can proceed without the need for any form of Imputation or dropping of NA values.

Next, I observed that there was a large number of players who had two positions, separated by a dash "-". Furthermore, one player had a position that was labelled as the string "nan". I will need to handle both of these situations to ensure that:

* Each player has a valid position
* Each player only has a single positon

The possible positions of players should be G (Guard), F (Forward) or C (Center).

In [539]:
# Unique values of position
player_data['position'].value_counts()

G      1322
F      1079
C       434
F-C     332
G-F     296
C-F     176
F-G     174
nan       1
Name: position, dtype: int64

In [540]:
# Filtering nan value out
player_data = player_data[player_data['position'] != "nan"]

After filtering out the nan value, I now must reduce each players position down into a single position label. Through my domain knowledge, it is evident that the position listed prior to the dash "-" is the primary position of a player since players such as Karl-Anthony Towns and Myles Turner have a position of "C-F" and are widely regarded as Centers (C) who just so happen to be able to play the Forward (F) position. As a result, I will keep only the primary position of such players.

In [541]:
# Observing recent players to identify primary position
player_data[(player_data['year_start'] == 2016) & (player_data['position'].str.contains('-'))].tail()

Unnamed: 0,player,height,weight,year_start,year_end,position
3713,Jonathon Simmons,185.0,83.0,2016,2018,G-F
3715,Axel Toupane,201.0,89.0,2016,2017,G-F
3716,Karl-Anthony Towns,213.0,110.0,2016,2018,C-F
3717,Myles Turner,211.0,110.0,2016,2018,C-F
3720,Alan Williams,198.0,90.0,2016,2017,F-C


In [542]:
# Extracting only primary position of each player
player_data['position'] = player_data['position'].apply(lambda row: row[0])

# Checking only primary positions exist
player_data['position'].value_counts()

G    1618
F    1585
C     610
Name: position, dtype: int64

Next, I will perform some feature engineering by incorporating the BMI of the player as a feature. The rational behind this decision is that most NBA positions have been increasing in height over time, especially Guards in the NBA which could cause some confusion as certain Guards like Luka Doncic have the heights of an average Forward. The incorporating of BMI captures how "bulky" a player is as most tall guards are generally less "bulky" than a forward of similar height considering the difference in roles between a Guard and Forward. Forwards are generally expected to be better rebounders which requires greater size whereas guards are generally expected to be more agile and shifty which is typically associated with players that are lighter.

**Note:** Show a chart of average height of NBA players over time to prove my point + chart of average weight of each position.

In [543]:
player_data['BMI'] = player_data['weight'] / (player_data['height']/100)**2
player_data.head()

Unnamed: 0,player,height,weight,year_start,year_end,position,BMI
0,Curly Armstrong,180.0,77.0,1949,1951,G,23.765432
1,Cliff Barker,188.0,83.0,1950,1952,G,23.483477
2,Leo Barnhorst,193.0,86.0,1950,1954,F,23.087868
3,Ed Bartels,196.0,88.0,1950,1951,F,22.907122
4,Ralph Beard,178.0,79.0,1950,1951,G,24.93372


### NOTE: Include year_start and year_end.

Explanation: NBA Players dimensions have changed drastically over time. Can insert a graph. The year in which a player started playing could inform what position they likely played. 

** Note: Try looking at the players that it misclassified and then classified correctly before including year_start & year_end and after including year_start & year_end.

# Building ML Model

In [544]:
# Obtain predictors and response
player_predictors = player_data.drop(['player', 'position'], axis = 1)
player_response = player_data['position']

# Convert into 2D array with shape (n, 1)
player_response = np.array(player_response).reshape(-1, 1)

# Encode player response to be numeric
ohe_rf = OneHotEncoder(sparse=False, categories='auto')
enc_player_response = ohe_rf.fit_transform(player_response)

# Obtain training/test split
X_train, X_test, y_train, y_test = train_test_split(player_predictors,
                                                    enc_player_response,
                                                    train_size=0.75)



In [545]:
rfc_pipeline = make_pipeline(StandardScaler(),
                             RandomForestClassifier())

rfc_pipeline.fit(X_train, y_train)
y_pred = rfc_pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)

Accuracy: 0.8081761006289309


In [546]:
# Transform back to obtain prediction
y_pred = rfc_pipeline.predict(X_test)
ohe_rf.inverse_transform(y_pred[2].reshape(1, 3))

array([['F']], dtype=object)

In [547]:
# Export model
pickle.dump(rfc_pipeline, open(os.path.join("models", "dimensions_rf.sav"), "wb"))
pickle.dump(ohe_rf, open(os.path.join("models", "ohe_rf.sav"), "wb"))

# Obtaining Predictions

In [548]:
model = pickle.load(open(os.path.join("models", "dimensions_rf.sav"), "rb"))
ohe = pickle.load(open(os.path.join("models", "ohe_rf.sav"), "rb"))

In [564]:
height = 200
weight = 100
year_start = 1950
year_end = 1970

# [height, weight, year_start, year_end, bmi] is the required format
tmp_input = np.array([[height, weight, year_start, year_end, (weight / (height/100)**2)]])

In [574]:
model.predict(tmp_input)



array([[0., 1., 0.]])

In [566]:
ohe.inverse_transform(model.predict(tmp_input))



array([['C']], dtype=object)

In [602]:
height = 190
weight = 102
year_start = 2000
year_end = 2001
tmp_input = np.array([[height, weight, year_start, year_end, (weight / (height/100)**2)]])

In [603]:
model.predict(tmp_input)



array([[0., 0., 1.]])

In [604]:
ohe.inverse_transform(model.predict(tmp_input))[0][0]



'G'

**Note:** The above shows the importance of having a model that predicts with year_start and year_end. You'd be considered a center during your playing time if you were between 1950-1970 but now, it's Forward!!! Perhaps try to show some chart for this.