# Analysis of NBA Player Statistics

Professional basketball players obviously put in a lot of hard work to perform well every game, and it's clear that their success comes as a result of their practice. With that said, success in basketball may also be dependent on other factors. Some would argue that a seven-foot person is more likely to score more, grab more rebounds, and generally perform better than a five-foot person. For this reason, in this project we will ask:

## **Research Question:** Can we accurately predict an NBA player's height based on information from their season statistics?

## Data Collection
We chose our sample of data to be satistics of each player that played in the last NBA Season:
* To obtain their names, we scraped data from the NBAstuffer website (https://www.nbastuffer.com/2018-2019-nba-player-stats/) and used BeautifulSoup to collect and parse the names of all players in the 2018-2019 NBA Season into a Python list.
![alt text](https://drive.google.com/uc?id=1IvtvLYlmrCHjQmXkmYpMqBO5C0X5CQNB)
* For every player in the list, we made a call to the balldontlie API (https://www.balldontlie.io/#introduction) to extract each player's balldontlie ID into a list, and then used this list to make a single call for every player's 2018-2019 season averages.


![alt text](https://drive.google.com/uc?id=1Obj_Y2cOy-mmBkHykMwBBTYaGbWT0RH1)

## Data Cleaning
At this point, we had a data frame of player season averages. From here, we need to add individual player data, like height, weight, and position.
* We called the API again and created another array of json data, using the ID's once again to request individual player information, specifically for Name, Position, Weight and Height.
* We took data from each json field in each entry in the array into a single data frame that now included both player data and their 2018-2019 season averages.
* Our **Quantitative Data** are all of the season average data - number of points per game, rebounds per game, minutes played, and field goal percentages to name a few - as well as our height and weight data. Height data was stored in two separate "feet" and "inches" columns, so we created a new column for "height" in centimeters. We also converted the "min" column from a string to a float in the "minutes/gm" column.
* Our **Categorical Data** includes each player's name and position - guard (G), forward (F), center (C), or a combination of two. Position will be an important factor for determining a player's height.

In [0]:
import pandas as pd
avgsurl = "https://raw.githubusercontent.com/austin-ng/data301/master/finalproject/EQcsv/nbaseasonavgs.csv"

df_avgs = pd.read_csv(avgsurl)
df_avgs.drop("Unnamed: 0", axis=1, inplace=True)
df_avgs

Unnamed: 0,games_played,player_id,season,min,fgm,fga,fg3m,fg3a,ftm,fta,oreb,dreb,reb,ast,stl,blk,turnover,pf,pts,fg_pct,fg3_pct,ft_pct,player_name,player_height_ft,player_height_in,player_weight,player_position,height (cm),minutes/gm
0,80,3,2018,33:19,6.01,10.09,0.00,0.03,1.83,3.65,4.89,4.61,9.50,1.55,1.48,0.95,1.71,2.55,13.85,0.596,0.000,0.500,Steven Adams,7.0,0.0,265.0,C,213.36,33.316667
1,81,6,2018,33:10,8.44,16.28,0.12,0.52,4.31,5.09,3.11,6.09,9.20,2.40,0.53,1.32,1.78,2.21,21.32,0.519,0.238,0.847,LaMarcus Aldridge,6.0,11.0,260.0,F,210.82,33.166667
2,48,8,2018,8:40,1.40,3.71,0.67,2.06,0.94,1.25,0.06,0.44,0.50,0.52,0.10,0.13,0.69,0.98,4.40,0.376,0.323,0.750,Grayson Allen,6.0,5.0,198.0,G,195.58,8.666667
3,80,9,2018,26:12,4.19,7.10,0.08,0.56,2.46,3.48,2.40,6.01,8.41,1.38,0.55,1.50,1.30,2.30,10.91,0.590,0.133,0.709,Jarrett Allen,6.0,11.0,237.0,C,210.82,26.200000
4,82,10,2018,27:56,3.13,7.23,1.17,3.41,1.83,2.11,1.37,6.07,7.44,1.27,0.83,0.40,0.88,1.74,9.27,0.433,0.343,0.867,Al-Farouq Aminu,6.0,9.0,220.0,F,205.74,27.933333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
205,3,1988,2018,4:15,1.00,2.00,0.00,0.00,0.00,0.67,0.67,0.33,1.00,0.33,0.00,0.33,1.00,2.00,2.00,0.500,0.000,0.000,Donatas Motiejunas,,,,,,4.250000
206,7,2106,2018,6:10,0.43,1.00,0.14,0.14,0.00,0.00,0.57,2.29,2.86,0.57,0.14,0.14,0.71,1.29,1.00,0.429,1.000,0.000,Eric Moreland,,,,,,6.166667
207,30,2158,2018,13:14,0.87,2.10,0.30,0.93,0.43,0.50,0.23,1.37,1.60,0.97,0.77,0.10,0.57,1.27,2.47,0.413,0.321,0.867,Patrick McCaw,6.0,7.0,185.0,,200.66,13.233333
208,40,2175,2018,24:28,2.95,6.30,1.85,4.45,1.40,1.78,0.63,2.88,3.50,1.00,0.53,0.28,0.88,2.00,9.15,0.468,0.416,0.789,Danuel House Jr.,6.0,7.0,220.0,,200.66,24.466667


## Data Exploration

Now that we finally had our data, we could start by trying to find correlations between variables in the data set and the height of the player.



We first checked the distribution of player heights.
![alt text](https://drive.google.com/uc?id=1gB8QYcJvmrfXybQY5FRTiw7XGj-U2br_)

We then explored the relationship between the player's position and their height, again in the form of a density plot.
![alt text](https://drive.google.com/uc?id=1brIKdXU6vr5Jg8-GCrmC-MVV75VgS37e)

This plot shows that certain positions, such as C-F, C, and F-C, are played by taller players, whereas other positions like G and F-G are played by shorter players.

Following that, we created scatterplots of the relationship between height and the following variables:

*   Points per game (r=-0.01)
![alt text](https://drive.google.com/uc?id=16j-kXuxSa-u1Bz3N-csfdLVUvD_3azYt)
*   Rebounds per game (r=0.44)
![alt text](https://drive.google.com/uc?id=1oNrsDMcsVbllLFZxGs76RykOC4GhxocQ)
*   Blocks per game (r=0.53)
![alt text](https://drive.google.com/uc?id=19A4hbUCD6CAGWGfUkhnovyInwhyafPGu)
*   Field goal percentage (r=0.52)
![alt text](https://drive.google.com/uc?id=1tL58Bm0V6sXgJfQwm5MI1ZWWAWb0cOQR)


We then adjusted the values by dividing by average number of minutes each player played per game. We then created scatterplots for the following variables:


*   Rebounds per game (adjusted r=0.72, original r=0.44)
![alt text](https://drive.google.com/uc?id=1d_YJH6R1SO7b57yqKFST8dj-282hyGg8)
*   Blocks per game (adjusted r=0.55, original r=0.53)
![alt text](https://drive.google.com/uc?id=1RSvZ8AV71y7MdSPbRbCO7LmMlPP0OhNM)
*   Field goal percentage (adjusted r=0.16, original r=0.52)
![alt text](https://drive.google.com/uc?id=1EiekVMoLLo-mEpBtBNpvOt9jMrnUd3Ls)

While two of the variables' relationship strength increased (even though blocks per game barely increased), the field goal percentage decreased dramatically. This indicates that dividing by the number of minutes played per game is not a good way to transform the data for certain statistics like field goal percentage which should generally be independent of time played. Nevertheless, the fact that the relationship strength increased for both rebounds per game and blocks per game gives more credence to the possibility of being able to estimate height given stats.

We also created line graphs of the median statistic for each height for several statistics. They offer the same conclusion as the scatterplots, but it is helpful to be able to visualize the trend of the center of the statistic.
![alt text](https://drive.google.com/uc?id=1B2MepYRqQwIy7X9QJ-C0hG3a64iaVPFZ)
![alt text](https://drive.google.com/uc?id=14u7HA3LUnDR4_ajAvlbusLfO0n6XB6Ib)

Finally, we explored how height can relate to statistics when we segmented the data by the player position using a groupby. This revealed interesting new trends, like the one in the image below:
![alt text](https://drive.google.com/uc?id=1BV6IEDg-krLO3bMkvE4r3r6qVBPZY6DF)

The existence of these separate trends for each position is evidence that height may be estimable given multiple summary statistics using an appropriate machine learning model.

## Machine Learning: Predicting Height
There are 11 players in our data frame that do not have a listed height. We are going to predict the heights (in centimeters) of these 11 players by training on the 200 players whose heights are known. From there, we will test our predictions against their actual heights (collected from https://www.basketball-reference.com) to obtain a proper testing error in the form of RMSE.

In [0]:
testurl = "https://raw.githubusercontent.com/austin-ng/data301/master/finalproject/EQcsv/nbaseasonavgs.csv"

df_testdata = pd.read_csv(testurl)
df_testdata.drop("Unnamed: 0", axis=1, inplace=True)
df_testdata

Unnamed: 0,games_played,player_id,season,min,fgm,fga,fg3m,fg3a,ftm,fta,oreb,dreb,reb,ast,stl,blk,turnover,pf,pts,fg_pct,fg3_pct,ft_pct,player_name,player_height_ft,player_height_in,player_weight,player_position,height (cm),minutes/gm
0,80,3,2018,33:19,6.01,10.09,0.00,0.03,1.83,3.65,4.89,4.61,9.50,1.55,1.48,0.95,1.71,2.55,13.85,0.596,0.000,0.500,Steven Adams,7.0,0.0,265.0,C,213.36,33.316667
1,81,6,2018,33:10,8.44,16.28,0.12,0.52,4.31,5.09,3.11,6.09,9.20,2.40,0.53,1.32,1.78,2.21,21.32,0.519,0.238,0.847,LaMarcus Aldridge,6.0,11.0,260.0,F,210.82,33.166667
2,48,8,2018,8:40,1.40,3.71,0.67,2.06,0.94,1.25,0.06,0.44,0.50,0.52,0.10,0.13,0.69,0.98,4.40,0.376,0.323,0.750,Grayson Allen,6.0,5.0,198.0,G,195.58,8.666667
3,80,9,2018,26:12,4.19,7.10,0.08,0.56,2.46,3.48,2.40,6.01,8.41,1.38,0.55,1.50,1.30,2.30,10.91,0.590,0.133,0.709,Jarrett Allen,6.0,11.0,237.0,C,210.82,26.200000
4,82,10,2018,27:56,3.13,7.23,1.17,3.41,1.83,2.11,1.37,6.07,7.44,1.27,0.83,0.40,0.88,1.74,9.27,0.433,0.343,0.867,Al-Farouq Aminu,6.0,9.0,220.0,F,205.74,27.933333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
205,3,1988,2018,4:15,1.00,2.00,0.00,0.00,0.00,0.67,0.67,0.33,1.00,0.33,0.00,0.33,1.00,2.00,2.00,0.500,0.000,0.000,Donatas Motiejunas,,,,,,4.250000
206,7,2106,2018,6:10,0.43,1.00,0.14,0.14,0.00,0.00,0.57,2.29,2.86,0.57,0.14,0.14,0.71,1.29,1.00,0.429,1.000,0.000,Eric Moreland,,,,,,6.166667
207,30,2158,2018,13:14,0.87,2.10,0.30,0.93,0.43,0.50,0.23,1.37,1.60,0.97,0.77,0.10,0.57,1.27,2.47,0.413,0.321,0.867,Patrick McCaw,6.0,7.0,185.0,,200.66,13.233333
208,40,2175,2018,24:28,2.95,6.30,1.85,4.45,1.40,1.78,0.63,2.88,3.50,1.00,0.53,0.28,0.88,2.00,9.15,0.468,0.416,0.789,Danuel House Jr.,6.0,7.0,220.0,,200.66,24.466667


### Linear Regression Model
For our linear model, we tested on multiple models using different quantitative data. We found that the following statistics were most effective and produced the smallest RMS training error:
* Points: points scored per game
* Rebounds: missed shots recovered per game
* Assists: passes that led to a made basket per game
* Blocks: shots blocked per game
* Steals: opponent possessions stolen by the player per game
* Weight: the weight of the player in pounds

![alt text](https://drive.google.com/uc?id=1g9Z5QasWarqwby1bPwZQAY4sckDzattX)

Using these six stats on a Linear Regression Model, the test data had an RMSE of 4.877.

In [0]:
linearurl = "https://raw.githubusercontent.com/austin-ng/data301/master/finalproject/EQcsv/linearresults.csv"
df_linresults = pd.read_csv(linearurl)
df_linresults.drop("Unnamed: 0", axis=1, inplace=True)
df_linresults

Unnamed: 0,Player,Predicted Height (cm),Actual Height (cm),Difference
0,Michael Carter-Williams,193.718595,195.58,1.861405
1,Wayne Ellington,194.536415,193.04,1.496415
2,Enes Kanter,214.561853,208.28,6.281853
3,Wesley Matthews,198.825386,193.04,5.785386
4,Jodie Meeks,198.668978,193.04,5.628978
5,Greg Monroe,211.932503,210.82,1.112503
6,Markieff Morris,207.238376,203.0,4.238376
7,Andrew Bogut,212.747895,213.36,0.612105
8,Donatas Motiejunas,202.262654,213.36,11.097346
9,Eric Moreland,206.056816,208.28,2.223184


### Random Forest Model
For our Random Forest Regressor, we tested the number of estimators from 1 to 50, finding 37 to be the optimal number. 
![alt text](https://drive.google.com/uc?id=1EW-xZhP91MHS3aXF7MsiZre4aigzExW7)

In [0]:
foresturl = "https://raw.githubusercontent.com/austin-ng/data301/master/finalproject/EQcsv/forestresults.csv"
df_forestresults = pd.read_csv(foresturl)
df_forestresults.drop("Unnamed: 0", axis=1, inplace=True)
df_forestresults

Unnamed: 0,Player,Predicted Height (cm),Actual Height (cm),Difference
0,Michael Carter-Williams,193.383243,195.58,2.196757
1,Wayne Ellington,194.138378,193.04,1.098378
2,Enes Kanter,211.369189,208.28,3.089189
3,Wesley Matthews,199.14973,193.04,6.10973
4,Jodie Meeks,195.374054,193.04,2.334054
5,Greg Monroe,208.005405,210.82,2.814595
6,Markieff Morris,204.22973,203.0,1.22973
7,Andrew Bogut,206.495135,213.36,6.864865
8,Donatas Motiejunas,209.172432,213.36,4.187568
9,Eric Moreland,205.396757,208.28,2.883243


The RMSE for our Random Forest Model was 3.28, and was our best training model out of the three. The difference betweeen the actual and predicted heights for the Random Forest Model is shown below.

In [0]:
df_forestresults["Difference"].sum(), df_forestresults["Difference"].mean()

(36.10324324324324, 3.282113022113022)

### K Nearest Neighbors Model
In our K Neighbors model, we standardized the features used in the Linear model, used OneHotEncoder on the player's position, and finally used a K Neighbors Regressor in a pipeline. To find the optimal number of neighbors, we ran the data through 20 models, each model using an additional neighbor. Using this method we found that the optimal value for k is 6 neighbors, the RMSE minimized at 6 in the graph below.

![alt text](https://drive.google.com/uc?id=1pftFBU35O2X4x53CCWHGWR1ZYo7UR0jz)

With the optimal value for k, we minimized our RMSE with the actual player heights to 3.88.

Our final model used a 6-neighbors model, training on points, rebounds, assists, blocks, steals, and player position. The predictions compared with the players' actual heights are displayed in the code below.

In [0]:
import numpy as np
resultsurl = "https://raw.githubusercontent.com/austin-ng/data301/master/finalproject/EQcsv/finalresults2.csv"
df_results = pd.read_csv(resultsurl)
df_results.drop("Unnamed: 0", axis=1, inplace=True)
df_results["Difference"] = np.abs(df_results["Predicted Height (cm)"] - df_results["Actual Height (cm)"])
df_results

Unnamed: 0,Player,Predicted Height (cm),Actual Height (cm),Difference
0,Michael Carter-Williams,193.886667,195.58,1.693333
1,Wayne Ellington,196.003333,193.04,2.963333
2,Enes Kanter,209.126667,208.28,0.846667
3,Wesley Matthews,198.966667,193.04,5.926667
4,Jodie Meeks,193.04,193.04,0.0
5,Greg Monroe,208.703333,210.82,2.116667
6,Markieff Morris,204.47,203.0,1.47
7,Andrew Bogut,207.01,213.36,6.35
8,Donatas Motiejunas,206.163333,213.36,7.196667
9,Eric Moreland,207.01,208.28,1.27


With a final RMSE of 3.88, the KNN model was the second best model according to our training error.

However, the model predicted 8 out of 11 players' heights within 4.5 centimeters (about 1.7 inches) of error. In fact, it even had a lower mean difference and lower difference sum between the actual and predicted heights for the test data than the Random Forest model. It seems our random forest model overfitted on the training data!

In [0]:
df_results["Difference"].sum(), df_results["Difference"].mean()

(34.06666666666678, 3.096969696969707)

For reference, the range and standard deviation of height values are provided below.

In [0]:
df_avgs["height (cm)"].max() - df_avgs["height (cm)"].min(), df_avgs["height (cm)"].std()

(38.099999999999994, 8.433337781251472)

## Final Results
Our best model (K Nearest Neighbors) predicted 8 out of 11 players' heights within 4.5 centimeters (about 1.7 inches) of error, with a mean difference between actual and predicted of 3.10 centimeters. Considering that the range of player heights was 38.10 centimeters and the standard deviation was 8.43 centimeters, the model is pretty accurate, although a better model could definitely be constructed.

Therefore, we can now make the conclusion that an NBA player's height can indeed be predicted from their season statistics. This suggests that a player's height is a central indicator of their performance in the NBA.