# NBA 2023 Draft Class Sucess Prediction

# Motivation:
## Problem
A problem that data science can provide helpful insights for is predicting the success of players in the NBA draft. By analyzing performance data from college players, classification algorithms can identify players with the potential to become stars based on their statistics.
This is an important problem for NBA teams as they invest a lot of money in drafting players, and their success depends on picking the right players. According to a study by DraftExpress, over the last 20 years, NBA teams have spent over $1 billion on draft picks, and the success rate of those picks has been only around 50% (1). Therefore, using data science to improve the accuracy of draft predictions could lead to significant benefits for NBA teams.

Schlosser, K. (2019). An NBA team's draft mistake can cost them $8.3 million, and the league is using AI to help make better decisions. Business Insider.
## Solution
The goal of this project is to utilize statistical features of basketball players such as points per game, field goal percentage, and player efficiency rating to predict their efficacy in the NBA draft. By analyzing the relationships between these statistical features and a player's performance in the draft, we can identify key factors that contribute to success in the NBA. To create a prediction model we will train our model on all of the draft classes sucess in the NBA from 2009-2018. This will require cleaning the data and mergering their NBA PER with their college statisitics. 

## Impact
If successful, this work on predicting the potential success of NBA draft picks based on their performance data may have several impacts. Firstly, it could help NBA teams to make more informed decisions when selecting players in the draft. This could lead to more successful draft classes, which may translate into more successful seasons for the teams.

However, there could be some negative impacts as well. For example, relying too heavily on machine learning algorithms to make player selections could lead to a lack of diversity in the types of players chosen. This could stifle innovation and creativity in the sport, as teams focus solely on picking players who fit the mold of what the algorithm identifies as successful.

# Dataset
## Detail
We will use a [2023 NBA Draft Prospects Stats](https://basketball.realgm.com/nba/draft/prospects/stats/Averages/All/) to observe the following features for each player:

- Team: College or international team
- GP: Games played
- MPG: Minutes per game
- PPG: Points per game
- FGM: Field goals made
- FGA: Field goals attempted
- FG%: Field goal percentage
- 3PM: 3-pointers made
- 3PA: 3-pointers attempted
- 3P%: 3-point percentage
- FTM: Free throws made
- FTA: Free throws attempted
- FT%: Free throw percentage
- ORB: Offensive rebounds
- DRB: Defensive rebounds
- RPG: Rebounds per game
- APG: Assists per game
- SPG: Steals per game
- BPG: Blocks per game
- TS%: True shooting percentage
- eFG%: Effective field goal percentage
- ORB%: Offensive rebound percentage
- DRB%: Defensive rebound percentage
- TRB%: Total rebound percentage
- AST%: Assist percentage
- TOV%: Turnover percentage
- STL%: Steal percentage
- BLK%: Block percentage
- USG%: Usage percentage
- PPR: Pure point rating
- PPS: Points per shot
- ORtg: Offensive rating
- DRtg: Defensive rating
- PER: Player efficiency rating

We will use a [2009-2018 NBA Draft College Stats](https://www.kaggle.com/datasets/adityak2003/college-basketball-players-20092021/code?select=CollegeBasketballPlayers2009-2021.csv) as our training data. We will drop missing values and merge college and current NBA PER. We will use df.groupby(“Player”).mean() for repeated player names for players who switch teams in the middle of a season




## Sufficient Data

These statistics can provide us with insights into a player's strengths and weaknesses, allowing us to assess their potential success in the NBA. For example, a player who has a high PPG and a high FG% may be more likely to become a successful scorer in the NBA. Similarly, a player who has a high RPG and a high BPG may be more likely to become a successful rebounder and shot blocker. By analyzing these statistics and identifying patterns and trends, we can create models and algorithms that can predict a player's potential success in the NBA draft. Additionally, I have included advanced stats from the dataset including PER, TS% etc. that can help us make complex descisions. Additionally, our college dataset has many data points to help us create a more accurate predictor. 

In [4]:
# reads in NBA 2023 draft class data
df_2023_draft = pd.read_csv('nba2023draftclasssupdated - nba2023draftclassupdated.csv')

# drop unneeded "Unnamed" columns  
feat_keep = ['Player', 'Team', 'GP', 'Ortg', 'usg', 'eFG', 'FTM', 'FTA', 'TO_per', 'TPM', 'TPA', 'drtg', 'mp', 'oreb', 'dreb', 'ast', 'stl', 'blk', 'pts']

# loop through the columns and drop those not in the list of columns to keep
for feat in df_2023_draft.columns:
    if feat not in feat_keep:
        df_2023_draft.drop(feat, axis=1, inplace=True)

# drops missing values
df_2023_draft.dropna(inplace=True)

df_2023_draft

Unnamed: 0,Player,Team,GP,mp,pts,TPM,TPA,FTM,FTA,oreb,dreb,ast,stl,blk,eFG,TO_per,usg,Ortg,drtg
0,Adam Flagler,BU,27,33.3,15.5,2.5,6.3,2.6,3.1,0.3,2.1,4.8,1.2,0.1,0.518,10.9,23.5,121.6,106.0
1,Adem Bona,UCLA,27,23.1,8.0,0.0,0.0,1.4,2.5,2.2,2.9,0.7,0.7,1.6,0.672,17.2,16.1,122.0,88.3
2,Alex Fudge,UF,27,20.0,6.0,0.4,1.4,1.1,1.8,1.3,3.1,0.4,0.5,0.7,0.455,15.1,19.0,94.1,97.1
3,Amari Bailey,UCLA,21,25.7,10.1,0.7,1.8,0.8,1.4,0.6,2.7,2.0,1.0,0.4,0.527,19.4,23.4,97.4,91.4
4,"Andre Jackson, Jr.",UConn,26,28.6,6.5,0.7,2.6,1.0,1.4,2.0,4.3,4.1,1.1,0.6,0.447,22.1,14.9,108.3,94.7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
77,Efe Abogidi,IGN,4,12.0,3.0,0.0,0.0,0.0,0.5,0.8,2.5,1.0,0.0,0.0,0.429,16.8,17.0,88.4,125.8
78,Sidy Cissoko,IGN,27,29.9,13.1,1.1,3.7,1.7,2.7,0.8,2.0,3.6,1.3,0.9,0.519,15.0,18.9,106.2,123.5
79,Scoot Henderson,IGN,19,30.7,16.5,0.7,2.7,2.2,2.9,1.1,4.3,6.5,1.0,0.5,0.455,18.4,26.8,99.2,123.9
80,Mojave King,IGN,28,24.5,7.7,0.6,2.3,1.0,1.1,0.9,3.4,1.1,0.4,0.2,0.459,10.0,14.1,103.4,125.1


In [3]:
import pandas as pd

# read in the data
df_10_years_college = pd.read_csv('2009 to 2008 college data.txt')

# group the data by player name and calculate the mean of each group
df_average_college = df_10_years_college.groupby('Player').mean()

# data from 2013 to 2022
df_4_years_after = pd.read_csv('nba10years.txt')

# group the data by player name and calculate the mean of each group
df_average_nba = df_4_years_after.groupby('Player').mean()

# resets the index of df_average_nba to show "Player" names
df_average_nba = df_4_years_after.reset_index()

# nba data with only "Player" and "PER" as columns
df_10_years_clean_nba = df_average_nba[['Player', 'PER']]

# merges two datasets
df_10_years_final = pd.merge(df_10_years_clean_nba, df_10_years_college, on='Player', how='outer')

# drop unneeded "Unnamed" columns  
feat_keep = ['Player', 'team', 'conf', 'GP', 'Ortg', 'usg', 'eFG', 'FTM', 'FTA', 'TO_per', 'TPM', 'TPA', 'drtg', 'mp', 'oreb', 'dreb', 'ast', 'stl', 'blk', 'pts']

# loop through the columns and drop those not in the list of columns to keep
for feat in df_10_years_final.columns:
    if feat not in feat_keep:
        df_10_years_final.drop(feat, axis=1, inplace=True)

# removes row if PER has no value
df_10_years_final.dropna(how='any', inplace=True)

# saves as csv
df_10_years_final.to_csv('df_10_years_final.csv', index=False)

df_10_years_final.head()

Unnamed: 0,Player,team,conf,GP,Ortg,usg,eFG,TO_per,FTM,FTA,TPM,TPA,drtg,mp,oreb,dreb,ast,stl,blk,pts
0,Jeff Ayres,Arizona St.,P10,35.0,130.8,20.2,66.0,12.3,113.0,145.0,0.0,1.0,95.9973,32.5429,2.4857,5.7143,0.8857,0.4857,0.8571,14.5429
3,DeJuan Blair,Pittsburgh,BE,33.0,125.8,26.2,58.4,11.0,97.0,160.0,0.0,0.0,86.5716,27.3636,5.5758,6.7273,1.2424,1.6061,0.9091,15.6061
4,Chase Budinger,Arizona,P10,35.0,117.5,23.9,55.4,16.6,125.0,156.0,67.0,168.0,105.009,37.6286,1.4,4.8,3.3714,1.4286,0.4571,18.0
5,DeMarre Carroll,Missouri,B12,38.0,118.8,25.1,57.6,11.5,121.0,191.0,16.0,44.0,89.7454,28.0,2.4211,4.7632,2.1579,1.5526,0.6579,16.5526
7,Earl Clark,Louisville,BE,37.0,100.3,24.6,49.2,22.0,90.0,139.0,31.0,95.0,88.6298,34.2703,2.7568,5.9459,3.2162,1.027,1.3784,14.1892


## How Data will Solve this Problem
To create a classification model, we would first need to define what we mean by "success". We eventually decided that this would be a player's PER. Success is defined using PER (player efficiency rating), which summarizes most of the player’s advanced statistics into one number. 

To predict an NBA player's Player Efficiency Rating (PER) based on their college statistics, we employed a regression machine learning model. By using regression, we were able to identify and quantify the relationships between NBA PER and college statistics, and then use these relationships to predict PER values based on the independent variables of college stats.

In order to improve our model’s accuracy, we also utilized Principal Component Analysis (PCA). By applying PCA, we can improve the overall performance and accuracy of our model. 

Our final regression model was a RandomForestRegressor that utilizes both PCA and cross validation to predict NBA PER using 10 years of data. This give us an indication of how good our model was a prediciting the 2023 draft class's PER. 