## What Does It Take To Be an NBA All-Star? 
## A Data Science Analysis
## Samuel Frankel

Ever since 1951, the NBA All-Star game has occurred annually. Twenty-four of the best players in the NBA are selected, by a mix of fan voting and head coaches, to compete in this prestigious game. (There can end up being more than 24, due to injuries). Each year, as the teams are announced, there is an abundance of discussion regarding the choices. Why did this player make it over this player? Why was this player chosen?I think that if we can look at the analytics behind which players are selected and which are left out, we can better understand the process, as well as make predictions on who will make it before the teams are even announced. 

The goal for this project is to create a model for predicting which players will be selected to the All Star team. We will collect players statistics from the past 10 years (2009-10 season until 2019-20 season), and look at which players made the all star teams which years, and use this to create a decision tree. The decision tree will be able to predict how likely a player is to be selected to the all star team based on their statistics. We will be analyzing our results throughout the project, to see where we can improve to get as accurate of a model as possible. 

## Step 1: Getting the Data

First we will import the libraries we will need to do our analysis. We will use pandas, in order to create dataframes, re to do regular expression operations, and statsmodels to do statistical analysis.

In [251]:
import pandas as pd
import re
import statsmodels.api as sm
import statsmodels.formula.api as smf


Now, we are going to create a dataframe with all the NBA players stats for each year from 2010-2020. To get the stats, go to https://www.basketball-reference.com/leagues/NBA_2022_per_game.html. This website has all the NBA stats for every season in NBA history. On the table, click "share and export" and export as csv. Copy the table. Next, in either Excel or Numbers, paste the table, and export it as a csv into your project folder.

Now, using Pandas, we will read the data into a dataframe. A dataframe is a two-dimensional table, with relational data. To learn more about pandas dataframes, see https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#:~:text=to%20different%20objects.-,DataFrame,most%20commonly%20used%20pandas%20object. 

Below you will see how read in the 2010 NBA data into a dataframe: 

In [252]:
data = pd.read_csv('NBA10.csv', sep=',')
data['Year'] = 2010
data

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Year
0,1,Arron Afflalo\afflaar01,SG,24,DEN,82,75,27.1,3.3,7.1,...,0.7,2.4,3.1,1.7,0.6,0.4,0.9,2.7,8.8,2010
1,2,Alexis Ajinça\ajincal01,C,21,CHA,6,0,5.0,0.8,1.7,...,0.2,0.5,0.7,0.0,0.2,0.2,0.3,0.8,1.7,2010
2,3,LaMarcus Aldridge\aldrila01,PF,24,POR,78,78,37.5,7.4,15.0,...,2.5,5.6,8.0,2.1,0.9,0.6,1.3,3.0,17.9,2010
3,4,Joe Alexander\alexajo01,SF,23,CHI,8,0,3.6,0.1,0.8,...,0.3,0.4,0.6,0.3,0.1,0.1,0.0,1.1,0.5,2010
4,5,Malik Allen\allenma01,PF,31,DEN,51,3,8.9,0.9,2.3,...,0.7,0.9,1.6,0.3,0.2,0.1,0.4,1.3,2.1,2010
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
573,438,Dorell Wright\wrighdo01,SF,24,MIA,72,1,20.8,2.7,5.8,...,0.7,2.6,3.3,1.3,0.7,0.4,0.7,1.3,7.1,2010
574,439,Julian Wright\wrighju01,SF,22,NOH,68,14,12.8,1.7,3.4,...,0.9,1.3,2.1,0.6,0.4,0.3,0.6,0.7,3.8,2010
575,440,Nick Young\youngni01,SG,24,WAS,74,23,19.2,3.1,7.5,...,0.3,1.1,1.4,0.6,0.4,0.1,0.8,2.0,8.6,2010
576,441,Sam Young\youngsa01,SF,24,MEM,80,1,16.5,2.8,6.2,...,1.0,1.6,2.5,0.7,0.4,0.3,1.2,1.3,7.4,2010


As you can see, the above data contains 578 entries, one for each player in the NBA that season. If we have to do that for 10 seasons, the dataframe will be approximately 5000 entries, which would be way too huge to look at easily, and it would make operations very slow.  

According to StatMuse.com, the last time a player made an NBA all star team with lower than 10 points per game was 2004, which is before our data. Therefore, to make things easier, we can drop all players who average less than 10 points per game. 

We will accomplish this with the following function: 


In [253]:
def drop_under_10(data):
    for i,r in data.iterrows():
        if (r['PTS'] < 10.0):
            data = data.drop(i)
    return data


Now, we will apply it to our dataframe for 2010 NBA data.

In [254]:
data = drop_under_10(data)
data

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Year
2,3,LaMarcus Aldridge\aldrila01,PF,24,POR,78,78,37.5,7.4,15.0,...,2.5,5.6,8.0,2.1,0.9,0.6,1.3,3.0,17.9,2010
5,6,Ray Allen*\allenra02,SG,34,BOS,80,80,35.2,5.8,12.2,...,0.6,2.6,3.2,2.6,0.8,0.3,1.6,2.3,16.3,2010
15,14,Carmelo Anthony\anthoca01,SF,25,DEN,69,69,38.2,10.0,21.8,...,2.2,4.4,6.6,3.2,1.3,0.4,3.0,3.3,28.2,2010
17,16,Gilbert Arenas\arenagi01,PG,28,WAS,32,32,36.5,7.9,19.3,...,0.5,3.6,4.2,7.2,1.3,0.3,3.7,3.0,22.6,2010
18,17,Trevor Ariza\arizatr01,SF,24,HOU,72,71,36.5,5.5,13.9,...,1.1,4.5,5.6,3.8,1.8,0.6,2.2,2.3,14.9,2010
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
565,430,Marvin Williams\willima02,SF,23,ATL,81,81,30.5,3.7,8.2,...,1.3,3.8,5.1,1.1,0.8,0.6,0.9,2.0,10.1,2010
566,431,Mo Williams\willima01,PG,27,CLE,69,68,34.2,5.5,12.4,...,0.4,2.6,3.0,5.3,1.0,0.3,2.5,2.5,15.8,2010
567,432,Reggie Williams\willire02,SF,23,GSW,24,10,32.6,5.8,11.8,...,0.8,3.8,4.6,2.8,1.0,0.3,1.2,2.0,15.2,2010
571,436,Metta World Peace\artesro01,SF,30,LAL,77,77,33.8,4.0,9.6,...,1.3,3.0,4.3,3.0,1.4,0.3,1.6,2.1,11.0,2010


As you can see, we have now managed to bring the size of our data frame down to 179 rows. 

Now, we will get the rest of the data. Just as we did with the 2010 data, we will do with the next year, 2011. 

In [255]:
data2 = pd.read_csv('NBA11.csv', sep=',')
data2['Year'] = 2011
data2 = drop_under_10(data2)

Next, we will add the two dataframes together.

In [257]:
data = pd.concat([data,data2])

Now, we will repeat this process with the data for the rest of the years. 

In [258]:

data2 = pd.read_csv('NBA12.csv', sep=',')
data2['Year'] = 2012
data2 = drop_under_10(data2)
data = pd.concat([data,data2])

data2 = pd.read_csv('NBA13.csv', sep=',')
data2['Year'] = 2013
data2 = drop_under_10(data2)
data = pd.concat([data,data2])

data2 = pd.read_csv('NBA14.csv', sep=',')
data2['Year'] = 2014
data2 = drop_under_10(data2)
data = pd.concat([data,data2])

data2 = pd.read_csv('NBA15.csv', sep=',')
data2['Year'] = 2015
data2 = drop_under_10(data2)
data = pd.concat([data,data2])

data2 = pd.read_csv('NBA16.csv', sep=',')
data2['Year'] = 2016
data2 = drop_under_10(data2)
data = pd.concat([data,data2])

data2 = pd.read_csv('NBA17.csv', sep=',')
data2['Year'] = 2017
data2 = drop_under_10(data2)
data = pd.concat([data,data2])

data2 = pd.read_csv('NBA18.csv', sep=',')
data2['Year'] = 2018
data2 = drop_under_10(data2)
data = pd.concat([data,data2])

data2 = pd.read_csv('NBA19.csv', sep=',')
data2['Year'] = 2019
data2 = drop_under_10(data2)
data = pd.concat([data,data2])

data2 = pd.read_csv('NBA20.csv', sep=',')
data2['Year'] = 2020
data2 = drop_under_10(data2)
data = pd.concat([data,data2])

data = data.reset_index(drop=True)

In [259]:
data

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Year
0,3.0,LaMarcus Aldridge\aldrila01,PF,24,POR,78,78,37.5,7.4,15.0,...,2.5,5.6,8.0,2.1,0.9,0.6,1.3,3.0,17.9,2010
1,6.0,Ray Allen*\allenra02,SG,34,BOS,80,80,35.2,5.8,12.2,...,0.6,2.6,3.2,2.6,0.8,0.3,1.6,2.3,16.3,2010
2,14.0,Carmelo Anthony\anthoca01,SF,25,DEN,69,69,38.2,10.0,21.8,...,2.2,4.4,6.6,3.2,1.3,0.4,3.0,3.3,28.2,2010
3,16.0,Gilbert Arenas\arenagi01,PG,28,WAS,32,32,36.5,7.9,19.3,...,0.5,3.6,4.2,7.2,1.3,0.3,3.7,3.0,22.6,2010
4,17.0,Trevor Ariza\arizatr01,SF,24,HOU,72,71,36.5,5.5,13.9,...,1.1,4.5,5.6,3.8,1.8,0.6,2.2,2.3,14.9,2010
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2052,,Justise Winslow,SF,23,MIA,11,5,32.0,4.5,11.7,...,1.5,5.2,6.6,4.0,0.6,0.5,2.2,3.5,11.3,2020
2053,,Christian Wood,PF,24,DET,62,12,21.4,4.6,8.2,...,1.7,4.6,6.3,1.0,0.5,0.9,1.4,1.6,13.1,2020
2054,,Thaddeus Young,PF,31,CHI,64,16,24.9,4.2,9.4,...,1.5,3.5,4.9,1.8,1.4,0.4,1.6,2.1,10.3,2020
2055,,Trae Young,PG,21,ATL,60,60,35.3,9.1,20.8,...,0.5,3.7,4.3,9.3,1.1,0.1,4.8,1.7,29.6,2020


As you can see, we currently have a dataframe with 2057 rows that contains the stats of every player (who scored over 10 points per game) from the year 2010 to 2020. 

Next, we need to mark down whether or not the player made the All Star team that year.

To accomplish this, we will make a dataframe with all the All Star Selections between 2010 and 2020. 

To get the data, go to https://basketball.realgm.com/nba/allstar/game/rosters/2010. Highlight the table and paste it in either Excel or Numbers. Export as a csv and add it to your file. Repeat this process for the years 2011-2020. 

Next, read it into a pandas dataframe like we did above, and add a column, Year, with the year the data comes from.

In [260]:
allstars = pd.read_csv('AllStars10.csv', sep=',')
allstars['Year'] = 2010 

allstars2 = pd.read_csv('AllStars11.csv', sep=',')
allstars2['Year'] = 2011
allstars = pd.concat([allstars, allstars2])

allstars2 = pd.read_csv('AllStars12.csv', sep=',')
allstars2['Year'] = 2012
allstars = pd.concat([allstars, allstars2])

allstars2 = pd.read_csv('AllStars13.csv', sep=',')
allstars2['Year'] = 2013
allstars = pd.concat([allstars, allstars2])

allstars2 = pd.read_csv('AllStars14.csv', sep=',')
allstars2['Year'] = 2014
allstars = pd.concat([allstars, allstars2])

allstars2 = pd.read_csv('AllStars15.csv', sep=',')
allstars2['Year'] = 2015
allstars = pd.concat([allstars, allstars2])

allstars2 = pd.read_csv('AllStars16.csv', sep=',')
allstars2['Year'] = 2016
allstars = pd.concat([allstars, allstars2])

allstars2 = pd.read_csv('AllStars17.csv', sep=',')
allstars2['Year'] = 2017
allstars = pd.concat([allstars, allstars2])

allstars2 = pd.read_csv('AllStars17.csv', sep=',')
allstars2['Year'] = 2017
allstars = pd.concat([allstars, allstars2])

allstars2 = pd.read_csv('AllStars18.csv', sep=',')
allstars2['Year'] = 2018
allstars = pd.concat([allstars, allstars2])

allstars2 = pd.read_csv('AllStars19.csv', sep=',')
allstars2['Year'] = 2019
allstars = pd.concat([allstars, allstars2])

allstars2 = pd.read_csv('AllStars20.csv', sep=',')
allstars2['Year'] = 2020
allstars = pd.concat([allstars, allstars2])

allstars.head()

Unnamed: 0,Player,Year
0,Carmelo Anthony,2010
1,Chauncey Billups,2010
2,Kobe Bryant,2010
3,Tim Duncan,2010
4,Kevin Durant,2010


We now have a dataframe with all the All Stars from 2010-2020. 

Now, we need to add a column in our original dataframe that indicated whether or not a player made the All Star team that year. 

In [261]:
data['All Star'] = 0

A player will have a 1 if they made the team, and a 0 if they did not. 

Now, we will make a method that returns true if a player made an all star team. It works by taking as input a players name and the year. It will search through the All Star dataframe, and return true if that player and year is located in the dataframe. 

In [262]:
def was_allstar(name, year):
    for i, r in allstars.iterrows():
        if (r['Player'] in name and year == r['Year']):
            return True 

Now, we will go through the dataframe of stats and call this method for every player. If the method returns true, we will add a 1 in the All Star column and a 0 if it returns false. 

In [263]:
for i, r in data.iterrows():
    if (was_allstar(r['Player'], r['Year'])):
        data.at[i, 'All Star'] = 1


In [264]:
data.head()

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Year,All Star
0,3.0,LaMarcus Aldridge\aldrila01,PF,24,POR,78,78,37.5,7.4,15.0,...,5.6,8.0,2.1,0.9,0.6,1.3,3.0,17.9,2010,0
1,6.0,Ray Allen*\allenra02,SG,34,BOS,80,80,35.2,5.8,12.2,...,2.6,3.2,2.6,0.8,0.3,1.6,2.3,16.3,2010,0
2,14.0,Carmelo Anthony\anthoca01,SF,25,DEN,69,69,38.2,10.0,21.8,...,4.4,6.6,3.2,1.3,0.4,3.0,3.3,28.2,2010,1
3,16.0,Gilbert Arenas\arenagi01,PG,28,WAS,32,32,36.5,7.9,19.3,...,3.6,4.2,7.2,1.3,0.3,3.7,3.0,22.6,2010,0
4,17.0,Trevor Ariza\arizatr01,SF,24,HOU,72,71,36.5,5.5,13.9,...,4.5,5.6,3.8,1.8,0.6,2.2,2.3,14.9,2010,0


Now, the last column contains whether or not they made the All Star team.

Currently, the dataframe has the players Twitter handle as well in the "Player" column. That is not good to look at, so we will create a method that will remove it from the dataframe. The method uses regular expressions to find the "\" and removes anything after it.

In [265]:
for i, r in data.iterrows():
    if re.search(r'\\[a-z]', data.at[i, 'Player']):
       name = re.split(r'\\[a-z]', data.at[i, 'Player'])
       data.at[i,'Player'] = name[0] 

In our model, we will only be looking at what I consider to be the most important, basic stats to consider.
Age, Points per game, Field Goal Percentage, Rebounds, Assists, Steals, Blocks, and Turnovers. Therefore, we will drop the columns with all of the other information.

In [266]:
data = data.drop(columns=['Pos', 'Rk', 'G', 'GS', 'MP', 'FG', 'FGA', '3P', '3P%', '2P', '2PA', '2P%', '3PA', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'PF'])

## Step 2: Creating the Model

Now, we will create our model. We will be creating a random forest of decision trees. A decision tree is a machine learning tool that will create branches based on inputs, and output a decision. In this case, it will take as input all the stats in the dataframe, and output whether or not that player will make the All Star team. 

To learn more about decision trees, check out this article: https://scikit-learn.org/stable/modules/tree.html.

Import the following in order to make a decision tree model:

In [267]:
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier

We will be making our model using a random forest. A random forest is a method that will make 100 different decision trees, and aggregate the results of each in order to make the best tree. 

We will divide the dataframe into "inputs" and "target". Inputs is all the stats we want the model to consider, and target is whether or not the player made the all star team. 

In [268]:
model = RandomForestClassifier()
inputs = data.drop(columns=['Player','Tm','Year','All Star'])
target = data['All Star']

Next, we will split our data so we can do cross validation. Cross validation is a term in machine learning that means that we will split our data into training data and testing data. Then, we will build the model based on the training data, and evaluate how well the model works with the testing data. To learn more, see https://scikit-learn.org/stable/modules/cross_validation.html. 

We will be splitting our data into 80% as training data, and 20% as testing data.

In [269]:
X_train, X_test, y_train, y_test = train_test_split(inputs, target, random_state=1, test_size= 0.2)
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
metrics.accuracy_score(y_test, y_pred)

0.9320388349514563

## Step 3: Adding advanced stats to our model 
We have now created the model, and our model correctly predicted around 93.2% of our testing data! That is pretty good!

However, I think that if we add advanced metrics into our model, we might even be able to do better.

Advanced metrics are a set of statistics that show how efficient players are, as well as compares them with the rest of the league. To many, they are considered more accurate ways of seeing how good specific players are than traditional statistics. 

We will now create a dataframe of advanced statistics using the same method as traditional statistics we used earlier. 

We will be looking at the following advanced metrics:

PER (Player Efficiency Rating): A metric that combines all traditional stats, compares to league averages, and adjusts for the total minutes played. The league average is always 15.0.

TS% (True Shooting Percentage): A weighted average that takes into account 3-Point percentage, free-throw percentage, and regular 2-point percentage.

WS (Win Shares): An estimate of the number of wins a player contricutes to their team.

BPM (Box Plus Minus): Estimates the number of points a player contributed to their team per 100 possesions. 

In [270]:
def drop_under_50(data):
    for i,r in data.iterrows():
        if r['G'] < 50:
            data = data.drop(i)
    return data

In order to keep the data size from getting too big, we will drop players that played less than 50 games, as it is very unlikely one could make all All Star team with that few games (there are 82 games in a normal NBA Season).

In [271]:
advanced = pd.read_csv('Advanced10.csv')
advanced['Year'] = 2010
advanced = drop_under_41(advanced)
advanced = advanced[['Player', 'PER', 'TS%', 'WS', 'BPM', 'Year']]

In [272]:
advanced2 = pd.read_csv('Advanced11.csv')
advanced2['Year'] = 2011
advanced2 = drop_under_41(advanced2)
advanced2 = advanced2[['Player', 'PER', 'TS%', 'WS', 'BPM', 'Year']]

# Combining the dataframes
advanced = pd.concat([advanced, advanced2])

advanced2 = pd.read_csv('Advanced12.csv')
advanced2['Year'] = 2012
advanced2 = drop_under_41(advanced2)
advanced2 = advanced2[['Player', 'PER', 'TS%', 'WS', 'BPM', 'Year']]
advanced = pd.concat([advanced, advanced2])


advanced2 = pd.read_csv('Advanced12.csv')
advanced2['Year'] = 2013
advanced2 = drop_under_41(advanced2)
advanced2 = advanced2[['Player', 'PER', 'TS%', 'WS', 'BPM', 'Year']]
advanced = pd.concat([advanced, advanced2])


advanced2 = pd.read_csv('Advanced14.csv')
advanced2['Year'] = 2014
advanced2 = drop_under_41(advanced2)
advanced2 = advanced2[['Player', 'PER', 'TS%', 'WS', 'BPM', 'Year']]
advanced = pd.concat([advanced, advanced2])


advanced2 = pd.read_csv('Advanced15.csv')
advanced2['Year'] = 2015
advanced2 = drop_under_41(advanced2)
advanced2 = advanced2[['Player', 'PER', 'TS%', 'WS', 'BPM', 'Year']]
advanced = pd.concat([advanced, advanced2])


advanced2 = pd.read_csv('Advanced16.csv')
advanced2['Year'] = 2016
advanced2 = drop_under_41(advanced2)
advanced2 = advanced2[['Player', 'PER', 'TS%', 'WS', 'BPM', 'Year']]
advanced = pd.concat([advanced, advanced2])


advanced2 = pd.read_csv('Advanced17.csv')
advanced2['Year'] = 2017
advanced2 = drop_under_41(advanced2)
advanced2 = advanced2[['Player', 'PER', 'TS%', 'WS', 'BPM', 'Year']]
advanced = pd.concat([advanced, advanced2])


advanced2 = pd.read_csv('Advanced18.csv')
advanced2['Year'] = 2018
advanced2 = drop_under_41(advanced2)
advanced2 = advanced2[['Player', 'PER', 'TS%', 'WS', 'BPM', 'Year']]
advanced = pd.concat([advanced, advanced2])


advanced2 = pd.read_csv('Advanced19.csv')
advanced2['Year'] = 2019
advanced2 = drop_under_41(advanced2)
advanced2 = advanced2[['Player', 'PER', 'TS%', 'WS', 'BPM', 'Year']]
advanced = pd.concat([advanced, advanced2])


advanced2 = pd.read_csv('Advanced20.csv')
advanced2['Year'] = 2020
advanced2 = drop_under_41(advanced2)
advanced2 = advanced2[['Player', 'PER', 'TS%', 'WS', 'BPM', 'Year']]
advanced = pd.concat([advanced, advanced2])

advanced = advanced.reset_index()
for i, r in advanced.iterrows():
    if re.search(r'\\[a-z]', advanced.at[i, 'Player']):
       name = re.split(r'\\[a-z]', advanced.at[i, 'Player'])
       advanced.at[i,'Player'] = name[0] 
       


In [273]:
advanced.head()

Unnamed: 0,index,Player,PER,TS%,WS,BPM,Year
0,0,Arron Afflalo,10.9,0.576,4.3,-0.4,2010
1,2,LaMarcus Aldridge,18.2,0.535,8.8,1.2,2010
2,4,Malik Allen,5.9,0.431,0.1,-5.7,2010
3,5,Ray Allen*,15.2,0.601,7.9,1.2,2010
4,6,Tony Allen,14.2,0.54,1.9,0.4,2010


We now have a dataframe with all the NBA players advanced statistics for 2010-2020. Next, we want to add these stats to our original dataframe. 

We will create columns for those statistics, and set the default values to 0.0.

In [274]:
data['PER'] = 0.0
data['TS%'] = 0.0
data['WS'] = 0.0
data['BPM'] = 0.0

We will create a method that returns a players advanced metrics for that year. This method works by calling a query on the advanced data dataframe. It will return a tuple of the players advanced stats.

In [275]:
def get_advanced(player, year):
    stats = advanced.query('Player == @player and Year == @year')
    if (not stats.empty):
        stats = stats.reset_index()
        return (stats.at[0,'PER'], stats.at[0,'TS%'], stats.at[0, 'WS'], stats.at[0, 'BPM'])
    else:
        return ()



Now, we will iterate through each player in the dataframe, and add their advanced metrics.

In [276]:
for i,r in data.iterrows():
    ad = get_advanced(r['Player'], r['Year'])
    if len(ad) != 0:
        data.at[i, 'PER'] = ad[0]
        data.at[i, 'TS%'] = ad[1]
        data.at[i, 'WS'] = ad[2]
        data.at[i, 'BPM'] = ad[3]
data.head()

Unnamed: 0,Player,Age,Tm,FG%,TRB,AST,STL,BLK,TOV,PTS,Year,All Star,PER,TS%,WS,BPM
0,LaMarcus Aldridge,24,POR,0.495,8.0,2.1,0.9,0.6,1.3,17.9,2010,0,18.2,0.535,8.8,1.2
1,Ray Allen*,34,BOS,0.477,3.2,2.6,0.8,0.3,1.6,16.3,2010,0,15.2,0.601,7.9,1.2
2,Carmelo Anthony,25,DEN,0.458,6.6,3.2,1.3,0.4,3.0,28.2,2010,1,22.2,0.548,7.9,2.3
3,Gilbert Arenas,28,WAS,0.411,4.2,7.2,1.3,0.3,3.7,22.6,2010,0,0.0,0.0,0.0,0.0
4,Trevor Ariza,24,HOU,0.394,5.6,3.8,1.8,0.6,2.2,14.9,2010,0,13.3,0.488,3.2,0.5


## Step 4: Making the new model
Now, we will create a random forest using the same method we used above, and see how it compares.

In [277]:
model2 = RandomForestClassifier()
inputs = data.drop(columns=['Player','Tm','FG%','Year','All Star'])
target = data['All Star']
X_train, X_test, y_train, y_test = train_test_split(inputs, target, random_state=1, test_size= 0.3)
model2.fit(X_train,y_train)
y_pred = model2.predict(X_test)
metrics.accuracy_score(y_test, y_pred)

0.93042071197411

Unfortuanely, the accuracy is only slightly better. It appears that advnaced metrics aren't incredibly helpful for determing All Stars.  

## Step 5: Predicting 2021's All Stars!

We will now use our model to predict the All Stars for the upcoming NBA season. Using the same method we did above, import the players statistics for the 2021 season so far.

In [278]:
nba21 = pd.read_csv('NBA21.csv', sep=',')
nba21['All Star %'] = 0.0
nba21 = nba21.dropna()
nba21 = nba21.reset_index()
nba21.head()

Unnamed: 0,index,Rk,Player,Pos,Age,Tm,FG%,3P%,TRB,AST,STL,BLK,TOV,PTS,All Star %
0,0,1,Precious Achiuwa\achiupr01,C,22,TOR,0.386,0.269,8.2,1.6,0.5,0.6,1.1,8.0,0.0
1,2,3,Bam Adebayo\adebaba01,C,24,MIA,0.519,0.0,10.2,3.2,1.1,0.3,2.9,18.7,0.0
2,3,4,Santi Aldama\aldamsa01,PF,21,MEM,0.364,0.111,2.6,0.8,0.1,0.2,0.3,3.6,0.0
3,4,5,LaMarcus Aldridge\aldrila01,C,36,BRK,0.573,0.367,5.7,0.9,0.4,1.2,0.8,14.0,0.0
4,5,6,Nickeil Alexander-Walker\alexani01,SG,23,NOP,0.366,0.31,3.9,2.6,1.0,0.4,1.5,13.5,0.0


We have created another column labeled All Star %, which will indicate the players probability of making the All Star team based on our model.

We will also add a column that contains which conference they are in, as the All Star teams consist of 12 players from each conference.

In [279]:
east = ['PHI', 'BRK', 'MIL', 'NYK', 'ATL', 'MIA', 'BOS', 'WAS','IND', 'CHA', 'CHI', 'TOR', 'CLE', 'ORL', 'DET']
nba21['Conference'] = ''
for i, r in nba21.iterrows():
    if r['Tm'] in east:
        nba21.at[i,'Conference'] = 'East'
    else:
        nba21.at[i,'Conference'] = 'West'

Now, we will get the probabilities that each player has of making the All Star team. 

In [280]:
inputs = nba21.drop(columns=['Pos','Conference','Rk', 'Player', 'Tm', '3P%', 'All Star %', 'index'])
inputs = inputs.dropna()
probabilities = model.predict_proba(inputs)

Probabilitis is now a 2d Array, where the first element is the probability a player does not make the All Star team, and the second element is the probabilty they will. 

We will now go through the nba 2021 dataframe, and add the probability the player will make the all Star team. We will store this data in a TSV file. 

In [284]:
for i,r in nba21.iterrows():
    nba21.at[i,'All Star %'] = probabilities[i][1]
    if i == 441:
        break
nba21 = nba21.sort_values(by=['Conference','All Star %'], ascending= False)
nba21.to_csv('AllStarPredictions.tsv', sep='\t')


Now, in order to make our predictions, we will look at the top 12 players with the highest percent chance of making the All Star team in each conference. The results are as follows: 



WEST: 


1. Lebron James
2. Karl Anothony-Towns
3. Anothony Davis
4. Luka Doncic
5. Paul George
6. Nikola Jokic
7. Russel Westbrook
8. Ja Morant
9. Stephen Curry
10. Donovan Mitchell 
11. Devin Booker
12. Chris Paul



EAST: 

1. Kevin Durant
2. Joel Embiid
3. Jimmy Butler
4. James Harden
5. Trae Young
6. DeMar Derozen
7. Zach Lavine
8. Jason Tatus
9. Damontis Sabonis
10. Bradley Beal
11. Darius Garland
12. Jaylen Brown

## Conclusion:

We managed to make a model that predics at 93%, so I would say that that is pretty succesful. However, I do think there are several ways in which we can improve. As a good portion of All Stars are selected by fan vote, I think that taking into account a player's popularity would be a good metric to keep track of. One way in which we could do this is by scraping the players social media followers. 

Also, the model currently compares all the plaeyrs in the NBA together, where it should really be comparing the players only to those in their same conference. This can make a big difference in a year where the players in one conference are significantly better than the other. 

I hope that you learned a lot about Data Science through this walkthrough and saw how applicable it can be!
