## What Does It Take To Be an NBA All-Star? 
## A Data Science Analysis
## Samuel Frankel

Ever since 1951, the NBA All-Star game has occurred annually. Twenty-four of the best players in the NBA are selected, by a mix of fan voting and head coaches, to compete in this prestigious game. (There can end up being more than 24, due to injuries). Each year, as the teams are announced, there is an abundance of discussion regarding the choices. Why did this player make it over this player? Why was this player chosen?I think that if we can look at the analytics behind which players are selected and which are left out, we can better understand the process, as well as make predictions on who will make it before the teams are even announced. 

The goal for this project is to create a model for predicting which players will be selected to the All Star team. We will collect players statistics from the past 10 years (2009-10 season until 2019-20 season), and look at which players made the all star teams which years, and use this to create a decision tree. The decision tree will be able to predict how likely a player is to be selected to the all star team based on their statistics. We will be analyzing our results throughout the project, to see where we can improve to get as accurate of a model as possible. 

## Step 1: Getting the data

In [2]:
import pandas as pd
import re
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression


In [3]:
#Function that drops players who averaged less than 10 ppg
def drop_under_10(data):
    for i,r in data.iterrows():
        if (r['PTS'] < 10.0):
            data = data.drop(i)
    return data

data = pd.read_csv('NBA10.csv', sep=',')
data['Year'] = 2010
data = drop_under_10(data)

data2 = pd.read_csv('NBA11.csv', sep=',')
data2['Year'] = 2011
data2 = drop_under_10(data2)
data = pd.concat([data,data2])

data2 = pd.read_csv('NBA12.csv', sep=',')
data2['Year'] = 2012
data2 = drop_under_10(data2)
data = pd.concat([data,data2])

data2 = pd.read_csv('NBA13.csv', sep=',')
data2['Year'] = 2013
data2 = drop_under_10(data2)
data = pd.concat([data,data2])

data2 = pd.read_csv('NBA14.csv', sep=',')
data2['Year'] = 2014
data2 = drop_under_10(data2)
data = pd.concat([data,data2])

data2 = pd.read_csv('NBA15.csv', sep=',')
data2['Year'] = 2015
data2 = drop_under_10(data2)
data = pd.concat([data,data2])

data2 = pd.read_csv('NBA16.csv', sep=',')
data2['Year'] = 2016
data2 = drop_under_10(data2)
data = pd.concat([data,data2])

data2 = pd.read_csv('NBA17.csv', sep=',')
data2['Year'] = 2017
data2 = drop_under_10(data2)
data = pd.concat([data,data2])

data2 = pd.read_csv('NBA18.csv', sep=',')
data2['Year'] = 2018
data2 = drop_under_10(data2)
data = pd.concat([data,data2])

data2 = pd.read_csv('NBA19.csv', sep=',')
data2['Year'] = 2019
data2 = drop_under_10(data2)
data = pd.concat([data,data2])

data2 = pd.read_csv('NBA20.csv', sep=',')
data2['Year'] = 2020
data2 = drop_under_10(data2)
data = pd.concat([data,data2])

data = data.reset_index(drop=True)

In [4]:
#https://basketball.realgm.com/nba/allstar/game/rosters/2011
allstars = pd.read_csv('AllStars10.csv', sep=',')
allstars['Year'] = 2010 

allstars2 = pd.read_csv('AllStars11.csv', sep=',')
allstars2['Year'] = 2011
allstars = pd.concat([allstars, allstars2])

allstars2 = pd.read_csv('AllStars12.csv', sep=',')
allstars2['Year'] = 2012
allstars = pd.concat([allstars, allstars2])

allstars2 = pd.read_csv('AllStars13.csv', sep=',')
allstars2['Year'] = 2013
allstars = pd.concat([allstars, allstars2])

allstars2 = pd.read_csv('AllStars14.csv', sep=',')
allstars2['Year'] = 2014
allstars = pd.concat([allstars, allstars2])

allstars2 = pd.read_csv('AllStars15.csv', sep=',')
allstars2['Year'] = 2015
allstars = pd.concat([allstars, allstars2])

allstars2 = pd.read_csv('AllStars16.csv', sep=',')
allstars2['Year'] = 2016
allstars = pd.concat([allstars, allstars2])

allstars2 = pd.read_csv('AllStars17.csv', sep=',')
allstars2['Year'] = 2017
allstars = pd.concat([allstars, allstars2])

allstars2 = pd.read_csv('AllStars17.csv', sep=',')
allstars2['Year'] = 2017
allstars = pd.concat([allstars, allstars2])

allstars2 = pd.read_csv('AllStars18.csv', sep=',')
allstars2['Year'] = 2018
allstars = pd.concat([allstars, allstars2])

allstars2 = pd.read_csv('AllStars19.csv', sep=',')
allstars2['Year'] = 2019
allstars = pd.concat([allstars, allstars2])

allstars2 = pd.read_csv('AllStars20.csv', sep=',')
allstars2['Year'] = 2020
allstars = pd.concat([allstars, allstars2])


In [5]:
data['All Star'] = 0
def was_allstar(name, year):
    for i, r in allstars.iterrows():
        if (r['Player'] in name and year == r['Year']):
            return True 
for i, r in data.iterrows():
    if (was_allstar(r['Player'], r['Year'])):
        data.at[i, 'All Star'] = 1


In [6]:
# Cleaning out the names
for i, r in data.iterrows():
    if re.search(r'\\[a-z]', data.at[i, 'Player']):
       name = re.split(r'\\[a-z]', data.at[i, 'Player'])
       data.at[i,'Player'] = name[0] 

In [7]:
data.to_csv('PlayerStats.csv', sep='\t')

In [8]:
data = data.drop(columns=['Pos', 'Rk', 'G', 'GS', 'MP', 'FG', 'FGA', '3P', '3P%', '2P', '2PA', '2P%', '3PA', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'PF'])

In [9]:
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier

In [10]:
model = RandomForestClassifier()
inputs = data.drop(columns=['Player','Tm','Year','All Star'])
target = data['All Star']
X_train, X_test, y_train, y_test = train_test_split(inputs, target, random_state=1, test_size= 0.3)
model.fit(X_train,y_train)


RandomForestClassifier()

In [11]:
y_pred = model.predict(X_test)

In [12]:
metrics.accuracy_score(y_test, y_pred)

0.9142394822006472

In [13]:
nba21 = pd.read_csv('NBA21.csv', sep=',')
nba21['All Star %'] = 0.0
nba21.head()


Unnamed: 0,Rk,Player,Pos,Age,Tm,FG%,3P%,TRB,AST,STL,BLK,TOV,PTS,All Star %
0,1,Precious Achiuwa\achiupr01,C,22,TOR,0.386,0.269,8.2,1.6,0.5,0.6,1.1,8.0,0.0
1,2,Steven Adams\adamsst01,C,28,MEM,0.535,,8.8,2.6,0.9,0.6,1.8,7.0,0.0
2,3,Bam Adebayo\adebaba01,C,24,MIA,0.519,0.0,10.2,3.2,1.1,0.3,2.9,18.7,0.0
3,4,Santi Aldama\aldamsa01,PF,21,MEM,0.364,0.111,2.6,0.8,0.1,0.2,0.3,3.6,0.0
4,5,LaMarcus Aldridge\aldrila01,C,36,BRK,0.573,0.367,5.7,0.9,0.4,1.2,0.8,14.0,0.0


In [14]:
nba21 = nba21.dropna()
inputs = nba21.drop(columns=['Pos','Rk', 'Player', 'Tm', '3P%', 'All Star %'])
inputs = inputs.dropna()

In [15]:
probabilities = model.predict_proba(inputs)


In [16]:
nba21 = nba21.dropna()
nba21 = nba21.reset_index()
for i,r in nba21.iterrows():
    nba21.at[i,'All Star %'] = probabilities[i][1]
    if i == 441:
        break


In [17]:
east = ['PHI', 'BRK', 'MIL', 'NYK', 'ATL', 'MIA', 'BOS', 'WAS','IND', 'CHA', 'CHI', 'TOR', 'CLE', 'ORL', 'DET']
nba21['Conference'] = ''
for i, r in nba21.iterrows():
    if r['Tm'] in east:
        nba21.at[i,'Conference'] = 'East'
    else:
        nba21.at[i,'Conference'] = 'West'

In [18]:
nba21 = nba21.sort_values(by=['Conference','All Star %'], ascending= False)
nba21.to_csv('AllStarPredictions.csv', sep='\t')

## Adding advanced stats to our model 


In [118]:
def drop_under_50(data):
    for i,r in data.iterrows():
        if r['G'] < 50:
            data = data.drop(i)
    return data

In [119]:
advanced = pd.read_csv('Advanced10.csv')
advanced['Year'] = 2010
advanced = drop_under_41(advanced)
advanced = advanced[['Player', 'PER', 'TS%', 'WS', 'BPM', 'Year']]

In [120]:
advanced2 = pd.read_csv('Advanced11.csv')
advanced2['Year'] = 2011
advanced2 = drop_under_41(advanced2)
advanced2 = advanced2[['Player', 'PER', 'TS%', 'WS', 'BPM', 'Year']]

# Combining the dataframes
advanced = pd.concat([advanced, advanced2])

advanced2 = pd.read_csv('Advanced12.csv')
advanced2['Year'] = 2012
advanced2 = drop_under_41(advanced2)
advanced2 = advanced2[['Player', 'PER', 'TS%', 'WS', 'BPM', 'Year']]
advanced = pd.concat([advanced, advanced2])


advanced2 = pd.read_csv('Advanced12.csv')
advanced2['Year'] = 2013
advanced2 = drop_under_41(advanced2)
advanced2 = advanced2[['Player', 'PER', 'TS%', 'WS', 'BPM', 'Year']]
advanced = pd.concat([advanced, advanced2])


advanced2 = pd.read_csv('Advanced14.csv')
advanced2['Year'] = 2014
advanced2 = drop_under_41(advanced2)
advanced2 = advanced2[['Player', 'PER', 'TS%', 'WS', 'BPM', 'Year']]
advanced = pd.concat([advanced, advanced2])


advanced2 = pd.read_csv('Advanced15.csv')
advanced2['Year'] = 2015
advanced2 = drop_under_41(advanced2)
advanced2 = advanced2[['Player', 'PER', 'TS%', 'WS', 'BPM', 'Year']]
advanced = pd.concat([advanced, advanced2])


advanced2 = pd.read_csv('Advanced16.csv')
advanced2['Year'] = 2016
advanced2 = drop_under_41(advanced2)
advanced2 = advanced2[['Player', 'PER', 'TS%', 'WS', 'BPM', 'Year']]
advanced = pd.concat([advanced, advanced2])


advanced2 = pd.read_csv('Advanced17.csv')
advanced2['Year'] = 2017
advanced2 = drop_under_41(advanced2)
advanced2 = advanced2[['Player', 'PER', 'TS%', 'WS', 'BPM', 'Year']]
advanced = pd.concat([advanced, advanced2])


advanced2 = pd.read_csv('Advanced18.csv')
advanced2['Year'] = 2018
advanced2 = drop_under_41(advanced2)
advanced2 = advanced2[['Player', 'PER', 'TS%', 'WS', 'BPM', 'Year']]
advanced = pd.concat([advanced, advanced2])


advanced2 = pd.read_csv('Advanced19.csv')
advanced2['Year'] = 2019
advanced2 = drop_under_41(advanced2)
advanced2 = advanced2[['Player', 'PER', 'TS%', 'WS', 'BPM', 'Year']]
advanced = pd.concat([advanced, advanced2])


advanced2 = pd.read_csv('Advanced20.csv')
advanced2['Year'] = 2020
advanced2 = drop_under_41(advanced2)
advanced2 = advanced2[['Player', 'PER', 'TS%', 'WS', 'BPM', 'Year']]
advanced = pd.concat([advanced, advanced2])

advanced = advanced.reset_index()
for i, r in advanced.iterrows():
    if re.search(r'\\[a-z]', advanced.at[i, 'Player']):
       name = re.split(r'\\[a-z]', advanced.at[i, 'Player'])
       advanced.at[i,'Player'] = name[0] 
       


In [121]:
data['PER'] = 0.0
data['TS%'] = 0.0
data['WS'] = 0.0
data['BPM'] = 0.0
data

Unnamed: 0,Player,Age,Tm,FG%,TRB,AST,STL,BLK,TOV,PTS,Year,All Star,PER,TS%,WS,BPM
0,LaMarcus Aldridge,24,POR,0.495,8.0,2.1,0.9,0.6,1.3,17.9,2010,0,0.0,0.0,0.0,0.0
1,Ray Allen*,34,BOS,0.477,3.2,2.6,0.8,0.3,1.6,16.3,2010,0,0.0,0.0,0.0,0.0
2,Carmelo Anthony,25,DEN,0.458,6.6,3.2,1.3,0.4,3.0,28.2,2010,1,0.0,0.0,0.0,0.0
3,Gilbert Arenas,28,WAS,0.411,4.2,7.2,1.3,0.3,3.7,22.6,2010,0,0.0,0.0,0.0,0.0
4,Trevor Ariza,24,HOU,0.394,5.6,3.8,1.8,0.6,2.2,14.9,2010,0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2052,Justise Winslow,23,MIA,0.388,6.6,4.0,0.6,0.5,2.2,11.3,2020,0,0.0,0.0,0.0,0.0
2053,Christian Wood,24,DET,0.567,6.3,1.0,0.5,0.9,1.4,13.1,2020,0,0.0,0.0,0.0,0.0
2054,Thaddeus Young,31,CHI,0.448,4.9,1.8,1.4,0.4,1.6,10.3,2020,0,0.0,0.0,0.0,0.0
2055,Trae Young,21,ATL,0.437,4.3,9.3,1.1,0.1,4.8,29.6,2020,1,0.0,0.0,0.0,0.0


In [185]:
def get_advanced(player, year):
    stats = advanced.query('Player == @player and Year == @year')
    if (not stats.empty):
        stats = stats.reset_index()
        return (stats.at[0,'PER'], stats.at[0,'TS%'], stats.at[0, 'WS'], stats.at[0, 'BPM'])
    else:
        return ()



In [191]:
for i,r in data.iterrows():
    ad = get_advanced(r['Player'], r['Year'])
    if len(ad) != 0:
        data.at[i, 'PER'] = ad[0]
        data.at[i, 'TS%'] = ad[1]
        data.at[i, 'WS'] = ad[2]
        data.at[i, 'BPM'] = ad[3]
data.head()

Unnamed: 0,Player,Age,Tm,FG%,TRB,AST,STL,BLK,TOV,PTS,Year,All Star,PER,TS%,WS,BPM
0,LaMarcus Aldridge,24,POR,0.495,8.0,2.1,0.9,0.6,1.3,17.9,2010,0,18.2,0.535,8.8,1.2
1,Ray Allen*,34,BOS,0.477,3.2,2.6,0.8,0.3,1.6,16.3,2010,0,15.2,0.601,7.9,1.2
2,Carmelo Anthony,25,DEN,0.458,6.6,3.2,1.3,0.4,3.0,28.2,2010,1,22.2,0.548,7.9,2.3
3,Gilbert Arenas,28,WAS,0.411,4.2,7.2,1.3,0.3,3.7,22.6,2010,0,0.0,0.0,0.0,0.0
4,Trevor Ariza,24,HOU,0.394,5.6,3.8,1.8,0.6,2.2,14.9,2010,0,13.3,0.488,3.2,0.5


In [193]:
model = RandomForestClassifier()
inputs = data.drop(columns=['Player','Tm','FG%','Year','All Star'])
target = data['All Star']
X_train, X_test, y_train, y_test = train_test_split(inputs, target, random_state=1, test_size= 0.3)
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
metrics.accuracy_score(y_test, y_pred)

0.9255663430420712

In [197]:
nba21 = pd.read_csv('NBA21.csv', sep=',')
nba21['All Star %'] = 0.0
nba21.head()
nba21 = nba21.dropna()
nba21 = nba21.reset_index()
for i,r in nba21.iterrows():
    nba21.at[i,'All Star %'] = probabilities[i][1]
    if i == 441:
        break
east = ['PHI', 'BRK', 'MIL', 'NYK', 'ATL', 'MIA', 'BOS', 'WAS','IND', 'CHA', 'CHI', 'TOR', 'CLE', 'ORL', 'DET']
nba21['Conference'] = ''
for i, r in nba21.iterrows():
    if r['Tm'] in east:
        nba21.at[i,'Conference'] = 'East'
    else:
        nba21.at[i,'Conference'] = 'West'
nba21 = nba21.sort_values(by=['Conference','All Star %'], ascending= False)
nba21.to_csv('AllStarPredictions.csv', sep='\t')


Unnamed: 0,index,Rk,Player,Pos,Age,Tm,FG%,3P%,TRB,AST,STL,BLK,TOV,PTS,All Star %
0,0,1,Precious Achiuwa\achiupr01,C,22,TOR,0.386,0.269,8.2,1.6,0.5,0.6,1.1,8.0,0.01
1,2,3,Bam Adebayo\adebaba01,C,24,MIA,0.519,0.000,10.2,3.2,1.1,0.3,2.9,18.7,0.16
2,3,4,Santi Aldama\aldamsa01,PF,21,MEM,0.364,0.111,2.6,0.8,0.1,0.2,0.3,3.6,0.03
3,4,5,LaMarcus Aldridge\aldrila01,C,36,BRK,0.573,0.367,5.7,0.9,0.4,1.2,0.8,14.0,0.11
4,5,6,Nickeil Alexander-Walker\alexani01,SG,23,NOP,0.366,0.310,3.9,2.6,1.0,0.4,1.5,13.5,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
437,477,478,McKinley Wright IV\wrighmc01,PG,23,MIN,1.000,1.000,0.0,0.0,0.0,0.0,1.0,3.0,0.16
438,478,479,Thaddeus Young\youngth01,PF,33,SAS,0.566,0.000,3.4,2.5,0.9,0.3,1.0,6.2,0.04
439,479,480,Trae Young\youngtr01,PG,23,ATL,0.461,0.383,4.1,9.3,1.0,0.1,4.1,27.0,0.62
440,480,481,Omer Yurtseven\yurtsom01,C,23,MIA,0.521,0.000,2.1,0.5,0.1,0.5,0.4,3.1,0.07
