## Machine Learning Predictions for the 2019-2020 Season
All-NBA team

Every NBA season a panel of sportswriters and broadcasters throughout the United States and Canada vote on the All-NBA teams. The All-NBA teams acknowledge the best players in the NBA by position. There are 3 ranked teams (1st, 2nd, 3rd) each consisting of 2 guards, 2 forwards, and 1 center, combining for a total of 15 All-NBA players. Not only is it a huge honor to be selected to an All-NBA team, but it is also important for contract negotiations as players can only be paid certain amounts if they achieve All-NBA accolades for a certain number of years.

**You must use pip to install basketball_reference_web_scraper otherwise none of this will work. Install it using the following command: pip install basketball_reference_web_scraper** 

Note: I originally had an issue installing it which was solved by running the command like this:
pip install basketball_reference_web_scraper --ignore-installed

**Please run each cell in order!! The first cell takes a while to run since it is retrieving a lot of data. Just wait for it to finish**


First, to get the data:

I added in some calculated stats like field goal percentage and points.

Also, if a player was moved midseason, I combined their totals from each team.

In [3]:
from basketball_reference_web_scraper import client
import collections
import copy
import pandas as pd

statsToAdd = ['games_played', 'games_started', 'minutes_played', 'made_field_goals', 'attempted_field_goals', 'made_three_point_field_goals', 'attempted_three_point_field_goals', 'made_free_throws', 'attempted_free_throws', 'offensive_rebounds', 'defensive_rebounds', 'assists', 'steals', 'blocks', 'turnovers', 'personal_fouls']
statDict = {}
for year in range(2001, 2021):
    yearPlayers = client.players_season_totals(season_end_year=year)
    playerIdList = [player['slug'] for player in yearPlayers]
    tradedPlayersIdList = [playerId for playerId, count in collections.Counter(playerIdList).items() if count > 1]
    for tradedPlayerId in tradedPlayersIdList:
        playerStatsList = []
        toDelete = []
        totalPlayerStats = {'slug': '', 'name': '', 'positions': [], 'age': 0, 'team': 0, 'games_played': 0, 'games_started': 0, 'minutes_played': 0, 'made_field_goals': 0, 'attempted_field_goals': 0, 'made_three_point_field_goals': 0, 'attempted_three_point_field_goals': 0, 'made_free_throws': 0, 'attempted_free_throws': 0, 'offensive_rebounds': 0, 'defensive_rebounds': 0, 'assists': 0, 'steals': 0, 'blocks': 0, 'turnovers': 0, 'personal_fouls': 0}
        for idx, player in enumerate(yearPlayers):
            if tradedPlayerId == player['slug']:
                playerStatsList.append(player)
                toDelete.append(idx)
        for i in toDelete:
            yearPlayers.pop(toDelete[0])
        for playerTeamStats in playerStatsList:
            if totalPlayerStats['slug'] == '':
                totalPlayerStats['slug'] = playerTeamStats['slug']
            if totalPlayerStats['name'] == '':
                totalPlayerStats['name'] = playerTeamStats['name']
            if totalPlayerStats['positions'] == []:
                totalPlayerStats['positions'] = playerTeamStats['positions']
            if totalPlayerStats['age'] == 0:
                totalPlayerStats['age'] = playerTeamStats['age']
            if totalPlayerStats['team'] == 0:
                totalPlayerStats['team'] = playerTeamStats['team']
            for statName in statsToAdd:
                totalPlayerStats[statName] = totalPlayerStats[statName] + playerTeamStats[statName]
        yearPlayers.append(totalPlayerStats)
        
        for player in yearPlayers:
            #center = 1, forward = 2, guard = 3
            positionstring = str(player['positions'])
            if 'CENTER' in positionstring: # 8, -1
                 player['positions'] = 1
            elif 'FORWARD' in positionstring:
                player['positions'] = 2
            elif 'GUARD' in positionstring:
                player['positions'] = 3
                
            if(player['attempted_field_goals'] != 0):
                player['effective_field_goal_percentage'] = (player['made_field_goals'] + (.5 * player['made_three_point_field_goals'])) / player['attempted_field_goals']
            else:
                player['effective_field_goal_percentage'] = 0
            player['total_points'] = (player['made_three_point_field_goals'] * 3) + ((player['made_field_goals'] - player['made_three_point_field_goals']) * 2) + (player['made_free_throws']) #not necessary?? 
            if(player['attempted_free_throws'] != 0):
                player['free_throw_percentage'] = player['made_free_throws'] / player['attempted_free_throws']
            else:
                player['free_throw_percentage'] = 0
    statDict[year] = yearPlayers

In [4]:
for year in range(2001, 2021):
    for game in client.season_schedule(season_end_year=year):
        for player in statDict[year]:
            if "team_games_played" in player and player["team_games_played"] >= 82:
                break
            if player["team"] == game["home_team"] or player["team"] == game["away_team"]:
                if "team_games_played" in player:
                    player["team_games_played"] += 1
                else:
                    player["team_games_played"] = 1
            if game["home_team_score"] > game["away_team_score"]:
                if player["team"] == game["home_team"]:
                    if "wins" in player:
                        player["wins"] += 1
                    else:
                        player["wins"] = 1
            else:
                if player["team"] == game["away_team"]:
                    if "wins" in player:
                        player["wins"] += 1
                    else:
                        player["wins"] = 1

In [5]:
currentStatDict = copy.deepcopy(statDict[2020])

totalgames = 82
for player in currentStatDict:
    rescalefact = totalgames/player['team_games_played']
    
    #print(rescalefact)
    for key, value in player.items():
        if key != 'age' and key != 'positions' and isinstance(value, int) == True:                
            value = round(value * rescalefact)
            player[key] = value
        

In [6]:
statDict.pop(2020) #keep in statDict
print(currentStatDict)

[{'slug': 'adamsst01', 'name': 'Steven Adams', 'positions': 1, 'age': 26, 'team': <Team.OKLAHOMA_CITY_THUNDER: 'OKLAHOMA CITY THUNDER'>, 'games_played': 74, 'games_started': 74, 'minutes_played': 2004, 'made_field_goals': 336, 'attempted_field_goals': 568, 'made_three_point_field_goals': 1, 'attempted_three_point_field_goals': 4, 'made_free_throws': 138, 'attempted_free_throws': 234, 'offensive_rebounds': 251, 'defensive_rebounds': 445, 'assists': 181, 'steals': 64, 'blocks': 83, 'turnovers': 110, 'personal_fouls': 142, 'points': 811, 'effective_field_goal_percentage': 0.59255079006772, 'total_points': 811, 'free_throw_percentage': 0.5901639344262295, 'team_games_played': 82, 'wins': 51}, {'slug': 'adebaba01', 'name': 'Bam Adebayo', 'positions': 2, 'age': 22, 'team': <Team.MIAMI_HEAT: 'MIAMI HEAT'>, 'games_played': 82, 'games_started': 82, 'minutes_played': 2820, 'made_field_goals': 515, 'attempted_field_goals': 907, 'made_three_point_field_goals': 1, 'attempted_three_point_field_goals

Using beautiful soup to get a nice clean list of the all-nba players since 2000.

In [7]:
from bs4 import BeautifulSoup
import urllib.request
import collections
import re
import bs4
import lxml

url = 'https://www.basketball-reference.com/awards/all_league.html'

soup = BeautifulSoup()

textList = []
allNbaTeamDict = {}
with urllib.request.urlopen(url) as ef:
    soup = BeautifulSoup(ef)
    textList = soup.find('table').get_text().splitlines()[15:]

    for line in textList:
        year = line[:7]
        formattedYear = year[:2] + year[5:]
        if formattedYear == '1900':
            formattedYear = '2001'
        if year == '1999-00':
            break
        if formattedYear not in allNbaTeamDict:
            allNbaTeamDict[formattedYear] = []
        wordList = line[13:].split()
        if len(wordList) == 0:
            continue
        playerTuple = (wordList[2][0],"%s %s" % (wordList[0], wordList[1]))
        allNbaTeamDict[formattedYear].append(playerTuple)
        playerTuple = (wordList[4][0],"%s %s" % (wordList[2][1:], wordList[3]))
        allNbaTeamDict[formattedYear].append(playerTuple)
        playerTuple = (wordList[6][0],"%s %s" % (wordList[4][1:], wordList[5]))
        allNbaTeamDict[formattedYear].append(playerTuple)
        playerTuple = (wordList[8][0],"%s %s" % (wordList[6][1:], wordList[7]))
        allNbaTeamDict[formattedYear].append(playerTuple)
        playerTuple = (wordList[10][0],"%s %s" % (wordList[8][1:], wordList[9]))
        allNbaTeamDict[formattedYear].append(playerTuple)
    allNbaTeamDict.pop('')
print(allNbaTeamDict)

{'2019': [('C', 'Nikola Jokić'), ('F', 'Giannis Antetokounmpo'), ('F', 'Paul George'), ('G', 'James Harden'), ('G', 'Stephen Curry'), ('C', 'Joel Embiid'), ('F', 'Kevin Durant'), ('F', 'Kawhi Leonard'), ('G', 'Damian Lillard'), ('G', 'Kyrie Irving'), ('C', 'Rudy Gobert'), ('F', 'LeBron James'), ('F', 'Blake Griffin'), ('G', 'Kemba Walker'), ('G', 'Russell Westbrook')], '2018': [('C', 'Anthony Davis'), ('F', 'LeBron James'), ('F', 'Kevin Durant'), ('G', 'Damian Lillard'), ('G', 'James Harden'), ('C', 'Joel Embiid'), ('F', 'Giannis Antetokounmpo'), ('F', 'LaMarcus Aldridge'), ('G', 'Russell Westbrook'), ('G', 'DeMar DeRozan'), ('C', 'Karl-Anthony Towns'), ('F', 'Jimmy Butler'), ('F', 'Paul George'), ('G', 'Stephen Curry'), ('G', 'Victor Oladipo')], '2017': [('C', 'Anthony Davis'), ('F', 'LeBron James'), ('F', 'Kawhi Leonard'), ('G', 'James Harden'), ('G', 'Russell Westbrook'), ('C', 'Rudy Gobert'), ('F', 'Giannis Antetokounmpo'), ('F', 'Kevin Durant'), ('G', 'Stephen Curry'), ('G', 'Isai

Adding the all_nba_type back into our stat dictionary:

In [8]:
relevantCenterData = []
relevantForwardData = []
relevantGuardData = []

for year in allNbaTeamDict:
    for position, playerName in allNbaTeamDict[str(year)]:
        if position == 'C':
            for player in statDict[int(year)]:
                if player['name'] == playerName:
                    relevantCenterData.append(player)
        elif position == 'F':
            for player in statDict[int(year)]:
                if player['name'] == playerName:
                    relevantForwardData.append(player)
        elif position == 'G':
            for player in statDict[int(year)]:
                if player['name'] == playerName:
                    relevantGuardData.append(player)
#data for all NBA team players by position 

In [9]:
# 1 = All-NBA center, 2 = All-NBA forward, 3 = All-NBA guard, 0 = regular player
for playerSeason in relevantCenterData:
    for year in statDict:
        for player in statDict[int(year)]:
            if player == playerSeason:
                player['all_nba_type'] = 1 #if player in statdict is in allNbaTeamDict, update all_nba_type value
for playerSeason in relevantForwardData:
    for year in statDict:
        for player in statDict[int(year)]:
            if player == playerSeason:
                player['all_nba_type'] = 2
for playerSeason in relevantGuardData:
    for year in statDict:
        for player in statDict[int(year)]:
            if player == playerSeason:
                player['all_nba_type'] = 3
flattenedStats = []
for year in statDict:
        for player in statDict[int(year)]:
            if 'all_nba_type' not in player: #iterate through all players in stat dict, if they have no allnbatype make 0 
                player['all_nba_type'] = 0
            flattenedStats.append(player) #add all players to flattenedstats

historicalDf = pd.DataFrame.from_dict(flattenedStats) #df of all players with All NBA team type
currentDf = pd.DataFrame.from_dict(currentStatDict) #df of current year to be looked at 
historicalDf.describe() 

Unnamed: 0,positions,age,games_played,games_started,minutes_played,made_field_goals,attempted_field_goals,made_three_point_field_goals,attempted_three_point_field_goals,made_free_throws,...,blocks,turnovers,personal_fouls,points,effective_field_goal_percentage,total_points,free_throw_percentage,team_games_played,wins,all_nba_type
count,8874.0,8874.0,8874.0,8874.0,8874.0,8874.0,8874.0,8874.0,8874.0,8874.0,...,8874.0,8874.0,8874.0,7806.0,8874.0,8874.0,8874.0,8874.0,8874.0,8874.0
mean,2.194388,26.638833,53.266734,25.878972,1251.586996,192.703403,425.635903,36.345842,102.00462,94.231801,...,25.366464,71.631846,108.820261,532.802075,0.473686,515.984449,0.700573,80.334009,39.920216,0.050034
std,0.755586,4.322034,25.006668,29.1426,903.247901,173.202731,372.633601,48.948826,129.303639,106.098018,...,33.870831,62.680901,73.057498,483.166013,0.09967,473.007812,0.194316,2.646965,12.392695,0.34331
min,1.0,18.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,66.0,7.0,0.0
25%,2.0,23.0,33.0,1.0,412.0,47.25,112.0,0.0,2.0,18.0,...,4.0,20.0,44.0,130.0,0.442446,123.0,0.645161,81.0,31.0,0.0
50%,2.0,26.0,61.0,11.0,1183.5,151.0,339.5,12.0,39.5,59.0,...,13.0,57.0,107.0,421.0,0.483312,399.0,0.75,81.0,41.0,0.0
75%,3.0,30.0,75.0,52.0,1984.0,294.0,647.0,60.0,171.0,133.0,...,32.0,106.0,164.0,813.0,0.519927,783.0,0.81536,81.0,49.0,0.0
max,3.0,44.0,85.0,83.0,3485.0,978.0,2173.0,402.0,1028.0,756.0,...,307.0,464.0,344.0,2832.0,1.5,2832.0,1.0,82.0,72.0,3.0


In [10]:
currentDf.describe()

Unnamed: 0,positions,age,games_played,games_started,minutes_played,made_field_goals,attempted_field_goals,made_three_point_field_goals,attempted_three_point_field_goals,made_free_throws,...,steals,blocks,turnovers,personal_fouls,points,effective_field_goal_percentage,total_points,free_throw_percentage,team_games_played,wins
count,514.0,514.0,514.0,514.0,514.0,514.0,514.0,514.0,514.0,514.0,...,514.0,514.0,514.0,514.0,460.0,514.0,514.0,514.0,514.0,514.0
mean,2.270428,25.470817,50.533074,23.931907,1156.811284,195.447471,425.019455,58.038911,162.406615,84.525292,...,36.671206,23.661479,66.437743,98.523346,543.465217,0.492854,533.468872,0.693541,82.0,40.723735
std,0.743275,4.051343,26.140568,28.844161,854.705684,180.31847,385.11299,64.817411,170.473872,103.823758,...,31.818854,29.592854,64.484921,69.886377,514.940605,0.132063,503.161285,0.234024,0.0,13.186972
min,1.0,19.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,82.0,19.0
25%,2.0,22.0,28.0,0.0,355.0,42.25,96.25,4.0,16.0,13.0,...,9.0,5.0,17.25,38.25,108.75,0.470695,113.5,0.645913,82.0,29.0
50%,2.0,25.0,58.0,8.0,1119.0,149.0,327.0,36.5,115.0,47.0,...,29.5,14.0,50.0,99.0,417.5,0.516992,400.5,0.755834,82.0,38.0
75%,3.0,28.0,73.0,49.0,1910.0,302.75,651.75,92.0,257.75,114.75,...,56.75,32.0,95.0,152.0,814.25,0.555749,805.25,0.830811,82.0,52.0
max,3.0,43.0,82.0,82.0,2871.0,786.0,1776.0,347.0,985.0,793.0,...,145.0,232.0,354.0,310.0,2686.0,0.777778,2686.0,1.0,82.0,67.0


Now I have all of the All-NBA players since year 1999-2000 season and their stats, along with whether they made an All-NBA team or not, and which team they made. Now to select relevant statistic:

In [11]:
#extract relevant stats of historical data
relevantStatsHistoricalDf = historicalDf[['wins', 'positions','free_throw_percentage', 'turnovers', 'games_played', 'games_started', 'minutes_played', 'made_field_goals','attempted_field_goals', 'made_three_point_field_goals', 'attempted_three_point_field_goals', 'made_free_throws', 'attempted_free_throws', 'assists', 'blocks', 'steals', 'total_points', 'offensive_rebounds', 'defensive_rebounds', 'effective_field_goal_percentage', 'all_nba_type']] 
relevantStatsHistoricalDf.head()
    

Unnamed: 0,wins,positions,free_throw_percentage,turnovers,games_played,games_started,minutes_played,made_field_goals,attempted_field_goals,made_three_point_field_goals,...,made_free_throws,attempted_free_throws,assists,blocks,steals,total_points,offensive_rebounds,defensive_rebounds,effective_field_goal_percentage,all_nba_type
0,23,3,0.758621,26,41,0,486,120,246,4,...,22,29,76,1,9,266,5,20,0.495935,0
1,40,3,0.583333,34,29,12,420,43,111,4,...,21,36,22,13,14,111,14,45,0.405405,0
2,22,2,0.834275,231,81,81,3241,604,1280,12,...,443,531,250,77,90,1663,175,560,0.476562,0
3,43,3,0.666667,25,26,0,227,18,56,4,...,12,18,36,0,16,52,0,25,0.357143,0
4,52,3,0.887755,204,82,82,3129,628,1309,202,...,348,392,374,20,124,1806,101,327,0.556914,0


In [12]:
#extract relevant stats of current data
relevantStatsCurrentDf = currentDf[['wins', 'positions','free_throw_percentage', 'turnovers', 'games_played', 'games_started', 'minutes_played', 'made_field_goals','attempted_field_goals', 'made_three_point_field_goals', 'attempted_three_point_field_goals', 'made_free_throws', 'attempted_free_throws', 'assists', 'blocks', 'steals', 'total_points', 'offensive_rebounds', 'defensive_rebounds', 'effective_field_goal_percentage']] 
relevantStatsCurrentDf.head()

Unnamed: 0,wins,positions,free_throw_percentage,turnovers,games_played,games_started,minutes_played,made_field_goals,attempted_field_goals,made_three_point_field_goals,attempted_three_point_field_goals,made_free_throws,attempted_free_throws,assists,blocks,steals,total_points,offensive_rebounds,defensive_rebounds,effective_field_goal_percentage
0,51,1,0.590164,110,74,74,2004,336,568,1,4,138,234,181,83,64,811,251,445,0.592551
1,52,2,0.690058,233,82,82,2820,515,907,1,16,298,431,420,107,98,1328,208,653,0.56815
2,35,1,0.827225,96,69,69,2283,509,1032,79,204,206,249,168,113,47,1303,134,376,0.531526
3,36,3,0.607143,51,53,0,642,99,291,51,150,22,36,95,9,14,270,10,92,0.427313
4,40,3,0.857143,29,38,0,628,100,222,42,115,38,44,54,1,8,279,6,77,0.542614


In [13]:
# help from here: http://ataspinar.com/2017/05/26/classification-with-scikit-learn/
# I used the code from this guide and applied it to my data
from sklearn.model_selection import train_test_split

y_col = 'all_nba_type' #to be removed from original Df and stored as targets
x_cols = list(relevantStatsHistoricalDf.columns.values)
x_cols.remove(y_col)

x = relevantStatsHistoricalDf[x_cols].values #all column values of df excluding allnbatype
y = relevantStatsHistoricalDf[y_col].values #corresponding targets

X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.2)

print(X_train)
print(Y_train)

[[5.80000000e+01 1.00000000e+00 5.45454545e-01 ... 8.30000000e+01
  1.32000000e+02 5.82901554e-01]
 [4.70000000e+01 3.00000000e+00 1.00000000e+00 ... 2.00000000e+00
  2.00000000e+00 5.92592593e-01]
 [3.20000000e+01 1.00000000e+00 5.41322314e-01 ... 1.51000000e+02
  3.41000000e+02 5.03797468e-01]
 ...
 [4.90000000e+01 3.00000000e+00 7.74818402e-01 ... 7.70000000e+01
  2.55000000e+02 4.94857595e-01]
 [4.90000000e+01 3.00000000e+00 7.79620853e-01 ... 1.43000000e+02
  2.58000000e+02 4.28017241e-01]
 [4.40000000e+01 2.00000000e+00 7.59750390e-01 ... 1.56000000e+02
  5.97000000e+02 5.44579173e-01]]
[0 0 0 ... 0 0 2]


In [14]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [15]:
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(max_iter=800)
parameter_space = {
    'hidden_layer_sizes': [(50,50,50)],
    'activation': ['tanh', 'relu'],
    'solver': [ 'sgd', 'adam'],
    'alpha': [0.0001],
    'learning_rate_init': [.001],
    'learning_rate': ['constant','adaptive'],
}

from sklearn.model_selection import GridSearchCV
clf = GridSearchCV(mlp, parameter_space, n_jobs=-1, cv=3)
clf.fit(X_train, Y_train)

print('Best parameters found:\n', clf.best_params_)

Best parameters found:
 {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50, 50, 50), 'learning_rate': 'adaptive', 'learning_rate_init': 0.001, 'solver': 'adam'}


In [16]:
#prepare 2020 data
predictions = clf.predict(X_test)

from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(Y_test,predictions))
print(classification_report(Y_test,predictions))

train_score = clf.score(X_train, Y_train)
test_score = clf.score(X_test, Y_test)
print("train_score: ", train_score)
print("test_score: ", test_score)

#prediction error vs number of units 

[[1711    8    6    4]
 [   4    5    1    0]
 [   5    1   10    0]
 [   4    0    0   16]]
              precision    recall  f1-score   support

           0       0.99      0.99      0.99      1729
           1       0.36      0.50      0.42        10
           2       0.59      0.62      0.61        16
           3       0.80      0.80      0.80        20

    accuracy                           0.98      1775
   macro avg       0.68      0.73      0.70      1775
weighted avg       0.98      0.98      0.98      1775

train_score:  0.9994365403577968
test_score:  0.9814084507042253


In [17]:
curr_cols = list(relevantStatsCurrentDf.columns.values)
curr_test = relevantStatsCurrentDf[x_cols].values
curr_test = scaler.transform(curr_test) #scale testing data

#split by position
#then use PCA or regression 

#compare to Nate Silver with regression 

print(curr_test)

allnbateam = clf.predict(curr_test)
print(allnbateam)

[[ 8.92175971e-01 -1.58171913e+00 -5.56016061e-01 ...  3.18000620e+00
   1.98016046e+00  1.18848777e+00]
 [ 9.72592904e-01 -2.59148565e-01 -4.81829744e-02 ...  2.47459245e+00
   3.42949219e+00  9.44783516e-01]
 [-3.94494955e-01 -1.58171913e+00  6.49130016e-01 ...  1.26062461e+00
   1.49937253e+00  5.78992576e-01]
 ...
 [-1.27908122e+00  1.06342200e+00 -1.01438172e+00 ... -9.37641488e-01
  -1.08573360e+00 -4.72969689e+00]
 [-1.27908122e+00 -2.59148565e-01  4.78822512e-02 ...  4.56781035e-01
   7.32899012e-01  2.99866780e-01]
 [-8.76996553e-01 -2.59148565e-01  8.68467119e-01 ... -5.27517217e-01
   1.26520299e-03  7.25400287e-01]]
[0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 2 0 0 0 0 3 0 0
 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

In [18]:
#allnbateam
print(len(allnbateam))
print("Predicted team")
for player in range(len(allnbateam)):
    if allnbateam[player] == 1:
        print("center: ", currentDf.iloc[player]['name'])
    elif allnbateam[player] == 2:
        print("forward: ", currentDf.iloc[player]['name'])
    elif allnbateam[player] == 3: 
        print("guard: ", currentDf.iloc[player]['name'])
        


514
Predicted team
forward:  Giannis Antetokounmpo
forward:  Jimmy Butler
forward:  Anthony Davis
forward:  DeMar DeRozan
guard:  Luka Dončić
center:  Joel Embiid
center:  Rudy Gobert
guard:  James Harden
guard:  LeBron James
center:  Nikola Jokić
forward:  Kawhi Leonard
guard:  Damian Lillard
guard:  Kyle Lowry
forward:  Pascal Siakam
forward:  Jayson Tatum
guard:  Russell Westbrook
center:  Hassan Whiteside
guard:  Trae Young


In [25]:
import numpy as np

allprobs = clf.predict_proba(curr_test)
#add index as 5th column of matrix 
idx = range(0, allprobs.shape[0])
print(allprobs.shape)
idx = np.asarray(idx)
idx = np.reshape(idx, (514, 1))
allprobs = np.hstack((allprobs, idx))

allcenters = allprobs[allprobs[:,1].argsort()]
c = 1
while c < 5:
    c_index = int(allcenters[-c][4])
    print("center: ", currentDf.iloc[c_index]['name'])
    c+=1

allforwards = allprobs[allprobs[:,2].argsort()]
f = 1
while f < 8:
    f_index = int(allforwards[-f][4])
    print("forward: ", currentDf.iloc[f_index]['name'])
    f+=1

allguards = allprobs[allprobs[:,3].argsort()]
g = 1
while g < 7:
    g_index = int(allguards[-g][4])
    print("guard: ", currentDf.iloc[g_index]['name'])
    g+=1
    
#best prediction: 13 accuracy, some positional discrepancy

(514, 4)
center:  Rudy Gobert
center:  Nikola Jokić
center:  Joel Embiid
center:  Hassan Whiteside
forward:  Jimmy Butler
forward:  Kawhi Leonard
forward:  Giannis Antetokounmpo
forward:  Anthony Davis
forward:  Pascal Siakam
forward:  Jayson Tatum
forward:  DeMar DeRozan
guard:  LeBron James
guard:  Russell Westbrook
guard:  Luka Dončić
guard:  James Harden
guard:  Damian Lillard
guard:  Kyle Lowry


In [109]:
import seaborn as sns
import matplotlib.pyplot as plt
import time

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, LabelEncoder

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB

dict_classifiers = {
    "Logistic Regression": LogisticRegression(),
    "Nearest Neighbors": KNeighborsClassifier(),
    "Linear SVM": SVC(),
    "Gradient Boosting Classifier": GradientBoostingClassifier(n_estimators=1000),
    "Decision Tree": tree.DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(n_estimators=1000),
    "Neural Net": MLPClassifier(alpha = 1),
    "Naive Bayes": GaussianNB(),
}

rfPredictions = []

def batch_classify(X_train, Y_train, X_test, Y_test, no_classifiers = 5, verbose = True):
    """
    This method, takes as input the X, Y matrices of the Train and Test set.
    And fits them on all of the Classifiers specified in the dict_classifier.
    The trained models, and accuracies are saved in a dictionary. The reason to use a dictionary
    is because it is very easy to save the whole dictionary with the pickle module.
    
    Usually, the SVM, Random Forest and Gradient Boosting Classifier take quiet some time to train. 
    So it is best to train them on a smaller dataset first and 
    decide whether you want to comment them out or not based on the test accuracy score.
    """
    
    dict_models = {}
    for classifier_name, classifier in list(dict_classifiers.items())[:no_classifiers]:
        t_start = time.clock()
        classifier.fit(X_train, Y_train)
        t_end = time.clock()
        
        t_diff = t_end - t_start
        train_score = classifier.score(X_train, Y_train)
        test_score = classifier.score(X_test, Y_test)
        
        dict_models[classifier_name] = {'model': classifier, 'train_score': train_score, 'test_score': test_score, 'train_time': t_diff}
        if verbose:
            print("trained {c} in {f:.2f} s".format(c=classifier_name, f=t_diff))
    return dict_models



def display_dict_models(dict_models, sort_by='test_score'):
    cls = [key for key in dict_models.keys()]
    test_s = [dict_models[key]['test_score'] for key in cls]
    training_s = [dict_models[key]['train_score'] for key in cls]
    training_t = [dict_models[key]['train_time'] for key in cls]
    
    df_ = pd.DataFrame(data=np.zeros(shape=(len(cls),4)), columns = ['classifier', 'train_score', 'test_score', 'train_time'])
    for ii in range(0,len(cls)):
        df_.loc[ii, 'classifier'] = cls[ii]
        df_.loc[ii, 'train_score'] = training_s[ii]
        df_.loc[ii, 'test_score'] = test_s[ii]
        df_.loc[ii, 'train_time'] = training_t[ii]
    
    display(df_.sort_values(by=sort_by, ascending=False))

dict_models = batch_classify(X_train, Y_train, X_test, Y_test, no_classifiers = 8)
display_dict_models(dict_models)



trained Logistic Regression in 1.14 s
trained Nearest Neighbors in 0.03 s
trained Linear SVM in 0.10 s




trained Gradient Boosting Classifier in 18.98 s
trained Decision Tree in 0.02 s




trained Random Forest in 4.26 s




trained Neural Net in 19.57 s
trained Naive Bayes in 0.02 s




Unnamed: 0,classifier,train_score,test_score,train_time
1,Nearest Neighbors,0.986517,0.985126,0.030147
5,Random Forest,1.0,0.983982,4.255132
2,Linear SVM,0.982287,0.982838,0.104653
0,Logistic Regression,0.979247,0.982265,1.14165
6,Neural Net,0.98004,0.979977,19.569856
3,Gradient Boosting Classifier,1.0,0.978833,18.977044
4,Decision Tree,1.0,0.9754,0.020159
7,Naive Bayes,0.906015,0.913043,0.016035


So we can see that the Random Forest classifier was very accurate! The question is, did it correctly predict the all_nba_team values? There are far fewer all nba players compared to total players so this accuracy could be meaningless, meaning it just usually predicted 0. Now to try to use this model on the current data and see who it predicts to be on the all-nba team.

In [110]:
rfPredictions = dict_models['Random Forest']['model'].predict(relevantStatsCurrentDf)
rfPredictions = rfPredictions.tolist()
print(len(rfPredictions))
currentDf['all_nba_type'] = rfPredictions
resultsDf = currentDf.loc[currentDf['all_nba_type'] > 0]
display(resultsDf)

514


Unnamed: 0,slug,name,positions,age,team,games_played,games_started,minutes_played,made_field_goals,attempted_field_goals,...,assists,steals,blocks,turnovers,personal_fouls,points,effective_field_goal_percentage,total_points,free_throw_percentage,all_nba_type
4,allengr01,Grayson Allen,[Position.SHOOTING_GUARD],24,Team.MEMPHIS_GRIZZLIES,47,0,780,124,276,...,67,9,2,36,56,346.0,0.543478,347,0.854545,3
8,anderju01,Justin Anderson,[Position.SMALL_FORWARD],26,Team.BROOKLYN_NETS,20,0,113,7,40,...,0,0,7,0,13,20.0,0.175,21,0.538462,2
10,anderry01,Ryan Anderson,[Position.POWER_FORWARD],31,Team.HOUSTON_ROCKETS,19,0,133,19,66,...,19,10,0,10,10,48.0,0.363636,48,0.0,3
17,augusdj01,D.J. Augustin,[Position.POINT_GUARD],32,Team.ORLANDO_MAGIC,66,8,1637,210,535,...,300,40,1,93,71,688.0,0.468224,688,0.886256,3
31,belinma01,Marco Belinelli,[Position.SHOOTING_GUARD],33,Team.SAN_ANTONIO_SPURS,69,0,1034,141,360,...,85,13,1,21,54,399.0,0.497222,400,0.777778,3
43,bonejo01,Jordan Bone,[Position.POINT_GUARD],22,Team.DETROIT_PISTONS,27,0,143,14,54,...,22,3,0,5,16,32.0,0.305556,33,0.0,3
68,burketr01,Trey Burke,[Position.POINT_GUARD],27,Team.PHILADELPHIA_76ERS,42,0,553,99,213,...,89,12,2,20,35,247.0,0.528169,247,0.733333,3
87,clevean01,Antonius Cleveland,[Position.SHOOTING_GUARD],25,Team.DALLAS_MAVERICKS,23,0,84,8,23,...,0,4,4,4,8,19.0,0.347826,20,0.5,2
93,cookqu01,Quinn Cook,[Position.POINT_GUARD],26,Team.LOS_ANGELES_LAKERS,55,0,592,110,250,...,58,16,1,36,33,265.0,0.512,265,0.692308,3
94,cookty01,Tyler Cook,[Position.POWER_FORWARD],22,Team.CLEVELAND_CAVALIERS,28,0,89,18,25,...,3,3,0,3,10,48.0,0.72,49,0.866667,2


We could make the All-NBA first and second teams using these predicitons! I'm inputting a little personal basketball knowledge so that the two lineups make sense

All-NBA first team:
C: Joel Embiid
F: Giannis Antetokounmpo 
F: Kevin Durant
G: Bradley Beal
G: James Harden

All-NBA second team:
C: Nikola Jokic
F: Karl-Anthony Towns(techinically a center but often plays as a PF)
F: Paul George
G: Kemba Walker
G: Damian Lillard

Leftover:
Rudy Gobert(who is a lock to make the All-NBA third team)
Tobias Harris(very unlikely to make an All-NBA team, but one of the most efficient scorers in the NBA, so not an unreasonable prediction)


Obviously these predictions are not perfect. But, even I'm surprised at how accurate they could turn out to be. I really do expect 8/11 given players to be named All-NBA players.

I think the glaring shortcoming is that players like Lebron and Steph were left off. I realize now that this comes down to me using total stats instead of games along with per game averages. Lebron and Steph simply missed too many games for their total stats to be in the range of players who are generally selected for All-NBA teams. 

Also, I ran this a couple of time and got the same results except one time it added Andre Drummond. So just warning you that it is possible the results come out ever so slighlty different than the results I described here. 

