In [1]:
from BBallDataHelper import *
import json
from urllib.parse import urlencode
from urllib.request import urlretrieve
import pandas as pd
import numpy as np
import seaborn as sns

%matplotlib inline
import matplotlib.pyplot as plt

import requests
from lxml import html
import re
import pickle
from sklearn.linear_model import LinearRegression
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

idx = pd.IndexSlice

## Scope

For this project we will need to assemble two main datasets.
This notebook will show process for doing this and conclude with saving two .pkl files to be used in the analysis notebook.

## Dataset 1. NBA game stats.

This dataset will contain player stats from 2014-2018, all retrieved from stats.nba using their API. These stats will be used to cluster players into a more modern view of positions, which we will attempt to predict using the next dataset. 

#### Categories of stats included:

Standard advanced -
Things like AST%, AST/TO ratio, OREB%, TS%. These are usually linear combinations of standard box score stats(AST,REB,FG,FGA,etc.) but also include these stats in comparison with their teams as a whole in the example of AST% or OREB%.

Per 100 Possesion - 
Regular box score statistics weighted per 100 possesion. 100 possesion chosen over per game or totals as a players volume should not affect how they are classified positionally. Per 36 minutes also may have been chosen but those are all affected by a teams pace. Perhaps different player types play in a faster pace system but if we wanted to focus on something like team style for player analysis we would incororate those team stats directly. This might be an avenue to pursue in the future but regardless per 100 possesion stats make the most sense for the scope of this project.


Defensive Player Tracking - 
hese are derived through NBA player tracking data. They tell how many shots a player defended (or more accurately, was the closest defender for) and their success rate on such. They are segmented by distance away from the basket and also give some measure of the expected FG% of these shots. Likely simply a weighted average of the FG% for those who took each shot.

Scoring - This breaks down field goals makes,attempt by percent from location, and whether or not assisted.

#### Other considersations/future inclusions:

BBall Reference advanced - Basketball reference has several other advanced stats such as BPM, Win Share and Vorp. These are generally linear combinations of already included statistics but might be useful nonetheless.

More tracking data - While no raw tracking data is available. the stats.nba API is very fast with data derived from it. This could be used to include more defense related data. This route also features playtype data. For example how many drives or isolation plays did a player have for game. It also includes information on dribbling, touch time, and closest defender. A lot of these might be helpful and I intend to examine their relation for future analysis.

Team stats - This was mentioned lightly above. Teams that play in high pace, high 3 point shooting, agressive blitzing, traditional, or a variety of other systems might have a focus on certain players. Using data that suggests things like that might be helpful in determing where a player lands. It makes more sense to me however, to classify players and then determine a teams playtyle off that. Or using a teams playstyle see what types of players perform well.


## Dataset 2. Combine/College stats

These stats will be used to predict what position a draft prospect is most likely to play. 

#### Categories of stats included:
Draft measurements and drill results - These too are pulled from the stats.nba api. They include anthroporthic data including wingspan, hand length, height. Drill results include vertical leap, three quarter sprint, and lane agility time. 

College statitics - These are scraped from basketball reference (violation of terms and service please don't tell anyone). Advanced statistics like those mentioned above (including the bball reference specific win share, vorp, etc.) Also per 40 minutes regular box score stats were included. I initially wanted to use per 100 possesion stats, but those only dated back to 2011. A revisit to this project might decide to use those and limit players available to see if that improves prediction.

#### Data Impution:
A decent amount of players either completely missed the draft or did not participate in certain drills. For player who missed the draft, their height, weight, and position were pulled from stats.nba via other access points. Using height weight and position a simple linear model was built to impute the results for other players. This seemed to make sense a 7 foot tall player will have a longer wingspan, or a 190 pound guard should be quicker in a three quarter sprint than a 280 pound big man. 
A total of blank/blank players had their stats completely determiend this way


#### Other considerations/future inclusions:

Some player have combine data available from individual combines. If a comprehensive dataset of this is available this might help complete the data without need for impution. Other than this certain unique players may be chosen and looked up but this would get tedious for only 5-10 rows of data becoming more complete. 

Using college stats means there is no inclusion of euroleague players. Euroleague stats are not a perfect comparision to college stats or even each other (though neither are college stats one could argue). I have not yet figured out how I would go about incorporating these into the same dataset. Perhaps I could build seperate models if there is suffient data. For now I choose to exclude players coming from europe. 

## Combine Data Retrieval

Data is pulled from the nba endpoint for anthropomorphic and end results and then merged together.
Note: it occured to me after that an enpoint exists that does not require this merge. 

A start year of 2008 was chosen rather arbitrarily. I wanted a suffiently large data set (also arbitrary). I wanted to choose only more recent player as certain features were only recored after certain years. Also since this is a focus on modern NBA positions and it might be hard to predict where a player coming out of college in 2007 will be playing in 2017 when their style of play, and more importantly the leagues will have changed significantly by then. 

In [4]:
Player_ID_Dict = pickle.load(open('NBAPlayerDict.pkl','rb'))
## Simple Dictionary that maps player name and draft year to player id

combine_columns = ['PLAYER_NAME','DRAFT_YEAR','POSITION','STANDING_VERTICAL_LEAP',
                   'MAX_VERTICAL_LEAP','LANE_AGILITY_TIME',
                   'MODIFIED_LANE_AGILITY_TIME','THREE_QUARTER_SPRINT',
                   'BENCH_PRESS','HEIGHT_WO_SHOES','WEIGHT','WINGSPAN',
                   'STANDING_REACH','BODY_FAT_PCT','HAND_LENGTH','HAND_WIDTH'                  
                  ]
params = {'LeagueID':'00'}
full_combine_df = pd.DataFrame(columns = combine_columns)
for i in range(8,19):
    year="20"+format(i,'02d')+"-"+format(i+1,'02d')
    params.update({"SeasonYear":year})
    df1 = get_nba_data(endpoint="draftcombinedrillresults",params=params)
    df2 = get_nba_data(endpoint="draftcombineplayeranthro",params=params)
    dfMerged = pd.merge(df1,df2)
    dfMerged['PLAYER_ID']=pd.Series(dfMerged.PLAYER_ID,dtype=str)
    if(i>13):
        dfMerged['PLAYER_NAME'] = dfMerged['PLAYER_NAME']
        dfMerged['PLAYER_ID']=(dfMerged['PLAYER_NAME']+"("+year[0:4]+")").map(Player_ID_Dict)
        dfMerged['PLAYER_ID']=pd.Series(dfMerged['PLAYER_ID'],dtype=str).str.split(".").str.get(0)

    dfMerged['DRAFT_YEAR'] = int(year[0:4])    
    dfMerged = dfMerged[dfMerged['PLAYER_ID']!='nan']

    dfMerged = dfMerged.set_index('PLAYER_ID')[combine_columns]
        
    full_combine_df = pd.concat([full_combine_df,dfMerged])
full_combine_df.fillna(value=np.nan, inplace=True)
full_combine_df = full_combine_df.drop(['MODIFIED_LANE_AGILITY_TIME','BENCH_PRESS'],axis=1)

In [8]:
print(len(full_combine_df))
full_combine_df.tail()

501


Unnamed: 0,PLAYER_NAME,DRAFT_YEAR,POSITION,STANDING_VERTICAL_LEAP,MAX_VERTICAL_LEAP,LANE_AGILITY_TIME,THREE_QUARTER_SPRINT,HEIGHT_WO_SHOES,WEIGHT,WINGSPAN,STANDING_REACH,BODY_FAT_PCT,HAND_LENGTH,HAND_WIDTH
1628403,Caleb Swanigan,2017,PF,,,,,79.5,245.6,87.0,108.0,,9.5,10.25
1628414,Sindarius Thornwell,2017,SG-SF,27.0,30.5,11.48,3.36,75.5,211.6,82.0,103.0,7.8,8.75,8.75
1628476,Derrick Walton Jr.,2017,PG,26.0,32.5,11.28,3.29,71.0,188.6,74.5,95.0,5.8,8.0,8.5
1628401,Derrick White,2017,PG,31.0,36.5,10.84,3.08,75.25,189.8,79.5,101.5,6.2,8.25,8.5
1628391,D.J. Wilson,2017,PF,,,,,80.75,234.4,87.0,109.5,6.4,9.25,10.25


## Retrieval for dataset 2

We do this now because we want to see which players qualify before requesting data that we will end up just throwing away.

Request data from years 2013-2018 for both playoffs and regular season through endpoints corresponding to 4 types of data we wish to retrieve.

In [28]:
advanced_2017_18 = get_nba_data(endpoint=endpt,params=params)
advanced_2017_18['SEASON'] = 2018
advanced_2017_18['SEASON_TYPE']='Regular Season'

keep = ['PLAYER_ID','SEASON','SEASON_TYPE','PLAYER_NAME','TEAM_ID','TEAM_ABBREVIATION','AGE','GP','MIN','OFF_RATING','DEF_RATING',
    'AST_PCT','AST_TO','AST_RATIO','OREB_PCT','DREB_PCT','REB_PCT','TM_TOV_PCT','EFG_PCT',
    'TS_PCT','USG_PCT','PACE']
advanced_2017_18 = advanced_2017_18[keep].set_index(['PLAYER_ID','SEASON','SEASON_TYPE'])

In [29]:
advanced_full = pd.DataFrame(columns=advanced_2017_18.columns)

for i in range(13,18):
    params.update({'SeasonType':'Regular Season'})
    
    year="20"+format(i,'02d')+"-"+format(i+1,'02d')
    params.update({'Season':year})
    reg_season = get_nba_data(endpoint=endpt,params=params)
    reg_season['SEASON_TYPE'] = 'Regular Season'
    params.update({'SeasonType':'Playoffs'})
    
    playoff = get_nba_data(endpoint=endpt,params=params)
    playoff['SEASON_TYPE'] = 'Playoffs'
    merged = pd.concat([reg_season,playoff])
    
    merged['SEASON']= 2001+i
    
    merged = merged[keep]

    advanced_full = pd.concat([advanced_full,merged])

newcols = advanced_full.columns[advanced_full.columns=='PLAYER_NAME'].append(
advanced_full.columns[advanced_full.columns!='PLAYER_NAME'])
advanced_full['SEASON']=pd.Series(advanced_full.SEASON,dtype='int64')

advanced_full['PLAYER_ID']=pd.Series(advanced_full.PLAYER_ID,dtype='int64')
advanced_full['PLAYER_ID']=pd.Series(advanced_full.PLAYER_ID,dtype='str')

advanced_full = advanced_full[newcols]\
                        .set_index(['PLAYER_ID','SEASON','SEASON_TYPE'])

In [30]:
advanced_full.sort_index().head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,PLAYER_NAME,AGE,AST_PCT,AST_RATIO,AST_TO,DEF_RATING,DREB_PCT,EFG_PCT,GP,MIN,OFF_RATING,OREB_PCT,PACE,REB_PCT,TEAM_ABBREVIATION,TEAM_ID,TM_TOV_PCT,TS_PCT,USG_PCT
PLAYER_ID,SEASON,SEASON_TYPE,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
101106,2014,Regular Season,Andrew Bogut,29.0,0.087,18.3,1.15,98.8,0.293,0.627,67,26.4,107.8,0.117,98.5,0.208,GSW,1610612744,15.8,0.61,0.124
101106,2015,Playoffs,Andrew Bogut,30.0,0.121,25.2,1.38,98.9,0.275,0.56,19,23.1,101.8,0.107,93.93,0.193,GSW,1610612744,18.2,0.551,0.113
101106,2015,Regular Season,Andrew Bogut,30.0,0.149,27.3,1.7,95.2,0.262,0.563,67,23.6,111.7,0.101,99.56,0.185,GSW,1610612744,16.1,0.565,0.133
101106,2016,Playoffs,Andrew Bogut,31.0,0.127,22.7,1.58,95.4,0.203,0.623,22,16.6,105.9,0.134,98.08,0.17,GSW,1610612744,14.4,0.607,0.125
101106,2016,Regular Season,Andrew Bogut,31.0,0.145,29.7,1.95,97.2,0.252,0.629,70,20.7,111.4,0.094,101.07,0.178,GSW,1610612744,15.2,0.623,0.116


In [31]:
endpt,params = get_params('http://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Per100Possessions&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2017-18&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight=')
params.update({'SeasonType':'Regular Season'})
per100poss = get_nba_data(endpt,params)
per100poss['SEASON'] = 2018
per100poss['SEASON_TYPE']='Regular Season'
per100poss = per100poss.loc[:,~per100poss.columns.str.contains('RANK')]\
                                    .drop(['CFID','CFPARAMS','NBA_FANTASY_PTS',
                                           'DD2','TD3','W','L','W_PCT','MIN',
                                           'PLAYER_NAME','TEAM_ID','TEAM_ABBREVIATION',
                                           'AGE','GP'
                                          ]
                                          ,axis=1)\
                                    .set_index(['PLAYER_ID','SEASON','SEASON_TYPE'])
per100poss.columns

Index(['FGM', 'FGA', 'FG_PCT', 'FG3M', 'FG3A', 'FG3_PCT', 'FTM', 'FTA',
       'FT_PCT', 'OREB', 'DREB', 'REB', 'AST', 'TOV', 'STL', 'BLK', 'BLKA',
       'PF', 'PFD', 'PTS', 'PLUS_MINUS'],
      dtype='object')

In [33]:
per_100p_full= pd.DataFrame(columns=per100poss.columns)

for i in range(13,18):
    params.update({'SeasonType':'Regular Season'})
    
    year="20"+format(i,'02d')+"-"+format(i+1,'02d')
    params.update({'Season':year})
    reg_season = get_nba_data(endpoint=endpt,params=params)
    reg_season['SEASON_TYPE'] = 'Regular Season'
    params.update({'SeasonType':'Playoffs'})
    
    playoff = get_nba_data(endpoint=endpt,params=params)
    playoff['SEASON_TYPE'] = 'Playoffs'
    merged = pd.concat([reg_season,playoff])
    merged = merged.loc[:,~merged.columns.str.contains('RANK')]\
                                    .drop(['CFID','CFPARAMS','NBA_FANTASY_PTS',
                                           'DD2','TD3','W','L','W_PCT','MIN',
                                           'PLAYER_NAME','TEAM_ID','TEAM_ABBREVIATION',
                                           'AGE','GP'
                                          ],
                                          axis=1)
    merged['SEASON']= 2001+i

    per_100p_full = pd.concat([per_100p_full,merged])

newcols = per_100p_full.columns[per_100p_full.columns=='PLAYER_NAME'].append(
per_100p_full.columns[per_100p_full.columns!='PLAYER_NAME'])
per_100p_full['SEASON']=pd.Series(per_100p_full.SEASON,dtype='int64')

per_100p_full['PLAYER_ID']=pd.Series(per_100p_full.PLAYER_ID,dtype='int64')
per_100p_full['PLAYER_ID']=pd.Series(per_100p_full.PLAYER_ID,dtype='str')

per_100p_full = per_100p_full[newcols]\
                        .set_index(['PLAYER_ID','SEASON','SEASON_TYPE'])
per_100p_full.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,AST,BLK,BLKA,DREB,FG3A,FG3M,FG3_PCT,FGA,FGM,FG_PCT,...,FTM,FT_PCT,OREB,PF,PFD,PLUS_MINUS,PTS,REB,STL,TOV
PLAYER_ID,SEASON,SEASON_TYPE,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
201985,2014,Regular Season,6.4,0.0,0.0,4.4,10.9,3.0,0.273,22.7,9.4,0.413,...,0.0,0.0,0.5,2.5,1.5,0.5,21.7,4.9,0.5,3.5
201166,2014,Regular Season,7.2,0.4,1.1,3.0,7.7,3.0,0.387,18.0,7.2,0.401,...,2.6,0.874,1.3,4.5,3.3,-2.9,19.9,4.3,1.6,3.6
201189,2014,Regular Season,3.1,1.1,1.4,9.7,0.1,0.0,0.0,8.6,3.8,0.443,...,1.5,0.55,5.9,9.0,2.2,-11.9,9.1,15.6,1.4,4.3
203519,2014,Regular Season,4.0,0.0,1.3,4.0,6.7,1.3,0.2,18.7,8.0,0.429,...,1.3,1.0,0.0,8.0,2.7,10.7,18.7,4.0,0.0,1.3
1733,2014,Regular Season,2.8,0.0,1.0,6.4,9.9,3.4,0.34,20.4,8.1,0.396,...,2.7,0.771,1.5,7.1,3.9,4.5,22.2,7.9,1.4,3.4


In [36]:
endpt,params = get_params('http://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&MeasureType=Scoring&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2017-18&SeasonSegment=&SeasonType=Playoffs&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight=')
params.update({'SeasonType':'Regular Season'})
scoring = get_nba_data(endpt,params)
scoring['SEASON'] = 2018
scoring['SEASON_TYPE']='Regular Season'
scoring = scoring.loc[:,~scoring.columns.str.contains('RANK')]\
                                    .drop(['CFID','CFPARAMS','W','L','W_PCT',
                                           'PLAYER_NAME','TEAM_ID','TEAM_ABBREVIATION',
                                           'AGE','GP','MIN'
                                          ]
                                          ,axis=1)\
                                    .set_index(['PLAYER_ID','SEASON','SEASON_TYPE'])
scoring_full= pd.DataFrame(columns=scoring.columns)

for i in range(13,18):
    params.update({'SeasonType':'Regular Season'})
    
    year="20"+format(i,'02d')+"-"+format(i+1,'02d')
    params.update({'Season':year})
    reg_season = get_nba_data(endpoint=endpt,params=params)
    reg_season['SEASON_TYPE'] = 'Regular Season'
    params.update({'SeasonType':'Playoffs'})
    
    playoff = get_nba_data(endpoint=endpt,params=params)
    playoff['SEASON_TYPE'] = 'Playoffs'
    merged = pd.concat([reg_season,playoff])
    merged = merged.loc[:,~merged.columns.str.contains('RANK')]\
                        .drop(['CFID','CFPARAMS','W','L','W_PCT',
                                           'PLAYER_NAME','TEAM_ID','TEAM_ABBREVIATION',
                                           'AGE','GP','MIN'
                                          ]
                              ,axis=1)
    merged['SEASON']= 2001+i

    scoring_full = pd.concat([scoring_full,merged])

scoring_full['SEASON']=pd.Series(scoring_full.SEASON,dtype='int64')

scoring_full['PLAYER_ID']=pd.Series(scoring_full.PLAYER_ID,dtype='int64')
scoring_full['PLAYER_ID']=pd.Series(scoring_full.PLAYER_ID,dtype='str')

scoring_full = scoring_full.set_index(['PLAYER_ID','SEASON','SEASON_TYPE'])
scoring_full.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,PCT_AST_2PM,PCT_AST_3PM,PCT_AST_FGM,PCT_FGA_2PT,PCT_FGA_3PT,PCT_PTS_2PT,PCT_PTS_2PT_MR,PCT_PTS_3PT,PCT_PTS_FB,PCT_PTS_FT,PCT_PTS_OFF_TOV,PCT_PTS_PAINT,PCT_UAST_2PM,PCT_UAST_3PM,PCT_UAST_FGM
PLAYER_ID,SEASON,SEASON_TYPE,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
201985,2014,Regular Season,0.231,0.833,0.421,0.522,0.478,0.591,0.409,0.409,0.114,0.0,0.114,0.182,0.769,0.167,0.579
201166,2014,Regular Season,0.139,0.698,0.369,0.573,0.427,0.425,0.099,0.447,0.109,0.129,0.138,0.326,0.861,0.302,0.631
201189,2014,Regular Season,0.667,0.0,0.667,0.984,0.016,0.831,0.092,0.0,0.031,0.169,0.108,0.738,0.333,0.0,0.333
203519,2014,Regular Season,0.2,1.0,0.333,0.643,0.357,0.714,0.286,0.214,0.429,0.071,0.357,0.429,0.8,0.0,0.667
1733,2014,Regular Season,0.646,1.0,0.793,0.517,0.483,0.427,0.053,0.453,0.098,0.12,0.164,0.373,0.354,0.0,0.207


In [40]:
# year="2017-18"
# endpt = 'playerdashptshotdefend'
# params = {'DateFrom':'','DateTo':'','GameSegment':'',
#           'LastNGames':'0','LeagueID':'00','Location':'',
#           'Month':'0','OpponentTeamID':'0','Outcome':'',
#           'Period':'0','PlayerID':'0','Season':year,
#           'SeasonSemgent':'','SeasonType':'Regular Season',
#           'TeamID':'0','VsConference':'','VsDivision':'',
#           'PerMode':'Totals','SeasonSegment':''
#          }
# defensive = get_nba_data(endpoint=endpt,params=params)
# defensive['PLAYER_ID'] = defensive.CLOSE_DEF_PERSON_ID
# defensive = defensive.set_index('PLAYER_ID').drop(['CLOSE_DEF_PERSON_ID'],axis=1)
defensive.groupby('DEFENSE_CATEGORY')[['PCT_PLUSMINUS','FREQ','D_FG_PCT','D_FGA']].describe()

Unnamed: 0_level_0,PCT_PLUSMINUS,PCT_PLUSMINUS,PCT_PLUSMINUS,PCT_PLUSMINUS,PCT_PLUSMINUS,PCT_PLUSMINUS,PCT_PLUSMINUS,PCT_PLUSMINUS,FREQ,FREQ,...,D_FG_PCT,D_FG_PCT,D_FGA,D_FGA,D_FGA,D_FGA,D_FGA,D_FGA,D_FGA,D_FGA
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
DEFENSE_CATEGORY,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2 Pointers,529.0,0.00965,0.104892,-0.574,-0.024,0.004,0.043,0.552,529.0,0.653221,...,0.547,1.0,529.0,262.559546,215.415096,1.0,62.0,237.0,412.0,1027.0
3 Pointers,526.0,0.007226,0.123102,-0.4,-0.03075,0.001,0.035,0.725,526.0,0.358266,...,0.394,1.0,526.0,134.171103,106.625081,1.0,33.25,121.0,212.75,518.0
Greater Than 15 Ft,531.0,0.003154,0.112102,-0.441,-0.026,0.003,0.0335,0.675,531.0,0.506194,...,0.404,1.0,531.0,192.150659,150.499859,1.0,49.0,181.0,304.0,728.0
Less Than 10 Ft,525.0,0.021015,0.125468,-0.754,-0.037,0.018,0.071,0.565,525.0,0.428076,...,0.632,1.0,525.0,173.129524,151.214342,1.0,41.0,144.0,259.0,742.0
Less Than 6 Ft,523.0,0.030228,0.13838,-0.66,-0.0385,0.022,0.083,0.558,523.0,0.327956,...,0.7,1.0,523.0,132.265774,118.10814,1.0,31.0,104.0,201.5,588.0
Overall,534.0,0.00717,0.088566,-0.531,-0.02275,0.002,0.032,0.64,534.0,1.0,...,0.488,1.0,534.0,392.262172,306.495476,1.0,92.75,375.5,627.75,1291.0


Defensive data came stacked. This function unstacks and keeps which variables we would like

In [43]:
def defensiveUnstack(season,season_type,measures):
    endpt = 'playerdashptshotdefend'
    params = {'DateFrom':'','DateTo':'','GameSegment':'',
          'LastNGames':'0','LeagueID':'00','Location':'',
          'Month':'0','OpponentTeamID':'0','Outcome':'',
          'Period':'0','PlayerID':'0','Season':year,
          'SeasonSemgent':'','SeasonType':'Regular Season',
          'TeamID':'0','VsConference':'','VsDivision':'',
          'PerMode':'Totals','SeasonSegment':''
         }
    params.update({'Season':year,
                   'SeasonType':season_type                  
                  })
    defensive = get_nba_data(endpoint=endpt,params=params)
    defensive['PLAYER_ID']=defensive.CLOSE_DEF_PERSON_ID
    defensive['SEASON'] = int(season[:4])+1
    defensive['SEASON_TYPE'] = season_type
    defensive = defensive.set_index('CLOSE_DEF_PERSON_ID')

    categories = defensive.DEFENSE_CATEGORY.unique()[1:]
    
    defensive_seperate = [defensive[defensive.DEFENSE_CATEGORY==cat][measures] for cat in categories]
    suffix_list = ['_3Pt',"_2Pt","_Lt6","_Lt10","_Gt15"]
    col_pair = defensive_seperate[0].columns
    cols = ["DEF_"+col_pair + suffix for suffix in suffix_list]
    
    for i,pair in enumerate(cols):
        defensive_seperate[i].columns=pair

    return(pd.concat(defensive_seperate,axis=1))

defensiveUnstack('2016-17','Playoffs',['PCT_PLUSMINUS','FREQ','D_FG_PCT','D_FGA']).head()

Unnamed: 0_level_0,DEF_PCT_PLUSMINUS_3Pt,DEF_FREQ_3Pt,DEF_D_FG_PCT_3Pt,DEF_D_FGA_3Pt,DEF_PCT_PLUSMINUS_2Pt,DEF_FREQ_2Pt,DEF_D_FG_PCT_2Pt,DEF_D_FGA_2Pt,DEF_PCT_PLUSMINUS_Lt6,DEF_FREQ_Lt6,DEF_D_FG_PCT_Lt6,DEF_D_FGA_Lt6,DEF_PCT_PLUSMINUS_Lt10,DEF_FREQ_Lt10,DEF_D_FG_PCT_Lt10,DEF_D_FGA_Lt10,DEF_PCT_PLUSMINUS_Gt15,DEF_FREQ_Gt15,DEF_D_FG_PCT_Gt15,DEF_D_FGA_Gt15
CLOSE_DEF_PERSON_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
201567,-0.029,0.251,0.352,71.0,0.021,0.749,0.547,212.0,-0.016,0.456,0.612,129.0,0.004,0.565,0.581,160.0,-0.036,0.367,0.365,104.0
1627759,0.034,0.434,0.403,72.0,-0.079,0.566,0.447,94.0,-0.059,0.217,0.583,36.0,-0.087,0.349,0.5,58.0,0.022,0.584,0.412,97.0
2747,-0.068,0.471,0.315,89.0,0.036,0.529,0.56,100.0,0.09,0.27,0.725,51.0,0.112,0.339,0.688,64.0,-0.1,0.582,0.3,110.0
1628369,-0.068,0.461,0.308,65.0,0.042,0.539,0.566,76.0,0.115,0.291,0.756,41.0,0.095,0.376,0.679,53.0,-0.074,0.56,0.316,79.0
203382,-0.062,0.162,0.294,34.0,-0.038,0.838,0.5,176.0,-0.029,0.419,0.625,88.0,-0.035,0.552,0.56,116.0,-0.039,0.333,0.329,70.0


In [44]:
defensive = defensiveUnstack('2017-18','Playoffs',['PCT_PLUSMINUS','FREQ','D_FG_PCT','D_FGA']).head()
defensive_full= pd.DataFrame(columns=defensive.columns)

for i in range(13,18):
    year="20"+format(i,'02d')+"-"+format(i+1,'02d')
    params.update({'Season':year})
    reg_season = defensiveUnstack(year,'Regular Season',
                                  ['PCT_PLUSMINUS','FREQ','D_FG_PCT','D_FGA'])
    reg_season['SEASON_TYPE'] = "Regular Season"
    playoff = defensiveUnstack(year,'Playoffs',['PCT_PLUSMINUS','FREQ','D_FG_PCT','D_FGA'])
    playoff['SEASON_TYPE'] = "Playoffs"
    
    merged = pd.concat([reg_season,playoff])
    merged['PLAYER_ID']=merged.index
    merged['SEASON'] = 2001+i
    
    defensive_full = pd.concat([defensive_full,merged])


defensive_full['SEASON']=pd.Series(defensive_full.SEASON,dtype='int64')

defensive_full['PLAYER_ID']=pd.Series(defensive_full.PLAYER_ID,dtype='int64')
defensive_full['PLAYER_ID']=pd.Series(defensive_full.PLAYER_ID,dtype='str')

defensive_full = defensive_full.set_index(['PLAYER_ID','SEASON','SEASON_TYPE'])

defensive_full.sort_index().head(15)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,DEF_D_FGA_2Pt,DEF_D_FGA_3Pt,DEF_D_FGA_Gt15,DEF_D_FGA_Lt10,DEF_D_FGA_Lt6,DEF_D_FG_PCT_2Pt,DEF_D_FG_PCT_3Pt,DEF_D_FG_PCT_Gt15,DEF_D_FG_PCT_Lt10,DEF_D_FG_PCT_Lt6,DEF_FREQ_2Pt,DEF_FREQ_3Pt,DEF_FREQ_Gt15,DEF_FREQ_Lt10,DEF_FREQ_Lt6,DEF_PCT_PLUSMINUS_2Pt,DEF_PCT_PLUSMINUS_3Pt,DEF_PCT_PLUSMINUS_Gt15,DEF_PCT_PLUSMINUS_Lt10,DEF_PCT_PLUSMINUS_Lt6
PLAYER_ID,SEASON,SEASON_TYPE,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
101106,2014,Regular Season,704.0,48.0,201.0,472.0,363.0,0.447,0.292,0.378,0.485,0.499,0.936,0.064,0.267,0.628,0.483,-0.052,-0.056,-0.004,-0.066,-0.102
101106,2015,Playoffs,189.0,11.0,47.0,136.0,102.0,0.45,0.273,0.426,0.441,0.461,0.945,0.055,0.235,0.68,0.51,-0.049,-0.067,0.055,-0.103,-0.126
101106,2015,Regular Season,675.0,47.0,186.0,485.0,353.0,0.424,0.319,0.409,0.421,0.448,0.935,0.065,0.258,0.672,0.489,-0.075,-0.008,0.03,-0.126,-0.146
101106,2016,Playoffs,126.0,19.0,28.0,106.0,85.0,0.421,0.211,0.286,0.415,0.459,0.869,0.131,0.193,0.731,0.586,-0.112,-0.117,-0.069,-0.152,-0.145
101106,2016,Regular Season,642.0,57.0,185.0,444.0,343.0,0.449,0.333,0.378,0.48,0.51,0.918,0.082,0.265,0.635,0.491,-0.051,-0.005,0.001,-0.068,-0.09
101106,2017,Regular Season,196.0,18.0,53.0,139.0,107.0,0.464,0.556,0.396,0.525,0.579,0.916,0.084,0.248,0.65,0.5,-0.052,0.187,0.008,-0.041,-0.036
101106,2018,Regular Season,65.0,14.0,22.0,49.0,34.0,0.415,0.429,0.409,0.429,0.529,0.823,0.177,0.278,0.62,0.43,-0.112,0.062,0.028,-0.147,-0.089
101107,2014,Regular Season,456.0,163.0,269.0,296.0,212.0,0.507,0.356,0.398,0.541,0.613,0.737,0.263,0.435,0.478,0.342,0.01,-0.005,0.019,-0.016,0.011
101107,2015,Regular Season,502.0,163.0,306.0,304.0,221.0,0.466,0.301,0.343,0.523,0.557,0.755,0.245,0.46,0.457,0.332,-0.019,-0.046,-0.027,-0.013,-0.022
101107,2016,Playoffs,38.0,22.0,31.0,24.0,15.0,0.579,0.364,0.355,0.667,0.8,0.633,0.367,0.517,0.4,0.25,0.059,0.007,-0.003,0.08,0.132


In [46]:
x = pd.merge(advanced_full,per_100p_full,left_index=True,right_index=True)
y = pd.merge(x, scoring_full,left_index=True,right_index=True)
all_full = pd.merge(y,defensive_full,left_index=True,right_index=True)
len(all_full.columns)
all_full['SEASON_MIN']=all_full.GP*all_full.MIN

In [70]:
pickle.dump(all_full, open('NBA_Season_Data.pkl', 'wb'))

In [57]:
all_full_reg_season = all_full.loc[idx[:,:, 'Regular Season'],:]
qualified = all_full_reg_season.query("SEASON_MIN>700")
qualified_player_list = list(qualified.reset_index().PLAYER_ID.unique())
len(qualified_player_list)

542

### Height and weight for missing players

Very complicated loop here. Lots of exceptions need be thrown if a players current team is not the same as in the commonallplayers data source 

### Which players went to college and which of those is draft data missing

In [59]:
draft_history = get_nba_data('drafthistory',{'LeagueID':'00'})
draft_history['SEASON']=pd.Series(draft_history.SEASON,dtype='int64')
draft_history = draft_history.query('SEASON>=2008')
draft_history = draft_history.query("ORGANIZATION_TYPE == 'College/University'")

draft_history['PLAYER_ID']=pd.Series(draft_history['PERSON_ID'],dtype=str)

draft_history = draft_history.set_index('PLAYER_ID')
#draft_history = draft_history.drop('PERSON_ID',axis=1)


drafted_from_college = np.array(draft_history.index)
from_college_qual = np.intersect1d(qualified_player_list,drafted_from_college)
len(from_college_qual)

277

In [61]:
combine_player_list = np.array(full_combine_df.index)

players_to_find = pd.Series(np.setdiff1d(from_college_qual,combine_player_list))
len(players_to_find)

53

In [62]:
players_to_find_df = draft_history.loc[players_to_find,:][['PLAYER_NAME','SEASON','TEAM_ID','TEAM_NAME']]
players_to_find_df.head()

Unnamed: 0_level_0,PLAYER_NAME,SEASON,TEAM_ID,TEAM_NAME
PLAYER_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1626143,Jahlil Okafor,2015,1610612755,76ers
1626150,Andrew Harrison,2015,1610612756,Suns
1626156,D'Angelo Russell,2015,1610612747,Lakers
1626157,Karl-Anthony Towns,2015,1610612750,Timberwolves
1626162,Kelly Oubre Jr.,2015,1610612737,Hawks


In [69]:
#endpt = 'commonteamroster'
#all_players = get_nba_data('commonallplayers',{'LeagueID':'00','Season':'2017-18','IsOnlyCurrentSeason':'0'})
#player_team_dict = all_players.set_index('PERSON_ID')['TEAM_ID']
new_player_combine_df = pd.DataFrame(columns=full_combine_df.columns)

for index,row in players_to_find_df.iterrows():
    season = str(row.SEASON)[:4]+"-"+str(row.SEASON+1)[2:]
    df_temp = None
    try:
        params = {'Season':season,'TeamID':str(row.TEAM_ID)}
        df_temp = get_nba_data(endpt,params)
        df_temp = df_temp.set_index('PLAYER_ID')
        height = df_temp.HEIGHT[int(index)].split("-")
        height_inches = int(height[0])*12 + int(height[1])-1
        weight = df_temp.WEIGHT[int(index)]
        position = df_temp.POSITION[int(index)]
        print('trying')
    except:
        print('except')
        try:
            team_id = player_team_dict[int(index)]
        except:
            continue
        params = {'Season':'2017-18','TeamID':str(team_id)}
        season_curr = 2017
        while(team_id==0 and season_curr>=2014):
            season_str = str(season_curr)[:4]+"-"+str(season_curr+1)[2:]
            all_players_temp = get_nba_data('commonallplayers',{'LeagueID':'00',
                                                                'Season':season_str,
                                                           'IsOnlyCurrentSeason':'1'})
            player_team_dict_temp = all_players_temp.set_index('PERSON_ID')['TEAM_ID']
            try:
                team_id = player_team_dict_temp[int(index)]
            except: 
                break
            params = {'Season':season_str,'TeamID':str(team_id)}
            season_curr-=1
#            print(season_curr)
            print(season_str)
        try:
            df_temp = get_nba_data(endpt,params)
        except:
            continue
        df_temp = df_temp.set_index('PLAYER_ID')
        height = df_temp.HEIGHT[int(index)].split("-")
        height_inches = int(height[0])*12 + int(height[1])-1
        weight = df_temp.WEIGHT[int(index)]
        position = df_temp.POSITION[int(index)]
    if not df_temp:
        new_row = pd.Series({'PLAYER_ID':index,
                             'PLAYER_NAME':row.PLAYER_NAME,
                             'DRAFT_YEAR':row.SEASON,
                             'HEIGHT_WO_SHOES':height_inches,
                             'WEIGHT':weight,'POSITION':position
                            }).transpose()
        new_player_combine_df=new_player_combine_df.append(new_row,ignore_index=True)
        print([height,weight,position])
new_player_combine_df = new_player_combine_df.set_index('PLAYER_ID')

KeyboardInterrupt: 

In [None]:
len(new_player_combine_df)

In [None]:
full_combine_df = pd.concat([full_combine_df,new_player_combine_df])

In [None]:
## Reposition to 4 instead of ~15
full_combine_df['POSITION_SIMPLE']=full_combine_df.POSITION
full_combine_df.POSITION_SIMPLE[full_combine_df.POSITION_SIMPLE.str.contains('C')]='Big'
full_combine_df.POSITION_SIMPLE[full_combine_df.POSITION_SIMPLE.str.contains('PF')]='Forward'
full_combine_df.POSITION_SIMPLE[full_combine_df.POSITION_SIMPLE.str.contains('SF')]='Wing'
full_combine_df.POSITION_SIMPLE[full_combine_df.POSITION_SIMPLE.str.contains('G')]='Guard'
full_combine_df.POSITION_SIMPLE[full_combine_df.POSITION_SIMPLE.str.contains('F')]='Forwad'

full_combine_df.loc[:,['POSITION','POSITION_SIMPLE']]

In [None]:
## Make every column that can be numeric, remove those that are still missing weight

full_combine_df = full_combine_df.apply(pd.to_numeric,errors='ignore')
full_combine_df = full_combine_df.loc[~full_combine_df.WEIGHT.isnull(),:]

In [None]:
full_combine_df.groupby('POSITION_SIMPLE').describe().stack()

### Imputing missing values with linear model

In [None]:
predicting_vals = pd.get_dummies(full_combine_df[['DRAFT_YEAR','POSITION_SIMPLE','WEIGHT','HEIGHT_WO_SHOES']])
predicting_vals
cols_to_impute = ['STANDING_VERTICAL_LEAP','MAX_VERTICAL_LEAP','LANE_AGILITY_TIME',
                  'THREE_QUARTER_SPRINT','WINGSPAN','STANDING_REACH',
                  'BODY_FAT_PCT','HAND_LENGTH','HAND_WIDTH']

for col in cols_to_impute:
#    print(col)
    missing = full_combine_df[col].isnull()
    X = predicting_vals[~missing]
    Y = full_combine_df[col][~missing]
    X_fill = predicting_vals[missing]
    model = LinearRegression(fit_intercept=True)
    model.fit(X,Y)
    full_combine_df[col][missing]=model.predict(X_fill)

In [None]:
## Dump pickle of combine data. 
pickle.dump(full_combine_df,open('CombineImputed.pkl','wb'))