# nba_api

Trying to figure out what the NBA API (https://github.com/swar/nba_api) can do and is useful for.

In [2]:
from nba_api.stats.endpoints import playbyplay, playbyplayv2, leaguegamefinder
import pandas as pd
from pprint import pprint

### Trying to get play-by-play data for a game. 

I found both `playbyplay` and `playbyplayv2` modules in the code, so let's start with the former.

`playbyplay.PlayByPlay` class requires a game ID, so I found one in the sample code.

In [3]:
pbp = playbyplay.PlayByPlay('0021700807')
dfs = pbp.get_data_frames()
len(dfs)

2

We get two DFs back, which is weird.

In [4]:
print(dfs[0].shape)
dfs[0].head()

(455, 12)


Unnamed: 0,GAME_ID,EVENTNUM,EVENTMSGTYPE,EVENTMSGACTIONTYPE,PERIOD,WCTIMESTRING,PCTIMESTRING,HOMEDESCRIPTION,NEUTRALDESCRIPTION,VISITORDESCRIPTION,SCORE,SCOREMARGIN
0,21700807,2,12,0,1,8:11 PM,12:00,,,,,
1,21700807,4,10,0,1,8:11 PM,12:00,Jump Ball Thompson vs. Towns: Tip to Gibson,,,,
2,21700807,7,2,1,1,8:11 PM,11:43,,,MISS Wiggins 27' 3PT Jump Shot,,
3,21700807,8,4,0,1,8:11 PM,11:33,,,Gibson REBOUND (Off:1 Def:0),,
4,21700807,9,2,97,1,8:11 PM,11:33,,,MISS Gibson 1' Tip Layup Shot,,


In [5]:
print(dfs[1].shape)
dfs[1].head()

(1, 1)


Unnamed: 0,VIDEO_AVAILABLE_FLAG
0,1


`dfs[0]` seems to be what we want... not sure what the second one is.

Now for `playbyplayv2`:

In [6]:
pbp2 = playbyplayv2.PlayByPlayV2('0021700807')
dfs2 = pbp2.get_data_frames()
len(dfs2)

2

In [7]:
print(dfs2[0].shape)
dfs2[0].head()

(455, 33)


Unnamed: 0,GAME_ID,EVENTNUM,EVENTMSGTYPE,EVENTMSGACTIONTYPE,PERIOD,WCTIMESTRING,PCTIMESTRING,HOMEDESCRIPTION,NEUTRALDESCRIPTION,VISITORDESCRIPTION,...,PLAYER2_TEAM_CITY,PLAYER2_TEAM_NICKNAME,PLAYER2_TEAM_ABBREVIATION,PERSON3TYPE,PLAYER3_ID,PLAYER3_NAME,PLAYER3_TEAM_ID,PLAYER3_TEAM_CITY,PLAYER3_TEAM_NICKNAME,PLAYER3_TEAM_ABBREVIATION
0,21700807,2,12,0,1,8:11 PM,12:00,,,,...,,,,0,0,,,,,
1,21700807,4,10,0,1,8:11 PM,12:00,Jump Ball Thompson vs. Towns: Tip to Gibson,,,...,Minnesota,Timberwolves,MIN,5,201959,Taj Gibson,1610613000.0,Minnesota,Timberwolves,MIN
2,21700807,7,2,1,1,8:11 PM,11:43,,,MISS Wiggins 27' 3PT Jump Shot,...,,,,0,0,,,,,
3,21700807,8,4,0,1,8:11 PM,11:33,,,Gibson REBOUND (Off:1 Def:0),...,,,,0,0,,,,,
4,21700807,9,2,97,1,8:11 PM,11:33,,,MISS Gibson 1' Tip Layup Shot,...,,,,0,0,,,,,


In [8]:
print(dfs2[1].shape)
dfs2[1].head()

(1, 1)


Unnamed: 0,VIDEO_AVAILABLE_FLAG
0,1


They look almost exactly the same -- both return two DataFrames, the former a full capture of the game's play-by-play and the latter pretty useless -- except that the v2 edition has 33 columns instead of 12.
Hopefully it has everything v1 has and more, so we can use v2 exclusively.
Let's compare the first 12 columns to be sure.

In [9]:
list(zip(dfs[0].columns, dfs2[0].columns[:12]))

[('GAME_ID', 'GAME_ID'),
 ('EVENTNUM', 'EVENTNUM'),
 ('EVENTMSGTYPE', 'EVENTMSGTYPE'),
 ('EVENTMSGACTIONTYPE', 'EVENTMSGACTIONTYPE'),
 ('PERIOD', 'PERIOD'),
 ('WCTIMESTRING', 'WCTIMESTRING'),
 ('PCTIMESTRING', 'PCTIMESTRING'),
 ('HOMEDESCRIPTION', 'HOMEDESCRIPTION'),
 ('NEUTRALDESCRIPTION', 'NEUTRALDESCRIPTION'),
 ('VISITORDESCRIPTION', 'VISITORDESCRIPTION'),
 ('SCORE', 'SCORE'),
 ('SCOREMARGIN', 'SCOREMARGIN')]

Yep, looks the same, so the v2 version is good enough for me.

### Full Season play-by-play
So fetching full play-by-play is fantastic, but right now we had to provide a game ID and I could only find that in the docs.
Let's try to get a more generic way, ideally pulling a whole season or more.

In [10]:
league = leaguegamefinder.LeagueGameFinder()
dfs = league.get_data_frames()
len(dfs)

1

In [11]:
games = dfs[0]
print(games.shape)
games.head()

(30000, 28)


Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,PTS,...,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,PF,PLUS_MINUS
0,22018,1610612758,SAC,Sacramento Kings,21800114,2018-11-01,SAC @ ATL,W,239,146,...,0.813,10,36,46,38,14,3,14,29,31.0
1,22018,1610612760,OKC,Oklahoma City Thunder,21800111,2018-11-01,OKC @ CHA,W,240,111,...,0.724,14,35,49,20,12,9,10,21,4.0
2,22018,1610612755,PHI,Philadelphia 76ers,21800113,2018-11-01,PHI vs. LAC,W,240,122,...,0.75,10,33,43,29,7,7,14,30,9.0
3,22018,1610612757,POR,Portland Trail Blazers,21800116,2018-11-01,POR vs. NOP,W,241,132,...,0.839,8,38,46,26,4,4,14,30,13.0
4,22018,1610612737,ATL,Atlanta Hawks,21800114,2018-11-01,ATL vs. SAC,L,239,115,...,0.676,10,33,43,26,6,6,22,28,-31.0


This looks like what we want.
Now we know that there should be

30 teams \* 82 games/season \* (1 game/2 teams) = 1230 games/season

Let's verify that's how many unique game_ids we got.

In [12]:
len(games['GAME_ID'].unique())

14998

Oh.

That's a lot more than expected.
Over what span is this data?

In [13]:
games['GAME_DATE'].min(), games['GAME_DATE'].max()

('2012-02-15', '2018-11-01')

Convert GAME_DATE to actual dates so we can see how SEASON_ID relates to the year of the game.

In [14]:
games['GAME_DATE'] = pd.to_datetime(games['GAME_DATE'])
games.groupby([games.GAME_DATE.dt.year, 'SEASON_ID'])['GAME_ID'].nunique()

GAME_DATE  SEASON_ID
2012       12012         144
           22011         713
           22012         845
           32011           4
           42011         102
           42012          19
2013       12013         140
           22012        1068
           22013         871
           32012           4
           32013           1
           42012         100
           42013          17
2014       12014         149
           22013        1074
           22014         920
           32013           4
           32014           1
           42013         106
           42014          18
2015       12015         130
           22014        1057
           22015         951
           32014           4
           32015           1
           42014          98
           42015          21
2016       12016         133
           22015        1056
           22016        1007
           32015           4
           42015         104
           42016          16
2017       12017      

It looks like the last 4 digits of SEASON_ID are always the current year or the one before.
Let's see what happens if we group by the last 4 digits of SEASON_ID.

In [15]:
games.groupby(games.SEASON_ID.str[-4:])['GAME_ID'].nunique()

SEASON_ID
2011     819
2012    2180
2013    2213
2014    2247
2015    2267
2016    2333
2017    2389
2018     550
Name: GAME_ID, dtype: int64

In [16]:
games.groupby([games.SEASON_ID.str[-4:], games.GAME_DATE.dt.year, games.SEASON_ID.str[0]])['GAME_ID'].nunique()

SEASON_ID  GAME_DATE  SEASON_ID
2011       2012       2             713
                      3               4
                      4             102
2012       2012       1             144
                      2             845
                      4              19
           2013       2            1068
                      3               4
                      4             100
2013       2013       1             140
                      2             871
                      3               1
                      4              17
           2014       2            1074
                      3               4
                      4             106
2014       2014       1             149
                      2             920
                      3               1
                      4              18
           2015       2            1057
                      3               4
                      4              98
2015       2015       1             130
        

Aha!
Looks pretty good, but I still don't understand what the first digit of SEASON_ID means (the third column).

But for now, it appears that the last 4 digits of SEASON_ID are the year the season began (2018 for the 2018-19 season).

Let's get all games from the 2017 season.

In [17]:
games_2017 = games[games.SEASON_ID.str[-4:] == '2017']
games_2017.head()

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,PTS,...,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,PF,PLUS_MINUS
972,42017,1610612739,CLE,Cleveland Cavaliers,41700404,2018-06-08,CLE vs. GSW,L,242,85,...,0.68,17,27,44,21,5,5,11,22,-23.0
976,42017,1610612744,GSW,Golden State Warriors,41700404,2018-06-08,GSW @ CLE,W,241,108,...,1.0,10,34,44,25,7,13,8,24,23.0
986,42017,1610612739,CLE,Cleveland Cavaliers,41700403,2018-06-06,CLE vs. GSW,L,240,102,...,0.765,15,32,47,20,6,4,13,18,-8.0
987,42017,1610612744,GSW,Golden State Warriors,41700403,2018-06-06,GSW @ CLE,W,239,110,...,0.895,6,31,37,27,6,5,10,20,8.0
996,42017,1610612744,GSW,Golden State Warriors,41700402,2018-06-03,GSW vs. CLE,W,238,122,...,0.619,7,34,41,28,3,8,12,25,19.0


Does that first digit relate to month in any way?

In [18]:
games_2017.groupby([games_2017.GAME_DATE.dt.year.rename('YEAR'),
                    games_2017.GAME_DATE.dt.month.rename('MONTH'),
                    games_2017.SEASON_ID.str[0]])['GAME_ID'].nunique()

YEAR  MONTH  SEASON_ID
2017  4      1              1
      5      1             12
             2             33
      6      2             51
      7      2            148
             3              1
      8      2             55
      9      1              2
             2             11
             4             13
      10     1             82
             2            104
             4              2
      11     1              1
             2            343
      12     2            378
2018  1      2            355
      2      2            269
             3              3
      3      2            343
             4              4
      4      2             88
             4             56
      5      4             31
      6      4              3
Name: GAME_ID, dtype: int64

Well this is bizarre... not only does SEASON_ID have no obvious relation to month, but the 2017 season data begins in April.
[The actual 2017 season started October 17](https://www.basketball-reference.com/leagues/NBA_2018_games.html).
What are these games?

In [19]:
games_2017[games_2017.GAME_DATE < '2017-10-01'].head(10)

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,PTS,...,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,PF,PLUS_MINUS
5225,12017,1610612744,GSW,Golden State Warriors,11700001,2017-09-30,GSW vs. DEN,L,240,102,...,0.833,17,36,53,17,13,4,14,28,-6.0
5226,12017,1610612750,MIN,Minnesota Timberwolves,11700002,2017-09-30,MIN @ LAL,W,240,108,...,0.8,14,35,49,28,7,3,17,21,9.0
5227,12017,1610612743,DEN,Denver Nuggets,11700001,2017-09-30,DEN @ GSW,W,240,108,...,0.676,9,40,49,23,6,0,18,24,6.0
5228,12017,1610612747,LAL,Los Angeles Lakers,11700002,2017-09-30,LAL vs. MIN,L,241,99,...,0.706,13,35,48,27,10,6,17,29,-9.0
5229,42017,1611661320,LAS,Los Angeles Sparks,1041700403,2017-09-29,LAS vs. MIN,W,200,75,...,0.842,8,26,34,19,11,5,15,14,11.0
5230,42017,1611661324,MIN,Minnesota Lynx,1041700403,2017-09-29,MIN @ LAS,L,202,64,...,0.7,4,23,27,13,11,5,15,21,-11.0
5231,42017,1611661324,MIN,Minnesota Lynx,1041700402,2017-09-26,MIN vs. LAS,W,199,70,...,0.65,7,29,36,10,9,6,12,17,2.0
5232,42017,1611661320,LAS,Los Angeles Sparks,1041700402,2017-09-26,LAS @ MIN,L,199,68,...,0.895,5,24,29,19,6,2,12,21,-2.0
5233,42017,1611661320,LAS,Los Angeles Sparks,1041700401,2017-09-24,LAS @ MIN,W,200,85,...,0.632,8,26,34,19,6,1,14,17,1.0
5234,42017,1611661324,MIN,Minnesota Lynx,1041700401,2017-09-24,MIN vs. LAS,L,200,84,...,0.692,7,28,35,22,6,3,15,19,-1.0


Many of these are WNBA games, and all of them have the 42017 SEASON_ID.

Could that be the SEASON_ID first-digit secret?

In [20]:
games_2017.groupby([games_2017.SEASON_ID.str[0], games_2017.TEAM_NAME])['GAME_ID'].nunique()

SEASON_ID  TEAM_NAME               
1          Agua Caliente Clippers       1
           Atlanta Dream                1
           Atlanta Hawks                5
           Boston Celtics               4
           Brisbane Bullets             1
           Brooklyn Nets                4
           Charlotte Hornets            5
           Chicago Bulls                6
           Chicago Sky                  2
           Cleveland Cavaliers          5
           Connecticut Sun              3
           Dallas Mavericks             6
           Dallas Wings                 2
           Denver Nuggets               5
           Detroit Pistons              5
           Erie BayHawks                1
           Golden State Warriors        4
           Guangzhou Long-Lions         1
           Haifa Maccabi Haifa          3
           Houston Rockets              5
           Indiana Fever                2
           Indiana Pacers               4
           Iowa Wolves                  

Nope, there's NBA, WNBA, and G-League throughout.
I even found a Chinese team and an Israeli team.

So this complicates things.
Not only am I interested in the first digit of SEASON_ID, but there are lots of non-NBA teams here and no obvious way to remove them.
Time to look at that LEAGUE_ID column.

If you're following along, go to the [League and Teams notebook](Leagues and Teams.ipynb).