# Predicting in game win probabilities of NBA games

## Why?, How?, what else has been done?

Before I moved to Canada I never really watched basketball live, mostly due to the time difference and me liking my sleep but when I did get the oppourtunity to move here in March/April 2019 I witnessed the Toronto Raptors make their championship run and ever since then i've been hooked. 

Naturally as a Data Scientist and someone who studied Maths at university I was drawn to the numbers side of things with basketball. Advanced stats, different ways of measuring impact, elo scores and other people's efforts working with basketball data showed me the other half of the sport. 

Seeing other peoples efforts really highlighted the amount of data that can be collected and analysed so I decided to do my own project with NBA data. Python has an excellent [API wrapper](https://github.com/swar/nba_api) around stats.nba.com with access to things I didn't even know were tracked. 

An interesting part of the api is that it provided live play by play data and I became interested in how nba games were predicted live.

## Reserach / Review

Looking around at what had been done I found a few blogs/articles/papers mostly related to other sports but some around basketball (NCAA and NBA).

- [Bayesian approach to predicting football(not soccer) games](https://dtai.cs.kuleuven.be/sports/blog/a-bayesian-approach-to-in-game-win-probability) [1]

- [Brian Burkes NFL forecasting](http://wagesofwins.com/2009/03/05/modeling-win-probability-for-a-college-basketball-game-a-guest-post-from-brian-burke/) [2]

- [py-ball](https://github.com/basketballrelativity/py_ball) [3]

- [inpredictable](http://stats.inpredictable.com/nba/wpBox_live.php) [4]


Reading the [1] definitely cleared up that predicting football was a lot harder since a lot of games end in draws and there is a lot of infrequent scoring.

[2] & [3] gave me the first step in order to build a model to predict games. The approach that was used here is to split the game into n-second intervals and then build a series of logisitc regression models, one for each interval. 

[4] gave me a sense of what other blogs were doing and something to compare my graphs too

# Data Gathering

As mentioned before, python has a GREAT wrapper for the stats.nba.com api again linked [here](https://github.com/swar/nba_api), which is worth checking out in your own time just to see the volume of data available to play with. But I wrote a simple script to collect all the playbyplay data for the last ~7 odd years. 
The problem is rate limits! 

In [8]:
#script to get all live playbyplay data
from nba_api.stats.endpoints import leaguegamelog, playbyplayv2
import pandas as pd

In [2]:
seasons = [f'20{x}-{x+1}' for x in range(13,22)] #can put whatever range you want

league_game_logs = []
for season in seasons:
    game_log = leaguegamelog.LeagueGameLog(season=season).get_data_frames()[0]
    league_game_logs.append(game_log)

In [3]:
league_game_log = pd.concat(league_game_logs)

#### Filtering for home games

The reason i'm filtering for the home games below is because there can only be two outcomes in basketball, win or lose and so if you find the probability that the home team wins then you're done. The game log api wrapper returns game logs from both teams perspectives. Matchups with '@' in them are instances from the perspective of the away team.

In [4]:
league_game_log = league_game_log[~league_game_log['MATCHUP'].str.contains('@')]

once you get all the game logs you can go and retrieve all playbyplay data. Usually when there are no rate limits I spam the API (WITHIN REASON) using multiprocessing however since the NBA api has some serious API limits this is a slow burning job that'll take a few hours. I suggest you run this before you get on with something else and let it run in the background.

In [12]:
league_game_log

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,FGM,...,DREB,REB,AST,STL,BLK,TOV,PF,PTS,PLUS_MINUS,VIDEO_AVAILABLE
0,22013,1610612747,LAL,Los Angeles Lakers,0021300003,2013-10-29,LAL vs. LAC,W,240,42,...,34,52,23,8,6,19,23,116,13,1
3,22013,1610612748,MIA,Miami Heat,0021300002,2013-10-29,MIA vs. CHI,W,240,37,...,35,40,26,10,7,20,21,107,12,1
5,22013,1610612754,IND,Indiana Pacers,0021300001,2013-10-29,IND vs. ORL,W,240,34,...,34,44,17,4,18,21,13,97,10,1
6,22013,1610612755,PHI,Philadelphia 76ers,0021300005,2013-10-30,PHI vs. MIA,W,240,43,...,32,40,24,16,1,18,21,114,4,1
8,22013,1610612744,GSW,Golden State Warriors,0021300017,2013-10-30,GSW vs. LAL,W,240,46,...,41,48,34,8,9,15,22,125,31,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1521,22021,1610612761,TOR,Toronto Raptors,0022100784,2022-02-01,TOR vs. MIA,W,240,39,...,28,43,20,9,3,15,24,110,4,1
1522,22021,1610612750,MIN,Minnesota Timberwolves,0022100770,2022-02-01,MIN vs. DEN,W,240,46,...,39,52,35,8,9,10,21,130,15,1
1525,22021,1610612741,CHI,Chicago Bulls,0022100769,2022-02-01,CHI vs. ORL,W,240,46,...,38,49,25,5,6,10,15,126,11,1
1526,22021,1610612759,SAS,San Antonio Spurs,0022100771,2022-02-01,SAS vs. GSW,L,240,46,...,29,34,33,7,7,14,17,120,-4,1


In [11]:
from tqdm import tqdm
from time import sleep

pbp = []
for game_id in tqdm(league_game_log.GAME_ID):
    pbp.append(playbyplayv2.PlayByPlayV2(game_id).get_data_frames()[0])
    sleep(0.2)
    #to ensure over time we aren't spamming the api and hitting any rate limits, set it to whatever.

  0%|▏                                                                                                                                                                         | 14/10284 [00:09<1:52:42,  1.52it/s]


In [13]:
df = pd.concat(pbp)
df.to_csv('bpbp.csv', index=False)

# Data Preprocessing and Featurizing 

okay we have our play by play data but now what? 
The first version of our model was to replicate what has already been done. So our current goal is to build 960 logistic regression models, one for each three second period.

Since our play by play data is not uniformly generated i.e. records have times from 0-2880 but they are not uniformly spaced and definitely not every three seconds (the shot clock itself goes on for 24 seconds) so what can we do? 

As Data Scientists real world data is never going to be perfect, formatting, quality, frequency etc but we must do what we can with what we have. 

And so we need to make assumptions and preprocess our data accordingly. 

Assumptions we're going to make:
- The state of the game is the same until the next play by play event i.e. if a play happened at 2700 and the score was 10-15 and the next play happened at 2680 and the score is now 12-15 then the time between 2700 and 2680 still has the score 10-15. 

The features we are aiming to produce are for the first iteration is

- Whos got possesion? 
- Score difference
- is it over time? 