# Predicting in game win probabilities of NBA games

## Why?, How?, what else has been done?

Before I moved to Canada I never really watched basketball live, mostly due to the time difference and me liking my sleep but when I did get the oppourtunity to move here in March/April 2019 I witnessed the Toronto Raptors make their championship run and ever since then i've been hooked. 

Naturally as a Data Scientist and someone who studied Maths at university I was drawn to the numbers side of things with basketball. Advanced stats, different ways of measuring impact, elo scores and other people's efforts working with basketball data showed me the other half of the sport. 

Seeing other peoples efforts really highlighted the amount of data that can be collected and analysed so I decided to do my own project with NBA data. Python has an excellent [API wrapper](https://github.com/swar/nba_api) around stats.nba.com with access to things I didn't even know were tracked. 

An interesting part of the api is that it provided live play by play data and I became interested in how nba games were predicted live.

## Reserach / Review

Looking around at what had been done I found a few blogs/articles/papers mostly related to other sports but some around basketball (NCAA and NBA).

- [Bayesian approach to predicting football(not soccer) games](https://dtai.cs.kuleuven.be/sports/blog/a-bayesian-approach-to-in-game-win-probability) [1]

- [Brian Burkes NFL forecasting](http://wagesofwins.com/2009/03/05/modeling-win-probability-for-a-college-basketball-game-a-guest-post-from-brian-burke/) [2]

- [py-ball](https://github.com/basketballrelativity/py_ball) [3]

- [inpredictable](http://stats.inpredictable.com/nba/wpBox_live.php) [4]


Reading the [1] definitely cleared up that predicting football was a lot harder since a lot of games end in draws and there is a lot of infrequent scoring.

[2] & [3] gave me the first step in order to build a model to predict games. The approach that was used here is to split the game into n-second intervals and then build a series of logisitc regression models, one for each interval. 

[4] gave me a sense of what other blogs were doing and something to compare my graphs too

# Data - Gathering/Processing/Cleaning

As mentioned before, python has a GREAT wrapper for the stats.nba.com api again linked [here](https://github.com/swar/nba_api), which is worth checking out in your own time just to see the volume of data available to play with. But I wrote a simple script to collect all the playbyplay data for the last ~7 odd years. 
The problem is rate limits! 

In [16]:
#script to get all live playbyplay data
from nba_api.stats.endpoints import leaguegamelog
import pandas as pd
import tqdm

In [13]:
seasons = [f'20{x}-{x+1}' for x in range(13,22)] #can put whatever range you want

league_game_logs = []
for season in seasons:
    game_log = leaguegamelog.LeagueGameLog(season=season).get_data_frames()[0]
    league_game_logs.append(game_log)

In [17]:
league_game_log = pd.concat(league_game_logs)

#### Filtering for home games

The reason i'm filtering for the home games below is because there can only be two outcomes in basketball, win or lose and so if you find the probability that the home team wins then you find the probability that the away team wins too so we focus our modelling efforts on the home team winning

In [19]:
league_game_log = league_game_log[~league_game_log['MATCHUP'].str.contains('@')]

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,FTA,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,PF,PTS,PLUS_MINUS,VIDEO_AVAILABLE
0,22013,1610612747,LAL,Los Angeles Lakers,21300003,2013-10-29,LAL vs. LAC,W,240,42,93,0.452,14,29,0.483,18,28,0.643,18,34,52,23,8,6,19,23,116,13,1
1,22013,1610612746,LAC,Los Angeles Clippers,21300003,2013-10-29,LAC @ LAL,L,240,41,83,0.494,8,21,0.381,13,23,0.565,10,30,40,27,11,4,16,21,103,-13,1
2,22013,1610612741,CHI,Chicago Bulls,21300002,2013-10-29,CHI @ MIA,L,240,35,83,0.422,7,26,0.269,18,23,0.783,11,30,41,23,11,4,19,27,95,-12,1
3,22013,1610612748,MIA,Miami Heat,21300002,2013-10-29,MIA vs. CHI,W,240,37,72,0.514,11,20,0.55,22,29,0.759,5,35,40,26,10,7,20,21,107,12,1
4,22013,1610612753,ORL,Orlando Magic,21300001,2013-10-29,ORL @ IND,L,240,36,93,0.387,9,19,0.474,6,10,0.6,13,26,39,17,10,6,19,26,87,-10,1


Why did I do the abov