# LUT Generator

The purpose of this notebook is to generate a set of look-up tables, for all the NBA teams' stats at every point in the 2018-19 season.  These stats are
- average points for per game
- average points against per game
- average rebounds for per game
- average rebounds against per game
- average assists for per game
- average assists against per game
- winning percentage

They are then saved into a `pkl` file called `luts.pkl`, which can then be loaded into memory in future notebooks.  We start here by importing some libraries, and reading in the data for this season.  Note that the season data is generated in the `EDA.ipynb` notebook.

In [1]:
import numpy as np
import pandas as pd
import joblib

In [2]:
df = pd.read_csv('Data/games.csv')
teams = pd.read_csv('Data/teams.csv')
df.head()

Unnamed: 0,GAME_DATE_EST,GAME_ID,GAME_STATUS_TEXT,HOME_TEAM_ID,VISITOR_TEAM_ID,SEASON,TEAM_ID_home,PTS_home,FG_PCT_home,FT_PCT_home,...,AST_home,REB_home,TEAM_ID_away,PTS_away,FG_PCT_away,FT_PCT_away,FG3_PCT_away,AST_away,REB_away,HOME_TEAM_WINS
0,2021-05-26,42000102,Final,1610612755,1610612764,2020,1610612755,120.0,0.557,0.684,...,26.0,45.0,1610612764,95.0,0.402,0.633,0.091,22.0,40.0,1
1,2021-05-26,42000132,Final,1610612752,1610612737,2020,1610612752,101.0,0.383,0.739,...,15.0,54.0,1610612737,92.0,0.369,0.818,0.273,17.0,41.0,1
2,2021-05-26,42000142,Final,1610612762,1610612763,2020,1610612762,141.0,0.544,0.774,...,28.0,42.0,1610612763,129.0,0.541,0.763,0.348,20.0,33.0,1
3,2021-05-25,42000112,Final,1610612751,1610612738,2020,1610612751,130.0,0.523,0.955,...,31.0,46.0,1610612738,108.0,0.424,0.783,0.353,23.0,43.0,1
4,2021-05-25,42000152,Final,1610612756,1610612747,2020,1610612756,102.0,0.465,0.933,...,21.0,31.0,1610612747,109.0,0.45,0.871,0.303,24.0,39.0,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24677 entries, 0 to 24676
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   GAME_DATE_EST     24677 non-null  object 
 1   GAME_ID           24677 non-null  int64  
 2   GAME_STATUS_TEXT  24677 non-null  object 
 3   HOME_TEAM_ID      24677 non-null  int64  
 4   VISITOR_TEAM_ID   24677 non-null  int64  
 5   SEASON            24677 non-null  int64  
 6   TEAM_ID_home      24677 non-null  int64  
 7   PTS_home          24578 non-null  float64
 8   FG_PCT_home       24578 non-null  float64
 9   FT_PCT_home       24578 non-null  float64
 10  FG3_PCT_home      24578 non-null  float64
 11  AST_home          24578 non-null  float64
 12  REB_home          24578 non-null  float64
 13  TEAM_ID_away      24677 non-null  int64  
 14  PTS_away          24578 non-null  float64
 15  FG_PCT_away       24578 non-null  float64
 16  FT_PCT_away       24578 non-null  float6

It looks like our date column is not in datetime format, and it would be useful to have a dictionary where we can look up a team's name, given their team ID.

In [4]:
# re-name the date column
df = df.rename(columns={'GAME_DATE_EST' : 'date'})

# re-format the date column
df['date'] = df['date'].astype('datetime64')

# sort the df by date
df = df.sort_values('date')

# generate a team name-ID dictionary from the teams dataset
teams = teams[['TEAM_ID', 'ABBREVIATION']]
teams = teams.set_index('TEAM_ID')
id_to_name = teams.to_dict()['ABBREVIATION']

# generate this same dictionary in reverse (i.e. ID-name instead of name-ID)
name_to_id = dict((v,k) for k,v in id_to_name.items())    

Next, let's define the functions which will calculate our game statistics.  These will be written out fairly long, and I am aware that it could definitely be shortened, but for now I will simply get the LUTs generated, and worry about the quality of the code a little later on.

In [5]:
def get_avg_pts_for(df, teams):
    '''
    This function takes in the games dataframe for a single team, and returns a Series corresponding to the 
    team's average point total per game, at the time of each game
    '''
    df['pts_scored'] = np.where(df['is_home'], df['PTS_home'], df['PTS_away'])
    df['avg_pts'] = round(df['pts_scored'].expanding().mean(), 1)

    return df['avg_pts']

In [6]:
def get_avg_pts_against(df, teams):
    '''
    This function takes in the games dataframe for a single team, and returns a Series corresponding to the 
    team's average point total per game, at the time of each game
    '''
    df['pts_scored'] = np.where(df['is_home'], df['PTS_away'], df['PTS_home'])
    df['avg_pts'] = round(df['pts_scored'].expanding().mean(), 1)

    return df['avg_pts']

In [7]:
def get_avg_reb_for(df, teams):
    '''
    This function takes in the games dataframe for a single team, and returns a Series corresponding to the 
    team's average rebound total per game, at the time of each game
    '''
    df['rebounds'] = np.where(df['is_home'], df['REB_home'], df['REB_away'])
    df['avg_reb'] = round(df['rebounds'].expanding().mean(), 1)

    return df['avg_reb']

In [8]:
def get_avg_reb_against(df, teams):
    '''
    This function takes in the games dataframe for a single team, and returns a Series corresponding to the 
    team's average rebound total per game, at the time of each game
    '''
    df['rebounds'] = np.where(~df['is_home'], df['REB_home'], df['REB_away'])
    df['avg_reb'] = round(df['rebounds'].expanding().mean(), 1)

    return df['avg_reb']

In [9]:
def get_avg_ast_for(df, teams):
    '''
    This function takes in the games dataframe for a single team, and returns a Series corresponding to the 
    team's average rebound total per game, at the time of each game
    '''
    df['assists'] = np.where(df['is_home'], df['AST_home'], df['AST_away'])
    df['avg_ast'] = round(df['assists'].expanding().mean(), 1)

    return df['avg_ast']

In [10]:
def get_avg_ast_against(df, teams):
    '''
    This function takes in the games dataframe for a single team, and returns a Series corresponding to the 
    team's average rebound total per game, at the time of each game
    '''
    df['assists'] = np.where(~df['is_home'], df['AST_home'], df['AST_away'])
    df['avg_ast'] = round(df['assists'].expanding().mean(), 1)

    return df['avg_ast']

In [11]:
def get_win_pct(df, teams):
    '''
    This function takes in the games dataframe for a single team, and returns a Series corresponding to the
    team's winning percentage at the time of each game
    '''
    df['is_win'] = np.where(df['HOME_TEAM_WINS'] == df['is_home'], 1, 0)
    df['num_wins'] = df['is_win'].cumsum()
    df['num_games'] = df.index + 1
    df['win_pct'] = round(df['num_wins'] / df['num_games'], 3)
    
    return df['win_pct']

Since we have pre-season and playoff games in our dataset, let's take into account the start and end dates of each regular season.  This way, our cumulative season stats will make more sense.

In [12]:
start_dates = {'2003' : '2003-10-28',
              '2004' : '2004-11-02',
              '2005' : '2005-11-01',
              '2006' : '2006-10-31',
              '2007' : '2007-10-30',
              '2008' : '2008-10-28',
              '2009' : '2009-10-27',
              '2010' : '2010-10-26',
              '2011' : '2010-12-25',
              '2012' : '2012-10-30',
              '2013' : '2013-10-29',
              '2014' : '2014-10-28',
              '2015' : '2015-10-27',
              '2016' : '2016-10-25',
              '2017' : '2017-10-17',
              '2018' : '2018-10-16'}

end_dates = {'2003' : '2004-04-14',
            '2004' : '2005-04-20',
            '2005' : '2006-04-19',
            '2006' : '2007-04-18',
            '2007' : '2008-04-16',
            '2008' : '2009-04-16',
            '2009' : '2010-04-14',
            '2010' : '2011-04-13',
            '2011' : '2012-04-26',
            '2012' : '2013-04-17',
            '2013' : '2014-04-16',
            '2014' : '2015-04-15',
            '2015' : '2016-04-13',
            '2016' : '2017-04-12',
            '2017' : '2018-04-11',
            '2018' : '2019-04-10'}

Now we need to generate tables for each team, each row of which will correspond to a game in which that team played.  The columns will represent the cumulative stats of that team, after that game has been played (e.g. updated winning percentage).  We can then use those tables to generate LUTs for each individual statistic.

Note that the tables for each team will be stored in a single dictionary, `data_dict`.

In [13]:
# initialize the dictionary of dataframes
data_dict = {}

# initialize the seasons to loop through
seasons = np.arange(2003, 2019, 1)

# cycle through the different seasons
for season in seasons:

    # filter original dataframe for games which occurred during this season
    tmp = df.loc[df['SEASON'] == season, :]
    
    # filter for regular season games only
    tmp = tmp.loc[(tmp['date'] >= start_dates[str(season)]) & \
                 (tmp['date'] <= end_dates[str(season)]), :]
    
    # cycle through the different teams 
    for team_id in id_to_name:

        print(f'Generating game-by-game statistics for {id_to_name[team_id]} in {season} season', end='\r')

        # this will be the unique key which will identify this teams df for this season
        id_ = str(season) + '_' + id_to_name[team_id]
        
        # select out games from this season which this team played in
        data_dict[id_] = tmp.loc[(df['HOME_TEAM_ID'] == team_id) | \
                                    (df['VISITOR_TEAM_ID'] == team_id), :].reset_index().drop('index', axis=1)
        
        # not all the teams are in the league for all the seasons, so if it's an empty df, just continue
        if (data_dict[id_].shape[0] == 0):
            continue

        # knowing whether this is a home game or not will make the calculations easier later
        data_dict[id_]['team_name'] = id_to_name[team_id]
        data_dict[id_]['home_name'] = data_dict[id_]['HOME_TEAM_ID'].map(id_to_name)
        data_dict[id_]['is_home'] = np.where(data_dict[id_]['team_name'] == data_dict[id_]['home_name'], 1, 0)

        # get average point numbers
        data_dict[id_]['avg_pts_against'] = get_avg_pts_against(data_dict[id_], id_to_name)
        data_dict[id_]['avg_pts_for']     = get_avg_pts_for(data_dict[id_], id_to_name)

        # get average rebound numbers
        data_dict[id_]['avg_reb_for']     = get_avg_reb_for(data_dict[id_], id_to_name)
        data_dict[id_]['avg_reb_against'] = get_avg_reb_against(data_dict[id_], id_to_name)

        # get average assist numbers
        data_dict[id_]['avg_ast_for']     = get_avg_ast_for(data_dict[id_], id_to_name)
        data_dict[id_]['avg_ast_against']     = get_avg_ast_against(data_dict[id_], id_to_name)

        # get winning percentage
        data_dict[id_]['win_pct']         = get_win_pct(data_dict[id_], id_to_name)

        # reduce the number of columns to only the date and stats we want
        data_dict[id_] = data_dict[id_].loc[:, ['date', 'avg_pts_for', 'avg_pts_against', 
                                                        'avg_reb_for', 'avg_reb_against', 'avg_ast_for',
                                                        'avg_ast_against', 'win_pct']]

        last_ix = data_dict[id_].shape[0] - 1
        last_date = data_dict[id_].loc[data_dict[id_].shape[0] - 1, 'date']

        # shift the dataframe so that we are looking at the average going into that game
        data_dict[id_] = data_dict[id_].shift(1)
        data_dict[id_]['new_date'] = data_dict[id_].shift(-1)['date']
        data_dict[id_].loc[last_ix, 'new_date'] = data_dict[id_].loc[last_ix, 'date']
        data_dict[id_] = data_dict[id_].fillna(0.0)
        data_dict[id_].loc[last_ix, 'new_date'] = last_date
        data_dict[id_]['date'] = data_dict[id_]['new_date']
        data_dict[id_].drop('new_date', axis=1, inplace=True)

Generating game-by-game statistics for GSW in 2018 season

Let's take a look at one of these newly generated dataframes:

In [14]:
test = data_dict['2018_TOR']
test.head()

Unnamed: 0,date,avg_pts_for,avg_pts_against,avg_reb_for,avg_reb_against,avg_ast_for,avg_ast_against,win_pct
0,2018-10-17,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2018-10-19,116.0,104.0,43.0,43.0,21.0,21.0,1.0
2,2018-10-20,114.5,102.5,46.0,46.0,22.5,22.5,1.0
3,2018-10-22,115.3,106.0,48.0,43.0,22.0,21.7,1.0
4,2018-10-24,118.2,106.0,47.5,43.8,25.5,25.2,1.0


In [15]:
test.shape

(82, 8)

It looks like we've generated these tables correctly (you can fact-check these numbers with a quick Google search).  Let's now convert this into some LUTs.

In [16]:
# set up the different stats we're going to make LUTs for
stats = ['avg_pts_for', 'avg_pts_against', 'avg_reb_for', 'avg_reb_against', 
         'avg_ast_for', 'avg_ast_against', 'win_pct']

# initialize the dictionary of LUTs
luts = {}

for season in seasons:

    print(f'Generating LUTs for season {season}', end='\r')
    
    # loop through the stats
    for stat in stats:

        # initialize the LUT for this stat
        lut = pd.DataFrame(pd.date_range(start=start_dates[str(season)], 
                                         end=end_dates[str(season)]), columns=['date'])

        # loop through each team in the dictionary of dataframes
        for team in name_to_id.keys():

            # this is the unique id which is a key in our data_dict dictionary
            data_dict_id = str(season) + '_' + team

            # some of them are empty, so we need to account for that
            if (data_dict[data_dict_id]).empty:
                continue
            
            # merge the date DataFrame with the column for this stat for this team
            lut = pd.merge(lut, data_dict[data_dict_id][['date', stat]], how='left', on='date')
            lut.rename(columns={stat : team}, inplace=True)
            lut.fillna(method='ffill', inplace=True)
            lut.fillna(0.0, inplace=True)

        # set up the unique key for this stat and this season, and save the newly generated LUT
        id_ = str(season) + '_' + stat
        luts[id_] = lut

Generating LUTs for season 2018

We can check how these LUTs came out:

In [17]:
luts['2018_avg_pts_for'].head()

Unnamed: 0,date,ATL,BOS,NOP,CHI,DAL,DEN,HOU,LAC,LAL,...,SAS,OKC,TOR,UTA,MEM,WAS,DET,CHA,CLE,GSW
0,2018-10-16,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2018-10-17,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2018-10-18,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2018-10-19,107.0,105.0,131.0,0.0,0.0,0.0,0.0,98.0,0.0,...,0.0,100.0,116.0,123.0,83.0,0.0,0.0,112.0,104.0,108.0
4,2018-10-20,107.0,103.0,131.0,108.0,100.0,107.0,112.0,98.0,119.0,...,112.0,100.0,114.5,123.0,83.0,112.0,103.0,116.0,104.0,108.0


In [18]:
luts['2018_avg_pts_for'].tail()

Unnamed: 0,date,ATL,BOS,NOP,CHI,DAL,DEN,HOU,LAC,LAL,...,SAS,OKC,TOR,UTA,MEM,WAS,DET,CHA,CLE,GSW
172,2019-04-06,113.2,112.3,115.2,105.2,108.6,110.8,113.4,114.9,111.8,...,111.5,114.0,114.4,111.3,102.8,114.2,107.2,110.6,104.6,117.6
173,2019-04-07,113.2,112.4,115.4,105.2,108.7,110.9,113.5,114.9,111.9,...,111.7,114.1,114.3,111.4,103.0,114.1,107.2,110.6,104.8,117.6
174,2019-04-08,113.2,112.4,115.4,105.2,108.7,110.9,113.5,114.9,111.9,...,111.7,114.1,114.3,111.4,103.0,114.1,107.2,110.6,104.8,117.6
175,2019-04-09,113.2,112.3,115.6,105.1,108.9,110.8,114.0,114.9,111.9,...,111.7,114.4,114.4,111.3,103.3,114.1,107.0,110.5,104.6,117.8
176,2019-04-10,113.1,112.3,115.6,104.9,109.0,110.8,114.0,114.8,111.9,...,111.7,114.3,114.4,111.4,103.2,114.1,107.0,110.7,104.6,117.7


It looks like we've got ourselves some LUTs! We'll save the dictionary of LUTs so that we can load them into memory later.

In [19]:
joblib.dump(luts, 'Data/luts2.pkl')

['Data/luts2.pkl']