# LUT Generator

The purpose of this notebook is to generate a set of look-up tables, for all the NBA teams' stats at every point in the 2018-19 season.  These stats are
- average points for per game
- average points against per game
- average rebounds for per game
- average rebounds against per game
- average assists for per game
- average assists against per game
- winning percentage

They are then saved into a `pkl` file called `luts.pkl`, which can then be loaded into memory in future notebooks.  We start here by importing some libraries, and reading in the data for this season.  Note that the season data is generated in the `EDA.ipynb` notebook.

In [1]:
import numpy as np
import pandas as pd
import joblib

In [2]:
df = pd.read_csv('Data/games_2018-19.csv')
teams = pd.read_csv('Data/teams.csv')
df.head()

Unnamed: 0,GAME_DATE_EST,GAME_ID,GAME_STATUS_TEXT,HOME_TEAM_ID,VISITOR_TEAM_ID,SEASON,TEAM_ID_home,PTS_home,FG_PCT_home,FT_PCT_home,...,FG_PCT_away,FT_PCT_away,FG3_PCT_away,AST_away,REB_away,HOME_TEAM_WINS,date,year,month,day
0,2018-10-16,21800002,Final,1610612744,1610612760,2018,1610612744,108.0,0.442,0.944,...,0.363,0.649,0.27,21.0,45.0,1,2018-10-16,2018,10,16
1,2018-10-16,21800001,Final,1610612738,1610612755,2018,1610612738,105.0,0.433,0.714,...,0.391,0.609,0.192,18.0,47.0,1,2018-10-16,2018,10,16
2,2018-10-17,21800012,Final,1610612746,1610612743,2018,1610612746,98.0,0.398,0.833,...,0.379,0.786,0.333,20.0,56.0,0,2018-10-17,2018,10,17
3,2018-10-17,21800003,Final,1610612766,1610612749,2018,1610612766,112.0,0.446,0.636,...,0.494,0.75,0.412,26.0,57.0,0,2018-10-17,2018,10,17
4,2018-10-17,21800004,Final,1610612765,1610612751,2018,1610612765,103.0,0.424,0.864,...,0.488,0.682,0.185,28.0,39.0,1,2018-10-17,2018,10,17


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1230 entries, 0 to 1229
Data columns (total 25 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   GAME_DATE_EST     1230 non-null   object 
 1   GAME_ID           1230 non-null   int64  
 2   GAME_STATUS_TEXT  1230 non-null   object 
 3   HOME_TEAM_ID      1230 non-null   int64  
 4   VISITOR_TEAM_ID   1230 non-null   int64  
 5   SEASON            1230 non-null   int64  
 6   TEAM_ID_home      1230 non-null   int64  
 7   PTS_home          1230 non-null   float64
 8   FG_PCT_home       1230 non-null   float64
 9   FT_PCT_home       1230 non-null   float64
 10  FG3_PCT_home      1230 non-null   float64
 11  AST_home          1230 non-null   float64
 12  REB_home          1230 non-null   float64
 13  TEAM_ID_away      1230 non-null   int64  
 14  PTS_away          1230 non-null   float64
 15  FG_PCT_away       1230 non-null   float64
 16  FT_PCT_away       1230 non-null   float64


It looks like our date column is not in datetime format, and it would be useful to have a dictionary where we can look up a team's name, given their team ID.

In [4]:
# re-format the date column
df['date'] = df['date'].astype('datetime64')

# generate a team name-ID dictionary from the teams dataset
teams = teams[['TEAM_ID', 'ABBREVIATION']]
teams = teams.set_index('TEAM_ID')
teams = teams.to_dict()['ABBREVIATION']

# generate this same dictionary in reverse (i.e. ID-name instead of name-ID)
teams_rev = dict((v,k) for k,v in teams.items())    

Next, let's define the functions which will calculate our game statistics.  These will be written out fairly long, and I am aware that it could definitely be shortened, but for now I will simply get the LUTs generated, and worry about the quality of the code a little later on.

In [5]:
def get_avg_pts_for(df, teams):
    '''
    This function takes in the games dataframe for a single team, and returns a Series corresponding to the 
    team's average point total per game, at the time of each game
    '''
    df['pts_scored'] = np.where(df['is_home'], df['PTS_home'], df['PTS_away'])
    df['avg_pts'] = round(df['pts_scored'].expanding().mean(), 1)

    return df['avg_pts']

In [6]:
def get_avg_pts_against(df, teams):
    '''
    This function takes in the games dataframe for a single team, and returns a Series corresponding to the 
    team's average point total per game, at the time of each game
    '''
    df['pts_scored'] = np.where(df['is_home'], df['PTS_away'], df['PTS_home'])
    df['avg_pts'] = round(df['pts_scored'].expanding().mean(), 1)

    return df['avg_pts']

In [7]:
def get_avg_reb_for(df, teams):
    '''
    This function takes in the games dataframe for a single team, and returns a Series corresponding to the 
    team's average rebound total per game, at the time of each game
    '''
    df['rebounds'] = np.where(df['is_home'], df['REB_home'], df['REB_away'])
    df['avg_reb'] = round(df['rebounds'].expanding().mean(), 1)

    return df['avg_reb']

In [8]:
def get_avg_reb_against(df, teams):
    '''
    This function takes in the games dataframe for a single team, and returns a Series corresponding to the 
    team's average rebound total per game, at the time of each game
    '''
    df['rebounds'] = np.where(~df['is_home'], df['REB_home'], df['REB_away'])
    df['avg_reb'] = round(df['rebounds'].expanding().mean(), 1)

    return df['avg_reb']

In [9]:
def get_avg_ast_for(df, teams):
    '''
    This function takes in the games dataframe for a single team, and returns a Series corresponding to the 
    team's average rebound total per game, at the time of each game
    '''
    df['assists'] = np.where(df['is_home'], df['AST_home'], df['AST_away'])
    df['avg_ast'] = round(df['assists'].expanding().mean(), 1)

    return df['avg_ast']

In [10]:
def get_avg_ast_against(df, teams):
    '''
    This function takes in the games dataframe for a single team, and returns a Series corresponding to the 
    team's average rebound total per game, at the time of each game
    '''
    df['assists'] = np.where(~df['is_home'], df['AST_home'], df['AST_away'])
    df['avg_ast'] = round(df['assists'].expanding().mean(), 1)

    return df['avg_ast']

In [11]:
def get_win_pct(df, teams):
    '''
    This function takes in the games dataframe for a single team, and returns a Series corresponding to the
    team's winning percentage at the time of each game
    '''
    df['is_win'] = np.where(df['HOME_TEAM_WINS'] == df['is_home'], 1, 0)
    df['num_wins'] = df['is_win'].cumsum()
    df['num_games'] = df.index + 1
    df['win_pct'] = round(df['num_wins'] / df['num_games'], 3)
    
    return df['win_pct']

Now we need to generate tables for each team, each row of which will correspond to a game in which that team played.  The columsn will represent the cumulative stats of that team, after that game has been played (e.g. updated winning percentage).  We can then use those tables to generate LUTs for each individual statistic.

Note that the tables for each team will be stored in a dictionary, `data_dict`.

In [12]:
# initialize the dictionary of dataframes
data_dict = {}

# cycle through the different teams 
for team_id in teams:
    
    # select out games which this team played in
    data_dict[team_id] = df.loc[(df['HOME_TEAM_ID'] == team_id) | \
                                (df['VISITOR_TEAM_ID'] == team_id), :].reset_index().drop('index', axis=1)
    
    # knowing whether this is a home game or not will make the calculations easier later
    data_dict[team_id]['team_name'] = teams[team_id]
    data_dict[team_id]['home_name'] = data_dict[team_id]['HOME_TEAM_ID'].map(teams)
    data_dict[team_id]['is_home'] = np.where(data_dict[team_id]['team_name'] == data_dict[team_id]['home_name'], 1, 0)
    
    # get average point numbers
    data_dict[team_id]['avg_pts_against'] = get_avg_pts_against(data_dict[team_id], teams)
    data_dict[team_id]['avg_pts_for']     = get_avg_pts_for(data_dict[team_id], teams)
    
    # get average rebound numbers
    data_dict[team_id]['avg_reb_for']     = get_avg_reb_for(data_dict[team_id], teams)
    data_dict[team_id]['avg_reb_against'] = get_avg_reb_against(data_dict[team_id], teams)
    
    # get average assist numbers
    data_dict[team_id]['avg_ast_for']     = get_avg_ast_for(data_dict[team_id], teams)
    data_dict[team_id]['avg_ast_against']     = get_avg_ast_against(data_dict[team_id], teams)
    
    # get winning percentage
    data_dict[team_id]['win_pct']         = get_win_pct(data_dict[team_id], teams)
    
    # reduce the number of columns
    data_dict[team_id] = data_dict[team_id].loc[:, ['date', 'avg_pts_for', 'avg_pts_against', 
                                                    'avg_reb_for', 'avg_reb_against', 'avg_ast_for',
                                                    'avg_ast_against', 'win_pct']]

Let's take a look at one of these newly generated dataframes:

In [15]:
test = data_dict[teams_rev['TOR']]
test.tail()

Unnamed: 0,date,avg_pts_for,avg_pts_against,avg_reb_for,avg_reb_against,avg_ast_for,avg_ast_against,win_pct
77,2019-04-01,114.4,108.4,45.0,44.7,25.4,25.4,0.705
78,2019-04-03,114.4,108.4,45.1,44.8,25.4,25.4,0.709
79,2019-04-05,114.3,108.4,45.1,44.7,25.4,25.4,0.7
80,2019-04-07,114.4,108.5,45.1,44.8,25.4,25.5,0.704
81,2019-04-09,114.4,108.4,45.2,44.6,25.4,25.5,0.707


It looks like we've generated these tables correctly (you can fact-check these numbers with a quick Google search).  Let's now convert this into some LUTs.

In [16]:
# set up the different stats we're going to make LUTs for
stats = ['avg_pts_for', 'avg_pts_against', 'avg_reb_for', 'avg_reb_against', 
         'avg_ast_for', 'avg_ast_against', 'win_pct']

# initialize the dictionary of LUTs
luts = {}

# loop through the stats
for stat in stats:

    # initialize the LUT for this stat
    lut = pd.DataFrame(pd.date_range(start='10/16/2018', end='4/10/2019'), columns=['date'])

    # loop through each team in the dictionary of dataframes
    for team_id in data_dict:

        # get the team name
        team = teams[team_id]

        # merge the date DataFrame with the column for this stat for this team
        lut = pd.merge(lut, data_dict[team_id][['date', stat]], how='left', on='date')
        lut.rename(columns={stat : team}, inplace=True)
        lut.fillna(method='ffill', inplace=True)
        lut.fillna(0.0, inplace=True)
    
    luts[stat] = lut

We can check how these LUTs came out:

In [17]:
luts['avg_pts_against'].tail()

Unnamed: 0,date,ATL,BOS,NOP,CHI,DAL,DEN,HOU,LAC,LAL,...,SAS,OKC,TOR,UTA,MEM,WAS,DET,CHA,CLE,GSW
172,2019-04-06,119.2,107.8,116.7,113.4,110.0,106.6,109.1,113.8,113.7,...,110.4,110.8,108.4,105.9,105.8,117.0,107.7,112.2,114.0,111.2
173,2019-04-07,119.2,107.9,116.8,113.4,110.2,106.7,109.1,114.0,113.6,...,110.2,111.0,108.5,106.0,106.1,116.9,107.7,111.9,114.0,111.1
174,2019-04-08,119.2,107.9,116.8,113.4,110.2,106.7,109.1,114.0,113.6,...,110.2,111.0,108.5,106.0,106.1,116.9,107.7,111.9,114.0,111.1
175,2019-04-09,119.2,108.0,116.8,113.2,110.2,106.9,109.1,114.0,113.5,...,110.2,111.0,108.4,106.0,106.0,116.9,107.5,111.7,114.1,111.0
176,2019-04-10,119.4,108.0,116.8,113.4,110.1,106.7,109.1,114.3,113.5,...,110.0,111.1,108.4,106.5,106.1,116.9,107.3,111.8,114.1,111.2


It looks like we've got ourselves some LUTs! We'll save the dictionary of LUTs so that we can load them into memory later.

In [18]:
joblib.dump(luts, 'Data/luts.pkl')

['Data/luts.pkl']