# Preprocessing: Feature Addition for Current Game

Now that I have collected and cleaned the data, there are several features that I want to add to help my model perform. I want my model to know how many days of rest the player has had since the last game, whether the game is a home game or an away game, and the team level statistics for the opposing team. I also want the model to have access to data that goes beyond single-game statistics, so I calculate moving averages using expanding averages and rolling averages.

## 1. Organizing Data by Game Date

Since many of the features that I am going to add refer to previous games, the first step I need to take is to organize the dataframe by game date.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('./Data/cleaned_and_merged_data.csv')

In [3]:
df.head()

Unnamed: 0,SEASON_ID,PLAYER_ID,PLAYER_NAME,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,...,TEAM_PCT_UAST_2PM,TEAM_PCT_UAST_3PM,TEAM_PCT_UAST_FGM,BIO_AGE,BIO_PLAYER_HEIGHT_INCHES,BIO_PLAYER_WEIGHT,BIO_DRAFT_ROUND,BIO_DRAFT_NUMBER,BIO_SEASON_ID,BIO_PLAYER_SEASON_ID
0,22015,1626162,Kelly Oubre Jr.,1610612764,WAS,Washington Wizards,21501221,2016-04-13,WAS vs. ATL,W,...,0.394,0.444,0.405,20.0,79,205,1,15,2015-16,16261622015-16
1,22015,202397,Ish Smith,1610612755,PHI,Philadelphia 76ers,21501222,2016-04-13,PHI @ CHI,L,...,0.45,0.0,0.243,27.0,72,175,0,0,2015-16,2023972015-16
2,22015,201166,Aaron Brooks,1610612741,CHI,Chicago Bulls,21501222,2016-04-13,CHI vs. PHI,W,...,0.458,0.067,0.308,31.0,72,161,1,26,2015-16,2011662015-16
3,22015,203503,Tony Snell,1610612741,CHI,Chicago Bulls,21501222,2016-04-13,CHI vs. PHI,W,...,0.458,0.067,0.308,24.0,79,200,1,20,2015-16,2035032015-16
4,22015,203924,Jerami Grant,1610612755,PHI,Philadelphia 76ers,21501222,2016-04-13,PHI @ CHI,L,...,0.45,0.0,0.243,22.0,80,210,2,39,2015-16,2039242015-16


In [4]:
df.shape

(201805, 183)

In [25]:
df.columns.tolist()

['SEASON_ID',
 'PLAYER_ID',
 'PLAYER_NAME',
 'TEAM_ID',
 'TEAM_ABBREVIATION',
 'TEAM_NAME',
 'GAME_ID',
 'GAME_DATE',
 'MATCHUP',
 'WL',
 'MIN',
 'FGM',
 'FGA',
 'FG_PCT',
 'FG3M',
 'FG3A',
 'FG3_PCT',
 'FTM',
 'FTA',
 'FT_PCT',
 'OREB',
 'DREB',
 'REB',
 'AST',
 'STL',
 'BLK',
 'TOV',
 'PF',
 'PTS',
 'PLUS_MINUS',
 'PLAYER_GAME_ID',
 'AST_PCT',
 'AST_RATIO',
 'AST_TO',
 'DEF_RATING',
 'DREB_PCT',
 'EFG_PCT',
 'E_DEF_RATING',
 'E_NET_RATING',
 'E_OFF_RATING',
 'E_PACE',
 'E_TOV_PCT',
 'E_USG_PCT',
 'FGA_PG',
 'FGM_PG',
 'NET_RATING',
 'OFF_RATING',
 'OREB_PCT',
 'PACE',
 'PACE_PER40',
 'PIE',
 'POSS',
 'REB_PCT',
 'SEASON_YEAR',
 'TM_TOV_PCT',
 'TS_PCT',
 'USG_PCT',
 'sp_work_DEF_RATING',
 'sp_work_NET_RATING',
 'sp_work_OFF_RATING',
 'sp_work_PACE',
 'BLKA',
 'OPP_PTS_2ND_CHANCE',
 'OPP_PTS_FB',
 'OPP_PTS_OFF_TOV',
 'OPP_PTS_PAINT',
 'PFD',
 'PTS_2ND_CHANCE',
 'PTS_FB',
 'PTS_OFF_TOV',
 'PTS_PAINT',
 'PCT_AST_2PM',
 'PCT_AST_3PM',
 'PCT_AST_FGM',
 'PCT_FGA_2PT',
 'PCT_FGA_3PT',
 'PCT_

In [6]:
type(df['GAME_DATE'][0])

str

The dates are currently formatted as strings, so I need to convert them into date times.

In [7]:
df['GAME_DATE'] = pd.to_datetime(df['GAME_DATE'])

In [8]:
type(df['GAME_DATE'][0])

pandas._libs.tslibs.timestamps.Timestamp

Now, I can sort the dataframe by game date.

In [9]:
df.sort_values('GAME_DATE',inplace=True)

In [10]:
df = df.reset_index(drop=True)

In [11]:
df

Unnamed: 0,SEASON_ID,PLAYER_ID,PLAYER_NAME,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,...,TEAM_PCT_UAST_2PM,TEAM_PCT_UAST_3PM,TEAM_PCT_UAST_FGM,BIO_AGE,BIO_PLAYER_HEIGHT_INCHES,BIO_PLAYER_WEIGHT,BIO_DRAFT_ROUND,BIO_DRAFT_NUMBER,BIO_SEASON_ID,BIO_PLAYER_SEASON_ID
0,22015,203084,Harrison Barnes,1610612744,GSW,Golden State Warriors,21500003,2015-10-27,GSW vs. NOP,W,...,0.313,0.222,0.293,24.0,80,225,1,7,2015-16,2030842015-16
1,22015,201582,Alexis Ajinca,1610612740,NOP,New Orleans Pelicans,21500003,2015-10-27,NOP @ GSW,L,...,0.414,0.333,0.400,28.0,86,248,1,20,2015-16,2015822015-16
2,22015,2733,Shaun Livingston,1610612744,GSW,Golden State Warriors,21500003,2015-10-27,GSW vs. NOP,W,...,0.313,0.222,0.293,30.0,79,192,1,4,2015-16,27332015-16
3,22015,2571,Leandro Barbosa,1610612744,GSW,Golden State Warriors,21500003,2015-10-27,GSW vs. NOP,W,...,0.313,0.222,0.293,33.0,75,194,1,28,2015-16,25712015-16
4,22015,2570,Kendrick Perkins,1610612740,NOP,New Orleans Pelicans,21500003,2015-10-27,NOP @ GSW,L,...,0.414,0.333,0.400,31.0,82,270,1,27,2015-16,25702015-16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
201800,22022,1627788,Furkan Korkmaz,1610612755,PHI,Philadelphia 76ers,22201217,2023-04-09,PHI @ BKN,W,...,0.462,0.286,0.415,25.0,79,202,1,26,2022-23,16277882022-23
201801,22022,1626149,Montrezl Harrell,1610612755,PHI,Philadelphia 76ers,22201217,2023-04-09,PHI @ BKN,W,...,0.462,0.286,0.415,29.0,79,240,2,32,2022-23,16261492022-23
201802,22022,203473,Dewayne Dedmon,1610612755,PHI,Philadelphia 76ers,22201217,2023-04-09,PHI @ BKN,W,...,0.462,0.286,0.415,33.0,82,245,0,0,2022-23,2034732022-23
201803,22022,1630564,RaiQuan Gray,1610612751,BKN,Brooklyn Nets,22201217,2023-04-09,BKN vs. PHI,L,...,0.435,0.250,0.371,23.0,79,269,2,59,2022-23,16305642022-23


Looks good. Next, it's time for feature addition.

## 2. Feature Addition: Single-Game Features

First, I add the single game features: days of rest, home or away, and opponent.

### 2a. Days of Rest

In [13]:
def add_days_of_rest(player):
    player['most_recent_game'] = player['GAME_DATE'].shift(1)
    time_difference = player['GAME_DATE'] - player['most_recent_game']
    player['days_of_rest'] = time_difference
    
    return player

df = df.groupby('PLAYER_ID',group_keys=False).apply(add_days_of_rest)

In [14]:
df[['GAME_DATE','most_recent_game','days_of_rest']]

Unnamed: 0,GAME_DATE,most_recent_game,days_of_rest
0,2015-10-27,NaT,NaT
1,2015-10-27,NaT,NaT
2,2015-10-27,NaT,NaT
3,2015-10-27,NaT,NaT
4,2015-10-27,NaT,NaT
...,...,...,...
201800,2023-04-09,2023-04-07,2 days
201801,2023-04-09,2023-04-07,2 days
201802,2023-04-09,2023-04-07,2 days
201803,2023-04-09,NaT,NaT


All of a player's first games come up as null values. In addition, some of the values are very large, sometimes even spanning multiple years. This may be because of time lost to injuries or time when the player played in a different league. I replace both the nulls and the extremely long periods of time off with the value 10 so that the large numbers don't have an outsized pull on the model.

In [15]:
fill_value = pd.Timedelta('10 days')

In [16]:
df['days_of_rest'].fillna(fill_value,inplace=True)

In [17]:
df['days_of_rest'].isna().sum()

0

In [18]:
df['days_of_rest'] = df['days_of_rest'].dt.days.astype(int)

In [19]:
df['days_of_rest'].value_counts().sort_index(ascending=False)

days_of_rest
1553         1
1518         1
1440         1
1396         1
1370         1
         ...  
5         3647
4         8537
3        29197
2       110293
1        32159
Name: count, Length: 476, dtype: int64

In [20]:
df.loc[df['days_of_rest'] > 10, 'days_of_rest'] = 10

In [21]:
df['days_of_rest'].value_counts().sort_index(ascending=False)

days_of_rest
10     10282
9       1761
8       2115
7       1715
6       2099
5       3647
4       8537
3      29197
2     110293
1      32159
Name: count, dtype: int64

Everything looks good. I no longer need the date of the most recent game, so I drop that column.

In [22]:
df.drop(['most_recent_game'],axis=1,inplace=True)

### 2b. Home or Away

I write a function so that a 1 indicates that the team is playing at home and a 0 indicates that the team is away.

In [23]:
def home_away(matchup):
    if '@' in matchup:
        return 0
    else:
        return 1

In [26]:
df['home'] = df['MATCHUP'].apply(home_away)

In [27]:
df[['MATCHUP','home']]

Unnamed: 0,MATCHUP,home
0,GSW vs. NOP,1
1,NOP @ GSW,0
2,GSW vs. NOP,1
3,GSW vs. NOP,1
4,NOP @ GSW,0
...,...,...
201800,PHI @ BKN,0
201801,PHI @ BKN,0
201802,PHI @ BKN,0
201803,BKN vs. PHI,1


Looks good. Next, I need to add moving averages.

## 3. Feature Addition: Moving Averages

I want my model to have access to more than just single-game performance. I can't use season averages, though, because in the real world, the model would only have access to games that have already occurred. I use moving averages to create categories that capture the averages based only on games that have already occurred.

### 3a. Expanding Averages

First, I add columns to take the median outcome based on all games played in a given season up to that point. I use medians because I want to limit how much the averages are influenced by particularly positive or negative performances. Before I get started, I review the data frame to see the non-numeric columns and consider what to do with them.

In [28]:
non_numeric_columns = df.select_dtypes(exclude=['float', 'int'])
non_numeric_columns

Unnamed: 0,PLAYER_NAME,TEAM_ABBREVIATION,TEAM_NAME,GAME_DATE,MATCHUP,WL,SEASON_YEAR,PLAYER_SEASON_ID,TEAM_WL,BIO_SEASON_ID,BIO_PLAYER_SEASON_ID
0,Harrison Barnes,GSW,Golden State Warriors,2015-10-27,GSW vs. NOP,W,2015-16,2030842015-16,W,2015-16,2030842015-16
1,Alexis Ajinca,NOP,New Orleans Pelicans,2015-10-27,NOP @ GSW,L,2015-16,2015822015-16,L,2015-16,2015822015-16
2,Shaun Livingston,GSW,Golden State Warriors,2015-10-27,GSW vs. NOP,W,2015-16,27332015-16,W,2015-16,27332015-16
3,Leandro Barbosa,GSW,Golden State Warriors,2015-10-27,GSW vs. NOP,W,2015-16,25712015-16,W,2015-16,25712015-16
4,Kendrick Perkins,NOP,New Orleans Pelicans,2015-10-27,NOP @ GSW,L,2015-16,25702015-16,L,2015-16,25702015-16
...,...,...,...,...,...,...,...,...,...,...,...
201800,Furkan Korkmaz,PHI,Philadelphia 76ers,2023-04-09,PHI @ BKN,W,2022-23,16277882022-23,W,2022-23,16277882022-23
201801,Montrezl Harrell,PHI,Philadelphia 76ers,2023-04-09,PHI @ BKN,W,2022-23,16261492022-23,W,2022-23,16261492022-23
201802,Dewayne Dedmon,PHI,Philadelphia 76ers,2023-04-09,PHI @ BKN,W,2022-23,2034732022-23,W,2022-23,2034732022-23
201803,RaiQuan Gray,BKN,Brooklyn Nets,2023-04-09,BKN vs. PHI,L,2022-23,16305642022-23,L,2022-23,16305642022-23


I can drop some of these features because they are redundant or unnecessary, but I turn the win-loss column into a numerical feature because it could be useful.

In [30]:
df.drop(['WL','TEAM_ABBREVIATION','BIO_SEASON_ID','BIO_PLAYER_SEASON_ID'],
        axis=1,
        inplace=True)

In [31]:
df['TEAM_WL'] = df['TEAM_WL'].replace({'W': 1, 'L': 0})

In [32]:
non_numeric_columns = df.select_dtypes(exclude=['float', 'int'])
non_numeric_columns

Unnamed: 0,PLAYER_NAME,TEAM_NAME,GAME_DATE,MATCHUP,SEASON_YEAR,PLAYER_SEASON_ID
0,Harrison Barnes,Golden State Warriors,2015-10-27,GSW vs. NOP,2015-16,2030842015-16
1,Alexis Ajinca,New Orleans Pelicans,2015-10-27,NOP @ GSW,2015-16,2015822015-16
2,Shaun Livingston,Golden State Warriors,2015-10-27,GSW vs. NOP,2015-16,27332015-16
3,Leandro Barbosa,Golden State Warriors,2015-10-27,GSW vs. NOP,2015-16,25712015-16
4,Kendrick Perkins,New Orleans Pelicans,2015-10-27,NOP @ GSW,2015-16,25702015-16
...,...,...,...,...,...,...
201800,Furkan Korkmaz,Philadelphia 76ers,2023-04-09,PHI @ BKN,2022-23,16277882022-23
201801,Montrezl Harrell,Philadelphia 76ers,2023-04-09,PHI @ BKN,2022-23,16261492022-23
201802,Dewayne Dedmon,Philadelphia 76ers,2023-04-09,PHI @ BKN,2022-23,2034732022-23
201803,RaiQuan Gray,Brooklyn Nets,2023-04-09,BKN vs. PHI,2022-23,16305642022-23


First, I identify the non_numeric columns that I do not need to take expanding averages for.

In [33]:
non_numeric_col_list = non_numeric_columns.columns.tolist()

Next, I identify the numeric columns that I do not need to take averages for.

In [34]:
cols_to_remove = non_numeric_col_list + ['TEAM_ID','GAME_ID','BIO_AGE', 'TEAM_ID', 'PLAYER_GAME_ID',
                                         'TEAM_GAME_ID','BIO_PLAYER_HEIGHT_INCHES', 'BIO_PLAYER_WEIGHT',
                                        'BIO_DRAFT_YEAR', 'BIO_DRAFT_ROUND', 'BIO_DRAFT_NUMBER','opponent_id']

In [35]:
selected_cols = df.columns[~df.columns.isin(cols_to_remove)]

In [36]:
df_expanding = df[list(selected_cols)]

In [37]:
df_expanding

Unnamed: 0,SEASON_ID,PLAYER_ID,MIN,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,...,TEAM_PCT_PTS_3PT,TEAM_PCT_PTS_FB,TEAM_PCT_PTS_FT,TEAM_PCT_PTS_OFF_TOV,TEAM_PCT_PTS_PAINT,TEAM_PCT_UAST_2PM,TEAM_PCT_UAST_3PM,TEAM_PCT_UAST_FGM,days_of_rest,home
0,22015,203084,33,3,12,0.250,1,5,0.200,1,...,0.243,0.117,0.180,0.234,0.486,0.313,0.222,0.293,10,1
1,22015,201582,17,3,7,0.429,0,0,0.000,0,...,0.189,0.189,0.200,0.305,0.421,0.414,0.333,0.400,10,0
2,22015,2733,21,3,6,0.500,0,2,0.000,0,...,0.243,0.117,0.180,0.234,0.486,0.313,0.222,0.293,10,1
3,22015,2571,21,0,6,0.000,0,2,0.000,0,...,0.243,0.117,0.180,0.234,0.486,0.313,0.222,0.293,10,1
4,22015,2570,16,5,5,1.000,0,0,0.000,0,...,0.189,0.189,0.200,0.305,0.421,0.414,0.333,0.400,10,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
201800,22022,1627788,30,3,7,0.429,3,4,0.750,2,...,0.313,0.172,0.104,0.209,0.522,0.462,0.286,0.415,2,0
201801,22022,1626149,21,5,10,0.500,0,2,0.000,5,...,0.313,0.172,0.104,0.209,0.522,0.462,0.286,0.415,2,0
201802,22022,203473,19,6,8,0.750,1,1,1.000,1,...,0.313,0.172,0.104,0.209,0.522,0.462,0.286,0.415,2,0
201803,22022,1630564,35,6,12,0.500,2,5,0.400,2,...,0.343,0.114,0.219,0.181,0.343,0.435,0.250,0.371,10,1


Now, I define a function to find the expanding average for each stat for each player and apply it to each player by season so that it does not include expanding averages from previous seasons. Because the dataframe is sorted sequentially by date, it will take the expanding averages sequentially.

In [38]:
def find_expanding_medians(col):
    expanding = col.expanding().median()
    return expanding

In [39]:
df_expanding = df_expanding.groupby(['PLAYER_ID', 'SEASON_ID'],group_keys=False).apply(find_expanding_medians)
df_expanding

Unnamed: 0,SEASON_ID,PLAYER_ID,MIN,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,...,TEAM_PCT_PTS_3PT,TEAM_PCT_PTS_FB,TEAM_PCT_PTS_FT,TEAM_PCT_PTS_OFF_TOV,TEAM_PCT_PTS_PAINT,TEAM_PCT_UAST_2PM,TEAM_PCT_UAST_3PM,TEAM_PCT_UAST_FGM,days_of_rest,home
0,22015.0,203084.0,33.0,3.0,12.0,0.250,1.0,5.0,0.2,1.0,...,0.243,0.1170,0.1800,0.2340,0.486,0.3130,0.222,0.293,10.0,1.0
1,22015.0,201582.0,17.0,3.0,7.0,0.429,0.0,0.0,0.0,0.0,...,0.189,0.1890,0.2000,0.3050,0.421,0.4140,0.333,0.400,10.0,0.0
2,22015.0,2733.0,21.0,3.0,6.0,0.500,0.0,2.0,0.0,0.0,...,0.243,0.1170,0.1800,0.2340,0.486,0.3130,0.222,0.293,10.0,1.0
3,22015.0,2571.0,21.0,0.0,6.0,0.000,0.0,2.0,0.0,0.0,...,0.243,0.1170,0.1800,0.2340,0.486,0.3130,0.222,0.293,10.0,1.0
4,22015.0,2570.0,16.0,5.0,5.0,1.000,0.0,0.0,0.0,0.0,...,0.189,0.1890,0.2000,0.3050,0.421,0.4140,0.333,0.400,10.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
201800,22022.0,1627788.0,7.0,1.0,2.0,0.333,0.0,1.0,0.0,0.0,...,0.298,0.1320,0.1650,0.1640,0.442,0.4860,0.167,0.391,2.0,0.0
201801,22022.0,1626149.0,12.0,2.0,3.0,0.500,0.0,0.0,0.0,1.0,...,0.325,0.1340,0.1650,0.1670,0.400,0.4860,0.167,0.375,2.0,0.0
201802,22022.0,203473.0,12.0,1.5,4.0,0.500,0.0,1.0,0.0,0.0,...,0.313,0.1055,0.1685,0.1685,0.443,0.4745,0.167,0.400,2.0,0.0
201803,22022.0,1630564.0,35.0,6.0,12.0,0.500,2.0,5.0,0.4,2.0,...,0.343,0.1140,0.2190,0.1810,0.343,0.4350,0.250,0.371,10.0,1.0


I label all of these columns as expanding for clarity.

In [40]:
expanding_cols = [f'{col}_EXPANDING' for col in df_expanding.columns]
df_expanding.columns = expanding_cols

df = pd.concat([df,df_expanding],axis=1)

In [41]:
df

Unnamed: 0,SEASON_ID,PLAYER_ID,PLAYER_NAME,TEAM_ID,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,MIN,FGM,...,TEAM_PCT_PTS_3PT_EXPANDING,TEAM_PCT_PTS_FB_EXPANDING,TEAM_PCT_PTS_FT_EXPANDING,TEAM_PCT_PTS_OFF_TOV_EXPANDING,TEAM_PCT_PTS_PAINT_EXPANDING,TEAM_PCT_UAST_2PM_EXPANDING,TEAM_PCT_UAST_3PM_EXPANDING,TEAM_PCT_UAST_FGM_EXPANDING,days_of_rest_EXPANDING,home_EXPANDING
0,22015,203084,Harrison Barnes,1610612744,Golden State Warriors,21500003,2015-10-27,GSW vs. NOP,33,3,...,0.243,0.1170,0.1800,0.2340,0.486,0.3130,0.222,0.293,10.0,1.0
1,22015,201582,Alexis Ajinca,1610612740,New Orleans Pelicans,21500003,2015-10-27,NOP @ GSW,17,3,...,0.189,0.1890,0.2000,0.3050,0.421,0.4140,0.333,0.400,10.0,0.0
2,22015,2733,Shaun Livingston,1610612744,Golden State Warriors,21500003,2015-10-27,GSW vs. NOP,21,3,...,0.243,0.1170,0.1800,0.2340,0.486,0.3130,0.222,0.293,10.0,1.0
3,22015,2571,Leandro Barbosa,1610612744,Golden State Warriors,21500003,2015-10-27,GSW vs. NOP,21,0,...,0.243,0.1170,0.1800,0.2340,0.486,0.3130,0.222,0.293,10.0,1.0
4,22015,2570,Kendrick Perkins,1610612740,New Orleans Pelicans,21500003,2015-10-27,NOP @ GSW,16,5,...,0.189,0.1890,0.2000,0.3050,0.421,0.4140,0.333,0.400,10.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
201800,22022,1627788,Furkan Korkmaz,1610612755,Philadelphia 76ers,22201217,2023-04-09,PHI @ BKN,30,3,...,0.298,0.1320,0.1650,0.1640,0.442,0.4860,0.167,0.391,2.0,0.0
201801,22022,1626149,Montrezl Harrell,1610612755,Philadelphia 76ers,22201217,2023-04-09,PHI @ BKN,21,5,...,0.325,0.1340,0.1650,0.1670,0.400,0.4860,0.167,0.375,2.0,0.0
201802,22022,203473,Dewayne Dedmon,1610612755,Philadelphia 76ers,22201217,2023-04-09,PHI @ BKN,19,6,...,0.313,0.1055,0.1685,0.1685,0.443,0.4745,0.167,0.400,2.0,0.0
201803,22022,1630564,RaiQuan Gray,1610612751,Brooklyn Nets,22201217,2023-04-09,BKN vs. PHI,35,6,...,0.343,0.1140,0.2190,0.1810,0.343,0.4350,0.250,0.371,10.0,1.0


I use one of my favorite players, Mikal Bridges, to check the results.

In [42]:
bridges_df = df.loc[df['PLAYER_NAME'] == 'Mikal Bridges']
bridges_df[['PTS','PTS_EXPANDING']].head(20)

Unnamed: 0,PTS,PTS_EXPANDING
78404,0,0.0
78954,10,5.0
79269,5,5.0
79549,2,3.5
80085,7,5.0
80120,9,6.0
80597,16,7.0
81012,2,6.0
81339,14,7.0
81590,6,6.5


Looks good! The expanding column contains the median outcome based on the current game and all previous games. Now that I have passed the Mikal Bridges test, I can move on to rolling averages.

### Adding Moving Averages: Rolling

In addition to the expanding averages that take the entire season up to that point into consideration, I also use rolling averages with a window of 10 to focus on more recent information. For these rolling averages, if a player has not played 10 games yet, I use the expanding average instead.

In [43]:
df_rolling = df[list(selected_cols)]

In [44]:
df_rolling

Unnamed: 0,SEASON_ID,PLAYER_ID,MIN,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,...,TEAM_PCT_PTS_3PT,TEAM_PCT_PTS_FB,TEAM_PCT_PTS_FT,TEAM_PCT_PTS_OFF_TOV,TEAM_PCT_PTS_PAINT,TEAM_PCT_UAST_2PM,TEAM_PCT_UAST_3PM,TEAM_PCT_UAST_FGM,days_of_rest,home
0,22015,203084,33,3,12,0.250,1,5,0.200,1,...,0.243,0.117,0.180,0.234,0.486,0.313,0.222,0.293,10,1
1,22015,201582,17,3,7,0.429,0,0,0.000,0,...,0.189,0.189,0.200,0.305,0.421,0.414,0.333,0.400,10,0
2,22015,2733,21,3,6,0.500,0,2,0.000,0,...,0.243,0.117,0.180,0.234,0.486,0.313,0.222,0.293,10,1
3,22015,2571,21,0,6,0.000,0,2,0.000,0,...,0.243,0.117,0.180,0.234,0.486,0.313,0.222,0.293,10,1
4,22015,2570,16,5,5,1.000,0,0,0.000,0,...,0.189,0.189,0.200,0.305,0.421,0.414,0.333,0.400,10,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
201800,22022,1627788,30,3,7,0.429,3,4,0.750,2,...,0.313,0.172,0.104,0.209,0.522,0.462,0.286,0.415,2,0
201801,22022,1626149,21,5,10,0.500,0,2,0.000,5,...,0.313,0.172,0.104,0.209,0.522,0.462,0.286,0.415,2,0
201802,22022,203473,19,6,8,0.750,1,1,1.000,1,...,0.313,0.172,0.104,0.209,0.522,0.462,0.286,0.415,2,0
201803,22022,1630564,35,6,12,0.500,2,5,0.400,2,...,0.343,0.114,0.219,0.181,0.343,0.435,0.250,0.371,10,1


I define a function to take the rolling median with a window of ten and to take the expanding median instead if the function would produce a null value because the player has not played 10 games yet.

In [45]:
def find_rolling_medians(col):
    rolling = col.rolling(10).median()
    expanding = col.expanding().median()
    return rolling.combine_first(expanding)

Just like before, I apply the function by player and season so that one season's data does not spill into the next season's data. I then label the columns and do a sanity check to make sure everything worked.

In [47]:
df_rolling = df_rolling.groupby(['PLAYER_ID', 'SEASON_ID'],group_keys=False).apply(find_rolling_medians)
df_rolling

Unnamed: 0,SEASON_ID,PLAYER_ID,MIN,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,...,TEAM_PCT_PTS_3PT,TEAM_PCT_PTS_FB,TEAM_PCT_PTS_FT,TEAM_PCT_PTS_OFF_TOV,TEAM_PCT_PTS_PAINT,TEAM_PCT_UAST_2PM,TEAM_PCT_UAST_3PM,TEAM_PCT_UAST_FGM,days_of_rest,home
0,22015.0,203084.0,33.0,3.0,12.0,0.2500,1.0,5.0,0.2000,1.0,...,0.2430,0.1170,0.1800,0.2340,0.4860,0.3130,0.2220,0.2930,10.0,1.0
1,22015.0,201582.0,17.0,3.0,7.0,0.4290,0.0,0.0,0.0000,0.0,...,0.1890,0.1890,0.2000,0.3050,0.4210,0.4140,0.3330,0.4000,10.0,0.0
2,22015.0,2733.0,21.0,3.0,6.0,0.5000,0.0,2.0,0.0000,0.0,...,0.2430,0.1170,0.1800,0.2340,0.4860,0.3130,0.2220,0.2930,10.0,1.0
3,22015.0,2571.0,21.0,0.0,6.0,0.0000,0.0,2.0,0.0000,0.0,...,0.2430,0.1170,0.1800,0.2340,0.4860,0.3130,0.2220,0.2930,10.0,1.0
4,22015.0,2570.0,16.0,5.0,5.0,1.0000,0.0,0.0,0.0000,0.0,...,0.1890,0.1890,0.2000,0.3050,0.4210,0.4140,0.3330,0.4000,10.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
201800,22022.0,1627788.0,6.0,1.0,2.0,0.5000,0.0,0.0,0.0000,0.0,...,0.3050,0.1375,0.1550,0.1510,0.4705,0.5150,0.1500,0.4125,2.0,0.0
201801,22022.0,1626149.0,4.5,1.0,2.0,0.5000,0.0,0.0,0.0000,0.0,...,0.3000,0.1540,0.1525,0.1590,0.4590,0.4895,0.2250,0.4125,4.5,0.0
201802,22022.0,203473.0,9.5,1.0,2.0,0.4165,0.0,0.0,0.0000,0.0,...,0.2815,0.1330,0.1630,0.1535,0.4930,0.5270,0.1500,0.4170,4.0,0.0
201803,22022.0,1630564.0,35.0,6.0,12.0,0.5000,2.0,5.0,0.4000,2.0,...,0.3430,0.1140,0.2190,0.1810,0.3430,0.4350,0.2500,0.3710,10.0,1.0


In [48]:
rolling_cols = [f'{col}_ROLLING' for col in df_rolling.columns]
df_rolling.columns = rolling_cols

df = pd.concat([df,df_rolling],axis=1)

In [49]:
df

Unnamed: 0,SEASON_ID,PLAYER_ID,PLAYER_NAME,TEAM_ID,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,MIN,FGM,...,TEAM_PCT_PTS_3PT_ROLLING,TEAM_PCT_PTS_FB_ROLLING,TEAM_PCT_PTS_FT_ROLLING,TEAM_PCT_PTS_OFF_TOV_ROLLING,TEAM_PCT_PTS_PAINT_ROLLING,TEAM_PCT_UAST_2PM_ROLLING,TEAM_PCT_UAST_3PM_ROLLING,TEAM_PCT_UAST_FGM_ROLLING,days_of_rest_ROLLING,home_ROLLING
0,22015,203084,Harrison Barnes,1610612744,Golden State Warriors,21500003,2015-10-27,GSW vs. NOP,33,3,...,0.2430,0.1170,0.1800,0.2340,0.4860,0.3130,0.2220,0.2930,10.0,1.0
1,22015,201582,Alexis Ajinca,1610612740,New Orleans Pelicans,21500003,2015-10-27,NOP @ GSW,17,3,...,0.1890,0.1890,0.2000,0.3050,0.4210,0.4140,0.3330,0.4000,10.0,0.0
2,22015,2733,Shaun Livingston,1610612744,Golden State Warriors,21500003,2015-10-27,GSW vs. NOP,21,3,...,0.2430,0.1170,0.1800,0.2340,0.4860,0.3130,0.2220,0.2930,10.0,1.0
3,22015,2571,Leandro Barbosa,1610612744,Golden State Warriors,21500003,2015-10-27,GSW vs. NOP,21,0,...,0.2430,0.1170,0.1800,0.2340,0.4860,0.3130,0.2220,0.2930,10.0,1.0
4,22015,2570,Kendrick Perkins,1610612740,New Orleans Pelicans,21500003,2015-10-27,NOP @ GSW,16,5,...,0.1890,0.1890,0.2000,0.3050,0.4210,0.4140,0.3330,0.4000,10.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
201800,22022,1627788,Furkan Korkmaz,1610612755,Philadelphia 76ers,22201217,2023-04-09,PHI @ BKN,30,3,...,0.3050,0.1375,0.1550,0.1510,0.4705,0.5150,0.1500,0.4125,2.0,0.0
201801,22022,1626149,Montrezl Harrell,1610612755,Philadelphia 76ers,22201217,2023-04-09,PHI @ BKN,21,5,...,0.3000,0.1540,0.1525,0.1590,0.4590,0.4895,0.2250,0.4125,4.5,0.0
201802,22022,203473,Dewayne Dedmon,1610612755,Philadelphia 76ers,22201217,2023-04-09,PHI @ BKN,19,6,...,0.2815,0.1330,0.1630,0.1535,0.4930,0.5270,0.1500,0.4170,4.0,0.0
201803,22022,1630564,RaiQuan Gray,1610612751,Brooklyn Nets,22201217,2023-04-09,BKN vs. PHI,35,6,...,0.3430,0.1140,0.2190,0.1810,0.3430,0.4350,0.2500,0.3710,10.0,1.0


In [50]:
bridges_df = df.loc[df['PLAYER_NAME'] == 'Mikal Bridges']
bridges_df[['PTS','PTS_EXPANDING','PTS_ROLLING']].head(20)

Unnamed: 0,PTS,PTS_EXPANDING,PTS_ROLLING
78404,0,0.0,0.0
78954,10,5.0,5.0
79269,5,5.0,5.0
79549,2,3.5,3.5
80085,7,5.0,5.0
80120,9,6.0,6.0
80597,16,7.0,7.0
81012,2,6.0,6.0
81339,14,7.0,7.0
81590,6,6.5,6.5


Great! The averages are the same through the first ten games and then the rolling average kicks in.

### Export to CSV

Now that all the single game features are included, I export the data frame to a csv to be used in the next notebook where I will focus on future outcomes.

In [51]:
df.to_csv('./Data/current_game_all_features.csv',index=False)