# Data Processing

### Summary

Games.csv and ranking.csv will be merged after initial data processing. Since all the features are *post* game data (final score, winning percentage after the game, etc...) they cannot be used as predictors for the current game. All the features will be used as predictors for the "next game", so the data will need to adjusted so that the TARGET (HOME_TEAM_WINS) is in the same row as the predictors. 

The easiest approach seems to be add a field called TARGET that denotes whether the home team won its *next* game or not.

Game_details.csv will intially be held in reserve for feature engineering. With a roster of 24 players per game and 21 features per player, the initial plan is NOT to add all these 500 features indiscriminately but to instead try to find useful features and incorporate these.

Scaling and power-transforms will not be used at this time since the plan is to use GBTs (gradient boosted trees) such as XGBoost where this is not needed. These transforms may be needed later for PCA and other techniques.


duplicates

 - both games.csv and ranking.csv contain several duplicated rows from Dec 2020 (covid season) that the pandas function *df.duplicated()* failed to detect in EDA. These will be filtered out using subsets instead of the entire dataframe.

 games.csv
 
 - delete preseason games (this will also take care of the null games from early 2003)
 - keep only games where GAME_STATUS_TEXT = 'Final' (for better utility in the future)
 - remove duplicated records 
 - flag postseason games 
 - drop 'GAME_STATUS_TEXT', 'TEAM_ID_home', 'TEAM_ID_away'

ranking.csv
 
 - drop preseason rankings (SEASON_ID begins with 1)
 - split HOME_RECORD into HOME_W, HOME_L, and HOME_W_PCT
 - split ROAD_RECORD into ROAD_W, ROAD_L, and ROAD_W_PCT
 - numericaly encode CONFERENCE (East or West)
 - remove duplicated records
 - drop 'SEASON_ID', 'LEAGUE_ID', 'RETURNTOPLAY', 'TEAM', 'HOME_RECORD', 'ROAD_RECORD'

 game_details.csv
 
 - fix mixed formats in MIN and convert to float
 - fix negatives in MIN
 - if MIN is null, edit START_POSITION to 'NP' (not played)
 - any START_POSITION remaining null, convert to NS (not start, but still played)
 - drop TEAM_ABBREVIATION, TEAM_CITY, PLAYER_NAME, NICKNAME, COMMENT
 
 Join games with ranking
 
  - LINK: games.GAME_DATE_EST, games.HOME_TEAM_ID, -> ranking.STANDINGSDATE, ranking.TEAM_ID 
  - ADD: CONFERENCE, G, W, L, W_PCT, HOME_W, HOME_L, HOME_W_PCT, ROAD_W, ROAD_L, ROAD_W_PCT
  - repeat with AWAY_TEAM_ID instead of HOME_TEAM_ID
  
 Add TARGET
 
  - Sort games by HOME_TEAM_ID and GAME_ID
  - for each SEASON and HOME_TEAM_ID, shift HOME_TEAM_WINS down to TARGET for previous game
  - remove games with null TARGETs (last game played each season by each team will have no null TARGET)
  

 

In [1]:
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', 500)

# For Visualization
import matplotlib.pyplot as plt
import seaborn as sns

from pathlib import Path  #for Windows/Linux compatibility
DATAPATH = Path(r'data')


## games.csv

In [2]:
games = pd.read_csv(DATAPATH / "games.csv")
games.head()

Unnamed: 0,GAME_DATE_EST,GAME_ID,GAME_STATUS_TEXT,HOME_TEAM_ID,VISITOR_TEAM_ID,SEASON,TEAM_ID_home,PTS_home,FG_PCT_home,FT_PCT_home,FG3_PCT_home,AST_home,REB_home,TEAM_ID_away,PTS_away,FG_PCT_away,FT_PCT_away,FG3_PCT_away,AST_away,REB_away,HOME_TEAM_WINS
0,2022-03-12,22101005,Final,1610612748,1610612750,2021,1610612748,104.0,0.398,0.76,0.333,23.0,53.0,1610612750,113.0,0.422,0.875,0.357,21.0,46.0,0
1,2022-03-12,22101006,Final,1610612741,1610612739,2021,1610612741,101.0,0.443,0.933,0.429,20.0,46.0,1610612739,91.0,0.419,0.824,0.208,19.0,40.0,1
2,2022-03-12,22101007,Final,1610612759,1610612754,2021,1610612759,108.0,0.412,0.813,0.324,28.0,52.0,1610612754,119.0,0.489,1.0,0.389,23.0,47.0,0
3,2022-03-12,22101008,Final,1610612744,1610612749,2021,1610612744,122.0,0.484,0.933,0.4,33.0,55.0,1610612749,109.0,0.413,0.696,0.386,27.0,39.0,1
4,2022-03-12,22101009,Final,1610612743,1610612761,2021,1610612743,115.0,0.551,0.75,0.407,32.0,39.0,1610612761,127.0,0.471,0.76,0.387,28.0,50.0,0


**Clean Data**

In [3]:
#remove preseason games (GAME_ID begins with a 1)
games = games[games['GAME_ID'] > 20000000]

#flag postseason games (GAME_ID begins with >2)
games['PLAYOFF'] = (games['GAME_ID'] >= 30000000).astype('int8')

#remove duplicates (each GAME_ID should be unique)
games = games[~games.duplicated(subset=['GAME_ID'])]

#drop unnecessary fields
drop_fields = ['GAME_STATUS_TEXT', 'TEAM_ID_home', 'TEAM_ID_away']
games = games.drop(drop_fields,axis=1)

games
    

Unnamed: 0,GAME_DATE_EST,GAME_ID,HOME_TEAM_ID,VISITOR_TEAM_ID,SEASON,PTS_home,FG_PCT_home,FT_PCT_home,FG3_PCT_home,AST_home,REB_home,PTS_away,FG_PCT_away,FT_PCT_away,FG3_PCT_away,AST_away,REB_away,HOME_TEAM_WINS,PLAYOFF
0,2022-03-12,22101005,1610612748,1610612750,2021,104.0,0.398,0.760,0.333,23.0,53.0,113.0,0.422,0.875,0.357,21.0,46.0,0,0
1,2022-03-12,22101006,1610612741,1610612739,2021,101.0,0.443,0.933,0.429,20.0,46.0,91.0,0.419,0.824,0.208,19.0,40.0,1,0
2,2022-03-12,22101007,1610612759,1610612754,2021,108.0,0.412,0.813,0.324,28.0,52.0,119.0,0.489,1.000,0.389,23.0,47.0,0,0
3,2022-03-12,22101008,1610612744,1610612749,2021,122.0,0.484,0.933,0.400,33.0,55.0,109.0,0.413,0.696,0.386,27.0,39.0,1,0
4,2022-03-12,22101009,1610612743,1610612761,2021,115.0,0.551,0.750,0.407,32.0,39.0,127.0,0.471,0.760,0.387,28.0,50.0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25684,2014-10-29,21400014,1610612758,1610612744,2014,77.0,0.308,0.743,0.167,13.0,50.0,95.0,0.440,0.719,0.222,21.0,44.0,0,0
25685,2014-10-29,21400015,1610612757,1610612760,2014,106.0,0.448,0.773,0.379,23.0,42.0,89.0,0.407,0.808,0.125,19.0,43.0,1,0
25686,2014-10-28,21400001,1610612740,1610612753,2014,101.0,0.406,0.484,0.235,20.0,62.0,84.0,0.381,0.762,0.364,17.0,56.0,1,0
25687,2014-10-28,21400002,1610612759,1610612742,2014,101.0,0.529,0.813,0.500,23.0,38.0,100.0,0.487,0.842,0.381,17.0,33.0,1,0


## ranking.csv

In [4]:
ranking = pd.read_csv(DATAPATH / "ranking.csv")
ranking.head()

Unnamed: 0,TEAM_ID,LEAGUE_ID,SEASON_ID,STANDINGSDATE,CONFERENCE,TEAM,G,W,L,W_PCT,HOME_RECORD,ROAD_RECORD,RETURNTOPLAY
0,1610612756,0,22021,2022-03-12,West,Phoenix,67,53,14,0.791,28-8,25-6,
1,1610612744,0,22021,2022-03-12,West,Golden State,68,46,22,0.676,28-7,18-15,
2,1610612763,0,22021,2022-03-12,West,Memphis,68,46,22,0.676,24-10,22-12,
3,1610612762,0,22021,2022-03-12,West,Utah,67,42,25,0.627,24-10,18-15,
4,1610612742,0,22021,2022-03-12,West,Dallas,67,41,26,0.612,23-12,18-14,


**Clean Data**

In [5]:
#remove preseason rankings (SEASON_ID begins with 1)
ranking = ranking[ranking['SEASON_ID'] > 20000]

#convert home record and road record to numeric
ranking['HOME_W'] = ranking['HOME_RECORD'].apply(lambda x: x.split('-')[0]).astype('int')
ranking['HOME_L'] = ranking['HOME_RECORD'].apply(lambda x: x.split('-')[1]).astype('int')
ranking['HOME_W_PCT'] = ranking['HOME_W'] / ( ranking['HOME_W'] + ranking['HOME_L'] )

ranking['ROAD_W'] = ranking['ROAD_RECORD'].apply(lambda x: x.split('-')[0]).astype('int')
ranking['ROAD_L'] = ranking['ROAD_RECORD'].apply(lambda x: x.split('-')[1]).astype('int')
ranking['ROAD_W_PCT'] = ranking['ROAD_W'] / ( ranking['ROAD_W'] + ranking['ROAD_L'] )

#encode CONFERENCE as an integer (just using pandas - not importing sklearn for just one feature)
ranking['CONFERENCE'] = ranking['CONFERENCE'].apply(lambda x: 0 if x=='East' else 1 ).astype('int') 

#remove duplicates (there should only be one TEAM_ID per STANDINGSDATE)
ranking = ranking[~ranking.duplicated(subset=['TEAM_ID','STANDINGSDATE'])]

#drop unnecessary fields
drop_fields = ['SEASON_ID', 'LEAGUE_ID', 'RETURNTOPLAY', 'TEAM', 'HOME_RECORD', 'ROAD_RECORD']
ranking = ranking.drop(drop_fields,axis=1)

ranking


Unnamed: 0,TEAM_ID,STANDINGSDATE,CONFERENCE,G,W,L,W_PCT,HOME_W,HOME_L,HOME_W_PCT,ROAD_W,ROAD_L,ROAD_W_PCT
0,1610612756,2022-03-12,1,67,53,14,0.791,28,8,0.777778,25,6,0.806452
1,1610612744,2022-03-12,1,68,46,22,0.676,28,7,0.800000,18,15,0.545455
2,1610612763,2022-03-12,1,68,46,22,0.676,24,10,0.705882,22,12,0.647059
3,1610612762,2022-03-12,1,67,42,25,0.627,24,10,0.705882,18,15,0.545455
4,1610612742,2022-03-12,1,67,41,26,0.612,23,12,0.657143,18,14,0.562500
...,...,...,...,...,...,...,...,...,...,...,...,...,...
201787,1610612765,2014-09-01,0,82,29,53,0.354,17,24,0.414634,12,29,0.292683
201788,1610612738,2014-09-01,0,82,25,57,0.305,16,25,0.390244,9,32,0.219512
201789,1610612753,2014-09-01,0,82,23,59,0.280,19,22,0.463415,4,37,0.097561
201790,1610612755,2014-09-01,0,82,19,63,0.232,10,31,0.243902,9,32,0.219512


## game_details.csv

In [6]:
details = pd.read_csv(DATAPATH / "games_details.csv")
details.head()

  details = pd.read_csv(DATAPATH / "games_details.csv")


Unnamed: 0,GAME_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_CITY,PLAYER_ID,PLAYER_NAME,NICKNAME,START_POSITION,COMMENT,MIN,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,FTA,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TO,PF,PTS,PLUS_MINUS
0,22101005,1610612750,MIN,Minnesota,1630162,Anthony Edwards,Anthony,F,,36:22,4.0,10.0,0.4,3.0,8.0,0.375,4.0,4.0,1.0,0.0,8.0,8.0,5.0,3.0,1.0,1.0,1.0,15.0,5.0
1,22101005,1610612750,MIN,Minnesota,1630183,Jaden McDaniels,Jaden,F,,23:54,6.0,8.0,0.75,1.0,3.0,0.333,1.0,1.0,1.0,2.0,4.0,6.0,0.0,0.0,2.0,2.0,6.0,14.0,10.0
2,22101005,1610612750,MIN,Minnesota,1626157,Karl-Anthony Towns,Karl-Anthony,C,,25:17,4.0,9.0,0.444,1.0,3.0,0.333,6.0,8.0,0.75,1.0,9.0,10.0,0.0,0.0,0.0,3.0,4.0,15.0,14.0
3,22101005,1610612750,MIN,Minnesota,1627736,Malik Beasley,Malik,G,,30:52,4.0,9.0,0.444,4.0,9.0,0.444,0.0,0.0,0.0,0.0,3.0,3.0,1.0,1.0,0.0,1.0,4.0,12.0,20.0
4,22101005,1610612750,MIN,Minnesota,1626156,D'Angelo Russell,D'Angelo,G,,33:46,3.0,13.0,0.231,1.0,6.0,0.167,7.0,7.0,1.0,0.0,6.0,6.0,9.0,1.0,0.0,5.0,0.0,14.0,17.0


**Clean Data**

In [7]:
# convert MIN:SEC to float
df = details.loc[details['MIN'].str.contains(':',na=False)]
df['MIN_whole'] = df['MIN'].apply(lambda x: x.split(':')[0]).astype("int8")
df['MIN_seconds'] = df['MIN'].apply(lambda x: x.split(':')[1]).astype("int8")
df['MIN'] = df['MIN_whole'] + (df['MIN_seconds'] / 60)

details['MIN'].loc[details['MIN'].str.contains(':',na=False)] = df['MIN']
details['MIN'] = details['MIN'].astype("float16")

# convert negatives to positive
details['MIN'].loc[details['MIN'] < 0] = -(details['MIN'])

#update START_POSITION if did not play (MIN = NaN)
details['START_POSITION'].loc[details['MIN'].isna()] = 'NP'

#update START_POSITION if null
details['START_POSITION'] = details['START_POSITION'].fillna('NS')

#drop unnecessary fields
drop_fields = ['COMMENT', 'TEAM_ABBREVIATION', 'TEAM_CITY', 'PLAYER_NAME', 'NICKNAME'] 
details = details.drop(drop_fields,axis=1)

details

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['MIN_whole'] = df['MIN'].apply(lambda x: x.split(':')[0]).astype("int8")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['MIN_seconds'] = df['MIN'].apply(lambda x: x.split(':')[1]).astype("int8")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['MIN'] = df['MIN_whole'] + (df['MIN_seconds'] / 

Unnamed: 0,GAME_ID,TEAM_ID,PLAYER_ID,START_POSITION,MIN,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,FTA,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TO,PF,PTS,PLUS_MINUS
0,22101005,1610612750,1630162,F,36.375000,4.0,10.0,0.400,3.0,8.0,0.375,4.0,4.0,1.000,0.0,8.0,8.0,5.0,3.0,1.0,1.0,1.0,15.0,5.0
1,22101005,1610612750,1630183,F,23.906250,6.0,8.0,0.750,1.0,3.0,0.333,1.0,1.0,1.000,2.0,4.0,6.0,0.0,0.0,2.0,2.0,6.0,14.0,10.0
2,22101005,1610612750,1626157,C,25.281250,4.0,9.0,0.444,1.0,3.0,0.333,6.0,8.0,0.750,1.0,9.0,10.0,0.0,0.0,0.0,3.0,4.0,15.0,14.0
3,22101005,1610612750,1627736,G,30.859375,4.0,9.0,0.444,4.0,9.0,0.444,0.0,0.0,0.000,0.0,3.0,3.0,1.0,1.0,0.0,1.0,4.0,12.0,20.0
4,22101005,1610612750,1626156,G,33.781250,3.0,13.0,0.231,1.0,6.0,0.167,7.0,7.0,1.000,0.0,6.0,6.0,9.0,1.0,0.0,5.0,0.0,14.0,17.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
645948,11200005,1610612743,202706,NS,19.000000,4.0,9.0,0.444,3.0,6.0,0.500,6.0,7.0,0.857,0.0,2.0,2.0,0.0,2.0,0.0,1.0,3.0,17.0,
645949,11200005,1610612743,202702,NS,23.000000,7.0,11.0,0.636,0.0,0.0,0.000,4.0,4.0,1.000,1.0,0.0,1.0,1.0,1.0,0.0,3.0,3.0,18.0,
645950,11200005,1610612743,201585,NS,15.000000,3.0,7.0,0.429,0.0,0.0,0.000,0.0,0.0,0.000,3.0,5.0,8.0,0.0,1.0,0.0,0.0,3.0,6.0,
645951,11200005,1610612743,202389,NS,19.000000,1.0,1.0,1.000,0.0,0.0,0.000,0.0,2.0,0.000,1.0,2.0,3.0,1.0,0.0,0.0,4.0,2.0,2.0,


## Join games and ranking

In [8]:
#rename columns for merging Home Team ranking data
ranking = ranking.rename(columns={'STANDINGSDATE': 'GAME_DATE_EST', 'TEAM_ID': 'HOME_TEAM_ID'})
games_ranking_home = pd.merge(games, ranking, how="left", on=["GAME_DATE_EST", "HOME_TEAM_ID"])

#rename columns for merging Visitor Team ranking data
ranking = ranking.rename(columns={'HOME_TEAM_ID': 'VISITOR_TEAM_ID'})
games_ranking = pd.merge(games_ranking_home, ranking, how="left", on=["GAME_DATE_EST", "VISITOR_TEAM_ID"])
                                  
games_ranking           
                         

Unnamed: 0,GAME_DATE_EST,GAME_ID,HOME_TEAM_ID,VISITOR_TEAM_ID,SEASON,PTS_home,FG_PCT_home,FT_PCT_home,FG3_PCT_home,AST_home,REB_home,PTS_away,FG_PCT_away,FT_PCT_away,FG3_PCT_away,AST_away,REB_away,HOME_TEAM_WINS,PLAYOFF,CONFERENCE_x,G_x,W_x,L_x,W_PCT_x,HOME_W_x,HOME_L_x,HOME_W_PCT_x,ROAD_W_x,ROAD_L_x,ROAD_W_PCT_x,CONFERENCE_y,G_y,W_y,L_y,W_PCT_y,HOME_W_y,HOME_L_y,HOME_W_PCT_y,ROAD_W_y,ROAD_L_y,ROAD_W_PCT_y
0,2022-03-12,22101005,1610612748,1610612750,2021,104.0,0.398,0.760,0.333,23.0,53.0,113.0,0.422,0.875,0.357,21.0,46.0,0,0,0,69,45,24,0.652,24,9,0.727273,21,15,0.583333,1,69,39,30,0.565,22,12,0.647059,17,18,0.485714
1,2022-03-12,22101006,1610612741,1610612739,2021,101.0,0.443,0.933,0.429,20.0,46.0,91.0,0.419,0.824,0.208,19.0,40.0,1,0,0,67,41,26,0.612,25,10,0.714286,16,16,0.500000,0,67,38,29,0.567,20,11,0.645161,18,18,0.500000
2,2022-03-12,22101007,1610612759,1610612754,2021,108.0,0.412,0.813,0.324,28.0,52.0,119.0,0.489,1.000,0.389,23.0,47.0,0,0,1,68,26,42,0.382,13,21,0.382353,13,21,0.382353,0,68,23,45,0.338,15,19,0.441176,8,26,0.235294
3,2022-03-12,22101008,1610612744,1610612749,2021,122.0,0.484,0.933,0.400,33.0,55.0,109.0,0.413,0.696,0.386,27.0,39.0,1,0,1,68,46,22,0.676,28,7,0.800000,18,15,0.545455,0,68,42,26,0.618,24,12,0.666667,18,14,0.562500
4,2022-03-12,22101009,1610612743,1610612761,2021,115.0,0.551,0.750,0.407,32.0,39.0,127.0,0.471,0.760,0.387,28.0,50.0,0,0,1,68,40,28,0.588,20,13,0.606061,20,15,0.571429,0,67,37,30,0.552,17,15,0.531250,20,15,0.571429
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24053,2014-10-29,21400014,1610612758,1610612744,2014,77.0,0.308,0.743,0.167,13.0,50.0,95.0,0.440,0.719,0.222,21.0,44.0,0,0,1,1,0,1,0.000,0,1,0.000000,0,0,,1,1,1,0,1.000,0,0,,1,0,1.000000
24054,2014-10-29,21400015,1610612757,1610612760,2014,106.0,0.448,0.773,0.379,23.0,42.0,89.0,0.407,0.808,0.125,19.0,43.0,1,0,1,1,1,0,1.000,1,0,1.000000,0,0,,1,1,0,1,0.000,0,0,,0,1,0.000000
24055,2014-10-28,21400001,1610612740,1610612753,2014,101.0,0.406,0.484,0.235,20.0,62.0,84.0,0.381,0.762,0.364,17.0,56.0,1,0,1,1,1,0,1.000,1,0,1.000000,0,0,,0,1,0,1,0.000,0,0,,0,1,0.000000
24056,2014-10-28,21400002,1610612759,1610612742,2014,101.0,0.529,0.813,0.500,23.0,38.0,100.0,0.487,0.842,0.381,17.0,33.0,1,0,1,1,1,0,1.000,1,0,1.000000,0,0,,1,1,0,1,0.000,0,0,,0,1,0.000000


In [9]:
games_ranking.describe(include = 'float').T.applymap('{:,.4f}'.format)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
PTS_home,24058.0,103.2917,13.161,59.0,94.0,103.0,112.0,168.0
FG_PCT_home,24058.0,0.4615,0.0564,0.25,0.422,0.461,0.5,0.684
FT_PCT_home,24058.0,0.7614,0.1005,0.143,0.7,0.767,0.833,1.0
FG3_PCT_home,24058.0,0.3569,0.1118,0.0,0.286,0.357,0.429,1.0
AST_home,24058.0,22.7912,5.1959,6.0,19.0,23.0,26.0,50.0
REB_home,24058.0,43.3263,6.6011,17.0,39.0,43.0,48.0,72.0
PTS_away,24058.0,100.4235,13.3249,54.0,91.0,100.0,109.0,168.0
FG_PCT_away,24058.0,0.4503,0.0553,0.244,0.413,0.449,0.488,0.687
FT_PCT_away,24058.0,0.7597,0.1035,0.143,0.694,0.765,0.833,1.0
FG3_PCT_away,24058.0,0.3503,0.1097,0.0,0.278,0.35,0.421,1.0


## Add correct TARGET to dataframe

In [10]:
games_ranking['TARGET'] = 0

#sort games by the order in which they were played for each home team
games_ranking = games_ranking.sort_values(by = ['HOME_TEAM_ID', 'GAME_ID'], axis=0, ascending=[False, False], ignore_index=True)

# for each season and each team, shift HOME_TEAM_WINS down one to TARGET
home_teams = games_ranking['HOME_TEAM_ID'].unique().tolist()
seasons = games_ranking['SEASON'].unique().tolist()

for season in seasons:
    for team in home_teams:
        games_ranking['TARGET'].loc[(games_ranking['SEASON'] == season) & (games_ranking['HOME_TEAM_ID'] == team)] = games_ranking['HOME_TEAM_WINS'].shift(periods=1)

#remove games with null TARGET
games_ranking = games_ranking[games_ranking['TARGET'].notna()]

games_ranking

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  games_ranking['TARGET'].loc[(games_ranking['SEASON'] == season) & (games_ranking['HOME_TEAM_ID'] == team)] = games_ranking['HOME_TEAM_WINS'].shift(periods=1)


Unnamed: 0,GAME_DATE_EST,GAME_ID,HOME_TEAM_ID,VISITOR_TEAM_ID,SEASON,PTS_home,FG_PCT_home,FT_PCT_home,FG3_PCT_home,AST_home,REB_home,PTS_away,FG_PCT_away,FT_PCT_away,FG3_PCT_away,AST_away,REB_away,HOME_TEAM_WINS,PLAYOFF,CONFERENCE_x,G_x,W_x,L_x,W_PCT_x,HOME_W_x,HOME_L_x,HOME_W_PCT_x,ROAD_W_x,ROAD_L_x,ROAD_W_PCT_x,CONFERENCE_y,G_y,W_y,L_y,W_PCT_y,HOME_W_y,HOME_L_y,HOME_W_PCT_y,ROAD_W_y,ROAD_L_y,ROAD_W_PCT_y,TARGET
1,2016-04-25,41500124,1610612766,1610612748,2015,89.0,0.400,0.833,0.250,10.0,36.0,85.0,0.395,0.667,0.379,20.0,46.0,1,1,0,82,48,34,0.585,30,11,0.731707,18,23,0.439024,0,82,48,34,0.585,28,13,0.682927,20,21,0.487805,0.0
2,2016-04-23,41500123,1610612766,1610612748,2015,96.0,0.389,0.955,0.278,18.0,47.0,80.0,0.342,0.633,0.318,13.0,53.0,1,1,0,82,48,34,0.585,30,11,0.731707,18,23,0.439024,0,82,48,34,0.585,28,13,0.682927,20,21,0.487805,1.0
3,2014-04-28,41300114,1610612766,1610612748,2013,98.0,0.507,0.700,0.280,22.0,36.0,109.0,0.500,0.759,0.375,25.0,33.0,0,1,0,82,43,39,0.524,25,16,0.609756,18,23,0.439024,0,82,54,28,0.659,32,9,0.780488,22,19,0.536585,1.0
4,2014-04-26,41300113,1610612766,1610612748,2013,85.0,0.415,0.727,0.389,21.0,38.0,98.0,0.434,0.850,0.500,26.0,39.0,0,1,0,82,43,39,0.524,25,16,0.609756,18,23,0.439024,0,82,54,28,0.659,32,9,0.780488,22,19,0.536585,0.0
5,2010-04-26,40900114,1610612766,1610612753,2009,90.0,0.451,0.636,0.263,27.0,36.0,99.0,0.418,0.714,0.394,18.0,38.0,0,1,0,82,44,38,0.537,31,10,0.756098,13,28,0.317073,0,82,59,23,0.720,34,7,0.829268,25,16,0.609756,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24053,2003-11-22,20300177,1610612737,1610612739,2003,92.0,0.447,0.933,0.200,25.0,45.0,83.0,0.330,0.840,0.571,21.0,54.0,1,0,0,14,5,9,0.357,2,3,0.400000,3,6,0.333333,0,14,4,10,0.286,4,2,0.666667,0,8,0.000000,0.0
24054,2003-11-17,20300141,1610612737,1610612764,2003,97.0,0.423,0.872,0.231,12.0,39.0,106.0,0.390,0.750,0.375,19.0,44.0,0,0,0,11,3,8,0.273,1,3,0.250000,2,5,0.285714,0,10,4,6,0.400,2,3,0.400000,2,3,0.400000,1.0
24055,2003-11-15,20300128,1610612737,1610612751,2003,85.0,0.382,0.767,0.333,19.0,39.0,100.0,0.479,0.867,0.444,23.0,38.0,0,0,0,10,3,7,0.300,1,2,0.333333,2,5,0.285714,0,10,5,5,0.500,2,3,0.400000,3,2,0.600000,0.0
24056,2003-11-03,20300042,1610612737,1610612740,2003,90.0,0.427,0.652,0.333,20.0,50.0,80.0,0.407,0.588,0.222,21.0,42.0,1,0,0,4,1,3,0.250,1,1,0.500000,0,2,0.000000,0,4,3,1,0.750,2,0,1.000000,1,1,0.500000,0.0


**Add a few simple Features**

In [11]:
# add 3-game rolling sums for HOME_WINS_LAST_3_HOME, PTS_home_LAST_3_HOME
# add 3-game rolling means for AWAY_WINS_LAST_3_AWAY, PTS_away_LAST_3_AWAY
# these are for the home teams last 3 *home* games
# and for the away teams last 3 *away* games

# first, process data for home team
home_teams = games_ranking['HOME_TEAM_ID'].unique().tolist()
seasons = games_ranking['SEASON'].unique().tolist()

#sort games by the order in which they were played for each home team
games_ranking = games_ranking.sort_values(by = ['HOME_TEAM_ID', 'GAME_ID'], axis=0, ascending=[True, True], ignore_index=True)

games_ranking['HOME_WINS_LAST_3_HOME'] = 0
games_ranking['PTS_home_LAST_3_HOME'] = 0


for season in seasons:
    for team in home_teams:
        games_ranking['HOME_WINS_LAST_3_HOME'].loc[(games_ranking['SEASON'] == season) & (games_ranking['HOME_TEAM_ID'] == team)] = games_ranking['HOME_TEAM_WINS'].rolling(3).sum()
        games_ranking['PTS_home_LAST_3_HOME'].loc[(games_ranking['SEASON'] == season) & (games_ranking['HOME_TEAM_ID'] == team)] = games_ranking['PTS_home'].rolling(3).mean()


# now for away teams        
games_ranking = games_ranking.sort_values(by = ['VISITOR_TEAM_ID', 'GAME_ID'], axis=0, ascending=[True, True], ignore_index=True)

games_ranking['AWAY_WINS_LAST_3_AWAY'] = 0
games_ranking['PTS_away_LAST_3_AWAY'] = 0

for season in seasons:
    for team in home_teams:
        games_ranking['AWAY_WINS_LAST_3_AWAY'].loc[(games_ranking['SEASON'] == season) & (games_ranking['HOME_TEAM_ID'] == team)] = (games_ranking['HOME_TEAM_WINS'] == 0).astype('int8').rolling(3).sum()
        games_ranking['PTS_away_LAST_3_AWAY'].loc[(games_ranking['SEASON'] == season) & (games_ranking['HOME_TEAM_ID'] == team)] = games_ranking['PTS_away'].rolling(3).mean()
 
        
games_ranking


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  games_ranking['HOME_WINS_LAST_3_HOME'].loc[(games_ranking['SEASON'] == season) & (games_ranking['HOME_TEAM_ID'] == team)] = games_ranking['TARGET'].rolling(3).sum()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  games_ranking['PTS_home_LAST_3_HOME'].loc[(games_ranking['SEASON'] == season) & (games_ranking['HOME_TEAM_ID'] == team)] = games_ranking['PTS_home'].rolling(3).mean()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  games_ranking['AWAY_WINS_LAST_3_AWAY'].loc[(

Unnamed: 0,GAME_DATE_EST,GAME_ID,HOME_TEAM_ID,VISITOR_TEAM_ID,SEASON,PTS_home,FG_PCT_home,FT_PCT_home,FG3_PCT_home,AST_home,REB_home,PTS_away,FG_PCT_away,FT_PCT_away,FG3_PCT_away,AST_away,REB_away,HOME_TEAM_WINS,PLAYOFF,CONFERENCE_x,G_x,W_x,L_x,W_PCT_x,HOME_W_x,HOME_L_x,HOME_W_PCT_x,ROAD_W_x,ROAD_L_x,ROAD_W_PCT_x,CONFERENCE_y,G_y,W_y,L_y,W_PCT_y,HOME_W_y,HOME_L_y,HOME_W_PCT_y,ROAD_W_y,ROAD_L_y,ROAD_W_PCT_y,TARGET,HOME_WINS_LAST_3_HOME,PTS_home_LAST_3_HOME,AWAY_WINS_LAST_3_AWAY,PTS_away_LAST_3_AWAY
0,2003-10-29,20300006,1610612740,1610612737,2003,88.0,0.324,0.700,0.160,24.0,55.0,83.0,0.398,0.737,0.214,18.0,58.0,1,0,0,1,1,0,1.000,1,0,1.000000,0,0,,0,1,0,1,0.000,0,0,,0,1,0.000000,1.0,2.0,91.666667,,
1,2003-10-31,20300024,1610612741,1610612737,2003,100.0,0.400,0.759,0.500,27.0,53.0,94.0,0.400,0.714,0.583,22.0,48.0,1,0,0,2,1,1,0.500,1,1,0.500000,0,0,,0,2,0,2,0.000,0,0,,0,2,0.000000,0.0,1.0,88.666667,,
2,2003-11-05,20300060,1610612744,1610612737,2003,99.0,0.446,0.645,0.278,23.0,52.0,72.0,0.367,0.500,0.333,19.0,43.0,1,0,1,4,2,2,0.500,2,1,0.666667,0,1,0.000000,0,5,1,4,0.200,1,1,0.500000,0,3,0.000000,1.0,3.0,96.666667,1.0,83.000000
3,2003-11-08,20300084,1610612757,1610612737,2003,90.0,0.425,0.900,0.500,28.0,41.0,83.0,0.438,0.786,0.100,21.0,45.0,1,0,1,6,3,3,0.500,3,1,0.750000,0,2,0.000000,0,6,1,5,0.167,1,1,0.500000,0,4,0.000000,1.0,3.0,88.666667,1.0,83.000000
4,2003-11-09,20300089,1610612760,1610612737,2003,81.0,0.379,0.737,0.056,12.0,46.0,91.0,0.479,0.789,0.533,16.0,41.0,0,0,1,4,3,1,0.750,2,1,0.666667,1,0,1.000000,0,7,2,5,0.286,1,1,0.500000,1,4,0.200000,0.0,1.0,96.666667,1.0,82.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24052,2016-04-17,41500121,1610612748,1610612766,2015,123.0,0.576,0.762,0.500,27.0,42.0,91.0,0.431,0.784,0.353,11.0,28.0,1,1,0,82,48,34,0.585,28,13,0.682927,20,21,0.487805,0,82,48,34,0.585,30,11,0.731707,18,23,0.439024,1.0,2.0,100.333333,0.0,92.000000
24053,2016-04-20,41500122,1610612748,1610612766,2015,115.0,0.579,0.818,0.563,19.0,35.0,103.0,0.427,0.788,0.063,9.0,39.0,1,1,0,82,48,34,0.585,28,13,0.682927,20,21,0.487805,0,82,48,34,0.585,30,11,0.731707,18,23,0.439024,0.0,2.0,108.000000,1.0,97.000000
24054,2016-04-27,41500125,1610612748,1610612766,2015,88.0,0.420,0.789,0.278,17.0,50.0,90.0,0.393,0.800,0.500,21.0,41.0,0,1,0,82,48,34,0.585,28,13,0.682927,20,21,0.487805,0,82,48,34,0.585,30,11,0.731707,18,23,0.439024,1.0,2.0,108.666667,1.0,94.666667
24055,2016-05-01,41500127,1610612748,1610612766,2015,106.0,0.483,0.688,0.375,24.0,58.0,73.0,0.321,1.000,0.318,14.0,36.0,1,1,0,82,48,34,0.585,28,13,0.682927,20,21,0.487805,0,82,48,34,0.585,30,11,0.731707,18,23,0.439024,0.0,1.0,103.000000,2.0,88.666667


**Save Train Data**

In [12]:
games_ranking.to_csv(DATAPATH / "transformed.csv",index=False)

**Sweetviz data visualization**

In [13]:
def run_sweetviz_report(df, TARGET):
    
    import sweetviz as sv
    from datetime import datetime
    
    report_label = datetime.today().strftime('%Y-%m-%d_%H_%M')
    
    my_report = sv.analyze(df,target_feat=TARGET)
    my_report.show_html(filepath='SWEETVIZ_' + report_label + '.html')
    
    return

df = pd.read_csv(DATAPATH / "transformed.csv")
run_sweetviz_report(df,'TARGET')

  all_source_names = [cur_name for cur_name, cur_series in source_df.iteritems()]
  filtered_series_names_in_source = [cur_name for cur_name, cur_series in source_df.iteritems()


                                             |          | [  0%]   00:00 -> (? left)

  for item in category_counts.iteritems():
  for item in to_process.source_counts["value_counts_without_nan"].iteritems():
  stats["mad"] = series.mad()
  stats["mad"] = series.mad()
  stats["mad"] = series.mad()
  stats["mad"] = series.mad()
  stats["mad"] = series.mad()
  stats["mad"] = series.mad()
  stats["mad"] = series.mad()
  stats["mad"] = series.mad()
  stats["mad"] = series.mad()
  stats["mad"] = series.mad()
  stats["mad"] = series.mad()
  stats["mad"] = series.mad()
  stats["mad"] = series.mad()
  stats["mad"] = series.mad()
  stats["mad"] = series.mad()
  stats["mad"] = series.mad()
  for item in category_counts.iteritems():
  for item in category_counts.iteritems():
  for item in category_counts.iteritems():
  stats["mad"] = series.mad()
  stats["mad"] = series.mad()
  stats["mad"] = series.mad()
  stats["mad"] = series.mad()
  stats["mad"] = series.mad()
  stats["mad"] = series.mad()
  stats["mad"] = series.mad()
  stats["mad"] = series.mad()
  stats["mad"] = series.mad(

Report SWEETVIZ_2022-10-14_07_35.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


**Correlations with TARGET**

In [14]:
df = games_ranking.drop(columns=['GAME_ID', 'TARGET'])

x = df.corrwith(games_ranking['TARGET'], method = 'spearman').sort_values(ascending=False)

print(x)

HOME_WINS_LAST_3_HOME    0.565372
HOME_TEAM_WINS           0.201666
W_PCT_x                  0.198058
HOME_W_PCT_x             0.182644
ROAD_W_PCT_x             0.169725
PLAYOFF                  0.141798
ROAD_W_x                 0.102559
W_x                      0.099765
HOME_W_x                 0.093858
FG_PCT_home              0.049822
PTS_home_LAST_3_HOME     0.048203
AST_home                 0.045139
PTS_home                 0.041201
HOME_L_y                 0.034238
FG3_PCT_home             0.034229
L_y                      0.033399
ROAD_L_y                 0.031714
REB_home                 0.022203
CONFERENCE_x             0.014261
FT_PCT_home              0.013211
G_x                      0.010127
G_y                      0.009344
HOME_W_y                -0.011774
ROAD_W_y                -0.012187
W_y                     -0.012331
FG3_PCT_away            -0.013738
HOME_TEAM_ID            -0.016253
FT_PCT_away             -0.018810
SEASON                  -0.028647
HOME_W_PCT_y  

  x = df.corrwith(games_ranking['TARGET'], method = 'spearman').sort_values(ascending=False)
