# Data Cleaning

The purpose of this script will be to clean the `games` dataset and get it in position for modelling.  We start by importing the standard libraries, and reading in the data.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('Data/games.csv')
df.head()

Unnamed: 0,GAME_DATE_EST,GAME_ID,GAME_STATUS_TEXT,HOME_TEAM_ID,VISITOR_TEAM_ID,SEASON,TEAM_ID_home,PTS_home,FG_PCT_home,FT_PCT_home,...,AST_home,REB_home,TEAM_ID_away,PTS_away,FG_PCT_away,FT_PCT_away,FG3_PCT_away,AST_away,REB_away,HOME_TEAM_WINS
0,2021-05-26,42000102,Final,1610612755,1610612764,2020,1610612755,120.0,0.557,0.684,...,26.0,45.0,1610612764,95.0,0.402,0.633,0.091,22.0,40.0,1
1,2021-05-26,42000132,Final,1610612752,1610612737,2020,1610612752,101.0,0.383,0.739,...,15.0,54.0,1610612737,92.0,0.369,0.818,0.273,17.0,41.0,1
2,2021-05-26,42000142,Final,1610612762,1610612763,2020,1610612762,141.0,0.544,0.774,...,28.0,42.0,1610612763,129.0,0.541,0.763,0.348,20.0,33.0,1
3,2021-05-25,42000112,Final,1610612751,1610612738,2020,1610612751,130.0,0.523,0.955,...,31.0,46.0,1610612738,108.0,0.424,0.783,0.353,23.0,43.0,1
4,2021-05-25,42000152,Final,1610612756,1610612747,2020,1610612756,102.0,0.465,0.933,...,21.0,31.0,1610612747,109.0,0.45,0.871,0.303,24.0,39.0,0


In [3]:
df.shape

(24677, 21)

Looks like we have just over 24,000 games to train on.  Let's bring in the `teams` dataset, so that we can generate a dictionary which converts between team name and ID.

In [4]:
# read in the dataset
teams = pd.read_csv('Data/teams.csv')

# generate a team name-ID dictionary from the teams dataset
teams = teams[['TEAM_ID', 'ABBREVIATION']]
teams = teams.set_index('TEAM_ID')
id_to_name = teams.to_dict()['ABBREVIATION']

# generate this same dictionary in reverse (i.e. ID-name instead of name-ID)
name_to_id = dict((v,k) for k,v in id_to_name.items())  

In [5]:
id_to_name

{1610612737: 'ATL',
 1610612738: 'BOS',
 1610612740: 'NOP',
 1610612741: 'CHI',
 1610612742: 'DAL',
 1610612743: 'DEN',
 1610612745: 'HOU',
 1610612746: 'LAC',
 1610612747: 'LAL',
 1610612748: 'MIA',
 1610612749: 'MIL',
 1610612750: 'MIN',
 1610612751: 'BKN',
 1610612752: 'NYK',
 1610612753: 'ORL',
 1610612754: 'IND',
 1610612755: 'PHI',
 1610612756: 'PHX',
 1610612757: 'POR',
 1610612758: 'SAC',
 1610612759: 'SAS',
 1610612760: 'OKC',
 1610612761: 'TOR',
 1610612762: 'UTA',
 1610612763: 'MEM',
 1610612764: 'WAS',
 1610612765: 'DET',
 1610612766: 'CHA',
 1610612739: 'CLE',
 1610612744: 'GSW'}

In [6]:
name_to_id

{'ATL': 1610612737,
 'BOS': 1610612738,
 'NOP': 1610612740,
 'CHI': 1610612741,
 'DAL': 1610612742,
 'DEN': 1610612743,
 'HOU': 1610612745,
 'LAC': 1610612746,
 'LAL': 1610612747,
 'MIA': 1610612748,
 'MIL': 1610612749,
 'MIN': 1610612750,
 'BKN': 1610612751,
 'NYK': 1610612752,
 'ORL': 1610612753,
 'IND': 1610612754,
 'PHI': 1610612755,
 'PHX': 1610612756,
 'POR': 1610612757,
 'SAC': 1610612758,
 'SAS': 1610612759,
 'OKC': 1610612760,
 'TOR': 1610612761,
 'UTA': 1610612762,
 'MEM': 1610612763,
 'WAS': 1610612764,
 'DET': 1610612765,
 'CHA': 1610612766,
 'CLE': 1610612739,
 'GSW': 1610612744}

Looks like we generated these dictionaries properly.  Let's rename the columns in the games dataframe to something more nice, and filter out for just the following features for each game:
- date
- home team id
- away team id
- is_home_win

From there, we will be able to use our LUTs to add in the real features for each team involved in the game.

In [7]:
df = df[['GAME_DATE_EST', 'SEASON', 'HOME_TEAM_ID', 'VISITOR_TEAM_ID', 'HOME_TEAM_WINS']]
df.rename(columns={'GAME_DATE_EST' : 'date', 'SEASON' : 'season', 'HOME_TEAM_ID' : 'home_id', 
                      'VISITOR_TEAM_ID' : 'away_id', 'HOME_TEAM_WINS' : 'is_home_win'}, inplace=True)
df = df.sort_values('date')
df = df.reset_index().drop('index', axis=1)

df.head()

Unnamed: 0,date,season,home_id,away_id,is_home_win
0,2003-10-05,2003,1610612762,1610612742,1
1,2003-10-06,2003,1610612763,1610612749,1
2,2003-10-07,2003,1610612758,1610612746,1
3,2003-10-07,2003,1610612757,1610612745,1
4,2003-10-07,2003,1610612748,1610612755,1


To make the dataframe easier to interpret, let's change the ID columns to team name columns, and add a 'winner' column which just tells us the winner (purely for visual purposes).

In [8]:
df['home'] = df['home_id'].map(id_to_name)
df['away'] = df['away_id'].map(id_to_name)

df = df[['date', 'season', 'home', 'away', 'is_home_win']]

df['winner'] = np.where(df['is_home_win'], df['home'], df['away'])

df.head()

Unnamed: 0,date,season,home,away,is_home_win,winner
0,2003-10-05,2003,UTA,DAL,1,UTA
1,2003-10-06,2003,MEM,MIL,1,MEM
2,2003-10-07,2003,SAC,LAC,1,SAC
3,2003-10-07,2003,POR,HOU,1,POR
4,2003-10-07,2003,MIA,PHI,1,MIA


Let's save this season into memory, so we can load it in our modelling script.

In [10]:
df.to_csv('Data/games_modelling.csv', index=False)