# "Data Cleaning & Pipelines"
> "It's not the sexiest part of data science but it is probably the most important"

- toc: false
- branch: master
- badges: true
- comments: true
- categories: [data cleaning, data preparation, pandas.pipe, NBA]
- image: https://media.giphy.com/media/Qvpxb0bju1rEp9Nipy/giphy.gif

## Goal: Prepare the NBA dataset for a classification model 
The data exploration post showed how to use knowledge about a dataset to interpret information. Since we know how the 2017-2019 seasons went for the Milwuakee Bucks and Sacramento Kings we can now plan out our machine learning problem. The machine learning model will attempt to predict the outcome of an NBA game before it actually occurs. We can start with using a logistic regression model to get a probabalistic output but we can look into other classification models after we give this one a go. This article outlines the most imperative portion of a machine learning project, outlining the problem and preparing the data. 

In [1]:
#hide
import os
from pathlib import Path
import pandas as pd
import numpy as np

### Set all the necessary paths for the data 
The data was provided by https://www.basketball-reference.com/. They are a great source for anyone interested in sports analytics as an intial introduction. I can go into details later within the project to note the importance of detail in sports data

In [10]:
DATA_FOLDER = Path(os.getcwd(), 'mil_sac_data')

In [54]:
sac_2017_szn = Path(DATA_FOLDER, 'sac_2017_2018_szn.csv')
sac_2018_szn = Path(DATA_FOLDER, 'sac_2018_2019_szn.csv')
mil_2017_szn = Path(DATA_FOLDER, 'mil_2017_2018_szn.csv')
mil_2018_szn = Path(DATA_FOLDER, 'mil_2018_2019_szn.csv')

In [55]:
sac_2017_df = pd.read_csv(sac_2017_szn, header=[0,1])

In [56]:
sac_2017_df.head(5)

Unnamed: 0_level_0,Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Team,Team,...,Opponent,Opponent,Opponent,Opponent,Opponent,Opponent,Opponent,Opponent,Opponent,Opponent
Unnamed: 0_level_1,Rk,G,Date,Unnamed: 3_level_1,Opp,W/L,Tm,Opp,FG,FGA,...,FT,FTA,FT%,ORB,TRB,AST,STL,BLK,TOV,PF
0,1,1,2017-10-18,,HOU,L,100,105,42,88,...,27,29,0.931,12,44,19,7,3,14,14
1,2,2,2017-10-20,@,DAL,W,93,88,37,87,...,15,21,0.714,7,36,19,8,7,12,13
2,3,3,2017-10-21,@,DEN,L,79,96,31,85,...,12,20,0.6,18,58,25,7,2,16,19
3,4,4,2017-10-23,@,PHO,L,115,117,43,99,...,23,27,0.852,6,45,20,6,5,20,25
4,5,5,2017-10-26,,NOP,L,106,114,38,81,...,17,23,0.739,10,47,22,5,3,15,24


### Merge the multi index headers and remove the "Unnamed: " tags within the column of the df
Using regular expressions, the "Unnmaed: " tags can be removed from the columns

In [57]:
test = sac_2017_df.columns.map('.'.join).str.strip('.')

In [58]:
sac_2017_df.columns = test.str.replace(r"Unnamed:\ [0-9]_level_[0-9].", '', regex=True)

In [59]:
sac_2017_df.head(5)

Unnamed: 0,Rk,G,Date,Unnamed: 3_level_1,Opp,W/L,Tm,Opp.1,Team.FG,Team.FGA,...,Opponent.FT,Opponent.FTA,Opponent.FT%,Opponent.ORB,Opponent.TRB,Opponent.AST,Opponent.STL,Opponent.BLK,Opponent.TOV,Opponent.PF
0,1,1,2017-10-18,,HOU,L,100,105,42,88,...,27,29,0.931,12,44,19,7,3,14,14
1,2,2,2017-10-20,@,DAL,W,93,88,37,87,...,15,21,0.714,7,36,19,8,7,12,13
2,3,3,2017-10-21,@,DEN,L,79,96,31,85,...,12,20,0.6,18,58,25,7,2,16,19
3,4,4,2017-10-23,@,PHO,L,115,117,43,99,...,23,27,0.852,6,45,20,6,5,20,25
4,5,5,2017-10-26,,NOP,L,106,114,38,81,...,17,23,0.739,10,47,22,5,3,15,24


There is still an'Unnmaed: 3_level_1' tag after the regex processing which represents if the team of interest was playing home or away. We won't even be using this column as is so we can just process our new column and drop 'Unnamed: 3_level_1' after.

The existing column consists of discreet values 'NaN' or @ indictation if the team was playing at home or away for this instance. We can simply check if the row value is NaN using the .isnull() function in pandas and set those values as a new column

In [68]:
sac_2017_df['playing_home'] = sac_2017_df['Unnamed: 3_level_1'].isnull()

Now that we have our column we can simply drop the existing "Unnamed: 3_level_1" column because "playing_home" represents the same thing now

In [71]:
sac_2017_df.drop(columns=['Unnamed: 3_level_1'], inplace=True)

In [72]:
sac_2017_df.head()

Unnamed: 0,Rk,G,Date,Opp,W/L,Tm,Opp.1,Team.FG,Team.FGA,Team.FG%,...,Opponent.FTA,Opponent.FT%,Opponent.ORB,Opponent.TRB,Opponent.AST,Opponent.STL,Opponent.BLK,Opponent.TOV,Opponent.PF,playing_home
0,1,1,2017-10-18,HOU,L,100,105,42,88,0.477,...,29,0.931,12,44,19,7,3,14,14,True
1,2,2,2017-10-20,DAL,W,93,88,37,87,0.425,...,21,0.714,7,36,19,8,7,12,13,False
2,3,3,2017-10-21,DEN,L,79,96,31,85,0.365,...,20,0.6,18,58,25,7,2,16,19,False
3,4,4,2017-10-23,PHO,L,115,117,43,99,0.434,...,27,0.852,6,45,20,6,5,20,25,False
4,5,5,2017-10-26,NOP,L,106,114,38,81,0.469,...,23,0.739,10,47,22,5,3,15,24,True


In order to prepare this data for a logistic regression model, we will also need to convert the non-numeric columns we plan to use to numerical values. Specifically converting the column of interest "W/L" to a numeric representation

In [76]:
sac_2017_df['dub'] = sac_2017_df['W/L'] == 'W'

In [79]:
sac_2017_df.head()

Unnamed: 0,Rk,G,Date,Opp,W/L,Tm,Opp.1,Team.FG,Team.FGA,Team.FG%,...,Opponent.FT%,Opponent.ORB,Opponent.TRB,Opponent.AST,Opponent.STL,Opponent.BLK,Opponent.TOV,Opponent.PF,playing_home,dub
0,1,1,2017-10-18,HOU,L,100,105,42,88,0.477,...,0.931,12,44,19,7,3,14,14,True,False
1,2,2,2017-10-20,DAL,W,93,88,37,87,0.425,...,0.714,7,36,19,8,7,12,13,False,True
2,3,3,2017-10-21,DEN,L,79,96,31,85,0.365,...,0.6,18,58,25,7,2,16,19,False,False
3,4,4,2017-10-23,PHO,L,115,117,43,99,0.434,...,0.852,6,45,20,6,5,20,25,False,False
4,5,5,2017-10-26,NOP,L,106,114,38,81,0.469,...,0.739,10,47,22,5,3,15,24,True,False


### Might as well make a pipeline
We have established, at least, our firsy pass at preparing the dataset. Since we will have to prepare the other dataframes in a similar way we can mitgate this by create a data pipeline. This pipeline will take each original dataframe in and run the same preprocessing steps. This ensures everything is going through the same steps. Pipelines are not required but it will help you to stay organized 

To make a pipeline we'll need to make the previous steps we created into functions to pass each dataframe through

In [89]:
def data_pipeline(df):
    test = df.columns.map('.'.join).str.strip('.')
    df.columns = test.str.replace(r"Unnamed:\ [0-9]_level_[0-9].", '', regex=True)
    df['playing_home'] = df['Unnamed: 3_level_1'].isnull()
    df.drop(columns=['Unnamed: 3_level_1'], inplace=True)
    df['dub'] = df['W/L'] == 'W'
    df.drop(columns=['W/L'], inplace=True)
    return df

### Running the pipeline
We can consolidate the number of duplicate lines to run into a function and process all similar datasets with the same function. The code below reads in each dataset and immediately uses the pandas .pipe() function passing in the preprocessing function. Though we didn't use it, the .pipe(0 function allows positional and keyword arguments to be passed in with the function to run. 

In [90]:
sac_2017_df = pd.read_csv(sac_2017_szn, header=[0, 1]).pipe(data_pipeline)
sac_2018_df = pd.read_csv(sac_2018_szn, header=[0, 1]).pipe(data_pipeline)
mil_2017_df = pd.read_csv(mil_2017_szn, header=[0, 1]).pipe(data_pipeline)
mil_2018_df = pd.read_csv(mil_2018_szn, header=[0, 1]).pipe(data_pipeline)