## Exploratory Data Analysis for Data Cleaning

Script Name: data_cleaning_eda.ipynb

Author: Brian Cain


The purpose of this jupyter notebook is to explore the joined data for characteristics of the data that need to be cleaned before advancing to the data analysis phase. Findings from this EDA will motivate data cleaning functions defined in the data_cleaning.py script. This jupyter notebook will also be fluid through time in that if data issues arise in the future I will come back to this notebook to further explore the data issues and cite any changes to be made to data_cleaning.py.

<hr>

In [1]:
##Import pandas for dataframe management/operations
import pandas as pd

##Import numpy 
import numpy as np

##Import tabulate for organized table creation
from tabulate import tabulate

In [2]:
##Pull in the joined dataframe
joinedDf = pd.read_csv('D:\\College_Football_Model_Data\\joinedDf.csv')

##Display the first couple rows of the data to ensure its been pulled in 
joinedDf.head()

Unnamed: 0,gameId,week_num,school,rush_td,pass_td,rush_attempt,yp_rush,rush_yards,yp_pass,completion_attempts,...,offensive_plays,offensive_drives,offensive_ppa,offensive_successRate,offensive_explosiveness,offensive_powerSuccess,offensive_stuffRate,offensive_lineYards,offensive_secondLevelYards,offensive_openFieldYards
0,400603830,1,Florida,4,4,41,5.4,222,10.1,31-38,...,81.0,13.0,0.489647,0.567901,1.272267,0.666667,0.146341,3.360976,1.414634,1.121951
1,400603830,1,New Mexico State,1,1,22,2.9,64,4.7,15-29,...,51.0,13.0,0.000437,0.27451,1.332495,0.666667,0.25,2.565,0.95,1.0
2,400787302,1,Ohio,2,3,38,5.4,205,11.4,20-25,...,64.0,10.0,0.474904,0.53125,1.228407,1.0,0.111111,3.752778,1.638889,1.388889
3,400787302,1,Idaho,2,1,28,3.6,100,6.2,36-48,...,78.0,12.0,0.094651,0.538462,0.832628,0.75,0.08,3.52,1.08,0.52
4,400763403,1,Texas,0,0,29,2.1,60,4.5,8-23,...,53.0,12.0,-0.101281,0.264151,1.123873,0.0,0.12,2.532,0.72,0.04


Lets first examine the data types of the data to see which features might need cleaning to start.

In [3]:
##Display the dataframe data types
joinedDf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10078 entries, 0 to 10077
Data columns (total 40 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   gameId                      10078 non-null  int64  
 1   week_num                    10078 non-null  int64  
 2   school                      10078 non-null  object 
 3   rush_td                     10078 non-null  int64  
 4   pass_td                     10078 non-null  int64  
 5   rush_attempt                10078 non-null  int64  
 6   yp_rush                     10078 non-null  float64
 7   rush_yards                  10078 non-null  int64  
 8   yp_pass                     10078 non-null  float64
 9   completion_attempts         10078 non-null  object 
 10  pass_yards                  10078 non-null  int64  
 11  total_yards                 10078 non-null  int64  
 12  turnovers                   10078 non-null  int64  
 13  tfl                         100

From prior knowledge of this dataset, all the float64 and int64 data objects look correct as is. We can note some object data type columns that we must deal with in order to start cleaning the dataset. These are listed below:

<b><i>Hyphenated Data:</i></b>
* completion_attempts
* penalty_yards
* fourthDown_eff
* thirdDown_eff

<b><i>Time Data:</i></b>
* possession_time

<b><i>List Data:</i></b>
* Quarterly_points

In the section below we will perform transformations to better format the dataset to the columns listed above. 

### Hyphenated Data

Below we will take a look at what is meant by "hyphenated" data. Essentially this is a datapoint where there is a "-" between two integers to indicates that a team has gone "# for #" in some statistic.

In [4]:
##Display the hyphenated data columns
joinedDf[['completion_attempts','penalty_yards','fourthDown_eff','thirdDown_eff']].head()

Unnamed: 0,completion_attempts,penalty_yards,fourthDown_eff,thirdDown_eff
0,31-38,1-10,2-2,10-15
1,15-29,1-9,2-3,1-12
2,20-25,10-92,0-0,5-10
3,36-48,3-30,4-5,4-14
4,8-23,4-50,0-0,2-14


One of the first things that needs to be addressed is if these data columns all have a "-" present, if not then the data may have an incorrect component to it. 

In [5]:
##Display results of assessing if all data is hypenated
hyph_colNames = ['completion_attempts','penalty_yards','fourthDown_eff','thirdDown_eff']
hyph_data = [['Column','Status']]
for i in hyph_colNames:
    if len(joinedDf[joinedDf[i].str.contains('-')]) != len(joinedDf):
        hyph_data.append([i,'Incorrect'])
    else:
        hyph_data.append([i,'Correct'])
print('Check all Data has "-":')
print(tabulate(hyph_data,headers='firstrow',tablefmt='grid'))

Check all Data has "-":
+---------------------+----------+
| Column              | Status   |
| completion_attempts | Correct  |
+---------------------+----------+
| penalty_yards       | Correct  |
+---------------------+----------+
| fourthDown_eff      | Correct  |
+---------------------+----------+
| thirdDown_eff       | Correct  |
+---------------------+----------+


The above results indicates that all data has a "-" present. This result generically indicates the data is a correct structure and we can move on to splitting the hyphenated data up. We now write a function to split hyphenated data into two new columns. 

Another thing that it would be important to check is if there is some error including a "--" double hyphen. This would indicate incorrect data and we'd want to explore places where this happens. 

In [6]:
#joinedDf['penalty_yards'].loc[joinedDf['penalty_yards'].str.len() < 4].str.split('-')
hyph_data = [['Column','Status']]
for i in hyph_colNames:
    if len(joinedDf[joinedDf[i].str.contains('--')]) > 0:
        hyph_data.append([i,'Incorrect'])
    else:
        hyph_data.append([i,'Correct'])
print('Check no Data has "--":')
print(tabulate(hyph_data,headers='firstrow',tablefmt='grid'))

Check no Data has "--":
+---------------------+-----------+
| Column              | Status    |
| completion_attempts | Correct   |
+---------------------+-----------+
| penalty_yards       | Incorrect |
+---------------------+-----------+
| fourthDown_eff      | Correct   |
+---------------------+-----------+
| thirdDown_eff       | Correct   |
+---------------------+-----------+


The above result shows that the penalty_yards column has a double "--" present, we will now explore the rows of data where this is happening. 

In [7]:
##Filter data to location where double hypen is occuring
print('Row of "--" Occurence:')
display(joinedDf[joinedDf['penalty_yards'].str.contains('--')][['gameId','school','penalty_yards']])
print('\nGame Record of "--" Occurence:')
gameId = (joinedDf.loc[joinedDf['penalty_yards'].str.contains('--'),'gameId'].tolist()[0])
joinedDf[joinedDf['gameId']==gameId][['gameId','school','penalty_yards']]

Row of "--" Occurence:


Unnamed: 0,gameId,school,penalty_yards
6092,401012776,Arizona State,7--4953



Game Record of "--" Occurence:


Unnamed: 0,gameId,school,penalty_yards
6092,401012776,Arizona State,7--4953
6093,401012776,USC,4-33


Above we see that only a single row is affected by this error, and the overall game corresponding to this error has a fine entry for penalty_yards for the other team. This data issue is a relatively simple fix, we will simply impute the value with the following:
* New Value = $7-y\cdot7$ where $y=$ average yards per penalty in the dataset. 

If this data error ever pops up in the future as well we will use the same data imputation technique. 

In [8]:
##Define function that imputes average yards per penalty for corrupt penalty yards data
####as documented above.
def imputeAvg_penalty_yards(df,colName,error_delim='--',delim='-'):
    
    ##Filter dataframe to where the error is not
    splitData = df[~df[colName].str.contains(error_delim)][colName].str.split(delim,expand=True)
    numPenalties, numYards = np.array(splitData[0]).astype(float), np.array(splitData[1]).astype(float)
    
    ##Compute average yards per penalty
    mean_ypp = 0
    n = len(numPenalties)
    for i,j in zip(numPenalties,numYards):
        if i != 0 and j != 0:
            mean_ypp = mean_ypp + (j/i)/n
    
    ##Define function for replacing error delimieter
    def replace_delim(x,error_delim,delim,mean_ypp):
        if error_delim in x:
            x = x.replace(error_delim,delim)
            delim_loc = x.find(delim)
            x = x[:delim_loc+1] + str(int(int(x[:delim_loc])*mean_ypp))
            return x
        else:
            return x
    
    ##Edit rows with incorrect data and impute average penalty yards
    correctedData = df[colName].tolist()
    for i in range(len(correctedData)):
        correctedData[i] = replace_delim(correctedData[i],error_delim,delim,mean_ypp)
    df[colName] = correctedData
    
    return df

##Now lets fix the penalty_yards double-hyphenation issue and assess if it worked
joinedDf = imputeAvg_penalty_yards(joinedDf,'penalty_yards')
hyph_data = [['Correct','Incorrect']]
for i in hyph_colNames:
    if len(joinedDf[joinedDf[i].str.contains('--')]) > 0:
        hyph_data.append([i,'Incorrect'])
    else:
        hyph_data.append([i,'Correct'])
print('Check all Data has "--":')
print(tabulate(hyph_data,headers='firstrow',tablefmt='grid'))

Check all Data has "--":
+---------------------+-------------+
| Correct             | Incorrect   |
| completion_attempts | Correct     |
+---------------------+-------------+
| penalty_yards       | Correct     |
+---------------------+-------------+
| fourthDown_eff      | Correct     |
+---------------------+-------------+
| thirdDown_eff       | Correct     |
+---------------------+-------------+


The above result indicates that the issue with double-hypenation has been resolved. 

The next issue we must address is moving the hyphenation data into a form that can be better used for feature engineering and statistical modeling. To do this we will split each hyphenated column into two columns of integers. I have listed these splits below:
* completion_attempts $\rightarrow$ passAttempt, passComplete
* penalty_yards $\rightarrow$ penalties, penalty_yardage
* fourthDown_eff $\rightarrow$ fourthAttempts, fourthSuccess
* thirdDown_eff $\rightarrow$ thirdAttempts, thirdSuccess

Below a function is defined to perform these splits and the new columns are added to the dataframe. We display these new columns below. 

In [9]:
##Define function to split hyphenated data into two new columns
####NOTE: delimineter is default "-" but there is an option to change in for function generalizability
def hyphenated_split(df,colName,newNames,delim='-'):
    
    ##Split by delimineter to create new columns and drop old column
    df[newNames] = df[colName].str.split(delim, expand=True)
    df = df.drop(colName,axis=1)
    
    return df

##Now for each of the hyphenated columns lets perform this transformation 
hyph_newNames = [['passAttempt','passComplete'],['penalties','penalty_yardage'],
                 ['fourthAttempts','fourthSuccess'],['thirdAttempts','thirdSuccess']]
for i in range(len(hyph_colNames)):
    joinedDf = hyphenated_split(joinedDf,hyph_colNames[i],hyph_newNames[i])

In [10]:
##Display the new dataframe with split columns
joinedDf[['passAttempt','passComplete','penalties','penalty_yardage',
          'fourthAttempts','fourthSuccess','thirdAttempts','thirdSuccess']].head()

Unnamed: 0,passAttempt,passComplete,penalties,penalty_yardage,fourthAttempts,fourthSuccess,thirdAttempts,thirdSuccess
0,31,38,1,10,2,2,10,15
1,15,29,1,9,2,3,1,12
2,20,25,10,92,0,0,5,10
3,36,48,3,30,4,5,4,14
4,8,23,4,50,0,0,2,14


We can see the split data resulting from the original hypenated columns is now in a much better form for future statistical modeling. The data is no longer string data so we'll be able to create new columns/features from this data to better evaluate game outcomes. 

### Time Data

Now we must address the possession_time column. This data is given in the format of a mm:ss time count that indicates how long during a game the team had possession of the ball. 

Realistically speaking the we don't need any seconds data. This is because the minutes data is granular enough information for us to precisely model how long teams control the ball during a game. Therefore we'll perform the following data transformation:
* possession_time $\rightarrow$ possession_minutes

Below we define a function to perform this data transformation. 

In [11]:
##Define function to replace possession time with integer value possession minutes
def possession_minutes(df,colName):
    
    ##Create possession_minutes column and drop possession_time column
    df['possession_minutes'] = df.apply(lambda row: row[colName][:row[colName].find(':')], axis = 1)
    df = df.drop(colName,axis=1)
    
    return df

##Apply operation to dataframe and view results
joinedDf = possession_minutes(joinedDf,'possession_time')
joinedDf['possession_minutes'].head()

0    37
1    22
2    29
3    30
4    20
Name: possession_minutes, dtype: object

### List Data

The only list data we have comes from the Quarterly_points column. Each row in this column contains a list with 4 elements, for example [7,14,0,3]. The element of the list are the points scored per quarter, out of all 4 quarters. We will perform the following transformation:
* Quarterly_points $\rightarrow$ Q1_points, Q2_points, Q3_points, Q4_points


In [71]:
##Define function to split quarterly points in gameDf into separate columns
from ast import literal_eval
def split_quarterly_pts(df):
    
    df['Quarterly_points1'] = df.apply(lambda row: [] if str(row['Quarterly_points']) == 'nan' else
                                               literal_eval(str(row['Quarterly_points'])),axis=1)
    df['quarters_available'] = df.apply(lambda row: len(row['Quarterly_points1']), axis = 1)

    for i,j in zip(['Q1_points','Q2_points','Q3_points','Q4_points'],[0,1,2,3]):
        df[i] = df.apply(lambda row: row['Quarterly_points1'][j] if len(row['Quarterly_points1'])>0 else 0, axis = 1)

    return df

In [72]:
df = split_quarterly_pts(joinedDf)

IndexError: list index out of range

In [74]:
df[len(df['Quarterly_points'])==11].tolist()

KeyError: False

In [76]:
####These are indicative of overtime games
df.loc[df['quarters_available']==11,['Quarterly_points','gameSeason']]

Unnamed: 0,Quarterly_points,gameSeason
4034,"[14, 0, 0, 17, 7, 7, 0, 8, 6, 6, 3]",2017.0
4035,"[10, 7, 7, 7, 7, 7, 0, 8, 6, 6, 6]",2017.0
6540,"[7, 10, 7, 7, 3, 7, 8, 3, 6, 8, 8]",2018.0
6541,"[7, 3, 7, 14, 3, 7, 8, 3, 6, 8, 6]",2018.0
