## Exploratory Data Analysis for Data Cleaning

Script Name: data_cleaning_eda.ipynb

Author: Brian Cain


The purpose of this jupyter notebook is to explore the joined data for characteristics of the data that need to be cleaned before advancing to the data analysis phase. Findings from this EDA will motivate data cleaning functions defined in the data_cleaning.py script. This jupyter notebook will also be fluid through time in that if data issues arise in the future I will come back to this notebook to further explore the data issues and cite any changes to be made to data_cleaning.py.

<hr>

In [1]:
##Import pandas for dataframe management/operations
import pandas as pd

##Import tabulate for organized table creation
from tabulate import tabulate

In [2]:
##Pull in the joined dataframe
joinedDf = pd.read_csv('D:\\College_Football_Model_Data\\joinedDf.csv')

##Display the first couple rows of the data to ensure its been pulled in 
joinedDf.head()

Unnamed: 0,gameId,week_num,school,rush_td,pass_td,rush_attempt,yp_rush,rush_yards,yp_pass,completion_attempts,...,offensive_plays,offensive_drives,offensive_ppa,offensive_successRate,offensive_explosiveness,offensive_powerSuccess,offensive_stuffRate,offensive_lineYards,offensive_secondLevelYards,offensive_openFieldYards
0,400603830,1,Florida,4,4,41,5.4,222,10.1,31-38,...,81.0,13.0,0.489647,0.567901,1.272267,0.666667,0.146341,3.360976,1.414634,1.121951
1,400603830,1,New Mexico State,1,1,22,2.9,64,4.7,15-29,...,51.0,13.0,0.000437,0.27451,1.332495,0.666667,0.25,2.565,0.95,1.0
2,400787302,1,Ohio,2,3,38,5.4,205,11.4,20-25,...,64.0,10.0,0.474904,0.53125,1.228407,1.0,0.111111,3.752778,1.638889,1.388889
3,400787302,1,Idaho,2,1,28,3.6,100,6.2,36-48,...,78.0,12.0,0.094651,0.538462,0.832628,0.75,0.08,3.52,1.08,0.52
4,400763403,1,Texas,0,0,29,2.1,60,4.5,8-23,...,53.0,12.0,-0.101281,0.264151,1.123873,0.0,0.12,2.532,0.72,0.04


Lets first examine the data types of the data to see which features might need cleaning to start.

In [3]:
##Display the dataframe data types
joinedDf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10078 entries, 0 to 10077
Data columns (total 40 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   gameId                      10078 non-null  int64  
 1   week_num                    10078 non-null  int64  
 2   school                      10078 non-null  object 
 3   rush_td                     10078 non-null  int64  
 4   pass_td                     10078 non-null  int64  
 5   rush_attempt                10078 non-null  int64  
 6   yp_rush                     10078 non-null  float64
 7   rush_yards                  10078 non-null  int64  
 8   yp_pass                     10078 non-null  float64
 9   completion_attempts         10078 non-null  object 
 10  pass_yards                  10078 non-null  int64  
 11  total_yards                 10078 non-null  int64  
 12  turnovers                   10078 non-null  int64  
 13  tfl                         100

From prior knowledge of this dataset, all the float64 and int64 data objects look correct as is. We can note some object data type columns that we must deal with in order to start cleaning the dataset. These are listed below:

<b><i>Hyphenated Data:</i></b>
* completion_attempts
* penalty_yards
* fourthDown_eff
* thirdDown_eff

<b><i>Time Data:</i></b>
* possession_time

<b><i>List Data:</i></b>
* Quarterly_points

In the section below we will perform transformations to better format the dataset to the columns listed above. 

### Hyphenated Data

Below we will take a look at what is meant by "hyphenated" data. Essentially this is a datapoint where there is a "-" between two integers to indicates that a team has gone "# for #" in some statistic.

In [4]:
##Display the hyphenated data columns
joinedDf[['completion_attempts','penalty_yards','fourthDown_eff','thirdDown_eff']].head()

Unnamed: 0,completion_attempts,penalty_yards,fourthDown_eff,thirdDown_eff
0,31-38,1-10,2-2,10-15
1,15-29,1-9,2-3,1-12
2,20-25,10-92,0-0,5-10
3,36-48,3-30,4-5,4-14
4,8-23,4-50,0-0,2-14


One of the first things that needs to be addressed is if these data columns all have a "-" present, if not then the data may have an incorrect component to it. 

In [5]:
##Display results of assessing if all data is hypenated
hyph_colNames = ['completion_attempts','penalty_yards','fourthDown_eff','thirdDown_eff']
hyph_data = [['Column','Status']]
for i in hyph_colNames:
    if len(joinedDf[joinedDf[i].str.contains('-')]) != len(joinedDf):
        hyph_data.append([i,'Incorrect'])
    else:
        hyph_data.append([i,'Correct'])
print(tabulate(hyph_data,headers='firstrow',tablefmt='grid'))

+---------------------+----------+
| Column              | Status   |
| completion_attempts | Correct  |
+---------------------+----------+
| penalty_yards       | Correct  |
+---------------------+----------+
| fourthDown_eff      | Correct  |
+---------------------+----------+
| thirdDown_eff       | Correct  |
+---------------------+----------+


The above results indicates that all data has a "-" present. This result generically indicates the data is a correct structure and we can move on to splitting the hyphenated data up. We now write a function to split hyphenated data into two new columns. 

In [7]:
joinedDf.columns

Index(['gameId', 'week_num', 'school', 'rush_td', 'pass_td', 'rush_attempt',
       'yp_rush', 'rush_yards', 'yp_pass', 'completion_attempts', 'pass_yards',
       'total_yards', 'turnovers', 'tfl', 'sacks', 'qb_hurries',
       'fumbles_lost', 'interceptions', 'possession_time', 'penalty_yards',
       'fourthDown_eff', 'thirdDown_eff', 'firstDowns', 'defensive_td',
       'homeBool', 'gameSeason', 'team_id', 'points', 'Quarterly_points',
       'elo', 'offensive_plays', 'offensive_drives', 'offensive_ppa',
       'offensive_successRate', 'offensive_explosiveness',
       'offensive_powerSuccess', 'offensive_stuffRate', 'offensive_lineYards',
       'offensive_secondLevelYards', 'offensive_openFieldYards', 'passAttempt',
       'passComplete'],
      dtype='object')

In [27]:
#joinedDf['penalty_yards'].loc[joinedDf['penalty_yards'].str.len() < 4].str.split('-')
ct = 0
for i in joinedDf['penalty_yards'].str.split('-'):
    if len(i) != 2:
        print(ct)
    ct+=1

6092


In [30]:
joinedDf['penalty_yards'][6092]

'7--4953'

In [6]:
##Define function to split hyphenated data into two new columns
####NOTE: delimineter is default "-" but there is an option to change in for function generalizability
def hyphenated_split(df,colName,newNames,delim='-'):
    
    ##Split by delimineter to create new columns and drop old column
    df[newNames] = df[colName].str.split(delim, expand=True)
    
    return df

##Now for each of the hyphenated columns lets perform this transformation 
hyph_newNames = [['passAttempt','passComplete'],['penalties','penalty_yardage'],
                 ['fourthAttempts','fourthSuccess'],['thirdAttempts','thirdSuccess']]
for i in range(len(hyph_colNames)):
    print(i)
    joinedDf = hyphenated_split(joinedDf,hyph_colNames[i],hyph_newNames[i])

0
1


ValueError: Columns must be same length as key