## Feature Engineering
Script Name: feature_exploration.ipynb

Author: Brian Cain


The purpose of this notebook is to use the cleaned data from the Data Wrangling process and create features necessary/useful in bayesian predictive modeling. These features will also be analyzed for elements such as collinearity which will effect model making decisions. The product of this notebook will be a finalized dataframe used for predictive modeling. All functions resulting from this notebook can be found in feature_creation.ipynb. 

<hr>

In [1]:
##Import necessary packages
import pandas as pd
import numpy as np

##Pull in and display the data from the finalized data wrangling process
data = pd.read_csv('D:\\College_Football_Model_Data\\cleanData.csv')
print('Data from Data Wrangling Process:')
data.head(5)

Data from Data Wrangling Process:


Unnamed: 0,gameId,school,week_num,gameSeason,homeBool,win,rush_td_movAvg,pass_td_movAvg,rush_attempt_movAvg,yp_rush_movAvg,...,gameControl_movAvg,completion_pct_movAvg,opp_completion_pct_movAvg,third_pct_movAvg,opp_third_pct_movAvg,fourth_pct_movAvg,opp_fourth_pct_movAvg,diff_points_movAvg,games_played,games_won
0,400764869,Temple,3,2015.0,1,1.0,1.666667,1.0,38.666667,3.766667,...,2.25,0.661111,0.514732,0.405811,0.322227,0.166667,0.333333,9.0,3,3.0
1,400763604,UTEP,3,2015.0,1,1.0,1.666667,1.333333,41.0,4.3,...,-7.0,0.695013,0.58254,0.433333,0.405594,0.333333,0.833333,-27.0,3,1.0
2,400756922,Georgia Tech,3,2015.0,1,0.0,3.5,2.5,51.5,6.2,...,5.875,0.566667,0.577273,0.433333,0.381818,0.833333,0.25,23.5,2,1.0
3,400603852,South Carolina,3,2015.0,1,0.0,1.333333,0.666667,40.0,5.2,...,-2.666667,0.50954,0.76568,0.31746,0.306777,0.666667,0.5,-10.666667,3,1.0
4,400603853,Kentucky,3,2015.0,1,0.0,1.666667,1.0,35.0,5.0,...,0.5,0.509353,0.618228,0.325397,0.25008,0.888889,0.916667,2.0,3,2.0


Note that the cleaned data above still contains two records for each game. Each record contains a single teams data (home and away team). We need to compress each team record for a single game into a single record containing both team's data. A function to perform this task is written below: 

In [2]:
def combine_data(gameRecords,columns): ##Very similar to obtain_defensive in dataAggregation.py

    ##Split data into two dataframes [homeData, awayData]
    homeData = gameRecords.loc[gameRecords['homeBool']==1,columns]
    awayData = gameRecords.loc[gameRecords['homeBool']==0,columns]

    ##Rename the columns in these dataframes
    dfs = [homeData,awayData]
    for i in range(len(dfs)):
        for j in columns:
            if j != 'gameId' and j!= 'gameSeason':
                dfs[i] = dfs[i].rename(columns={j:'opposition_'+j})

    homeData,awayData = dfs[0],dfs[1]

    ##Now join opponent offensive data onto the cleaned data frame
    new_homeData = pd.merge(gameRecords[gameRecords['homeBool']==1],
                            awayData,
                            how='left',
                            left_on=['gameId'],
                            right_on=['gameId'])
    new_awayData = pd.merge(gameRecords[gameRecords['homeBool']==0],
                            homeData,
                            how='left',
                            left_on=['gameId'],
                            right_on=['gameId'])

    ##Stack the two dataframe for our end result
    new_gameRecords_df = pd.concat([new_homeData,new_awayData])

    return new_gameRecords_df

##Conduct operation so we have all game data for both teams in a single row
cols_for_combination = ['gameId', 'homeBool', 'win',
       'rush_td_movAvg', 'pass_td_movAvg', 'rush_attempt_movAvg',
       'yp_rush_movAvg', 'rush_yards_movAvg', 'yp_pass_movAvg',
       'pass_yards_movAvg', 'total_yards_movAvg', 'turnovers_movAvg',
       'fumbles_lost_movAvg', 'interceptions_movAvg', 'firstDowns_movAvg',
       'defensive_td_movAvg', 'points_movAvg', 'elo_movAvg',
       'offensive_drives_movAvg', 'offensive_ppa_movAvg',
       'offensive_successRate_movAvg', 'offensive_explosiveness_movAvg',
       'offensive_powerSuccess_movAvg', 'offensive_stuffRate_movAvg',
       'offensive_lineYards_movAvg', 'offensive_secondLevelYards_movAvg',
       'passComplete_movAvg', 'passAttempt_movAvg', 'fourthSuccess_movAvg',
       'fourthAttempts_movAvg', 'thirdSuccess_movAvg', 'thirdAttempts_movAvg',
       'quarters_available_movAvg', 'Q1_points_movAvg', 'Q2_points_movAvg',
       'Q3_points_movAvg', 'Q4_points_movAvg', 'opp_rush_td_movAvg',
       'opp_pass_td_movAvg', 'opp_rush_attempt_movAvg', 'opp_yp_rush_movAvg',
       'opp_rush_yards_movAvg', 'opp_yp_pass_movAvg', 'opp_pass_yards_movAvg',
       'opp_total_yards_movAvg', 'opp_turnovers_movAvg',
       'opp_fumbles_lost_movAvg', 'opp_interceptions_movAvg',
       'opp_firstDowns_movAvg', 'opp_defensive_td_movAvg',
       'opp_points_movAvg', 'opp_elo_movAvg', 'opp_offensive_drives_movAvg',
       'opp_offensive_ppa_movAvg', 'opp_offensive_successRate_movAvg',
       'opp_offensive_explosiveness_movAvg',
       'opp_offensive_powerSuccess_movAvg', 'opp_offensive_stuffRate_movAvg',
       'opp_offensive_lineYards_movAvg',
       'opp_offensive_secondLevelYards_movAvg', 'opp_passComplete_movAvg',
       'opp_passAttempt_movAvg', 'opp_fourthSuccess_movAvg',
       'opp_fourthAttempts_movAvg', 'opp_thirdSuccess_movAvg',
       'opp_thirdAttempts_movAvg',
       'opp_Q1_points_movAvg', 'opp_Q2_points_movAvg', 'opp_Q3_points_movAvg',
       'opp_Q4_points_movAvg', 'gameControl_movAvg', 'completion_pct_movAvg',
       'opp_completion_pct_movAvg', 'third_pct_movAvg', 'opp_third_pct_movAvg',
       'fourth_pct_movAvg', 'opp_fourth_pct_movAvg', 'diff_points_movAvg',
       'games_played', 'games_won']

##Call function to combine game data
data = combine_data(data,cols_for_combination)
data.head()

Unnamed: 0,gameId,school,week_num,gameSeason,homeBool,win,rush_td_movAvg,pass_td_movAvg,rush_attempt_movAvg,yp_rush_movAvg,...,opposition_gameControl_movAvg,opposition_completion_pct_movAvg,opposition_opp_completion_pct_movAvg,opposition_third_pct_movAvg,opposition_opp_third_pct_movAvg,opposition_fourth_pct_movAvg,opposition_opp_fourth_pct_movAvg,opposition_diff_points_movAvg,opposition_games_played,opposition_games_won
0,400764869,Temple,3,2015.0,1,1.0,1.666667,1.0,38.666667,3.766667,...,-4.5,0.549351,0.598214,0.275735,0.510526,0.5,0.25,-18.0,2,0.0
1,400763604,UTEP,3,2015.0,1,1.0,1.666667,1.333333,41.0,4.3,...,-4.166667,0.526959,0.76509,0.255342,0.544444,0.388889,0.666667,-17.666667,3,0.0
2,400756922,Georgia Tech,3,2015.0,1,0.0,3.5,2.5,51.5,6.2,...,4.166667,0.675362,0.455123,0.311688,0.247619,0.555556,0.722222,16.666667,3,3.0
3,400603852,South Carolina,3,2015.0,1,0.0,1.333333,0.666667,40.0,5.2,...,7.166667,0.731429,0.575883,0.371083,0.325397,0.5,0.583333,28.666667,3,3.0
4,400603853,Kentucky,3,2015.0,1,0.0,1.666667,1.0,35.0,5.0,...,5.0,0.6689,0.496169,0.385392,0.297619,1.0,0.611111,20.0,3,3.0


<b>Important Modeling Decision:</b>

A choice must be made to either model the odds the home team or away team wins. A single model for both scenarios would introduce inconsistencies in game prediction probabilities. I choose to model the probability the home team will win the game, so the dataset is modified to reflect that. 

In [3]:
##Filter data to only be home games
data = data[data['homeBool']==1]

In [4]:
##Inspect the win and loss class balance for home games
print('Win/Loss Class Balance of Data:')
print('Wins: '+str(len(data[data['win']==1])))
print('Losses: '+str(len(data[data['win']==0])))

Win/Loss Class Balance of Data:
Wins: 1701
Losses: 2101


The data for home teams is relatively balanced, with there being more losses than wins. This is an interesting result and potential evidence to de-bunk home-field advantage, however that would require a more detailed study for another time. 

### Feature Creation:

In this section we will explore what features are already in existence in the data and create new features based off of these. First lets explore the feature we have and the data-types they were loaded as.

In [12]:
##Import nice package for creating organized tables
from tabulate import tabulate

##Display current features and datatypes
colData = [['Feature','Data Type']]
for i in data.columns:
    colData.append([i,data[i].dtype])
print(tabulate(colData,headers='firstrow',tablefmt='grid'))

+--------------------------------------------------+-------------+
| Feature                                          | Data Type   |
| gameId                                           | int64       |
+--------------------------------------------------+-------------+
| school                                           | object      |
+--------------------------------------------------+-------------+
| week_num                                         | int64       |
+--------------------------------------------------+-------------+
| gameSeason                                       | float64     |
+--------------------------------------------------+-------------+
| homeBool                                         | int64       |
+--------------------------------------------------+-------------+
| win                                              | float64     |
+--------------------------------------------------+-------------+
| rush_td_movAvg                                   | float64  

The datatypes of the available features look correct. We can not move directly to some feature creation.

<b>Difference of Statistic Features:</b>

A single game record contains each teams...

### Feature Analysis/Collinearity