# College Football Analysis
## By David Weck

In this project, I will be analyzing college football data from 2015-2022. The goal of this project is to derive informative insights, present elegant visualizations, and eventually predict various betting metrics such as winning team, over/under, and the spread.

#### Setup and Package Import

In [1]:
import cfbd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

os.chdir('..')
cwd = os.getcwd()
data_path = cwd + "\\Data\\college_football_analysis\\"

In [7]:
os.getcwd()

'C:\\Users\\dweck\\OneDrive\\Documents\\Python_Projects'

#### Loading Data

All of this data was sourced using the cfbd Python API. This API pulls from [collegefootballdata.com](https://collegefootballdata.com/exporter). Details on the API can be found [here](https://github.com/CFBD/cfbd-python). To see how I created the datasets below, please view `data_pull.py` on the [github repo](https://github.com/davidweck96/College-Football-Analysis) for this project.

In [2]:
adv_stats_df = pd.read_csv(data_path + 'adv_stats_df.csv')
betting_df = pd.read_csv(data_path + 'betting_df.csv')
game_results_df = pd.read_csv(data_path + 'game_results_df.csv')
recruiting_df = pd.read_csv(data_path + 'recruiting_df.csv')
talent_df = pd.read_csv(data_path + 'talent_df.csv')
teams_df = pd.read_csv(data_path + 'teams.csv')
win_prob_df = pd.read_csv(data_path + 'win_prob_df.csv')

#### Exploring Data

In [15]:
print(adv_stats_df.info())
adv_stats_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10986 entries, 0 to 10985
Data columns (total 32 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   game_id                           10986 non-null  int64  
 1   team                              10986 non-null  object 
 2   opponent                          10986 non-null  object 
 3   week                              10986 non-null  int64  
 4   offense_plays                     10986 non-null  int64  
 5   offense_drives                    10986 non-null  int64  
 6   offense_ppa                       10986 non-null  float64
 7   offense_total_ppa                 10986 non-null  float64
 8   offense_success_rate              10986 non-null  float64
 9   offense_explosiveness             10986 non-null  float64
 10  offense_power_success             10986 non-null  float64
 11  offense_stuff_rate                10986 non-null  float64
 12  offe

Unnamed: 0,game_id,team,opponent,week,offense_plays,offense_drives,offense_ppa,offense_total_ppa,offense_success_rate,offense_explosiveness,...,defense_success_rate,defense_explosiveness,defense_power_success,defense_stuff_rate,defense_line_yards,defense_line_yards_total,defense_second_level_yards,defense_second_level_yards_total,defense_open_field_yards,defense_open_field_yards_total
0,400603827,Alabama,Wisconsin,1,66,12,0.464209,30.63782,0.469697,1.518528,...,0.344262,1.18687,0.5,0.166667,1.855556,33.0,0.277778,5.0,0.833333,15.0
1,400603827,Wisconsin,Alabama,1,61,12,0.042139,2.570494,0.344262,1.18687,...,0.469697,1.518528,1.0,0.147059,3.382353,115.0,1.5,51.0,3.352941,114.0
2,400603828,Arkansas,UTEP,1,56,11,0.591284,33.111923,0.482143,1.728689,...,0.425926,1.119297,1.0,0.46875,1.1625,37.0,0.90625,29.0,0.40625,13.0
3,400603828,UTEP,Arkansas,1,54,9,0.094017,5.07692,0.425926,1.119297,...,0.482143,1.728689,0.4,0.205882,2.247059,76.0,0.647059,22.0,2.235294,76.0
4,400603829,Auburn,Louisville,1,63,10,0.258543,16.288231,0.539683,0.955356,...,0.493827,1.123813,0.75,0.116279,3.802326,164.0,1.674419,72.0,1.162791,50.0


In [16]:
print(betting_df.info())
betting_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5384 entries, 0 to 5383
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   game_id           5384 non-null   int64  
 1   season            5384 non-null   int64  
 2   week              5384 non-null   int64  
 3   home_team         5384 non-null   object 
 4   home_score        5384 non-null   int64  
 5   away_team         5384 non-null   object 
 6   provider          5384 non-null   object 
 7   spread            5377 non-null   float64
 8   formatted_spread  5384 non-null   object 
 9   spread_open       840 non-null    float64
 10  over_under        3675 non-null   float64
 11  over_under_open   841 non-null    float64
 12  home_moneyline    701 non-null    float64
 13  away_moneyline    700 non-null    float64
dtypes: float64(6), int64(4), object(4)
memory usage: 589.0+ KB
None


Unnamed: 0,game_id,season,week,home_team,home_score,away_team,provider,spread,formatted_spread,spread_open,over_under,over_under_open,home_moneyline,away_moneyline
0,400756886,2015,1,Oregon,61,Eastern Washington,consensus,-35.0,Oregon -35,,,,,
1,400756902,2015,1,Pittsburgh,45,Youngstown State,consensus,-15.0,Pittsburgh -15,,,,,
2,400756890,2015,1,California,73,Grambling,consensus,-43.0,California -43,,,,,
3,400756900,2015,1,Miami,45,Bethune-Cookman,consensus,-39.0,Miami -39,,,,,
4,400764857,2015,1,East Carolina,28,Towson,consensus,-30.5,East Carolina -30.5,,,,,


In [17]:
print(game_results_df.info())
game_results_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4820 entries, 0 to 4819
Data columns (total 23 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   game_id             4820 non-null   int64  
 1   season              4820 non-null   int64  
 2   week                4820 non-null   int64  
 3   conference_game     4820 non-null   bool   
 4   excitement_index    4820 non-null   float64
 5   attendance          4820 non-null   float64
 6   neutral_site        4820 non-null   bool   
 7   away_conference     4820 non-null   object 
 8   away_division       4820 non-null   object 
 9   away_id             4820 non-null   int64  
 10  away_points         4820 non-null   float64
 11  away_post_win_prob  4820 non-null   float64
 12  away_postgame_elo   4820 non-null   float64
 13  away_pregame_elo    4820 non-null   float64
 14  away_team           4820 non-null   object 
 15  home_conference     4820 non-null   object 
 16  home_d

Unnamed: 0,game_id,season,week,conference_game,excitement_index,attendance,neutral_site,away_conference,away_division,away_id,...,away_pregame_elo,away_team,home_conference,home_division,home_id,home_points,home_post_win_prob,home_postgame_elo,home_pregame_elo,home_team
0,400763593,2015,1,False,5.390338,39184.0,False,Conference USA,fbs,2229,...,1321.0,Florida International,American Athletic,fbs,2116,14.0,0.101317,1609.0,1626.0,UCF
1,400603840,2015,1,False,7.949786,51664.0,True,ACC,fbs,153,...,1477.0,North Carolina,SEC,fbs,2579,17.0,0.322459,1646.0,1646.0,South Carolina
2,400763399,2015,1,False,3.188541,19717.0,False,Big 12,fbs,197,...,1567.0,Oklahoma State,Mid-American,fbs,2117,13.0,0.086295,1407.0,1417.0,Central Michigan
3,400603839,2015,1,False,5.657551,30307.0,False,Conference USA,fbs,98,...,1521.0,Western Kentucky,SEC,fbs,238,12.0,0.690378,1371.0,1365.0,Vanderbilt
4,400756883,2015,1,False,4.617671,47825.0,False,Big Ten,fbs,130,...,1553.0,Michigan,Pac-12,fbs,254,24.0,0.901752,1610.0,1603.0,Utah


In [18]:
print(recruiting_df.info())
recruiting_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 892 entries, 0 to 891
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   year                   892 non-null    int64  
 1   team                   892 non-null    object 
 2   total_PPA              892 non-null    float64
 3   total_passing_ppa      892 non-null    float64
 4   total_receiving_ppa    892 non-null    float64
 5   total_rushing_ppa      892 non-null    float64
 6   percent_ppa            892 non-null    float64
 7   percent_passing_ppa    892 non-null    float64
 8   percent_receiving_ppa  892 non-null    float64
 9   perecent_rushing_ppa   892 non-null    float64
 10  usage                  892 non-null    float64
 11  passing_usage          892 non-null    float64
 12  receiving_usage        892 non-null    float64
 13  rushing_usage          892 non-null    float64
dtypes: float64(12), int64(1), object(1)
memory usage: 97.7+ KB

Unnamed: 0,year,team,total_PPA,total_passing_ppa,total_receiving_ppa,total_rushing_ppa,percent_ppa,percent_passing_ppa,percent_receiving_ppa,perecent_rushing_ppa,usage,passing_usage,receiving_usage,rushing_usage
0,2015,Air Force,225.0,14.2,109.3,101.5,0.634,0.164,0.869,0.714,0.646,0.162,0.95,0.711
1,2015,Akron,97.3,10.6,45.6,41.1,0.562,1.071,0.372,1.005,0.758,0.997,0.432,0.995
2,2015,Alabama,138.1,21.1,78.4,38.6,0.22,0.101,0.285,0.269,0.251,0.134,0.233,0.364
3,2015,Appalachian State,412.1,118.6,156.2,137.3,0.954,0.994,0.909,0.974,0.93,0.863,0.938,0.973
4,2015,Arizona,402.4,133.5,179.9,88.9,0.875,1.0,0.789,0.903,0.861,1.0,0.807,0.756
