# College Football - Pull Data

The goal of this notebook is to pull in all the data from CollegeFootballData.com and export it to the data folder. 

#### Before running this notebook 

* Get custom API key emailed here: https://collegefootballdata.com/key
* Add it to a config/api_key.json file under the key api_key

##### Datasets extracted

1. **Recruiting information:** This data contians infromatioon about the high schooleers that were rectuited into college football between a certain timeframe. 

2. **Team:** This data contains all the college teams in the dataset. 

3. **Game:** This data contains all the the games that have been playbed between a certain timeframe. Each row is oen game

4. **Game Manipulated:** This data takes in the game dataset pulled above and manipulates it from one row is one game, and instead create two records from one game: one from the perspective of the home team and one for the away team. This makes it easier for analysis and offers easy analysis on more familiar metrics like points for, etc.


#### ----------------------------------

###### Helpful Tutorial
https://blog.collegefootballdata.com/introduction-to-cfb-analytics/

###### Actual Documentation
https://api.collegefootballdata.com/api/docs/?url=/api-docs.json


In [1]:
# Uncomment and run line below if cfbd library isn't already installed
#! pip install cfbd
from IPython.display import clear_output

import cfbd
import numpy as np
import pandas as pd
import json

pd.set_option('display.max_columns', None)
clear_output()


## Set up api connection

In [2]:
# Running this code by itself won't work. You'll need your own API Key.
# See link above to have custom API link emailed and save that key as variable api_key.

# Load JSON data from file
with open('../config/api_key.json', 'r') as file:
    data = json.load(file)

# Get the value of 'api_key'
api_key = data.get('api_key')


In [3]:
def api_setup(api_key):

    """
    Configure the api. 
    Only input is the apikey which can be created from the link above.
    """
    import cfbd
    
    configuration = cfbd.Configuration()
    configuration.api_key['Authorization'] = api_key
    configuration.api_key_prefix['Authorization'] = 'Bearer'

    return cfbd.ApiClient(configuration)
    
api_config = api_setup(api_key)

## Define timeframe 

In [4]:
start_year_timeframe = 2013
end_year_timeframe = 2023

## Get Datasource 1 - Player Recruiting Rankings

Get each football players ranking and origin information as they were recruited into college each year. 

In [5]:
def hs_recruits(start_year, end_year):
    
    """
    Two inputs: start_year and end_year (the ranges of years we want the recruiting data for - inclusive)
    
    1) Get each year as a json
    2) Convert to df
    3) Union each year's df together.
    """

    recruits_df_list = []

    for i in range(start_year, end_year + 1):

        # Connect to api for given year
        recr_api = cfbd.RecruitingApi(api_config)
        recruits = recr_api.get_recruiting_players(year = i)

        # Convert json to df
        df_recruits = pd.DataFrame.from_records([r.to_dict() for r in recruits])

        # Append dfs together to create list of dfs
        recruits_df_list.append(df_recruits)

    # Concatenate / union each year's df together
    df_recruits_final = pd.concat(recruits_df_list).reset_index()
    
    df_recruits_final['latitude'] = df_recruits_final.hometown_info.str['latitude']
    df_recruits_final['longitude'] = df_recruits_final.hometown_info.str['longitude']
    
    df_recruits_final.drop(columns = 'hometown_info', inplace = True)

    return df_recruits_final

df_recruits = hs_recruits(start_year_timeframe, end_year_timeframe)

In [6]:
df_recruits.shape

(39216, 19)

In [7]:
# df_recruits[df_recruits['name'] == 'Mike Sainristil']
df_recruits.head() 


Unnamed: 0,index,id,athlete_id,recruit_type,year,ranking,name,school,committed_to,position,height,weight,stars,rating,city,state_province,country,latitude,longitude
0,0,24957,550011.0,HighSchool,2013,1.0,Robert Nkemdiche,Grayson,Ole Miss,SDE,76.0,285.0,5,1.0,Loganville,GA,USA,33.838998,-83.900738
1,1,24958,550965.0,HighSchool,2013,2.0,Jaylon Smith,Bishop Luers,Notre Dame,OLB,75.0,218.0,5,0.9986,Fort Wayne,IN,USA,41.07999,-85.138601
2,2,24959,551319.0,HighSchool,2013,3.0,Vernon Hargreaves III,Wharton,Florida,CB,71.0,185.0,5,0.9979,Tampa,FL,USA,27.94776,-82.458444
3,3,24960,550015.0,HighSchool,2013,4.0,Laremy Tunsil,Columbia,Ole Miss,OT,78.0,295.0,5,0.9975,Lake City,FL,USA,30.189676,-82.63929
4,4,24961,546417.0,HighSchool,2013,5.0,Su'a Cravens,Vista Murrieta,USC,S,73.0,205.0,5,0.996,Murrieta,CA,USA,33.577752,-117.188454


## Get Datasource 2 - College Football Team 

Get every college football team, some attributes, and their name

In [8]:
def team_dataset():

    teams_api = cfbd.TeamsApi(api_config)
    teams = teams_api.get_fbs_teams()

    df_teams = pd.DataFrame.from_records([t.to_dict() for t in teams])
    df_teams
    df_teams = df_teams[['id', 'school', 'conference', 'division', 'color', 'logos']]
    
    return df_teams

df_teams = team_dataset()

In [9]:
df_teams.head()

Unnamed: 0,id,school,conference,division,color,logos
0,2005,Air Force,Mountain West,Mountain,#004a7b,[http://a.espncdn.com/i/teamlogos/ncaa/500/200...
1,2006,Akron,Mid-American,East,#00285e,[http://a.espncdn.com/i/teamlogos/ncaa/500/200...
2,333,Alabama,SEC,West,#690014,[http://a.espncdn.com/i/teamlogos/ncaa/500/333...
3,2026,Appalachian State,Sun Belt,East,#000000,[http://a.espncdn.com/i/teamlogos/ncaa/500/202...
4,12,Arizona,Pac-12,,#002449,[http://a.espncdn.com/i/teamlogos/ncaa/500/12....


## Get Datasource 3 - All College Football Fames 

Get every college football game played over a timeframe and stored in a dataframe

In [10]:
def games_non_transformed(start_year, final_year):
    
    
    """
    Connect to the games api and get every post and regular season game over a given time frame
    
    1) Beginning with post season, iterate over every year in given range.
    2) Union each year together
    3) Repeat for regular season
    4) Union post and regular season dfs together
    """

    # Connect to games api
    games_api = cfbd.GamesApi(api_config)

    # Post Season Games
    postseason_games = []
    for i in range(start_year, final_year + 1):

        games = games_api.get_games(year=i, season_type = 'postseason')
        df_games_post_i = pd.DataFrame.from_records([g.to_dict() for g in games])
        postseason_games.append(df_games_post_i)

    postseason_games_df = pd.concat(postseason_games)

    # Regular Season Games
    regseason_games = []
    for i in range(start_year, final_year + 1):

        games = games_api.get_games(year=i, season_type = 'regular')
        df_games_reg_i = pd.DataFrame.from_records([g.to_dict() for g in games])
        regseason_games.append(df_games_reg_i)

    regseason_games_df = pd.concat(regseason_games)

    # Union post and regular season
    return pd.concat([regseason_games_df, postseason_games_df])

df_games = games_non_transformed(start_year_timeframe, end_year_timeframe)

In [11]:
df_games.head() 

Unnamed: 0,id,season,week,season_type,start_date,start_time_tbd,completed,neutral_site,conference_game,attendance,venue_id,venue,home_id,home_team,home_conference,home_division,home_points,home_line_scores,home_post_win_prob,home_pregame_elo,home_postgame_elo,away_id,away_team,away_conference,away_division,away_points,away_line_scores,away_post_win_prob,away_pregame_elo,away_postgame_elo,excitement_index,highlights,notes
0,332412309,2013,1,regular,2013-08-29T22:00:00.000Z,False,True,False,False,20790.0,3696.0,Dix Stadium,2309,Kent State,Mid-American,fbs,17.0,"[7, 0, 0, 10]",0.396157,1530.0,1536.0,2335,Liberty,Big South,fcs,10.0,"[0, 3, 7, 0]",0.603843,1467.0,1461.0,,,
1,332412579,2013,1,regular,2013-08-29T22:00:00.000Z,,True,False,False,81572.0,3994.0,Williams-Brice Stadium,2579,South Carolina,SEC,fbs,27.0,"[17, 3, 7, 0]",0.655586,1759.0,1785.0,153,North Carolina,ACC,fbs,10.0,"[0, 7, 3, 0]",0.344414,1638.0,1612.0,,,
2,332410154,2013,1,regular,2013-08-29T22:30:00.000Z,False,True,False,False,26202.0,3630.0,BB&T Field,154,Wake Forest,ACC,fbs,31.0,"[3, 14, 7, 7]",0.999789,,,2506,Presbyterian College,Big South,fcs,7.0,"[7, 0, 0, 0]",0.000211,,,,,
3,332410135,2013,1,regular,2013-08-29T23:00:00.000Z,,True,False,False,44217.0,3953.0,TCF Bank Stadium,135,Minnesota,Big Ten,fbs,51.0,"[3, 13, 14, 21]",0.448322,1417.0,1466.0,2439,UNLV,Mountain West,fbs,23.0,"[6, 7, 3, 7]",0.551678,1207.0,1158.0,,,
4,332410189,2013,1,regular,2013-08-29T23:00:00.000Z,,True,False,False,18142.0,3700.0,Doyt Perry Stadium,189,Bowling Green,Mid-American,fbs,34.0,"[3, 3, 7, 21]",0.648535,1450.0,1543.0,202,Tulsa,Conference USA,fbs,7.0,"[0, 0, 0, 7]",0.351465,1635.0,1542.0,,,


#### Maniuplate data so it's at the team-game grain, rather than game grain
###### -- There will be duplicate games, but we can filter for a team one one column now. 

###### -- Each game will have two records: one for the home team's perspective, one for the away team.

In [12]:
def games_manipulation(df_games):
    
    """
    The function takes in the output of the previous function games_non_transformed().
    It converts the grain of the data. Before, the grain was 1 row per game.
    Now, each game has two rows: one from the perspective of each team.
    For example, when Ohio lost to Michigan 42-27, Michigan will a win by 15 points, and OSU will show a loss by -15 points.
    
    1) First identifies every team that played at least 1 home game.
    2) Loops over every team.
    3) Converts a bunch of data points so that the numbers are referenced from the perspective of the team of interst (also called main_team)
    """


    df_seasons = []
    teams_list = list(df_games['home_team'].unique()[1:])

    # Loop over every team that played a game over the time frame specified in games_non_transformed()
    for team in teams_list:

        # find every home / away game for team of interst
        df_home = df_games[df_games['home_team'] == team]
        df_away = df_games[df_games['away_team'] == team]

        # Combine home and away games into 1 table. 
        df_season_i = pd.concat([df_home, df_away])

        # Add column specifying what team that row of data pertains to
        df_season_i['main_team'] = team

        ############
        # Adjust key columns so they represent our team of interest

        df_season_i['home_game_flag'] = np.where(df_season_i['home_team'] == team, 1, 0)

        df_season_i['team_id'] = np.where(df_season_i['home_team'] == team, df_season_i['home_id'], df_season_i['away_id'])
        df_season_i['opposing_team_id'] = np.where(df_season_i['home_team'] == team, df_season_i['away_id'], df_season_i['home_id'])

        df_season_i['team_conference'] = np.where(df_season_i['home_team'] == team, df_season_i['home_conference'], df_season_i['away_conference'])
        df_season_i['opposing_conference'] = np.where(df_season_i['home_team'] == team, df_season_i['away_conference'], df_season_i['home_conference'])

        df_season_i['points_for'] = np.where(df_season_i['home_team'] == team, df_season_i['home_points'], df_season_i['away_points'])
        df_season_i['points_against'] = np.where(df_season_i['home_team']== team, df_season_i['away_points'], df_season_i['home_points'])

        df_season_i['point_differential'] = df_season_i['points_for'] - df_season_i['points_against']

        df_season_i['team_line_scores']  = np.where(df_season_i['home_team'] == team, df_season_i['home_line_scores'], df_season_i['away_line_scores'])
        df_season_i['opposing_line_scores']  = np.where(df_season_i['home_team'] == team, df_season_i['away_line_scores'], df_season_i['home_line_scores'])

        df_season_i['team_pregame_elo']  = np.where(df_season_i['home_team'] == team, df_season_i['home_pregame_elo'], df_season_i['away_pregame_elo'])
        df_season_i['team_postgame_elo']  = np.where(df_season_i['home_team'] == team, df_season_i['home_postgame_elo'], df_season_i['away_postgame_elo'])

        df_season_i['opponent_pregame_elo'] = np.where(df_season_i['home_team'] != team, df_season_i['home_pregame_elo'], df_season_i['away_pregame_elo'])
        df_season_i['opponent_postgame_elo'] = np.where(df_season_i['home_team'] != team, df_season_i['home_postgame_elo'], df_season_i['away_postgame_elo'])

        df_season_i['win_flag'] = np.where(df_season_i['point_differential'] > 0, 1, 0)
        
#         a = df_season_i.sort_values('start_date', ascending=True) \
#                        .groupby(['main_team', 'season']) \
#                        .cumcount() + 1
        
#         df_season_i['game_that_season'] = list(a)
        
        
        ############

        df_seasons.append(df_season_i)
        
    data = pd.concat(df_seasons)
    data = data.drop(columns = ['home_id', 'home_team', 'home_conference', 'home_division', 'home_points', 'home_line_scores', 'home_post_win_prob',
                                'home_pregame_elo', 'home_postgame_elo', 'away_id', 'away_team', 'away_conference', 'away_division', 'away_points',
                                'away_line_scores', 'away_post_win_prob', 'away_pregame_elo', 'away_postgame_elo'], axis = 1)
    
    # Field that counts what game (ie the 15th game, 3rd game, etc)
    data = data.reset_index()
    data['game_that_season'] = data.sort_values(['season', 'start_date', 'team_id'], ascending=True) \
                                        .groupby(['team_id', 'season']) \
                                        .cumcount() + 1
    
    
    
    return data
    
df_manipulated_games = games_manipulation(df_games)

df_manipulated_games.head()

Unnamed: 0,index,id,season,week,season_type,start_date,start_time_tbd,completed,neutral_site,conference_game,attendance,venue_id,venue,excitement_index,highlights,notes,main_team,home_game_flag,team_id,opposing_team_id,team_conference,opposing_conference,points_for,points_against,point_differential,team_line_scores,opposing_line_scores,team_pregame_elo,team_postgame_elo,opponent_pregame_elo,opponent_postgame_elo,win_flag,game_that_season
0,1,332412579,2013,1,regular,2013-08-29T22:00:00.000Z,,True,False,False,81572.0,3994.0,Williams-Brice Stadium,,,,South Carolina,1,2579,153,SEC,ACC,27.0,10.0,17.0,"[17, 3, 7, 0]","[0, 7, 3, 0]",1759.0,1785.0,1638.0,1612.0,1,1
1,324,332572579,2013,3,regular,2013-09-14T23:00:00.000Z,,True,False,True,81371.0,3994.0,Williams-Brice Stadium,,,,South Carolina,1,2579,238,SEC,SEC,35.0,25.0,10.0,"[21, 7, 7, 0]","[0, 10, 0, 15]",1769.0,1777.0,1644.0,1636.0,1,3
2,671,332782579,2013,6,regular,2013-10-05T23:30:00.000Z,,True,False,True,82313.0,3994.0,Williams-Brice Stadium,,,,South Carolina,1,2579,96,SEC,SEC,35.0,28.0,7.0,"[14, 10, 3, 8]","[0, 7, 0, 21]",1780.0,1768.0,1353.0,1365.0,1,5
3,1011,333062579,2013,10,regular,2013-11-02T16:21:00.000Z,,True,False,True,82111.0,3994.0,Williams-Brice Stadium,,,,South Carolina,1,2579,344,SEC,SEC,34.0,16.0,18.0,"[14, 3, 17, 0]","[7, 3, 0, 6]",1872.0,1890.0,1634.0,1616.0,1,9
4,1295,333202579,2013,12,regular,2013-11-17T00:00:00.000Z,,True,False,True,83853.0,3994.0,Williams-Brice Stadium,,,,South Carolina,1,2579,57,SEC,SEC,19.0,14.0,5.0,"[3, 3, 7, 6]","[7, 7, 0, 0]",1890.0,1887.0,1675.0,1678.0,1,10


## Output Data into CSV files 

Located under the data folder. 

In [13]:
df_games.to_csv('../data/games.csv') 
df_recruits.to_csv('../data/recruits.csv')
df_teams.to_csv('../data/teams.csv')
df_manipulated_games.to_csv('../data/games_manipulated.csv')

In [14]:
print(f'Games: {df_games.shape}\n')
print(f'Game Manipulated: {df_manipulated_games.shape}\n')
print(f'Recruits: {df_recruits.shape}\n')
print(f'Teams: {df_teams.shape}\n')

Games: (21427, 33)

Game Manipulated: (42537, 33)

Recruits: (39216, 19)

Teams: (133, 6)

