# College Football - Pull Data

The goal of this notebook is to pull in all the data from CollegeFootballData.com and export it to the data folder. 

#### Before running this notebook 

* Get custom API key emailed here: https://collegefootballdata.com/key
* Add it to a config/api_key.json file under the key api_key

Example of the structure of the config/api_key.json: 
```
{  
    "api_key": "InserttheAPIKeyhere"

}
```



##### Datasets extracted

1. **Recruiting information:** This data contians infromatioon about the high schooleers that were rectuited into college football between a certain timeframe. 

2. **Team:** This data contains all the college teams in the dataset. 

3. **Game:** This data contains all the the games that have been playbed between a certain timeframe. Each row is oen game

5. **Roster:** For every single team from 2013 - 2023, get all their roster data from the API.

6. **Team Composite Ratings:** For every single team in every year, pull in their team composite rating from the same API. This metric looks at all the players on the roster, and aggregates their recruting profile.

7. **Game Manipulated:** This data takes in the game dataset pulled above and manipulates it from one row is one game, and instead create two records from one game: one from the perspective of the home team and one for the away team. This makes it easier for analysis and offers easy analysis on more familiar metrics like points for, etc.

8. **Draft:** This data gives us the NFL draft with there rank annd school that they came from.

#### ----------------------------------

###### Helpful Tutorial
https://blog.collegefootballdata.com/introduction-to-cfb-analytics/

###### Actual Documentation
https://api.collegefootballdata.com/api/docs/?url=/api-docs.json

In [4]:
# Uncomment and run line below if cfbd library isn't already installed
from IPython.display import clear_output

import cfbd
import numpy as np
import pandas as pd
import json

pd.set_option('display.max_columns', None)
clear_output()


## Set up api connection

In [5]:
# Running this code by itself won't work. You'll need your own API Key.
# See link above to have custom API link emailed and save that key as variable api_key.

# Load JSON data from file
with open('../config/api_key.json', 'r') as file:
    data = json.load(file)

# Get the value of 'api_key'
api_key = data.get('api_key')

In [6]:
def api_setup(api_key):

    """
    Configure the api. 
    Only input is the apikey which can be created from the link above.
    """
    import cfbd
    
    configuration = cfbd.Configuration()
    configuration.api_key['Authorization'] = api_key
    configuration.api_key_prefix['Authorization'] = 'Bearer'

    return cfbd.ApiClient(configuration)
    
api_config = api_setup(api_key)

## Define timeframe 

In [27]:
start_year_timeframe = 2008
end_year_timeframe = 2020 # The seniors in 2020 would've played all 4 years. Ie the Blake Corum year. 

## Get Datasource 1 - Player Recruiting Rankings

Get each football players ranking and origin information as they were recruited into college each year. 

In [28]:
def hs_recruits(start_year, end_year):
    
    """
    Two inputs: start_year and end_year (the ranges of years we want the recruiting data for - inclusive)
    
    1) Get each year as a json
    2) Convert to df
    3) Union each year's df together.
    """

    recruits_df_list = []

    for i in range(start_year, end_year + 1):

        # Connect to api for given year
        recr_api = cfbd.RecruitingApi(api_config)
        recruits = recr_api.get_recruiting_players(year = i)

        # Convert json to df
        df_recruits = pd.DataFrame.from_records([r.to_dict() for r in recruits])

        # Append dfs together to create list of dfs
        recruits_df_list.append(df_recruits)

    # Concatenate / union each year's df together
    df_recruits_final = pd.concat(recruits_df_list).reset_index()
    
    df_recruits_final['latitude'] = df_recruits_final.hometown_info.str['latitude']
    df_recruits_final['longitude'] = df_recruits_final.hometown_info.str['longitude']
    
    df_recruits_final.drop(columns = 'hometown_info', inplace = True)

    return df_recruits_final

df_recruits = hs_recruits(start_year_timeframe, end_year_timeframe)

In [29]:
df_recruits.sample(5)

Unnamed: 0,index,id,athlete_id,recruit_type,year,ranking,name,school,committed_to,position,height,weight,stars,rating,city,state_province,country,latitude,longitude
23038,2939,67043,,HighSchool,2015,2887.0,Dalton Banks,Alamo Heights,,PRO,74.0,207.0,2,0.7667,San Antonio,TX,USA,29.4246,-98.495141
44379,4145,63495,,HighSchool,2020,,Nick Bagashvili,Tottenville,Temple,DT,75.0,280.0,3,0.8263,Staten Island,NY,USA,40.564209,-74.125305
10786,1135,23146,535794.0,HighSchool,2012,1135.0,Jacob Bailey,Cathedral,Indiana,OT,76.0,265.0,3,0.8342,Indianapolis,IN,USA,39.768333,-86.15835
36988,919,47687,,HighSchool,2019,891.0,Christopher Russell,Dyersburg,Texas A&M,ILB,73.0,228.0,3,0.8593,Dyersburg,TN,USA,36.034516,-89.385628
16681,433,28631,4240780.0,HighSchool,2014,429.0,Lonnie Johnson,West Side,Western Michigan,S,75.0,190.0,3,0.8762,Gary,IN,USA,41.602129,-87.337137


## Get Dataset 2 - Draft Data

In [32]:
import requests
from bs4 import BeautifulSoup
import pandas as pd 
import numpy as np 

In [35]:
def get_draft_data(year): 
    url = f"https://www.pro-football-reference.com/years/{year}/draft.htm"
    response = requests.get(url)

    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Parse the HTML content of the page
        soup = BeautifulSoup(response.content, 'html.parser')

        # Find the table containing the draft data
        table = soup.find('table')

        # Use Pandas to read the HTML table into a DataFrame
        df = pd.read_html(str(table))[0].droplevel(0, axis=1)
        return df
    else:
        print("Failed to retrieve data. Status Code:", response.status_code)

def combine_draft_data(start_year, end_year): 
    
    years = np.arange(start_year, end_year, 1)
    print(f'Getting drafts for the following years: {years}. ')

    df_draft = pd.DataFrame() 
    for idx, year in enumerate(years): 
        if idx == 0: 
            df_draft = get_draft_data(years[0])
            df_draft['draft_year'] = year 
        else: 
            data_draft = get_draft_data(year)
            data_draft['draft_year'] = year 
            df_draft = pd.concat([df_draft, data_draft], axis=0)
    return df_draft

In [36]:
start_year_timeframe = 2010 
end_year_timeframe = 2024

In [37]:
# get draft data to include 2023
df_draft = combine_draft_data(start_year_timeframe, end_year_timeframe+1)

Getting drafts for the following years: [2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023
 2024]. 


In [40]:
df_draft.head()

Unnamed: 0,Rnd,Pick,Tm,Player,Pos,Age,To,AP1,PB,St,wAV,DrAV,G,Cmp,Att,Yds,TD,Int,Att.1,Yds.1,TD.1,Rec,Yds.2,TD.2,Solo,Int.1,Sk,College/Univ,Unnamed: 28_level_1,draft_year
0,1,1,STL,Sam Bradford,QB,22,2018,0,0,5,44,25,83,1855,2967,19449,103,61,146,340,2,1,5,0,,,,Oklahoma,College Stats,2010
1,1,2,DET,Ndamukong Suh,DT,23,2022,3,5,12,100,59,199,0,0,0,0,0,0,0,0,0,0,0,392.0,1.0,71.5,Nebraska,College Stats,2010
2,1,3,TAM,Gerald McCoy,DT,22,2021,1,6,10,69,65,140,0,0,0,0,0,0,0,0,0,0,0,235.0,,59.5,Oklahoma,College Stats,2010
3,1,4,WAS,Trent Williams,T,22,2023,3,11,12,99,51,178,0,0,0,0,0,0,0,0,0,0,0,1.0,,,Oklahoma,College Stats,2010
4,1,5,KAN,Eric Berry,DB,21,2018,3,5,5,50,50,89,0,0,0,0,0,0,0,0,0,0,0,377.0,14.0,5.5,Tennessee,College Stats,2010


## Get Dataset 3 - College Team Data (conference, location, etc.)

In [7]:
def team_dataset():

    teams_api = cfbd.TeamsApi(api_config)
    teams = teams_api.get_fbs_teams()

    df_teams = pd.DataFrame.from_records([t.to_dict() for t in teams])
    
    # Extract coordinates from dict columns
    df_teams['latitude_school'] = df_teams['location'].apply(lambda x: x.get('latitude'))
    df_teams['longitude_school'] = df_teams['location'].apply(lambda x: x.get('longitude'))
    
    df_teams = df_teams[['id', 'school', 'conference', 'division', 'color', 'logos', 'latitude_school', 'longitude_school']]
    
    return df_teams

df_teams = team_dataset()

# Remove brackets around image url
df_teams['logos'] = df_teams['logos'].str.get(0)

In [9]:
df_teams.head()

Unnamed: 0,id,school,conference,division,color,logos,latitude_school,longitude_school
0,2005,Air Force,Mountain West,,#004a7b,http://a.espncdn.com/i/teamlogos/ncaa/500/2005...,38.99697,-104.843616
1,2006,Akron,Mid-American,,#00285e,http://a.espncdn.com/i/teamlogos/ncaa/500/2006...,41.072553,-81.508341
2,333,Alabama,SEC,,#9e1632,http://a.espncdn.com/i/teamlogos/ncaa/500/333.png,33.208275,-87.550384
3,2026,Appalachian State,Sun Belt,East,#000000,http://a.espncdn.com/i/teamlogos/ncaa/500/2026...,36.211427,-81.685428
4,12,Arizona,Big 12,,#0c234b,http://a.espncdn.com/i/teamlogos/ncaa/500/12.png,32.228805,-110.948868


### Save all to CSV in the 'data' file

In [45]:
df_draft.to_csv('../data/draft_2010_2024.csv', index = False)
print('df_draft: ' + str(df_draft.shape))

df_recruits.to_csv('../data/recruits_2008_2020.csv', index = False)
print('df_recruits: ' + str(df_recruits.shape))

df_teams.to_csv('../data/teams.csv', index = False)
print('df_teams: ' + str(df_teams.shape))

df_draft: (3926, 30)
df_recruits: (44541, 19)
