## SIADS Milestone I: CFB Analysis

#### The goal of this notebook is to answer the following questions:

1) Do teams need to build talented rosters through recruiting in order to compete for championships?
2) Is a player's college recuriting ranking a good indicator for the NFL draft
--
##### Notebook Outline:
1) Read in the following
* Games outcomes dataset
* Team attributes datast (for their image url)
* Team Talent Composite Ratings dataset

2) Data Manipulation:
* Filter game dataset to week 12 to get only the end of the regular season
* Filter to only power 5 conferences
* Create a flag saying whether that team competed in the national championship that year
* Join all datasets together

3) Create scatter plot(s) to see relationship between recruiting and team sucess.

4) Bring in draft data and player recruting data to see correlation / heatmap between recruiting rank and draft stock
* Joining onto the draft data was complicated.
* Originally, intended to do it based on name, but there are instances where two players have the same name.
* To choose the correct math, I'm using their university column to see if there 'committed to' and 'drafted from' columns match.
* Unfortunately, the two datasets use different naming conventions (like Mississippi vs Ole Miss).
* To get around this, I'm only checking to see if the two columns have at least one substring of 4 characters in common. 
* If there are duplicate players, we'll get the one with the matching university column
* Remove undrafted players



In [1]:
# Uncomment and run line below if cfbd library isn't already installed
#! pip install cfbd

import cfbd
import numpy as np
import pandas as pd
import altair as alt
import cfbd

pd.set_option('display.max_columns', None)

In [2]:
# Get the teams dataset straight form the api
# The CSV version of the url was corrupted

import config
api_key = config.api_key

def api_setup(api_key):

    """
    Configure the api. 
    Only input is the apikey which can be created from the link above.
    """
    import cfbd
    
    configuration = cfbd.Configuration()
    configuration.api_key['Authorization'] = api_key
    configuration.api_key_prefix['Authorization'] = 'Bearer'

    return cfbd.ApiClient(configuration)
    
api_config = api_setup(api_key)

def team_dataset():

    teams_api = cfbd.TeamsApi(api_config)
    teams = teams_api.get_fbs_teams()

    df_teams = pd.DataFrame.from_records([t.to_dict() for t in teams])
    df_teams
    df_teams = df_teams[['id', 'school', 'conference', 'division', 'color', 'logos']]
    
    return df_teams

df_teams = team_dataset()

# Remove brackets around image url
df_teams['logos'] = df_teams['logos'].str.get(0)

In [3]:
df = pd.read_csv('../data/games_manipulated.csv')

# Filter to only the power 5 conference week 12
power_5_conf = ['Pac-12', 'Big 12', 'ACC', 'SEC', 'Big Ten']
df = df[df['team_conference'].isin(power_5_conf)]
df = df[df['game_that_season'] == 12] # Final game of reg season

# Bring in only necessarry columns
df = df[['season', 'team_id', 'main_team', 'team_postgame_elo', 'team_conference']]

# Combine teams and game datasets
final_df = pd.merge(left = df, right = df_teams, left_on = 'team_id', right_on = 'id')
final_df = final_df[['season', 'team_conference', 'team_postgame_elo', 'main_team', 'logos']]

  df = pd.read_csv('../data/games_manipulated.csv')


#### Composite Talent Rankings and Team's Success.

In [4]:
df = pd.read_csv('../data/games_manipulated.csv')

  df = pd.read_csv('../data/games_manipulated.csv')


In [5]:
df = pd.read_csv('../data/games_manipulated.csv')

# Filter to only the power 5 conference week 12
power_5_conf = ['Pac-12', 'Big 12', 'ACC', 'SEC', 'Big Ten']
df = df[df['team_conference'].isin(power_5_conf)]
df = df[df['game_that_season'] == 12] 

# Bring in only necessarry columns
df = df[['season', 'team_id', 'main_team', 'team_postgame_elo', 'team_conference']]

final_df = pd.merge(left = df, right = df_teams, left_on = 'team_id', right_on = 'id')

final_df = final_df[['season', 'team_conference', 'team_postgame_elo', 'main_team', 'logos', 'color']]

  df = pd.read_csv('../data/games_manipulated.csv')


In [6]:
# Get dataset of teams that competed in national championship game
df = pd.read_csv('../data/games_manipulated.csv')

substr_1 = 'NATIONAL CHAMPIONSHIP'
substr_2 = 'National Championship'

df = df[df['notes'].notna()]
a = df[df['notes'].str.contains(substr_1)]
b = df[df['notes'].str.contains(substr_2)]

championship_games = pd.concat([a, b])
championship_games['championship_appearance'] = 1
championship_games = championship_games.sort_values(by = 'season', ascending = True)[['main_team', 'season', 'win_flag', 'championship_appearance']]

  df = pd.read_csv('../data/games_manipulated.csv')


In [7]:
# Join championship dataset to end of regualr season game dataset

final_df.sort_values(by = 'team_postgame_elo', ascending = False)

new_df = pd.merge(final_df, championship_games,  how='left', left_on=['season','main_team'], right_on = ['season','main_team'])
new_df['Championship Game Appearance']= new_df['championship_appearance'].fillna(0)
new_df['win_flag'] = new_df['win_flag'].fillna(0)
new_df = new_df.sort_values(by = 'team_postgame_elo', ascending = False)

new_df['year_team'] = new_df['season'].astype(str) + ' ' + new_df['main_team'].astype(str)

In [8]:
talent_df = pd.read_csv('../data/team_talent.csv')
new_df = pd.merge(new_df, talent_df,  how='left', left_on=['season','main_team'], right_on = ['year','school'])

In [9]:
scatter = alt.Chart(new_df).mark_circle(size = 60, opacity = .90).encode(
    alt.X('talent:Q', scale=alt.Scale(domain=[400, 1100]), title = 'Team Composite Talent Rating'),
    alt.Y('team_postgame_elo:Q', scale=alt.Scale(domain=[800, 2500]), title = 'Regular Season Success (ELO Rating)'),
    color = 'Championship Game Appearance:N',
    tooltip = ['year_team']).properties(
    width = 500, height = 500, title = 'Team Talent Rating vs Regular Season Success (ELO)')

elo_line = alt.Chart(new_df).mark_rule(color="#e60634", size = 3, strokeDash=[9,3], opacity = .85).encode(
    y = 'mean(team_postgame_elo)')

talent_line = alt.Chart(new_df).mark_rule(color="#e60634", size = 3, strokeDash=[9,3], opacity = .85).encode(
    x = 'mean(talent)')

scatter + elo_line + talent_line

In [10]:
scatter = alt.Chart(new_df).mark_image(size = 1, width = 25, height = 25).encode(
    alt.X('talent:Q', scale=alt.Scale(domain=[400, 1100]), title = 'Team Composite Talent Rating'),
    alt.Y('team_postgame_elo:Q', scale=alt.Scale(domain=[800, 2500]), title = 'End of Regular Season ELO Rating'),
    tooltip = ['season', 'talent', 'team_postgame_elo'],
    url = 'logos').facet(facet = 'season:O', columns = 3)

scatter

### Begin draft analysis

##### Identify relationship between recruiting rank and draft stock

In [11]:
draft = pd.read_csv('../data/draft.csv')
recruits = pd.read_csv('../data/recruits.csv')

In [12]:
# Sometime there are two players with the same name - this will create duplicates
# About 2k of the 39k records are duplicated due to shared names.

merged_df = pd.merge(left = recruits, right = draft, how = 'left', left_on = 'name', right_on = 'Player')
col = ['name', 'rating', 'stars', 'committed_to', 'athlete_id', 'Rnd', 'Pick', 'Player', 'draft_year', 'College/Univ', 'CommonSequence']

In [13]:
# Function to check if there are at least 4 sequential characters in common
# If the name of the school they committed to matches the name they were drafted from, then return a 1:
def has_common_sequence(str1, str2, min_seq_length=4):
    if pd.isna(str1) or pd.isna(str2):
        return 0

    for i in range(len(str1) - min_seq_length + 1):
        sequence = str1[i:i + min_seq_length]
        if sequence in str2:
            return 1
    return 0

# Apply the function to the DataFrame
merged_df['CommonSequence'] = merged_df.apply(lambda row: has_common_sequence(row['committed_to'], row['College/Univ']), axis=1)

merged_df['RN'] = merged_df.sort_values(['name','CommonSequence'], ascending=[True,False]) \
                           .groupby(['name', 'rating']) \
                           .cumcount() + 1

col = ['name', 'rating', 'stars', 'committed_to', 'athlete_id', 'Rnd', 'Pick', 'Player', 'draft_year', 'College/Univ', 'CommonSequence', 'RN']
merged_df = merged_df[col]

In [14]:
# Remove duplicates

merged_df = merged_df[merged_df['RN'] == 1]
merged_df['is_drafted'] = np.where(merged_df['Rnd'].isna(), 0.0, 1.0)

In [15]:
merged_df['rating_round'] = merged_df['rating'].round(3)

draft_likelihood = merged_df.groupby(by = ['stars', 'rating_round']).agg({'is_drafted': 'mean', 'name': 'count'}).reset_index()

rename_dict = {'stars': 'HS Recruiting Stars', 
               'rating_round': 'HS Recruiting Rating',
               'is_drafted': '% of Players Drafted',
               'name': 'count'}

draft_likelihood = draft_likelihood.rename(rename_dict, axis = 1)

In [16]:
alt.Chart(draft_likelihood).mark_circle(size = 60).encode(
    x = alt.X('HS Recruiting Rating', scale=alt.Scale(domain=[.6, 1]), title = 'Recuiting Ranking (rounded to 3 digits)'),
    y = alt.Y('% of Players Drafted', title = '% of Players Who Get Drafted', scale=alt.Scale(domain=[0, 1])),
    color = 'HS Recruiting Stars:N').properties(
    width = 600, height = 400, title = 'High School Recruit Ranking vs Draft Likelihood')