## SIADS Milestone I: CFB Analysis

#### ELO Exploration

1) How reliable is the ELO metric? Has it historically done a good job predicting winners of CFB games?
2) What is the distribution of ELO scores? Does it match what we see in research papers?



In [1]:
# Uncomment and run line below if cfbd library isn't already installed

import cfbd
import numpy as np
import pandas as pd
import altair as alt
import cfbd
import warnings

pd.set_option('display.max_columns', None)
warnings.filterwarnings('ignore')


#### Historically, has ELO served as a good predictor of games outcomes. Is a reliable metric?

In [2]:
# Read in games data. All games between for power 5 teams after Week 7
df = pd.read_csv('../data/games_manipulated.csv')
power_5_conf = ['Pac-12', 'Big 12', 'ACC', 'SEC', 'Big Ten']
df = df[df['team_conference'].isin(power_5_conf)]
df_mid_season = df[df['week'] >= 7]

In [3]:
# df_elo = pd.merge(left = df, right = df_teams, left_on = 'team_id', right_on = 'id')
col = ['season','week', 'point_differential', 'team_pregame_elo', 'opponent_pregame_elo', 'win_flag']
df_mid_season = df_mid_season[col]

df_mid_season['pre_game_elo_diff'] = df_mid_season['team_pregame_elo'] - df_mid_season['opponent_pregame_elo']
df_mid_season['pre_game_elo_diff_rounded'] = df_mid_season['pre_game_elo_diff'].round(decimals=-1)

a = df_mid_season.groupby(by = 'pre_game_elo_diff_rounded').agg({'win_flag':'mean'}).reset_index()

In [4]:
title_params = {
      "text": "Pre Game Elo Rating vs % Chance to Win", 
      "subtitle": "Includes all College Football Games after Week 7 from 2013 - 2023",
    }

base = alt.Chart(a).mark_circle(opacity=0.4, color = '#00274C').encode(
    alt.X('pre_game_elo_diff_rounded:Q', title = 'Difference in pre-game Elo Rating (rounded to nearest 10)'),
    alt.Y('win_flag:Q', title = 'Probability of Winning')
)

# Line is formed using altair's "LOcally Estimated Scatterplot Smoothing" (LOESS)
base + base.transform_loess('pre_game_elo_diff_rounded', 'win_flag').mark_line(size=4, color = '#FFCB05').properties(
    title = title_params,
    height = 500, width = 500)

#### What is the distribution of ELO scores? Does it match up with what other sources say should be the distribution of ELO scores?

In [5]:
# Distribution of rating dataset
df_mid_season['team_pregame_elo_round'] = df_mid_season['team_pregame_elo'].round(decimals=-2)
b = df_mid_season.groupby(by = 'team_pregame_elo_round').agg({'season': 'count'}).reset_index().rename({'season': 'count'})

alt.Chart(b).mark_area(opacity = .8).encode(
    alt.X('team_pregame_elo_round', title = 'Pre-game ELO Rating (Rounded to Nearest 100)'),
    alt.Y('season', title = 'Number of Games'))\
.properties(title = 'Distribution of ELO Rating Before Each Game', height = 300, width = 550)