## Introduction
The machine learning problem we are trying to solve is predicting the College Football Playoff Commitee's top 25 teams ranking. This matters beacuse th CFP Commitee's rankings are riddled with controversey year after year, as teams are constantly frustrated with their ranking. These rankings matter heavily, as they determine how prestigious of a bowl game teams play in, as well as who gets to compete for a national champsionship. There is very little information on what the commitee considers when they rank the teams, so our model could be used to give teams insight into what particular statistics the committe might value the most when they determine which teams should be ranked higher than others. Teams could then place emphasis on say making sure they have good passing offense or a positive turnover margin if the model shows teams who perform well in those categories are ranked well by the commitee.

The dataset we are using we got from Kaggle at the following link: https://shorturl.at/glT68

It holds data on over a 140 statistical categories(our features) on all FBS teams(the best 130 or so teams in the country) for every year the College Playoff Committe has existed, which is from 2014 to the present. Examples of features in our dataset are total points scored, offensive yards per play, sacks, and many more. We have over 800 records in our dataset.



## Data Cleaning

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# import sklearn

In [2]:
first_year = 14 # the first dataset is from 2014
num_years = 9 # the datasets go from 2014-2022
df_dict = {} # holds individual dataframes for each year

Load every dataset into a dataframe and maps the year to its corresponding dataframe.

In [3]:
for year in range(first_year, first_year + num_years - 2):
    df = pd.read_csv("cfb" + str(year) + ".csv")
    df_dict[year] = df

There are two University of Miami's in our datasets. One in Florida and another in Ohio. We wanted to strip the parentheses off the location to help for a later step which also deals with parsing info out of parentheses.

In [4]:
for year in range(first_year, first_year + num_years - 2):
    df = df_dict[year]
    df['Team'] = df['Team'].str.replace('Miami \(FL\)', 'Miami FL', regex=True)
    df['Team'] = df['Team'].str.replace('Miami \(OH\)', 'Miami OH', regex=True)

The conference a team is plays a big role in how the commitee ranks them. Most of the games teams play are against teams in the same conference, and some conferences have historically had higher performing teams. As a result, teams in these more elite conferences often have harder schedules, and thus it is critical to take into consideration the conference a team plays in when ranking them.

In our datasets, the conference is not its own feature, it is attached to the feature for the team's name. Hence, we parse the conference from the team's name and add a column for the conference the team is in.

In [5]:
for year in range(first_year, first_year + num_years - 2):
    df = df_dict[year]  
    df['conference'] = df['Team'].apply(lambda x: x[x.find('(') + 1: x.find(')')] if '(' in x and ')' in x else x)

After we parse the conference from the team name feature and create the conference feature, we can remove the conference from the team name.

In [6]:
for year in range(first_year, first_year + num_years - 2):
    df = df_dict[year]
    df['Team'] = df['Team'].apply(lambda x: x[:x.find('(')] if '(' in x else x)
    df['Team'] = df['Team'].str[:-1]

Our class label is if the team was ranked in the top 25 or not for the given year, so we utilize the lists below to add a column which holds a 1 if the team was ranked in the top 25 that year and a 0 if not.

In [7]:
# lists of the CFP Commitee's top 25 ranking each year

top25_14 = ['Alabama', 'Oregon', 'Florida St.', 'Ohio St.', 'Baylor',
            'TCU', 'Mississippi St.', 'Michigan St.', 'Ole Miss', 'Arizona',
            'Kansas St.', 'Georgia Tech', 'Georgia', 'UCLA', 'Arizona St.',
            'Missouri', 'Clemson', 'Wisconsin', 'Auburn', 'Boise St.',
            'Louisville', 'Utah', 'LSU', 'Southern California', 'Minnesota'] # top 25 from 2014

top25_15 = ['Clemson', 'Alabama', 'Michigan St.', 'Oklahoma', 'Iowa',
            'Stanford', 'Ohio St.', 'Notre Dame', 'Florida St.', 'North Carolina',
            'TCU', 'Ole Miss', 'Northwestern', 'Michigan', 'Oregon',
            'Oklahoma St.', 'Baylor', 'Houston', 'Florida', 'LSU',
            'Navy', 'Utah', 'Tennessee', 'Temple', 'Southern California'] # top 25 from 2015

top25_16 = ['Alabama', 'Clemson', 'Ohio St.', 'Washington', 'Penn St.',
            'Michigan', 'Oklahoma', 'Wisconsin', 'Southern California', 'Colorado',
            'Florida St.', 'Oklahoma St.', 'Louisville', 'Auburn', 'Western Mich.',
            'West Virginia', 'Florida', 'Stanford', 'Utah', 'LSU',
            'Tennessee', 'Virginia Tech', 'Pittsburgh', 'Temple', 'Navy'] # top 25 from 2016

top25_17 = ['Clemson', 'Oklahoma', 'Georgia', 'Alabama', 'Ohio St.',
            'Wisconsin', 'Auburn', 'Southern California', 'Penn St.', 'Miami FL',
            'Washington', 'UCF', 'Stanford', 'Notre Dame', 'TCU',
            'Michigan St.', 'LSU', 'Washington St.', 'Oklahoma St.', 'Memphis',
            'Northwestern', 'Virginia Tech', 'Mississippi St.', 'NC State', 'Boise St.'] # top 25 from 2017

top25_18 = ['Alabama', 'Clemson', 'Notre Dame', 'Oklahoma', 'Georgia',
            'Ohio St.', 'Michigan', 'UCF', 'Washington', 'Florida',
            'LSU', 'Penn St.', 'Washington St.', 'Kentucky', 'Texas',
            'West Virginia', 'Utah', 'Mississippi St.', 'Texas A&M', 'Syracuse',
            'Fresno St.', 'Northwestern', 'Missouri', 'Iowa St.', 'Boise St.'] # top 25 from 2018

top25_19 = ['LSU', 'Ohio St.', 'Clemson', 'Oklahoma', 'Georgia',
            'Oregon', 'Baylor', 'Wisconsin', 'Florida', 'Penn St.',
            'Utah', 'Auburn', 'Alabama', 'Michigan', 'Notre Dame',
            'Iowa', 'Memphis', 'Minnesota', 'Boise St.', 'Appalachian St.',
            'Cincinnati', 'Southern California', 'Navy', 'Virginia', 'Oklahoma St.'] # top 25 from 2019

top25_20 = ['Alabama', 'Clemson', 'Ohio St.', 'Notre Dame', 'Texas A&M',
            'Oklahoma', 'Florida', 'Cincinnati', 'Georgia', 'Iowa St.',
            'Indiana', 'Coastal Carolina', 'North Carolina', 'Northwestern', 'Iowa',
            'BYU', 'Southern California', 'Miami FL', 'Louisiana', 'Texas',
            'Oklahoma St.', 'San Jose St.', 'NC State', 'Tulsa', 'Oregon'] # top 25 from 2020

top25_21 = ['Alabama', 'Michigan', 'Georgia', 'Cincinnati', 'Notre Dame',
            'Ohio St.', 'Baylor', 'Ole Miss', 'Oklahoma St.', 'Michigan St.',
            'Utah', 'Pittsburgh', 'BYU', 'Oregon', 'Iowa',
            'Oklahoma', 'Wake Forest', 'NC State', 'Clemson', 'Houston',
            'Arkansas', 'Kentucky', 'Louisiana', 'San Diego St.', 'Texas A&M'] # top 25 from 2021

top25_22 = ['Georgia', 'Michigan', 'TCU', 'Ohio St.', 'Alabama',
            'Tennessee', 'Clemson', 'Utah', 'Kansas St.', 'Southern California',
            'Penn St.', 'Washington', 'Florida St.', 'Oregon St.', 'Oregon',
            'Tulane', 'LSU', 'UCLA', 'South Carolina', 'Texas',
            'Notre Dame', 'Mississippi St.', 'NC State', 'Troy', 'UTSA'] # top 25 from 2022

top25_dict = {14: top25_14, 15: top25_15, 16: top25_16, 17: top25_17, 18: top25_18,
              19: top25_19, 20: top25_20, 21: top25_21, 22: top25_22} # dictionary to be able to grab rankings for a desired year

for year in range(first_year, first_year + num_years - 2):
    df = df_dict[year]
    top25 = top25_dict[year]
    df['top 25'] = df['Team'].apply(lambda team: 1 if team in top25 else 0)

Every individual dataframe is almost completley cleaned and properly structured, so we can go ahead and merge them all into one big dataframe.

In [8]:
key_set = sorted(list(df_dict.keys()))
while len(key_set) > 1:
  df0 = df_dict[key_set[0]]
  df1 = df_dict[key_set[1]]
  shared_columns = df0.columns.intersection(df1.columns)
  df0 = df0.loc[:, shared_columns]
  df1 = df1.loc[:, shared_columns]
  df0 = pd.concat([df0, df1], ignore_index=True)
  df_dict[key_set[0]] = df0
  key_set.remove(key_set[1])
df = df_dict[14] # our one big dataframe after merging

There were a couple individual records with missing/wrong conferences, so we manually set the teams' conference to its correct one.

In [9]:
def fix_conference_issues(df):
  df.loc[df['conference'] == 'Independent', 'conference'] = 'FBS Independent'
  df.loc[(df['conference'] == '') & (df['Team'] == 'Ole Miss'), 'conference'] = 'SEC'
  df.loc[(df['conference'] == '') & (df['Team'] == 'Pittsburgh'), 'conference'] = 'ACC'

In [None]:
# def clean_data():
#   first_year = 14
#   num_years = 9
#   #for year in range(first_year, first_year + num_years):
#   for year in range(first_year, first_year + num_years - 2): # - 2 cuz 21 and 22 have different column names so merge fails
#     load_dataset(year)
#     adjust_miami_teams(df_dict[year])
#     add_conference_column(df_dict[year])
#     remove_conference_from_team_name(df_dict[year])
#     add_top25_column(year, df_dict[year])
#     add_year_column(year, df_dict[year])
#   merge_dataframes()
#   fix_conference_issues(df_dict[14])
#   cleaned_df = df_dict[14]
#   return cleaned_df

In [10]:
df

Unnamed: 0,Team,Games,Win,Loss,Off.Rank,Off.Plays,Off.Yards,Off.Yards.Play,Off.TDs,Off.Yards.per.Game,...,Fumbles.Recovered,Opponents.Intercepted,Turnovers.Gain,Fumbles.Lost,Interceptions.Thrown.y,Turnovers.Lost,Turnover.Margin,Avg.Turnover.Margin.per.Game,conference,top 25
0,Akron,12,5,7,88,891,4479,5.03,32,373.3,...,11,13,24,12,14,26,-2,-0.17,MAC,0
1,Alabama,14,12,2,17,1018,6783,6.66,67,484.5,...,9,11,20,12,10,22,-2,-0.14,SEC,1
2,Arizona,14,10,4,25,1139,6491,5.70,55,463.6,...,13,13,26,8,10,18,8,0.57,Pac-12,1
3,Arizona St.,13,10,3,34,975,5750,5.90,54,442.3,...,13,14,27,4,9,13,14,1.08,Pac-12,1
4,Arkansas,13,7,6,60,916,5278,5.76,52,406.0,...,12,12,24,11,6,17,7,0.54,SEC,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
854,West Virginia,9,5,4,42,690,3804,5.51,28,422.7,...,2,10,12,6,3,9,3,0.33,Big 12,0
855,Western Ky.,11,5,6,120,699,3200,4.58,21,290.9,...,4,5,9,10,2,12,-3,-0.27,C-USA,0
856,Western Mich.,6,4,2,15,392,2878,7.34,32,479.7,...,1,2,3,4,2,6,-3,-0.50,MAC,0
857,Wisconsin,6,3,3,93,431,2153,5.00,17,358.8,...,4,4,8,5,6,11,-3,-0.50,Big Ten,0


In [None]:
df['Win-Ratio'] = df['Win'] / df['Games']
df.columns.values

In [None]:
conference_ranked_counts = df.groupby('conference')['top 25'].sum()
conference_ranked_counts.plot(kind='bar')
plt.xlabel('Conference')
plt.ylabel('Number of Ranked Teams')
plt.title('Number of Ranked Teams in Each Conference')
plt.show()

In [None]:
conference_ranked_counts = df.groupby('top 25')['Turnover.Margin'].mean()
conference_ranked_counts.plot(kind='bar')
plt.xlabel('top 25')
plt.ylabel('avg turnover margin')
plt.title('avg turnover margin for ranked vs unranked teams')
plt.show()

In [None]:
conference_ranked_counts = df.groupby('top 25')['Turnover.Margin'].mean()
conference_ranked_counts.plot(kind='bar')
plt.xlabel('top 25')
plt.ylabel('avg turnover margin')
plt.title('avg turnover margin for ranked vs unranked teams')
plt.show()

In [None]:
conference_ranked_counts = df.groupby('top 25')['Off.Yards.Play'].mean()
conference_ranked_counts.plot(kind='bar')
plt.xlabel('top 25')
plt.ylabel('avg off yards per play')
plt.title('avg yards per play for ranked vs unranked teams')
plt.show()

In [None]:
one_hot = pd.get_dummies(df['conference'])
one_hot = one_hot.astype('int')
#df = pd.get_dummies(df, columns = ['conference'])
df = df.drop('conference', axis = 1)
df.join(one_hot)

In [None]:
no_string_df = df.drop(['Team', 'year'], axis=1)

def convert_time(str_time):
    minutes, seconds = str_time.split(':')
    return (int(minutes) * 60) + int(seconds)

no_string_df['Time.of.Possession'] = no_string_df['Time.of.Possession'].apply(lambda x: convert_time(x))
no_string_df['Average.Time.of.Possession.per.Game'] = no_string_df['Average.Time.of.Possession.per.Game'].apply(lambda x: convert_time(x))

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
scaler = StandardScaler()
pca = PCA()
scaled_df = scaler.fit_transform(no_string_df)
scaled_df = pd.DataFrame(scaled_df, columns=no_string_df.columns)
scaled_df

In [None]:

# features we are using for the prediction
#test_df = df.drop(['Team','top 25', 'year'],axis=1)

# data we are trying to predict
#result = df['top 25']