## Introduction
The machine learning problem we are trying to solve is predicting the College Football Playoff Commitee's top 25 teams ranking. This matters beacuse th CFP Commitee's rankings are riddled with controversey year after year, as teams are constantly frustrated with their ranking. These rankings matter heavily, as they determine how prestigious of a bowl game teams play in, as well as who gets to compete for a national champsionship. There is very little information on what the commitee considers when they rank the teams, so our model could be used to give teams insight into what particular statistics the committe might value the most when they determine which teams should be ranked higher than others. Teams could then place emphasis on say making sure they have good passing offense or a positive turnover margin if the model shows teams who perform well in those categories are ranked well by the commitee.

The dataset we are using we got from Kaggle at the following link: https://shorturl.at/glT68

It holds data on over a 140 statistical categories(our features) on all FBS teams(the best 130 or so teams in the country) for every year the College Playoff Committe has existed, which is from 2014 to the present. Examples of features in our dataset are total points scored, offensive yards per play, sacks, and many more. We have over 800 records in our dataset.



## Data Cleaning

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# import sklearn

In [3]:
first_year = 14 # the first dataset is from 2014
num_years = 9 # the datasets go from 2014-2022
df_dict = {} # holds individual dataframes for each year

Load every dataset into a dataframe and maps the year to its corresponding dataframe.

In [4]:
for year in range(first_year, first_year + num_years - 2):
    df = pd.read_csv("cfb" + str(year) + ".csv")
    df_dict[year] = df

Adds the year the datasets are from as a column in each dataframe. We will drop the year feature eventually, but it will be helpful when we feature engineer later after merging the dataframes into one big dataframe.

In [5]:
for year in range(first_year, first_year + num_years - 2):
    df = df_dict[year]
    df['year'] = year

Merges each individual dataframe into one big dataframe.

In [6]:
key_set = sorted(list(df_dict.keys()))
while len(key_set) > 1:
  df0 = df_dict[key_set[0]]
  df1 = df_dict[key_set[1]]
  shared_columns = df0.columns.intersection(df1.columns)
  df0 = df0.loc[:, shared_columns]
  df1 = df1.loc[:, shared_columns]
  df0 = pd.concat([df0, df1], ignore_index=True)
  df_dict[key_set[0]] = df0
  key_set.remove(key_set[1])
df = df_dict[14] # our one big dataframe after merging

There are two University of Miami's in our datasets. One in Florida and another in Ohio. We wanted to strip the parentheses off the location to help for a later step which also deals with parsing info out of parentheses.

In [7]:
df['Team'] = df['Team'].str.replace('Miami \(FL\)', 'Miami FL', regex=True)
df['Team'] = df['Team'].str.replace('Miami \(OH\)', 'Miami OH', regex=True)

## Feature Engineering

The number of games teams play can vary based on if they qualified for their conference championship or not, and during the Covid-19 pandemic, some conferences played a drastically different amount of games than others. For these reasons, it would make more sense to use win percentage rather than just number of wins in our model.

In [8]:
df['win perct'] = df['Win'] / df['Games']

We can subsequently drop the number of wins, lossess, and games played from our dataframe after creating our win percentage feature.

In [9]:
df = df.drop('Win', axis=1)
df = df.drop('Loss', axis=1)
df = df.drop('Games', axis=1)

The conference a team is plays a big role in how the commitee ranks them. Most of the games teams play are against teams in the same conference, and some conferences have historically had higher performing teams. As a result, teams in these more elite conferences often have harder schedules, and thus it is critical to take into consideration the conference a team plays in when ranking them.

In our datasets, the conference is not its own feature, it is attached to the feature for the team's name. Hence, we parse the conference from the team's name and add a column for the conference the team is in.

In [10]:
df['conference'] = df['Team'].apply(lambda x: x[x.find('(') + 1: x.find(')')] if '(' in x and ')' in x else x)

After we parse the conference from the team name feature and create the conference feature, we can remove the conference from the team name.

In [11]:
df['Team'] = df['Team'].apply(lambda x: x[:x.find('(')] if '(' in x else x)
df['Team'] = df['Team'].str[:-1]

There were a couple individual records with missing/wrong conferences, so we manually set the teams' conference to its correct one.

In [12]:
df.loc[df['conference'] == 'Independent', 'conference'] = 'FBS Independent'
df.loc[(df['conference'] == '') & (df['Team'] == 'Ole Miss'), 'conference'] = 'SEC'
df.loc[(df['conference'] == '') & (df['Team'] == 'Pittsburgh'), 'conference'] = 'ACC'

Our class label is if the team was ranked in the top 25 or not for the given year, so we utilize the lists below to add a column which holds a 1 if the team was ranked in the top 25 that year and a 0 if not.

In [13]:
# lists of the CFP Commitee's top 25 ranking each year

top25_14 = ['Alabama', 'Oregon', 'Florida St.', 'Ohio St.', 'Baylor',
            'TCU', 'Mississippi St.', 'Michigan St.', 'Ole Miss', 'Arizona',
            'Kansas St.', 'Georgia Tech', 'Georgia', 'UCLA', 'Arizona St.',
            'Missouri', 'Clemson', 'Wisconsin', 'Auburn', 'Boise St.',
            'Louisville', 'Utah', 'LSU', 'Southern California', 'Minnesota'] # top 25 from 2014

top25_15 = ['Clemson', 'Alabama', 'Michigan St.', 'Oklahoma', 'Iowa',
            'Stanford', 'Ohio St.', 'Notre Dame', 'Florida St.', 'North Carolina',
            'TCU', 'Ole Miss', 'Northwestern', 'Michigan', 'Oregon',
            'Oklahoma St.', 'Baylor', 'Houston', 'Florida', 'LSU',
            'Navy', 'Utah', 'Tennessee', 'Temple', 'Southern California'] # top 25 from 2015

top25_16 = ['Alabama', 'Clemson', 'Ohio St.', 'Washington', 'Penn St.',
            'Michigan', 'Oklahoma', 'Wisconsin', 'Southern California', 'Colorado',
            'Florida St.', 'Oklahoma St.', 'Louisville', 'Auburn', 'Western Mich.',
            'West Virginia', 'Florida', 'Stanford', 'Utah', 'LSU',
            'Tennessee', 'Virginia Tech', 'Pittsburgh', 'Temple', 'Navy'] # top 25 from 2016

top25_17 = ['Clemson', 'Oklahoma', 'Georgia', 'Alabama', 'Ohio St.',
            'Wisconsin', 'Auburn', 'Southern California', 'Penn St.', 'Miami FL',
            'Washington', 'UCF', 'Stanford', 'Notre Dame', 'TCU',
            'Michigan St.', 'LSU', 'Washington St.', 'Oklahoma St.', 'Memphis',
            'Northwestern', 'Virginia Tech', 'Mississippi St.', 'NC State', 'Boise St.'] # top 25 from 2017

top25_18 = ['Alabama', 'Clemson', 'Notre Dame', 'Oklahoma', 'Georgia',
            'Ohio St.', 'Michigan', 'UCF', 'Washington', 'Florida',
            'LSU', 'Penn St.', 'Washington St.', 'Kentucky', 'Texas',
            'West Virginia', 'Utah', 'Mississippi St.', 'Texas A&M', 'Syracuse',
            'Fresno St.', 'Northwestern', 'Missouri', 'Iowa St.', 'Boise St.'] # top 25 from 2018

top25_19 = ['LSU', 'Ohio St.', 'Clemson', 'Oklahoma', 'Georgia',
            'Oregon', 'Baylor', 'Wisconsin', 'Florida', 'Penn St.',
            'Utah', 'Auburn', 'Alabama', 'Michigan', 'Notre Dame',
            'Iowa', 'Memphis', 'Minnesota', 'Boise St.', 'Appalachian St.',
            'Cincinnati', 'Southern California', 'Navy', 'Virginia', 'Oklahoma St.'] # top 25 from 2019

top25_20 = ['Alabama', 'Clemson', 'Ohio St.', 'Notre Dame', 'Texas A&M',
            'Oklahoma', 'Florida', 'Cincinnati', 'Georgia', 'Iowa St.',
            'Indiana', 'Coastal Carolina', 'North Carolina', 'Northwestern', 'Iowa',
            'BYU', 'Southern California', 'Miami FL', 'Louisiana', 'Texas',
            'Oklahoma St.', 'San Jose St.', 'NC State', 'Tulsa', 'Oregon'] # top 25 from 2020

top25_21 = ['Alabama', 'Michigan', 'Georgia', 'Cincinnati', 'Notre Dame',
            'Ohio St.', 'Baylor', 'Ole Miss', 'Oklahoma St.', 'Michigan St.',
            'Utah', 'Pittsburgh', 'BYU', 'Oregon', 'Iowa',
            'Oklahoma', 'Wake Forest', 'NC State', 'Clemson', 'Houston',
            'Arkansas', 'Kentucky', 'Louisiana', 'San Diego St.', 'Texas A&M'] # top 25 from 2021

top25_22 = ['Georgia', 'Michigan', 'TCU', 'Ohio St.', 'Alabama',
            'Tennessee', 'Clemson', 'Utah', 'Kansas St.', 'Southern California',
            'Penn St.', 'Washington', 'Florida St.', 'Oregon St.', 'Oregon',
            'Tulane', 'LSU', 'UCLA', 'South Carolina', 'Texas',
            'Notre Dame', 'Mississippi St.', 'NC State', 'Troy', 'UTSA'] # top 25 from 2022

top25_dict = {14: top25_14, 15: top25_15, 16: top25_16, 17: top25_17, 18: top25_18,
              19: top25_19, 20: top25_20, 21: top25_21, 22: top25_22} # dictionary to be able to grab rankings for a desired year

df['top 25'] = df.apply(lambda row: 1 if row['Team'] in top25_dict.get(row['year'], []) else 0, axis=1)

After we create the feature that says if a team was in the top 25 or not for the given year, we can drop the year column.

In [14]:
df = df.drop('year', axis=1)

In [15]:
df

Unnamed: 0,Team,Off.Rank,Off.Plays,Off.Yards,Off.Yards.Play,Off.TDs,Off.Yards.per.Game,Def.Rank,Def.Plays,Yards.Allowed,...,Opponents.Intercepted,Turnovers.Gain,Fumbles.Lost,Interceptions.Thrown.y,Turnovers.Lost,Turnover.Margin,Avg.Turnover.Margin.per.Game,win perct,conference,top 25
0,Akron,88,891,4479,5.03,32,373.3,44,859,4453,...,13,24,12,14,26,-2,-0.17,0.416667,MAC,0
1,Alabama,17,1018,6783,6.66,67,484.5,12,945,4598,...,11,20,12,10,22,-2,-0.14,0.857143,SEC,1
2,Arizona,25,1139,6491,5.70,55,463.6,103,1115,6314,...,13,26,8,10,18,8,0.57,0.714286,Pac-12,1
3,Arizona St.,34,975,5750,5.90,54,442.3,81,964,5422,...,14,27,4,9,13,14,1.08,0.769231,Pac-12,1
4,Arkansas,60,916,5278,5.76,52,406.0,10,821,4204,...,12,24,11,6,17,7,0.54,0.538462,SEC,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
854,West Virginia,42,690,3804,5.51,28,422.7,5,561,2675,...,10,12,6,3,9,3,0.33,0.555556,Big 12,0
855,Western Ky.,120,699,3200,4.58,21,290.9,21,741,3700,...,5,9,10,2,12,-3,-0.27,0.454545,C-USA,0
856,Western Mich.,15,392,2878,7.34,32,479.7,60,428,2398,...,2,3,4,2,6,-3,-0.50,0.666667,MAC,0
857,Wisconsin,93,431,2153,5.00,17,358.8,1,332,1581,...,4,8,5,6,11,-3,-0.50,0.500000,Big Ten,0


## Data Exploration

In [None]:
conference_ranked_counts = df.groupby('conference')['top 25'].sum()
conference_ranked_counts.plot(kind='bar')
plt.xlabel('Conference')
plt.ylabel('Number of Ranked Teams')
plt.title('Number of Ranked Teams in Each Conference')
plt.show()

In [None]:
conference_ranked_counts = df.groupby('top 25')['Turnover.Margin'].mean()
conference_ranked_counts.plot(kind='bar')
plt.xlabel('top 25')
plt.ylabel('avg turnover margin')
plt.title('avg turnover margin for ranked vs unranked teams')
plt.show()

In [None]:
conference_ranked_counts = df.groupby('top 25')['Off.Yards.Play'].mean()
conference_ranked_counts.plot(kind='bar')
plt.xlabel('top 25')
plt.ylabel('avg off yards per play')
plt.title('avg yards per play for ranked vs unranked teams')
plt.show()

## Modeling

In [None]:
one_hot = pd.get_dummies(df['conference'])
one_hot = one_hot.astype('int')
#df = pd.get_dummies(df, columns = ['conference'])
df = df.drop('conference', axis = 1)
df.join(one_hot)

In [None]:
no_string_df = df.drop(['Team'], axis=1)

def convert_time(str_time):
    minutes, seconds = str_time.split(':')
    return (int(minutes) * 60) + int(seconds)

no_string_df['Time.of.Possession'] = no_string_df['Time.of.Possession'].apply(lambda x: convert_time(x))
no_string_df['Average.Time.of.Possession.per.Game'] = no_string_df['Average.Time.of.Possession.per.Game'].apply(lambda x: convert_time(x))

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
scaler = StandardScaler()
pca = PCA()
scaled_df = scaler.fit_transform(no_string_df)
scaled_df = pd.DataFrame(scaled_df, columns=no_string_df.columns)
scaled_df

## Results

In [None]:
features = no_string_df.drop('top 25', axis=1)
labels = no_string_df['top 25'].values.ravel()
scaled_features = scaled_df.drop('top 25', axis=1)
scaled_labels = scaled_df['top 25'].values.ravel()

Decision Trees with different parameters

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

clf = DecisionTreeClassifier()
param_grid = {'max_depth': [5, 10, 15, 20], 'min_samples_leaf': [5, 10, 15, 20], 'max_features': [5, 10, 15]}
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')
nested_scores = cross_val_score(grid_search, features, labels, cv=5)
print("Average accuracy:", nested_scores.mean() * 100, "%")

Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
scores = cross_val_score(gnb, features, labels, cv=10, scoring='accuracy')
print("Average accuracy:", scores.mean() * 100, "%")

KNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold

param_grid = {'pca__n_components': list(range(5, 19)), 'knn__n_neighbors': list(range(1, 25))}

gr_sch = GridSearchCV(pl, param_grid, cv = 5)
gr_sch.fit(features, labels)
outer_loop = KFold(n_splits = 5, shuffle = True, random_state = 21)
nested_scores = cross_val_score(gr_sch, features, labels, cv = outer_loop)
accuracy = nested_scores.mean()
print("Nested CV accuracy:", accuracy * 100, "%")

SVM

In [None]:
from sklearn.model_selection import cross_val_predict
from sklearn.svm import SVC

svm_pl = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])
prm_grd = { 'svc__C': [0.1, 1, 10, 100], 'svc__kernel': ['linear', 'rbf', 'poly']}
gr_sch = GridSearchCV(svm_pl, prm_grd, cv = 5)
predictions = cross_val_predict(gr_sch, features, labels, cv = outer_loop)
accuracy = cross_val_score(gr_sch, features, labels, cv = 5, scoring = 'accuracy').mean()
print("Accuracy:", accuracy * 100, "%")

Neural Nets

In [None]:
from sklearn.utils._testing import ignore_warnings
from sklearn.exceptions import ConvergenceWarning
from sklearn.neural_network import MLPClassifier

@ignore_warnings(category=ConvergenceWarning)
def run_nn():
    pl = Pipeline([('scaler', StandardScaler()), ('mlp', MLPClassifier(max_iter = 1000, random_state = 21))])
    prm_grd = {'mlp__hidden_layer_sizes': [(30,), (40,), (50,), (60,)], 'mlp__activation': ['logistic', 'tanh', 'relu']}
    grid_search = GridSearchCV(pl, prm_grd, cv=5)
    accuracy = cross_val_score(grid_search, features, labels, cv = 5, scoring = 'accuracy').mean()
    print("Accuracy:", accuracy * 100, "%")
  

run_nn()

Random Forest Ensemble

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

gnb = GaussianNB()
predicted_labels = cross_val_predict(gnb, features, labels, cv=10)
rf_clf = RandomForestClassifier(random_state = 21)
prm_grd = {'n_estimators': [50, 100, 150]}
grd_sch = GridSearchCV(rf_clf, prm_grd, cv = 5)
predictions = cross_val_predict(grd_sch, features, labels, cv = 5)
cls_rpt = classification_report(labels, predicted_labels)
print(cls_rpt)