In [2]:
import numpy as np
import pandas as pd


I have already downloaded a csv from collegefootballdata.com/data and saved it as 'cfb_dta.csv'

In [4]:
df = pd.read_csv('cfb_dta.csv',sep='\t')
df.head()

Unnamed: 0,id,season,week,season_type,start_date,neutral_site,conference_game,attendance,venue_id,venue,...,home_line_scores[1],home_line_scores[2],home_line_scores[3],away_team,away_conference,away_points,away_line_scores[0],away_line_scores[1],away_line_scores[2],away_line_scores[3]
0,401110723,2019,1,regular,2019-08-24T23:00:00.000Z,True,False,,4013,Camping World Stadium,...,0.0,10.0,7.0,Miami,ACC,20.0,3.0,10.0,0.0,7.0
1,401114164,2019,1,regular,2019-08-25T02:30:00.000Z,False,False,,3610,Aloha Stadium,...,14.0,7.0,10.0,Arizona,Pac-12,38.0,0.0,21.0,14.0,3.0
2,401117854,2019,1,regular,2019-08-29T23:00:00.000Z,False,False,,3854,Nippert Stadium,...,3.0,7.0,7.0,UCLA,Pac-12,14.0,0.0,7.0,7.0,0.0
3,401119254,2019,1,regular,2019-08-29T23:00:00.000Z,False,False,,3700,Doyt Perry Stadium,...,17.0,7.0,9.0,Morgan State,,3.0,0.0,3.0,0.0,0.0
4,401117855,2019,1,regular,2019-08-29T23:00:00.000Z,False,False,,3892,Rentschler Field,...,3.0,14.0,0.0,Wagner,,21.0,0.0,0.0,14.0,7.0


Now we will get a list of uniwue team.
We will only grab the list form home teams because away teams include many FCS teams, and the data for these teams are too sparse to get a good analysis.


In [6]:
teams = np.asarray(df.home_team.unique()).ravel()
team_dict = {}
for i in range(teams.shape[0]):
    team_dict[teams[i]] = i

Now we will generate our observations by taking the home points, away points, home team and away team in each of the games in our dataset

In [10]:
obs = []
for i in range(df.shape[0]):
    samp = df.iloc[i]
    away_team = samp.away_team
    if away_team not in teams or np.isnan(samp.away_points):
        continue
    obs.append([samp.home_points,samp.away_points,team_dict[samp.home_team],team_dict[samp.away_team]])
obs = np.array(obs)

We now define a function to give the relative likelihood for all the games given the choice variable (params) and the observations (obs).  The parameters is the offensive and defensive rating for each team, and this is the value we need to optimize to find our final rankings.
The relative likelihood is obtained with the PDF of a normal distribution, taking the log (to avoid vanishing probabilities), then we drop the integrating factor since it has no impact on the optimal values.  This leaves is with just the exponential portion of the PDF.  Since we took the log, the result is summed rather than multiplied.

In [7]:
def prob_calc(params, obs):
    prob = 0
    f = lambda o1,o2,d1,d2,s1,s2:((o1/d2-s1)**2)/98+((o2/d1-s2)**2)/98
    for i in obs:
        o1,o2,d1,d2 = params[int(i[2])*2], params[int(i[3])*2], params[int(i[2])*2+1], params[int(i[3])*2+1]
        prob += f(o1,o2,d1,d2,i[0],i[1])
    return prob
        

Now we will import minimize from the optimize library included in scipy library to find the optimal values for our parameters.

In [8]:
from scipy.optimize import minimize

In [11]:
results = minimize(prob_calc,x0=np.random.uniform(2,19,(len(teams),2)),args=(obs)).x

We find the ranking metric by taking the product of each team's offensive rating and their defensive rating.  This metric seems like the best metric based on our assumptions because a team will beat another team in expectation (that is the expected value of a teams score is higher than the expected values of their opponent's score) if and only if the product of the offensive and defensive rating is higher than the opponent's.

Proof, Let $S_1, S_2$ be the scores for team one and two respectively, $O_1, O_2$ be the offensive ratings for each team and $D_1, D_2$ be the defensive ratings then:

$E[S_1]=\frac{O_1}{D_2}, E[S_2] = \frac{O_2}{D_1}$ (By assumption)

Then we have:

If $E[S_1]>E[S_2]$ Then $\frac{O_1}{D_2} > \frac{O_2}{D_1}$

<=>

$O_1 * D_1 > O_2 * D_2$

Q.E.D

Please note that the optimial values are actually not unique because each offesnive and defensive rating can be scaled by a constant, then in the fraction, the constant will cancel out and our mean value in the normal distribution will remain unchanged; however, the ratio of the optimal values will remain the same, thus if you re-run this code, the values may be different, but the ordering and ratios will be the same.

In [21]:
my_ranks = np.array([results[2*i]*results[2*i+1] for i in range(len(teams))])

In [22]:
sorter = my_ranks.argsort()

In [23]:
final_rankings = teams[sorter][::-1]

# Final Rankings (As of Nov 1, 2019)

In [24]:
for idx, name in enumerate(final_rankings):
    print(idx+1,':\t   ',name, '-'*(25- len(name)), my_ranks[sorter][::-1][idx])

1 :	    Ohio State --------------- 210.35763118724307
2 :	    Penn State --------------- 111.52681551544967
3 :	    Wisconsin ---------------- 92.06981713301826
4 :	    Clemson ------------------ 85.11375703660192
5 :	    Georgia ------------------ 79.72622948916691
6 :	    Utah --------------------- 78.12207569790337
7 :	    Auburn ------------------- 73.09326000667845
8 :	    Iowa --------------------- 71.70470034311558
9 :	    LSU ---------------------- 71.32042505619873
10 :	    Alabama ------------------ 71.26379225334216
11 :	    Oregon ------------------- 61.377758152947905
12 :	    Michigan ----------------- 57.26440029881133
13 :	    Oklahoma ----------------- 56.59751776418706
14 :	    Iowa State --------------- 52.5657122319153
15 :	    Florida ------------------ 52.369073875116484
16 :	    Michigan State ----------- 48.33793731871939
17 :	    Cincinnati --------------- 47.43158099033406
18 :	    Baylor ------------------- 47.40310544774329
19 :	    Minnesota ---------------