In [1]:
import pandas as pd
import numpy as np 
import datetime
import math
import gc
import requests
gc.collect()

from tqdm import tqdm

# Load Processed Data

In [2]:
PATH = './output/processed.csv'
df = pd.read_csv(PATH)
print(len(df))

1927115


## Challenges

There are two main challenges when weighting an EPA model.  

One, when teams get up 30 points, they will stop running efficient plays and just run clock. There needs to be some sort of penalty weight for plays when the score is out of hand. I'll try and arrive to this penalty empirically.  

Two, in the Ron Yurko et. al. paper I'm referencing, they assign weighting penalties if the next score is 4 or 5 drives in the future. Essentially, in that case, the current drive isn't deterministic when it comes to expected points. Again, I'll try to arrive to this penalty empirically.  

I'll start with the second challenge, and I'm going to approach it slightly differently. First, I'm going to take a small subset of the data and try to predict drives till next score. This should be near zero when in the opponent's redzone, and might max out (just a guess, 1.5) around a team's own 25. With multinomial logit, I can get a probability of zero, one, two, etc drives till next score. Then, using the rest of the data, I can group by drives till next score and predict probabilities of each type of score that way.

In [3]:
gb = df.groupby(['game_id','drive_id'])['down'].count().reset_index()
# gb = gb.groupby(['down'])['drive_id'].count()
# gb = gb.sort_values(ascending=False)
# gb = gb.reset_index()
# gb = gb.rename(columns={'drive_id':'play_count'})
_max = gb.down.max()
print(_max)
print(len(gb))

36
296840


In [4]:
# just for fun/validation, wanted to look at 
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

x = pd.Series(gb.down, name="play_count")
fig = plt.figure(figsize=[15,9])
# ax1 = sns.kdeplot(x, bw=0.14, label="play_count")
ax = sns.distplot(x, bins=np.arange(0,_max), kde=False, norm_hist=True)
ax.set(xlabel='play_count', ylabel='percentage of drives')
ax.set_title('Distribution of Play Count on 300,000 CFB Drives')
plt.show()

fig.savefig("./plots/play_counts.png")

<Figure size 1500x900 with 1 Axes>

In [5]:
# add indicator if it's a scoring drive

In [6]:
gb = df.groupby(['drive_result'])['down'].count()
gb

drive_result
BLOCKED FG                       186
BLOCKED FG (TD) TD                 4
BLOCKED PUNT                     256
BLOCKED PUNT TD                   28
DOWNS                          51298
DOWNS TD                          99
END OF 4TH QUARTER               234
END OF GAME                    33699
END OF GAME TD                    22
END OF HALF                    32496
END OF HALF TD                    75
FG                             91065
FG GOOD                       139204
FG GOOD TD                        31
FG MISSED                      53611
FG MISSED TD                     229
FG TD                             22
FUMBLE                         73713
FUMBLE RETURN TD                1485
FUMBLE TD                       2652
INCOMPLETE                       600
INT                            98958
INT RETURN TOUCH                 162
INT TD                          6608
KICKOFF                           81
KICKOFF RETURN TD                 34
LATERAL                  

In [8]:
for text in list(uncat.play_text.values):
    print(text)

Kevin Parks run for 1 yd to the Virg 38
Goff, Jared pass complete to Rubenzer, Luke for 11 yards to the COLO35, PENALTY CAL illegal forward pass (Rubenzer, Luke) 11 yards to the COLO35, NO PLAY.
Timeout CALIFORNIA, clock 00:01
Jared Goff pass complete to Trevor Davis for 9 yds to the Colo 46
Jared Goff pass incomplete to Trevor Davis
Goff, Jared pass incomplete, PENALTY COLO holding (Crawley, Ken) 10 yards to the CAL45, 1ST DOWN CAL, NO PLAY. for a 1ST down
Jared Goff pass incomplete
WEST VIRGINIA Penalty, unsportsmanlike conduct (Kj Dillon) to the WVirg 13
Tyler Rogers pass incomplete to Jordan Bergstrom
Larry Rose III run for 11 yds to the UTEP 43 for a 1ST down
Tyler Rogers pass incomplete to Teldrick Morgan
Tyler Rogers pass incomplete to Tyrain Taylor
Tyler Rogers pass complete to Jordan Bergstrom for 12 yds to the NMxSt 46 for a 1ST down
Tyler Rogers run for 9 yds to the NMxSt 34
Timeout NEW MEXICO ST, clock 00:15
Tyler Rogers pass incomplete to Joshua Bowen
Tyler Rogers pass inc