# NFL Big Data Bowl 

The data supplied for this notebook is from the kaggle competition for the NFL big data bowl https://www.kaggle.com/c/nfl-big-data-bowl-2020/data.

# Goal

1. Create a predictive model to determine how many yards an NFL team will gain after receiving a handoff.

2. Determine which Defensive Lineup is the best against each Offensive Lineup and create a classifier to predict the Offensive Lineup that the offensive team will use

In [11]:
# Package imports and reading dataset
import pandas as pd
import numpy as np

# Insert when completed
import warnings
warnings.filterwarnings('ignore')

from sklearn.neighbors import KNeighborsRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.metrics import precision_score, recall_score


# Read in dataset
nfl = pd.read_csv("train.csv")

nfl.head()



Unnamed: 0,GameId,PlayId,Team,X,Y,S,A,Dis,Orientation,Dir,...,Week,Stadium,Location,StadiumType,Turf,GameWeather,Temperature,Humidity,WindSpeed,WindDirection
0,2017090700,20170907000118,away,73.91,34.84,1.69,1.13,0.4,81.99,177.18,...,1,Gillette Stadium,"Foxborough, MA",Outdoor,Field Turf,Clear and warm,63.0,77.0,8,SW
1,2017090700,20170907000118,away,74.67,32.64,0.42,1.35,0.01,27.61,198.7,...,1,Gillette Stadium,"Foxborough, MA",Outdoor,Field Turf,Clear and warm,63.0,77.0,8,SW
2,2017090700,20170907000118,away,74.0,33.2,1.22,0.59,0.31,3.01,202.73,...,1,Gillette Stadium,"Foxborough, MA",Outdoor,Field Turf,Clear and warm,63.0,77.0,8,SW
3,2017090700,20170907000118,away,71.46,27.7,0.42,0.54,0.02,359.77,105.64,...,1,Gillette Stadium,"Foxborough, MA",Outdoor,Field Turf,Clear and warm,63.0,77.0,8,SW
4,2017090700,20170907000118,away,69.32,35.42,1.82,2.43,0.16,12.63,164.31,...,1,Gillette Stadium,"Foxborough, MA",Outdoor,Field Turf,Clear and warm,63.0,77.0,8,SW


# Data Preprocessing

In [5]:
def strtoseconds(txt):
    txt = txt.split(':')
    ans = int(txt[0])*60 + int(txt[1]) + int(txt[2])/60
    return ans

In [6]:
# Converts gameclock time (MM:SS) to seconds
nfl['GameClock'] = nfl['GameClock'].apply(strtoseconds)

In [7]:
# Converts player height from Ft-Inches (String format) to Inches (integer format)
nfl['PlayerHeight'] = nfl['PlayerHeight'].apply(lambda x: 12*int(x.split('-')[0])+int(x.split('-')[1]))

In [8]:
# Unique plays with the row representing the rusher
nfl_sample = nfl.loc[nfl['NflId'] == nfl['NflIdRusher']]

In [14]:
nfl.WindSpeed.unique()

array([8.0, 6.0, 10.0, 9.0, 11.0, nan, 7.0, 5.0, 2.0, 12.0, 1, 3, 4, 13,
       '10', '5', '6', '4', '8', '0', 'SSW', 14.0, 0.0, 15.0, 17.0, 18.0,
       16.0, '11-17', '16', '14', '13', '12', '23', '7', '9', '3', '17',
       '14-23', '1', '13 MPH', 24.0, '15', '12-22', '2', '4 MPh',
       '15 gusts up to 25', '11', '10MPH', '10mph', '22', 'E', '7 MPH',
       'Calm', '6 mph', '19', 'SE', '20', '10-20', '12mph'], dtype=object)

## Preliminary Objective Results

#### To predict how many yards a team will gain on a given rushing play I wanted to find the features that have an influence on yards so I created a correlation series with yards. Also I took the rusher of each play to represent each row as to not have a bunch of rows wtih duplicate entries. Otherwise, this would effect the K-Means algorithm.

In [111]:
corr_with_yard = nfl_sample[nfl_sample.columns[1:]].corr()['Yards']
display(corr_with_yard.abs().sort_values(ascending = False))

Yards                     1.000000
A                         0.158868
DefendersInTheBox         0.109212
S                         0.084061
Distance                  0.071936
YardLine                  0.064551
Dis                       0.052288
PlayId                    0.031466
Season                    0.031350
PlayerWeight              0.027306
Down                      0.021672
NflId                     0.018997
NflIdRusher               0.018997
GameClock                 0.016365
Temperature               0.012367
VisitorScoreBeforePlay    0.009084
Quarter                   0.006733
PlayerHeight              0.006587
Dir                       0.005075
HomeScoreBeforePlay       0.004706
Humidity                  0.003901
Orientation               0.003824
Week                      0.003265
Y                         0.002796
JerseyNumber              0.002264
X                         0.000257
Name: Yards, dtype: float64

In [112]:
nfl_sample["DefendersInBox_vs_Dist"] = nfl_sample['DefendersInTheBox'] / nfl_sample['Distance']


features = ["DefendersInTheBox",
           "Down",
           "OffenseFormation",
           "Distance",
           "A",
           "X",
           "Y",
           "Dis",
           "DefendersInBox_vs_Dist"]

nfl_sample.dropna(inplace = True)

train = nfl_sample.sample(frac=.5)
val = nfl_sample.drop(train.index)


X_train_dict = train[features].to_dict(orient="records")
X_val_dict = val[features].to_dict(orient="records")

y_train = train["Yards"]
y_val = val["Yards"]

# convert categorical variables to dummy variables
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
X_val = vec.transform(X_val_dict)

# standardize the data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_val_sc = scaler.transform(X_val)

model = KNeighborsRegressor(n_neighbors=10)
model.fit(X_train_sc, y_train)
y_val_pred = model.predict(X_val_sc)
mae = (y_val - y_val_pred).abs().mean()
mae

4.012609764664549

##### An MAE of around 4.0 depending on the validation and training set sampled. This is around the same as the average number of yards gained per play: 4.2123

In [113]:
print("MAE of our K-Nearest Neighbor Model vs Average number of yards gained in a rushing play")
mae, nfl_sample['Yards'].mean()

MAE of our K-Nearest Neighbor Model vs Average number of yards gained in a rushing play


(4.012609764664549, 4.235803770050345)

#### Let's try to find a better K-Value for our model

In [114]:
nfl_dict = nfl_sample[features].to_dict(orient = 'records')
nfl_y = nfl_sample['Yards']

vec = DictVectorizer(sparse=False)
scaler = StandardScaler()

def get_cv_error(k):
    model = KNeighborsRegressor(n_neighbors=k)
    pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
    mse = np.mean(-cross_val_score(
        pipeline, nfl_dict, nfl_y, 
        cv=5, scoring="neg_mean_squared_error"
    ))
    return mse

# This will take a bit of time
k_val = pd.Series(range(1, 51))
k_val_score = k_val.apply(get_cv_error)

best_k = np.sqrt(k_val_score).idxmin() + 1

After finding the best k-value from 1 - 50 lets re run the K-Nearest algorithm

In [115]:
model = KNeighborsRegressor(n_neighbors=best_k)
model.fit(X_train_sc, y_train)
y_val_pred = model.predict(X_val_sc)

MAE = (y_val - y_val_pred).abs().mean()
print("Mean absolute error of:", MAE)

Mean absolute error of: 3.835520430862914


#### Let's also test out new features

In [116]:
vec = DictVectorizer(sparse=False)
scaler = StandardScaler()
model = KNeighborsRegressor(n_neighbors=50)
pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])

In [117]:
nfl_y = nfl_sample['Yards']

In [118]:
features = ["DefendersInTheBox",
           "Down",
           "OffenseFormation",
           "Distance",
           "A",
           "X",
           "Y",
           "Dis",
           "DefendersInBox_vs_Dist"]
X_dict = nfl_sample[features].to_dict(orient="records")
np.mean(
    -cross_val_score(pipeline, X_dict, nfl_y, cv=10, scoring="neg_mean_squared_error")
)

40.620573086524985

In [119]:
# Function to test each individual feature to add to see which ones affect MSE the most
def feature_to_test(feat):
    features = ["DefendersInTheBox",
           "Down",
           "OffenseFormation",
           "Distance",
           "A",
           "X",
           "Y",
           "Dis",
           "DefendersInBox_vs_Dist"]
    features.append(feat)
    X_dict = nfl_sample[features].to_dict(orient="records")
    return np.mean(
        -cross_val_score(pipeline, X_dict, nfl_y, cv=10, scoring="neg_mean_squared_error")
    )

In [120]:
# Takes some time to test each feature
features_to_add = pd.Series(['GameClock', 'PlayerHeight', 'PlayerWeight'])
val = features_to_add.apply(feature_to_test)
MSE_df = pd.concat([features_to_add, val], axis = 1)
MSE_df.rename(columns = {0: "Feature", 1: "MSE"}, inplace = True)
MSE_df

Unnamed: 0,Feature,MSE
0,GameClock,40.817052
1,PlayerHeight,40.861093
2,PlayerWeight,40.812371


The MSE values hardly changed which is surprising because you would think that the Rusher's height and weight would be a contributing factor for the yardage gained.

# Secondary Objective

#### Given that there are a ton of defensive formations, I want to find the most successful DefensePersonnel for each Offensive Formation. I think that there is a ton of information in this data set that can be put to use like the angle of the defenders in reponse to where the runningback.

In [121]:
avg_yard = nfl_sample.groupby(['DefensePersonnel', 'OffenseFormation'])['Yards'].mean().to_frame()
freq = nfl_sample.groupby(['DefensePersonnel'])['OffenseFormation'].value_counts()
avg_yard['Frequency'] = freq

# Some defensive formations do not appear frequent enough so lets filter
cut_off = avg_yard['Frequency'].median()
avg_yard.loc[avg_yard['Frequency'] > cut_off].sort_values(by = ['OffenseFormation', 'Yards'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Yards,Frequency
DefensePersonnel,OffenseFormation,Unnamed: 2_level_1,Unnamed: 3_level_1
"4 DL, 4 LB, 3 DB",I_FORM,2.910714,56
"5 DL, 3 LB, 3 DB",I_FORM,3.222222,27
"3 DL, 5 LB, 3 DB",I_FORM,3.689655,29
"3 DL, 4 LB, 4 DB",I_FORM,3.970641,1124
"5 DL, 2 LB, 4 DB",I_FORM,4.117647,68
"3 DL, 3 LB, 5 DB",I_FORM,4.133333,75
"4 DL, 3 LB, 4 DB",I_FORM,4.162559,1907
"2 DL, 5 LB, 4 DB",I_FORM,4.25,8
"4 DL, 2 LB, 5 DB",I_FORM,4.25,260
"2 DL, 4 LB, 5 DB",I_FORM,4.409091,88


#### From here we can see the best defensive lineups for each offensive lineups that result in the least amount of average yards gained.

#### I want to take this a step further and use a classifier to predict the Offense Formation for each play

In [122]:
# First define features and format data 
nfl_sample.dropna(inplace = True)
features = ['DefendersInTheBox', 'DefensePersonnel', 'Down', 'Quarter', 'YardLine', 'Distance']
nfl_sample['Down'] = nfl_sample['Down'].apply(str)
nfl_sample['Quarter'] = nfl_sample['Quarter'].apply(str)

# Set up training and validation data
train = nfl_sample.sample(frac = 0.5)
val = nfl_sample.drop(train.index)

X_train_dict = train[features].to_dict(orient="records")
X_val_dict = val[features].to_dict(orient="records")

y_train = train['OffenseFormation']
y_val = val['OffenseFormation']

# convert categorical variables to dummy variables
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
X_val = vec.transform(X_val_dict)

# standardize the data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_val_sc = scaler.transform(X_val)

# Fit model
model = KNeighborsClassifier(n_neighbors=10)
model.fit(X_train_sc, y_train)

y_val_pred = model.predict(X_val_sc)

print('Accuracy:', (y_val_pred == y_val).mean())

Accuracy: 0.48144245404519376


#### Ok this accuracy is pretty off but let's look at the F1 scores for different offensive formations

In [123]:
from sklearn.metrics import f1_score

O_form = nfl_sample['OffenseFormation'].unique().tolist()
scores = f1_score(y_val, y_val_pred, average = None, labels = O_form).tolist()

tuple(zip(O_form, scores))

(('SHOTGUN', 0.5556574362736109),
 ('SINGLEBACK', 0.46620498614958444),
 ('JUMBO', 0.6217008797653959),
 ('PISTOL', 0.0),
 ('I_FORM', 0.426610348468849),
 ('ACE', 0.0),
 ('WILDCAT', 0.0),
 ('EMPTY', 0.0))

#### Our classifier did not predict some offensive formations which explains some of the accuracy error. This makes sense for ACE, EMPTY and WILDCAT since they have such a low frequency but it's odd to see the PISTOL formation have no predictions.

In [124]:
nfl_sample['OffenseFormation'].value_counts()

SINGLEBACK    7253
SHOTGUN       5183
I_FORM        3689
PISTOL         456
JUMBO          412
WILDCAT         71
EMPTY           17
ACE              1
Name: OffenseFormation, dtype: int64