<font color=teal>
_______________________________________
</font>


### <font color=teal>Goal:</font>

- Merge play actions and offense/defense power scores into a play by play dataset focused on play-calling

### <font color=teal>Input:</font>

- pbp_actions.parquet
- defense_power.parquet
- offense_power.parquet


### <font color=teal>Steps:</font>
- merge offense and defense scores into each play based on which team offense and defense
- save the final play-calling dataset


### <font color=teal>Code:</font>
- /src module



### <font color=teal>Output:</font>

- nfl_pbp_play_calls.parquet



<font color=teal>
_______________________________________
</font>

In [1]:
import os
import sys

sys.path.append(os.path.abspath("../src"))

In [2]:
from matplotlib import pyplot as plt
import seaborn as sns
from tensorflow.keras.callbacks import EarlyStopping
import tensorflow as tf
from tensorflow.keras.layers import Input, Embedding, Dense
from tensorflow.keras.models import Model

import warnings

warnings.filterwarnings('ignore')

In [3]:
from src import *

In [4]:
DEBUG = False

data_directory = get_config('data_directory')

plt.style.use('seaborn-darkgrid')

In [5]:
full_path = os.path.join(data_directory, "nfl_pbp_play_calls.parquet")
pbp_actions_df = pd.read_parquet(full_path)
pbp_actions_df.head()

Unnamed: 0,row_id,season,week,game_id,drive,play_counter,posteam,posteam_score,posteam_score_post,defteam,...,next_starting_score,down,ydstogo,yards_to_goal,game_seconds_remaining,action,yards_gained,points_gained,offense_power,defense_power
0,3,2016,1,2016_01_DET_IND,1.0,55.0,IND,0,0,DET,...,0.0,1.0,10.0,75,3600.0,pass,6.0,0,36.354117,19.079208
1,36,2016,1,2016_01_DET_IND,1.0,142.0,IND,0,0,DET,...,0.0,2.0,6.0,61,3454.0,rush,2.0,0,36.354117,19.079208
2,80,2016,1,2016_01_DET_IND,1.0,241.0,IND,0,0,DET,...,,3.0,15.0,51,3295.0,pass,3.0,0,36.354117,19.079208
3,199,2016,1,2016_01_DET_IND,3.0,532.0,IND,0,0,DET,...,0.0,1.0,10.0,75,2983.0,pass,0.0,0,36.354117,19.079208
4,219,2016,1,2016_01_DET_IND,3.0,577.0,IND,0,0,DET,...,,3.0,8.0,73,2902.0,pass,0.0,0,36.354117,19.079208


In [6]:
full_path = os.path.join(data_directory, "offense_power.parquet")
offense_power_df = pd.read_parquet(full_path)
offense_power_df.head()

Unnamed: 0,team,season,week,offense_power
0,ARI,2016,1,32.910841
1,ARI,2016,2,43.420363
2,ARI,2016,3,19.541967
3,ARI,2016,4,31.546562
4,ARI,2016,5,36.564986


In [7]:
full_path = os.path.join(data_directory, "defense_power.parquet")
defense_power_df = pd.read_parquet(full_path)
defense_power_df.head()

Unnamed: 0,team,season,week,defense_power
0,ARI,2016,1,19.289876
1,ARI,2016,2,20.485941
2,ARI,2016,3,22.691844
3,ARI,2016,4,15.487278
4,ARI,2016,5,24.862436


In [8]:
db = database_loader.DatabaseLoader(get_config('connection_string'))
db.load_table(df=pbp_actions_df, table_name="nfl_pbp_play_calls", schema='controls')
db.load_table(df=offense_power_df, table_name="offense_power", schema='controls')
db.load_table(df=defense_power_df, table_name="defense_power", schema='controls')


Many of these are interesting and needed just to understand and validate the information, but they have varying effectiveness for a play call predictor

- drop: season, week, play counter, -- unless we can use this to weight more recent seasons
- not sure:
        - drive - we get a sense of time using seconds remaining, point differential, yards_to_goal, etc.
        - posteam - we could label this, but really offense and defense power identifies the team better for this type of application
        - defteam - it would just take a lot longer to train - defense power is perhaps just as effective
        - down - again, interesting from an understinf=ding of what's going on, but not really for a play call predictor
        -
        -
- Keepers
        - point differential - float
        - yrdstogo ....float
        - yards_to_goal - int64
        - game seconds remaining  - float
        - action - label
        - yards_gained - float
        - points gained - int
        - defense power - float
        - offense power - float
        -

In [None]:
pbp_actions_df['power_differential'] = pbp_actions_df['offense_power'] - pbp_actions_df['defense_power']

In [None]:
keepers = [
    'action',
    'posteam',
    'point_differential',
    'ydstogo',
    'yards_to_goal',
    'game_seconds_remaining',
    'power_differential',
    'yards_gained'
]

df = pbp_actions_df[keepers]
df.head()

In [None]:
df.dropna(axis=0, inplace=True)
assert df.isna().sum().sum() == 0

In [None]:
full_dataset = pd.get_dummies(df, columns=['action', 'posteam'], prefix='', prefix_sep='')
full_dataset.tail()

In [None]:
dataset = full_dataset.sample(15000)
dataset.shape

In [None]:
# sns.pairplot(dataset[['point_differential',
#                       'ydstogo',
#                       'yards_to_goal',
#                       'game_seconds_remaining',
#                       'power_differential',
#                       'yards_gained']], diag_kind='kde')
# plt.show()

In [None]:
train_dataset = dataset.sample(frac=0.8, random_state=0)
test_dataset = dataset.drop(train_dataset.index)

In [None]:
train_features = train_dataset.copy()
test_features = test_dataset.copy()

train_labels = train_features.pop('yards_gained')
test_labels = test_features.pop('yards_gained')


In [None]:
import numpy as np

normalizer = tf.keras.layers.Normalization(axis=-1)
normalizer.adapt(np.array(train_features))

In [None]:
from keras.src.layers import Normalization

power = np.array(train_features)

power_normalizer = Normalization(input_shape=[1,], axis=None)
power_normalizer.adapt(power)

In [None]:
power_model = tf.keras.Sequential([
    power_normalizer,
    Dense(64, activation="relu", name="layer1"),
    Dense(32, activation="relu", name="layer2"),
    Dense(units=1)
])

power_model.summary()

In [None]:
power_model.compile(
    optimizer=tf.keras.optimizers.legacy.Adam(learning_rate=0.1),
    loss='mean_absolute_error',
    metrics=['accuracy']
)

In [None]:
%%time
history = power_model.fit(
    train_features['power_differential'],
    train_labels,
    epochs=100,
    # Suppress logging.
    # verbose=0,
    # Calculate validation results on 20% of the training data.
    validation_split = 0.2)

In [None]:
def plot_loss(history):
    plt.plot(history.history['loss'], label='loss')
    plt.plot(history.history['val_loss'], label='val_loss')
    plt.ylim([0, 10])
    plt.xlabel('Epoch')
    plt.ylabel('Error [MPG]')
    plt.legend()
    plt.grid(True)

In [None]:
plot_loss(history)