## Big Data Bowl Metric Calculation

*Big Data Bowl 2024 Submission*  
*By Eli Gnesin*  
*Masters of Statistical Science, Duke University*

This notebook was used to calculate the **TOCQ: Tackle Opportunity Containment Quotient** metric I created for Big Data Bowl 2024. This version of the notebook is adjusted for use to calculate the metric for my Duke MSS portfolio. The only differences are:

-  If a frame is tagged as "ball snap", then the counting for Tackle Opportunity purposes starts 2 frames later. This decision fixes the issue of defensive linemen getting unfairly penalized for being lined up over the snap.
-  Rather than a constant tackle zone of radius 2 yards, I used the following tackle zones (for purposes of sensitivity analysis):

   1.   Constant radius of 1.5 yards
   2.   Maximum positive 1 frame displacement from radius (so the maximum of $c$ and $c + st + \frac{1}{2}at^2$ evaluated at $t = 0.1$ and $c = 1,1.25, 1.5$

### Setup

In [1]:
# import packages
import pandas as pd
import numpy as np
from zipfile import ZipFile
import matplotlib.pyplot as plt
import matplotlib.patches as patch
from matplotlib.collections import PatchCollection
from matplotlib.animation import ArtistAnimation, PillowWriter
import matplotlib
import seaborn as sns
import math
from shapely import geometry
import shapely
import imageio
import os
from tqdm import tqdm
from IPython import display

import warnings
warnings.filterwarnings("ignore")

plt.rc('axes', axisbelow=True)

In [2]:
# Read in the data
games = pd.read_csv("games.csv")
plays = pd.read_csv("plays.csv")
tackles = pd.read_csv("tackles.csv")
players = pd.read_csv("players.csv")

### Functions

For this notebook I am writing 4 functions:
1.  `get_all_plays_and_def` takes a `game_Id` and returns the ID of every play in the game in the dataset, as well as the defensive team during that play.
   
3.  `prep_play` takes 4 arguments: a tracking dataframe, a `game_Id`, a `play_Id` and the defensive team during the given play. It then subsets the tracking data to the given play in the given game and does the following:
   -  Calculates the `x` and `y` bounds for the play (for visualization purposes)
   -  Subsets to only observations for the football or the defensive team
   -  Drops all data from frames prior to the ball snap + 2 (if available, +2 to mitigate issue of defensive tackles getting caught in ball snap frames) or pass catch (if available) and after the tackle/out of bounds/slide/sack
   -  Returns the dataframe, the list of frames kept, and a tuple of the bounds
     
3.  `plot_play` takes the outputs of the `prep_play` function as well as a binary `save` argument to save the results rather than just displaying them. It then iterates through the frames, and for each frame makes a plot with each defensive player's location (marked in black) and tackle zone, and the location of the football (marked in brown). For players where the football is within the tackle zone, the zone is colored blue, and for players where the football is not within the tackle zone, the zone is colored gray. If `save = False`, the function just outputs the gif, and if `save = True`, the function saves the gif with the title `gameId_playId.gif`.
   
5.  `count_instances` is the key function for counting tackle opportunities and containments, and takes a prepped play dataframe and the defensive team as its primary arguments. It then does the following:
   -  Create an empty dictionary with a key for each player_id with his team
   -  Determines the "action frame" where the tackle/out of bounds/sack/slide/fumble occurs and exits if one does not exist (such as a touchdown)
   -  Iterates through each frame and creates the tackle zone for each defensive player as well as recording the football position
   -  In each frame, checks if the football is in the tackle zone for each player to record "tackle opportunity"
   -  In the "action frame", checks if the football is in the tackle zone for each player to record "tackle containment"
   -  Returns the dictionary for the play  
There are three other arguments for `count_instances`. The first, `dictionary` allows the user to pass a dictionary forward, and then `count_instances` aggregates the current play into the previous dictionary and returns both the dictionary for the individual play and the modified dictionary passed in. This is crucial for calculating **TOCQ** because the metric is over the course of a game or season, not an individual play. the second argument, `oneplay`, is a debugging tool that assumes `dictionary = None` is passed in, and returns only the dictionary from the given play. The final argument, `verbose`, is a useful debugging tool that flags instances where there may be faulty tracking data or where there is no "action frame". In these instances, the function returns silently if `verbose = False` with only the `dictionary` and `None` for the dictionary for the current play, and offers a useful print line if `verbose = True`.

In [3]:
## Helper functions

def get_all_plays_and_def(gameID):
    game = plays[plays.gameId == gameID]
    playIDs = pd.unique(game.playId)
    def_teams = [game[game.playId == i].defensiveTeam.iloc[0] for i in playIDs]
    return (playIDs, def_teams)

def prep_play(tracking_data, gameID, playID, def_team):
    play = tracking_data[(tracking_data.gameId == gameID) & 
                         (tracking_data.playId == playID)]
    (xmin, xmax) = (min(play.x)-1, max(play.x)+1)
    (ymin, ymax) = (min(play.y)-1, max(play.y)+1) 
    play = play[((play.club == def_team) | (play.club == "football"))]
    events = play[["event", "frameId"]].drop_duplicates()
    if "ball_snap" in events.event.values: #Minor change here to not include ball snap frame (start 2 frames later)
        idx = events[events.event == "ball_snap"].frameId.iloc[0]
        events = events.loc[events.frameId >= idx + 2]
    if "pass_outcome_caught" in events.event.values:
        idx = events[events.event == "pass_outcome_caught"].frameId.iloc[0]
        events = events.loc[events.frameId >= idx]
    if "tackle" in events.event.values:
        idx = events[events.event == "tackle"].frameId.iloc[0]
        events = events.loc[events.frameId <= idx]
    elif "out_of_bounds" in events.event.values:
        idx = events[events.event == "out_of_bounds"].frameId.iloc[0]
        events = events.loc[events.frameId <= idx]
    elif "qb_slide" in events.event.values:
        idx = events[events.event == "qb_slide"].frameId.iloc[0]
        events = events.loc[events.frameId <= idx]
    elif "qb_sack" in events.event.values:
        idx = events[events.event == "qb_sack"].frameId.iloc[0]
        events = events.loc[events.frameId <= idx]
    elif "fumble" in events.event.values:
        idx = events[events.event == "fumble"].frameId.iloc[0]
        events = events.loc[events.frameId <= idx]
    elif "safety" in events.event.values:
        idx = events[events.event == "safety"].frameId.iloc[0]
        events = events.loc[events.frameId <= idx]
    frames = pd.unique(events.frameId)
    
    play = play[(play.frameId >= min(frames)) & (play.frameId <= max(frames))]

    return [play, frames, (xmin,xmax), (ymin,ymax)]

In [4]:
def plot_play(prepped, frames, bounds, save = True):
    game_id = prepped.gameId.iloc[0]
    play_id = prepped.playId.iloc[0]

    fig, ax = plt.subplots()
    ax.set_xlim(left = bounds[0][0], right = bounds[0][1])
    ax.set_ylim(bottom = bounds[1][0], top = bounds[1][1])
    ax.set_xticklabels([50 - np.abs(60-int(k)) for k in ax.get_xticks()])
    ax.set_yticks([])
    ax.set_title(plays[(plays.gameId == game_id) & 
              (plays.playId == play_id)].playDescription.iloc[0])
    ax.set_facecolor("lawngreen")
    ax.grid(which = "major", axis = "x", color = "white", linewidth = 2) 

    ims = []
    for i in frames:
        frame = prepped[prepped.frameId == i]
        wedges = []
            
        for j in range(len(frame)):
            res = frame.iloc[j]
            if res.club != "football":
                wedges.append(patch.Wedge((res.x, res.y), r = max(1.25, 1.25+0.1*res.s + 0.5*0.01*res.a), 
                                          theta1 = ((-res.o+90) - 75) % 360, theta2 = ((-res.o+90) + 75) % 360, 
                                         color = "dimgray", alpha = 0.7))
                wedges.append(patch.Circle((res.x, res.y), radius = 0.5, color = "black"))
                continue
            # Plot the football
            else:
                for wedge in wedges:
                    if wedge.contains_point((res.x,res.y)):
                        if type(wedge) != matplotlib.patches.Circle:
                            wedge.set_color("blue")
                wedges.append(patch.Circle((res.x, res.y), radius = 0.5, color = "saddlebrown"))

        wedges = PatchCollection([wedge for wedge in wedges],
                                 facecolors = [wedge.get_facecolor() for wedge in wedges],
                                 edgecolors = [wedge.get_edgecolor() for wedge in wedges]
                                                       )
        im = ax.add_collection(wedges)
        ims.append([im])

    animation_play = ArtistAnimation(fig = fig, artists = ims, interval = 150,
                                    blit = True, repeat_delay = 2000)
    if save:
        animation_play.save(f"{game_id}_{play_id}.gif", writer = PillowWriter())
    video = animation_play.to_jshtml()
    html = display.HTML(video)
    display.display(html)
    plt.close()

In [5]:
# Calculation function

def count_instances(prepped, club = "", dictionary = None, oneplay = False, verbose = True):
    players = [f"{str(int(q))}_{club}" for q in pd.unique(prepped.nflId[prepped.nflId.notna()])]
    newdict = {}
    for player in players:
        newdict[player] = np.array([0,0])
    
    frames = prepped[["event", "frameId"]].drop_duplicates()
    action_frame = None
    for frame in frames.values:
        if frame[0] in ["tackle", "out_of_bounds", "qb_slide", "qb_sack", "fumble", "safety"]:
            action_frame = frame[1]
    if action_frame is None:
        if verbose:
            print(f"No tackle/action event available for play {prepped.playId.values[0]} in game {prepped.gameId.values[0]}")
        return dictionary, None
    frames = frames.frameId.values
    for frame in frames:
        play_frame = prepped[prepped.frameId == frame]
        wedges = [None] * 11
        players_frame = play_frame[play_frame.nflId.notna()]
        player_ids = players_frame.nflId.values
        for j in range(len(players_frame)):
            res = play_frame.iloc[j]
            wedges[j]  = patch.Wedge((res.x, res.y), r = max(1.25, 1.25 + 0.1*res.s + 0.5*0.01*res.a), 
                                    theta1 = ((-res.o+90) - 75) % 360, 
                                    theta2 = ((-res.o+90) + 75) % 360)
        football = play_frame[play_frame.displayName == "football"]
        football_loc = (football.x.values[0], football.y.values[0])
        
        if wedges[0] is None:
            if verbose:
                print(f"Faulty tracking data for play {prepped.playId.values[0]} in game {prepped.gameId.values[0]}")
            return dictionary, None
                
        for i in range(len(wedges)):
            wedge = wedges[i]
            if wedge.contains_point(football_loc):
                newdict[f"{str(int(player_ids[i]))}_{club}"] = np.array([1,0])
                
        if frame == action_frame:
            for i in range(len(wedges)):
                wedge = wedges[i]
                if wedge.contains_point(football_loc):
                    newdict[f"{str(int(player_ids[i]))}_{club}"] = np.array([1,1])
                    
    if dictionary is not None:
        for player in players:
            if player not in dictionary.keys():
                dictionary[player] = newdict[player]
            else:
                dictionary[player] = dictionary[player] + newdict[player]
        
        return dictionary, newdict
                
    return newdict

In [6]:
# An example play for visualization
t2 = pd.read_csv("tracking_week_2.csv")
play = prep_play(t2, 2022091801, 2090, "NYJ")
plot_play(play[0], play[1], (play[2], play[3]))

### Calculating TOCQ

With the given functions above, we can now calculate **TOCQ** relatively efficiently. For this, I iterated through the weeks, then through the game IDs for each week, collected all plays in the game with `get_all_plays_and_def`, and then for each play prepped the play with `prep_play` and called `count_instances` for the counting. I had a rolling dictionary with the `dictionary` argument that got passed in play after play, which allowed me to collect half-season long data for every player and team. I also returned the play dictionary for each play and aggregated them together by game, which will allow me to look at metrics by game/team combinations and against particular opponents as well.

The `tqdm` time for this calculation is approximately 30 minutes, depending on server speed and other factors.

In [7]:
# Calculate the metric

dictionary = {}
games_dict = {}
for week_num in range(1,10):
    print(week_num)
    game_IDs = pd.unique(games[games.week == week_num].gameId)
    tracking = pd.read_csv(f"tracking_week_{week_num}.csv")
    for game_id in tqdm(game_IDs):
        plays_def = get_all_plays_and_def(game_id)
        for i in range(len(plays_def[0])):
            play_num = plays_def[0][i]
            def_team = plays_def[1][i]
            prepped_play = prep_play(tracking, game_id, play_num, def_team)
            dictionary, game_dict = count_instances(prepped_play[0], def_team, dictionary = dictionary, verbose = False)
        
            # Saving the Game dict
            if game_dict is not None:
                if game_id in games_dict.keys():
                    for k in game_dict.keys():
                        if k in games_dict[game_id].keys():
                            games_dict[game_id][k] = games_dict[game_id][k] + game_dict[k]
                        else:
                            games_dict[game_id][k] = game_dict[k]
                else:
                    games_dict[game_id] = game_dict

1


100%|██████████| 16/16 [03:56<00:00, 14.76s/it]


2


100%|██████████| 16/16 [03:49<00:00, 14.36s/it]


3


100%|██████████| 16/16 [04:01<00:00, 15.08s/it]


4


100%|██████████| 16/16 [03:54<00:00, 14.69s/it]


5


100%|██████████| 16/16 [04:04<00:00, 15.26s/it]


6


100%|██████████| 14/14 [03:28<00:00, 14.92s/it]


7


100%|██████████| 14/14 [03:34<00:00, 15.34s/it]


8


100%|██████████| 15/15 [03:56<00:00, 15.78s/it]


9


100%|██████████| 13/13 [03:10<00:00, 14.66s/it]


### Saving Results

I want to save both the full results dictionary and the dictionary of game dictionaries as CSV files so I can move them over to R for cleaner visualizations. For the full dictionary, each row corresponds to a player/team combination (so players who changed teams will have multiple rows) and includes their name, team, and position. For the game dictionaries, the CSV file will have each game/player combination as a row, with the player's name, team, and position included as well.

In [8]:
full_dict = pd.DataFrame.from_dict(data=dictionary, orient='index', columns =["Opportunities", "Contains"])
full_dict.reset_index(inplace = True)
full_dict[['Player_ID', 'Team']] = full_dict['index'].str.split('_', expand=True)
full_dict["Player_Name"] = full_dict.Player_ID.apply(lambda x: players[players.nflId == int(x)].displayName.values[0])
full_dict["Position"] = full_dict.Player_ID.apply(lambda x: players[players.nflId == int(x)].position.values[0])
full_dict.to_csv('full_dictionary_54.csv', header=True)

In [9]:
games_df = pd.DataFrame.from_dict(data=games_dict).melt(var_name = "game", ignore_index = False)
games_df.reset_index(inplace = True)
games_df[['Player_ID', 'Team']] = games_df['index'].str.split('_', expand=True)
games_df.dropna(inplace = True)
games_df['Opportunities'] = games_df["value"].apply(lambda x: x[0])
games_df['Contains'] = games_df["value"].apply(lambda x: x[1])
games_df["Player_Name"] = games_df.Player_ID.apply(lambda x: players[players.nflId == int(x)].displayName.values[0])
games_df["Position"] = games_df.Player_ID.apply(lambda x: players[players.nflId == int(x)].position.values[0])
games_df.drop("value", axis = 1,inplace = True)
games_df.to_csv("games_dictionary_54.csv", header = True)