# Data Preprocessing

The following notebook will include function to handle preprocessing within our data including the following: 

- Our model training is only concerned with tracking data in which the ball carrier has actually been determined as the runner on the play, so we will remove all tracking data prior to a 
    - run (identified as a QB run), 
    - a handoff, or 
    - pass complete.
- Once the play has ended, we will need to remove tracking data after one of the following events occurs:
    - Out of bounds
    - touchdown
    - fumble
    - qb_slide 
    - tackle

In [4]:
#Import libraries
import pandas as pd
import numpy as np
import missingno as msno
import matplotlib.pyplot as plt
import os
import warnings
from adjustText import adjust_text
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)
warnings.filterwarnings("ignore")

In [5]:
def remove_tracking_issues(tracking):
    #plays to remove with players that have tracking anamolies that show up in the accleration
    plays_to_rem = tracking[(tracking["displayName"]!="football") & (tracking["a"]>17)][["gameId","playId"]].drop_duplicates()
    
    # Perform an anti-join on 'key' column
    anti_join_result = tracking.merge(plays_to_rem, on=["gameId","playId"], how='left', indicator=True).query('_merge == "left_only"')

    # Drop the '_merge' column used for indicator and reset index
    anti_join_result = anti_join_result.drop('_merge', axis=1).reset_index(drop=True)

    return anti_join_result

In [6]:
def remove_plays_with_mult_tackles(tracking,tackles):
    duplicated_tackles = tackles[tackles["tackle"]==1][["gameId", "playId","nflId"]].drop_duplicates()
    plays_to_rem = duplicated_tackles[duplicated_tackles.duplicated(subset=['gameId', 'playId'], keep=False)][["gameId", "playId"]]
    
    # Perform an anti-join on 'key' column
    anti_join_result = tracking.merge(plays_to_rem, on=["gameId","playId"], how='left', indicator=True).query('_merge == "left_only"')

    # Drop the '_merge' column used for indicator and reset index
    anti_join_result = anti_join_result.drop('_merge', axis=1).reset_index(drop=True)
    
    return anti_join_result

In [7]:
def standardize_field(tracking):
    import pandas as pd
    import numpy as np

    # Applying the transformations
    tracking['x'] = np.where(tracking['playDirection'] == 'left', 120 - tracking['x'], tracking['x'])
    tracking['y'] = np.where(tracking['playDirection'] == 'left', 160/3 - tracking['y'], tracking['y'])
    tracking['unitO'] = np.where(tracking['playDirection'] == 'left', (180 + tracking["unitO"])%360, tracking['unitO'])
    tracking['unitDir'] = np.where(tracking['playDirection'] == 'left',(180 + tracking["unitDir"])%360 , tracking['unitDir'])

    return tracking

## Filter Frames By Events

Remember, the purpose of this model is to predict when a tackle occurs. The data that was given to us filtered data out based on non-ball carrier plays, such as pass incomplete and quarterback sacks. The focus of this data was to predict when a tackle was going to occur given their was a ball carrier on the play. For example, the quarterback is snapped the ball, but he is not deemed as the ball carrier right away because he will either pass the ball or hand it off. Once the quarterback has passed the ball, and the ball has been posssessed by a reciever, the reciever is now considered a ball carrier. Likewise, if the running back is handed the ball, they are not considered the ball carrier until they have recieved the ball. In the data, we could identify, that a quarterback has been deemed a runner rather than a passer when the event tag of "run" occurs on a given play. 

With all this said, we only want to build our model with the data that includes frames of a definitive ball carrier. Then we only want to consider frames when the tackle has occured or an appropriate event tag the signifies the end of a play. We can disregard all other frames after the play was completed. Thus, the purpose of this function is to take the tracking data and remove unneccessary frames at the end of the play and frames at the beginning of the play with no definitive ball carrier.

Note: This function must be run after the creation of our dependent variable to account for the extra two frames that we added after a tagged "tackle".

In [1]:
# filter_frames_by_events: function that takes tracking data and filters frames based on start and end tagged events defined
#                          in the function. Uses tackle_multiple feature to add buffered frames for dependent variable
# Input: tracking: tracking data
# Output: tracking dataframe with filtered frames
# Example usage: filtered_frames = filter_frames_by_events(tracking)
def filter_frames_by_events(tracking):
    # Define start and end events
    start_events = ['run', 'handoff', 'pass_outcome_caught', 'lateral', 'snap_direct']
    end_events = ['out_of_bounds', 'touchdown', 'fumble', 'qb_slide', 'tackle', 'safety']

    # Function to filter frames for a single play
    def filter_frames(play_data):
        # Find the first frame of the start events
        start_frame = play_data[play_data['event'].isin(start_events)]['frameId'].min()
        
        # Find the last frame before any of the end events
        # If there is a tackle: we need to include the 2 frames after the play ended for the buffer
        if (play_data["tackle_multiple"] == 1.0).any():
            end_frame = play_data[play_data["tackle_multiple"] == 1.0]['frameId'].max()
            
        else:
            end_frame = play_data[play_data['event'].isin(end_events)]['frameId'].min()

        # If start_frame or end_frame is NaN, return an empty DataFrame
        if pd.isna(start_frame) or pd.isna(end_frame):
            return pd.DataFrame()
        
        # Filter the play_data for frames between start_frame and end_frame
        return play_data[(play_data['frameId'] >= start_frame) & (play_data['frameId'] <= end_frame)]

    # Group by game and play, apply the filter_frames function, and concatenate the results
    filtered_data = tracking.groupby(['gameId', 'playId']).apply(filter_frames)
    
    # Reset the index and return the result
    return filtered_data.reset_index(drop=True)

In [9]:
def remove_football_frames(tracking):

    # Remove rows where 'football' is found in the specified column
    filtered_data = tracking[tracking['displayName']!= "football"]

    return filtered_data

In [10]:
def remove_offensive_players(tracking,plays):
    data = tracking.merge(plays[["gameId","playId","defensiveTeam"]], how = "inner", on = ["gameId","playId"])
    return data[data["club"]==data["defensiveTeam"]]