# Feature Engineering

In this notebook, we generate new features to be used in the dataset for our models. We will include a feature name, description of the feature, justification for adding the feature, and a function to implement the feature for the data

## Features From Data

The following features are variables from the data given to us from the kaggle competition that we believe to be valuable for our models: 
- From players.csv file:
    - position: position of the player (character)
    - weight: weight in lbs for defender (numeric)
    - ballCarrier_position: position of ball carrier (character)
    - weight_ballCarrier: weight in lbs for ball carrier (numeric)
- From plays.csv file:
    - passProbability: NGS probability of nex play being pass based on model, not the actually probability of pass being caught (numeric)
    - preSnapWinProbabilityDefense: Win probability for visitor team (numeric)
    - defendersInTheBox: Number of defenders in close proiximity to line-of-scrimmage (numeric)
    - offenseFormation: Formation used by possession team (more on varibale down below) (character)
    - absoluteYardlineNumber: Distnace from enzone for possession team (numeric)
    - down: down of the play (numeric)
    - yardsToGo: distance to get first down (numeric)
- From tracking_week_#.csv files: 
    - x: player position along the long axis of the field (0-120 yards) (numeric)
    - y: player position along the wide axis of the field (0 - 53.3 yards) (numeric)
    - s: speed in yards/sec (numeric)
    - a: speed in yards/sec^2 (numeric)
    - o: player orientation (deg), 0-360 (numeric)
    - dir: angle of player motion (deg), 0 - 360 degrees (numeric)
    - football_x: football position along the long axis of the field (0-120 yards) (numeric) (not directly in data)
    - football_y: football position along the wide axis of the field (0 - 53.3 yards) (numeric) (not directly in data)
- From tackles.csv file: 
    - tackle: 0 or 1 indicating if the tackle was awarded (not direclty in data) (Dependent Variable)


## Install External Datasets

In [3]:
pip install nfl_data_py

Defaulting to user installation because normal site-packages is not writeable
Collecting nfl_data_py
  Using cached nfl_data_py-0.3.1.tar.gz (16 kB)
Collecting fastparquet>0.5
  Using cached fastparquet-2023.10.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
Collecting cramjam>=2.3
  Using cached cramjam-2.7.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB)
Collecting pandas>1
  Using cached pandas-2.1.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.3 MB)
Collecting numpy>=1.20.3
  Downloading numpy-1.26.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
[K     |████████████████████████████████| 18.2 MB 4.5 MB/s eta 0:00:01
[?25hCollecting tzdata>=2022.1
  Using cached tzdata-2023.3-py2.py3-none-any.whl (341 kB)
Building wheels for collected packages: nfl-data-py
  Building wheel for nfl-data-py (setup.py) ... [?25ldone
[?25h  Created wheel for nfl-data-py: filename=nfl_data_py-0.3.1-py3-none-any.whl size=13206 

# Orient Angle

In [1]:
def orient_angle(angle):
    if angle >=0 and angle < 90:
        return (90-angle)
    if angle >=90 and angle <180:
        return ((180-angle)+270)
    if angle >=180 and angle < 270:
        return ((270 -angle)+180)
    else:
        return(180-(angle - 270))

## Distance Between Players and Projections Between Players
This will calculate the distance between each player and the other players on the field, based on position, as well as the distance between the player and the football.

In [2]:
def calculate_angle(x_defender, y_defender, x_ball_carrier, y_ball_carrier, defender_dir):
    import math
    
    delta_x = x_ball_carrier - x_defender
    delta_y = y_ball_carrier - y_defender
    # Calculate the angle in radians
    angle_radians = np.arctan2(delta_y, delta_x)
    
    # Convert the angle to degrees and ensure it's within the [0, 360) range
    angle_degrees = (np.degrees(angle_radians))
    
    if (angle_degrees < 0).all():
        angle_degrees = 360 - abs(angle_degrees)
    
    #How far away is the angle of the player from the trajectory
    angle_degrees = abs(defender_dir - angle_degrees)
    
    
    #Orient angle correctly for player
    if (angle_degrees > 180).all():
        angle_degrees = 360 - angle_degrees
    
    return angle_degrees

# Define a function to calculate distance
def calculate_distance(x1, y1, x2, y2):
    from math import sqrt
    return sqrt((x2 - x1) ** 2 + (y2 - y1) ** 2)

In [3]:
def calculate_distance_angles(tracking, plays):
    import math
    # Merge tracking data with plays data and the columns needed to identify defensive team and ballcarrier
    data = tracking.merge(plays[["gameId","playId","defensiveTeam","ballCarrierId"]], how="inner", on=["gameId","playId"])

    # Prepare columns for distances (c1Dist, c2Dist, ..., c10Dist)
    for i in range(1, 11):
        data[f'c{i}Dist'] = None

    # Prepare columns for projection (c1Proj, c2Proj, ..., c10Proj)
    for i in range(1, 11):
        data[f'c{i}Ang'] = None

    # Prepare columns for ball carrier distance and Projection
    data["bcDist"] = None
    data["bcAng"] = None

    # Process each game, play, and frame
    for (game_id, play_id), play_data in data.groupby(['gameId', 'playId']):
        # Determine the defensive team for this play
        defensive_team = play_data.iloc[0]['defensiveTeam']
        ballCarrierId = play_data.iloc[0]['ballCarrierId']

        for frame_id, frame_data in play_data.groupby('frameId'):
            # Separate defensive and offensive players
            defensive_players = frame_data[frame_data['club'] == defensive_team]
            offensive_players = frame_data[(frame_data['club'] != defensive_team) & (frame_data['displayName'] != "football")]

            for index, def_player in defensive_players.iterrows():
                # Calculate distances to all offensive players except the ball carrier
                distances_angles = [
                    (calculate_distance(def_player['x'], def_player['y'], off_player['x'], off_player['y']),
                     calculate_angle(def_player['x'], def_player['y'], off_player['x'], off_player['y'], def_player["unitDir"]))
                    for _, off_player in offensive_players.iterrows()
                    if off_player['nflId'] != ballCarrierId
                ]

                # Sort distances_projections by distances in ascending order
                distances_angles.sort(key=lambda x: x[0])

                # Update the DataFrame with sorted distances and projections
                for i, (dist, angle) in enumerate(distances_angles[:10], start=1):
                    data.loc[index, f'c{i}Dist'] = dist
                    data.loc[index, f'c{i}Ang'] = angle

                # Update the DataFrame with ball carrier distance and Projection
                bc_dist = calculate_distance(def_player['x'], def_player['y'],
                                   offensive_players[offensive_players["nflId"] == ballCarrierId]['x'],
                                   offensive_players[offensive_players["nflId"] == ballCarrierId]['y'])
                bc_angle = calculate_angle(def_player['x'], def_player['y'],
                             offensive_players[offensive_players["nflId"] == ballCarrierId]['x'],
                             offensive_players[offensive_players["nflId"] == ballCarrierId]['y'],
                             def_player["unitDir"])
                data.loc[index, 'bcDist'] = bc_dist
                data.loc[index, 'bcAng'] = float(bc_angle.iloc[0])

    return data[["gameId", "playId", "nflId", "frameId",
                 "c1Dist", "c2Dist", "c3Dist", "c4Dist", "c5Dist", "c6Dist", "c7Dist", "c8Dist", "c9Dist", "c10Dist", "bcDist",
                 "c1Ang", "c2Ang", "c3Ang", "c4Ang", "c5Ang", "c6Ang", "c7Ang", "c8Ang", "c9Ang", "c10Ang", "bcAng"]]

## Game Tackling Metrics
This will create in game tackling metrics for the defensive players. Each player will have their total tackles and assists, tackles, assists, missed tackles, and forced fumbles assigned to them in the dataframe for the cumulative plays that have occurred up to that point in the game. A tackle efficiency metric will also be calculated which represents the percent of time a player made a tackle when they had the opportunity to do so. Lastly, a rough version of a tackle rating has been calculated, building on the tackle efficiency metric. This will assign weights to each type of tackle metric, for example, a forced fumble will be weighted much heavier.

In [1]:
def ingame_tackling(df_week):
    
    def ingame_stats(row, tackles):
        game_id = row['gameId']
        play_id = row['playId']
        nfl_id = row['nflId']
        
        # For the gameId and nflId, get cumulative values for tackling metrics up until that play in the gme
        cumulative_tackling = tackles[(tackles['gameId'] == game_id) & (tackles['nflId'] == nfl_id) & (tackles['playId'] < play_id)][['tackle', 'assist', 'forcedFumble', 'pff_missedTackle']].sum()
        return cumulative_tackling['tackle'], cumulative_tackling['assist'], cumulative_tackling['forcedFumble'], cumulative_tackling['pff_missedTackle']
    
    # Only apply above function for the minimum frame to speed up processing
    min_frame_indices = df_week.groupby(['gameId', 'playId', 'nflId'])['frameId'].idxmin()
    min_frame_data = df_week.loc[min_frame_indices]

    # Apply the function to the week dataframe and rename the columns
    cumulative_stats = min_frame_data.apply(lambda row: ingame_stats(row, tackles), axis = 1, result_type='expand')
    cumulative_stats.columns = ['tackles_ingame', 'assists_ingame', 'ff_ingame', 'misses_ingame']
    
    # Create other in game metrics
    cumulative_stats['tackle_efficiency_ingame'] = (cumulative_stats['tackles_ingame'] + cumulative_stats['assists_ingame']) / (cumulative_stats['tackles_ingame'] + cumulative_stats['assists_ingame'] + cumulative_stats['misses_ingame'])
    
    # Create a weighted tackle rating for in game stats
    cumulative_stats['tackle_rating_ingame'] = (cumulative_stats['tackles_ingame'] + cumulative_stats['assists_ingame'] * .5 + cumulative_stats['ff_ingame'] * 5) / (cumulative_stats['tackles_ingame'] + cumulative_stats['assists_ingame'] + cumulative_stats['misses_ingame'])
    
    # Concatenate cumulative game results with gameId, playId, and nflId
    cumulative_stats = pd.concat([min_frame_data[['gameId', 'playId', 'nflId']], cumulative_stats], axis = 1) #.fillna(0) #***This will replace nflId for ball from nan to 0***
    
    return cumulative_stats

## Rolling Tackling Metrics
This will create rolling tackle metrics for the defensive players last 3 weeks. Each player will have their total tackles and assists, tackles, assists, missed tackles, and forced fumbles assigned to them in the dataframe. A tackle efficiency metric will also be calculated which represents the percent of time a player made a tackle when they had the opportunity to do so. Lastly, a rough version of a tackle rating has been calculated, building on the tackle efficiency metric. This will assign weights to each type of tackle metric, for example, a forced fumble will be weighted much heavier.

In [5]:
def rolling_tackling():
  
    # Subset and merge the games and tackles data
    g = games[['gameId', 'week']]
    tackle_history = g.merge(tackles, how = 'left', on = 'gameId')
    
    tackles_weekly = tackle_history.groupby(['week', 'gameId', 'nflId'])[['tackle', 'assist', 'forcedFumble', 'pff_missedTackle']].sum().reset_index()
    
    def rolling_sums(row, window_size = 3):
        # Sort values by week to ensure the rolling window follows chronological order
        row = row.sort_values('week')
        
        # Calculate the rolling sums
        row['rolling_tackles'] = row['tackle'].rolling(window = window_size, min_periods = 1).sum().shift()
        row['rolling_assists'] = row['assist'].rolling(window = window_size, min_periods = 1).sum().shift()
        row['rolling_ff'] = row['forcedFumble'].rolling(window = window_size, min_periods = 1).sum().shift()
        row['rolling_mt'] = row['pff_missedTackle'].rolling(window = window_size, min_periods = 1).sum().shift()
        
        return row
    
    # Group by nflId without prepending group keys to the result index
    df_rolling = tackles_weekly.groupby('nflId', group_keys = False).apply(rolling_sums)
    df_rolling = df_rolling[['gameId', 'nflId', 'rolling_tackles', 'rolling_assists', 'rolling_ff', 'rolling_mt']].fillna(0)
    
    return df_rolling

## Game Misc Attributes

From dataset, bring in information such as surface the game was played on, and if it was inside or outside

In [1]:
# game_misc: function to obtain 
# input: shorten binary variable to collopse different types of turn into only categorical variable
# output: dataframe of gamedata with gameId, categorical surface type, and categorical variable for indoor/outdoor
# games.merge(game_miscs(), on = "gameId", how = "left")
def game_miscs():
    import nfl_data_py as nfl

    # Import data and get only relevant columns
    df_2022 = nfl.import_pbp_data([2022])
    df_2022 = df_2022[['home_team', 'away_team', 'week', 'roof', 'surface']].drop_duplicates()
    
    # Join onto games dataset to combine with information we imported
    games_misc = games.merge(df_2022, how = 'inner', left_on = ['week', 'homeTeamAbbr', 'visitorTeamAbbr'], right_on = ['week', 'home_team', 'away_team'])
    
    # Transform roof variable to inside or outside and get relevant columns 
    games_misc['inside_outside'] = games_misc.apply(lambda x: 'inside' if x['roof'] in ['dome', 'closed'] else 'outside', axis = 1)
    games_misc = games_misc[['gameId', 'surface', 'inside_outside']]
    
    #Change this so all turf surfaces become turf and anything empty becomes grass 
    games_misc['surface'] = games_misc['surface'].apply(lambda x: 'turf' if 'turf' in x else 'grass')
    
    return games_misc

# Play Types

The play_type variable identifies what type of play occurs on a given play: pass, run, qb_run (designed or scramble), or other. In the data-investigation.ipynb notebook, we identified that all our passes in the data were caught by an offensive player. Anytime within the tracking data, a play was labeled with the event of "pass_outcome_caught, "lateral", or "autoevent_passforward". We also identified that all plays that had a tracking event with "handoff" were designed runs by someone other than the player who obtained the snap. While, "run" was identified as those plays that the player who obtained the snapped ball (most of the time a QB) either scrambled or had a designed run. We labeled these as "qb_run". Lastly, if there were any other plays that did not have these tracking events, we labeled these as "Other".

In [9]:
# play_type: takes the plays and tracking data to obtain a data frame with a unique identifier for a given play
#            and the type of play that was ran on that play
# input: plays and tracking dataframes
# output: data frame with "gameId", "playId", "play_type"
# usage: tracking.merge(play_type(plays,tracking), on = ["GameId","PlayId"])
def play_type(plays,tracking):
    # Create a function to determine play type
    def determine_play_type(play_data):
        if any(np.isin(play_data["event"].values, ["pass_outcome_caught", "lateral", "autoevent_passforward"])):
            return "pass"
        elif "handoff" in play_data["event"].values:
            return "run"
        elif "run" in play_data["event"].values:
            return "qb_run"
        else:
            return "other"
        
        
    plays_tracking = plays.merge(tracking, on = ["gameId", "playId"])
    #Drop duplicates, just need gameId,playId,frameId, event, ball_carrier position
    bc_event = plays_tracking[plays_tracking["event"].notna()][['gameId', 'playId', 'frameId', 'event']].drop_duplicates().reset_index(drop=True)
    # Group by 'gameId' and 'playId' and apply the function to each group
    result = bc_event.groupby(['gameId', 'playId']).apply(determine_play_type).reset_index()
    result.columns = ['gameId', 'playId', 'play_type']
    
    return result

# Defense Formations

In [10]:
# defense_formation: Takes a play and counts the number of dlineman, linebackers, and dbacks on a given play based on the 
#                    position categorized for the player
# Input: plays, tackles, tracking, players data
# Output: data frame with "gameId", "playId","DL", "LB","DB" and the corresponsing counts on the play
# Usage: tracking.merge(defense_formation(plays,tackles, tracking,players), on = ["gameId", "playId"])
def defense_formation(plays,tackles, tracking,players):
    
    dlinemen = ['DT','DE']
    linebackers = ['OLB','ILB','MLB', 'RB']
    dbacks = ['CB','FS', 'SS', 'NT', 'DB', 'WR']

    # Categorize defenders 
    def defense_pos(pos):
        if pos in dlinemen:
            return 'DL'
        elif pos in linebackers:
            return 'LB'
        elif pos in dbacks:
            return 'DB'

    def count_positions(play):

        counts = play['positionCat'].value_counts()
        dl = counts.get('DL', 0)
        lb = counts.get('LB', 0)
        db = counts.get('DB', 0)

        # Create the defense formation string
        defense_formation = f"{dl} - {lb} - {db}"

        return pd.Series({
            'DL': dl,
            'LB': lb,
            'DB': db,
            'defFormation': defense_formation
        })
    
    # Subset the datasets 
    plays_sub = plays[['gameId','playId','defensiveTeam','playDescription']]
    tackles_sub = tackles[['gameId','playId','nflId','tackle','assist']]
    tracking_sub = tracking[['gameId','playId','nflId','displayName','club']]
    players_sub = players[['nflId','displayName','position']]

    # Merge the Datasets
    inter = plays_sub.merge(tracking_sub, how='inner', on=['gameId','playId'])
    df = inter.merge(players_sub, how='left', on=['nflId'])

    # Filter for defense only 
    defense = df[df['club'] == df['defensiveTeam']].drop_duplicates()
    
    defense['positionCat'] = defense['position'].apply(defense_pos)

    #Apply the function to each group
    position_counts = defense.groupby(['gameId', 'playId']).apply(count_positions).reset_index()

    # # Merge the position counts back into the df DataFrame
    new_df = df.merge(position_counts, on=['gameId', 'playId'], how='left')
    
    return new_df[["gameId", "playId","DL", "LB","DB"]].drop_duplicates()

# Offense Formation

In [11]:
# offense_formation: Takes a play and counts the number of WR, QB, TE, RB, OL on a given play based on the 
#                    position categorized for the player
# Input: plays, tackles, tracking, players data
# Output: data frame with "gameId", "playId","WR", "QB","TE", "RB", "OL" and the corresponsing counts on the play
# Usage: tracking.merge(defense_formation(plays,tackles, tracking,players), on = ["gameId", "playId"])
def offense_formation(plays,tackles, tracking,players):
    
    QB = ['QB']
    RB = ["RB","FB"]
    WR = ["WR"]
    TE = ["TE"]
    OL = ["G", "C", "T","ILB", "OLB", "MLB", "DT"]

    # Categorize defenders 
    def defense_pos(pos):
        if pos in RB:
            return 'RB'
        elif pos in OL:
            return 'OL'
        else:
            return pos

    def count_positions(play):

        counts = play['positionCat'].value_counts()
        Qb = counts.get('QB', 0)
        Rb = counts.get('RB', 0)
        Wr = counts.get('WR', 0)
        Te = counts.get('TE', 0)
        Ol = counts.get('OL', 0)


        return pd.Series({
            'QB': Qb,
            'RB': Rb,
            'WR': Wr,
            'TE': Te,
            'OL': Ol
        })
    
    # Subset the datasets 
    plays_sub = plays[['gameId','playId','possessionTeam','playDescription']]
    tackles_sub = tackles[['gameId','playId','nflId','tackle','assist']]
    tracking_sub = tracking[['gameId','playId','nflId','displayName','club']]
    players_sub = players[['nflId','displayName','position']]

    # Merge the Datasets
    inter = plays_sub.merge(tracking_sub, how='inner', on=['gameId','playId'])
    df = inter.merge(players_sub, how='left', on=['nflId'])

    # Filter for defense only 
    offense = df[df['club'] == df['possessionTeam']].drop_duplicates()
    
    offense['positionCat'] = offense['position'].apply(defense_pos)

    #Apply the function to each group
    position_counts = offense.groupby(['gameId', 'playId']).apply(count_positions).reset_index()

    # # Merge the position counts back into the df DataFrame
    new_df = df.merge(position_counts, on=['gameId', 'playId'], how='left')
    
    return new_df[["gameId", "playId","QB", "RB","WR", "TE", "OL"]].drop_duplicates()

# Game Time in Seconds

The purpose of this function is to calculate the number of seconds since the start of the game. Hence, we should have 0 seconds for the first play of the game if the play started at 15:00 in Q1. At 15:00 in Q2, we will have 900 seconds. At 15:00 in Q3, we will have 1800 seconds. At 15:00 in Q4, we will have 2700 seconds. At 0:00 in the Q4, we will have 3600 seconds. Lastly for 10:00 at 10:00, we will have 3600 seconds and a max value of 4200 seconds.

In [12]:
# nfl_clock_to_seconds: takes a row of data and changes the game clock with its quarter to a corresponding time value 
#                       in seconds from the start of the game, i.e. 0 seconds at the start, and 3600 at the end of Q4
# input: row of data
# output: computed total seconds from the start of the game given the gameClock and quarter variable from plays data
# usage: plays["timeSinceStart"] = plays.apply(nfl_clock_to_seconds,axis = 1)
def nfl_clock_to_seconds(row):
    # Convert quarter to minutes
    if row["quarter"] == 1:
        quarter_minutes = 0
    elif row["quarter"] == 2:
        quarter_minutes = 15
    elif row["quarter"] == 3:
        quarter_minutes = 30
    elif row["quarter"] == 4:
        quarter_minutes = 45
    elif row["quarter"] == 5:
        quarter_minutes = 60

    # Split the clock into minutes and seconds
    clock_parts = row["gameClock"].split(":")

    minutes = int(clock_parts[0])
    seconds = int(clock_parts[1])

    # Calculate the total time in seconds
    if row["quarter"]!=5:
        total_seconds = quarter_minutes*60 + (900- (minutes*60 + seconds))
    else:
        total_seconds = quarter_minutes*60 + (600- (minutes*60 + seconds))

    return total_seconds

# presnapDefenseWinProbability and home

The following function will take the tracking, games, and plays data and return the features presnapeDefenseProbabiity and home variable. This feature will calculate the win probability in terms of the defense rather than home vs away. It will also include a binary variable for whether the player on the team is playing on the home team.

In [13]:
# presnapDefenseWinProbability: takes plays, tracking, and plays to calculate the presnapDefenseWinProbability and identfiy
#                               whether the player is playing for the home team
# input: games, tracking, plays
# output: a dataframe of gameId, playId, nflId, frameId, home binary variable, preSnapWinProbabilityDefense
# usage: tracking.merge(presnapDefenseWinProbability(games, tracking, plays), on = ["gameId", "playId", "nflId", "frameId"])
def presnapDefenseWinProbability(games, tracking, plays):
    merged = pd.merge(games, plays, on = 'gameId', how = 'inner') #merge games and plays
    merged = pd.merge(merged, tracking, on = ['gameId','playId'], how = 'inner') #merge games, plays, tracking

    #need to know who the club of the player is, preSnapHomeTeamWinProability, who the home team is and whos on defense
    #create home variable if club is home team
    merged["home"]=(merged["club"]== merged["homeTeamAbbr"]).astype(int)
    #create preSnapWInProbabilityDefense variable if homeTeamAbbr == defensiveTeam then use home team win prob, else use 1 - home team win prob
    merged["preSnapWinProbabilityDefense"] = merged.\
                apply(lambda row: row['preSnapHomeTeamWinProbability'] if row["homeTeamAbbr"]==row["defensiveTeam"] else
                     1 - row['preSnapHomeTeamWinProbability'], axis = 1)

    #return gameId, playId, nflId home, preSnapWinProbabilityDefense to merge for feature
    return merged[["gameId","playId","nflId","frameId","home","preSnapWinProbabilityDefense"]]

# Dependent variable: Tackle

For the dependent variable, we are going to create four different possible variables so that we can decide later what we want to do. 1. a 0/1 to the player who made a tackle throughout the entire play 2. a 0/.5/1 to the player who made a tackle throughout the entire play 3. a 0/1 to the player who made the tackle at the exact moment 4. 0/.5/1 to the player who made the tackle at the exact moment. This way we have everything we need on the decision we want to take

In [14]:
#Function to obtain appropriate dependent variable
#Input: tackles, tracking data
#Output: data frame with gameId, playId, frameId, nflId, tackle_binary_all, tackle_binary_singl, tackle_nonbinary_all, tackle_nonbinary_single
def tackle_dependent_variable(tackles,tracking):
    
    #merge tracking and tackles
    merged = pd.merge(tracking, tackles, on = ['gameId','playId', 'nflId'], how = 'left')   
    
    #create a new variable called tackle_binary_all for the player who made the tackle or assist on a given play to have a value of 1 and 0 otherwise
    merged["tackle_binary_all"] = merged.apply(lambda row: 1 if row['tackle'] == 1 or row['assist'] == 1 else 0, axis=1)
    print("done tackle_binary_all")

    #create a new variable called tackle_binary_single for the player who made the tackle on at the instance they made the tackle have a value of 1 and 0 otherwise
    merged["tackle_binary_single"] = merged.apply(
        lambda row: 1 if ((row['tackle'] == 1 or row['assist'] == 1) and 
                          (row["event"] == "tackle" or row["event"] == "out_of_bounds" or row["event"]=="fumble" or row["event"]=="qb_slide" or row["event"]=="safety")) else 0,
        axis=1)    
    print("done tackle_binary_single")
    
    #create a new variable called tackle_nonbinary_all for the player who made the tackle or assist on a given play
    #to have a value of 0,0.5,1 depending if it was a tackle (value of 1)  or an assist (value of 0.5)
    merged["tackle_nonbinary_all"] = merged.apply(lambda row: 1 if row['tackle'] == 1 else (0.5 if row['assist'] == 1 else 0), axis=1)
    print("done tackle_nonbinary_all")
    
    
    #create a new variable called tackle_nonbinaru_sing for the player who made the tackle or assist on a given play
    #to have a value of 0,0.5,1 depending if it was a tackle (value of 1)  or an assist (value of 0.5) at the instance of a tackle occuring
    merged["tackle_nonbinary_single"] = merged.apply(
        lambda row: 1 if (row['tackle'] == 1 and 
                          (row["event"] == "tackle" or row["event"] == "out_of_bounds" or row["event"] =="fumble" or row["event"]=="qb_slide" or row["event"]=="qb_slide" or row["event"]=="safety"))
        else (0.5 if (row['assist'] == 1 and 
                      (row["event"] == "tackle" or row["event"] == "out_of_bounds" or row["event"] =="fumble" or row["event"]=="qb_slide" or row["event"]=="qb_slide" or row["event"]=="safety")) else 0),
        axis=1)
    print("done tackle_non_binary_single")

    #return gameId, playId, frameId, nflId, tackle
    return merged[["gameId","playId","frameId","nflId",
                   "tackle_binary_all","tackle_binary_single", "tackle_nonbinary_all", "tackle_nonbinary_single"]]

# Ball Carrier Data

The following function will assign vital data about the ball carrier to the tracking of each play. This includes weight (in lbs), x and y coordinates on the field, speed, acceleration, orientation, direction, force, and position of the ball Carrier.

In [15]:
# ballCarrierData: takes plays, tracking, and players data to compute important data to include ball carrier data
#                  for each tracking aspect in our data
# input: plays, tracking, players
# output: a dataframe of all ball carrier information on a given play and frame
# usage: tracking.merge(ballCarrierData(plays,tracking,players), on = ["gameId", "playId", "frameId"])
def ballCarrierData(plays,tracking,players):
    plays_tracking = tracking.merge(plays, on = ['gameId','playId'], how = 'inner') #merge plays and tracking

    #subset data to obtain all information of ball carriers
    ball_carrier_tracking = plays_tracking[plays_tracking["nflId"]==plays_tracking["ballCarrierId"]]

    ball_carrier_tracking = ball_carrier_tracking.merge(players[["nflId", "weight", "position"]], on = "nflId", how = "left")

    ball_carrier_tracking = ball_carrier_tracking[["gameId", "playId", "frameId","x", "y", "s", "a", "o", "dir", "weight", "position"]]
    
    ball_carrier_tracking["bcForce"] = (ball_carrier_tracking["weight"]/2.2)*ball_carrier_tracking["a"]

    ball_carrier_tracking.rename(columns = {"x":"bcx", "y": "bcy", "s": "bcs", "a": "bca", "o": "bco", "dir":"bcdir",
                                           "weight": "bcweight", "position":"bcPosition"}, inplace = True)

    # return dataframe with tracking on gameId, playId, nflId, frameId, and ballCarrierinfo
    # merge with tracking
    return ball_carrier_tracking

# Force and Mass

The following function will compute the mass of a given player by using the listed weight in pounds from the players dataframe. This will be calculated based on weight/2.2 to obain mass in kg. The force will then be calculated by F = mass x acceleration. The resulting function will return the appropropriate dataframe.

In [16]:
# calculate_mass_and_force: takes the tracking and players data which computes the mass and acceleration. The mass will be calculated
#                       based on the players listed weight. Force will be then calculate based on accleration and mass (F = ma)
# input: tracking, players dataframes
# output: dataframe of gameId, playId, nflId, frameId, mass, force where the dataframe can be merged with the Id values
# usage: tracking = tracking.merge(calculate_mass_and_force(tracking, players), on = ["gameId", "playId", "nflId", "frameId"])
def calculate_mass_and_force(tracking, players):
    # Join tracking_df and players_df based on nflId
    tracking_players_df = pd.merge(tracking, players, on='nflId', how='left')

    # Calculate mass (assuming weight is in pounds, converting to kilograms)
    tracking_players_df['mass'] = tracking_players_df['weight'] / 2.2

    # Calculate force (assuming 'a' represents acceleration)
    tracking_players_df['force'] = tracking_players_df['mass'] * tracking_players_df['a']

    # Select the desired columns
    result_df = tracking_players_df[['gameId', 'playId', 'nflId','frameId', 'force']]

    return result_df