# Data Prep

The purpose of this notebook is to use feature functions from Feature-Engineering.ipynb and data cleaning functions from Data-Preprocessing.ipynb to create prepared and cleaned training, validating, and testing data. At the end we will split the data into 70% training, 15% validation, and 15% testing dataframes that will be used for model building. These will be written as .csv files.

## Run Preliminary Notebook Functions

The following cells of code run the neccessary juptyer notebooks with the needed functions to create features and clean up data. We also obtain the neccessary libraries and cell chunks needed to run these functions. We need to upload our data and then combine the tracking data.

In [1]:
#Run Notebookw for functions to prepare data
%run Feature-Engineering.ipynb
%run Data-Preprocessing.ipynb

In [2]:
#Import libraries
import pandas as pd
import numpy as np
import os
import warnings
import time
from sklearn.model_selection import GroupShuffleSplit

warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)


#Import Data
games = pd.read_csv("../Data/games.csv")
players = pd.read_csv("../Data/players.csv")
plays = pd.read_csv("../Data/plays.csv")
tackles = pd.read_csv("../Data/tackles.csv")

#Read in all tracking data
tracking_1 = pd.read_csv("../Data/tracking_week_1.csv")
tracking_2 = pd.read_csv("../Data/tracking_week_2.csv")
tracking_3 = pd.read_csv("../Data/tracking_week_3.csv")
tracking_4 = pd.read_csv("../Data/tracking_week_4.csv")
tracking_5 = pd.read_csv("../Data/tracking_week_5.csv")
tracking_6 = pd.read_csv("../Data/tracking_week_6.csv")
tracking_7 = pd.read_csv("../Data/tracking_week_7.csv")
tracking_8 = pd.read_csv("../Data/tracking_week_8.csv")
tracking_9 = pd.read_csv("../Data/tracking_week_9.csv")

In [3]:
#Add column for week
tracking_1.insert(0,'Week',1)
tracking_2.insert(0,'Week',2)
tracking_3.insert(0,'Week',3)
tracking_4.insert(0,'Week',4)
tracking_5.insert(0,'Week',5)
tracking_6.insert(0,'Week',6)
tracking_7.insert(0,'Week',7)
tracking_8.insert(0,'Week',8)
tracking_9.insert(0,'Week',9)

In [4]:
#combine tracking
tracking = pd.concat([tracking_1,tracking_2,tracking_3,tracking_4,tracking_5,tracking_6,tracking_7,tracking_8,tracking_9], axis = 0).reset_index(drop = True)
display(tracking)

Unnamed: 0,Week,gameId,playId,nflId,displayName,frameId,time,jerseyNumber,club,playDirection,x,y,s,a,dis,o,dir,event
0,1,2022090800,56,35472.0,Rodger Saffold,1,2022-09-08 20:24:05.200000,76.0,BUF,left,88.370000,27.270000,1.62,1.15,0.16,231.74,147.90,
1,1,2022090800,56,35472.0,Rodger Saffold,2,2022-09-08 20:24:05.299999,76.0,BUF,left,88.470000,27.130000,1.67,0.61,0.17,230.98,148.53,pass_arrived
2,1,2022090800,56,35472.0,Rodger Saffold,3,2022-09-08 20:24:05.400000,76.0,BUF,left,88.560000,27.010000,1.57,0.49,0.15,230.98,147.05,
3,1,2022090800,56,35472.0,Rodger Saffold,4,2022-09-08 20:24:05.500000,76.0,BUF,left,88.640000,26.900000,1.44,0.89,0.14,232.38,145.42,
4,1,2022090800,56,35472.0,Rodger Saffold,5,2022-09-08 20:24:05.599999,76.0,BUF,left,88.720000,26.800000,1.29,1.24,0.13,233.36,141.95,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12187393,9,2022110700,3787,,football,40,2022-11-07 23:06:49.200000,,football,right,26.219999,19.680000,1.37,2.58,0.15,,,tackle
12187394,9,2022110700,3787,,football,41,2022-11-07 23:06:49.299999,,football,right,26.320000,19.610001,1.07,2.74,0.12,,,
12187395,9,2022110700,3787,,football,42,2022-11-07 23:06:49.400000,,football,right,26.389999,19.559999,0.80,2.49,0.09,,,
12187396,9,2022110700,3787,,football,43,2022-11-07 23:06:49.500000,,football,right,26.450001,19.520000,0.57,2.38,0.07,,,


In [5]:
print(len(tracking))

12187398


## Sample Data

Run these cells only if we are running a sample. Comment out this code if running functions on entire data set.

In [6]:
# # #Create unique id column and move it to the first column in the dataframe
# tracking['gamePlayId'] = tracking.apply(lambda x: str(x['gameId']) + str(x['playId']), axis=1)
# #Keep Column
# gamePlayId = tracking['gamePlayId']
# #Drop the column
# tracking = tracking.drop(columns=['gamePlayId'])
# # Insert the column at the beginning
# tracking.insert(0, 'gamePlayId', gamePlayId)

In [7]:
# display(tracking)

In [8]:
# # #Run this line of code for a sample: 
# # # Group by 'Category' and apply the sampling function to each group
# #Initialize GroupShuffleSplit
# gss = GroupShuffleSplit(n_splits=2, test_size=0.01, random_state=1)

# # Split data into training, validation, and testing
# for train_idx, test_idx in gss.split(tracking, groups=tracking['gamePlayId']):
#     _, tracking_sample = tracking.iloc[train_idx], tracking.iloc[test_idx]
    
# tracking = tracking_sample
    
# print("Number of rows in tracking:",len(tracking_sample))
# print("Number columns in tracking:", tracking_sample.shape[1])
# print("Number of plays in data:",tracking_sample["gamePlayId"].nunique())
# print("Weeks in data:", tracking_sample["Week"].unique())

In [9]:
# tracking = tracking_sample

## Features and Data Cleaning

The following code cells create the tracking features as well as the data preprocessing steps. These functions are run in order of neccessary order to obtian the proper datasets. Each cell runs the functions giving us an elapsed time that it takes to run the functions, the latest date of the tracking, number of rows in the data, the number of columns in the data, and number of unique plays to make sure our functions are properly accounted for.

In [6]:
# #####NOTE: Comment this out if running a sample
#Create unique id column and move it to the first column in the dataframe
start_time = time.time()
tracking['gamePlayId'] = tracking.apply(lambda x: str(x['gameId']) + str(x['playId']), axis=1)
end_time = time.time()
# Calculate the elapsed time
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time} seconds")
print("Lastest Date:",tracking["time"].max())
print("Number of rows in tracking:",len(tracking))
print("Number columns in tracking:", tracking.shape[1])
print("Number of plays in data:",len(tracking["gamePlayId"].drop_duplicates()))

Elapsed time: 103.07761096954346 seconds
Lastest Date: 2022-11-07 23:06:49.599999
Number of rows in tracking: 12187398
Number columns in tracking: 19
Number of plays in data: 12486


In [7]:
#Features for changing the orientation and direction to unit circle
start_time = time.time()
tracking["unitDir"] = tracking["dir"].apply(orient_angle)
tracking["unitO"] = tracking["o"].apply(orient_angle)
end_time = time.time()
# Calculate the elapsed time
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time} seconds")
print("Lastest Date:",tracking["time"].max())
print("Number of rows in tracking:",len(tracking))
print("Number columns in tracking:", tracking.shape[1])
print("Number of plays in data:",tracking["gamePlayId"].nunique())

Elapsed time: 9.362496852874756 seconds
Lastest Date: 2022-11-07 23:06:49.599999
Number of rows in tracking: 12187398
Number columns in tracking: 21
Number of plays in data: 12486


In [8]:
#Standardaize the tracking data
start_time = time.time()
tracking = standardize_field(tracking)
end_time = time.time()
# Calculate the elapsed time
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time} seconds")
print("Lastest Date:",tracking["time"].max())
print("Number of rows in tracking:",len(tracking))
print("Number columns in tracking:", tracking.shape[1])
print("Number of plays in data:",tracking["gamePlayId"].nunique())

Elapsed time: 3.8418948650360107 seconds
Lastest Date: 2022-11-07 23:06:49.599999
Number of rows in tracking: 12187398
Number columns in tracking: 21
Number of plays in data: 12486


In [9]:
#Remove football data as we are only concerned with the position of the ball carrier
start_time = time.time()
tracking = remove_football_frames(tracking)
end_time = time.time()
# Calculate the elapsed time
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time} seconds")
print("Lastest Date:",tracking["time"].max())
print("Number of rows in tracking:",len(tracking))
print("Number columns in tracking:", tracking.shape[1])
print("Number of plays in data:",tracking["gamePlayId"].nunique())

Elapsed time: 3.468153953552246 seconds
Lastest Date: 2022-11-07 23:06:49.599999
Number of rows in tracking: 11657338
Number columns in tracking: 21
Number of plays in data: 12483


In [10]:
#Remove plays with multiple tackles on the play
start_time = time.time()
tracking = remove_plays_with_mult_tackles(tracking,tackles)
end_time = time.time()
# Calculate the elapsed time
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time} seconds")
print("Lastest Date:",tracking["time"].max())
print("Number of rows in tracking:",len(tracking))
print("Number columns in tracking:", tracking.shape[1])
print("Number of plays in data:",tracking["gamePlayId"].nunique())

Elapsed time: 19.97579073905945 seconds
Lastest Date: 2022-11-07 23:06:49.599999
Number of rows in tracking: 11656766
Number columns in tracking: 21
Number of plays in data: 12482


In [11]:
#Remove plays with tracking issues
start_time = time.time()
tracking = remove_tracking_issues(tracking)
end_time = time.time()
# Calculate the elapsed time
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time} seconds")
print("Lastest Date:",tracking["time"].max())
print("Number of rows in tracking:",len(tracking))
print("Number columns in tracking:", tracking.shape[1])
print("Number of plays in data:",tracking["gamePlayId"].nunique())

Elapsed time: 15.296844720840454 seconds
Lastest Date: 2022-11-07 23:06:49.599999
Number of rows in tracking: 11654016
Number columns in tracking: 21
Number of plays in data: 12479


In [12]:
#Create feature for dependent variables
start_time = time.time()
tracking = tracking.merge(tackle_dependent_variable(tackles,tracking), on = ["gameId", "playId", "nflId", "frameId"])
end_time = time.time()
# Calculate the elapsed time
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time} seconds")
print("Lastest Date:",tracking["time"].max())
print("Number of rows in tracking:",len(tracking))
print("Number columns in tracking:", tracking.shape[1])
print("Number of plays in data:",tracking["gamePlayId"].nunique())

tackle_single variable done
tackle_multiple done
Elapsed time: 147.77235794067383 seconds
Lastest Date: 2022-11-07 23:06:49.599999
Number of rows in tracking: 11654093
Number columns in tracking: 23
Number of plays in data: 12479


In [13]:
#Filter frames in tracking to only include desired rows
start_time = time.time()
tracking = filter_frames_by_events(tracking)
end_time = time.time()
# Calculate the elapsed time
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time} seconds")
print("Lastest Date:",tracking["time"].max())
print("Number of rows in tracking:",len(tracking))
print("Number columns in tracking:", tracking.shape[1])
print("Number of plays in data:",len(tracking["gamePlayId"].drop_duplicates()))

Elapsed time: 34.7127206325531 seconds
Lastest Date: 2022-11-07 23:06:49.400000
Number of rows in tracking: 7916161
Number columns in tracking: 23
Number of plays in data: 12468


In [14]:
#Create feature for ball carrier data to each player
start_time = time.time()
tracking = tracking.merge(ballCarrierData(plays,tracking,players), on = ["gameId", "playId","frameId"])
end_time = time.time()
# Calculate the elapsed time
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time} seconds")
print("Lastest Date:",tracking["time"].max())
print("Number of rows in tracking:",len(tracking))
print("Number columns in tracking:", tracking.shape[1])
print("Number of plays in data:",tracking["gamePlayId"].nunique())

Elapsed time: 11.881945610046387 seconds
Lastest Date: 2022-11-07 23:06:49.400000
Number of rows in tracking: 7916161
Number columns in tracking: 30
Number of plays in data: 12468


In [15]:
#Create feature for ball carrier distance to adj endzone
start_time = time.time()
tracking['bcy_adj'] = tracking.apply(bc_adj, axis = 1)
end_time = time.time()
# Calculate the elapsed time
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time} seconds")
print("Lastest Date:",tracking["time"].max())
print("Number of rows in tracking:",len(tracking))
print("Number columns in tracking:", tracking.shape[1])
print("Number of plays in data:",tracking["gamePlayId"].nunique())

Elapsed time: 50.85424995422363 seconds
Lastest Date: 2022-11-07 23:06:49.400000
Number of rows in tracking: 7916161
Number columns in tracking: 31
Number of plays in data: 12468


In [16]:
#Create feature for ball carrier distance to out of bounds
start_time = time.time()
tracking['bcy_toob'] = tracking.apply(bc_toob, axis = 1)
end_time = time.time()
# Calculate the elapsed time
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time} seconds")
print("Lastest Date:",tracking["time"].max())
print("Number of rows in tracking:",len(tracking))
print("Number columns in tracking:", tracking.shape[1])
print("Number of plays in data:",tracking["gamePlayId"].nunique())

Elapsed time: 69.83258867263794 seconds
Lastest Date: 2022-11-07 23:06:49.400000
Number of rows in tracking: 7916161
Number columns in tracking: 32
Number of plays in data: 12468


In [17]:
#Create feature for calculating Force
start_time = time.time()
tracking_force = tracking.merge(calculate_force(tracking, players), on = ["gameId", "playId", "nflId", "frameId"])
end_time = time.time()
# Calculate the elapsed time
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time} seconds")
print("Lastest Date:",tracking["time"].max())
print("Number of rows in tracking:",len(tracking))
print("Number columns in tracking:", tracking.shape[1])
print("Number of plays in data:",tracking["gamePlayId"].nunique())

Elapsed time: 18.008800745010376 seconds
Lastest Date: 2022-11-07 23:06:49.400000
Number of rows in tracking: 7916161
Number columns in tracking: 32
Number of plays in data: 12468


### Split Data By Week to Run on Long Data Processing Times

Due to time and effort, we are just going to split our data by week and run each week on the calculate_distance_angles and voronoi_tesselations. With each loop we will run the week on each function, write it to a csv file, and reread the data, running the functions on the next week, and then rewriting it again. This way if our notebook crashes we can continue where we left off.

In [33]:
week = 1
var_name = "tracking_dist_voronoi_df"

if var_name in locals():
    del tracking_dist_voronoi_df

while week!=10:
    if os.path.exists("../Data/tracking_dist_voronoi.csv"):
        tracking_dist_voronoi_df = pd.read_csv("../Data/tracking_dist_voronoi.csv") #Read in data
        last_week = tracking_dist_voronoi_df['Week'].unique().max()
        week = last_week + 1
        if week>9:
            break
            print("File already completed")
        print(f"File Found. Updating Data starting with Week {week}")
    else:
        print("File not Found. Creating File")
        
    #Run calculations on tracking
    tracking_week = tracking[tracking["Week"]==week] #subset data for week
    
    start_time = time.time()
    tracking_week = tracking_week.merge(calculate_distance_angles(tracking_week,plays), on = ["gameId", "playId", "nflId", "frameId"])
    end_time = time.time()
    elapsed_time = end_time - start_time
    print(f"Elapsed time for distance and angles: {elapsed_time} seconds")
    
    start_time = time.time()
    tracking_week = tracking_week.merge(voronoi_tessellations(tracking_week, plays), on = ['gameId','playId','frameId','nflId'])
    end_time = time.time()
    elapsed_time = end_time - start_time
    print(f"Elapsed time for voronoi tessellations: {elapsed_time} seconds")
    
    #write the file
    if var_name in locals():
        tracking_dist_voronoi_df = pd.concat([tracking_dist_voronoi_df, tracking_week], ignore_index = True).reset_index(drop = True)
        #Write file
        tracking_dist_voronoi_df.to_csv("../Data/tracking_dist_voronoi.csv", index = False)
        print(f"Week {week} complete")
    else:
        tracking_week.to_csv("../Data/tracking_dist_voronoi.csv", index = False)
        print(f"Week {week} complete")
        
    #increase week    
    week += 1
    
#read back in dataframe
tracking = pd.read_csv("../Data/tracking_dist_voronoi.csv")
print("Number of rows in tracking:",len(tracking))
print("Number columns in tracking:", tracking.shape[1])
print("Number of plays in data:",tracking["gamePlayId"].nunique())

File Found. Updating Data starting with Week 6.0
Elapsed time for distance and angles: 2335.6207156181335 seconds
Elapsed time for voronoi tessellations: 1040.1737208366394 seconds
Week 6.0 complete
File Found. Updating Data starting with Week 7.0
Elapsed time for distance and angles: 2382.8126168251038 seconds
Elapsed time for voronoi tessellations: 1087.5858709812164 seconds
Week 7.0 complete
File Found. Updating Data starting with Week 8.0
Elapsed time for distance and angles: 2773.75280213356 seconds
Elapsed time for voronoi tessellations: 1265.845834493637 seconds
Week 8.0 complete
File Found. Updating Data starting with Week 9.0
Elapsed time for distance and angles: 2282.665349006653 seconds
Elapsed time for voronoi tessellations: 1016.0367240905762 seconds
Week 9.0 complete
Number of rows in tracking: 8828716
Number columns in tracking: 55
Number of plays in data: 12468


In [34]:
#Remove any duplicates from the process
#Duplicates can occur if we run tracking one week and our code fails in which we need to rerun
tracking_duplicates = len(tracking)
tracking = tracking.drop_duplicates(subset=['gamePlayId', 'nflId', 'frameId'])
print("Number of rows in removed from tracking due to duplicates:",tracking_duplicates - len(tracking))
print("Number of rows in tracking:",len(tracking))
print("Number columns in tracking:", tracking.shape[1])
print("Number of plays in data:",tracking["gamePlayId"].nunique())

Number of rows in removed from tracking due to duplicates: 912632
Number of rows in tracking: 7916084
Number columns in tracking: 55
Number of plays in data: 12468


### Continue data cleaning

In [35]:
#Remove offensive players from the data
start_time = time.time()
tracking = remove_offensive_players(tracking,plays)
end_time = time.time()
# Calculate the elapsed time
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time} seconds")
print("Lastest Date:",tracking["time"].max())
print("Number of rows in tracking:",len(tracking))
print("Number columns in tracking:", tracking.shape[1])
print("Number of plays in data:",tracking["gamePlayId"].nunique())

Elapsed time: 6.859408140182495 seconds
Lastest Date: 2022-11-07 23:06:49.400000
Number of rows in tracking: 3958042
Number columns in tracking: 56
Number of plays in data: 12468


In [36]:
display(tracking)

Unnamed: 0,Week,gameId,playId,nflId,displayName,frameId,time,jerseyNumber,club,playDirection,x,y,s,a,dis,o,dir,event,gamePlayId,unitDir,unitO,tackle_single,tackle_multiple,bcx,bcy,bcs,bca,bcUnitO,bcUnitDir,bcForce,bcy_adj,bcy_toob,c1Dist,c2Dist,c3Dist,c4Dist,c5Dist,c6Dist,c7Dist,c8Dist,c9Dist,c10Dist,bcDist,c1Ang,c2Ang,c3Ang,c4Ang,c5Ang,c6Ang,c7Ang,c8Ang,c9Ang,c10Ang,bcAng,voronoi_min_dist_from_bc,defensiveTeam
1,1.0,2.022091e+09,56.0,38577.0,Bobby Wagner,6.0,2022-09-08 20:24:05.700000,45.0,LA,left,41.89,24.593333,3.35,2.62,0.32,349.47,357.71,pass_outcome_caught,202209080056,272.29,280.53,0.0,0.0,40.15,17.743333,4.61,4.82,155.73,67.80,418.463636,69.85,17.743333,3.195387,10.116071,10.461855,10.882909,12.035414,12.635874,12.701657,13.169210,14.799963,23.582173,7.067538,75.239538,89.059902,122.727777,103.094000,104.570954,166.159966,86.642246,80.903275,77.812451,158.729540,16.542527,3.948566,LA
2,1.0,2.022091e+09,56.0,41239.0,Aaron Donald,6.0,2022-09-08 20:24:05.700000,99.0,LA,left,27.85,23.373333,3.62,2.86,0.37,186.16,157.65,pass_outcome_caught,202209080056,112.35,83.84,0.0,0.0,40.15,17.743333,4.61,4.82,155.73,67.80,418.463636,69.85,17.743333,1.400321,1.783620,2.496898,3.993257,4.414386,4.674409,8.228657,17.168183,21.436532,32.008038,13.527265,113.577579,163.037991,150.980380,102.982031,53.447313,68.130076,59.944064,110.580937,72.972206,65.058378,136.944687,12.659069,LA
5,1.0,2.022091e+09,56.0,42816.0,Troy Hill,6.0,2022-09-08 20:24:05.700000,2.0,LA,left,49.38,45.673333,2.60,4.14,0.27,331.57,278.33,pass_outcome_caught,202209080056,351.67,298.43,0.0,0.0,40.15,17.743333,4.61,4.82,155.73,67.80,418.463636,69.85,17.743333,1.233207,10.014569,22.204274,22.838312,26.325539,26.712411,27.894992,30.064028,31.255438,33.017583,29.415605,89.937075,111.358111,93.020417,127.965101,125.346367,127.777247,120.762863,123.703959,122.414454,123.035494,99.957121,23.831392,LA
6,1.0,2.022091e+09,56.0,43294.0,Jalen Ramsey,6.0,2022-09-08 20:24:05.700000,5.0,LA,left,41.85,15.483333,5.88,1.23,0.59,140.96,178.50,pass_outcome_caught,202209080056,91.50,129.04,0.0,0.0,40.15,17.743333,4.61,4.82,155.73,67.80,418.463636,69.85,17.743333,8.993442,13.196030,14.422794,14.850576,15.279797,15.418982,16.539265,16.979061,21.643128,32.342421,2.828003,22.070958,48.171854,61.668459,56.543692,67.764015,42.186107,43.622479,30.430267,8.319643,15.291336,35.450938,0.000000,LA
7,1.0,2.022091e+09,56.0,43298.0,Leonard Floyd,6.0,2022-09-08 20:24:05.700000,54.0,LA,left,27.89,20.193333,1.34,2.21,0.13,159.12,203.53,pass_outcome_caught,202209080056,66.47,110.88,0.0,0.0,40.15,17.743333,4.61,4.82,155.73,67.80,418.463636,69.85,17.743333,0.773886,2.104305,3.431049,5.466160,7.240836,7.311580,10.903687,17.517377,23.554390,34.387191,12.502404,48.770529,7.667234,0.177919,21.988835,3.672038,5.689796,3.646058,54.242746,21.039989,15.533146,77.770945,5.949168,LA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7916074,9.0,2.022111e+09,3787.0,52627.0,Geno Stone,42.0,2022-11-07 23:06:49.400000,26.0,BAL,right,32.75,23.420000,4.24,3.01,0.44,243.34,219.17,,20221107003787,230.83,206.66,0.0,0.0,26.40,19.620000,0.54,4.13,337.78,324.53,418.631818,83.60,19.620000,4.904080,5.773664,7.497526,8.112854,9.377062,12.696661,12.721761,13.493024,14.067242,15.810003,7.400169,17.982457,129.442220,39.445132,34.360749,36.126867,15.512649,57.329346,132.777965,24.847280,50.866240,19.932635,6.124674,BAL
7916078,9.0,2.022111e+09,3787.0,53460.0,Odafe Oweh,42.0,2022-11-07 23:06:49.400000,99.0,BAL,right,22.20,24.350000,0.72,0.83,0.07,70.32,71.02,,20221107003787,18.98,19.68,0.0,0.0,26.40,19.620000,0.54,4.13,337.78,324.53,418.631818,83.60,19.620000,2.151325,3.625810,4.006008,4.255091,5.339850,7.364306,8.272182,10.531904,15.149274,16.185750,6.325575,147.306728,84.889153,55.964317,68.364100,170.940961,48.155513,107.663884,7.706698,36.155014,83.079860,67.376549,4.346443,BAL
7916079,9.0,2.022111e+09,3787.0,53533.0,Brandon Stephens,42.0,2022-11-07 23:06:49.400000,21.0,BAL,right,33.04,37.060000,5.58,0.78,0.56,189.29,187.02,,20221107003787,262.98,260.71,0.0,0.0,26.40,19.620000,0.54,4.13,337.78,324.53,418.631818,83.60,19.620000,2.197908,8.107114,16.886033,16.940602,17.777089,17.866407,18.553975,21.094713,23.528342,27.529363,18.661275,75.660981,3.139458,8.119057,19.787059,39.643914,19.831886,23.276448,42.729287,19.893548,0.851087,13.823528,14.865301,BAL
7916081,9.0,2.022111e+09,3787.0,54541.0,Travis Jones,42.0,2022-11-07 23:06:49.400000,98.0,BAL,right,24.89,19.600000,0.93,3.09,0.10,102.30,150.23,,20221107003787,299.77,347.70,0.0,0.0,26.40,19.620000,0.54,4.13,337.78,324.53,418.631818,83.60,19.620000,1.522104,1.880877,2.394932,3.915763,4.317453,7.107461,8.824477,10.743393,11.620189,18.187724,1.510132,147.217212,169.730411,137.934734,77.461822,65.153408,167.507156,145.492954,5.710042,114.898669,131.067809,60.988840,0.830782,BAL


## Clean Up Tracking Data

The following cells clean up the neccessary columns needed for model training. We will drop the following columns: 

- displayName
- time
- jereyNumber
- club
- playDirection
- o
- dir
- event
- defensiveTeam

We will also need to fill some missing values from some of our variables

In [37]:
#columns to drop
columns_to_drop = ['displayName','time','jerseyNumber','club','playDirection','o', 'dir', 'event', 'defensiveTeam']

#Return df with only desired columns
final_df = tracking.drop(columns=columns_to_drop)
display(final_df)

Unnamed: 0,Week,gameId,playId,nflId,frameId,x,y,s,a,dis,gamePlayId,unitDir,unitO,tackle_single,tackle_multiple,bcx,bcy,bcs,bca,bcUnitO,bcUnitDir,bcForce,bcy_adj,bcy_toob,c1Dist,c2Dist,c3Dist,c4Dist,c5Dist,c6Dist,c7Dist,c8Dist,c9Dist,c10Dist,bcDist,c1Ang,c2Ang,c3Ang,c4Ang,c5Ang,c6Ang,c7Ang,c8Ang,c9Ang,c10Ang,bcAng,voronoi_min_dist_from_bc
1,1.0,2.022091e+09,56.0,38577.0,6.0,41.89,24.593333,3.35,2.62,0.32,202209080056,272.29,280.53,0.0,0.0,40.15,17.743333,4.61,4.82,155.73,67.80,418.463636,69.85,17.743333,3.195387,10.116071,10.461855,10.882909,12.035414,12.635874,12.701657,13.169210,14.799963,23.582173,7.067538,75.239538,89.059902,122.727777,103.094000,104.570954,166.159966,86.642246,80.903275,77.812451,158.729540,16.542527,3.948566
2,1.0,2.022091e+09,56.0,41239.0,6.0,27.85,23.373333,3.62,2.86,0.37,202209080056,112.35,83.84,0.0,0.0,40.15,17.743333,4.61,4.82,155.73,67.80,418.463636,69.85,17.743333,1.400321,1.783620,2.496898,3.993257,4.414386,4.674409,8.228657,17.168183,21.436532,32.008038,13.527265,113.577579,163.037991,150.980380,102.982031,53.447313,68.130076,59.944064,110.580937,72.972206,65.058378,136.944687,12.659069
5,1.0,2.022091e+09,56.0,42816.0,6.0,49.38,45.673333,2.60,4.14,0.27,202209080056,351.67,298.43,0.0,0.0,40.15,17.743333,4.61,4.82,155.73,67.80,418.463636,69.85,17.743333,1.233207,10.014569,22.204274,22.838312,26.325539,26.712411,27.894992,30.064028,31.255438,33.017583,29.415605,89.937075,111.358111,93.020417,127.965101,125.346367,127.777247,120.762863,123.703959,122.414454,123.035494,99.957121,23.831392
6,1.0,2.022091e+09,56.0,43294.0,6.0,41.85,15.483333,5.88,1.23,0.59,202209080056,91.50,129.04,0.0,0.0,40.15,17.743333,4.61,4.82,155.73,67.80,418.463636,69.85,17.743333,8.993442,13.196030,14.422794,14.850576,15.279797,15.418982,16.539265,16.979061,21.643128,32.342421,2.828003,22.070958,48.171854,61.668459,56.543692,67.764015,42.186107,43.622479,30.430267,8.319643,15.291336,35.450938,0.000000
7,1.0,2.022091e+09,56.0,43298.0,6.0,27.89,20.193333,1.34,2.21,0.13,202209080056,66.47,110.88,0.0,0.0,40.15,17.743333,4.61,4.82,155.73,67.80,418.463636,69.85,17.743333,0.773886,2.104305,3.431049,5.466160,7.240836,7.311580,10.903687,17.517377,23.554390,34.387191,12.502404,48.770529,7.667234,0.177919,21.988835,3.672038,5.689796,3.646058,54.242746,21.039989,15.533146,77.770945,5.949168
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7916074,9.0,2.022111e+09,3787.0,52627.0,42.0,32.75,23.420000,4.24,3.01,0.44,20221107003787,230.83,206.66,0.0,0.0,26.40,19.620000,0.54,4.13,337.78,324.53,418.631818,83.60,19.620000,4.904080,5.773664,7.497526,8.112854,9.377062,12.696661,12.721761,13.493024,14.067242,15.810003,7.400169,17.982457,129.442220,39.445132,34.360749,36.126867,15.512649,57.329346,132.777965,24.847280,50.866240,19.932635,6.124674
7916078,9.0,2.022111e+09,3787.0,53460.0,42.0,22.20,24.350000,0.72,0.83,0.07,20221107003787,18.98,19.68,0.0,0.0,26.40,19.620000,0.54,4.13,337.78,324.53,418.631818,83.60,19.620000,2.151325,3.625810,4.006008,4.255091,5.339850,7.364306,8.272182,10.531904,15.149274,16.185750,6.325575,147.306728,84.889153,55.964317,68.364100,170.940961,48.155513,107.663884,7.706698,36.155014,83.079860,67.376549,4.346443
7916079,9.0,2.022111e+09,3787.0,53533.0,42.0,33.04,37.060000,5.58,0.78,0.56,20221107003787,262.98,260.71,0.0,0.0,26.40,19.620000,0.54,4.13,337.78,324.53,418.631818,83.60,19.620000,2.197908,8.107114,16.886033,16.940602,17.777089,17.866407,18.553975,21.094713,23.528342,27.529363,18.661275,75.660981,3.139458,8.119057,19.787059,39.643914,19.831886,23.276448,42.729287,19.893548,0.851087,13.823528,14.865301
7916081,9.0,2.022111e+09,3787.0,54541.0,42.0,24.89,19.600000,0.93,3.09,0.10,20221107003787,299.77,347.70,0.0,0.0,26.40,19.620000,0.54,4.13,337.78,324.53,418.631818,83.60,19.620000,1.522104,1.880877,2.394932,3.915763,4.317453,7.107461,8.824477,10.743393,11.620189,18.187724,1.510132,147.217212,169.730411,137.934734,77.461822,65.153408,167.507156,145.492954,5.710042,114.898669,131.067809,60.988840,0.830782


In [58]:
final_df.isna().sum()

Week                        0
gameId                      0
playId                      0
nflId                       0
frameId                     0
x                           0
y                           0
s                           0
a                           0
dis                         0
gamePlayId                  0
unitDir                     0
unitO                       0
tackle_single               0
tackle_multiple             0
bcx                         0
bcy                         0
bcs                         0
bca                         0
bcUnitO                     0
bcUnitDir                   0
bcForce                     0
bcy_adj                     0
bcy_toob                    0
c1Dist                      0
c2Dist                      0
c3Dist                      0
c4Dist                      0
c5Dist                      0
c6Dist                      0
c7Dist                      0
c8Dist                      0
c9Dist                      0
c10Dist   

In [38]:
#Fill voronoi tesselation variables with the mean (unknown as to why these are NA)
final_df["voronoi_min_dist_from_bc"] = final_df["voronoi_min_dist_from_bc"].fillna(final_df["voronoi_min_dist_from_bc"].mean())

## Split the Data

The following cells will split our data into training, testing and validation needed for model training. We are performing a 70-15-15 training, validating, testing split

In [39]:
#Initialize GroupShuffleSplit
gss = GroupShuffleSplit(n_splits=2, test_size=0.3, random_state=42)

# Split data into training, validation, and testing
for train_idx, test_idx in gss.split(final_df, groups=final_df['gamePlayId']):
    x_train, x_test = final_df.iloc[train_idx], final_df.iloc[test_idx]

# Further split the testing set into validation and testing
gss_val = GroupShuffleSplit(n_splits=1, test_size=0.5, random_state=42)

for val_idx, test_idx in gss_val.split(x_test, groups=x_test['gamePlayId']):
    x_val, x_test = x_test.iloc[val_idx], x_test.iloc[test_idx]

In [40]:
#Examine shapes and ensure no unique IDs are in other splits
print("xtrain: ", x_train.shape, "\nxval:  ", x_val.shape, "\nxtest: ", x_test.shape)

xtrain:  (2792955, 47) 
xval:   (589446, 47) 
xtest:  (575641, 47)


In [41]:
print("Number of plays in training:",x_train["gamePlayId"].nunique())

Number of plays in training: 8727


In [42]:
print("Number of plays in validation:",x_val["gamePlayId"].nunique())

Number of plays in validation: 1870


In [43]:
print("Number of plays in test:",x_test["gamePlayId"].nunique())

Number of plays in test: 1871


## Sample Training and Testing Dataframes

We will be building larger models using these large dataframes. Thus, it is imporant to have a smaller sampled data frame from the trainin and testing data frames to build model architectures and save time.

In [44]:
#get a sample of the train
gss = GroupShuffleSplit(n_splits=2, test_size=0.85, random_state=42)
for train_idx, test_idx in gss.split(x_train, groups=x_train["gamePlayId"]):
    x_train_sample,_ = x_train.iloc[train_idx], x_train.iloc[test_idx]
    
#get a sample of the test
for train_idx, test_idx in gss.split(x_val, groups=x_val["gamePlayId"]):
    x_val_sample, _ = x_val.iloc[train_idx], x_val.iloc[test_idx]

In [45]:
print("xtrain_sample: ", x_train_sample.shape, "xval_sample: ", x_val_sample.shape)

xtrain_sample:  (421333, 47) xval_sample:  (90475, 47)


In [46]:
print("Number of plays in test:",x_train_sample["gamePlayId"].nunique())

Number of plays in test: 1309


In [47]:
print("Number of plays in test:",x_val_sample["gamePlayId"].nunique())

Number of plays in test: 280


## Write Dataframes to csv files

The following data frames will be written to csv files: 
- final_df: clean data before being split
- x_train: training data
- x_val: validation data
- x_test: testing data
- x_train_sample: sampled training data
- x_val_sample: sampled validation training data

In [48]:
final_df.to_csv("../Data/clean_tracking.csv", index = False)

In [49]:
x_train.to_csv("../Data/train.csv", index = False)

In [50]:
x_val.to_csv("../Data/val.csv", index = False)

In [51]:
x_test.to_csv("../Data/test.csv", index = False)

In [52]:
x_train_sample.to_csv("../Data/train_sample.csv", index = False)

In [53]:
x_val_sample.to_csv("../Data/val_sample.csv", index = False)