# Data Prep

The purpose of this notebook is to use feature functions from Feature-Engineering.ipynb and data cleaning functions from Data-Preprocessing.ipynb to create prepared and cleaned training, validating, and testing data. At the end we will split the data into 70% training, 15% validation, and 15% testing dataframes that will be used for model building. These will be written as .csv files.

## Run Preliminary Notebook Functions

The following cells of code run the neccessary juptyer notebooks with the needed functions to create features and clean up data. We also obtain the neccessary libraries and cell chunks needed to run these functions. We need to upload our data and then combine the tracking data.

In [20]:
#Run Notebookw for functions to prepare data
%run Feature-Engineering.ipynb
%run Data-Preprocessing.ipynb

In [41]:
#Import libraries
import pandas as pd
import numpy as np
import os
import warnings
import time
from sklearn.model_selection import GroupShuffleSplit

warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)


#Import Data
games = pd.read_csv("../Data/games.csv")
players = pd.read_csv("../Data/players.csv")
plays = pd.read_csv("../Data/plays.csv")
tackles = pd.read_csv("../Data/tackles.csv")

#Read in all tracking data
tracking_1 = pd.read_csv("../Data/tracking_week_1.csv")
tracking_2 = pd.read_csv("../Data/tracking_week_2.csv")
tracking_3 = pd.read_csv("../Data/tracking_week_3.csv")
tracking_4 = pd.read_csv("../Data/tracking_week_4.csv")
tracking_5 = pd.read_csv("../Data/tracking_week_5.csv")
tracking_6 = pd.read_csv("../Data/tracking_week_6.csv")
tracking_7 = pd.read_csv("../Data/tracking_week_7.csv")
tracking_8 = pd.read_csv("../Data/tracking_week_8.csv")
tracking_9 = pd.read_csv("../Data/tracking_week_9.csv")

In [42]:
#Add column for week
tracking_1.insert(0,'Week',1)
tracking_2.insert(0,'Week',2)
tracking_3.insert(0,'Week',3)
tracking_4.insert(0,'Week',4)
tracking_5.insert(0,'Week',5)
tracking_6.insert(0,'Week',6)
tracking_7.insert(0,'Week',7)
tracking_8.insert(0,'Week',8)
tracking_9.insert(0,'Week',9)

In [43]:
#combine tracking
tracking = pd.concat([tracking_1,tracking_2,tracking_3,tracking_4,tracking_5,tracking_6,tracking_7,tracking_8,tracking_9], axis = 0).reset_index(drop = True)
display(tracking)

Unnamed: 0,Week,gameId,playId,nflId,displayName,frameId,time,jerseyNumber,club,playDirection,x,y,s,a,dis,o,dir,event
0,1,2022090800,56,35472.0,Rodger Saffold,1,2022-09-08 20:24:05.200000,76.0,BUF,left,88.370000,27.270000,1.62,1.15,0.16,231.74,147.90,
1,1,2022090800,56,35472.0,Rodger Saffold,2,2022-09-08 20:24:05.299999,76.0,BUF,left,88.470000,27.130000,1.67,0.61,0.17,230.98,148.53,pass_arrived
2,1,2022090800,56,35472.0,Rodger Saffold,3,2022-09-08 20:24:05.400000,76.0,BUF,left,88.560000,27.010000,1.57,0.49,0.15,230.98,147.05,
3,1,2022090800,56,35472.0,Rodger Saffold,4,2022-09-08 20:24:05.500000,76.0,BUF,left,88.640000,26.900000,1.44,0.89,0.14,232.38,145.42,
4,1,2022090800,56,35472.0,Rodger Saffold,5,2022-09-08 20:24:05.599999,76.0,BUF,left,88.720000,26.800000,1.29,1.24,0.13,233.36,141.95,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12187393,9,2022110700,3787,,football,40,2022-11-07 23:06:49.200000,,football,right,26.219999,19.680000,1.37,2.58,0.15,,,tackle
12187394,9,2022110700,3787,,football,41,2022-11-07 23:06:49.299999,,football,right,26.320000,19.610001,1.07,2.74,0.12,,,
12187395,9,2022110700,3787,,football,42,2022-11-07 23:06:49.400000,,football,right,26.389999,19.559999,0.80,2.49,0.09,,,
12187396,9,2022110700,3787,,football,43,2022-11-07 23:06:49.500000,,football,right,26.450001,19.520000,0.57,2.38,0.07,,,


In [6]:
print(len(tracking))

12187398


## Sample Data

Run these cells only if we are running a sample. Comment out this code if running functions on entire data set.

In [44]:
# #Create unique id column and move it to the first column in the dataframe
tracking['gamePlayId'] = tracking.apply(lambda x: str(x['gameId']) + str(x['playId']), axis=1)
#Keep Column
gamePlayId = tracking['gamePlayId']
#Drop the column
tracking = tracking.drop(columns=['gamePlayId'])
# Insert the column at the beginning
tracking.insert(0, 'gamePlayId', gamePlayId)

In [6]:
# display(tracking)

In [45]:
# #Run this line of code for a sample: 
# # Group by 'Category' and apply the sampling function to each group
#Initialize GroupShuffleSplit
gss = GroupShuffleSplit(n_splits=2, test_size=0.01, random_state=1)

# Split data into training, validation, and testing
for train_idx, test_idx in gss.split(tracking, groups=tracking['gamePlayId']):
    _, tracking_sample = tracking.iloc[train_idx], tracking.iloc[test_idx]
    
tracking = tracking_sample
    
print("Number of rows in tracking:",len(tracking_sample))
print("Number columns in tracking:", tracking_sample.shape[1])
print("Number of plays in data:",tracking_sample["gamePlayId"].nunique())
print("Weeks in data:", tracking_sample["Week"].unique())

Number of rows in tracking: 118864
Number columns in tracking: 19
Number of plays in data: 125
Weeks in data: [1 2 3 4 5 6 7 8 9]


In [8]:
# tracking = tracking_sample

## Features and Data Cleaning

The following code cells create the tracking features as well as the data preprocessing steps. These functions are run in order of neccessary order to obtian the proper datasets. Each cell runs the functions giving us an elapsed time that it takes to run the functions, the latest date of the tracking, number of rows in the data, the number of columns in the data, and number of unique plays to make sure our functions are properly accounted for.

In [7]:
# # #####NOTE: Comment this out if running a sample
# #Create unique id column and move it to the first column in the dataframe
# start_time = time.time()
# tracking['gamePlayId'] = tracking.apply(lambda x: str(x['gameId']) + str(x['playId']), axis=1)
# end_time = time.time()
# # Calculate the elapsed time
# elapsed_time = end_time - start_time
# print(f"Elapsed time: {elapsed_time} seconds")
# print("Lastest Date:",tracking["time"].max())
# print("Number of rows in tracking:",len(tracking))
# print("Number columns in tracking:", tracking.shape[1])
# print("Number of plays in data:",len(tracking["gamePlayId"].drop_duplicates()))

Elapsed time: 101.84703707695007 seconds
Lastest Date: 2022-11-07 23:06:49.599999
Number of rows in tracking: 12187398
Number columns in tracking: 19
Number of plays in data: 12486


In [46]:
#Features for changing the orientation and direction to unit circle
start_time = time.time()
tracking["unitDir"] = tracking["dir"].apply(orient_angle)
tracking["unitO"] = tracking["o"].apply(orient_angle)
end_time = time.time()
# Calculate the elapsed time
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time} seconds")
print("Lastest Date:",tracking["time"].max())
print("Number of rows in tracking:",len(tracking))
print("Number columns in tracking:", tracking.shape[1])
print("Number of plays in data:",tracking["gamePlayId"].nunique())

Elapsed time: 0.11212682723999023 seconds
Lastest Date: 2022-11-07 21:53:32.799999
Number of rows in tracking: 118864
Number columns in tracking: 21
Number of plays in data: 125


In [47]:
#Standardaize the tracking data
start_time = time.time()
tracking = standardize_field(tracking)
end_time = time.time()
# Calculate the elapsed time
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time} seconds")
print("Lastest Date:",tracking["time"].max())
print("Number of rows in tracking:",len(tracking))
print("Number columns in tracking:", tracking.shape[1])
print("Number of plays in data:",tracking["gamePlayId"].nunique())

Elapsed time: 0.04532313346862793 seconds
Lastest Date: 2022-11-07 21:53:32.799999
Number of rows in tracking: 118864
Number columns in tracking: 21
Number of plays in data: 125


In [49]:
#Remove football data as we are only concerned with the position of the ball carrier
start_time = time.time()
tracking = remove_football_frames(tracking)
end_time = time.time()
# Calculate the elapsed time
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time} seconds")
print("Lastest Date:",tracking["time"].max())
print("Number of rows in tracking:",len(tracking))
print("Number columns in tracking:", tracking.shape[1])
print("Number of plays in data:",tracking["gamePlayId"].nunique())

Elapsed time: 0.03897976875305176 seconds
Lastest Date: 2022-11-07 21:53:32.799999
Number of rows in tracking: 113696
Number columns in tracking: 21
Number of plays in data: 125


In [50]:
#Remove plays with multiple tackles on the play
start_time = time.time()
tracking = remove_plays_with_mult_tackles(tracking,tackles)
end_time = time.time()
# Calculate the elapsed time
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time} seconds")
print("Lastest Date:",tracking["time"].max())
print("Number of rows in tracking:",len(tracking))
print("Number columns in tracking:", tracking.shape[1])
print("Number of plays in data:",tracking["gamePlayId"].nunique())

Elapsed time: 0.08342361450195312 seconds
Lastest Date: 2022-11-07 21:53:32.799999
Number of rows in tracking: 113696
Number columns in tracking: 21
Number of plays in data: 125


In [51]:
#Remove plays with tracking issues
start_time = time.time()
tracking = remove_tracking_issues(tracking)
end_time = time.time()
# Calculate the elapsed time
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time} seconds")
print("Lastest Date:",tracking["time"].max())
print("Number of rows in tracking:",len(tracking))
print("Number columns in tracking:", tracking.shape[1])
print("Number of plays in data:",tracking["gamePlayId"].nunique())

Elapsed time: 0.07215261459350586 seconds
Lastest Date: 2022-11-07 21:53:32.799999
Number of rows in tracking: 113696
Number columns in tracking: 21
Number of plays in data: 125


In [52]:
#Create feature for dependent variables
start_time = time.time()
tracking = tracking.merge(tackle_dependent_variable(tackles,tracking), on = ["gameId", "playId", "nflId", "frameId"])
end_time = time.time()
# Calculate the elapsed time
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time} seconds")
print("Lastest Date:",tracking["time"].max())
print("Number of rows in tracking:",len(tracking))
print("Number columns in tracking:", tracking.shape[1])
print("Number of plays in data:",tracking["gamePlayId"].nunique())

tackle_single variable done
tackle_multiple done
Elapsed time: 1.229814052581787 seconds
Lastest Date: 2022-11-07 21:53:32.799999
Number of rows in tracking: 113699
Number columns in tracking: 23
Number of plays in data: 125


In [53]:
#Filter frames in tracking to only include desired rows
start_time = time.time()
tracking = filter_frames_by_events(tracking)
end_time = time.time()
# Calculate the elapsed time
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time} seconds")
print("Lastest Date:",tracking["time"].max())
print("Number of rows in tracking:",len(tracking))
print("Number columns in tracking:", tracking.shape[1])
print("Number of plays in data:",len(tracking["gamePlayId"].drop_duplicates()))

Elapsed time: 0.28427648544311523 seconds
Lastest Date: 2022-11-07 21:53:32.599999
Number of rows in tracking: 76695
Number columns in tracking: 23
Number of plays in data: 124


In [54]:
#Create feature for ball carrier data to each player
start_time = time.time()
tracking = tracking.merge(ballCarrierData(plays,tracking,players), on = ["gameId", "playId","frameId"])
end_time = time.time()
# Calculate the elapsed time
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time} seconds")
print("Lastest Date:",tracking["time"].max())
print("Number of rows in tracking:",len(tracking))
print("Number columns in tracking:", tracking.shape[1])
print("Number of plays in data:",tracking["gamePlayId"].nunique())

Elapsed time: 0.08188772201538086 seconds
Lastest Date: 2022-11-07 21:53:32.599999
Number of rows in tracking: 76695
Number columns in tracking: 30
Number of plays in data: 124


In [55]:
#Create feature for ball carrier distance to adj endzone
start_time = time.time()
tracking['bcy_adj'] = tracking.apply(bc_adj, axis = 1)
end_time = time.time()
# Calculate the elapsed time
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time} seconds")
print("Lastest Date:",tracking["time"].max())
print("Number of rows in tracking:",len(tracking))
print("Number columns in tracking:", tracking.shape[1])
print("Number of plays in data:",tracking["gamePlayId"].nunique())

Elapsed time: 0.4848811626434326 seconds
Lastest Date: 2022-11-07 21:53:32.599999
Number of rows in tracking: 76695
Number columns in tracking: 31
Number of plays in data: 124


In [56]:
#Create feature for ball carrier distance to out of bounds
start_time = time.time()
tracking['bcy_toob'] = tracking.apply(bc_toob, axis = 1)
end_time = time.time()
# Calculate the elapsed time
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time} seconds")
print("Lastest Date:",tracking["time"].max())
print("Number of rows in tracking:",len(tracking))
print("Number columns in tracking:", tracking.shape[1])
print("Number of plays in data:",tracking["gamePlayId"].nunique())

Elapsed time: 0.6524934768676758 seconds
Lastest Date: 2022-11-07 21:53:32.599999
Number of rows in tracking: 76695
Number columns in tracking: 32
Number of plays in data: 124


In [57]:
#Create feature for calculating Force
start_time = time.time()
tracking_force = tracking.merge(calculate_force(tracking, players), on = ["gameId", "playId", "nflId", "frameId"])
end_time = time.time()
# Calculate the elapsed time
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time} seconds")
print("Lastest Date:",tracking["time"].max())
print("Number of rows in tracking:",len(tracking))
print("Number columns in tracking:", tracking.shape[1])
print("Number of plays in data:",tracking["gamePlayId"].nunique())

Elapsed time: 0.09665465354919434 seconds
Lastest Date: 2022-11-07 21:53:32.599999
Number of rows in tracking: 76695
Number columns in tracking: 32
Number of plays in data: 124


### Split Data By Week to Run on Long Data Processing Times

Due to time and effort, we are just going to split our data by week and run each week on the calculate_distance_angles and voronoi_tesselations. With each loop we will run the week on each function, write it to a csv file, and reread the data, running the functions on the next week, and then rewriting it again. This way if our notebook crashes we can continue where we left off.

In [59]:
week = 1
var_name = "tracking_dist_voronoi_df"
ran_already = False

if var_name in locals():
    del tracking_dist_voronoi_df

while week!=10:
    if os.path.exists("../Data/tracking_dist_voronoi.csv"):
        tracking_dist_voronoi_df = pd.read_csv("../Data/tracking_dist_voronoi.csv") #Read in data
        last_week = tracking_dist_voronoi_df['Week'].unique().max()
        if ran_already:
            week = last_week + 1 #update i to second earliest week in case we have missing data if our data stops
        else:
            ran_already = True
            week = last_week
        print(f"File Found. Updating Data starting with Week {week}")
    else:
        print("File not Found. Creating File")
        
    #Run calculations on tracking
    tracking_week = tracking[tracking["Week"]==week] #subset data for week
    
    start_time = time.time()
    tracking_week = tracking_week.merge(calculate_distance_angles(tracking_week,plays), on = ["gameId", "playId", "nflId", "frameId"])
    end_time = time.time()
    elapsed_time = end_time - start_time
    print(f"Elapsed time for distance and angles: {elapsed_time} seconds")
    
    start_time = time.time()
    tracking_week = tracking_week.merge(voronoi_tessellations(tracking_week, plays), on = ['gameId','playId','frameId','nflId'])
    end_time = time.time()
    elapsed_time = end_time - start_time
    print(f"Elapsed time for voronoi tessellations: {elapsed_time} seconds")
    
    #write the file
    if var_name in locals():
        tracking_dist_voronoi_df = pd.concat([tracking_dist_voronoi_df, tracking_week], ignore_index = True).reset_index(drop = True)
        #Write file
        tracking_dist_voronoi_df.to_csv("../Data/tracking_dist_voronoi.csv", index = False)
        print(f"Week {week} complete")
    else:
        tracking_week.to_csv("../Data/tracking_dist_voronoi.csv", index = False)
        print(f"Week {week} complete")
        
    #increase week    
    week += 1
    
    #ran already
    ran_already = True
    
#read back in dataframe
tracking = pd.read_csv("../Data/tracking_dist_voronoi.csv")
print("Number of rows in tracking:",len(tracking))
print("Number columns in tracking:", tracking.shape[1])
print("Number of plays in data:",tracking["gamePlayId"].nunique())

File Found. Updating Data starting with Week 2.0
Elapsed time for distance and angles: 18.1406147480011 seconds
Elapsed time for voronoi tessellations: 6.394983768463135 seconds
Week 2.0 complete
File Found. Updating Data starting with Week 3.0
Elapsed time for distance and angles: 45.01141548156738 seconds
Elapsed time for voronoi tessellations: 15.449057340621948 seconds
Week 3.0 complete
File Found. Updating Data starting with Week 4.0
Elapsed time for distance and angles: 45.293052434921265 seconds
Elapsed time for voronoi tessellations: 16.09014868736267 seconds
Week 4.0 complete
File Found. Updating Data starting with Week 5.0
Elapsed time for distance and angles: 22.77634024620056 seconds
Elapsed time for voronoi tessellations: 7.965648889541626 seconds
Week 5.0 complete
File Found. Updating Data starting with Week 6.0
Elapsed time for distance and angles: 20.069495916366577 seconds
Elapsed time for voronoi tessellations: 6.986861705780029 seconds
Week 6.0 complete
File Found. U

In [60]:
#Remove any duplicates from the process
#Duplicates can occur if we run tracking one week and our code fails in which we need to rerun
tracking_duplicates = len(tracking)
tracking = tracking.drop_duplicates(subset=['gamePlayId', 'nflId', 'frameId'])
print("Number of rows in removed from tracking due to duplicates:",tracking_duplicates - len(tracking))
print("Number of rows in tracking:",len(tracking))
print("Number columns in tracking:", tracking.shape[1])
print("Number of plays in data:",tracking["gamePlayId"].nunique())

Number of rows in removed from tracking due to duplicates: 5721
Number of rows in tracking: 76692
Number columns in tracking: 55
Number of plays in data: 124


### Continue data cleaning

In [None]:
#Remove offensive players from the data
start_time = time.time()
tracking = remove_offensive_players(tracking,plays)
end_time = time.time()
# Calculate the elapsed time
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time} seconds")
print("Lastest Date:",tracking["time"].max())
print("Number of rows in tracking:",len(tracking))
print("Number columns in tracking:", tracking.shape[1])
print("Number of plays in data:",tracking["gamePlayId"].nunique())

In [39]:
display(tracking)

Unnamed: 0,Week,gameId,playId,nflId,displayName,frameId,time,jerseyNumber,club,playDirection,x,y,s,a,dis,o,dir,event,gamePlayId,unitDir,unitO,tackle_single,tackle_multiple,bcx,bcy,bcs,bca,bcUnitO,bcUnitDir,bcForce,bcy_adj,bcy_toob,c1Dist,c2Dist,c3Dist,c4Dist,c5Dist,c6Dist,c7Dist,c8Dist,c9Dist,c10Dist,bcDist,c1Ang,c2Ang,c3Ang,c4Ang,c5Ang,c6Ang,c7Ang,c8Ang,c9Ang,c10Ang,bcAng,voronoi_min_dist_from_bc
0,1.0,2.022091e+09,56.0,35472.0,Rodger Saffold,6.0,2022-09-08 20:24:05.700000,76.0,BUF,left,31.20,26.633333,1.15,1.42,0.12,234.48,139.41,pass_outcome_caught,202209080056,130.59,35.52,0.0,0.0,40.15,17.743333,4.61,4.82,155.73,67.80,418.463636,69.85,17.743333,,,,,,,,,,,,,,,,,,,,,,,10.467646
1,1.0,2.022091e+09,56.0,38577.0,Bobby Wagner,6.0,2022-09-08 20:24:05.700000,45.0,LA,left,41.89,24.593333,3.35,2.62,0.32,349.47,357.71,pass_outcome_caught,202209080056,272.29,280.53,0.0,0.0,40.15,17.743333,4.61,4.82,155.73,67.80,418.463636,69.85,17.743333,3.195387,10.116071,10.461855,10.882909,12.035414,12.635874,12.701657,13.169210,14.799963,23.582173,7.067538,75.239538,89.059902,122.727777,103.094000,104.570954,166.159966,86.642246,80.903275,77.812451,158.729540,16.542527,3.948566
2,1.0,2.022091e+09,56.0,41239.0,Aaron Donald,6.0,2022-09-08 20:24:05.700000,99.0,LA,left,27.85,23.373333,3.62,2.86,0.37,186.16,157.65,pass_outcome_caught,202209080056,112.35,83.84,0.0,0.0,40.15,17.743333,4.61,4.82,155.73,67.80,418.463636,69.85,17.743333,1.400321,1.783620,2.496898,3.993257,4.414386,4.674409,8.228657,17.168183,21.436532,32.008038,13.527265,113.577579,163.037991,150.980380,102.982031,53.447313,68.130076,59.944064,110.580937,72.972206,65.058378,136.944687,12.659069
3,1.0,2.022091e+09,56.0,42392.0,Mitch Morse,6.0,2022-09-08 20:24:05.700000,60.0,BUF,left,31.79,24.023333,1.42,0.64,0.14,282.32,347.15,pass_outcome_caught,202209080056,282.85,347.68,0.0,0.0,40.15,17.743333,4.61,4.82,155.73,67.80,418.463636,69.85,17.743333,,,,,,,,,,,,,,,,,,,,,,,9.536856
4,1.0,2.022091e+09,56.0,42489.0,Stefon Diggs,6.0,2022-09-08 20:24:05.700000,14.0,BUF,left,40.15,17.743333,4.61,4.82,0.45,114.27,202.20,pass_outcome_caught,202209080056,67.80,155.73,0.0,0.0,40.15,17.743333,4.61,4.82,155.73,67.80,418.463636,69.85,17.743333,,,,,,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
988209,9.0,2.022111e+09,2156.0,53457.0,Payton Turner,65.0,2022-11-07 21:53:32.599999,98.0,NO,right,77.28,17.980000,3.44,2.10,0.36,154.80,150.22,,20221107002156,299.78,295.20,0.0,0.0,83.22,-1.040000,6.26,2.49,332.54,302.06,239.945455,26.78,-1.040000,2.256103,4.681250,6.033788,6.647353,7.015925,10.617495,11.035148,11.534318,16.680384,28.309451,19.925963,132.584266,134.224168,122.678170,56.079114,171.445135,102.624368,141.935699,39.462815,25.792274,133.065191,12.436267,16.508849
988210,9.0,2.022111e+09,2156.0,53489.0,Pete Werner,65.0,2022-11-07 21:53:32.599999,20.0,NO,right,73.24,6.040000,3.27,0.73,0.33,200.72,209.01,,20221107002156,240.99,249.28,0.0,0.0,83.22,-1.040000,6.26,2.49,332.54,302.06,239.945455,26.78,-1.040000,2.175983,7.009280,12.575341,15.658241,17.282433,18.540777,18.590132,22.493208,23.543078,40.911272,12.236290,134.195837,76.901270,159.403615,166.054485,168.953174,155.630458,172.545310,177.128820,164.825559,168.618893,83.657392,9.433519
988211,9.0,2.022111e+09,2156.0,53505.0,Paulson Adebo,65.0,2022-11-07 21:53:32.599999,29.0,NO,right,76.45,0.990000,2.99,3.07,0.31,250.25,266.27,,20221107002156,183.73,199.75,0.0,0.0,83.22,-1.040000,6.26,2.49,332.54,302.06,239.945455,26.78,-1.040000,2.020544,5.728569,17.543574,18.117166,21.594316,22.631394,23.592054,25.659139,28.014719,44.986598,7.067800,173.754873,82.557373,89.251140,118.045706,99.364030,102.934272,89.573441,113.478453,98.685567,105.504523,159.578520,2.412374
988212,9.0,2.022111e+09,2156.0,54490.0,Tyler Linderbaum,65.0,2022-11-07 21:53:32.599999,64.0,BAL,right,80.07,23.330000,0.90,0.61,0.10,163.14,168.28,,20221107002156,281.72,286.86,0.0,0.0,83.22,-1.040000,6.26,2.49,332.54,302.06,239.945455,26.78,-1.040000,,,,,,,,,,,,,,,,,,,,,,,20.961358


## Clean Up Tracking Data

The following cells clean up the neccessary columns needed for model training. We will drop the following columns: 

- displayName
- time
- jereyNumber
- club
- playDirection
- o
- dir
- event
- defensiveTeam

We will also need to fill some missing values from some of our variables

In [None]:
#columns to drop
columns_to_drop = ['displayName','time','jerseyNumber','club','playDirection','o', 'dir', 'event', 'defensiveTeam']

#Return df with only desired columns
final_df = tracking.drop(columns=columns_to_drop)
display(final_df)

In [None]:
#Fill voronoi tesselation variables with the mean (unknown as to why these are NA)
final_df["voronoi_min_dist_from_bc"] = final_df["voronoi_min_dist_from_bc"].fillna(final_df["voronoi_min_dist_from_bc"].mean())

## Split the Data

The following cells will split our data into training, testing and validation needed for model training. We are performing a 70-15-15 training, validating, testing split

In [None]:
#Initialize GroupShuffleSplit
gss = GroupShuffleSplit(n_splits=2, test_size=0.3, random_state=42)

# Split data into training, validation, and testing
for train_idx, test_idx in gss.split(final_df, groups=final_df['gamePlayId']):
    x_train, x_test = final_df.iloc[train_idx], final_df.iloc[test_idx]

# Further split the testing set into validation and testing
gss_val = GroupShuffleSplit(n_splits=1, test_size=0.5, random_state=42)

for val_idx, test_idx in gss_val.split(x_test, groups=x_test['gamePlayId']):
    x_val, x_test = x_test.iloc[val_idx], x_test.iloc[test_idx]

In [None]:
#Examine shapes and ensure no unique IDs are in other splits
print("xtrain: ", x_train.shape, "\nxval:  ", x_val.shape, "\nxtest: ", x_test.shape)

In [None]:
print("Number of plays in training:",x_train["gamePlayId"].nunique())

In [None]:
print("Number of plays in validation:",x_val["gamePlayId"].nunique())

In [None]:
print("Number of plays in test:",x_test["gamePlayId"].nunique())

## Sample Training and Testing Dataframes

We will be building larger models using these large dataframes. Thus, it is imporant to have a smaller sampled data frame from the trainin and testing data frames to build model architectures and save time.

In [None]:
#get a sample of the train
gss = GroupShuffleSplit(n_splits=2, test_size=0.85, random_state=42)
for train_idx, test_idx in gss.split(x_train, groups=x_train["gamePlayId"]):
    x_train_sample,_ = x_train.iloc[train_idx], x_train.iloc[test_idx]
    
#get a sample of the test
for train_idx, test_idx in gss.split(x_val, groups=x_val["gamePlayId"]):
    x_val_sample, _ = x_val.iloc[train_idx], x_val.iloc[test_idx]

In [None]:
print("xtrain_sample: ", x_train_sample.shape, "xval_sample: ", x_val_sample.shape)

In [None]:
print("Number of plays in test:",x_train_sample["gamePlayId"].nunique())

In [None]:
print("Number of plays in test:",x_val_sample["gamePlayId"].nunique())

## Write Dataframes to csv files

The following data frames will be written to csv files: 
- final_df: clean data before being split
- x_train: training data
- x_val: validation data
- x_test: testing data
- x_train_sample: sampled training data
- x_val_sample: sampled validation training data

In [None]:
final_df.to_csv("../Data/clean_tracking.csv", index = False)

In [None]:
x_train.to_csv("../Data/train.csv", index = False)

In [None]:
x_val.to_csv("../Data/val.csv", index = False)

In [None]:
x_test.to_csv("../Data/test.csv", index = False)

In [None]:
x_train_sample.to_csv("../Data/train_sample.csv", index = False)

In [None]:
x_val_sample.to_csv("../Data/val_sample.csv", index = False)