## Competition Overview
*Description*

- The downfield pass is the crown jewel of American sports. When the ball is in the air, anything can happen, like a touchdown, an interception, or a contested catch. The uncertainty and the importance of the outcome of these plays is what helps keep audiences on the edge of its seat.

- The 2026 Big Data Bowl is designed to help the National Football League better understand player movement during the pass play, starting with when the ball is thrown and ending when the ball is either caught or ruled incomplete. For the offensive team, this means focusing on the targeted receiver, whose job is to move towards the ball landing location in order to complete a catch. For the defensive team, who could have several players moving towards the ball, their jobs are to both prevent the offensive player from making a catch, while also going for the ball themselves. This year's Big Data Bowl asks our fans to help track the movement of these players.

- In the Prediction Competition of the Big Data Bowl, participants are tasked with predicting player movement with the ball in the air. Specifically, the NFL is sharing data before the ball is thrown (including the Next Gen Stats tracking data), and stopping the play the moment the quarterback releases the ball. In addition to the pre-pass tracking data, we are providing participants with which offensive player was targeted (e.g, the targeted receiver) and the landing location of the pass.

- Using the information above, participants should generate prediction models for player movement during the frames when the ball is in the air. The most accurate algorithms will be those whose output most closely matches the eventual player movement of each player.

*Competition Specifics*
- In the NFL's tracking data, there are 10 frames per second. As a result, if a ball is in the air for 2.5 seconds, there will be 25 frames of location data to predict.

- Quick passes (less than half a second), deflected passes, and throwaway passes are dropped from the competition.

- Evaluation for the training data is based on historical data. Evaluation for the leaderboard is based on data that hasn't happened yet. Specifically, we will be doing a live leaderboard covering the last five weeks of the 2025 NFL season.

*Evaluation*
- Submissions are evaluated using the Root Mean Squared Error between the predicted and the observed target.

### Initial Data Load for Competition

In [9]:
import pandas as pd
import numpy as np
import glob

# Training Data Materialization
# Scotty's Path "C:\Users\F4WF76A\OneDrive - Fiserv Corp\Repos\NFL-Big-Data-Bowl-Prediction\Data\train"
# Brian's Path "C:\Users\npowers\Documents\Notre Dame MSSA\NFL-Big-Data-Bowl-Prediction\Data\train"

# Change to match your directory's pathh
train_path = r"C:\Users\npowers\Documents\Notre Dame MSSA\NFL-Big-Data-Bowl-Prediction\Data\train"
input_files = sorted(glob.glob(train_path + "/input_2023_w*.csv"))
output_files = sorted(glob.glob(train_path + "/output_2023_w*.csv"))
# Model training features dataframe
train_feature_df = pd.concat([pd.read_csv(f) for f in input_files], ignore_index=True)
# Training target dataframe
train_target_df = pd.concat([pd.read_csv(f) for f in output_files], ignore_index=True)


# Test Data Materialization
# Test input dataframe with features used to train model
test_input_df = pd.read_csv('../Data/test_input.csv')
# Test ground truth data to compare the predictions against
test_df = pd.read_csv('../Data/test.csv')

### Training Data

**Training Input/Features**
- From files with pattern 'input_2023_w[01-18].csv'
- The input data contains tracking data before the pass is thrown

*Data Dictionary*
- game_id: Game identifier, unique (numeric)
- play_id: Play identifier, not unique across games (numeric)
- player_to_predict: whether or not the x/y prediction for this player will be scored (bool)
- nfl_id: Player identification number, unique across players (numeric)
- frame_id: Frame identifier for each play/type, starting at 1 for each game_id/play_id/file type (input or output) (numeric)
- play_direction: Direction that the offense is moving (left or right)
- absolute_yardline_number: Distance from end zone for possession team (numeric)
- player_name: player name (text)
- player_height: player height (ft-in)
- player_weight: player weight (lbs)
- player_birth_date: birth date (yyyy-mm-dd)
- player_position: the player's position (the specific role on the field that they typically play)
- player_side: team player is on (Offense or Defense)
- player_role: role player has on play (Defensive Coverage, Targeted Receiver, Passer or Other Route Runner)
- x: Player position along the long axis of the field, generally within 0 - 120 yards. (numeric)
- y: Player position along the short axis of the field, generally within 0 - 53.3 yards. (numeric)
- s: Speed in yards/second (numeric)
- a: Acceleration in yards/second^2 (numeric)
- o: orientation of player (deg)
- dir: angle of player motion (deg)
- num_frames_output: Number of frames to predict in output data for the given game_id/play_id/nfl_id. (numeric)
- ball_land_x: Ball landing position position along the long axis of the field, generally within 0 - 120 yards. (numeric)
- ball_land_y: Ball landing position along the short axis of the field, generally within 0 - 53.3 yards. (numeric)

In [10]:
print('Train Features Dataframe Dimensions:', train_feature_df.shape)
# pd.set_option('display.max_columns', None)
train_feature_df.head()

Train Features Dataframe Dimensions: (4880579, 23)


Unnamed: 0,game_id,play_id,player_to_predict,nfl_id,frame_id,play_direction,absolute_yardline_number,player_name,player_height,player_weight,...,player_role,x,y,s,a,dir,o,num_frames_output,ball_land_x,ball_land_y
0,2023090700,101,False,54527,1,right,42,Bryan Cook,6-1,210,...,Defensive Coverage,52.33,36.94,0.09,0.39,322.4,238.24,21,63.259998,-0.22
1,2023090700,101,False,54527,2,right,42,Bryan Cook,6-1,210,...,Defensive Coverage,52.33,36.94,0.04,0.61,200.89,236.05,21,63.259998,-0.22
2,2023090700,101,False,54527,3,right,42,Bryan Cook,6-1,210,...,Defensive Coverage,52.33,36.93,0.12,0.73,147.55,240.6,21,63.259998,-0.22
3,2023090700,101,False,54527,4,right,42,Bryan Cook,6-1,210,...,Defensive Coverage,52.35,36.92,0.23,0.81,131.4,244.25,21,63.259998,-0.22
4,2023090700,101,False,54527,5,right,42,Bryan Cook,6-1,210,...,Defensive Coverage,52.37,36.9,0.35,0.82,123.26,244.25,21,63.259998,-0.22


In [None]:
# determine the summary statistics for the dataset
train_feature_df.describe(include='all')


Unnamed: 0,game_id,play_id,player_to_predict,nfl_id,frame_id,play_direction,absolute_yardline_number,player_name,player_height,player_weight,...,player_role,x,y,s,a,dir,o,num_frames_output,ball_land_x,ball_land_y
count,4880579.0,4880579.0,4880579,4880579.0,4880579.0,4880579,4880579.0,4880579,4880579,4880579.0,...,4880579,4880579.0,4880579.0,4880579.0,4880579.0,4880579.0,4880579.0,4880579.0,4880579.0,4880579.0
unique,,,2,,,2,,1383,16,,...,4,,,,,,,,,
top,,,False,,,right,,Cameron Sutton,6-1,,...,Defensive Coverage,,,,,,,,,
freq,,,3577139,,,2459074,,13641,909987,,...,2662657,,,,,,,,,
mean,2023155000.0,2196.409,,49558.9,16.13179,,60.55045,,,211.2783,...,,60.50074,26.8119,3.019878,2.118335,180.4972,181.5366,11.64147,60.51581,26.63766
std,201140.5,1246.426,,5210.338,11.13008,,23.05935,,,22.17747,...,,23.48919,10.0062,2.227939,1.415794,100.7162,98.00912,5.331537,25.29643,15.43814
min,2023091000.0,54.0,,30842.0,1.0,,11.0,,,153.0,...,,0.41,0.62,0.0,0.0,0.0,0.0,5.0,-5.26,-3.91
25%,2023101000.0,1150.0,,45198.0,8.0,,41.0,,,195.0,...,,42.63,18.99,1.09,1.01,90.91,91.74,8.0,42.61,13.3
50%,2023111000.0,2171.0,,52413.0,15.0,,61.0,,,207.0,...,,60.41,26.85,2.72,1.92,179.56,180.14,10.0,60.51,26.47
75%,2023121000.0,3246.0,,54500.0,22.0,,80.0,,,225.0,...,,78.23,34.62,4.62,3.04,270.83,271.58,14.0,78.47,39.87


In [20]:
# print all the unique values of player_role column
print("PLAYER ROLE UNIQUE:")
print(train_feature_df['player_role'].unique())

print("\nPLAYER ROLE COUNTS:")
print(train_feature_df['player_role'].value_counts())

# print all the unique values of player_side column
print("PLAYER SIDE UNIQUE:")
print(train_feature_df['player_side'].unique())

print("\nPLAYER SIDE COUNTS:")
print(train_feature_df['player_side'].value_counts())

# print all the unique values of player_position
print("PLAYER POSITION UNIQUE:")
print(train_feature_df['player_position'].unique())

print("\nPLAYER POSITION COUNTS:")
print(train_feature_df['player_position'].value_counts())

# print all the unique values of player_to_predict
print("PLAYER POSITION UNIQUE:")
print(train_feature_df['player_to_predict'].unique())

print("\nPLAYER POSITION COUNTS:")
print(train_feature_df['player_to_predict'].value_counts())

# print all the unique values of player_to_predict
print("PLAYER TO PREDICT UNIQUE:")
print(train_feature_df['player_to_predict'].unique())

print("\nPLAYER TO PREDICT COUNTS:")
print(train_feature_df['player_to_predict'].value_counts())

# print all the unique values of direction
print("DIRECTION UNIQUE:")
print(train_feature_df['play_direction'].unique())

print("\nDIRCTION COUNTS:")
print(train_feature_df['play_direction'].value_counts())


PLAYER ROLE UNIQUE:
['Defensive Coverage' 'Other Route Runner' 'Passer' 'Targeted Receiver']

PLAYER ROLE COUNTS:
player_role
Defensive Coverage    2662657
Other Route Runner    1424243
Targeted Receiver      396914
Passer                 396765
Name: count, dtype: int64
PLAYER SIDE UNIQUE:
['Defense' 'Offense']

PLAYER SIDE COUNTS:
player_side
Defense    2662657
Offense    2217922
Name: count, dtype: int64
PLAYER POSITION UNIQUE:
['FS' 'SS' 'CB' 'MLB' 'WR' 'TE' 'QB' 'OLB' 'ILB' 'RB' 'DE' 'FB' 'NT' 'DT'
 'S' 'T' 'LB' 'P' 'K']

PLAYER POSITION COUNTS:
player_position
WR     1063660
CB     1056888
FS      476865
TE      417146
QB      401007
SS      392421
RB      314918
ILB     295593
OLB     207429
MLB     199983
FB       19584
DE       16932
S        13764
DT        3139
NT        1090
T           83
LB          31
P           23
K           23
Name: count, dtype: int64
PLAYER POSITION UNIQUE:
[False  True]

PLAYER POSITION COUNTS:
player_to_predict
False    3577139
True     1303440
N

**Training Target Data (Has Values We Want to Predict)**
- From files with pattern 'output_2023_w[01-18].csv'
- The output data contains tracking data after the pass is thrown.

*Data Dictionary*
- game_id: Game identifier, unique (numeric)
- play_id: Play identifier, not unique across games (numeric)
- nfl_id: Player identification number, unique across players. (numeric)
- frame_id: Frame identifier for each play/type, starting at 1 for each game_id/play_id/ file type (input or output). The maximum value for a given game_id, play_id and nfl_id will be the same as the num_frames_output value from the corresponding input file. (numeric)
- x: Player position along the long axis of the field, generally within 0-120 yards. (TARGET TO PREDICT)
- y: Player position along the short axis of the field, generally within 0 - 53.3 yards. (TARGET TO PREDICT)

In [11]:
print('Train Target Dataframe Dimensions: ', train_target_df.shape)
train_target_df.head()

Train Target Dataframe Dimensions:  (562936, 6)


Unnamed: 0,game_id,play_id,nfl_id,frame_id,x,y
0,2023090700,101,46137,1,56.22,17.28
1,2023090700,101,46137,2,56.63,16.88
2,2023090700,101,46137,3,57.06,16.46
3,2023090700,101,46137,4,57.48,16.02
4,2023090700,101,46137,5,57.91,15.56


### Test Data

**Test Input/Features**
- Player tracking data at the same play as prediction. This file is provided only for convenience, the actual test data will be provided by the API.

In [12]:
print('Test input Dataframe Dimensions:', test_input_df.shape)
test_input_df.head()

Test input Dataframe Dimensions: (49753, 23)


Unnamed: 0,game_id,play_id,player_to_predict,nfl_id,frame_id,play_direction,absolute_yardline_number,player_name,player_height,player_weight,...,player_role,x,y,s,a,dir,o,num_frames_output,ball_land_x,ball_land_y
0,2024120805,74,False,52518,1,left,95,Darnay Holmes,5-10,198,...,Defensive Coverage,90.85,17.17,0.35,0.78,218.39,81.52,11,90.379997,46.470001
1,2024120805,74,False,52518,2,left,95,Darnay Holmes,5-10,198,...,Defensive Coverage,90.83,17.11,0.62,1.2,204.64,81.52,11,90.379997,46.470001
2,2024120805,74,False,52518,3,left,95,Darnay Holmes,5-10,198,...,Defensive Coverage,90.8,17.03,0.87,1.44,200.53,81.52,11,90.379997,46.470001
3,2024120805,74,False,52518,4,left,95,Darnay Holmes,5-10,198,...,Defensive Coverage,90.76,16.91,1.22,1.87,197.16,81.52,11,90.379997,46.470001
4,2024120805,74,False,52518,5,left,95,Darnay Holmes,5-10,198,...,Defensive Coverage,90.72,16.77,1.52,2.11,195.24,83.75,11,90.379997,46.470001


**Test Target**
- A mock test set representing the structure of the unseen test set. This file is provided only for convenience, the actual test_input data will be provided by the API. Contains the prediction targets as rows with columns (game_id, play_id, nfl_id, frame_id) representing each position that needs to be predicted.

In [13]:
print('Test Target (Ground Truth) Dataframe Dimensions:', test_df.shape)
test_df.head()

Test Target (Ground Truth) Dataframe Dimensions: (5837, 5)


Unnamed: 0,id,game_id,play_id,nfl_id,frame_id
0,12350,2024120805,74,54586,1
1,12351,2024120805,74,54586,2
2,12352,2024120805,74,54586,3
3,12353,2024120805,74,54586,4
4,12354,2024120805,74,54586,5
