# Team Project: Part 2
## NFL First Down Prediction: Project Part 2

Project Title: project1

Team Name: group 4

Team member names: Brandon Chung (906507859), Hannah Solis, Riley Small, Janet Lu, Rafe Sholler

## 1. Project Introduction
Topic: NFL Big Data Bowl — Predicting First Downs

Overview: Topic Overview
This project explores NFL player tracking data from the 2026 Big Data Bowl competition, which provides detailed spatial and motion information for every player on the field during pass plays. The dataset includes pre-pass tracking frames containing each player's position (x, y), speed, acceleration, orientation, and movement direction. This level of detail enables us to study player behavior using quantitative and predictive methods.
Player movement is one of the most important aspects of football strategy. Being able to estimate where players are likely to move—seconds or even fractions of a second into the future—has the potential to improve coaching decisions, defensive matching, route recognition, and automated broadcast visualizations.

Goal: Model and predict where players will move immediately after the pass is thrown.


Main Research Question: Which modeling approaches best predict player movement (future x/y coordinates) during the moments after a pass is thrown, and which regression model predicts these movements most accurately?

Secondary Research Questions: 
- How does player speed, acceleration, and direction change before and after the pass?
- Do different player roles (receiver, defender, quarterback) show different movement patterns?
- How far ahead in time can player positions be accurately predicted using simple regression models?

## 2. Data Sources
Links:
https://operations.nfl.com/gameday/analytics/big-data-bowl

https://www.kaggle.com/competitions/nfl-big-data-bowl-2026-prediction/data?select=train

### Pre-pass data (Model Input Data)
**Files used:** `train/input_2023_w[01–18].csv`  
These files include frame-by-frame tracking data *before* the quarterback releases the ball. Each row corresponds to a single player in a single frame and includes:

- x and y field position  
- speed (s)  
- acceleration (a)  
- orientation (o)  
- direction of movement (dir)  
- player position and role  
- play direction and yardline context  
- frame identifiers  

These variables serve as the **input features** for our movement prediction model.

### post-pass data (Model Target Data)
**Files used:** `train/output_2023_w[01–18].csv`  
These files contain the player tracking data *after* the ball is thrown. They record where each player actually moved during the pass play, including their future x and y positions for each frame after release.

These rows provide the **regression targets** (future x, future y) for evaluating model accuracy.

### connection to the RQ
Our research question asks which regression modeling approaches best predict player movement during the moments after a pass is thrown. This dataset is ideal because:

- The **input files** capture player state at the time of the throw.  
- The **output files** capture the actual future movement we want to predict.  
- Every player is tracked at extremely fine temporal resolution (multiple frames per second).  
- The dataset includes high-dimensional tracking data, satisfying the project’s requirement for a large, challenging dataset.

In [1]:
## 3. Data Loading and Inspection
import pandas as pd
import numpy as np
import os
from glob import glob

# Path to the folder containing all the train CSV files
DATA_PATH = "data/train/"

In [2]:
input_files = sorted(glob(os.path.join(DATA_PATH, "input_2023_w*.csv")))
output_files = sorted(glob(os.path.join(DATA_PATH, "output_2023_w*.csv")))

print("Input files:", len(input_files))
print("Output files:", len(output_files))

Input files: 18
Output files: 18


In [3]:
# Load and concatenate all input files
input_dfs = []

for f in input_files:
    print("Loading:", f)
    df = pd.read_csv(f)
    input_dfs.append(df)

input_df = pd.concat(input_dfs, ignore_index=True)

print("Input data shape:", input_df.shape)

Loading: data/train\input_2023_w01.csv
Loading: data/train\input_2023_w02.csv
Loading: data/train\input_2023_w03.csv
Loading: data/train\input_2023_w04.csv
Loading: data/train\input_2023_w05.csv
Loading: data/train\input_2023_w06.csv
Loading: data/train\input_2023_w07.csv
Loading: data/train\input_2023_w08.csv
Loading: data/train\input_2023_w09.csv
Loading: data/train\input_2023_w10.csv
Loading: data/train\input_2023_w11.csv
Loading: data/train\input_2023_w12.csv
Loading: data/train\input_2023_w13.csv
Loading: data/train\input_2023_w14.csv
Loading: data/train\input_2023_w15.csv
Loading: data/train\input_2023_w16.csv
Loading: data/train\input_2023_w17.csv
Loading: data/train\input_2023_w18.csv
Input data shape: (4880579, 23)


In [4]:
# Load and combine all output files
output_dfs = []

for f in output_files:
    print("Loading:", f)
    df = pd.read_csv(f)
    output_dfs.append(df)

output_df = pd.concat(output_dfs, ignore_index=True)

print("Output data shape:", output_df.shape)

Loading: data/train\output_2023_w01.csv
Loading: data/train\output_2023_w02.csv
Loading: data/train\output_2023_w03.csv
Loading: data/train\output_2023_w04.csv
Loading: data/train\output_2023_w05.csv
Loading: data/train\output_2023_w06.csv
Loading: data/train\output_2023_w07.csv
Loading: data/train\output_2023_w08.csv
Loading: data/train\output_2023_w09.csv
Loading: data/train\output_2023_w10.csv
Loading: data/train\output_2023_w11.csv
Loading: data/train\output_2023_w12.csv
Loading: data/train\output_2023_w13.csv
Loading: data/train\output_2023_w14.csv
Loading: data/train\output_2023_w15.csv
Loading: data/train\output_2023_w16.csv
Loading: data/train\output_2023_w17.csv
Loading: data/train\output_2023_w18.csv
Output data shape: (562936, 6)


In [5]:
print("\nInput Columns:")
print(list(input_df.columns))

print("\nOutput Columns:")
print(list(output_df.columns))

input_df.info(memory_usage="deep")
output_df.info(memory_usage="deep")

# Preview a few rows
input_df.head()


Input Columns:
['game_id', 'play_id', 'player_to_predict', 'nfl_id', 'frame_id', 'play_direction', 'absolute_yardline_number', 'player_name', 'player_height', 'player_weight', 'player_birth_date', 'player_position', 'player_side', 'player_role', 'x', 'y', 's', 'a', 'dir', 'o', 'num_frames_output', 'ball_land_x', 'ball_land_y']

Output Columns:
['game_id', 'play_id', 'nfl_id', 'frame_id', 'x', 'y']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4880579 entries, 0 to 4880578
Data columns (total 23 columns):
 #   Column                    Dtype  
---  ------                    -----  
 0   game_id                   int64  
 1   play_id                   int64  
 2   player_to_predict         bool   
 3   nfl_id                    int64  
 4   frame_id                  int64  
 5   play_direction            object 
 6   absolute_yardline_number  int64  
 7   player_name               object 
 8   player_height             object 
 9   player_weight             int64  
 10  player_birth

Unnamed: 0,game_id,play_id,player_to_predict,nfl_id,frame_id,play_direction,absolute_yardline_number,player_name,player_height,player_weight,...,player_role,x,y,s,a,dir,o,num_frames_output,ball_land_x,ball_land_y
0,2023090700,101,False,54527,1,right,42,Bryan Cook,6-1,210,...,Defensive Coverage,52.33,36.94,0.09,0.39,322.4,238.24,21,63.259998,-0.22
1,2023090700,101,False,54527,2,right,42,Bryan Cook,6-1,210,...,Defensive Coverage,52.33,36.94,0.04,0.61,200.89,236.05,21,63.259998,-0.22
2,2023090700,101,False,54527,3,right,42,Bryan Cook,6-1,210,...,Defensive Coverage,52.33,36.93,0.12,0.73,147.55,240.6,21,63.259998,-0.22
3,2023090700,101,False,54527,4,right,42,Bryan Cook,6-1,210,...,Defensive Coverage,52.35,36.92,0.23,0.81,131.4,244.25,21,63.259998,-0.22
4,2023090700,101,False,54527,5,right,42,Bryan Cook,6-1,210,...,Defensive Coverage,52.37,36.9,0.35,0.82,123.26,244.25,21,63.259998,-0.22


In [6]:
## 4. Data Cleaning and Processing 
The data was sterile, meaning it came straight from the NFL Company 
Values given: include valid : 5837, Mismatched 0, Missing 0, Mean 69.5k, SD 63.5K, Quantiles : 12.3, 13.8, 35.0, 156, 158 
This was done for game_id, game_id, play_id, nfl_id, frame_id 


In [7]:
## 5. Exploratory Data Visualization

In [8]:
## 6. Predictive Modeling

In [9]:
## 7. Findings and Interpretation

In [10]:
## 8. Contributions