# Clustering Pre-snap Movements and Predicting Routes

Roman Bukreev – Undergraduate Track (Salem State University)

# Introduction

**Pre-snap movements** in football are utilized by offense team to confuse the opponent and reveal their defensive strategies. However, the potential of the insights that defense team can get from the pre-snap movements of the opponent is still underestimated. These movements can reveal some attacking patterns, and help the defense be more prepared for them. This project, in particular, is focused on predicting the running routes of the players that perform pre-snap motion based on their movement trajectory.

The main tool used to implement this project is **clustering of pre-snap motions**. There are thousands of pre-snap motions in our dataset, and since the idea of the project is to make a prediction based on the motion of the opponent, we need to be able to differentiate these motions from each other. That's why we will perform clustering to sparate all the movements into catefories. Then we will look at the distribution of routes ran within each cluster to get the insights about them.

The reasons why route ran was chosen as the offensive behaviour to predict are not really unique for this metric only. This just seems as somrthing that can benefit defensive team in preparing for games and provide some useful insights about the offense. But the main part of the project is movements clustering. After we split them into categories, we will be able to look at different metrics within cluster and make some predictions. So you can think of route ran just as of simple example to illustrate our approach.

# Data preprocessing

We are focused on the pre-snap movements in the first 9 weeks of the season. Since we are clustering attacking movements, we will focus only on the datapoints that correspond to the possession team. For consistnecy, we will also make all the plays' directions same and adjust the coordinates based on the ball position. This will ensure that similar trajectories in the different regions of the field are treated similarly. Finally, we will leave only the players that were in motion at the moment of the snap, and we will leave only situations where there was just one player at motion.

# Model

## Model selection

Our tracking data consists of the sequence of coordinates for each player. It would be hard to perform clustering right the way from these sequences, that's why we needed a way to convert it to some other representation. We decided to use **LSTM encoder-decoder** for it. The LSTM encoder can process input trajectories, encoding them into fixed-length latent representations that encapsulate the underlying movement patterns. The decoder then attempts to reconstruct the original trajectories from these latent vectors, ensuring the encoded representations retain essential information. Then we use encoder model to create the latent vector for each trajectory. These vectors can then be used for clustering.

## Feature selection and engineering for model training

It is important to keep only essential feautures for the model training. Here are the features that were left for the training:

1. X coordinate
2. Y coordinate
3. Speed
4. Acceleration
5. $\sin$ of direction
6. $\cos$ of direction

Features 1-4 are just basic features, representing position and movement of a player. Then the direction angle was converted into two features: $\sin$ and $\cos$ of an angle. The reason behind it is that regular direction values have circular structure. For example, 359 is much closer to 0 than to 270. Using $\sin$ and $\cos$ instead will take care of this problem. We can think of these values as of coordinates on a unit circle, and so the same values over the different iteration on the circle (0 and 360, for example) will have the same coordinates ($\sin$ and $\cos$ values).

## Model performance

After training we have achieved the validation **MSE** of 0.063 and testing **MSE** of 0.064, which is pretty good result for such a complex task. This means that our model is capable of representing trajectories as latent vectors, encapsulating all the essential features, which we can use for clustering.

# Clustering

After we have compressed our trajectories into vector representaion, we can perform clustering. The clustering model that was chosen is Gaussian Mixture Model. The reason why we decided to use this model is that it is capable of capturing distributions of movement data, providing a more flexible and probabilistic framework compared to the rigid cluster shapes of k-means, for example.

## Number of clusters

Determining the number of cluster is often not a trivial task. In our case, we decided to look at the **<a href="https://en.wikipedia.org/wiki/Bayesian_information_criterion">Bayesian Information Criterion (BIC)</a>** score across different number of clusters. The model that achieves the lowest value is a good candidate to be the optimal model. Here is the result we've got for number of clusters between 1 and 20:

<center><img src="https://raw.githubusercontent.com/b00nk3r/NFL-Big-Data-Bowl-2025/main/code/figures/BIC_scores.png" width="800"/></center>
<center>Figure 1. BIC scores accros different numbers of clusters.</center>

Based on these scores, the optimal number of clusters should be somewhere between 10 and 15. We decided to stick with 12.

## Looking at the results

It's finally time to look at the results of clustering. Below are the plots of all the trajectories within each cluster with the mean trajectory highlighted.

<center><img src="https://raw.githubusercontent.com/b00nk3r/NFL-Big-Data-Bowl-2025/main/code/figures/clusters_with_mean_trajectories.png?version=2" width="1600"/></center>
<center>Figure 2. All clusters with mean trajectory highlighted.</center>

So, we've got 12 distinct clusters of the pre-snap motions. Notice that some clusters are symmetrical and in this case mean trajectory doesn't really represent the cluster.

We can see that we've got a pretty good breakdown of all the motion trajectories into 12 categories. Each cluster have similar trajectories of movement, and we can use it to analyze different aspects of the game, related to the pre-snap motions. Let's look at one of them.

# Predicting route ran based on cluster

It is time to apply clustering results to get some insights about the game. We will focus on predicting routes ran by player, who performed pre-snap motion. By looking at the most likely routes ran within each cluster we can create predictions on what to expect from the player in motion before snap. Let's see what we've got for each of the 12 clusters and each of the routes ran.

<center><img src="https://raw.githubusercontent.com/b00nk3r/NFL-Big-Data-Bowl-2025/main/code/figures/route_ran_heatmap.png?version=2" width="800"/></center>
<center>Figure 3. Probability of every route ran for each cluster.</center>

We can see that we can make predictions for some routes, based on the cluster. For example, if we see the movement of cluster 0, we can expect either no route ran at all (which is, unsurprisingly, often the case for all of the clusters) or with about 20% chance we can expect CROSS route. It's interesting that cluster 9 has 36% probability for the FLAT route, and it is even more interesting that the probability of no route ran at all for this cluster is only 36%, which is the lowest among all clusters. If we look at cluster 9 on the Figure 2, we can see that it corresponds to the pre-snap movements from center to the sidelines of the filed. So it makes sense that such movements are often followed by *flat* running route.

# Coclusion

As was mentioned before, the main idea behind this project is the clustering of pre-snap movements. When we have the movements clustered, we can do predictions within each cluster. The example provided here (predocting route ran) is quite simple, but this method has big potential in general. We can exaplore specific clusters in details and even build separate models for them, which in some cases can give more valuable predictions than just looking at all the data without any separation.

# Appendix

All code is available **<a href="https://github.com/b00nk3r/NFL-Big-Data-Bowl-2025/">here</a>**. Feel free to contact me with any questions.

**Roman Bukreev** (romanbukreev@icloud.com)