<a href="https://www.kaggle.com/hongzeliu7/decompose-tackling-a-neural-network-approach?scriptVersionId=83982439" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Decompose Tackling, a Neural Network Approach

Github link: https://github.com/Xuanfu-lab/2022_NFL_DataBowl

## TABLE OF CONTENTS
* [Introduction](#one)
* [Issues with current PFF Scouting Data](#two)
* [Neural Network Approach to Identify Tacklers](#three)
    - [Data Preparation](#a)
    - [Control Influence](#b)
    - [Neural Network Model & Its Result](#c)
    - [Model Application](#d)
    - [Model Limitations](#e)
* [Neural Network Approach to Predict Tackling Outcome](#four)
* [Grading System Based on Neural Network Model](#five)
* [Expanded Application in Other Areas in NFL](#six)
* [Reference](#ref)

<a id="one"></a>
## I. INTRODUCTION

>*“Football is blocking and tackling, everything else is mythology.” - Vince Lombardi*

Although football has changed a lot from the Great Lombardi’s era, blocking and tackling are still fundamental aspects of football. An important tackle, such as Mike Jones’s tackle on Kevin Dyson on Super BowlIV, can send a team to dreamland. Mistackles, such as Cardinals’ terrible attempts on James Harrison, set up a heartbreaking defeat. 

The most exciting and the most important event on punt/kickoff return is returners’ one-on-one against tacklers. Many of the returners succeed; their moves will be recorded on the highlight wheel, and the return team gains yards. But many of them are stopped immediately and give their team the minimal gain. In this paper, we will use neural network models to identify tacklers, predict tackling success probability, and give scores to returners and tacklers by comparing their actual breaking tackle rate/successful tackle rate to their expected breaking tackle rate/successful tackle rate.



<a id="two"></a>
## II. ISSUES WITH CURRENT PFF SCOUTING DATA

During our data explanatory phase, we noticed that the data from PFF Scouting is less than perfect. Although PFF has done a good job on labeling “missed tackler”, “assist tackler”, and “tackler”, it has failed to label some players that should be recorded. On some long returns (> 20 yards), there were players who made genuine efforts to tackle the returners but failed to appear on PFF scouting data.

An example is Antonio Callaway’s punt return on the Ravens (playID = 2502, GameID = 2018123000). The gifs below from the game suggest that Baltimore No.36 and Baltimore No.54 were both missed tacklers. However, PFF Scouting data only labels Baltimore No.54 as a mistackler and ignores Baltimore No.36.

<img src="https://media.giphy.com/media/GWNbM2IshTeck5JjXo/giphy.gif" width=600> 

<a id="three"></a>
## III. NEURAL NETWORK APPROACH TO IDENTIFY TACKLERS

Traditional methods are not suitable for identifying tacklers. Statistical methods such as quadratic discriminant analysis (QDA) failed to classify tacklers due to the large volume of input features (over 200 features were used). Simple algorithms, such as classifying the nearest opposing player as the tackler, also failed because it could not consider events such as a returner’s teammate blocking the tackling path. Because of the nature of the input features, we decided to use a neural network approach. Inspired by *Wang and Zemel*[1], we applied a Feed-Forward Network (FFN) to identify tacklers.

<a id="a"></a>
### a. Data Preparation

We first transformed the NFL’s Next Gen tracking data into a matrix, with each row containing all player’s relevant information in a given time frame. For each tackling event recorded in the PFF Scouting data, we selected the last frame when the tackler entered a cutoff distance from the returner. The intuition is that a tackler must be close enough to a returner in order to attempt a tackle, and only the last attempt would be labeled in PFF Scouting data. The cutoff distance we chose is 1.5 yard, which is the estimated distance between returner’s (x,y) and tackler’s (x,y) when tackler reaches his arm and touches the returner’s body. 

We then added three features on top of the tracking data to further help the neural network model to identify tacklers. The first two added features are self-explanatory. We will discuss the mathematical formula and intuitions behind the third feature “control influence” later.

- d: Distance from a player to the returner
- an: Angle from a player to the returner
- i: A player’s control influence over the returner

Below is a sample of one row of the input matrix: (217 features in total)

[returner’s *x, y, s, a, dis, o, dir*] [returner’s teammate1’s x, y, s, a, dis, o, dir, d, an, i] … [returner’s teammate10’s *x, y, s, a, dis, o, dir, d, an, i*] [opposing player1’s *x, y, s, a, dis, o, dir, d, an, i*] … [opposing player11’s *x, y, s, a, dis, o, dir, d, an, i*]

Note: ‘x’, ‘y’, ‘s’, ‘a’, ‘dis’, ‘o’, ‘dir’ are directly taken from the tracking data. 
Their definitions are in [NFL Big Data Bowl 2022 | Kaggle]('https://www.kaggle.com/c/nfl-big-data-bowl-2022/data'). 

Further, we augmented the training set with its mirror image plays, which were rotated about the x-axis and the y-axis of the field of the play. Our intuition is that tackling events should be invariant to symmetrical transformations. A tackler, who has a 85% successful tackle rate on the returner of the left side of the field, should still have 85% successful tackle rate on the returner of the right side of the field, same applies to  rotating the field about the y-axis.

This mirror image augmentation significantly improved neural network’s overfitting issue. Originally, we had 7,120 samples from the PFF Scouting data. With mirror plays, we had 28,480 (7,120 × 2 × 2) samples to train the model.

<a id="b"></a>
### b. Control Influence

When identifying a tackler, we wanted to quantitatively measure each player’s control of the field. After obtaining all players’ field control measures, we are able to determine the distribution of blockers and tacklers, and whether blockers’ positions provide blockers abilities to block tacklers. This can be measured using *William Spearman*’s[2] pitch control model. More specifically, we are interested in each player’s individual influence around the returner. We used *Javier Fernandez and Luke Bornn*’s[3] formula for the influence calculation. A player i ’s influence on at given location p is defined as:

$$ I_i= \frac {f_i(p, t)}  {f_i(p_i(t), t)}$$

We will demonstrates how control influence measures path interference with the below hypothetical case:

- Returner is at (10, 10) with speed 10 yard/s moving toward right-bottom corner
- Tackler is at (10, 12) with speed 10 yard/s moving downward

<img src="https://i.ibb.co/1nWSxfb/Picture1.png" width=800> 

The last column in the graph above shows the tackler’s influence over the region. Because the tackler is running downward from point (10, 12) with a relatively high speed, his influence is primarily concentrated in the lower region highlighted in red. The middle column shows the returner’s influence, and the first column shows the intertwined influence of the returner and tackler. The red region in the first column of the graph shows where  two players are most likely to have conflicting control. In the NFL's words, this is where tackling would most likely happen.

<a id="c"></a>
### c. Neural Network Model & Its Result

We trained a relatively shallow (2 hidden layers with 2 drop-out layers) neural network due to limited sample size. Even with mirror plays, we only have 28,480 samples. We used 80% of the samples as training set, and the rest 20% as testing set. Deeper neural networks over-fit the training set, resulting in a very low accuracy for the test set. With a shallower structure and two drop-out layers, the overfitting issue is resolved.

<img src="https://i.ibb.co/2tNkRjw/Picture2.png">

Using naive guessing as a baseline, the probability of correctly picking the tackler is 9.1% (1 out of 11). Our model achieved 89.0% accuracy for the training set and 78.7% for the test set. 

<img src="https://i.ibb.co/pbpp9sv/Picture3.png">





<a id="d"></a>
### d.	Model Application

With a successful neural network model, we expanded its application to identify tacklers that weren’t recorded in the PFF Scouting data. Take the game Baltimore vs. Cleveland (playID = 2502, GameID = 2018123000) as an example:

<img src="https://i.ibb.co/4Y9B78Y/Picture4.png">

The expected tackler sequence is:

[9 … 8 … 1 … 10 … 2 … 6 … 0 … 3]
    
Below is the neural network model predicted tackler (Y) for each frame of the punt-return phase:

[<span style="color:red;"> 4  4 </span> 9  9  9  9  9  9  8  8  8  8  8  1  1  1  1  1  1  1  1  1  1  1  10 10 10 10 10 10 10 10 10 10  2  2  2  2  2  2  6  6  6  6 <span style="color:red;"> 2  1  3  </span>0  0  0  0  0  0  0  0  3  3  3  3  3  3  3  3  3]

The model predicts the 1st tackler to be Y=9 (BAL36). After the 8th frame, it predicts the 2nd tackler is Y=8 (BAL54), and so on. Lastly, it predicts Y=0 (BAL 4) and Y=3 (BAL 87) as the last two tacklers. Despite a few frames of mis-classification (highlighted in red above), the overall sequence is in line with what we observed in the real game. 


<a id="e"></a>
### e. Model Limitations

As mentioned in data preparation, we only used tacklers’ last entries to the 1.5 yard circle from the returner. However, in some plays, the tackler misses a tackle, gets up, chases down the returner, and makes a successful tackle. Under such circumstances, only the last successful tackle will be included, but his first failed attempt will not be included in the data selection. However, for such re-entries, it usually means that the returner’s speed is much slower than that of the tackler due to chaotic field situations and/or the tackler’s previous tackle. The returner, under this circumstance, won’t make much forward progress.


<a id="four"></a>
## IV. NEURAL NETWORK APPROACH TO PREDICT TACKLING OUTCOME 

After seeing the neural network model’s success in identifying tacklers, we expanded its application to outcome prediction. The input matrix for outcome prediction neural network model is identical to tackler identification model, and the model structures are similar except an increased neuron size for each layer. The two hidden layers now have 400 and 200 neurons respectively, compared to 128 and 64 in the previous model. The output layer uses a sigmoid function, which yields a value ranging from 0 to 1. This value can be treated as the probability of a specific tackler successfully tackling the returner. The model will make a cutoff at 0.5 to determine whether a tackle is missed (Y < 0.5), or assist/success (Y ≥ 0.5). 

<img src="https://i.ibb.co/s2zyNV3/Picture5.png">

The model achieves 89.4% accuracy for the training set and 80.3% accuracy for the test set. While results were good, the model shows signs of overfitting, as the loss functions of the training set and test set start to diverge after the 50th training epoch. As a result, we have to early-stop the learning process at the 150th epoch (compared to the 200th epoch for the tackler identification) to prevent excessive overfitting. We hypothesize that the overfitting issue was due to our neural network model being too shallow. However, we are not able to train an effective deep neural network with the limited amount of samples.

<img src="https://i.ibb.co/8N2ZLz6/Picture6.png">



<a id="five"></a>
## V. GRADING SYSTEM BASED ON NEURAL NETWORK MODEL

As mentioned above, the second neural network model can predict the probability (Y) of a specific tackler successfully tackling the returner. This probability translates to an unscaled difficulty for the returner.
A high number indicates high difficulty for the returner and low difficulty for the tackler. 
We can now use this unscaled difficulty measurement to give scores to both the returner and the tackler based on their historical performances. We will use the neural network model to predict the difficulty of each play, and each tackle action’s scores are calculated as following:

$$Difficulty_{returner}=Y$$
$$Difficulty_{Tackler}=1-Y$$
$$Score_{returner}=Difficulty_{returner}*(100_{BreaksTackle}, 0_{Else})$$
$$Score_{tackler}=Difficulty_{tackler}*(100_{TackleSucceeds}, 0_{Else})$$

Here we list the top 10 tacklers with more than 10 tackling attempts:

<img src="https://i.ibb.co/kyqGKWm/Picture7.png" width=650>

Below are the top 10 returners with more than 10 breaking attempts:

<img src="https://i.ibb.co/rpLfn7r/Picture8.png" width=650>



<a id="six"></a>
## VI. EXPANDED APPLICATION ON OTHER AREAS IN NFL 

Although we only use special team’s data to train, label, predict, and grade players’ tackling ability / breaking tackle ability, our model could be widely applied to many other areas of the NFL since tackling is possibly the most fundamental element of football. We can apply our model in other types of plays such as rushing and passing to predict ball carriers / tacklers’ performance as well. Additionally, since rushing / passing plays are a lot more frequent in a football game than the special team play, with a significantly larger dataset, our neural network models should be more reliable.

Word Count:1957


<a id="ref"></a>
## Reference

[1] 	Wang, Kuan-Chieh and Zemel, Richard. Classifying NBA Offensive Plays Using Neural Networks. MIT
Sloan Sports Analytics Conference 2016. 2016.

[2] 	Spearman, William. Quantifying Pitch Control. 10.13140/RG.2.2.22551.93603. 2016.

[3] 	Fernandez, Javier and Bornn, Luke. Wide Open Spaces: A statistical technique for measuring space
creation in professional soccer. MIT Sloan Sports Analytics Conference 2018. 2018.