## Big Man Betweenness (BMB): A Network Measure for QB Pass Protection

Team: Bruno Scodari, Mirjana Stevanovic, Peiying Hua  \
Affiliation: Dartmouth College

In [1]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

## TODO 

All
- proofread

Bruno 
- pressure test code
- link folders to git 

Mina 
- drive conclusion? consider adding Abstract to just give the audience the answers up front?
- see if you want to add to any of the analysis sections 
- help with making visuals prettier? there's an NFL ggplot addition that maybe we should try to use

Peiying
- finish programming the visualization
- take a look at the mixed model specification to see if we can make it look more mathy? may not be necessary

In [3]:
import pandas as pd
import warnings
from IPython.display import display_html
from itertools import chain,cycle
warnings.filterwarnings('ignore')

# defining a funciton to display side-by-side dataframes
def display_side_by_side(*args,titles=cycle([''])):
    html_str=''
    for df,title in zip(args, chain(titles,cycle(['</br>'])) ):
        html_str+='<th style="text-align:center"><td style="vertical-align:top">'
        html_str+=f'<h2 style="text-align: center;">{title}</h2>'
        html_str+=df.to_html().replace('table','table style="display:inline"')
        html_str+='</td></th>'
    display_html(html_str,raw=True)

## Introduction

**Network analysis** is the scientific discipline that examines the structure of relations, often social ones, through the use of graph theory and quantitative methods. A network is characterized by a set of actors, or **nodes**, and the relationships, or **edges**, that connect them. Edges can be directed or undirected, and are often assigned weights that reflect the magnitude or strength of a given relationship. A network is easily represented mathematically using the underlying adjacency matrix, and can be visualized using a sociogram. Different measures can be computed using the adjacency matrix to answer questions about a specific network.

Network analysis has applications across many domains, including sports. For example, networks have been used in the past to study pass-sharing among soccer players and assess the importance of individual players and the relationships among them. In this project, we apply network analysis to the NFL's Next Gen Stats data and propose a novel network analytic metric titled **B**ig **M**an **B**etweenness (**BMB**) to evaluate the pass protection efficiency of offensive linemen. This notebook outlines our approach, and its content is organized as follows:
1. **Network Assembly**: We construct a network for each frame in the 8-week tracking data and integrate player spatial data to inform node positioning.
2. **The BMB Metric**: We outline the development and visualization of the BMB metric.
3. **Statistical Inference**: We test the association between BMB and quarterback pass protection. 
4. **Application**: We use the BMB metric to compute team ratings, quantify the probability of success, and visualize plays.
5. **Conclusion**: We discuss the limitations of our approach, and outline future directions of this project.

We hope this work will provide actionable insights to NFL coaches, and inform future network science applications in American football.

## 1. Network Assembly

For each frame in a given play, we define a network whose nodes represent the players and their spatial positions on the field. Edges are induced among the quarterback and all offensive and defensive players, as well as among the offensive line and defensive players. Edges are inversely weighted by $\epsilon$, where $\epsilon$ represents the Euclidean distance between nodes. An inverse weighting scheme assigns greater strength to ties among players who are closer on the field than to those who are further away. Lastly, edges are removed for nodes that are more than $7$ yards apart. We chose a 7-yard cutoff because in the commonly used Shotgun formation for passing plays, the quarterback stands approximately $5-7$ yards behind the center. It seems plausible to argue that restricting connections to those within $7$ yards of a given player allows us to focus our analysis on the most meaningful relationships and concurrently filter out noisy relations that could distort the results of the subsequent network analysis. The GIF below shows the networks that comprise a single play chosen for illustration.

Week 1: Tampa Bay Buccaneers vs. Dallas Cowboys, play 947: *(:05) (Shotgun) T.Brady pass incomplete short middle to A.Brown (D.Kazee)*
![gif1](https://raw.githubusercontent.com/bscod27/big-man-betweenness/main/gifs/pos/final.gif)


Once a network is initialized, centrality measures are commonly computed to assign numeric values to the nodes to capture "importance." A commonly used centrality measure is **betweenness centrality**. The betweenness of node $i$ is given by the proportion of times that node $i$ lies on the **shortest path** between two other nodes, or simply by the number of paths that pass through node $i$. Betweenness centrality can also be captured in a weighted network by using **minimal weighted paths** instead of shortest paths, which is calculated by taking the geodesic distance among nodes. Nodes with high betweenness are commonly referred to as "**brokers**," as this measure has been commonly used to understand which nodes in a social network "broker" or mediate relationships between others. The weighted network below assigns betweenness scores to offensive linemen. We square root betweenness measures for the offensive line, calculate the mean, and overlay this dynamic metric onto the network in the GIF below.

Week 1: Tampa Bay Buccaneers vs. Dallas Cowboys, play 947: *(:05) (Shotgun) T.Brady pass incomplete short middle to A.Brown (D.Kazee)*
![gif2](https://raw.githubusercontent.com/bscod27/big-man-betweenness/main/gifs/betw/final.gif)

Extending the concept of a broker to football, **we hypothesize that offensive lines with higher mean betweenness centrality across frames in a play are associated with quarterback pass protection.** This forms the basis for the formulation of our metric BMB, which is explained in the following section. Before moving there, we wrangle our data into useable form and specifically take the following steps: 
- Merge datasets to the 8-week tracking information based on primary keys specified in the Kaggle prompt
- Assemble networks for each frame in a given play and calculate the average weighted betweenness centrality among the offensive line
- Define an outcome variable called "**pressure**," which represents whether the QB was either hurried, hit, or sacked on a given play 
- Average dynamic variables and take the mode of static variables across the frames for each play to derive a large data frame whose rows/observations represent individual plays

The head of the resulting dataset is displayed below, which will be used for subsequent analysis.

In [3]:
import pandas as pd
url = 'https://raw.githubusercontent.com/bscod27/big-man-betweenness/main/snippets/rolled.csv'
df = pd.read_csv(url,index_col=0,parse_dates=[0])
print('Dimension:', df.shape, '\n')
df.head(10)

Dimension: (8550, 11) 



Unnamed: 0_level_0,gameId,playId,pos_team,def_team,down,yardstogo,def_coverage,def_covtype,def_playersinbox,line_betw,pressure
week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,2021091207,152,ARI,TEN,1,10,Cover-3,Zone,8,0.032619,0
1,2021091207,218,ARI,TEN,1,18,Cover-3,Zone,7,0.039551,0
1,2021091207,253,ARI,TEN,2,17,Quarters,Zone,5,0.009189,0
1,2021091207,386,ARI,TEN,2,24,Cover-6,Zone,5,0.013034,0
1,2021091207,410,ARI,TEN,3,16,Cover-3,Zone,6,0.010476,0
1,2021091207,621,ARI,TEN,2,5,Cover-1,Man,5,0.016653,0
1,2021091207,660,ARI,TEN,3,5,Red Zone,Other,6,0.018937,0
1,2021091207,839,ARI,TEN,2,6,Cover-1,Man,6,0.017703,0
1,2021091207,863,ARI,TEN,3,3,Cover-1,Man,7,0.025111,0
1,2021091207,1090,ARI,TEN,1,10,Quarters,Zone,6,0.022553,0


## 2. The "Big Man Betweenness" Metric

### Overview
Recognizing that the betweenness centrality of an offensive line in a given play is largely dependent on the spatial positioning of the opposition at the time of snap, we attempt to account for the number of defenders in the box to standardize this metric accordingly. For instance, teams with dual-threat quarterbacks may strongly influence the opposition's formation and propensity to blitz, resulting in potentially lower betweenness centrality among the possession team's offensive line. Our methodology for standardizing betweenness of the offensive line forms the basis of the $BMB$ metric. 

### The math
We define $BMB$ as simply the ratio between the observed square root of O-line betweenness and its expectation given the number of defenders in the box. We square root O-line betweenness to normalize its distribution. The mathematical definition is below: 

$$BMB = \frac{Y \sim \mathcal{N}(\mu \neq 0, \ \sigma^2)}{E(Y|X)\sim \mathcal{N}(\mu \neq 0, \ \sigma^2)},  \quad \text{where } Y=\sqrt{\text{O-line betweenness}} \ \text{ and } \ X=\text{defenders in the box}$$

To estimate $E(Y|X)$, we fit a multilevel linear model where we regress $Y$ onto $X$ and control for within-group variability for week, game, offense, and defense using random intercepts. The fitted values of this model yield the expected values, which are then used to obtain the $BMB$ metric at a play-level. The model specification is included below:

$$ E(Y|X) = (1 | \text{week}) + (1 | \text{game}) + (1 | \text{offense}) + (1 | \text{defense}) + \beta_1 X + \epsilon$$

In matrix notation:

$$y = X\beta + Zb + \epsilon$$


Where: 
1. $y$ is the observation of dependent variable. $E(y) = X\beta$
2. $\beta$ is the vector of fixed effects.
3. $b$ is the vector of random effects. $E(b) = 0$
4. The variance-covariance matrix $var(b) = G$.
5. $\epsilon$ is the random error term. $E(\epsilon) = 0$ and $var(\epsilon) = R$ 

$$\begin{bmatrix} X^TR^{-1}X & X^TR^{-1}Z \\ Z^TR^{-1}X & Z^TR^{-1}Z+G^{-1} \end{bmatrix} \begin{bmatrix} \hat{\beta} \\ \hat{b}\end{bmatrix} = \begin{bmatrix} X^TR^{-1}y \\ Z^TR^{-1}y\end{bmatrix}$$


### Sampling distribution
 Since $Y \sim \mathcal{N(\mu \neq 0, \ \sigma^2)}$ and $E(Y|X) \sim \mathcal{N(\mu \neq 0, \ \sigma^2)}$, the resulting ratio distribution of $BMB$ should be **approximately normal** in certain cases according to prior [research](https://link.springer.com/content/pdf/10.1007/s00362-012-0429-2.pdf).

We interrogate the normality of the empirical distribution for play-level $BMB$. According to our data, the mean of the distribution is centered at approximately 1, while the variance is roughly 0.03. We can visualize the empirical distribution in relation to the $\mathcal{N}(\mu=1, \ \sigma^2=0.03)$ approximation. 


<div>
<img src="https://raw.githubusercontent.com/bscod27/big-man-betweenness/main/images/density.png" width="550"/>
</div>

While the empirical distribution is slightly off-centered and right-skewed, we observe that the normal approximation is reasonable. Understanding the sampling distribution for $BMB$ comes in handy for application purposes in subsequent sections.

### Interpretation
The major benefit of constructing $BMB$ as a ratio between observed and expected values is interpretability. Given that we want to maximize $BMB$, a $BMB>1$ indicates overperformance while $BMB<1$ indicates underperformance. The normal approximation for our statistic also aids in interpretation to assess the likelihood of observing a $BMB$ statistic that is more extreme than the null hypothesis of $BMB=1$.

## 3. Statistical Inference

In the previous section, we showed how to calculate $BMB$, uncovered its approximate sampling distribution, and discussed how to interpret the statistic; however, we have not yet shown **why** this metric is important. To illustrate the importance of $BMB$, we fit several multilevel logistic regression models to demonstrate its strong association with the previously defined **pressure** outcome, which we define as $P$. The following models have been fit on play-level data and control for varying subsets of covariates which may distort the exposure-outcome relationship: 

- Model 1: $P = (1 | \text{week}) + (1 | \text{game}) + (1 | \text{off}) + (1 | \text{def}) + \beta_1 (BMB) + \epsilon$
- Model 2: $P = (1 | \text{week}) + (1 | \text{game}) + (1 | \text{off}) + (1 | \text{def})+ \beta_1 (BMB) + \beta_2 (\text{def in box}) + \beta_3 (\text{coverage}) + \epsilon$
- Model 3: $P = (1 | \text{week}) + (1 | \text{game}) + (1 | \text{off}) + (1 | \text{def}) + \beta_1 (BMB) +\beta_2 (\text{def in box}) + \beta_3 (\text{coverage}) + \beta_4 (\text{down}) + \beta_5 (\text{yards to go}) + \epsilon$

The fixed effects for $BMB$ have been exponentiated to obtain odds ratios, and 95% CIs have been calculated via the normal approximation method. The results are shown below.


<div>
<img src="https://raw.githubusercontent.com/bscod27/big-man-betweenness/main/images/model_results.png" width="750"/>
</div>


For all cases, increasing $BMB$ is significantly associated with a decreased odds of quarterback pressure, even when controlling for various covariates that may distort the exposure-outcome relationship. Thus, we conclude that $BMB$ is an important indicator of successful quarterback pass protection, and one that NFL analysts should consider using.

## 4. Applications

### Team ratings
One of the obvious applications of $BMB$ is using it to rate team O-line performance. Using the play-level data, we average O-line betweenness and defenders in the box across plays for each team. We then use the fixed effect coefficient from the previously fit linear mixed model to estimate the corresponding expected values for betweenness centrality and $BMB$ statistic at the team-level.

Now, we examine two plots that establish a framework for rating each team's O-line on pass plays. The first rank-orders each team by average $BMB$.


<div>
<img src="https://raw.githubusercontent.com/bscod27/big-man-betweenness/main/images/rankings.png" width="750"/>
</div>


The second plots the $BMB$ numerator vs. denominator to distill the driving elements of the calculated statistic for each team.


<div>
<img src="https://raw.githubusercontent.com/bscod27/big-man-betweenness/main/images/matrix.png" width="750"/>
</div>


The above matrix plots each team into one of four partitions based on $Y$ vs. $E(Y|X)$. The diagonal line represents $y=x$, or the null value for $BMB=1$, and the vertical line indicates the mean of the expectation, or in mathematical terms, $E[E(Y|X)]]$. 

So, teams above the diagonal have an average $BMB>1$, and teams below the diagonal have an average $BMB<1$. Teams to the right of the vertical line had a higher expected value, and teams to the left had a lower expected value. 

Following, one could interpret teams in the bottom right quadrant as those who were expected to perform better than they actually did on pass plays.

### Probabilities of success
Due to the earlier establishment that $BMB \sim \mathcal{N}(\mu=1, \sigma^2=0.03)$, we can use the cumulative distribution function for $BMB$ to assign probabilities of "success" (ie, successful pass protection of QB) to each play. This could be especially useful for coaches who spend lots of time reviewing film and analyzing various ways to be successful. In theory, using this probability of success, one could easily isolate the most critical plays or understand how probabilities change given different play scenarios. 

We demonstrate how this could be useful by zooming in on the Dallas Cowboys and analyzing how their probabilities of success change given different cuts of the data below.

Now that we've established a strong association between $BMB$ and quarterback pressure on passing plays, we shift our focus to application of this metric. 

In [4]:
def read_file(name):
    url = 'https://raw.githubusercontent.com/bscod27/big-man-betweenness/main/snippets/' + name
    return pd.read_csv(url,index_col=0,parse_dates=[0])

df1 = read_file('down.csv')
df2 = read_file('covtype.csv')
df3 = read_file('defcov.csv')

display_side_by_side(df1,df2,df3) 

Unnamed: 0_level_0,Probability
Down,Unnamed: 1_level_1
4th/2pc,0.738572
1st,0.4865
3rd,0.43991
2nd,0.431703

Unnamed: 0_level_0,Probability
Coverage type,Unnamed: 1_level_1
Man,0.515263
Zone,0.433756
Other,0.421909

Unnamed: 0_level_0,Probability
Coverage,Unnamed: 1_level_1
Cover-0,0.670825
Cover-6,0.510233
2-Man,0.495028
Cover-1,0.477954
Red Zone,0.450754
Cover-3,0.435605
Cover-2,0.407063
Bracket,0.39089
Quarters,0.389206
Other,0.199838


This data is helpful for coaching staffs who seek to 1) optimize the performance of their O-line on pass plays or 2) exploit the opposition's O-line on pass plays.

### Frame-by-frame play analysis

By putting all the players back to the playfield, we visualize how the big-man-betweenness statistics change over the course of a play while the players move around the field and form different formations.
 
Week 1 Game 2021091207: Tennessee Titans vs. Arizona Cardinals, play 3828

In [17]:
from IPython.display import IFrame
IFrame(src='gifs/animation11.20quick.html', width=1000, height=700)

A last use case of $BMB$ is to visualize how the statistic changes over the course of a play. By doing so, coaches and players can see where passing plays break down and understand how to address any pain points with their offensive line.

- Peiying's video >> show how the statistic changes over each frame
- Try to pick a case where it flips from >1 to <1 to show how it could be useful

## 5. Conclusion

What we did / found

Limitations
- computationally burdensome
- no player level assessment done here

Future directions
- look other dependent variables for statistical models
- extend this to an individual level 
- explore other network measures in the game

## Appendix

The supporting code and associated descriptions can be found on the following [Github page](https://github.com/bscod27/big-man-betweenness).