## Big Man Betweenness (BMB): A Network Measure for QB Pass Protection

Team: Bruno Scodari, Mirjana Stevanovic, Peiying Hua  \
Affiliation: Dartmouth College

In [1]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

## TO DO 
- proofread
- make sure code is hidden
- pressure test code 
- enhance visuals >> center + higher resolution

In [2]:
import pandas as pd
import warnings
from IPython.display import display_html
from itertools import chain,cycle
warnings.filterwarnings('ignore')

# defining a funciton to display side-by-side dataframes
def display_side_by_side(*args,titles=cycle([''])):
    html_str=''
    for df,title in zip(args, chain(titles,cycle(['</br>'])) ):
        html_str+='<th style="text-align:center"><td style="vertical-align:top">'
        html_str+=f'<h2 style="text-align: center;">{title}</h2>'
        html_str+=df.to_html().replace('table','table style="display:inline"')
        html_str+='</td></th>'
    display_html(html_str,raw=True)

## Introduction

**Network analysis** is the scientific discipline that examines the structure of relations, often social ones, through the use of graph theory and quantitative methods. A network is characterized by a set of actors, or **nodes**, and the relationships, or **edges**, that connect them. Edges can be directed or undirected, and are often assigned weights that reflect the magnitude or strength of a given relationship. A network is easily represented mathematically using the underlying adjacency matrix, and can be visualized using a sociogram. Different measures can be computed using the adjacency matrix to answer questions about a specific network.

Network analysis has applications across many domains, including sports. For example, networks have been used in the past to study pass-sharing among soccer players and assess the importance of individual players and the relationships among them. In this project, we apply network analysis to the NFL's Next Gen Stats data and propose a novel network analytic metric titled **B**ig **M**an **B**etweenness (**BMB**) to evaluate the pass protection efficiency of offensive linemen. This notebook outlines our approach, and its content is organized as follows:
1. **Network Assembly**: We construct a network for each frame in the 8-week tracking data and integrate player spatial data to inform node positioning.
2. **The BMB Metric**: We outline the development and visualization of the BMB metric.
3. **Statistical Inference**: We test the association between BMB and quarterback pass protection. 
4. **Application**: We use the BMB metric to compute team ratings, quantify the probability of success, and visualize plays.
5. **Conclusion**: We discuss the limitations of our approach, and outline future directions of this project.

We hope this work will provide actionable insights to NFL coaches, and inform future network science applications in American football.

## 1. Network Assembly

For each frame in a given play, we define a network whose nodes represent the players and their spatial positions on the field. Edges are induced among the quarterback (QB) and all offensive and defensive players, as well as among the offensive line (O-line) and defensive players. Edges are inversely weighted by $\epsilon$, where $\epsilon$ represents the Euclidean distance between nodes. An inverse weighting scheme assigns greater strength to ties among players who are closer on the field than to those who are further away. Lastly, edges are removed for nodes that are more than $7$ yards apart. We chose a 7-yard cutoff because in the commonly used Shotgun formation for passing plays, the quarterback stands approximately $5-7$ yards behind the center. It seems plausible to argue that restricting connections to those within $7$ yards of a given player allows us to focus our analysis on the most meaningful relationships and concurrently filter out noisy relations that could distort the results of the subsequent network analysis. The GIF below shows the networks that comprise a single play chosen for illustration.

Week 1: Tampa Bay Buccaneers vs. Dallas Cowboys, play 947: *(:05) (Shotgun) T.Brady pass incomplete short middle to A.Brown (D.Kazee)*

<div>
<img src="https://raw.githubusercontent.com/bscod27/big-man-betweenness/main/gifs/pos/final.gif" width="550" class="center">
</div>


For each network, we proceed to calculate **betweenness centrality** of its nodes. Betweenness centrality is a commonly used centrality measure. The betweenness of node $i$ is given by the proportion of times that node $i$ lies on the **shortest path** between two other nodes, or simply by the proportion of geodesic paths that pass through node $i$. Betweenness centrality can also be captured in a weighted network by using **minimal weighted paths** instead of shortest paths. Nodes with high betweenness are commonly referred to as **brokers**, as this measure has been commonly used to understand which nodes in a social network broker or mediate relationships between others. We use betweenness centrality as the basis for the BMB metric we develop below. The GIF below shows how betweenness centrality can be computed for every frame in a given play. For the same Tampa Bay Buccaneers vs. Dallas Cowboys play chosen for illustration above, we dynamically visualize the square root of the average of the betweenness centrality measures for the O-line, which we explain in more detail in the following section. A higher value indicates that a higher proportion of paths between the quarterback and all other players go through the O-line, thereby conferring **higher quarterback pass protection**.

Week 1: Tampa Bay Buccaneers vs. Dallas Cowboys, play 947: *(:05) (Shotgun) T.Brady pass incomplete short middle to A.Brown (D.Kazee)*

<div>
<img src="https://raw.githubusercontent.com/bscod27/big-man-betweenness/main/gifs/betw/final.gif" width="550" class="center">
</div>

For all our analyses, we use the dataframe we obtained in the following way: 
- We merged game, play, player and PFF scouting information to the 8-week tracking data based on the primary key variables specified in the Kaggle prompt.
- We created networks for each frame in a given play and calculated the average of the square root of betweenness centrality of offensive line players.
- We defined an outcome variable called **pressure**, which represents whether the QB was hurried, hit, or sacked on a given play. 
- We averaged the dynamic variables and took the mode of the static ones across the frames of each play to derive a large data frame whose rows/observations represent individual plays.

The head of the resulting dataset is displayed below.

In [3]:
import pandas as pd
url = 'https://raw.githubusercontent.com/bscod27/big-man-betweenness/main/snippets/rolled.csv'
df = pd.read_csv(url,index_col=0,parse_dates=[0])
print('Dimension:', df.shape, '\n')
df.head(10)

Dimension: (8550, 11) 



Unnamed: 0_level_0,gameId,playId,pos_team,def_team,down,yardstogo,def_coverage,def_covtype,def_playersinbox,line_betw,pressure
week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,2021091207,152,ARI,TEN,1,10,Cover-3,Zone,8,0.032619,0
1,2021091207,218,ARI,TEN,1,18,Cover-3,Zone,7,0.039551,0
1,2021091207,253,ARI,TEN,2,17,Quarters,Zone,5,0.009189,0
1,2021091207,386,ARI,TEN,2,24,Cover-6,Zone,5,0.013034,0
1,2021091207,410,ARI,TEN,3,16,Cover-3,Zone,6,0.010476,0
1,2021091207,621,ARI,TEN,2,5,Cover-1,Man,5,0.016653,0
1,2021091207,660,ARI,TEN,3,5,Red Zone,Other,6,0.018937,0
1,2021091207,839,ARI,TEN,2,6,Cover-1,Man,6,0.017703,0
1,2021091207,863,ARI,TEN,3,3,Cover-1,Man,7,0.025111,0
1,2021091207,1090,ARI,TEN,1,10,Quarters,Zone,6,0.022553,0


## 2. The Big Man Betweenness Metric

The $BMB$ metric is computed as follows. For each offensive line player $v$, we calculate betweenness centrality

$$g(v)=\sum_{s\neq v\neq t} \frac{\sigma_{st}(v)}{\sigma_{st}}$$

where $\sigma_{st}$ is the number of minimal weighted paths between players $s$ and $t$, and $\sigma_{st}(v)$ is the number of those paths that pass through $v$. As ours is an undirected network, $g(v)$ is rescaled by $(N-1)(N-2)/2$, or the number of pairs of players not including $v$. $N$ is the total number of players in the network.

We then take the mean across all O-line players, and take the square root of the O-line betweenness to normalize its distribution. We denote this quantity as Y:

$$Y = \sqrt{\frac{\sum_{i=1}^{n} g(v_{i})}{n}}, \quad \text{where n is the number of O-line players}$$

Next, we note that the betweenness centrality of an offensive line in a given play is largely dependent on the spatial positioning of the opposition at the time of snap. For instance, teams with dual-threat quarterbacks may strongly influence the opposition's formation and propensity to blitz, resulting in potentially lower betweenness centrality among the possession team's offensive line. We therefore account for the number of defenders in the box in order to standardize our metric accordingly. Specifically, we define $BMB$ as the ratio between the observed square root of O-line betweenness and its expectation given the number of defenders in the box:

$$BMB = \frac{Y}{E[Y|X]},  \quad \text{where X is the number of defenders in the box}$$

To estimate $E[Y|X]$, we fit a multilevel linear mixed-effects model where we regress $Y$ onto $X$ while controlling for within-group variability with respect to week, game, offense, and defense using random intercepts:

$$ \pmb{Y} = \pmb{X\beta} + \pmb{Zb} + \pmb{\epsilon} $$

where
1. $\pmb{Y}$ is a vector containing the square root of O-line betweenness, as defined above.
2. $\pmb{X}$ is a vector of the number of defenders in the box, the only predictor variable.
3. $\pmb{\beta}$ is the vector of regression coefficients corresponding to X.
4. $\pmb{Z}$ is the design matrix for the four random effects and their corresponding groups.
5. $\pmb{b}$ is the vector of random effects.
6. $\pmb{\epsilon}$ is the vector of residuals, or the random error term.

The fitted values of this model yield the expected values of Y given X, which are then used to obtain the $BMB$ metric at a play-level. As $BMB$ is given by the ratio of observed and expected values, its interpretation is straightforward and intuitive. A $BMB$ value greater than 1 indicates that the O-line performance was better than expected, whereas a BMB value less than 1 means that the team underperformed. Another major benefit of constructing $BMB$ as a ratio between observed and expected values is its distribution. As mentioned previously, $Y$, the numerator of $BMB$, is normally distributed, and we know that $E[Y|X]$, the denominator of $BMB$, is also normally distributed. The distribution of the resulting ratio between the two variables should also be **approximately normal** (D&iacute;az-Franc&eacute;s & Rubio, 2013). We interrogate the normality of the empirical distribution for play-level $BMB$. The mean of the distribution is approximately 1, while the variance is roughly 0.03. We can therefore visualize the empirical distribution of $BMB$ in relation to the $\mathcal{N}(\mu=1, \ \sigma^2=0.03)$ approximation.

<div>
<img src="https://raw.githubusercontent.com/bscod27/big-man-betweenness/main/images/density.png" width="550" class="center">
</div>

While the empirical distribution is slightly off-centered and right-skewed, we observe that the normal approximation is reasonable. We will use this sampling distribution of $BMB$ in a subsequent section to assign the probablity of success for each play. Note also that $BMB=1$ makes for a natural null hypothesis when assessesing the likelihood of observing a $BMB$ statistic more extreme than that.

## 3. Statistical Inference

Next, we fit several multilevel mixed-effects logistic regression models to demonstrate the strong association between $BMB$ and the previously defined binary **pressure** outcome. We regress pressure onto $BMB$ while allowing for various random intercepts to control for within-group variability among repeated measures. We fit the following models: 

| | Fixed effects | Random effects |
| --- | --- | --- |
| Model 1 | BMB | week, game, offense, defense | 
| Model 2 | BMB, defenders in box, coverage type | week, game, offense, defense | 
| Model 3 | BMB, defenders in box, coverage type, down, yards until 1st down  | week, game, offense, defense | 

We chose to control for the number of defenders in the box, coverage type, down, and distance needed for a first down because we believe these variables could potentially confound the relationship between BMB and pressure since they could be associated with both. For instance, it seems plausible to argue that the higher the number of the opposition players in the box, the higher the betweenness centrality of O-line players, but also the higher the likelihood that the QB gets hurried, hit, or sacked. Similarly, coverage type, a categorical variable indicating whether the defense's coverage type was man, zone or other, could affect both BMB and pressure. Whether the defensive players follow the receiver on any route, or they cover a zone of the field to protect against a pass, may affect the spatial positioning of the offensive line, thereby affecting the BMB metric, and these different coverage types arguably affect the outcome of a play. Down, another numerical variable indicating how many of the four downs the team has had in their effort to advance ten yards or more towards the opponent’s goal line, could be associated with both BMB and pressure because (Bruno, please finish this). Finally, yards until first down, also a numeric variable, (Bruno, please finish this).

In the table below, the coefficients corresponding to the fixed effects in our models have been exponentiated to obtain odds ratios, and 95% CIs have been calculated via the normal approximation method:


|  | Model 1 | Model 2 | Model 3 |
| --- | :---: | :---: | :---: |
| Big Man Betweenness	| 0.31 [0.15, 0.66]	| 0.25 [0.12, 0.54]	| 0.25 [0.11, 0.54] |
| Defenders in box | | 0.74 [0.66, 0.84] | 0.78 [0.68, 0.89] |
| Defensive coverage: Zone vs. Other || 0.78 [0.46, 1.30] | 0.90 [0.53, 1.53] |
| Defensive coverage: Man vs. Other || 1.02 [0.59, 1.76] | 1.05 [0.61, 1.82] |
| Yards to go ||| 1.00 [0.97, 1.03] |
| Down: 2nd vs. 1st down ||| 1.30 [0.94, 1.79] |
| Down: 3rd vs. 1st down ||| 1.62 [1.16, 2.26] |
| Down: 4th/2pc vs. 1st down |||2.58 [1.41, 4.69] |


In all models, the higher the $BMB$ value, the lower the odds of quarterback pressure, even after controlling for various covariates that may simultaneously be associated with both $BMB$ and pressure. In particular, for every 1 unit increase in $BMB$, there is a 69% or 75% decrease in the odds of quarterback pressure. This association is statistically significant at the 0.05 level as we note that the 95% CI for the OR does not contain the null value of 1. Thus, we conclude that $BMB$ is an important indicator of successful quarterback pass protection, and one that NFL analysts should consider using.

## 4. Applications

### Team ratings
One obvious application of $BMB$ is to rank teams based on their O-line performance. Using play-level data, we average O-line betweenness and the number of defenders in the box across plays for each team. We then use the fixed effect coefficient from the previously fit linear mixed model to estimate the corresponding expected values for betweenness centrality and the $BMB$ statistic at the team-level. In the plot below, we rank-order teams based on their average $BMB$.


<div>
<img src="https://raw.githubusercontent.com/bscod27/big-man-betweenness/main/images/rankings.png" width="750" class="center">
</div>


We then plot the $BMB$ numerator vs. denominator to better understand the driving elements of the calculated statistic for each team.


<div>
<img src="https://raw.githubusercontent.com/bscod27/big-man-betweenness/main/images/matrix.png" width="750" class="center">
</div>


The figure above shows that each team falls into one of four partitions based on $Y$ vs. $E[Y|X]$. The diagonal line represents points where the observed value of $Y$ equals its expected value $E[Y|X]$, or alternatively the null hypothesis that $BMB=1$. The vertical line indicates the mean of the expectation $E[E[Y|X]]$. Teams above the diagonal have an average $BMB>1$, and teams below the diagonal have an average $BMB<1$. Teams to the right of the vertical line had an expected value higher than the average, and teams to the left had a lower expected value. The teams in the bottom right quadrant are therefore those that were expected to perform better than they actually did on pass plays.

### Probability of a successful pass protection of the QB
As we have established that $BMB \sim \mathcal{N}(\mu=1, \sigma^2=0.03)$, we can use the cumulative distribution function for $BMB$ to assign the probability of a successful pass protection of the QB to each play. This application of the $BMB$ metric could be especially useful for coaches in conjunction with reviewing film in an effort to analyze various ways to be successful. Using this probability of a successful pass protection of the QB, one could isolate the most critical plays in a game, and understand how this probability of success changes with different play scenarios. In this way, coaching staff can optimize the performance of their O-line on pass plays as well as exploit the opposition's O-line on pass plays. To illustrate this application of the $BMB$ metric, we zoom in on the Dallas Cowboys as an example, and analyze how their probability of success changes under different circumstances.

In [4]:
def read_file(name):
    url = 'https://raw.githubusercontent.com/bscod27/big-man-betweenness/main/snippets/' + name
    return pd.read_csv(url,index_col=0,parse_dates=[0])

df1 = read_file('down.csv')
df2 = read_file('covtype.csv')
df3 = read_file('defcov.csv')

display_side_by_side(df1,df2,df3) 

Unnamed: 0_level_0,Probability
Down,Unnamed: 1_level_1
4th/2pc,0.738572
1st,0.4865
3rd,0.43991
2nd,0.431703

Unnamed: 0_level_0,Probability
Coverage type,Unnamed: 1_level_1
Man,0.515263
Zone,0.433756
Other,0.421909

Unnamed: 0_level_0,Probability
Coverage,Unnamed: 1_level_1
Cover-0,0.670825
Cover-6,0.510233
2-Man,0.495028
Cover-1,0.477954
Red Zone,0.450754
Cover-3,0.435605
Cover-2,0.407063
Bracket,0.39089
Quarters,0.389206
Other,0.199838


### Frame-by-frame play analysis
A last use case of $BMB$ is to visualize how the statistic changes over the course of a play. By doing so, coaches and players can see where passing plays break down and understand how to address any pain points with their offensive line.
 
Week 1 Game 2021091207: Tennessee Titans vs. Arizona Cardinals, play 3828


<div>
<img src="https://raw.githubusercontent.com/bscod27/big-man-betweenness/main/gifs/visualizefieldwithBMB.gif" width="500" class="center">
</div>

## 5. Conclusion

In this project, we use network analysis to develop the BMB metric that can be used to evaluate quarterback pass protection. We show how it can be applied to rank teams, predict the probability of a successful pass protection of the quarterback, and analyze plays frame-by-frame in order to better understand the performance of the offensive line.

One limitation of our approach is that calculating BMB for every frame of a play is computationally expensive. While we were able to leverage a High Performance Computing cluster at our institution, we acknowledge that such resources are not widely accessible. A computationally less burdensome approach would consist of focusing only on the frames before snap, though such an approach might be less robust compared to making use of all the available data. One of the next steps in this project will be determining how much we can restrict the time window of interest without losing the predictive power of the BMB metric. A further limitation of our approach is that it does not allow for the assessment of player-level performance. By averaging betweenness centrality across all O-line players, we lose player-level information that could be useful in determining individuals driving the performance of the O-line. The BMB metric could be easily adapted to retain this player-level information. In particular, instead of computing the mean of betweenness centrality of all O-line players, we could focus on the betweenness centrality of each player and calculate the ratio between it and its expected value that we obtain by regressing betweenness centrality on the number of defenders in the box while modeling week, game, offense, and defense as random intercepts. However, the distributional properties of such a player-level metric might be less convenient than those of BMB. Another future direction of this project entails implementing this alternative approach and testing its usability. We also intend to leverage additional variables available in the data provided as they could potentially improve the predictive power of BMB if they were to be included in the model used to compute E[Y|X]. Finally, we will explore other network measures.

## Appendix

The supporting code and associated documentation can be found on the following [Github page](https://github.com/bscod27/big-man-betweenness).