# NFL Big Data Bowl 2023

##### Zachary Galante, University of California, Berkeley
##### Dr. James G. Shanahan, University of California, Berkeley


# Background 


Consisting of 5 players, the offensive line plays a critical part in the success of a team. During every play, each linemen must understand his assignment and be able to work in unison with the rest of the line. This analysis will focus on the performance of the entire offensive line as a whole during pass plays throughout the first 8 weeks of the 2021 NFL season. This will be done by constructing a metric using both existing and created features. 

![Blah](patriots_example.gif)

## Methodology 
To create our metric, we first trained a logistic regression model to predict if a play would contain an 'event of interest', which were plays where at least one of the 5 offensive linemen were marked as allowing one of the following; <i>'pff_hurryAllowed', 'pff_hitAllowed', 'pff_sackAllowed' </I>. Our best performing logistic regression model resulted in an accuracy score of 0.72, with a balanced accuracy score of 0.68. We then took the coefficients and bias term from this model to use in our metric to asses offensive line perfomance.

# Exploratory Data Analysis

During each play data is collected every millisecond for each player on the field, with this, we started with 7,952,692 records with 34 features.

## Feature Engineering

Using existing features, several approaches were taken to construct new ones. Each feature described below was calculated using data collected every millisecond.

<B> 1: QB Circle </B>

This approach creates a circle that localizes around the QB throughout a play. The motivation for this was that if an offensive lineman and his assigned defender are in the circle, then the defender would be in range to quickly either sack, hit, or hurry the qb. For each play, the total time that the linemen spent in the circle will be represented by the feature, 'total_circle_time'




<b> 2: Offensive Line Bounding Box </b>

In this approach, a bounding box is used to encapsualte the entire offensive line throughout a play. We then use the average height and width of the box in the model to better understand the shape of the offensive line throughout a play. The motivation for the approach

<b> 3: Density of Linemen </b>
Here, the distance between each lineman is being calculated every millisecond throughout the play. The thought process for this approach was that if all the linemen are close together, almost creating a wall, then the QB will have less of a chance at getting sacked, but if all the linemen are spread out, then there's a high chance a defender would be able to get through to the QB.

# Model Building Phase
The final dataset used for modeling featured 8063 rows, each featuring 18 columns. The data was then split using 80% training, and 20% test. Due to the extreme imbalance of eventful plays, the training set was downsampled to meet the number of non-eventful plays. After this transformation, the training set featured 3238 records with 18 columns. The test reamined unbalanced, and featured 1612 plays.


<i> Sample Record Included Below</i>

| offenseFormation| team_win_percentage| dropbackType | average_linemen_distance | defendersInBox | average_box_height | average_box_width | time_to_throw | total_circle_time|
| ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- |
| 4 | 0.125 | 5| 1.720 | 6 | 6.63 | 0.96 | 1.8 | 0.0 

# <b> Logistic Regression </b>
With the features previously defined, a logisitc regression model was trained with the following results.

| Accuracy Score| Balanced Accuracy Score|
| ----------- | ----------- |
|71%|68%|

The learned weights and bias from this model were then used to calculate a score for each play, using the following equation;


$Efficiency = -0.08  * (offenseFormation) +  -0.245 * (team\_win\_percentage) + 0.18 * (dropBackType) + 0.99 * (average\_linemen\_distance) +
 -0.176 * (defendersInBox) +  -0.42 * (average\_box\_height) + 0.03 * (average\_box\_width) + 0.39 * (time\_to\_throw) + 0.46 (total\_circle\_time) + -0.46 $


## Model Evaluation
To evaluate the performance of the model, we used the publically available Pro Football Focus article, [NFL Week 10 Offensive Line Rankings](https://www.pff.com/news/nfl-week-10-offensive-line-rankings-2021). The article also featured the previous week's rankings (week 9), which were the ones after the final week of data provided. 





# Relevance for coaches 

# Appendix

# Limitations / Future Implementations