This is the NFL Big Data Bowl 2021 submission from team **isaihchrischarlie** by collaborators [Isaih](https://www.kaggle.com/zayuhtheiv), [Chris](https://www.kaggle.com/christopherhanes), and [Charlie](https://www.kaggle.com/danoff).

# Introduction

Drawing on our team strengths, our approach was to focus less on sophisticated machine learning and more on having crisp data visualizations and clear writing informed by football game expertise. We all have experience doing data analysis. Chris has extensive knowledge of databases, Charlie writes, and Isaih is our [Subject Matter Expert](https://en.wikipedia.org/w/index.php?title=Subject-matter_expert) (SME) on the game drawing on his time playing defensive back in high school. Isaih and Charlie have been researching sports analytics for multiple years and [submitted an entry](https://www.kaggle.com/danoff/neural-networks-isaih-divya-charlie-version?scriptVersionId=24162374) to the last big data bowl, trying to calculate how many yards a running back would get on a given play. 

This time we are quantitatively evaluating defenders, which has more nuance as [Brian Burke outlined](http://archive.advancedfootballanalytics.com/2010/03/measuring-defensive-playmakers.html):

> Offensive stats are straightforward, but objective defensive stats are problematic. When a running back picks up a 10-yard gain, although other teammates contributed, that's obviously a good play by the ball carrier. And when a running back stumbles at the line for no gain, that's obviously bad. But looking at the same two plays from the other side of the ball is much trickier. A strong safety, say Troy Polamalu, who makes the best play he can by preventing the runner getting past 10 yards, would be debited for that 10 yard gain. The other four or five defenders who had a chance to make the play sooner, but didn't, aren't mentioned in the play description and wouldn't be docked for the play. On the other hand, if Polamalu is playing run support, and he reads the play and stuffs the running back at the line, that's certainly to his credit. If only there were a way to credit each defender for plays like this, and at the same time ignore the plays that really should count against his teammates.

To measure that credit we wanted to come up with a [Key Performance Indicator](https://en.wikipedia.org/w/index.php?title=Performance_indicator) (KPI) for the effectiveness of [defensive backs](https://en.wikipedia.org/w/index.php?title=Defensive_back). Ideally we sought one number to answer: what player is most effective at pass coverage?

That established, we agree with what [Burke also wrote 2010](http://archive.advancedfootballanalytics.com/2010/03/measuring-defensive-playmakers.html) "the truth is no objective, quantitative football statistic will ever capture every individual contribution of a player", so we tried to limit the scope of what our KPI indicated. Our SME emphasized that the most important thing for a defensive back was not letting someone get behind you. Coaches are typically okay with giving up "cheap stuff" to help prevent a long touchdown pass. We thought if we looked at the defenders who are scored on, we could then measure the distance between the receiver and DB, as a matter of finding the "burn factor", i.e., how badly the defender was beat or "burned" by the wide receiver. We were also interested in measuring said distance between the two players on non-scoring receptions as well.

Another thing to keep in mind that came up during our work was the placement of players before the ball was snapped. [Cornerbacks](https://en.wikipedia.org/wiki/Cornerback) are typically in front of their man, anywhere between 3-10 yards pre-snap. [Safeties](https://en.wikipedia.org/wiki/Safety_(gridiron_football_position)) have more variability: they can be lined up in the box like a [linebacker](https://en.wikipedia.org/wiki/Linebacker) or as far back as a free safety in a baseball [center-fielder-like](https://en.wikipedia.org/wiki/Center_fielder) role, or in a slot corner role even. Once the ball is actually snapped all defensive backs are expected to read and react (i.e., putting themselves in a good position consistently and recovering from disadvantageous positions). How they do so exactly depends on if they’re assigned to play [zone defense](https://en.wikipedia.org/w/index.php?title=Zone_defense_in_American_football) or [man-to-man](https://en.wikipedia.org/w/index.php?title=Man-to-man_defense).

Ideally we hoped to consider whether or not the team is playing zone because that leads to different player responsibilities than man coverage. These responsibilities mean that if we are assigning "blame" with our KPI, perhaps the blame should be split if they are running zone? Or more contextualized, since they would be more generous with the space given, or a bubble for the receiver since the priority would be to ensure that they don't get behind the DB.

Consider this [quote from former college corner](https://bleacherreport.com/articles/1745443-how-to-read-and-react-a-college-cornerbacks-guide-to-pre-snap-pass-defense) Micheal Felder:

> In addition to communication with teammates, the cornerback has reads he must make on his own based upon down and distance, as well as receiver alignment. While the called coverage dictates the general concept to the corner, the down-and-distance situation should play a significant role in how the cornerback plays his responsibility. Essentially, Cover 2 on 1st-and-10 is not the same as playing Cover 2 on 3rd-and-4. First-and-10 calls for more rules as a team's playbook is wide open. Third-and-4, the first down is job one and protecting the sticks against run and pass must happen.

To understand more about analyzing the zone, we looked at Dutta, Yurko, and Ventura's 2020 paper [Unsupervised Methods for Identifying Pass Coverage Among Defensive Backs with NFL Player Tracking Data](https://arxiv.org/abs/1906.11373), as recommended by [Dr.  Michael Lopez](https://www.kaggle.com/c/nfl-big-data-bowl-2021/discussion/191334), of the NFL and Skidmore College. They focused on corners and drawing on their suggestion "a complete analysis of other positions would require the design of new features specific to the safety position and the patterns of motion of safeties in relation to their teammates and opponents." we debated focusing just on safeties. We felt it would be insightful to understand that position more. Upon investigation and seeing the nuance in the tracking data between free safeties, strong safeties, combo safeties and the various and ever-changing responsibilities of them all--including slot corner coverage as nickel backs, we decided to simplify. We chose to include all defensive backs in our metric.

We also felt it was important to keep false positives in mind. A defender may be very far away from a receiver leaving him open, but if the quarterback never looks on that side of the field, the team will not be harmed. We felt like over the course of a season this will be accounted for because it is unlikely for quarterbacks to avoid a specific defender all season.

The data is organized in an increasingly granular fashion, first by game, then by play and then by frame where there are multiple frames per second. Caio Brighenti described the data in his [Big Data Bowl submission](https://caiobrighenti.github.io/nfl-data-bowl.html) as:

> For each play, positional, directional, and movement data is available for each player on the field at the moment the ball is handed off to the running back. In other words, the data offer a snapshot of each play, alongside how many yards were gained on that play, as well as a host of game-status variables such as the down, quarter, and time on the clock.

We wanted to evaluate which defenders close gaps best while the ball was in the air. In other words, when the ball leaves the quarterback’s hands to when it arrives at the targeted receiver. On one play for example, Marcus Williams was 8.19 yards away from the targeted receiver at the "pass forward" event and 5.96 yards away at the "pass arrived" event, good for a 27.2% gap decrease.

Another factor we wanted to consider was which specific defender we were going to analyze on a given play. We decided that a corner on the other side of the field defending a non-targeted receiver should not be part of the evaluation. We created a dummy variable to show who is closest to the targeted receiver when the ball arrives. 

To gain an outside perspective on who were some of the best defensive backs in the year we will be analyzing we reviewed the [Top 25 Cornerbacks in the NFL in 2018](https://www.pff.com/news/pro-top-25-cornerbacks-in-the-nfl-in-2018) from Pro Football Focus. As a way to audit the quality of our KPI, we will see if our top rated defenders align with the [PFF grades](https://www.pff.com/grades).

## Focus of Analysis

Ultimately the question we decided to focus on was: 

*Is a pass defender’s percent distance closed to the target receiver, during the pass air time, a meaningful measure of pass defender performance?*

# Methodology

Now we will outline our methodological approach to preparing the base data for analysis.

1. We define pass defenders as any player playing any of the following positions during a play: 
    * Cornerback
    * Defensive Back
    * Free Safety
    * Inside Linebacker
    * Middle Linebacker
    * Outside Linebacker
    * Safety
    * Strong Safety

2. We calculate the distance of each pass defender to the target receiver at the moment of pass arrival.

3. We identify the pass defender closest to the target receiver.

4. We calculate the identified pass defender’s distance to target receiver at the moment pass was thrown (i.e., at the pass forward event).

5. We calculate a distance closed variable by subtracting the identified defender’s distance to the target receiver at the moment of pass arrival from the distance from the distance at the moment the pass was thrown.

6. We normalize the distance closed variable by turning it into a percentage and call it Percent Distance Closed (PDC).

## Scope

The scope of our analysis encompasses 13,896 valid pass plays. A valid play is a play having the following conditions:

1. The play_type value is play_type_pass.
2. The play has a specified targeted receiver.
3. One, and only one, frame has a pass_forward event (this excludes multi-pass plays).
4. One, and only one, frame has a pass_arrived event (this excludes single-pass plays having a lateral).
5. The distance between the identified closest defender and the target receiver at the pass forward event has to be an absolute value (i.e., > 0). We excluded four plays where the distance was 0.
6. "Symmetrical" tracking data records—that is, the number of tracking data records for a particular play equals the number of unique nflId values for the play multiplied by the number of unique frameId values for the play.
7. All timestamp values for each and every Frame ID are identical.

## Data Processing Notebook Workflow

The data processing workflow entailed running the notebooks listed below in the order listed.
![dataprocessingworkflow](http://danoff.org/image001.jpg)

Data Quality Issues

1. Games data
    * Data for three games are missing; all from 9/9/2018. The missing games are: MIN vs SF, DEN vs SEA, LAC vs KC
2. Players data
    * Inconsistent height values - height values are expressed in both inches as integers and in feet-inches as strings. These values were normalized in the ETL process.
    * Inconsistent birthDate values - birthDate values are expressed in both yyyy-mm-dd and mm/dd/yyyy formats. These values were normalized in the ETL process.
3. Tracking (weeks) data
    * Duplicate records - The weekly data files collectively contain 3,769 duplicate records. These represent 0.02% of 18,309,388 total records. The duplicate records have been removed during the ETL process, yielding 18,305,619 unique records.
    * Missing nflId values - 6.81% (1,247,642 of 18,309,388) of the nflId values in the weekly data files are missing.
        * All of these missing values correspond to records for the football, which makes sense for why they do not have an nflId value.
        * To remedy this, all of these 1,247,642 records have been assigned an nflId value of 0.
    * Asymmetrical data - 135 plays having asymmetrical data - - that is, the expected number of records for a play does not equal the number of unique nflId values for the play multiplied by the number of unique frameId values for the play.
        * These plays remain in the tracking (weeks) data, but metadata about them are stored in a table named asymmetrical_play_data_meta; additionally, their playId and gameId values are stored in a lookup table named bad_plays_data_meta, which will be used to exclude these plays from any analysis.
    * Frames with different timestamp values - 1,339 play frames have different timestamp values within the same frame instead of identical values.
        * These 1,339 frames represent only 0.1073% of the 1,247,711 total frames.
        * These frames remain in the tracking (weeks) data, but metadata about them are stored in a table named invalid_frames_meta; additionally, their corresponding playId and gameId values are stored in a lookup table named bad_plays_data_meta, which will be used to exclude these plays from any analysis.


# Results

For the initial analysis we only included data from defensive backs (i.e., removed all linebackers) and excluded one play that had two incomplete pass events (Game ID 2018090900, Play ID 2037). The mean average PDC was 22.7% with a standard deviation of 59.2%. This makes sense given the wide range from defenders getting badly burned to close encounters. Most of the time the closest defensive back was making up around one fifth of the distance between them and the targeted receiver while the ball was in the air. The histogram below shows the shape of the data:


In [None]:
# Load libraries and check versions
 
# Python version
import sys
print('Python: {}'.format(sys.version))

# scipy
import scipy
print('scipy: {}'.format(scipy.__version__))
from scipy import stats

# numpy
import numpy as np
print('numpy: {}'.format(np.__version__))

# matplotlib

import matplotlib
print('matplotlib: {}'.format(matplotlib.__version__))
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

# seaborn

import seaborn


# pandas
import pandas as pd
print('pandas: {}'.format(pd.__version__))
from pandas.plotting import scatter_matrix

# scikit-learn
import sklearn
print('sklearn: {}'.format(sklearn.__version__))
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

# statsmodels

import statsmodels
from statsmodels.compat import lzip
import statsmodels.formula.api as smf
import statsmodels.stats.api as sms
import statsmodels.api as sm

# read CSV of selected 2018 pass data 

data = pd.read_csv('../input/onlydbs/alpha_analysis_detail_few_columns_only_dbs.csv', header=0)
data = data.dropna()
data.head

# make PDC histogram

data.km_pctDistClosed.hist(color='yellow')
plt.suptitle('Percent of Distance Closed to Targeted Receiver by Defensive Back')
plt.title('While Pass is in the Air Histogram')
plt.xlabel('% of Distance Closed')
plt.ylabel('Frequency')
plt.xlim(-900, 100)
plt.grid(b=None)
plt.text(-300, 350, r'$\mu=22.7$')
plt.text(-300, 300, r'$\sigma=59.2$')
plt.text(-300, 250, r'min = -855.8')
plt.text(-300, 200, r'max = 95.2')
plt.savefig('hist_pct_distance_closed')

Next, we looked to see whether or not our key metric increased the odds of successful on field outcomes for defenders. We ran a logistic regression with PDC as the independent variable and if the pass was incomplete as the dependent variable. We found that PDC was not significant at the 10% level (*p* = 0.18) as you can see below.

In [None]:
# Binary logistic regression to predict incomplete passes take one

y=data['pm_efcPassOutcomeIncomplete']

data["constant"] = 1.0

x=data[['km_pctDistClosed', 'constant']]

logit_model=sm.Logit(y,x)
result=logit_model.fit(method='bfgs')
print(result.summary2())

params = result.params
conf = result.conf_int()
conf['OR'] = params
conf.columns = ['2.5%', '97.5%', 'OR']

print(np.exp(conf))

We tried again with more independent variables: 
* Number of defenders in the box
* Number of pass rushers
* Distance closed (total, not percentage)
* Defender weight
* Defender height
* Targeted receiver height
* Targeted receiver weight
* Yardline
* Yards to go
* Expected Points Added (EPA)

This time PDC was significant at the 10% level (*B* = -0.008, *p* = -0.029, Odds Ratio = 1.008). In this case, we can specifically state that the better the defensive back is at closing the gap between themselves and the targeted receiver while the ball is in the air, the better the odds of an incomplete pass. The other significant predictors were Number of pass rushers, Distance Closed (in total), yardline number, yards to go, and EPA.

In [None]:
# Binary logistic regression to predict incomplete passes take two

y2=data['pm_efcPassOutcomeIncomplete']

data["constant"] = 1.0

x2=data[['km_pctDistClosed', 'p_defendersInTheBox', 'p_numberOfPassRushers', 'km_distClosed', 'tdf_weight', 'tdf_height', 'ttr_height', 'ttr_weight', 'p_yardlineNumber', 'p_yardsToGo', 'p_epa',
        'constant']]

logit_model=sm.Logit(y2,x2)
result=logit_model.fit(method='bfgs')
print(result.summary2())

params = result.params
conf = result.conf_int()
conf['OR'] = params
conf.columns = ['2.5%', '97.5%', 'OR']

print(np.exp(conf))

We ran it again with only the significant predictors. This time they were all significant at the 10% level, except for yards to go. Of the significant variables, the number of pass rushers had the highest odds ratio of 1.491, indicating that in this regression the number of pass rushers increased the odds of an incomplete pass the most. PDC was second at 1.008. This indicates that the number of pass rushers (which may lead to the quarterback being more pressured) is more influential on the outcome of a pass than how much the closest defensive back closes the gap to the target receiver.

In [None]:
# Binary logistic regression to predict incomplete passes take three

y3=data['pm_efcPassOutcomeIncomplete']

data["constant"] = 1.0

x3=data[['km_pctDistClosed', 'p_numberOfPassRushers', 'km_distClosed', 'p_epa', 'p_yardlineNumber', 'p_yardsToGo',
        'constant']]

logit_model=sm.Logit(y3,x3)
result=logit_model.fit(method='bfgs')
print(result.summary2())

params = result.params
conf = result.conf_int()
conf['OR'] = params
conf.columns = ['2.5%', '97.5%', 'OR']

print(np.exp(conf))

Next we looked to see how individual defensive backs scored in PDC. In the chart below we show the top 15 defensive backs in PDC for the 2018 season with a minimum of 40 plays where they were the closest defender to the targeted receiver. 

![Top 15 Defensive Backs](http://danoff.org/Top15DefensiveBacksDash3.png)

Adrian Amos, a safety for the Bears, had the highest average PDC at 44%. He was not one of the top defenders according to Pro Football Focus, but his fellow Bears safety Eddie Jackson was 4th on the list at 36% and he was listed as [the top safety of the year by PFF](https://www.pff.com/news/pro-best-player-at-every-position-in-the-nfl-in-2018). Amongst cornerbacks, Jonathan Joseph plus Kareem Jackson of the Texans along with Adoree’ Jackson of the Titans and A.J. Bouye of the Jaguars were listed in the [top 25 cornerbacks of 2018 by PFF](https://www.pff.com/news/pro-top-25-cornerbacks-in-the-nfl-in-2018). 

We then looked to see how different positions fared. Free Safeties had the highest mean PDC at 29%. Cornerbacks were the lowest amongst defensive backs at 17%. Not far behind them were Middle Linebackers at 15%. 

We felt that this made sense when considering that cornerbacks are typically tracking their assignments more closely when the ball is thrown in the first place, as evidenced by our data. We observed that cornerbacks were 4.57 yards away from the targeted receiver when the pass was thrown, on average (n=6,879). For safeties of all types the distance was more at 6.24 yards away, on average (n=2,609). 

Due to this, we have reason to believe that the PDC metric is higher for safeties because cornerbacks are already in more of a position to make a play when the ball is thrown and when it arrives. The distance when the ball arrives data shows that cornerbacks are 3.49 yards away on average (n=6,879) to 4.16 yards for safeties (n=2,609). 

With the larger gap for safeties they have more of an incentive to “get on their horse” and cover as much ground as possible by the time the ball does arrive, hence the larger percentages illustrated below:

![Comparing Positions](http://danoff.org/ComparingPositions3.png)

We also looked at a scatter plot with average PDC on the y-axis and average pass distance on the x-axis. Similar to our finding above, amongst defensive backs, free safeties had covered the longest passes while corners the least. 

![Scatter Plot](http://danoff.org/ScatterPlot.png)

## Including Zone as a Variable

We thought that there could be some interesting findings within our data if we decided to parse things out by zone and man coverage for contextual purposes. Within our final dataset of approximately 13,896 plays we found 737 observations (5%) that had information on what coverage the defense was in (thanks to the [bonus dataset](https://www.kaggle.com/tombliss/nfl-big-data-bowl-2021-bonus)). Of those 737 observations, 509 (69%) were zone, and 228 (31%) were man coverage. Here is a quick rundown of the coverage-contextualized findings:

* The average number of defenders in the box when in man coverage was 6.37 to zone’s 5.95
* The average number of expected points added is .27 when man coverage is called versus zone’s .07. 
    * This falls in line with our expectations where higher stakes situations call for tighter coverage.
* An offense’s yards to go on average was 8.02 in man, and 9.62 in zone
    * This also falls in line with the expectations of higher stakes situations calling for man coverage since it is typically tighter
* As expected, the defensive back’s average distance to the targeted receiver when the ball is passed forward in man coverage (3.33 yards) is lower than in zone (5.90)
* As expected, the defensive back’s average distance to the targeted receiver when the ball arrives in man coverage (2.67 yards) is lower than in zone (4.38)
* As expected, the average distance closed by a defensive back on a targeted receiver is lower in man coverage (.65 yards) than in zone (1.52)
* The average PDC by a defensive back on a targeted receiver is lower in man coverage (7.37%) than in zone (20.88%)
* Offenses typically earn nearly one extra yard on average when passing versus man coverage (8.97) over zone (8.02) 
* In man coverage, the offense’s pass outcome is found to be a touchdown 3.51% of the time compared to 1.38% in zone. 
    * Zone is considered to be a safer option. For example, at the end of a game defenses may employ a prevent defense, which can be thought of as a “hyper” zone. And as we mentioned above man is used more in tighter situations so it makes sense that there are more plays with defensive backs scored upon since they are in a riskier situation in the first place as evidenced by the EPA data referenced above. 

Finally, we compared PDC by position and coverage. In all cases the PDC was higher for zone than it was for man.

![Scatter Plot](http://danoff.org/ComparingPositionsbyCoverage3.png)


# Future Work

In the future we would like to understand the quantitative measurements of defensive backs more deeply. For instance, similar to what [Dutta, Yurko, Ventura](https://arxiv.org/abs/1906.11373) suggested, can we create specific models for free safeties, strong safeties, and even nickel backs? Can we come up with a metric for how effective defensive backs are at stopping the run? Or now that we know more about PDC can we connect that to [data from the annual NFL combine](https://www.pro-football-reference.com/draft/2020-combine.htm) to see if things such as the shuttle drill actually connect with on-field performance at all? Could these defensive metrics help fantasy choose which defense to start in a given week or help bettors make the winning pick? 

# Conclusion

Our key takeaway is that PDC can significantly increase the odds of an incomplete pass and it roughly aligns with qualitative assessments of top defensive backs by Pro Football Focus grades. Additionally PDC is meaningfully affected by both position and the type of defensive coverage. 

Given all of this, we can see that situational football plays heavily into the type of coverage called on any given play. Thus, there is a lot of context to be considered when evaluating defensive back play. Coaches should consider PDC and other distance tracking metrics referenced when evaluating their defensive backs in regards to their ability to read and react, and in turn, prevent chunk yardage and touchdowns. 