# Introduction 
NFL Big Data Bowl 2025: Cornerback performance in Press-Man alignment

[Blake Robinson](https://www.linkedin.com/in/blake-r-97426a10b/) | Metric Track

Press, an alignment where the cornerback lines up within 3 yards of the line of scrimmage before the football is snapped, is an important tenant of man coverage. However, it is difficult to execute consistenly. Therefore, teams, players, and coaches stand to benefit from an increased scrutinization of positive and negative outcomes based on the input of a press-man pre-snap alignment. While 3rd parties like PFF have made great strides in the documenting qualitative data such as defensive play types and individual coverage assignments, there is unfortunately no existing data to determine whether or not a cornerback is playing in a press alignment. Thankfully, we can determine press alignment by looking at how far off the line of scrimmage a cornerback lines up before the snap, using pre-snap tracking data. This project aims to measure cornerback performance in press-man coverage.

Goal: Rank NFL cornerbacks by the following metrics, based on performance when lined up in a press-man pre snap alignment.
 - Targets allowed over expected
 - Catches allowed over expected
 - Yards allowed over expected
 - Passes defended over expected
 - Interceptions over expected

# Feature Engineering
Importing Dependencies

In [48]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
from xgboost import XGBClassifier
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import classification_report
from sklearn.ensemble import HistGradientBoostingRegressor
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

import numpy as np 
import pandas as pd 
import os
import matplotlib as plt
import seaborn as sns

Loading in each CSV data source into relevant dataframes
In a now removed one-time step, I loaded in all tracking data and filtered out all player-play records for players that aren't cornerbacks or wide receivers, and saved these datasets to new CSV files. I don't need those rows, but they make up the vast majority of these CSV files, greatly slowing down my notebook's performance at best and causing my notebook to crash at worst. Removing these non WR/CB records resolved this.

In [49]:
plays = pd.read_csv('/kaggle/input/big-data-bowl-25/plays.csv')
games = pd.read_csv('/kaggle/input/big-data-bowl-25/games.csv')
player_play = pd.read_csv('/kaggle/input/big-data-bowl-25/player_play.csv')
players = pd.read_csv('/kaggle/input/big-data-bowl-25/players.csv')

default_path = '/kaggle/input/big-data-bowl-25/'
tracking_csvs = ['tracking_full_cb_pre_snap.csv', 'tracking_full_wr_pre_snap.csv']

#Loading 2 files into 1 dataframe
tracking_list = [pd.read_csv(default_path + file) for file in tracking_csvs]
tracking_full = pd.concat(tracking_list, ignore_index=True)

pd.set_option('display.max_columns', None)

This section includes "housekeeping" steps; I created player datasets containing CBs, WRs, and QBs, to help later on down the line. I also created a universal player ID column (univ_id) which I leaned on heavily in this project. This column represents each play in every game, and is needed because playId is not unique across games, whereas univ_id is. 

This section also creates the quarterback passing data dataframe for use in team passing aggregates, by filtering down to passing plays on which the ball was thrown, to play-lines that represent quarterbacks, and while creating a new feature to represent completed passes as a "catch".

In [50]:
CBs = players[(players['position'] == "CB")]
WRs =  players[(players['position'] == "WR")]
QBs = players[(players['position'] == "QB")]

#Creating a universal play ID, as playId is not unique between games, I want one value that represents every play uniquely and easily
player_play['univ_id'] = player_play['gameId'].astype(str) + player_play['playId'].astype(str)
plays['univ_id'] = plays['gameId'].astype(str) + plays['playId'].astype(str)
tracking_full['univ_id'] = tracking_full['gameId'].astype(str) + tracking_full['playId'].astype(str)

#plays_joined is a full dataset of every play from all 9 weeks, relevant_plays is a dataset containing only pass plays
plays_joined = player_play.join(plays.set_index('univ_id'), on='univ_id', how='inner', rsuffix='_play')
relevant_plays = plays_joined.dropna(subset=['passResult'])

qb_data = relevant_plays[relevant_plays['passResult'].isin(['C', 'I', 'IN'])]

qb_data = qb_data[qb_data['nflId'].isin(QBs['nflId'])]
qb_data['catch'] = (qb_data['passResult'] == 'C').astype(int)

In [51]:
value_counts_sorted = qb_data['passResult'].value_counts()
value_counts_sorted

passResult
C     5658
I     2922
IN     194
Name: count, dtype: int64

This section includes code used to create statistical aggregates for team passing data, which will be attached to the possessionTeams in the final features dataset. Created features includes "season" (weeks 1-9) completion percentage, average depth of target, passing yards per attempt, and pass attempts per game.

In [52]:
passing_aggregated_df = (
    qb_data
    .groupby(['possessionTeam'])
    .agg(
        games_played=('gameId', 'nunique'),  
        completions=('catch', 'sum'),  
        attempts=('univ_id', 'count'),  
        air_yards_sum=('passLength', 'sum'),  
        passing_yards=('passingYards', 'sum'), 
    )
    .reset_index()
)

passing_aggregated_df['completion_percentage'] = (
    (passing_aggregated_df['completions'] / passing_aggregated_df['attempts']) * 100
)
passing_aggregated_df['avg_depth_of_target'] = (
    passing_aggregated_df['air_yards_sum'] / passing_aggregated_df['attempts']
)
passing_aggregated_df['passing_yards_per_attempt'] = (
    passing_aggregated_df['passing_yards'] / passing_aggregated_df['attempts']
)
passing_aggregated_df['attempts_per_game'] = (
    passing_aggregated_df['attempts'] / passing_aggregated_df['games_played']
)

final_passing_df = passing_aggregated_df[
    [
        'possessionTeam',
        'completion_percentage',
        'avg_depth_of_target',
        'passing_yards_per_attempt',
        'attempts_per_game'
    ]
]

This section includes numerous steps of feature prep:
- Creating univ_player_id to represent the game, play, and player of each record
- Creating a dataframe of WR play records, for use in WR feature aggregates
- Furthur shaping relevant_plays from passing plays down to include only relevant columns, CB plays, plays determined to be man coverage by PFF, and merged with CB player names, for ease of reading.

In [53]:
relevant_plays['univ_player_id'] = relevant_plays['univ_id'].astype(str) + relevant_plays['nflId'].astype(str)

wr_plays = relevant_plays[relevant_plays['nflId'].isin(WRs['nflId'])]

wr_plays = wr_plays.dropna(subset=['routeRan'])


wr_plays = wr_plays[["gameId", "playId", "nflId", "univ_player_id", "teamAbbr", "univ_id", "possessionTeam", 
                    "defensiveTeam", "yardlineSide", "yardlineNumber", "absoluteYardlineNumber", 
                    "hadPassReception", "receivingYards", "wasTargettedReceiver", 
                    "yardageGainedAfterTheCatch", "timeToThrow", "routeRan", "passLength"]]


relevant_plays = relevant_plays[["gameId", "playId", "nflId", "teamAbbr", "univ_id", "univ_player_id", "possessionTeam", "defensiveTeam", "yardlineSide", "yardlineNumber", "absoluteYardlineNumber",
                                              "passDefensed", "hadInterception", "pff_primaryDefensiveCoverageMatchupNflId", "pff_defensiveCoverageAssignment", "passResult", "passLength", 
                                               "targetX", "targetY", "pff_passCoverage", "pff_manZone", "quarter", "down",
                                            "yardsToGo", "expectedPoints", "preSnapHomeScore", "preSnapVisitorScore", "gameClock",
                                            "timeToThrow", "routeRan", "playAction", "dropbackType", "timeInTackleBox", "unblockedPressure"]]

relevant_plays = relevant_plays[relevant_plays['nflId'].isin(CBs['nflId'])]
relevant_plays = relevant_plays[(relevant_plays['pff_defensiveCoverageAssignment'].astype(str) == 'MAN')]
relevant_plays = relevant_plays.merge(players[['nflId', 'position']],  on='nflId', how='left')

tracking_full['univ_player_id'] = ((tracking_full['univ_id'].astype(str) + tracking_full['nflId'].astype(str))).str.slice(0, -2)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  relevant_plays['univ_player_id'] = relevant_plays['univ_id'].astype(str) + relevant_plays['nflId'].astype(str)


Similar to the passing aggregates section above, this section focuses on creating wide receiver statistical aggregates, including the following:
- Catches per game
- Targets per game
- Average depth of target
- Yards per game

In [54]:
wr_plays = wr_plays.dropna(subset=['routeRan'])

wr_plays_agg = (
    wr_plays
    .groupby(['nflId'])
    .agg(
        games_played=('gameId', 'nunique'),
        catches=('hadPassReception', 'sum'),
        targets=('wasTargettedReceiver', 'sum'),
        air_yards_sum=('passLength', lambda x: (wr_plays.loc[x.index, 'wasTargettedReceiver'] * x).sum()),
        receiving_yards=('receivingYards', 'sum'),
    )
    .reset_index()
)

wr_plays_agg['wr_catches_per_game'] = wr_plays_agg['catches'] / wr_plays_agg['games_played']
wr_plays_agg['wr_targets_per_game'] = wr_plays_agg['targets'] / wr_plays_agg['games_played']
wr_plays_agg['wr_avg_depth_of_target'] = wr_plays_agg['air_yards_sum'] / wr_plays_agg['targets']
wr_plays_agg['wr_receiving_yards_per_target'] = wr_plays_agg['receiving_yards'] / wr_plays_agg['targets']

wr_plays.rename(columns={'univ_player_id': 'univ_player_id_coverage_assignment'}, inplace=True)


This section utilizes the player tracking dataset to determine cornerback play-lines in which the cornerback lines up in press. Additionally, due to the above filtering to only include coverage assignments of "MAN", I know that this dataset only includes passing play-lines for cornerbacks in man coverage.

This section starts by converting the "time" column to a datetime type which can be sorted and ordered. Then, while grouping by universal player ID, it finds the latest timestamp index for each play-line (i.e. the last pre-snap timestamp before the snap, based on the above filtering) and adds that to a new dataframe, called "tracking_final_frame", to represent these last moments before snaps. These rows can then be inferred to be the cornerback's tracking position at the time of the snap.

Next, it merges this tracking data with the play data from relevant_plays, to show us the the outcome of the play for each pre-snap tracking line.

Lastly, it determines the cornerback's distance from the line of scrimmage by utilizing absoluteYardlineNumber, and subtracting the player's X value (equivalent to the player's own absoluteYardlineNumber) from it. If this number is less than or equal to 3 (meaning the corner is 3 yards or less from the line of scrimmage), then we know the corner was lined up in a press-man alignment for that play, and the play-row is given a 1 for this new "in_press" column. The final lines capture the CB's coverage assignment by nflId, and also creates the universal player ID for this coverage assignment.

The subsequent section creates a new dataframe containing all the play-rows where the corner is in a press-man alignment.

In [55]:
#Press Logic

pd.set_option('display.max_columns', None)

tracking_full['time'] = pd.to_datetime(tracking_full['time'], format='mixed')

latest_idx = tracking_full.groupby('univ_player_id')['time'].idxmax()
tracking_final_frame = tracking_full.loc[latest_idx]

tracking_final_frame = pd.merge(tracking_final_frame, relevant_plays[['univ_player_id', 'yardlineNumber', 'absoluteYardlineNumber', 'pff_primaryDefensiveCoverageMatchupNflId']], on='univ_player_id', how='inner')

tracking_final_frame = tracking_final_frame.merge(players[['nflId', 'position']],  on='nflId', how='left')

tracking_final_frame['cb_distance_from_los'] = abs(tracking_final_frame['absoluteYardlineNumber'] - tracking_final_frame['x'])
tracking_final_frame['in_press'] = (tracking_final_frame['cb_distance_from_los'] <= 3).astype(int)
tracking_final_frame['wr_coverage_assignment'] = (tracking_final_frame['pff_primaryDefensiveCoverageMatchupNflId'].astype(int)).astype(str)
tracking_final_frame['univ_player_id_coverage_assignment'] = tracking_final_frame['univ_id'].astype(str) + (tracking_final_frame['pff_primaryDefensiveCoverageMatchupNflId'].astype(int)).astype(str)

In [56]:
cb_plays_in_press = tracking_final_frame[tracking_final_frame['in_press'] == 1]

In [57]:
join_cols_df = cb_plays_in_press[['displayName', 'univ_player_id', 'univ_player_id_coverage_assignment', 'wr_coverage_assignment']]

relevant_plays_cbs_press = relevant_plays.merge(join_cols_df, on='univ_player_id', how='inner')

play_level_cb = relevant_plays_cbs_press.merge(wr_plays, on='univ_player_id_coverage_assignment', how='inner')

wr_plays_agg['nflId'] = wr_plays_agg['nflId'].astype(str)

This section creates the WR aggregate and Team passing aggregate stats dataframes, and drops irrelevant columns to create the full features dataset.

The following section completes one-hot encoding for categorical features.

In [58]:
pbp_with_wr_stats = pd.merge(
    play_level_cb,  
    wr_plays_agg,  
    how='left',  
    left_on='wr_coverage_assignment',  
    right_on='nflId' 
)

pbp_with_passing_stats = pd.merge(
    pbp_with_wr_stats,  
    final_passing_df,  
    how='left',  
    left_on='possessionTeam_x', 
    right_on='possessionTeam'  
)

training_df_full = pbp_with_passing_stats.drop(['targetX', 'targetY', 'routeRan_x', 'yardlineSide_x', 'position', 'gameId_y', 'playId_y', 'nflId_y', 'teamAbbr_y', 'univ_id_y', 'possessionTeam_y', 'defensiveTeam_y', 'yardlineSide_y', 'yardlineNumber_y', 'absoluteYardlineNumber_y', 'nflId', 'pff_defensiveCoverageAssignment', 'pff_primaryDefensiveCoverageMatchupNflId', 'possessionTeam_x', 'defensiveTeam_x', 'teamAbbr_x', 'gameId_x', 'playId_x', 'possessionTeam', 'displayName', 'univ_id_x'], axis=1)

In [59]:
encoded_df = pd.get_dummies(training_df_full, columns=['passResult', 'pff_passCoverage', 'pff_manZone', 'playAction', 'dropbackType', 'unblockedPressure', 'routeRan_y'], prefix=['result', 'passCoverage', 'manZone', 'playAction', 'dropbackType', 'unblockedPressure', 'routeRan'], dtype=int)

encoded_df_pruned = encoded_df.drop(['univ_player_id_coverage_assignment', 'wr_coverage_assignment'], axis=1)

This section converts the gameClock column into gameClock_seconds and stage of quarter, which (after one-hot encoding) is usable in the models.

In [60]:
def game_clock_to_seconds(game_clock):
    minutes, seconds = map(int, game_clock.split(':'))
    return minutes * 60 + seconds

def game_clock_period(seconds):
    if seconds > 480:
        return 'early'
    elif seconds > 240:
        return 'mid'
    else:
        return 'late'

encoded_df_pruned['gameClock_seconds'] = encoded_df_pruned['gameClock'].apply(game_clock_to_seconds)
encoded_df_pruned['gameClock_period'] = encoded_df_pruned['gameClock_seconds'].apply(game_clock_period)

encoded_df_pruned = pd.get_dummies(encoded_df_pruned, columns=['gameClock_period'], prefix=['gameClock_period'], dtype=int)

encoded_df_pruned.drop(['gameClock'], axis=1, inplace=True)

# Models

In [61]:
#Target columns = passDefensed, hadInterception, hadPassReception, receivingYards, wasTargettedReceiver
X = encoded_df_pruned.drop(['nflId_x', 'univ_player_id', 'passDefensed', 'hadInterception', 'hadPassReception', 'receivingYards', 'wasTargettedReceiver'], axis=1)
y_passDefensed = encoded_df_pruned[['passDefensed']]
y_hadInterception = encoded_df_pruned[['hadInterception']]
y_allowedCatch = encoded_df_pruned[['hadPassReception']]
y_yardsAllowed = encoded_df_pruned[['receivingYards']]
y_targetAllowed = encoded_df_pruned[['wasTargettedReceiver']]

IDs = encoded_df_pruned[['nflId_x', 'univ_player_id']]

In [62]:
X_train, X_test, y_train_targeted, y_test_targeted = train_test_split(X, y_targetAllowed, test_size=0.2, random_state=42)
_, _, y_train_catch, y_test_catch = train_test_split(X, y_allowedCatch, test_size=0.2, random_state=42)
_, _, y_train_yards, y_test_yards = train_test_split(X, y_yardsAllowed, test_size=0.2, random_state=42)
_, _, y_train_defensed, y_test_defensed = train_test_split(X, y_passDefensed, test_size=0.2, random_state=42)
_, _, y_train_intercepted, y_test_intercepted = train_test_split(X, y_hadInterception, test_size=0.2, random_state=42)

The first model is predicting whether a corner's coverage assignment will be targeted on a given play. The model is effective at predicting and identifying when a receiver will not be targeted, as indicated by high precision and recall for class 0. It is also generally effective at predicting targets, with a 80% precision. However, it performs poorly at identifying targets, doing so only 45% of the time. After testing and changes to the model, this was this was the most effective configuration I found.

In [63]:
# Model 1: Predicting if targeted
model_targeted = XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss')
model_targeted.fit(X_train, y_train_targeted)

target_probabilities = model_targeted.predict_proba(X_test)

custom_threshold = 0.7
pred_targeted_custom = (target_probabilities[:, 1] >= custom_threshold).astype(int)

print("Targeted Report with Custom Threshold (0.7):\n")
print(classification_report(y_test_targeted, pred_targeted_custom))

np.set_printoptions(suppress=True, precision=6) 

Targeted Report with Custom Threshold (0.7):

              precision    recall  f1-score   support

           0       0.88      0.97      0.92       773
           1       0.80      0.45      0.58       192

    accuracy                           0.87       965
   macro avg       0.84      0.71      0.75       965
weighted avg       0.86      0.87      0.85       965



In [64]:
full_target_probabilities = model_targeted.predict_proba(X)[:, 1]

output = encoded_df_pruned
output['target_probability'] = full_target_probabilities

The second model predicting catch probabilities performed exceptionally well compared to the target model. It predicts 97% of no-catches and identifies 100% of them, while predicting 96% of catches and identifying 75% of them.

In [65]:
# Model 2: Predicting if catch
model_catch = XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss')
model_catch.fit(X_train, y_train_catch)
pred_catch = model_catch.predict(X_test)
print("Catch Report:\n", classification_report(y_test_catch, pred_catch))

catch_probabilities = model_catch.predict_proba(X_test)[:, 1]

custom_threshold_catch = 0.7
pred_catch_custom = (catch_probabilities >= custom_threshold_catch).astype(int)

print(f"Catch Report with Custom Threshold ({custom_threshold_catch}):\n")
print(classification_report(y_test_catch, pred_catch_custom))

Catch Report:
               precision    recall  f1-score   support

           0       0.97      0.99      0.98       866
           1       0.90      0.76      0.82        99

    accuracy                           0.97       965
   macro avg       0.94      0.87      0.90       965
weighted avg       0.97      0.97      0.97       965

Catch Report with Custom Threshold (0.7):

              precision    recall  f1-score   support

           0       0.97      1.00      0.98       866
           1       0.96      0.75      0.84        99

    accuracy                           0.97       965
   macro avg       0.97      0.87      0.91       965
weighted avg       0.97      0.97      0.97       965



In [66]:
full_catch_probabilities = model_catch.predict_proba(X)[:, 1]

output['catch_probability'] = full_catch_probabilities

The third model predicts how many receiving yards a corner is expected to give up on a given play. A histogram-based gradient boosting regressor model was chosen because it was able to easily handle null values in the receiving_yards column, of which there were 6 instances. The model performs fairly well as the Mean Absolute Error is .869 yards, which is reasonable given the possible yardage range of each play being 0 to 99, with the average yardage per play being on the lower end of that range. In addition, the R squared score is a strong .843, meaning the model is able to explain 84.3% of variance in the yards allowed on each play.

In [67]:
# Model 3: Predicting receiving yards allowed
model_yards_allowed = HistGradientBoostingRegressor()

X_train, X_test, y_train_yards, y_test_yards = train_test_split(X, y_yardsAllowed, test_size=0.2, random_state=42)

model_yards_allowed.fit(X_train, y_train_yards)

y_pred_yards = model_yards_allowed.predict(X_test)

mae = mean_absolute_error(y_test_yards, y_pred_yards)
r2 = r2_score(y_test_yards, y_pred_yards)

print(f"Mean Absolute Error: {mae}")
print(f"R^2 Score: {r2}")


  y = column_or_1d(y, warn=True)


Mean Absolute Error: 0.8691313576308233
R^2 Score: 0.8433624209980296


In [68]:
y_pred_full = model_yards_allowed.predict(X) 

min_yards = 0  
max_yards = 99  

# Clipping was used to remove negative predicted_yards values, as negative receiving yard plays are incredibly rare, and are better understood as 0 receiving yards.
y_pred_clipped = np.clip(y_pred_full, min_yards, max_yards)

output['predicted_yards_allowed'] = y_pred_clipped

The fourth model predicts whether a corner will defend (deflect) a pass thrown at them. This model, along with the fifth model, encounter a new issue where the vast majority of player-plays do not result in a pass defended. Therefore, the 0 class (no pass defended) are heavily overweight, and thus require resampling to undersample the majority class and oversample the minority class. Unfortunately, based on the performance report, it appears that the model is experiencing class imbalanced as it was unable to identify any positive class instances. However, with probability prediction, the results may still be usable.

In [69]:
# Model 4: Predicting pass defensed
#model_defensed = XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss')
#model_defensed.fit(X_train, y_train_defensed)
#pred_defensed = model_defensed.predict(X_test)
#print("Defensed Report:\n", classification_report(y_test_defensed, pred_defensed))

#defensed_probabilities = model_defensed.predict_proba(X_test)[:, 1]

In [70]:
# Model 4: Predicting pass defensed

imputer = SimpleImputer(strategy='mean')
X_train_nd = pd.DataFrame(imputer.fit_transform(X_train), columns=X_train.columns)
X_test_nd = pd.DataFrame(imputer.transform(X_test), columns=X_test.columns)

smote = SMOTE(random_state=42)
undersampler = RandomUnderSampler(random_state=42)

resampling_pipeline = Pipeline([
    ('smote', smote),         
    ('undersampler', undersampler)  
])

X_resampled, y_resampled = resampling_pipeline.fit_resample(X_train_nd, y_train_defensed)

model_defensed = XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss')
model_defensed.fit(X_resampled, y_resampled)

pred_defensed = model_defensed.predict(X_test_nd)

print("Defensed Report (After SMOTE + Undersampling):\n")
print(classification_report(y_test_defensed, pred_defensed))

defensed_probabilities = model_defensed.predict_proba(X_test)[:, 1]


Defensed Report (After SMOTE + Undersampling):

              precision    recall  f1-score   support

           0       0.97      0.99      0.98       937
           1       0.00      0.00      0.00        28

    accuracy                           0.96       965
   macro avg       0.49      0.50      0.49       965
weighted avg       0.94      0.96      0.95       965



In [71]:
full_defensed_probabilities = model_defensed.predict_proba(X)[:, 1]

output['defensed_probability'] = full_defensed_probabilities

The fifth and final model projects interceptions on player-plays. As mentioned above, the majority of player-plays do not result in an interception for the cornerback, therefore hyperparameter tuning is used. However, the performance report indicates, this model is likely overfitting. Despite this, the expected probability values might once again still prove usable for this project, as we will see post-aggregation.

In [72]:
# Model 5: Predicting interceptions
X = encoded_df_pruned.drop(['nflId_x', 'univ_player_id', 'passDefensed', 'hadInterception', 'hadPassReception', 'receivingYards', 'wasTargettedReceiver'], axis=1)
y_intercepted = encoded_df_pruned[['hadInterception']]

X_train, X_test, y_train_intercepted, y_test_intercepted = train_test_split(X, y_intercepted, test_size=0.2, random_state=42)

pipeline_intercepted = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),         # Handle missing values
    ('scaler', StandardScaler()),                        # Scale features
    ('model', XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss'))  # XGB model
])

pipeline_intercepted.fit(X_train, y_train_intercepted)

pred_intercepted = pipeline_intercepted.predict(X_test)
print("Intercepted Report (After Remake):\n", classification_report(y_test_intercepted, pred_intercepted))

param_grid = {
    'model__n_estimators': [50, 100, 200],  
    'model__max_depth': [3, 6, 9],          
    'model__learning_rate': [0.01, 0.1, 0.2]  
}

grid_search = GridSearchCV(pipeline_intercepted, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train_intercepted)

print("Best parameters from GridSearchCV:", grid_search.best_params_)
print("Best cross-validation score:", grid_search.best_score_)

best_model = grid_search.best_estimator_
pred_intercepted_best = best_model.predict(X_test)
print("\nIntercepted Report (With Tuning):\n", classification_report(y_test_intercepted, pred_intercepted_best))


Intercepted Report (After Remake):
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       965

    accuracy                           1.00       965
   macro avg       1.00      1.00      1.00       965
weighted avg       1.00      1.00      1.00       965

Best parameters from GridSearchCV: {'model__learning_rate': 0.1, 'model__max_depth': 3, 'model__n_estimators': 50}
Best cross-validation score: 0.9992224619127302

Intercepted Report (With Tuning):
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       965

    accuracy                           1.00       965
   macro avg       1.00      1.00      1.00       965
weighted avg       1.00      1.00      1.00       965



In [73]:
full_int_probabilities = best_model.predict_proba(X)[:, 1]

output['intercept_probability'] = full_int_probabilities

The final step of the project is to send the relevant columns a new dataframe and add the players' names to each row with the CBs dataframe. Then, each of the metrics and their expected values are summed up, and over_expected columns are calculated. Lastly, 5 new dataframes are created to contain one metric each, and ranked. 

In [74]:
rankings_pre = output[['nflId_x', 'passDefensed', 'hadInterception', 'hadPassReception', 'receivingYards', 'wasTargettedReceiver', 'target_probability', 'catch_probability', 'predicted_yards_allowed', 'defensed_probability', 'intercept_probability']]
rankings_pre.rename(columns={'nflId_x': 'nflId'}, inplace=True)

CBs_subset = CBs[['nflId', 'displayName']]
rankings_pre_names = rankings_pre.merge(CBs_subset, on='nflId', how='left')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rankings_pre.rename(columns={'nflId_x': 'nflId'}, inplace=True)


In [75]:
totals_expected_aggregated = (
    rankings_pre_names
    .groupby(['nflId', 'displayName'])
    .agg(
        snaps=('nflId', 'size'),
        targets=('wasTargettedReceiver', 'sum'),
        expected_targets=('target_probability', 'sum'),
        receptions_allowed=('hadPassReception', 'sum'),
        expected_receptions_allowed=('catch_probability', 'sum'),
        yards_allowed=('receivingYards', 'sum'),
        expected_yards_allowed=('predicted_yards_allowed', 'sum'),
        defensed=('passDefensed', 'sum'),
        expected_defensed=('defensed_probability', 'sum'),
        interceptions=('hadInterception', 'sum'),
        expected_interceptions=('intercept_probability', 'sum')
    )
    .reset_index()
)

totals_expected_aggregated['targets_oe'] = totals_expected_aggregated['targets'] - totals_expected_aggregated['expected_targets']
totals_expected_aggregated['receptions_oe'] = totals_expected_aggregated['receptions_allowed'] - totals_expected_aggregated['expected_receptions_allowed']
totals_expected_aggregated['receiving_yards_oe'] = totals_expected_aggregated['yards_allowed'] - totals_expected_aggregated['expected_yards_allowed']
totals_expected_aggregated['passes_defensed_oe'] = totals_expected_aggregated['defensed'] - totals_expected_aggregated['expected_defensed']
totals_expected_aggregated['interceptions_oe'] = totals_expected_aggregated['interceptions'] - totals_expected_aggregated['expected_interceptions']

totals_expected_aggregated = totals_expected_aggregated[totals_expected_aggregated['snaps'] >= 10]

# Results

**Targets:** \
*Which cornerbacks allowed the most targets over expected?* \
Chidobe Awuzie: 2.53 targets over expected | 16 vs 13.47 expected \
Cam Dantzler: 1.85 targets o/e | 7 vs 5.14 expected \
Charvarius Ward: 1.83 targets o/e | 14 vs 12.17 expected \
Cornell Armstrong: 1.75 targets o/e | 10 vs 8.25 expected \
Jalen Ramsey: 1.73 targest o/e | 6 vs 4.27 expected 

*Which cornerbacks allowed the most targets under expected?* \
Michael Carter: 2.05 targets under expected | 8 vs 10.95 expected \
Terrence Mitchell: 1.81 targets u/e | 4 vs 5.81 expected \
Byron Murphy: 1.76 targets u/e | 7 vs 8.76 expected \
DJ Reed: 1.75 targets u/e | 11 vs 12.75 expected \
Jourdan Lewis: 1.53 targets u/e | 0 vs 1.53 expected 

**Primary insights:** \
Based on the findings of the model, expected targets are a good measure of whether or not the cornerback, after aligning pre-snap in press coverage, is covering well enough to dissuade the quarterback from throwing to their coverage assignment. Corners who are allowing more targets over expected are likely not staying in an advantageous coverage position, are covering a mismatch, or are being schemed at by the offense. By contrast, cornerbacks who are allowing fewer targets than expected are likely keeping a strong position in coverage, or are being schemed away from by the offense. Given that there are 9 weeks included in the sample, it becomes a fairly safe assumption to ascribe the ratings to coverage performance.

These findings can be used in future weeks as a starting point for deeper analysis, to spur questions like: why are higher corners allowing more targets in press coverage? Or, why are the lower cornerbacks stifling their coverage assignments? From a data standpoint, these results can be used to inform a defensive coordinator on which of their corners are allowing more targets in press than they should be, or which corners are performing better than expected. In turn, pre-snap alignments can be adjusted to line up poor performers off the line of scrimmage more often, or strong performers in press more often.

These findings can also inform offenses to gameplan their receivers who are strong against press to face corners struggling in press more often. Quarterbacks and wide receivers can be informed to look for particularly advantageous matchups when a specific cornerback is lined up in press pre-snap. 

A caveat of expected targets is that cornerbacks who are allowing more targets than expected could be more often charged with covering the offense's first read or #1 receiver. For example, Chidobe Awuzie, Jalen Ramsey, and Stephon Gilmore often covered the best receiver on the opposing team, and those coverage assignments will naturally draw more targets, regardless of if the corner is in press alignment. Where outsized insight comes from are those types of cornerbacks who are allowing targets under expected, such as Marlon Humphrey, Trevon Diggs, and Marshon Lattimore.

In [76]:
targets = totals_expected_aggregated[['displayName', 'snaps', 'targets', 'expected_targets', 'targets_oe']].sort_values(by='targets_oe', ascending=False).reset_index(drop=True)
targets.to_csv('targets.csv', index=False)

targets

Unnamed: 0,displayName,snaps,targets,expected_targets,targets_oe
0,Chidobe Awuzie,74,16,13.466761,2.533239
1,Cameron Dantzler,28,7,5.141062,1.858938
2,Charvarius Ward,65,14,12.168337,1.831663
3,Cornell Armstrong,23,10,8.249307,1.750693
4,Jalen Ramsey,19,6,4.268343,1.731657
5,Patrick Surtain,59,9,7.279074,1.720926
6,Damarion Williams,28,11,9.347524,1.652476
7,Avonte Maddox,32,5,3.383404,1.616596
8,Stephon Gilmore,67,20,18.418325,1.581675
9,Donte Jackson,32,11,9.440722,1.559278


**Receptions:** \
*Which cornerbacks allowed the most catches over expected?* \
Chidobe Awuzie: 1.74 receptions over expected | 5 vs 3.26 expected \
Donte Jackson: 1.25 receptions o/e | 8 vs 6.75 expected \
Myles Bryant: 1.23 receptions o/e | 8 vs 6.77 expected \
Paulson Adebo: 1.19 receptions o/e | 9 vs 7.8 expected \
Dee Alford: 1.06 receptions o/e | 5 vs 3.94 expected 

*Which cornerbacks allowed the most catches under expected?* \
Xavien Howard: 1.29 receptions under expected | 11 vs 12.29 expected \
DJ Reed: 1.16 receptions u/e | 7 vs 8.16 expected \
Emmanuel Moseley: 0.92 receptions u/e | 2 vs 2.92 expected \
Marcus Jones: 0.81 receptions u/e | 0 vs 0.81 expected \
Eli Apple: 0.8 receptions u/e | 4 vs 4.8 expected 

**Primary Insights:** \
Similar to the insights derrived from allowed targets, expected allowed receptions will generally show which cornerbacks are performing well in coverage and positioning themselves well. However, expected catches in particular inform how well a corner is doing at the catch point. Without necessarily making a play on the ball, this metric can inform how well a cornerback is disturbing a wide receiver's process after being targeted. This (naturally) does take into account whether or not the receiver was targeted, which as described above, speaks to the corner's positioning. 

There are notable crossovers at the top and bottom of the list compared to expected targets, including Chidobe Awuzie, who appears to be performing poorly in press-man, and DJ Reed, who appears to be a strong performer in press-man coverage. Interestingly, one corner who allowed more targets than expected but allowed far fewer catches than expected was Xavien Howard. Reviewing film might inform why this is. Perhaps he isn't as effective at pre-throw positioning, and much more effective at affecting receivers while the ball is in the air. 

In [77]:
receptions = totals_expected_aggregated[['displayName', 'receptions_allowed', 'expected_receptions_allowed', 'receptions_oe', 'snaps']].sort_values(by='receptions_oe', ascending=False).reset_index(drop=True)
receptions.to_csv('receptions.csv', index=False)

receptions

Unnamed: 0,displayName,receptions_allowed,expected_receptions_allowed,receptions_oe,snaps
0,Chidobe Awuzie,5,3.259903,1.740097,74
1,Donte Jackson,8,6.752584,1.247416,32
2,Myles Bryant,8,6.771557,1.228443,58
3,Paulson Adebo,9,7.808813,1.191187,55
4,Dee Alford,5,3.93543,1.06457,32
5,Damarion Williams,6,5.011921,0.988079,28
6,Greedy Williams,2,1.029235,0.970765,15
7,Christian Benford,2,1.073723,0.926277,24
8,Darious Williams,6,5.078888,0.921112,41
9,Eric Stokes,7,6.098673,0.901327,53


**Yards allowed:** \
*Which cornerbacks allowed the most receiving yards over expected?* \
Dee Alford: 35.37 yards over expected | 93 vs 57.63 expected \
Kyler Gordon: 32.52 yards o/e | 213 vs 180.48 expected \
Paulson Adebo: 24.93 yards o/e | 155 vs 130.02 expected \
Donte Jackson: 22.24 yards o/e | 76 vs 53.76 expected \
JC Jackson: 18.03 yards o/e | 111 vs 92.97 expected 

*Which cornerbacks allowed the most receiving yards under expected?* \
Jeff Okudah: 28.58 yards under expected | 213 vs 241.58 expected \
Jaylon Johnson: 27.27 yards u/e | 82 vs 109.27 expected \
Marshon Lattimore: 26.8 yards u/e | 68 vs 94.8 expected \
Xavien Howard: 25.33 yards u/e | 256 vs 281.33 expected \
Keion Crossen: 24.76 yards u/e | 53 vs 77.76 expected 

**Primary Insights:** \
Yardage allowed provides a number of insights into what comes out of a completed pass while a specfic corner is in coverage. Higher yards over expected can indicate a tackling or angles weakness, or it can indicate a corner's tendency to play the ball instead of the receiver, manifesting more clearly when the ball is caught. 

Yardage allowed appears to have a strong crossover with receptions allowed, as the names at the top of the yardage list (Donte Jackson, Paulson Adebo, Dee Alford, for example) tend to appear at the top of the receptions allowed list. Similarly, the names at the bottom of the each lists tend to cross over (Xavien Howard, Jeff Okudah, Jaylon Johnson). This makes intuitive sense, as allowing a catch implies allowing positive receiving yards in almost all cases. 

In [78]:
yards_allowed = totals_expected_aggregated[['displayName', 'yards_allowed', 'expected_yards_allowed', 'receiving_yards_oe', 'snaps']].sort_values(by='receiving_yards_oe', ascending=False).reset_index(drop=True)
yards_allowed.to_csv('yards_allowed.csv', index=False)

yards_allowed


Unnamed: 0,displayName,yards_allowed,expected_yards_allowed,receiving_yards_oe,snaps
0,Dee Alford,93,57.62661,35.37339,32
1,Kyler Gordon,213,180.483981,32.516019,55
2,Paulson Adebo,155,130.068043,24.931957,55
3,Donte Jackson,76,53.762332,22.237668,32
4,J.C. Jackson,111,92.971883,18.028117,41
5,Jaylen Watson,143,127.026864,15.973136,82
6,Jaire Alexander,71,56.861126,14.138874,26
7,Darius Slay,22,7.887122,14.112878,32
8,Charvarius Ward,89,76.301359,12.698641,65
9,Trevon Diggs,121,108.3376,12.6624,37


**Passes defended:** \
*Which cornerbacks defended the most passes over expected?* \
Cornell Armstrong: 2.03 passes defended over expected | 4 vs 1.97 expected \
Cameron Sutton: 1.89 passes defended o/e | 3 vs 1.11 expected \
Taron Johnson: 1.83 passes defended o/e | 2 vs 0.27 expected \
Avonte Maddox: 1.67 passes defended o/e | 2 vs 0.33 expected \
Ahmad Gardner: 1.15 passes defended o/e | 5 vs 3.85 expected 

*Which cornerbacks defended the most passes under expected?* \
Kenny Moore: 1.07 passes defended under expected | 0 vs 1.07 expected \
Myles Bryant: 1.0 passes defended under expected | 2 vs 3.0 expected \
Michael Davis: 0.83 passes defended under expected | 0 vs 0.83 expected \
Roger McCreary: 0.83 passes defended under expected | 1 vs 1.83 expected \
Terrance Mitchell: 0 passes defended under expected | 0 vs 0.82 expected 

**Primary Insights:** \
Passes defended serve as a key indicator of a cornerback's ability to contest throws in their direction, either by breaking up passes or closely shadowing their assignment to disrupt completions. Cornerbacks with high PD totals relative to their targets often display exceptional timing, ball skills, and closing speed, making them critical assets in press-man coverage. Conversely, low PD totals, especially for corners with many targets, could signal challenges in effectively contesting throws or potential mismatches in coverage assignments.

These insights can help identify cornerbacks who excel at disrupting plays, even when targeted frequently, and those who might benefit from improved techniques or adjustments to pre-snap alignments. For example, defensive coordinators could leverage this metric to identify matchups where press-man coverage is yielding tangible results or consider alternative coverages when contested plays are not materializing. Offenses, on the other hand, might use PD data to avoid targeting high-performing cornerbacks in critical situations or exploit matchups where defenders struggle to break up passes.

A limitation of passes defended is that they don't fully account for uncatchable throws or passes redirected away from a cornerback's coverage due to tight alignment. Thus, this metric works best when analyzed alongside other measures like targets over expected and completion percentage allowed.

In [79]:
defensed = totals_expected_aggregated[['displayName', 'defensed', 'expected_defensed', 'passes_defensed_oe', 'snaps']].sort_values(by='passes_defensed_oe', ascending=False).reset_index(drop=True)
defensed.to_csv('defensed.csv', index=False)

defensed

Unnamed: 0,displayName,defensed,expected_defensed,passes_defensed_oe,snaps
0,Cornell Armstrong,4,1.968549,2.031451,23
1,Cameron Sutton,3,1.111666,1.888334,39
2,Taron Johnson,2,0.172485,1.827515,44
3,Avonte Maddox,2,0.333076,1.666924,32
4,Ahmad Gardner,5,3.854238,1.145762,71
5,Charvarius Ward,3,1.976346,1.023654,65
6,Denzel Ward,3,1.991172,1.008828,34
7,Paulson Adebo,3,1.993947,1.006053,55
8,Damarion Williams,2,0.996768,1.003232,28
9,Carlton Davis,3,2.076408,0.923592,46


**Interceptions:** \
*Which cornerbacks intercepted the most passes over expected?* \
Joshua Williams: 0.3 interceptions over expected | 1 vs .7 expected \
Tariq Woolen: 0.17 interceptions o/e | 2 vs 1.83 expected \
Denzel Ward: 0.08 interceptions o/e | 1 vs 0.92 expected \
Levi Wallace: 0.05 interceptions o/e | 1 vs 0.95 expected \
Bryce Callahan: 0.03 interceptions o/e | 1 vs 0.97 expected 

*Which cornerbacks intercepted the most passes under expected?* \
Kindle Vildor: 0.32 interceptions under expected | 0 vs 0.32 expected \
Stephon Gilmore: 0.26 interceptions u/e | 1 vs 1.26 expected \
Jeff Okudah: 0.12 interceptions u/e | 0 vs 0.12 expected \
Jaylen Watson: 0.1 interceptions u/e | 0 vs 0.1 expected \
Xavien Howard: 0.1 interceptions u/e | 0 vs 0.1 expected

**Primary Insights:** \
Interceptions are the ultimate disruptor in pass defense, turning potential completions into immediate turnovers and often changing the momentum of a game. Cornerbacks with high interception totals in press-man coverage likely possess not only strong coverage skills but also exceptional instincts, anticipation, and hand-eye coordination. These traits allow them to take calculated risks while maintaining tight coverage, often forcing quarterbacks to second-guess throws into their vicinity.

Cornerbacks excelling in interceptions provide tremendous value to defensive schemes by flipping field position and directly affecting the scoreboard. Defensive coordinators can analyze interception trends to identify opportunities for press alignments or coverage schemes that maximize these ball-hawking tendencies. For offenses, interception data underscores the importance of carefully considering pre-snap reads and avoiding overly aggressive throws against corners demonstrating a propensity for turnovers.

The challenge with interceptions in this context is that there is relatively little variance between the cornerbacks in the NFL, but the model can still provide an idea of who is producing slightly better or worse than the field, while outlier cases can provide insights into which corners are performing particularly well or poor in production. 

A caveat of interceptions is that they can sometimes be situational—products of poorly thrown balls, tipped passes, or opportunistic plays rather than purely strong coverage. Additionally, cornerbacks rarely targeted may have fewer interception opportunities despite excellent coverage skills. As such, interception data should complement metrics like targets, passes defended, and coverage grades to form a holistic view of performance.

In [80]:
interceptions = totals_expected_aggregated[['displayName', 'interceptions', 'expected_interceptions', 'interceptions_oe', 'snaps']].sort_values(by='interceptions_oe', ascending=False).reset_index(drop=True)
interceptions.to_csv('interceptions.csv', index=False)

interceptions

Unnamed: 0,displayName,interceptions,expected_interceptions,interceptions_oe,snaps
0,Joshua Williams,1,0.698469,0.301531,27
1,Tariq Woolen,2,1.833819,0.166181,76
2,Denzel Ward,1,0.922251,0.077749,34
3,Levi Wallace,1,0.950696,0.049304,33
4,Bryce Callahan,1,0.969265,0.030735,43
5,Marco Wilson,1,0.971414,0.028586,39
6,Darius Slay,1,0.971814,0.028186,32
7,Kyler Gordon,1,1.005617,-0.005617,55
8,Josh Jackson,0,0.010215,-0.010215,10
9,Josiah Scott,0,0.010215,-0.010215,10


# **Conclusion** 
These findings in this notebook are representative of numerous insights that a coaching staff would be interested in on a week-to-week basis. The first three metrics especially provide valuable insights into the performance of cornerbacks in press alignment pre-snap. The fourth and fifth insights provide strong feedback on how effectively cornerbacks are producing on-ball. Overall, all five metrics tell one piece of the same story: How effective is a cornerback effective in press coverage. From these metrics teams can glean how well a player positions himself, how well the affect receivers while the ball is in the error, if they have tackling or angle weaknesses, if they are schemed towards or against, how productive they are on-ball, and many more. 


# **Appendix** 
Future improvements: \
At the recommendation of the Big Data Bowl, the scope of this project was kept narrow. However, there are a number of potential improvements that could be made to provide deeper analysis and improve current insights. Here are a few:
* Add coverage assignment tracking coordinates to measure starting distance between CB and WR. This can also be used to track inside/outside leverage.
* Divide the field into quarters to determine whether or not the CB is in the slot, or on the outside.
* If this data ever becomes available, a receiver's place in the QB's read order could be an exceptional feature to add to this project.
* Expand assignment pool to include tight ends and running backs. While this would only add a small number of records, it would provide a full picture of the CBs' press performance.

