# Logistic Regression: Classification With Categorical Variables on Pass Plays

Logistic Regression is an algorithm that performs binary classification by modeling a dependent variable (Y) in terms of one or more independent variables (X). In other words, it’s a generalized model that predicts the probability that an event will occur.

The NFL Big Data bowl data set from Kaggle will be used to build the Logistic Regression Model. This data describes the plays from the first 8 weeks from NFL's 2021 season. 

There are several input variables: examples include yardsToGo, dropBackType, offenseFormation, defendersInBox, pff_passCoverage, pff_passCoverageType, etc. There is one output variable which describes the result off the pass play whether a completed pass, incompleted, interception, etc.

Our objective will be to try to predict the pass result based on the input variables mentioned above.

### Importing libraries

In [1]:
import numpy as np
import pandas as pd
import glob
import os
from datetime import datetime
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

### Importing datasets

In [2]:
players = pd.read_csv("../input/nfl-big-data-bowl-2023/players.csv")
pff = pd.read_csv("../input/nfl-big-data-bowl-2023/pffScoutingData.csv")
plays = pd.read_csv("../input/nfl-big-data-bowl-2023/plays.csv")

## Exploratory Data Analysis

### Get the position identifier, we are only interested in linemen positions:

* T: Tackle
* C: Center 
* G: Guard

In [3]:
players.officialPosition.unique()

array(['QB', 'T', 'TE', 'WR', 'DE', 'SS', 'C', 'FS', 'NT', 'DT', 'CB',
       'G', 'OLB', 'RB', 'MLB', 'ILB', 'LB', 'FB', 'DB'], dtype=object)

In [4]:
## filter by linemen positions
linemen =  players[players["officialPosition"].isin(['T','C','G'])]

In [5]:
## merge with pff dataset to get stats on the linemen
stats_data = pd.merge(pff, linemen, how='inner', on='nflId')

In [6]:
## filter by play, only interested in pass plays
stats_data = stats_data[stats_data['pff_role'] == 'Pass Block']

In [7]:
## get age of linemen
today = datetime.today()
stats_data['birthDate'] = pd.to_datetime(stats_data['birthDate'], format='%Y-%m-%d')
stats_data['age'] = stats_data['birthDate'].apply(lambda x: today.year - x.year - ((today.month, today.day) < (x.month, x.day)))

In [8]:
## remove unwanted columns
columns = ['pff_positionLinedUp', 'pff_hit', 'pff_hurry', 'pff_sack', 'pff_nflIdBlockedPlayer','birthDate','pff_role','pff_backFieldBlock','pff_beatenByDefender']
stats_data.drop(columns, axis=1, inplace=True)

In [9]:
stats_data.head()

Unnamed: 0,gameId,playId,nflId,pff_hitAllowed,pff_hurryAllowed,pff_sackAllowed,pff_blockType,height,weight,collegeName,officialPosition,displayName,age
0,2021090900,97,40151,0.0,0.0,0.0,SW,6-4,319,Colorado State-Pueblo,C,Ryan Jensen,31.0
1,2021090900,410,40151,0.0,0.0,0.0,PP,6-4,319,Colorado State-Pueblo,C,Ryan Jensen,31.0
2,2021090900,434,40151,0.0,0.0,0.0,NB,6-4,319,Colorado State-Pueblo,C,Ryan Jensen,31.0
3,2021090900,456,40151,0.0,0.0,0.0,CL,6-4,319,Colorado State-Pueblo,C,Ryan Jensen,31.0
4,2021090900,480,40151,0.0,0.0,0.0,CL,6-4,319,Colorado State-Pueblo,C,Ryan Jensen,31.0


### Distribution of weight of the linemen

In [10]:
weights = stats_data.sort_values(['nflId', 'weight'], ascending = [True, False]).groupby('nflId').first().reset_index()
fig = px.histogram(weights, x="weight")
fig.show()

### Distribution of height of the linemen

In [11]:
heights = stats_data.sort_values(['nflId', 'height'], ascending = [True, False]).groupby('nflId').first().reset_index()
fig = px.histogram(heights, x="height")
fig.show()

### Distribution of age of the linemen

In [12]:
ages = stats_data.sort_values(['nflId', 'age'], ascending = [True, False]).groupby('nflId').first().reset_index()
fig = px.histogram(ages, x="age")
fig.show()

### Top 15 college contributors to NFL teams

In [13]:
colleges = stats_data.sort_values(['nflId', 'collegeName'], ascending = [True, False]).groupby('nflId').first().reset_index()
colleges = colleges.groupby('collegeName')['nflId'].count().reset_index(name='players').sort_values(by='players', ascending=False).head(15)
fig = px.bar(colleges, x='collegeName', y='players')
fig.show()

### Offensive lineman pressures rates calculation

Hurries, hits and sacks allowed all come into play for offensive linemen, as does PFF’s pass-blocking efficiency metrics, which measures pressures allowed on a per-snap basis with weighting toward sacks allowed. 

The formula:  1 – (Sacks + Hits) / (2) + (Hurries) / (2) / (Total Pass-Block Snaps) * 100.

According to [pff.com](https://www.pff.com/news/pro-signature-stat-spotlight-offensive-line)

In [14]:
performance = stats_data.groupby(['nflId','displayName','age']).agg(
    sacksAllowed=('pff_sackAllowed', sum),
    hitAllowed=('pff_hitAllowed', sum),
    hurryAllowed=('pff_hurryAllowed', sum),
    totalSnaps=('nflId', 'count')
    ).reset_index()
performance['pffPerformance'] = 1-(performance['sacksAllowed']+performance['hitAllowed'])/2 + (performance['hurryAllowed']) / (2) / (performance['totalSnaps']) * 100

### Top 10 best linemen by Performance

In [15]:
performance.sort_values(by='pffPerformance', ascending=False).head(10)

Unnamed: 0,nflId,displayName,age,sacksAllowed,hitAllowed,hurryAllowed,totalSnaps,pffPerformance
90,43295,Ronnie Stanley,28.0,0.0,1.0,8.0,36,11.611111
119,44942,David Sharpe,27.0,0.0,0.0,1.0,8,7.25
193,48094,Calvin Anderson,26.0,0.0,0.0,1.0,10,6.0
202,48513,Nate Herbig,24.0,0.0,1.0,7.0,72,5.361111
220,52519,Solomon Kindley,25.0,0.0,0.0,5.0,60,5.166667
186,47914,Wes Martin,26.0,1.0,0.0,4.0,43,5.151163
164,46755,Trenton Scott,28.0,0.0,1.0,4.0,43,5.151163
204,52418,Jedrick Wills,23.0,0.0,3.0,16.0,149,4.869128
169,47805,Andre Dillard,27.0,1.0,0.0,14.0,161,4.847826
16,38606,Brandon Brooks,33.0,0.0,0.0,3.0,39,4.846154


### Top 10 worst linemen by Perfromance

In [16]:
performance.sort_values(by='pffPerformance', ascending=False).tail(10)

Unnamed: 0,nflId,displayName,age,sacksAllowed,hitAllowed,hurryAllowed,totalSnaps,pffPerformance
211,52477,Damien Lewis,25.0,1.0,5.0,4.0,188,-0.93617
18,38642,Bobby Massie,33.0,5.0,4.0,14.0,292,-1.10274
136,46092,Isaiah Wynn,26.0,3.0,5.0,8.0,237,-1.312236
168,47801,Garrett Bradbury,27.0,2.0,7.0,11.0,254,-1.334646
215,52491,Lloyd Cushenberry,24.0,3.0,5.0,8.0,292,-1.630137
23,39947,Eric Fisher,31.0,5.0,6.0,11.0,209,-1.868421
112,44832,Garett Bolles,30.0,5.0,4.0,8.0,282,-2.08156
30,40124,David Quessenberry,32.0,4.0,6.0,10.0,276,-2.188406
2,33107,Duane Brown,37.0,6.0,5.0,9.0,233,-2.56867
145,46152,Orlando Brown,26.0,3.0,9.0,11.0,339,-3.377581


### The block type with most sacks Allowed


If player is a blocking offensive player, the type of block that the offensive player is executing on the defender (text)
Possible values:
* BH: Backfield Help - A block from a player aligned in the backfield on which the blocker merely helps on a block rather than fully engaging his assignment. Usually seen when a blocker is clearing up a block or picking up a defender when he has broken through or been missed by another blocker
* CH: Chip Block - This is only to be used for players who chip a pass rusher when they release for their route
* CL: Second Level – A block made at the second level, this must be at least two yards across the line of scrimmage
* NB: No Block - If a blocker executes no block on a play but simply runs his path or takes his pass set then we will note him with one all blocking line with this block type
* PA: Play Action Pass Protection - A blocker pass protecting inline on a play action pass selling the play action by stepping in to show a run block before converting to pass protect
* PP: Pass Protection - A standard pass protection block from an inline blocker
* PR: Pocket Roll Block - This block type will be used any time the offense is executing a “rolling pocket” by which the entire offensive line moves with the QB’s rollout to stay in front of him but without ever taking a “conventional” pass set. There will be flexibility here to record the PR – Pocket Roll Block type in the same way as PA & RP block types in that individual matchups & responsibilities won’t always be obvious or necessary, so PR block types can be recorded by multiple blockers on an individual defender on the same play
* PT: Post Block - A post block by an offensive player in pass protection to control a defender for another blocker while clearly demonstrating that he is not, at least initially, trying to fully engage with the block
* PU: Backfield Pickup - A pass protection pick-up by a player aligned in the backfield
* SR: Set & Release - A blocker who sets to pass protect a defender before releasing. This block will cover both players releasing from a set to block for a screen as well as “hold ups” by tight ends before they leak into the flat
* SW: Switch Block - A blocker who passes off (or attempts to pass off) a defender. Most often used on stunts but can also be used for pass offs when pass rushers are slanting across the pocket or interior defenders are working to the edge to replace a dropping edge rusher, with an interior offensive lineman passing them out rather than staying with them
* UP: Pull Pass Protection - A blocker pulling in pass protection from an inline alignment to block a defender in pass protection

In [17]:
sack_by_block = stats_data.groupby('pff_blockType')['pff_sackAllowed'].sum().sort_values(ascending=False).reset_index()
fig = px.pie(sack_by_block, values='pff_sackAllowed', names='pff_blockType')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()

### Finding wich type of coverage allows more sacks, hits and hurries

In [18]:
## remove unwanted columns
columns = ['playDescription', 'quarter', 'down', 'possessionTeam', 'defensiveTeam','yardlineSide','yardlineNumber','gameClock','preSnapHomeScore',
          'preSnapVisitorScore','penaltyYards','prePenaltyPlayResult','playResult','foulName1','foulNFLId1','foulName2','foulNFLId2','foulName3',
          'foulNFLId3','absoluteYardlineNumber','personnelO','personnelD']
plays.drop(columns, axis=1, inplace=True)
## merge with plays dataset
stats_plays = pd.merge(stats_data, plays, how='inner', on=['gameId','playId'])

In [19]:
coverages = stats_plays.groupby(['pff_passCoverage','pff_passCoverageType'])['pff_sackAllowed','pff_hitAllowed','pff_hurryAllowed'].sum().reset_index()
## sort by sacks, hits and hurries
coverages.sort_values(by=['pff_sackAllowed','pff_hitAllowed','pff_hurryAllowed'], ascending=False)

Unnamed: 0,pff_passCoverage,pff_passCoverageType,pff_sackAllowed,pff_hitAllowed,pff_hurryAllowed
3,Cover-1,Man,90.0,130.0,497.0
5,Cover-3,Zone,83.0,146.0,536.0
4,Cover-2,Zone,37.0,55.0,243.0
10,Quarters,Zone,31.0,59.0,239.0
6,Cover-6,Zone,30.0,44.0,177.0
11,Red Zone,Other,14.0,18.0,91.0
0,2-Man,Man,9.0,13.0,42.0
2,Cover-0,Man,7.0,10.0,39.0
1,Bracket,Other,3.0,1.0,15.0
8,Miscellaneous,Other,2.0,0.0,1.0


### Pass result acoording to the sacks, hits and hurries

In [20]:
stats_plays.loc[stats_plays['passResult'] == 'C', 'passResult'] = "Complete pass"
stats_plays.loc[stats_plays['passResult'] == 'I', 'passResult'] = "Incomplete pass"
stats_plays.loc[stats_plays['passResult'] == 'S', 'passResult'] = "QB Sack"
stats_plays.loc[stats_plays['passResult'] == 'IN', 'passResult'] = "Interception"
stats_plays.loc[stats_plays['passResult'] == 'R', 'passResult'] = "Scramble"
pass_results = stats_plays.groupby('passResult')['pff_sackAllowed','pff_hitAllowed','pff_hurryAllowed'].sum().reset_index()
pass_results.sort_values(by=['pff_sackAllowed','pff_hitAllowed','pff_hurryAllowed'], ascending=False)

Unnamed: 0,passResult,pff_sackAllowed,pff_hitAllowed,pff_hurryAllowed
3,QB Sack,306.0,2.0,193.0
1,Incomplete pass,0.0,265.0,697.0
0,Complete pass,0.0,187.0,733.0
2,Interception,0.0,23.0,42.0
4,Scramble,0.0,0.0,224.0


## Try to predict the result of the pass based on pass situation variables

### Data Preprocessing

First of all we group and get the data we are interested in. For this analysis we will get the pass situation variables as our variables and the pass result as our dependent variable

In [21]:
df = stats_plays.groupby('playId')['yardsToGo','dropBackType','offenseFormation','pff_playAction','defendersInBox','pff_passCoverage','pff_passCoverageType','passResult'].first().reset_index()
df.drop(['playId'], axis=1, inplace=True)
df.head()

Unnamed: 0,yardsToGo,dropBackType,offenseFormation,pff_playAction,defendersInBox,pff_passCoverage,pff_passCoverageType,passResult
0,10,TRADITIONAL,SINGLEBACK,1,8.0,Cover-3,Zone,Complete pass
1,10,TRADITIONAL,EMPTY,0,7.0,Quarters,Zone,Complete pass
2,10,TRADITIONAL,I_FORM,1,7.0,Cover-2,Zone,Complete pass
3,10,SCRAMBLE_ROLLOUT_RIGHT,PISTOL,1,6.0,Cover-3,Zone,Scramble
4,10,TRADITIONAL,SHOTGUN,0,7.0,Cover-3,Zone,Incomplete pass


Next we’ll check our dataset for null values

In [22]:
df.isnull().sum()

yardsToGo                0
dropBackType            94
offenseFormation         1
pff_playAction           0
defendersInBox           1
pff_passCoverage         0
pff_passCoverageType     0
passResult               0
dtype: int64

In next two variables we decided to fill the NaN values with the UNKNOWN word

In [23]:
df.dropBackType.unique()

array(['TRADITIONAL', 'SCRAMBLE_ROLLOUT_RIGHT', 'DESIGNED_ROLLOUT_RIGHT',
       'SCRAMBLE', None, 'DESIGNED_ROLLOUT_LEFT', 'SCRAMBLE_ROLLOUT_LEFT',
       'DESIGNED_RUN'], dtype=object)

In [24]:
df['dropBackType'] = df['dropBackType'].fillna('UNKNOWN')

In [25]:
df.offenseFormation.unique()

array(['SINGLEBACK', 'EMPTY', 'I_FORM', 'PISTOL', 'SHOTGUN', 'WILDCAT',
       'JUMBO', None], dtype=object)

In [26]:
df['offenseFormation'] = df['offenseFormation'].fillna('UNKNOWN')

For defendersInBox variable, we decided to drop the row with NaN values, as it is only one record

In [27]:
df.defendersInBox.unique()

array([ 8.,  7.,  6.,  5.,  4.,  9., 11.,  3., 10.,  2.,  1., nan])

In [28]:
df.dropna(subset=['defendersInBox'], inplace=True)

Now we do not have any NaN records

In [29]:
df.isnull().sum()

yardsToGo               0
dropBackType            0
offenseFormation        0
pff_playAction          0
defendersInBox          0
pff_passCoverage        0
pff_passCoverageType    0
passResult              0
dtype: int64

### Distribution of pass results

We’ll start with a visualization of our output variable

In [30]:
fig = px.histogram(df, x='passResult',barmode='group')
fig.show()

### Data Preparation: Converting Categorical Features

At this stage, we need to change the categorical variables to a format that our Linear Regression model will understand. We can do this by converting categorical features to dummy variables.
 
We’ll convert all “non-null object” columns to dummy variables using the following process:

In [31]:
df = pd.get_dummies(df, columns=['dropBackType','offenseFormation','pff_passCoverage','pff_passCoverageType'], drop_first=True)

In [32]:
labels = pd.DataFrame(df['passResult'])
labels.passResult[labels.passResult == 'Complete pass' ] = 0
labels.passResult[labels.passResult == 'Scramble' ] = 1
labels.passResult[labels.passResult == 'Incomplete pass' ] = 2
labels.passResult[labels.passResult == 'QB Sack' ] = 3
labels.passResult[labels.passResult == 'Interception' ] = 4
df.drop('passResult', axis=1, inplace=True)
df = df.apply(pd.to_numeric)
labels = labels.apply(pd.to_numeric)

### Data Preparation: Train and Test Split

We will now split our data into training data sets and test data sets:

In [33]:
X_train, X_test, y_train, y_test = train_test_split(df, labels, test_size=0.3, random_state=0)

ss = StandardScaler()
X_train_scaled = ss.fit_transform(X_train)
X_test_scaled = ss.transform(X_test)

### Building the Model

Now that we have our training and test data sets, we can train our model and make predictions:

In [34]:
## Logistic Regression
model = LogisticRegression(class_weight= 'balanced')

## Fitting the model
model.fit(X_train_scaled, y_train)

LogisticRegression(class_weight='balanced')

### Feature Importance
To better understand the performance of our model, we can investigate each individual feature. To do so, we’ll start by getting each individual feature’s coefficient score:

In [35]:
importances = pd.DataFrame(data={'Attribute': X_train.columns,'Importance': model.coef_[0]})
importances = importances.sort_values(by='Importance', ascending=False)

In [36]:
fig = px.bar(importances, y='Attribute',x='Importance', orientation='h')
fig.show()

Scores marked with a zero coefficient, or very near zero coefficient, indicate that the model found those features unimportant and essentially removed them from the model. 

### Predictions with our model

In [37]:
## Make some predictions with our test data
predictions = model.predict(X_test_scaled)

### Model Evaluation

In [38]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.73      0.27      0.39       636
           1       0.44      0.92      0.59        59
           2       0.42      0.28      0.34       345
           3       0.05      0.14      0.08        70
           4       0.01      0.26      0.03        19

    accuracy                           0.30      1129
   macro avg       0.33      0.37      0.28      1129
weighted avg       0.56      0.30      0.36      1129



We can see that our model has a 73% precision rate for completed passes and a 42% precision rate for incomplete passes. This is no doubt a result of our data imbalance.

In [39]:
print('Mean Squared Error: ', mean_squared_error(y_test, predictions))
print('Mean Absolute Error: ', mean_absolute_error(y_test, predictions))

Mean Squared Error:  5.2506643046944195
Mean Absolute Error:  1.753764393268379


As we can see, our MSE is high, that indicates that the data points are dispersed widely around its central moment (mean). As well as the MAE, is greater than 1 we can expect a middling performace on our model.