<a href="https://colab.research.google.com/github/cbsebastian24/randomStuff/blob/main/plenary_5_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Logistic Regression and Decision Trees with NFL Data


## Load the data

These data record play-by-play information for all games in the 2022 National Football League (NFL) season. These data were downloaded using the `nflverse` package for the R programming language (another statistics and data science environment), lightly edited, and saved in a tabular format for us to use in Python.

In [None]:
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, export_text

import warnings
warnings.filterwarnings('ignore')

In [None]:
file_path = "https://github.com/UM-Data-Science-101/lab-11/raw/refs/heads/main/NFL_play_by_play_2022.csv.gz"
nfl = pd.read_csv(file_path)

In [None]:
nfl.shape

(50147, 340)

There are many measurements for each play, some of which are computed values from `nflverse`. Here's a brief list using a data dictionary.

In [None]:
nfl_data_dictionary = pd.read_csv("https://github.com/UM-Data-Science-101/lab-11/raw/refs/heads/main/NFL_play_by_play_data_dictonary.csv", index_col = "Field")
nfl_data_dictionary.loc[["play_id", "game_id", "home_team", "away_team", "posteam",
                         "defteam", "yardline_100", "down", "ydstogo",
                        "touchdown", "play_type"]]

Unnamed: 0_level_0,Description,Type
Field,Unnamed: 1_level_1,Unnamed: 2_level_1
play_id,Numeric play id that when used with game_id an...,numeric
game_id,Ten digit identifier for NFL game.,character
home_team,String abbreviation for the home team.,character
away_team,String abbreviation for the away team.,character
posteam,String abbreviation for the team with possession.,character
defteam,String abbreviation for the team on defense.,character
yardline_100,Numeric distance in the number of yards from t...,numeric
down,The down for the given play.,numeric
ydstogo,Numeric yards in distance from either the firs...,numeric
touchdown,Binary indicator for if the play resulted in a...,numeric


## Logistic Regression

A logistic regression model predicts the probability of an event occuring. Specifically, it models the log-odds of an event as a linear combination of one or more independent variables. It passes $v = a + b_1 x_1 + ...$ through the function:

$$\frac{e^v}{1 + e^v}$$

where $e \approx 2.71$

In previous models we assumed that a one unit change of "down" leads to a constant change in the conditional probability of getting a touchdown. But this may not be the case. What if each down contributes a differernt amount to the conditional probability of a touchdown, even attending to whether it is a pass or run. Let's instead use the indicators of each order of downs as the predictors.

In [None]:
pass_run = nfl.loc[nfl["play_type"].isin(["run", "pass"])].copy()
pass_run["pass"] = (pass_run["play_type"] == "pass") + 0 # Indicator of passing play. Force it to be numeric

In [None]:
tmp = pass_run[["yardline_100", "pass", "down", "touchdown"]].dropna()
tmp = tmp.join(pd.get_dummies(data = tmp["down"], prefix = "down")).dropna()
tmp.head()

Unnamed: 0,yardline_100,pass,down,touchdown,down_1.0,down_2.0,down_3.0,down_4.0
2,78.0,0,1.0,0.0,True,False,False,False
3,59.0,1,1.0,0.0,True,False,False,False
4,59.0,0,2.0,0.0,False,True,False,False
5,54.0,1,3.0,0.0,False,False,True,False
7,72.0,1,1.0,0.0,True,False,False,False


Here we create a logistic regression model predicting the probability of a `touchdown` given `pass`, `down_1.0`, `down_2.0`, `down_3.0`, `down_4.0`, and `yardline_100`. Note that we do not need to fit an intercept due to the `down` dummy variables.

In [None]:

X = tmp[["pass", "down_1.0", "down_2.0", "down_3.0", "down_4.0", "yardline_100"]]
y = tmp["touchdown"]

log_model = LogisticRegression(fit_intercept = False, penalty = None)
log_model.fit(X, y)

Print out the coefficients of the model. Looking at the sign of the coefficients, do these variables have a positive or negative impact on the likelihood of getting a touchdown?

<details>

````
log_model.coef_


# For a cleaner look:

coef_table = pd.DataFrame({
    "predictor": X.columns,
    "coefficient": log_model.coef_[0]
})

coef_table
````

</details>

*Replace this text with your answer*

Unlike linear regression, we can't interpret this result as a "one unit change leads to $b$ units change in the conditional probablity", but we can look at the signs and magnitudes to get an idea of the conditional probability changes.

We can compute estimated probabilties and compare them. For example, on 4th down, on the 10 yard line, assuming we don't want to punt or take a 3pt attempt, should a team run or pass to get a touchdown?

We'll use the `predict_proba()` method to get the predicted probabilities. The looks a little tricky, but we're just creating a table with one row (the lists `[[...]]`) with values for whether teams passed or ran (first postition) on 4th down (4th position) on the 10 yard line (5th position). The rest is for formatting.

In [None]:
fourth_10_run = log_model.predict_proba([[0, 0, 0, 0, 1, 10]])[0][0]
fourth_10_pass = log_model.predict_proba([[1, 0, 0, 0, 1, 10]])[0][0]
fourth_10_run - fourth_10_pass

np.float64(0.07525713934180256)

Here we see that there is an 8% higher likelihood that running gains a touchdown than a pass.

On the second down on the 50 yard line, should a team run or pass?

<details>

````
second_50_run = log_model.predict_proba([[0, 0, 1, 0, 0, 50]])[0][0]
second_50_pass = log_model.predict_proba([[1, 0, 1, 0, 0, 50]])[0][0]
second_50_run - second_50_pass
````

</details>

*Replace this text with your answer*

## Decision Trees

While logistic regression uses a linear model to fit a single decision boundary, decision trees create a hierarchical structure of non-linear partitions to separate data. Decision trees are good for handling complex, non-linear relationships and automatically detecting variable interactions. You can think of decision trees as flowchart-like diagrams used for decision-making and analysis.

Here we create a decision tree predicting `play_type` from `yardline_100`, `ydstogo`, `game_seconds_remaining`, and `score_differential` assuming we are on the fourth down. The tree has a `max_depth` of 3. This defines the maximum length of the longest path from the root node to any leaf node. It directly controls the complexity of the decision tree model.

In [None]:
fourth_down = nfl.loc[nfl["down"] == 4].copy()
predictors = ["yardline_100", "ydstogo", "game_seconds_remaining", "score_differential"]
fourth_down["play_type"] = fourth_down["play_type"].astype("category")

In [None]:
X = fourth_down[predictors]
y = fourth_down["play_type"]
tree_model = DecisionTreeClassifier(max_depth=3, random_state=0)
tree_model.fit(X, y.cat.codes)

Below we can see a diagram of the tree. What do you notice about the splits? In which circumstances are different variables prioritized to make decisions?

In [None]:
print(export_text(tree_model, feature_names = predictors))

|--- yardline_100 <= 39.50
|   |--- ydstogo <= 1.50
|   |   |--- score_differential <= -3.50
|   |   |   |--- class: 5
|   |   |--- score_differential >  -3.50
|   |   |   |--- class: 5
|   |--- ydstogo >  1.50
|   |   |--- score_differential <= -11.50
|   |   |   |--- class: 2
|   |   |--- score_differential >  -11.50
|   |   |   |--- class: 0
|--- yardline_100 >  39.50
|   |--- ydstogo <= 1.50
|   |   |--- yardline_100 <= 56.50
|   |   |   |--- class: 5
|   |   |--- yardline_100 >  56.50
|   |   |   |--- class: 3
|   |--- ydstogo >  1.50
|   |   |--- game_seconds_remaining <= 261.00
|   |   |   |--- class: 3
|   |   |--- game_seconds_remaining >  261.00
|   |   |   |--- class: 3



*Replace this text with your answer*

In [None]:
pd.DataFrame({'code': fourth_down["play_type"].cat.codes, 'play_type': fourth_down["play_type"]}).drop_duplicates().sort_values("code")

Unnamed: 0,code,play_type
2708,-1,
32,0,field_goal
169,1,no_play
145,2,pass
6,3,punt
9919,4,qb_kneel
315,5,run


Using the above output, what choice would typically be made if a team is on the 30 yard line, there is more than 1.5 yards to go to get a first down, and the defensive side is leading by 12 points?


*Replace this text with your answer*