<a href="https://colab.research.google.com/github/cbsebastian24/randomStuff/blob/main/plenary_5_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Model Assessment and Selection with NFL Data


## Load the data

These data record play-by-play information for all games in the 2022 National Football League (NFL) season. These data were downloaded using the `nflverse` package for the R programming language (another statistics and data science environment), lightly edited, and saved in a tabular format for us to use in Python.

In [None]:
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm

In [None]:
file_path = "https://github.com/UM-Data-Science-101/lab-11/raw/refs/heads/main/NFL_play_by_play_2022.csv.gz"
nfl = pd.read_csv(file_path)

In [None]:
nfl.shape

(50147, 340)

There are many measurements for each play, some of which are computed values from `nflverse`. Here's a brief list using a data dictionary.

In [None]:
nfl_data_dictionary = pd.read_csv("https://github.com/UM-Data-Science-101/lab-11/raw/refs/heads/main/NFL_play_by_play_data_dictonary.csv", index_col = "Field")
nfl_data_dictionary.loc[["play_id", "game_id", "home_team", "away_team", "posteam",
                         "defteam", "yardline_100", "down", "ydstogo",
                        "touchdown", "play_type"]]

Unnamed: 0_level_0,Description,Type
Field,Unnamed: 1_level_1,Unnamed: 2_level_1
play_id,Numeric play id that when used with game_id an...,numeric
game_id,Ten digit identifier for NFL game.,character
home_team,String abbreviation for the home team.,character
away_team,String abbreviation for the away team.,character
posteam,String abbreviation for the team with possession.,character
defteam,String abbreviation for the team on defense.,character
yardline_100,Numeric distance in the number of yards from t...,numeric
down,The down for the given play.,numeric
ydstogo,Numeric yards in distance from either the firs...,numeric
touchdown,Binary indicator for if the play resulted in a...,numeric


## Model Assessment

The coefficient of determination, or the $R^2$ value, measures the "goodness-of-fit" of a model. Specifically, $R^2$ is a measure used in regression analysis that determines the proportion of variance in the response variable that can be explained by the predictor variable(s) in a model. It indicates how well the model's predictions fit the observed data.

For standard linear regression with an intercept, $R^2$ values range from 0 to 1 (0% to 100%). A value of 0 means the model doesn't explain any variability, while a value of 1 means it explains all variability. In simple linear regression, $R^2$ is the square of the Pearson correlation coefficient between observed and predicted values.

Let's use the `sm.OLS` function to get a linear regression of `touchdown` on `yardline_100` for pass or run plays. Print out the $R^2$ value. How would you interpret this result?

In [None]:
pass_run = nfl.loc[nfl["play_type"].isin(["run", "pass"])].copy()
pass_run["pass"] = (pass_run["play_type"] == "pass") + 0 # Indicator of passing play. Force it to be numeric

<details>

````
tmp = pass_run[["touchdown", "yardline_100"]].dropna()
model = sm.OLS(tmp["touchdown"], sm.add_constant(pass_run["yardline_100"]))
model.fit().rsquared
````

</details>

*Replace this text with your answer*

Calculate the correlation of `touchdown` and `yardline_100` and square the result. What do you notice?

<details>

````
(tmp["touchdown"].corr(tmp["yardline_100"]))**2
````

</details>

*Replace this text with your answer*

## Model Selection

Let's try adding the `ydstogo` variable as a predictor to our previous model.

Create a regression predicting `touchdown` on `yardline_100` and `ydstogo`. Print out the $R^2$ value. If you had to choose between this model and the previous model, which model would you select? Why?

<details>

````
X = pass_run[["yardline_100", "ydstogo"]]
model = sm.OLS(tmp["touchdown"], sm.add_constant(X))
model.fit().rsquared
````

</details>

*Replace this text with your answer*

Finally, let's add the `yards_gained` variable as a predictor.

Create a regression predicting `touchdown` on `yardline_100`, `ydstogo`, and `yards_gained`. Print out the $R^2$ value. If you had to choose between this model and the previous models, which model would you select? Why?

<details>

````
X = pass_run[["yardline_100", "ydstogo", "yards_gained"]]
model = sm.OLS(tmp["touchdown"], sm.add_constant(X))
model.fit().rsquared
````

</details>

*Replace this text with your answer*