<a href="https://colab.research.google.com/github/cbsebastian24/randomStuff/blob/main/Copy_of_Breakout_4_c.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Breakout 4.c

In [None]:
import os
import pandas as pd
import numpy as np
import seaborn as sb
import statsmodels.api as sm
import matplotlib.pyplot as plt
pd.set_option('display.max_colwidth', None)

In [None]:
file_path = "https://github.com/UM-Data-Science-101/lab-11/raw/refs/heads/main/NFL_play_by_play_2022.csv.gz"
nfl = pd.read_csv(file_path)
nfl.shape

(50147, 340)

These data record play-by-play information for all games in the 2022 National Football League (NFL) season. These data were downloaded using the `nflverse` package for the R programming language (another statistics and data science environment), lightly edited, and saved in a tabular format for us to use in Python.

There are many measurements for each play, some of which are computed values from `nflverse`. Here's a brief list using the data dictionary.

In [None]:
nfl_data_dictionary = pd.read_csv("https://github.com/UM-Data-Science-101/homework-10/raw/refs/heads/main/NFL_play_by_play_data_dictonary.csv", index_col = "Field")
nfl_data_dictionary.loc[["play_id", "game_id", "home_team", "away_team", "posteam",
                         "defteam", "yardline_100", "down", "ydstogo",
                        "touchdown", "play_type"]]

Unnamed: 0_level_0,Description,Type
Field,Unnamed: 1_level_1,Unnamed: 2_level_1
play_id,Numeric play id that when used with game_id and drive provides the unique identifier for a single play.,numeric
game_id,Ten digit identifier for NFL game.,character
home_team,String abbreviation for the home team.,character
away_team,String abbreviation for the away team.,character
posteam,String abbreviation for the team with possession.,character
defteam,String abbreviation for the team on defense.,character
yardline_100,Numeric distance in the number of yards from the opponent's endzone for the posteam.,numeric
down,The down for the given play.,numeric
ydstogo,Numeric yards in distance from either the first down marker or the endzone in goal down situations.,numeric
touchdown,Binary indicator for if the play resulted in a TD.,numeric


## Question 1

### Part (a)

For this section, we will aggregate the individual plays into games.
Investigate `games` using a plot that shows the number of games played each week. The season is composed of a regular season in which all teams play and post season playoffs in which only some teams play. Using the plot, how many weeks are in a regular season?


In [None]:
games = nfl.groupby("game_id").agg({"home_score": "first",
 "away_score": "first",
 "week": "first",
 "home_team": "first",
 "away_team": "first",
 "roof": "first",
 "wind": "median",
 "temp": "median",
 "play_id": "size"})



### Part (b)

Some people think teams benefit from playing at home. Compute the difference between the home team score and the away team score and store it as a new column (call it `"home_away_score"`).

Plot this new variable. Do you see evidence of this claim?



### Part (c)

Suppose these games represent a sample from all possible games that could have been played in 2022. Let $X$ be the home and away teams' score difference. Test the hypothesis:

$H_0: E(X) = 0$ against $H_1: E(X) \ne 0$

at the 5% level or create a 95% confidence interval for $E(X)$. What do you conclude about this hypothesis. Interpret it as evidence for or against the claim of home field advantage.




### Part (d)

One theory of home game advantage states that teams that play outdoors in cold weather are acclimated to cold weather, while teams that do not play outdoors will not perform as well in outdoor games.

We will ask a slightly simpler question and ask if the average home and away difference in outdoor games is larger than in indoor games.

To do this, we need to identify if a game is played outdoors. Investigate the `"roof"` column and create a new column (call it `"is_outdoors"`) that has the value True if the games is played outdoors and False otherwise.


Use a box plot to explore whether games played outdoors have different home and away score differences than non-outdoor games.




### Part (e)

Perform a difference of means hypothesis test to the the hypothesis that the average score difference is the same for both outdoors and non-outdoors games against the alternative that it is different. At the 5% level (or using a 95% confidence interval) what do you conclude?





### Part (f)

Another way to perform this test is to use linear regression. If we write:

$$E(Y \mid X = x) = a + b x$$

Then the difference of means for $$E(Y \mid X = 1) - E(Y \mid X = 0) = (a + b ) - (a + b \cdot 0) = b$$

The hypothesis test will use a slightly different standard error calculation, but it will be still be a valid way to test this hypothesis or get confidence intervals.

Use the `sm.OLS` to perform a linear regression of `"home_away_score`" on `"is_outdoors"`. You will need to convert the `"is_outdoors"` variable to a numeric 1/0 version first. This can be done by using `.astype('int')` to create a new column of 0 and 1 values.

Display the confidence intervals for each coefficient. For the `is_outdoors` coefficient, what do you see?

In [None]:
## quick example for conversion
tf = pd.Series([True, False, False, True])
tf.astype("int")

Unnamed: 0,0
0,1
1,0
2,0
3,1


### Part (g)

If our theory that outdoor games helps the home team because of the weather, perhaps we can use measured temperature and wind to see if decreasing temperature and increasing wind increases the the home team's score over the away team.

You will notice that there is some amount of missingness for the `"temp"` and `"wind"` columns. Create a new column that track if either are missing for each game.

Compute the conditional probability of missing either of these measurements for the different `"roof"` categories. What do you notice?





### Part (h)

Perform a multiple linear regression using `"is_outdoors"` (converted to 0 and 1), `"wind"`, and `"temp`". Print out the parameters and 95% confidence intervals.

For each factor, holding the others constant, would we reject the hypothesis that the conditional mean of the score difference is independent of the factor?




## Question 2

In this question, we will look at the relationship between different types of plays (passing the ball, running the ball) and the "down" (the 4 attempts the offensive team has to gain 10 yards before turning over the ball to the other side).

Most plays are either passing or running the ball. When teams are on their 4th down and do not think they can make the full 10 years, they will often punt it. Because this almost only happens on 4th downs and several of the other play types are so specialized, we will focus on just runs and passes.

In [None]:
plays = nfl.loc[nfl["play_type"].isin(["run", "pass"])].dropna(subset = ["play_type", "down"])
plays["play_type"].value_counts()

Unnamed: 0_level_0,count
play_type,Unnamed: 1_level_1
pass,20299
run,15005


We will relate this to the "down" column to see where runs and passes are more common.

In [None]:
plays["down"].describe()

Unnamed: 0,down
count,35304.0
mean,1.810985
std,0.834178
min,1.0
25%,1.0
50%,2.0
75%,2.0
max,4.0


### Part (a)

Create a mosaic plot for `play_type` and `down` in the `plays` table. Also compute the counts (we will use both later).

Look at the results. Do you notice anywhere that the patterns in one column are not the same as the patterns in the other column?



## Part (b)

Create a crosstab for these two variables. How is this similar to the mosaic plot above?


### Part (3)

We are ultimately concerned with whether the two variables, `play_type` and `down` are independent.

Use the function `scipy.stats.chi2_contingency` on your cross tab. Interpret the result. Do you see evidence against the claim that these are independent?
