<div class="alert alert-block alert-danger">


# 15B: Does it pay to run in football?


**Use with textbook version 6.0+**


**Lesson assumes students have read up through page: 15.8**




</div>


In [None]:
# This code will load the R packages we will use
suppressPackageStartupMessages({
    library(coursekata)
})

nfl_salary <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQ6ZcO9y6f23HWKRUe8IKlYFUFhDw5trtNlyJJj7bI7aKVstNJJcsi_bwuEsb-vSPUwUz0diRa_JOKL/pub?gid=1297158575&single=true&output=csv")[,-1]


## About the Data

The dataset `nfl_salary` contains data from [Spotrac.com](Spotrac.com) and used by [fivethirtyeight](https://fivethirtyeight.com/features/running-backs-are-finally-getting-paid-what-theyre-worth/). Each row represents a particular position in American football in a particular year. 

- `year` Ranges from 2011-18.
- `salary` The average salary for the 16 highest-paid players in each position that year.
- `position` The NFL position: 
    - QB = Quarterback
    - RB = Running Back
    - S = Safety
    - ST = Special Teamer 
    - TE = Tight End 
    - WR = Wide Receiver 
    - CB = Cornerback 
    - DL = Defensive Lineman, 
    - LB = Linebacker
    - OL = Offensive Lineman

## 1.0 - Explore Variation

1.1 - First, take a moment to explore the data frame (look at whatever you are curious about!). 

1.2 - Does the salary of the highest paid players vary by year? Does it vary by position? Write a word equation that proposes that salaries vary by both year and position.

1.3 - Make a visualization to explore your hypothesis.

In [None]:
# Sample Responses

gf_jitter(salary ~ year, data = nfl_salary, color = ~position, size = .3) %>%
  gf_facet_wrap(~ position)

## 2.0 - Running Backs versus Quarterbacks

2.1 - To simplify our models a bit, let's zoom in on running backs and quarterbacks. Create a data set with data only for running backs and quarterbacks. 

Re-make your visualization with your new smaller data set.

In [None]:
# Sample Responses

nfl2 <- filter(nfl_salary, position == "QB" | position == "RB")

gf_jitter(salary ~ year, data = nfl2, color = ~position) %>%
  gf_facet_wrap(~ position)

2.2 - Which do you think will be a better model for this data: additive or interaction? Why?

<div class="alert alert-block alert-warning">

**Sample Responses:**

An additive model made of parallel lines would not fit this data as well as an interaction model with lines that have different slopes -- one more positive and steep, one more flat.

</div>

## 3.0 - Model Data

3.1 - Fit the model that would be a better fit for this data. Add it to your visualization. 

In [None]:
# Sample Responses

int_model <- lm(salary ~ year * position, data = nfl2)

gf_jitter(salary ~ year, data = nfl2, color = ~position) %>%
  gf_facet_wrap(~ position) %>%
  gf_model(int_model)

3.2a - Express the best fitting model in GLM notation.

In [None]:
# Sample Responses

int_model

<div class="alert alert-block alert-warning">

**Sample Responses:**

$salary_i = -3556.414 + 1.774year_i + 3832.565positionRB_i + -1.908year_i*positionRB_i$


</div>

3.2b - Take that equation and parse it into two separate equations, one representing the model for QB salaries and the other representing the model for RB salaries.

<div class="alert alert-block alert-warning">

**Sample Responses:**

The way to get to these is by assuming the value of $positionRB_i$ is 0 (for quarterback model) or 1 (for running back model).

- QB: $salary_i = -3556.414 + 1.774year_i + 3832.565*0 + -1.908year_i*0$
- RB: $salary_i = -3556.414 + 1.774year_i + 3832.565*1 + -1.908year_i*1$

To write it more simply: 

- QB: $salary_i = -3556.414 + 1.774year_i$
- RB: $salary_i = -3556.414 + 1.774year_i + 3832.565 + -1.908year_i$

By combining like terms, we can simplify the RB equation into:

- RB: $salary_i = 276.151 + -0.134year_i$



</div>

In [None]:
# using R as a calculator to simplify RB equation
-3556.414 + 3832.565 
1.774 -1.908

3.3 - Interpret the slope for the model of RB salaries. What does it mean? 

<div class="alert alert-block alert-warning">

**Sample Responses:**

- The negative slope is also seen in the RB line in the visualization; predicted salary decreasing over time.
- For each year, there is a change of -0.134 million dollars in predicted salary for running backs.

Teachers may want to compare briefly to the positive slope for quarterback salaries: for each year, there is a change of 1.774 million dollars in predicted salary. Quite a difference in what is happening to salaries over time.
</div>

## 4.0 - Is the interaction model better?

In the visualization below, you can view either the interaction model (in black) or the additive model (in red). In this section, we're going to try and answer: **how much better is the interaction model than the additive model at explaining variation in salary?**

In [None]:
add_model <- lm(salary ~ year + position, data = nfl2)

gf_jitter(salary ~ year, data = nfl2, color = ~position) %>%
  gf_facet_wrap(~ position) %>%
  gf_model(int_model, color = "black") 
#  gf_model(add_model, color = "red")

4.1 - Generate an ANOVA table to help us evaluate the interaction model against the additive model.

In [None]:
supernova(int_model)


<div class="alert alert-block alert-warning">

**Sample Responses:**

Some students may generate the ANOVA table for both additive and interaction model but one key insight is that the interaction table has within it a comparison to the additive model (the `year:position` row).

You may want to show the ANOVA table for the interaction model and ask, "Which line of the ANOVA table shows us the model comparison between the interaction model and the additive model?"

</div>

In [None]:
# if students need a hint, they can use the generate_models() function
generate_models(int_model)

4.2 - How much more error is explained by adding the interaction term ($year_i*positionRB_i$) to the additive model?

<div class="alert alert-block alert-warning">

**Sample Responses:**

Students can answer this by appealing to SS, PRE, or F:
- 1223 squared millions of dollars
- 34% (or .34) of the error
- 129 times more variance is explained by this term than the error variance

</div>

4.3 - The PREs for the whole interaction model is .78 and the PRE for the additive model is .67. The difference between these two numbers is not .34. Why not?

In [None]:
# if you want to look at the PRE's yourself
supernova(int_model)
supernova(add_model)

<div class="alert alert-block alert-warning">

**Sample Responses:**

- The PREs for the whole model (either interaction or additive) use the SS Total as the denominator (11,129).
- The PRE for the `year:position` row uses a completely different denominator. This denominator takes out any parts that are explained by `year` alone and `position` alone. Thus the denominator is not SS Total; instead it is the sum of SS Error (2399) and the SS reduced by that interaction term (1223). 

</div>

In [None]:
# Complete version
# how the interaction row's PRE is calculated
1223.017 / (1223.017 + 2399.392)

4.4 - Going beyond the data now, why are running backs salaries changing in a way that is **so different** from quarterback salaries? 

<div class="alert alert-block alert-warning">

This is for students to bring in their knowledge of football into the discussion.

If you have students who want to read further about fivethirtyeight's hypotheses about this, you can refer them to this article: https://fivethirtyeight.com/features/running-backs-are-finally-getting-paid-what-theyre-worth/

</div>

## 5.0, Bonus - Why is there a negative b0?

Let's examine the two lines that represent the interaction model again:

QB: $salary_i = -3556.414 + 1.774year_i$

RB: $salary_i = 276.151 + -0.134year_i$

5.1 - Why is the $b_0$ for quarterback salary negative?

<div class="alert alert-block alert-warning">

**Sample Responses:**

Because $b_0$ is the salary when year is 0 (and the NFL wasn't around over 2000 years ago), this number is ridiculous. 

</div>

5.2 - To make the parameter estimates more interpretable, what can we do? (see pages 9.5 and 9.6).

<div class="alert alert-block alert-warning">

**Sample Responses:**

To make these numbers less ridiculous students might want to explore centering the explanatory variable `year` (see pages 9.5 and 9.6).

Then we would get $b_0$s that represent the salary in the average year (around 2014). Code for centering shown below.

- $salary_i = 16.49 + 1.774year_i + -10.118positionRB_i + -1.908year_i*positionRB_i$
  - QB: $salary_i = 16.49 + 1.774year_i$
  - RB: $salary_i = 6.372 + -0.134year_i$
</div>

In [None]:
mean(nfl2$year)

# centering year (now 0 means "average year in the data set")
nfl2$year_0 <- nfl2$year - mean(nfl2$year)
int_model_0 <- lm(salary ~ year_0 * position, data = nfl2)

gf_jitter(salary ~ year_0, data = nfl2, color = ~position) %>%
  gf_facet_wrap(~ position) %>%
  gf_model(int_model_0, color = "black") 

int_model_0