# Tutorial 02: MLR with different types of input variables

#### Lecture and Tutorial Learning Goals:
After completing this week's lecture and tutorial work, you will be able to:

1. Give an example of a real problem that could be answered by a multiple linear regression.
2. Interpret the coefficients and $p$-values of different types of input variables, including categorical input variables.
3. Define interactions in the context of linear regression.
4. Write a computer script to perform linear regression when input variables are continuous or discrete and when there are interactions between some variables.

In [None]:
# Run this cell before continuing.
library(tidyverse)
library(cowplot)
library(repr)
library(infer)
library(broom)
library(AER)
library(modelr)
source("tests_tutorial_02.R")

## The data

In this tutorial, we will continue using the `CASchools` real world dataset from 420 K-6 and K-8 districts in California. The California School data set comes with an R package called `AER`, an acronym for Applied Econometrics with R (by Christian Kleiber & Zeileis, 2017). 

The dataset contains data on test performance, school characteristics, and student demographic backgrounds in Californian school districts. Among many variables available, we will use the following:

- `grades`: factor indicating grade span of district.

- `income`: District average income (in USD 1,000).

- `english`: Percent of English learners.

- `read`: Average reading score.


In [None]:
# Run this cell
data(CASchools)

caschools <- 
    CASchools %>%
    select(grades, income, english, read) %>%
    mutate_if(is.numeric, round, 2)

head(caschools)

<font style='color: darkred'>**Note:** Make sure that categorical variables in your model are factors. If the categorical variable is not a factor, `lm` won't create a dummy variable. </font> 

## 1. MLR: additive

As discussed in the lecture, R will create dummy variables to include categorical variables in the model. In this example, `grades` is a categorical variable with 2 levels. 

**Question 1.0**
<br>{points: 1}

The input variable `grades` is a discrete nominal variable with 2 levels, KK-06 and KK-08. Since this variable is a factor in the dataset, `lm` selects one level as a baseline to create a dummy variable. Which level of `grades` is selected, by default, as a baseline?

**A.** `KK-06`

**B.** `KK-08`

*Assign your answer to an object called `answer1.0`. Your answer should be one of `"A"` or `"B"` surrounded by quotes.*

In [None]:
# answer1.0 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.0()

**Question 1.1**
<br>{points: 1}

How many dummy variables will `lm` create to fit a linear regression with the categorical variable `grades`?

**A.** 1

**B.** 2

**C.** 3

**D.** 4

*Assign your answer to an object called `answer1.1`. Your answer should be one of `"A"`, `"B"`, `"C"`, or `"D"` surrounded by quotes.*

In [None]:
# answer1.1 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.1()

**Question 1.2**
<br>{points: 1}

In the previous tutorial, you used a simple linear regression (SLR) to study the relation between `read` and `income` on average for all types of schools. Suppose you want to examine whether this relation differs depending on the school's grade span (i.e., KK-06 vs KK-08). 

If you expect the same change in reading score per unit change in the average income for all types of school (i.e., for all levels of `grades`), which MLR will you fit in `R` using the `lm` function?

**A.** `lm(read ~ income + grades, data = caschools)`

**B.** `lm(read ~ income * grades, data = caschools)`

*Assign your answer to an object called `answer1.2`. Your answer should be one of `"A"` or `"B"` surrounded by quotes.*

In [None]:
# answer1.2 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.2()

**Question 1.3**
<br>{points: 1}

Which of the following descriptions will best describe a visualization of the MLR considered in *Question 1.2*? 

**A.** one line through a cloud of data points

**B.** two lines with equal slopes but different intercepts

**C.** two lines with different slopes and different intercepts

**D.** a smooth concave curve through a cloud of data points

**E.** two boxplots for different levels of `grades`

*Assign your answer to an object called `answer1.3`. Your answer should be one of `"A"`, `"B"`, `"C"`, `"D"`, or `"E"` surrounded by quotes.*

In [None]:
# answer1.3 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.3()

**Question 1.4**
<br>{points: 1}

Using `caschools`, fit the MLR proposed in *Question 1.2*. Store the result in an object called `caschools_MLR_add`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# caschools_MLR_add <- ...(..., ...)

# your code here
fail() # No Answer - remove if you provide an answer

caschools_MLR_add

In [None]:
test_1.4.0()
test_1.4.1()
test_1.4.2()

**Question 1.5**
<br>{points: 1}

Create a scatterplot of the data in `caschools` along with the estimated regression lines from the additive regression model `caschools_MLR_add`. 

- Use different colours for the points and regression lines of each type of school (levels of `grades`). 
- Include a legend indicating the colour of each level with proper axis labels. 

The `ggplot()` object's name will be `caschools_MLR_add_plot`.

*Hint: Start by computing the predictions of the fitted model (the function `add_predictions` can be handy). Store the predictions in a new column named `pred_MLR_additive`.*



*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# Adjust these numbers so the plot looks good in your desktop.
options(repr.plot.width = 15, repr.plot.height = 7)

# caschools <-
#     ... %>% 
#     ...(..., var = 'pred_MLR_additive')

# caschools_MLR_add_plot <- 
#     ... %>%
#     ggplot(aes(x = ...,y = ..., color = ...)) +
#     ...() +
#     ...(aes(y = ...), linewidth = 1) +
#     labs(title = ...,
#          x = ...,
#          y = ...) +
#     theme(text = element_text(size = 16.5),
#           plot.title = element_text(face = "bold"),
#           axis.title = element_text(face = "bold"),
#           legend.title = element_text(face = "bold")) +
#     labs(color = "grades")

# your code here
fail() # No Answer - remove if you provide an answer

caschools_MLR_add_plot

In [None]:
test_1.5()

**Question 1.6**
<br>{points: 1}

Find the estimated coefficients of `caschools_MLR_add` using `tidy()`. Report the estimated coefficients, their standard errors and corresponding $p$-values. Include the corresponding asymptotic 90% confidence intervals. 

Store the results in the variable `caschools_MLR_add_results`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# caschools_MLR_add_results <-
    # ...(..., ..., ....) %>% 
    # mutate_if(is.numeric, round, 2)

# your code here
fail() # No Answer - remove if you provide an answer

caschools_MLR_add_results

In [None]:
test_1.6()

**Question 1.7**
<br>{points: 1}

Considering the results in `caschools_MLR_add_results` from *Question 1.6*, how would you interpret the estimated coefficient of the continuous variable `income` ?

> *Your answer goes here.*

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 1.8**
<br>{points: 1}

Using a significance level $\alpha = 0.10$, which of the following claims are correct?


**A.** for any average income, the expected reading score is the same for schools with KK-06 and schools with KK-08 grades. 

**B.** there is enough evidence to reject the hypothesis that, for any average income, the expected reading score is the same for schools with KK-06 and schools with KK-08 grades.

**C.** for any school type, the reading test score (`read`) is significantly associated with the average district income (`income`). 

**D.** there is enough evidence to believe that the association between `income` and `read` varies depending on the grade span of the school.


*Assign your answers to the object `answer1.8`. Your answers have to be included in a single string indicating the correct options in alphabetical order and surrounded by quotes (e.g., `"ABCD"` indicates you are selecting the four options).*

In [None]:
# answer1.8 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.8()

**Question 1.9**
<br>{points: 1}

In one or two sentences, explain what "statistically significant" means in the following sentence and how it differs from "practical significance".

> "the reading test score (read) is significantly associated with the average district income (income)"


> *Your answer goes here.*

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

## 2. MLR: with interactions

In this section, we will explore whether the relation between `read` and `income` is the same for all school types. We can do this using *interactions* between the input variables!

Note that interactions can be used when the relation between an input and the response depends on another input variable (not necessarily categorical!).

**Question 2.0**
<br>{points: 1}

We can use `lm` to fit the MLR with interactions between the continuous variable `income` and the categorical variable `grades` (with 2 levels) defined above.

How many regression coefficients will be estimated by `lm`?

**A.** 1

**B.** 2

**C.** 3

**D.** 4


*Assign your answer to an object called `answer2.0`. Your answer should be one of `"A"`, `"B"`, `"C"`, `"D"`, `"E"`, or `"F"` surrounded by quotes.*

In [None]:
# answer2.0 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.0()

**Question 2.1**
<br>{points: 1}

Using `caschools`, fit the MLR with the interaction described above and call it `caschools_MLR_int`.

*Hint: Interaction terms can be easily specified in `lm()` using the notation `*`.*


*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# caschools_MLR_int <- ...(..., ...)

# your code here
fail() # No Answer - remove if you provide an answer

caschools_MLR_int

In [None]:
test_2.1.0()
test_2.1.1()
test_2.1.2()
test_2.1.3()

**Question 2.2**
<br>{points: 1}

Create a scatterplot of the data in `caschools` along with the estimated regression lines from the regression model with interaction `caschools_MLR_int`. 

- Use different colours for the points and regression lines of each type of school (levels of `grades`). 
- Include a legend indicating the colour of each level with proper axis labels. 

The `ggplot()` object's name will be `caschools_MLR_int_plot`.

*Hint: Start by computing the predictions of the fitted model (the function `add_predictions` can be handy). Store the predictions in a new column named `pred_MLR_interaction`.*

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# Adjust these numbers so the plot looks good in your desktop.
options(repr.plot.width = 8, repr.plot.height = 5)

# caschools <-
#     ... %>% 
#     ...(..., var = 'pred_MLR_interaction')

# caschools_MLR_int_plot <- 
#     ... %>%
#     ggplot(aes(x = ...,y = ..., color = ...)) +
#     ...() +
#     ...(aes(y = ...), linewidth = 1) +
#     labs(title = ...,
#          x = ...,
#          y = ...) +
#     theme(text = element_text(size = 16.5),
#           plot.title = element_text(face = "bold"),
#           axis.title = element_text(face = "bold"),
#           legend.title = element_text(face = "bold")) +
#     labs(color = "grades")

# your code here
fail() # No Answer - remove if you provide an answer

caschools_MLR_int_plot

In [None]:
test_2.2()

**Question 2.3**
<br>{points: 1}

Find the estimated coefficients of `caschools_MLR_int` using `tidy()`. Report the estimated coefficients, their standard errors and corresponding $p$-values. Include the corresponding asymptotic 90% confidence intervals. Store the results in the variable `caschools_MLR_int_results`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# caschools_MLR_int_results <- 
#    ...(..., ..., ....) %>%
#    mutate_if(is.numeric, round, 2)

# your code here
fail() # No Answer - remove if you provide an answer

caschools_MLR_int_results

In [None]:
test_2.3()

**Question 2.4**
<br>{points: 1}

Using a significance level $\alpha = 0.10$, which of the following claims are correct?


**A.** There is enough evidence to reject the hypothesis that, for any average income, the expected reading score is the same for schools with KK-06 and schools with KK-08 grades.

**B.** For any school type, the change in `read` per unit change in `income` is statistically significant.

**C.** There is enough evidence to reject the hypothesis that the change in `read` per unit change in `income` is the same for schools with KK-06 and schools with KK-08 grades.

**D.** There is not enough evidence to reject the hypothesis that the association between `income` and `read` is the same for schools with KK-06 and schools with KK-08 grades.

**E.** For schools with KK-06 grade span, the reading test score (`read`) is significantly associated with the average district income (`income`).

*Assign your answers to the object `answer2.4`. Your answers have to be included in a single string indicating the correct options **in alphabetical order** and surrounded by quotes (e.g., `"ABCDE"` indicates you are selecting the seven options).*

In [None]:
# answer2.4 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.4()

**Question 2.5**
<br>{points: 1}

A common practice is not to interpret coefficients that are not statistically significant since what you observe is not significantly different from 0 and reflects just noise in the data. Alternatively, you can provide an interpretation but with the caveat that the result is not statistically significant.  

Following the second approach, what would be a correct interpretation of the estimated coefficient of the interaction term `income:gradesKK-08` from `caschools_MLR_int_results` in *Question 2.3*? (remember to comment on the significance of the result).

> *Your answer goes here.*

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 2.6**
<br>{points: 1}

Fit the following 3 models as indicated:

**A.** a SLR with `read` as the response and `income` as the *only* input variable using only KK-06 schools in `caschools`. Use `tidy` to get estimated parameters and standard errors. Call the results `caschools_SLR_kk06_results`

**B.** a SLR with `read` as the response and `income` as the *only* input variable using only KK-08 schools in `caschools`. Use `tidy` to get estimated parameters and standard errors. Call the results `caschools_SLR_kk08_results`

**C.** a MLR with `read` as the response and `income` and `grades` as input variables, *including their interaction*, using `caschools`. Note that you already have the estimated parameters and standard errors in `caschools_MLR_int_results`. 

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# caschools_SLR_kk06_results <- 
#    tidy(lm(... ~ ..., data = subset(...,... == ...))) %>%
#    mutate_if(is.numeric, round, 2)


# caschools_SLR_kk08_results <- 
#    tidy(lm(... ~ ..., data = subset(...,... == ...))) %>%
#    mutate_if(is.numeric, round, 2)

# your code here
fail() # No Answer - remove if you provide an answer

caschools_SLR_kk06_results
caschools_SLR_kk08_results
caschools_MLR_int_results

In [None]:
test_2.6.0()
test_2.6.1()
test_2.6.2()

**Question 2.7**
<br>{points: 1}

**2.7.0** Using the results from `caschools_SLR_kk06_results` and `caschools_MLR_int_results` in *Question 2.6*, explain why the estimated coefficients of `income` are the same in both models

**2.7.1** Using the results from `caschools_SLR_kk08_results` and `caschools_MLR_int_results` in *Question 2.6*, explain why the estimated coefficients of `income` are *not* the same in both models. 

**2.7.2** Explain why the estimated coefficients of `income` in `caschools_SLR_kk08_results` is *not* the same as that of `income:gradesKK-08` in `caschools_MLR_int_results` using the results from *Question 2.6*.

> *Your answer goes here.*

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.