# Worksheet 02: Different Types of Input Variables and Interactions

## Learning Objectives:

After completing this week's lecture and tutorial work, you will be able to:

1. Give an example of a real problem that could be answered by a multiple linear regression.

2. Interpret the coefficients and p-values of different types of input variables, including categorical input variables.

3. Define interactions in the context of linear regression.

4. Write a computer script to perform linear regression when input variables are continuous or discrete, and when there are interactions between some of these variables.

## Multiple Linear Regression (MLR)

This week, we will "unleash" the scope of linear regression models and use them to study the association between a continuous response and *many* input variables of *different types*! 

A linear regression model with many input variables is usually called a *Multiple Linear Regression (MLR)*. We will present models with the following:

- Continuous and categorical input variables

- Additive models and models with interaction terms

In [None]:
# Run this cell before continuing.
library(tidyverse)
library(repr)
library(infer)
library(cowplot)
library(broom)
library(GGally)
library(modelr)
source("tests_worksheet_02.R")

### 1. MLR additive models with continuous input variables

In this section, we will continue working with the dataset `US_cancer_data` introduced in `worksheet_01`. You noticed a positive association between mortality rates (measured by `TARGET_deathRate`) and the percentage of the county's populace in poverty (measured by `povertyPercent`).

However, there may be other variables associated with mortality. 

Let's start by visualizing the relationship between the variables in our data again using the plotting function `ggpairs()` from the library `GGally.` The `ggplot()` object's name will be `US_cancer_data_pairplots`.

In [None]:
US_cancer_data <- 
    read_csv("data/US_county_cancer_data.csv") %>%
    select(TARGET_deathRate, povertyPercent, PctPrivateCoverage)

In [None]:
US_cancer_data_pairplots <- 
    US_cancer_data %>%
    ggpairs(progress = FALSE) +
    theme(
        text = element_text(size = 15),
        plot.title = element_text(face = "bold"),
        axis.title = element_text(face = "bold")
    )

US_cancer_data_pairplots

**Question:** Should we study the relation of the response with input variables separately or jointly using linear regressions?

#### Additive models versus models with interaction: R code

In R, you can add input variables on the right-hand side of the formula with signs `+` or `*`, depending on the assumptions made. For example, in an **additive model**, variables are added to the model using `+`. If you want to model interactions between variables, you use `*`.

##### Additive Model:

The additive model is given by: 

$$
y_i = \beta_0 + \beta_1\times \text{povertyPercent}_i + \beta_2\times \text{PctPrivateCoverage}_i + \epsilon_i
$$

where $y_i$ is the $i$th entry of TARGET_deathRate.

You can fit such a model using the `+` sign on the right-hand side:

```r
MLR_poverty_coverage_add <- lm(TARGET_deathRate ~ povertyPercent + PctPrivateCoverage, data = US_cancer_data)
```

##### Model with interaction:

On the other hand, the interactive model would also include the product between `povertyPercent` and  `PctPrivateCoverage`. The model equation is given by: 

$$
y_i = \beta_0 + \beta_1\times \text{povertyPercent}_i + \beta_2\times \text{PctPrivateCoverage}_i + \beta_3\times \text{povertyPercent}_i\times \text{PctPrivateCoverage}_i +\epsilon_i
$$

You can fit such a model using the `*` sign on the right-hand side:

```r
MLR_poverty_coverage_add <- lm(TARGET_deathRate ~ povertyPercent * PctPrivateCoverage, data = US_cancer_data)
```

**Question 1.0**
<br>{points: 2}

Using the dataset `US_cancer_data` fit the following 3 models:

**A.** A SLR with `TARGET_deathRate` as the response and `povertyPercent` as a single input variable. Call this model `SLR_poverty`

**B.** A SLR with `TARGET_deathRate` as the response and `PctPrivateCoverage` as a single input variable. Call this model `SLR_coverage`

**C.** An additive MLR with `TARGET_deathRate` as the response and `povertyPercent` and `PctPrivateCoverage` as input variables. Call this model `MLR_poverty_coverage`

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# Model A 
# SLR_poverty <- ...(..., ...)
# SLR_poverty

# Model B
# SLR_coverage <- ...(..., ...)
# SLR_coverage

# Model C
# MLR_poverty_coverage <- ...(..., ...)
# MLR_poverty_coverage

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.0.0()
test_1.0.1()
test_1.0.2()

**Question 1.1**
<br>{points: 1}

Use `tidy` and manipulate the output to obtain the estimated coefficients of the variable `PctPrivateCoverage` in models **B** and **C**, respectively. Assign your answers to the objects `SLR_coverage_coef` and `MLR_coverage_coef`.

*Fill out those parts indicated with ..., uncomment the corresponding code in the cell below, and run it.*

In [None]:
# SLR_coverage_coef <- 
#     tidy(...) %>% 
#     mutate_if(is.numeric, round, 2) %>% 
#     filter(term == "...") %>%
#     pull(...)

# MLR_coverage_coef <- 
#     tidy(...) %>% 
#     mutate_if(is.numeric, round, 2) %>% 
#     filter(term == "...") %>%
#     pull(...)

# SLR_coverage_coef
# MLR_coverage_coef



# your code here
fail() # No Answer - remove if you provide an answer

SLR_coverage_coef
MLR_coverage_coef

In [None]:
test_1.1.0()
test_1.1.1()

The results from Question 1.1 show that the estimated relation between `TARGET_deathRate` and `PctPrivateCoverage` is ***not*** the same in both models. But why?

- In the MLR, the coefficient of `PctPrivateCoverage` models the relation between `TARGET_deathRate` and `PctPrivateCoverage` *for any fixed percentage of the populace in poverty*.

- In the SLR, the variable `povertyPercent` is not even in the model. Thus, the variation in mortality due to this variable is part of the error term, together with many other variables that may be related to mortality.

Keeping these observations in mind, distinguish between the following interpretations:

**Question 1.2**
<br>{points: 1}

Use the estimates from Question 1.1 to complete the following interpretations:

1. *On average*, a 1 unit increase in the percentage of county residents with private health coverage is associated with an expected ...`answer1.2.0`...(cases/100,000) change in cancer mortality per capita.


2. *For any fixed percentage of the populace in poverty*, a 1 unit increase in the percentage of county residents with private health coverage is associated with an expected ...`answer1.2.1`...(cases/100,000) change in cancer mortality per capita.


*Assign your answers to the objects `answer1.2.0` and `answer1.2.1` (numeric type).* 

**Tip: Your answers should be the names of the objects used to store the estimates in the previous question**

In [None]:
# answer1.2.0 <- ...
#answer1.2.0

# answer1.2.1 <- ...
#answer1.2.1

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.2.0()
test_1.2.1()

**Question 1.3**
<br>{points: 1}

As we have learned for SLRs, `lm` uses some theoretical results to measure the sample-to-sample variation of the estimators. The same is true for the estimators of MLR coefficients. 

Use `tidy` to obtain the standard errors, statistics, and p-values for the coefficients of `MLR_poverty_coverage` model which are needed to perform statistical hypothesis tests. 

Store them in the variable `MLR_poverty_coverage_results`.

*Fill out those parts indicated with ..., uncomment the corresponding code in the cell below, and run it.*

In [None]:
# MLR_poverty_coverage_results <- 
#    ...(...) %>%
#    mutate_if(is.numeric, round, 2)

# your code here
fail() # No Answer - remove if you provide an answer

MLR_poverty_coverage_results

In [None]:
test_1.3()

**Question 1.4**
<br>{points: 1}

Based on the output stored in `MLR_poverty_coverage_results` with a significance level $\alpha = 0.05$, in plain words, what is the conclusion of the hypothesis testing for the association between mortality and the percentage of county residents with private health coverage?

**A.** We reject the null hypothesis; thus, holding the percentage of county's populace in poverty constant (at any value), the percentage of county residents with private health coverage has a statistically significant effect on the county's cancer mortality per capita (cases/100,000).

**B.** We fail to reject the null hypothesis; thus, holding the percentage of county's populace in poverty constant (at any value), the percentage of county residents with private health coverage does not have a significant effect on the county's cancer mortality per capita (cases/100,000).

**C.** We fail to reject the null hypothesis; thus, holding the percentage of county's populace in poverty constant (at any value), the percentage of county residents with private health coverage is not significantly associated with the county's cancer mortality per capita (cases/100,000).

**D.** We reject the null hypothesis; thus, holding the percentage of county's populace in poverty constant (at any value), the percentage of county residents with private health coverage is significantly associated with the county's cancer mortality per capita (cases/100,000).


*Assign your answer to an object called `answer1.4`. Your answer should be one of `"A"`, `"B"`, `"C"`, or `"D"` surrounded by quotes.*

In [None]:
# answer1.4 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.4()


**Important:** <font color="blue"> MLR *simultaneously* models the association of multiple predictors with the response

In additive models:

- we assume that the association between each input and the response does *not* vary across values of other variables
    
    
- each estimate is interpreted as the association of the input variable with the response, keeping all other input variables **in the model** constant (at any value!).

## 2. Categorical input variables with 2 or more levels
In STAT 201 course, you used statistical tools to analyze whether population quantities vary among groups. For instance, you investigated if the average donation amount is influenced by different versions of a website (with two or more variations). In these scenarios, you are exploring the relationship between two variables. In this example, you compare donations (in $) with website versions (with levels A and B).

**Question:** Can we use linear regression to compare population quantities from different groups?

The answer is: yes! We can formulate these problems as linear models as well. 

Let's use the `US_cancer_data` to explore if the cancer mortality differ by states. 

> Note: the original `US_county_cancer_data` does not contain a specific column for the state where each county belongs, so we'll get that information from the variable `Geography` using code to split strings.

Let's start by comparing the mortality rates in 2 states: *Alabama* vs *California*. 

In [None]:
US_cancer_data <- 
    read_csv("data/US_county_cancer_data.csv") %>%
    select(TARGET_deathRate, povertyPercent, PctPrivateCoverage, Geography) %>%
    mutate(state = str_extract(Geography, "[^,]+$")) %>% # Extract the text after the "," 
    mutate(state = str_trim(state, side = 'both')) %>%  # Remove the spaces from beginning and end
    mutate(state = as_factor(state)) # Converts the strings in factors
                            
head(US_cancer_data, 3)
str(US_cancer_data)

In [None]:
AC_cancer_data <- 
    US_cancer_data %>%
    filter(state %in% c("California", "Alabama")) %>% 
    droplevels()

str(AC_cancer_data)

**Question 2.0**
<br>{points: 1}

Let's start by graphically comparing both states' distributions and the spread of mortality rates. We will use side-by-side boxplots. The `ggplot()` object's name will be `TARGET_deathRate_boxplots`. 

> Note that function `stat_summary()` will be needed to add the average mortality rates in each state as points on top of each boxplot.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# Adjust these numbers so the plot looks good in your desktop.
options(repr.plot.width = 15, repr.plot.height = 7) 

# TARGET_deathRate_boxplots <- 
#   ... %>%
#   ggplot() +
#   ...(aes(..., ..., fill = ...)) +
#   ggtitle(...) +
#   xlab(...) +
#   ylab(...) +
#   stat_summary(aes(..., ...),
#     fun = ..., geom = "point", colour = "yellow", 
#     shape = 18, size = 5
#   ) +
#   theme(
#     text = element_text(size = 18),
#     plot.title = element_text(face = "bold"),
#     axis.title = element_text(face = "bold")
#   )

# TARGET_deathRate_boxplots

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.0()

The side-by-side boxplots in `TARGET_deathRate_boxplots` show differences in mortality rates between both states. Let's use this data and a linear regression to estimate the relationship between these variables and test this hypothesis.

**Question:** But how can we include a categorical variable in a mathematical equation?

We can think the problem as having 2 SLR in one:

- For counties in Alabama, $X_i=0$, then $Y_i = \beta_0 + \beta_1 \times 0 + \varepsilon_i = \beta_0 + \varepsilon_i$


- For counties in California, $X_i=1$, then $Y_i = \beta_0 + \beta_1 \times 1 + \varepsilon_i = \beta_0 + \beta_1 + \varepsilon_i$


Then,


- For counties in Alabama, $X_i=0$, $E[Y_i|X_i=0] = \beta_0$


- For counties in California, $X_i=1$, $E[Y_i|X_i=1] = \beta_0 + \beta_1$




### R Code

R creates special numerical variables for you, called dummy variables, to include in the model if you indicate that an input variable (in our case, `state`) is a *factor*. For example, 

`AC_data_LR <- lm(TARGET_deathRate ~ state, data = AC_cancer_data)`

Since `state` is a factor, R creates a dummy variable to estimate the model:

- `R` calls the dummy variable `stateCalifornia` (name of the variable followed by the level corresponding to 1)


- The reference level (dummy variable = 0 is "left out") is "Alabama", chosen alphabetically.

> **Important:** Note that there's not a *line* in this case. But we can still call it a "linear regression" because the model is a linear combination of variables! 

**Question 2.1**
<br>{points: 1}

Using `AC_cancer_data`, use an LR to estimate the relation between `TARGET_deathRate` and `state` and assign it to the object `AC_data_LR`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# AC_data_LR <- ...(..., data = ... )

# your code here
fail() # No Answer - remove if you provide an answer

AC_data_LR

In [None]:
test_2.1.0()
test_2.1.1()

**Question 2.2**
<br>{points: 1}

Find the estimated coefficients of `AC_data_LR` using `tidy()`. Report the estimated coefficients, their standard errors, and corresponding $p$-values. Include the corresponding asymptotic 95% confidence intervals. 

Store the results in the variable `AC_data_LR_results`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# AC_data_LR_results <- 
#    ...(..., ...) %>% 
#    mutate_if(is.numeric, round, 2)

# your code here
fail() # No Answer - remove if you provide an answer

AC_data_LR_results

In [None]:
test_2.2()

**Question 2.3**
<br>{points: 1}

We want to investigate whether the states' expected mortality rates are significantly different. Hence, it will be necessary to perform hypothesis testing on the regression coefficient corresponding to `state`. In plain words, the hypotheses ($H_0$ versus $H_1$) are the following:

$$H_0: \text{there is no difference in the expected mortality rates (means) between both states}$$
$$H_1: \text{there is a difference in the expected mortality rates (means) between both states}$$

What is the mathematical translation of these hypotheses?

**A.** $H_0: \beta_1 = 0$ vs. $H_1: \beta_1 > 0$

**B.** $H_0: \beta_0 = 0$ vs. $H_1: \beta_0 \neq 0$

**C.** $H_0: \beta_0 = 0$ vs. $H_1: \beta_0 > 0$

**D.** $H_0: \hat{\beta}_1$ = 0 vs. $H_1: \hat{\beta}_1 \neq 0$

**E.** $H_0: \hat{\beta}_0$ = 0 vs. $H_1: \hat{\beta}_0 > 0$

**F.** $H_0: \beta_1 = 0$ vs. $H_1: \beta_1 \neq 0$

*Assign your answer to an object called `answer2.3`. Your answer should be one of `"A"`, `"B"`, `"C"`, `"D"`, `"E"`, or `"F"` surrounded by quotes.*

In [None]:
# answer2.3 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.3()

**Question 2.4**
<br>{points: 1}

Using the corresponding $p$-value stored in `AC_data_LR_results` and a significance level $\alpha = 0.05$, what is the conclusion of the hypothesis test in Question 2.3 in plain words?

**A.** We fail to reject the null hypothesis; thus, there is no statistically significant difference in the mortality rates between states.

**B.** We reject the alternative hypothesis; thus, there is no statistically significant difference between states' average mortality rates.

**C.** We reject the null hypothesis; thus, there is a statistically significant difference in the average mortality rates between states.

*Assign your answer to an object called `answer2.4`. Your answer should be one of `"A"`, `"B"`, or `"C"`, surrounded by quotes.*

In [None]:
# answer2.4 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.4()

**Question 2.5**
<br>{points: 1}

Let's take a look at the results again:

In [None]:
AC_data_LR_results

What is the correct interpretation of the estimated regression coefficient for the dummy variable `stateCalifornia`?

**A.** The average mortality rate in California is 34.63 (cases/100,000) below that of Alabama.

**B.** The average mortality rate in Alabama is 34.63 (cases/100,000) below that of California.

**C.** The average mortality rate in California is 34.63 (cases/100,000).


*Assign your answers to the object `answer2.5` (it should be one of `"A"`, `"B"`, or `"C"` surrounded by quotes).*

In [None]:
# answer2.5 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.5()

### But wait: isn't this a 2-sample t-test?

Run the following cell and discuss results with your neighbour.

In [None]:
t.test(TARGET_deathRate ~ state, AC_cancer_data, var.equal=T)

This is not a coincidence! `lm` is running the *same* *t*-test!!

### Categorical variable with more than 2 levels

But what do we do if the categorical variable has more levels?

We need additional dummy variables. Dummy variables are comparisons to a *reference* level. Thus, we need more of these variables to compare all other levels with respect to the reference.

Don't worry, R will create these variables (as long as the categorical variables is a factor).

In [None]:
ACK_cancer_data <- 
    US_cancer_data %>%
    filter(state %in% c("California", "Alabama", "Kansas")) %>% 
    droplevels()

In [None]:
ACK_data_LR <- 
    tidy(lm(TARGET_deathRate ~ state, data = ACK_cancer_data)) %>% 
    mutate_if(is.numeric, round, 3)

ACK_data_LR

From the table above, you can get the (sample) average mortality rate of each state:

- Alabama (reference level): 192.729
- California: 192.729 - 34.632 = 158.097
- Kansas: 192.729 - 24.894 = 167.835

#### Overriding R's default behavior

We can override R's default behavior for selecting a reference level and use 0 as the reference level. This will make all the estimates the actual mean instead of the difference from the reference level.

In [None]:
# The 0 tells R to not fit an intercept (which would be the reference level). 
tidy(lm(TARGET_deathRate ~ 0 + state, data = ACK_cancer_data)) %>% 
    mutate_if(is.numeric, round, 3)

### Take-away points

- Categorical variables can be included in a LR using dummy variables;

- R creates them for all factors in the model
        
- (by default) a reference level is chosen alphabetically and levels of a categorical variable are compared to the reference level!

- we can force R to not fit an intercept, so we have three dummy variables, where each coefficient will give the mean. Be careful; this is not the default behaviour. 

# PART II

## 3. MLR: additive model with one continuous and one categorical variable

**Question 3.0**
<br>{points: 1}

Use the `AC_cancer_data` dataset to estimate the following additive model:
$$
y_i = \beta_0 + \beta_1\times\text{stateCalifornia}_i + \beta_2\times\text{povertyPercent}_i + \varepsilon_i
$$

Call the object to store the estimated additive model `MLR_state_poverty_add`.

Report the estimated coefficients, their standard errors, and corresponding $p$-values using `tidy()`. Include the corresponding asymptotic 95% confidence intervals. 

Store the results in the variable `MLR_state_poverty_add_results`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# MLR_state_poverty_add <- ...(..., ...)

# MLR_state_poverty_add_results <- 
#    ...(..., ...) %>% 
#    mutate_if(is.numeric, round, 2)

# your code here
fail() # No Answer - remove if you provide an answer

MLR_state_poverty_add
MLR_state_poverty_add_results

In [None]:
test_3.0()

<span style="color: darkred">NOTE: there are 3 coefficients for 2 lines because the additive model assumes a **common slope**!!</span>

**Question 3.1**
<br>{points: 1}

Using a $10\%$ significance level (i.e., $\alpha = 0.10$), and the results in `MLR_state_poverty_add_results`, which of the following interpretations is correct?

**A.** On average, the mortality rate in California is significantly different from that in Alabama.

**B.** On average, the mortality rate is significantly associated with poverty.

**C.** The association between mortality and poverty differs per state.

**D.** The expected mortality rate is the same in both states.

*Assign your answers to the object `answer3.1`. Your answers have to be included in a single string indicating the correct options **in alphabetical order** and surrounded by quotes (e.g., `"ABCD"` indicates you are selecting the four options).*

In [None]:
# answer3.1 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.1()

**Question 3.2**
<br>{points: 1}

Create a plot of the data (using `geom_point()`) along with the estimated regression lines coming from the additive regression model, stored in `MLR_state_poverty_add`

> Note that your plot should have two regression lines, one for each state. 

- Colour the points and regression lines by state. 

- Include a legend indicating what colour corresponds to each state with proper axis labels. 

The `ggplot()` object's name will be `MLR_state_poverty_add_plot`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# Let's first add a column with the predictions
# AC_cancer_data <- 
#     AC_cancer_data %>%
#     add_predictions(...put_the_model_here..., 
#                     var = "pred_MLR_add")



# MLR_state_poverty_add_plot <- 
#     AC_cancer_data %>%
#     ggplot(aes(..., ..., color = ...)) +
#     geom_...() +
#     geom_line(aes(..., ..., color = ...)) +  
#   labs(title = "...",
#        x = "...",
#        y = "..."
#   ) +
#   theme(
#     text = element_text(size = 16.5),
#     plot.title = element_text(face = "bold"),
#     axis.title = element_text(face = "bold"),
#     legend.title = element_text(face = "bold"),
#   )

# your code here
fail() # No Answer - remove if you provide an answer

MLR_state_poverty_add_plot

In [None]:
test_3.2()

**Question 3.3**
<br>{points: 1}

Looking at the plot in Question 3.2 and the modelling framework in Question 3.0, what is our fundamental assumption when using an additive MLR with a mixture of continuous and categorical explanatory variables?

**A.** The regression lines for both states have different slopes relating poverty to average mortality.

**B.** The regression lines for both states have the same slopes relating poverty to average mortality but different intercepts.

**C.** The regression lines for both states have different slopes and intercepts relating poverty to average mortality.

*Assign your answer to the object `answer3.3`. Your answer should be one of `"A"`, `"B"`, or `"C"` surrounded by quotes.*

In [None]:
# answer3.3 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.3()

#### An additive model with one continuous and one categorical input variable has a common slope and different intercepts for each level of the categorical variable.

## 4. MLR: interaction between a continuous and a categorical variable

Now, we will estimate another MLR model called `MLR_state_poverty_int`, but this time, we do not assume that the association between mortality and poverty is the same in both states. In other words, an *interaction* may exist between the input variables. **If the relation changes by the levels of the categorical variable, we need to add interaction term(s).**

**Question 4.0**
<br>{points: 1}

Use the data to estimate the new model with an interaction term. 

Call the object to store the estimated additive model `MLR_state_poverty_int`.

Report the estimated coefficients, their standard errors, and corresponding $p$-values using `tidy()`. Include the corresponding asymptotic 95% confidence intervals. Store the results in the variable `MLR_state_poverty_int_results`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# MLR_state_poverty_int <- ...(..., ...)

# MLR_state_poverty_int_results <- 
#    ...(..., ...) %>%
#    mutate_if(is.numeric, round, 2)


# your code here
fail() # No Answer - remove if you provide an answer

MLR_state_poverty_int
MLR_state_poverty_int_results

In [None]:
test_4.0()

<font style='color: darkred'>Note: there are now 4 coefficients for 2 lines because in the model with interactions we do *NOT* assume a common slope!!</font>

**Question 4.1**
<br>{points: 1}

Using a significance level $\alpha = 0.10$, and the results in `MLR_state_poverty_int_results`, which of the following interpretations is correct?

**A.** Mortality rates in Alabama are low.

**B.** The expected mortality rate does not differ by state.

**C.** In California, the expected change in mortality as poverty increases is higher than in Alabama.

**D.** In Alabama, the mortality rate is statistically associated with poverty.

**E.** The association between mortality and poverty varies (significantly) between Alabama and California.

**F.** The association between mortality and poverty is the same in Alabama and California.

*Assign your answers to the object `answer4.1`. Your answers have to be included in a single string indicating the correct options **in alphabetical order** and surrounded by quotes (e.g., `"ABCD"` indicates you are selecting the four options).*

In [None]:
# answer4.1 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_4.1()

**Question 4.2**
<br>{points: 1}

Create a plot of the data (using `geom_point()`) along with the estimated regression lines coming from the interaction regression model `MLR_state_poverty_int`.

> Note that your plot should have two regression lines, one for each state. 

- You have to colour the points and regression lines by state. 

- Include a legend indicating the colour of each state with proper axis labels. 

The `ggplot()` object's name will be `MLR_state_poverty_int_plot`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# Let's first add a column with the predictions
# AC_cancer_data <- 
#     AC_cancer_data %>%
#     add_predictions(..., var = "pred_MLR_int")


# MLR_state_poverty_int_plot <- 
#     AC_cancer_data %>%
#     ggplot(aes(..., ..., color = ...)) +
#     geom_...() +
#     geom_line(aes(..., ..., color = ...)) +  
#   labs(title = "...",
#        x = "...",
#        y = "..."
#   ) +
#   theme(
#     text = element_text(size = 16.5),
#     plot.title = element_text(face = "bold"),
#     axis.title = element_text(face = "bold"),
#     legend.title = element_text(face = "bold"),
#   )

# your code here
fail() # No Answer - remove if you provide an answer

MLR_state_poverty_int_plot

In [None]:
test_4.2()

**Question 4.3**
<br>{points: 1}

Looking at the plot in Question 4.2 and the modelling framework in Question 4.0, what is the fundamental assumption we are making when using an interaction MLR with a mixture of continuous and categorical explanatory variables?

**A.** The regression lines for both states have the same slope relating poverty to average mortality, but different intercepts.

**B.** The regression lines for both states have different slopes and intercepts relating poverty to average mortality.

**C.** The regression lines for both states have different slopes relating poverty to average mortality, but same intercepts.

*Assign your answer to the object `answer4.3`. Your answer should be one of `"A"`, `"B"`, or `"C"` surrounded by quotes.*

In [None]:
# answer4.3 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_4.3()

<font color="darkred"> If the relation between the continuous predictor and the response changes for each  level of the categorical variable, we need to add interaction term(s). The model now has different slopes and different intercepts for each level of the categorical variable </font>  