# Worksheet 03: Model Asssumptions and Causality

## Lecture and Tutorial Learning Goals:
After completing this week's lecture and tutorial work, you will be able to:

1. Describe heteroscedasticity and the problem it presents to generative modelling.
2. Write a computer script to assess whether heteroscedasticity is present in a given data set, and if so, use practical solutions to manage it.
3. Describe colinearity and the problem it presents to generative modelling.
4. Write a computer script to assess whether collinearity exists between input variables in a given data set, and if so, use practical solutions to manage it.
5. Discuss why a data scientist may need to consult a domain expert when examining model assumptions.
6. Give an example of a real problem that aims to test a causal relationship between variables.
7. Give an example of a real problem where the model can only establish an association between the response and the input variables.
8. Discuss how the desired goal of generative modelling is usually to make causal claims; however, we cannot often/easily do so (e.g., in particular, in the context of observational studies).
9. Discuss the role of confounders in causal inference.

In [None]:
# Run this cell before continuing.
library(tidyverse)
library(repr)
library(infer)
library(cowplot)
library(broom)
library(faux)
source("tests_worksheet_03.R")

## PART I - Model Assumptions

When we analyze a dataset to respond to a question of interest, we make several assumptions. However, in practice, many of these assumptions may not be true, posing many problems for our analysis. The most common problems that we may encounter are:

1. the relation between the response and the input variable(s) is not linear
2. error terms are correlated (not independent as we assumed)
3. error terms do not have a common variance (not identically distributed as we assumed)
4. error terms are not Normally distributed (not a strong assumption, but convenient)
5. (some) input variables are correlated

> "In practice, identifying and overcoming these problems is as much an
art as a science" (from ISL, Section 3.3.3). 

### 1. Warm Up Questions

**Question 1.1**
<br>{points: 1}

True or false?

The results of the hypotheses tests given by `lm()` are only valid if we assume that the errors in the linear regression model are (exactly) Normally distributed.

*Assign your answer to an object called `answer1.1`. Your answer should be either "true" or "false", surrounded by quotes.*

In [None]:
# answer1.1 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.1()

**Question 1.2**
<br>{points: 1}

True or false?

In linear regression, multicollinearity refers to the correlation between each input variable and the response variable.

*Assign your answer to an object called `answer1.2`. Your answer should be either "true" or "false", surrounded by quotes.*

In [None]:
# answer1.2 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.2()

**Question 1.3**
<br>{points: 1}

True or false?

In the presence of multicollinearity in multiple linear regression (MLR), it can be difficult to determine how collinear variables are separately associated with the response.

*Assign your answer to an object called `answer1.3`. Your answer should be either "true" or "false", surrounded by quotes.*

In [None]:
# answer1.3 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.3()

**Question 1.4**
<br>{points: 1}

True or false?

Multicollinearity inflates the estimates of the standard errors of the least squares (LS) estimators of the regression coefficients.

*Assign your answer to an object called `answer1.4`. Your answer should be either "true" or "false", surrounded by quotes.*

In [None]:
# answer1.4 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.4()

**Question 1.5**
<br>{points: 1}

True or false?

The assumption that all the error terms have the same variance does not affect the estimator of the standard error of the LS estimators.

*Assign your answer to an object called `answer1.5`. Your answer should be either "true" or "false", surrounded by quotes.*

In [None]:
# answer1.5 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.5()

## Part II - Violations of model assuptions

In the following problems, we will explore how violations of model assumptions affect our estimation and analyses *using simulated data*. 

The advantage of simulating data is that we have full control over the data generating process so **we know the true parameters** and we can examine problems in a controlled way. 

Real datasets are rich and interesting but usually contain many (unknown) problems, and we don't know the true parameters to assess the performance of our estimation and analyses. 

### 2.1 Bechmark model

**Question 2.0**
<br>{points: 1}

Let's start by generating a sample of size $n = 1000$ from a data-generating process that *fulfills all the assumptions* of classical least square estimation in linear regression.

In this problem, we generate data from a continuous response, one continuous input variable and one binary input variable. We call this first sample `sample_model_1`.

That is, for $i = 1, \dots, n$:

$$Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \varepsilon_i$$ 

where the error terms $\varepsilon_i \sim \mathcal{N}(0, \sigma^2 = 4)$ are independent and identically distributed. Assume that $X_{1}$ is uniformly distributed between 2 and 5, and $X_{2}$ is a binary random variable with levels "A" (value of 0) and "B" (value of 1) with equal probability.

Generate a response variable using these model assumptions and true (population) regression coefficients $\beta_0$, $\beta_1$, and $\beta_2$ equal to $10$, $8$, and $5$, respectively. 

> **Heads up**: The distributions used to generate the input variables do not affect the results of our analysis. 

Note that the columns of `sample_model_1` are:

- `x_1`: the values of the continuous input $X_{i1}$
- `x_2`: the levels of the discrete input $X_{i2}$
- `y`: the sampled response values $Y_i$

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
set.seed(123) # DO NOT CHANGE!

sample_size <- 1000

# sample_model_1 <- tibble(
#   x_1 = runif(n = ..., ..., ...),
#   x_2 = factor(rbinom(n = ..., size = 1, prob = 0.5), labels=c("A", "B")),
#   y = ... * if_else(x_2 == "B", ..., ...) + rnorm(n = ..., mean = ..., sd = ...)
# )

# your code here
fail() # No Answer - remove if you provide an answer

head(sample_model_1)

In [None]:
test_2.0()

**Question 2.1**
<br>{points: 1}

Use the simulated `sample_model_1` and the `lm` function to estimate the regression parameters $\beta_0$, $\beta_1$, and $\beta_2$ and elements to make inferences using hypothesis tests. Assign the output of `lm` to the object `model_1`.

Obtain the estimated coefficients, their standard errors, corresponding $p$-values, and $95\%$ confidence intervals using `tidy()`. Store the results in `model_1_results`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# model_1 <- ...(..., ...)

# model_1_results <- 
#    ...(..., ...) %>%
#    mutate_if(is.numeric, round, 2)

# your code here
fail() # No Answer - remove if you provide an answer

model_1_results

In [None]:
test_2.1()

#### Conclusive remarks: Benchmark Model

Note that the estimates $\hat{\beta}_0$,  $\hat{\beta}_1$, and $\hat{\beta}_2$ in `model_1` are close to the true population parameters (less than 2 standard errors away). At least for this sample, all of the 95% confidence intervals contain the true parameters that we've used to generate the data. Two final comments are worth making:

- Note that, in general, we don't know the true parameters to make this type of assessment.

- Even with simulated data, the estimated CI may not contain the true parameters. Why?

### 2.2 Heteroscedasticity

In the previous case, we assume that the error terms were independent and identically distributed. *What would happen if this assumption is violated in the data??*

In particular, we will focus on a problem known as *heteroscedasticity*.

In the next question, we are going to simulate data from a *data generating process with heteroscedasticity* and use `lm` to estimate the coefficients of the MLR ignoring that the error terms are heteroscedastic.

**Question 2.2**
<br>{points: 1}

Again, we generate a sample of size $n = 1000$. The variables $X_{i1}$ and $X_{i2}$ are generated as in *Question 2.0*.

The model equation is given by:

$$Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \varepsilon_i, \quad i = 1, \dots, n$$ 

where the error terms $\varepsilon_i \sim \mathcal{N}\left(0, \sigma_i^2 = X_{i1}^4\right)$ (note how now we have **heteroscedasticity**, i.e., the value of $\sigma_i^2$ is different for each observation). 

As before, let the true (population) regression terms $\beta_0$, $\beta_1$, and $\beta_2$ be $10$, $8$, and $5$. 

Call this new sample `sample_model_2`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
set.seed(321) # DO NOT CHANGE!

# sample_size = ...
# sample_model_2 <- tibble(
#   x_1 = runif(n = ..., ..., ...),
#   x_2 = factor(rbinom(n = ..., size = 1, prob = 0.5), labels=c("A", "B")),
#   y = ... * if_else(x_2 == "B", ..., ...) + rnorm(n = ..., mean = ..., sd = ...)
# )

# your code here
fail() # No Answer - remove if you provide an answer

head(sample_model_2)

In [None]:
test_2.2()

**Question 2.3**
<br>{points: 1}

Use the simulated `sample_model_2` to estimate the regression parameters $\beta_0$, $\beta_1$, and $\beta_2$ and elements to make inferences using hypothesis tests. 

*Ignore the heteroscedasticity of the data generating process* and use the function `lm()` to estimate the regression parameters. Assign the results to the object `model_2`.

Obtain the estimated coefficients, their standard errors, corresponding $p$-values, $95\%$ confidence intervals using `tidy()`. Store the results in `model_2_results`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# model_2 <- ...(..., ...)

# model_2_results <- 
#    ...(..., ...) %>% 
#    mutate_if(is.numeric, round, 2)

# your code here
fail() # No Answer - remove if you provide an answer

model_2_results

To be easier for you to compare, here's the Model 1 results: 

In [None]:
# Run this cell and compare Model 1 results against Model 2 results.

################################################################################
# Pay close attention to the std. error columns and the width of the intervals #
################################################################################

model_1_results

In [None]:
test_2.3()

**Question 2.4**
<br>{points: 1}

Recall that the true population values of $\beta_0$, $\beta_1$, and $\beta_2$ are $10$, $8$, and $5$ respectively.  Which of the following consequences of ignoring the heteroscedasticity of the data-generating process is true? 

*Tip*: `lm` assumes that the errors are *iid* as in `sample_model_1`. Thus, you can also use `model_1_results` as a benchmark.

**A.** The estimates $\hat{\beta}_0$,  $\hat{\beta}_1$, and $\hat{\beta}_2$ in `model_2_results` are still similar to the true population parameters, but their estimated standard errors are inflated.

**B.** The 95% confidence intervals of the regression coefficients are not affected by the heteroscedasticity of the data-generating process.


*Assign your answer to an object called `answer2.4`. Your answer should be one of `"A"` or `"B"` surrounded by quotes.*

In [None]:
# answer2.4 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.4()

#### Detecting heteroscedasticity

When we don't have simulated data, we can (graphically) diagnose heteroscedasticity by comparing the fitted values to the **residuals**. 

<font color="darkred">**Diagnosis rule**</font>

<font color="darkred">If the errors are homoscedastic (equal variance), the residuals should show equal variation for all fitted values. </font>

Let's take a look at these plots for both cases simulated before: `model_1` (homoscedastic case) versus `model_2` (heteroscedastic case). We can obtain both plots via the function `plot()`.

*Run the cell below before continuing.*

In [None]:
# Adjust these numbers so the plot looks good in your desktop.
options(repr.plot.width = 8, repr.plot.height = 6) 

plot(model_1, 1, main = "Model 1: Homoscedastic")
plot(model_2, 1, main = "Model 2: Heteroscedastic")

**Question 2.5**
<br>{points: 1}

What is the difference between both plots of residuals versus fitted values?

**A.** There is no difference between both plots; their respective clouds of points look uniformly similar.

**B.** The diagnostic plot of `model_1` shows a uniform and more scattered cloud of points than `model_2`. The cloud of points in `model_2` shows a clear funnel shape, indicating a non-constant variance.

**C.** The diagnostic plot of `model_2` shows a uniform and more scattered cloud of points than `model_1`. The cloud of points in `model_1` shows a clear funnel shape, indicating a non-constant variance.

*Assign your answer to an object called `answer2.5`. Your answer should be one of `"A"`, `"B"`, or `"C"` surrounded by quotes.*

In [None]:
# answer2.5 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.5()

### 2.3 Normality

Another assumption commonly made in linear regression is that the error terms are Normally distributed. *What would happen if this assumption is violated??*. 

In the next question, we are going to simulate data using a data-generating process with non-normal errors and use `lm` to estimate the coefficients of the MLR.

**Question 2.6**
<br>{points: 1}

Once again, we generate a sample of size $n = 1000$. The variables $X_{i1}$ and $X_{i2}$ are generated as in *Question 2.0*.

The model equation is given by:

$$Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \varepsilon_i, \quad i = 1, \dots, n$$ 

where the error terms are independent and identically distributed from a Uniform distribution, $\varepsilon_i \sim \mathrm{U}(-10,10)$. 

As before, let the true (population) regression terms $\beta_0$, $\beta_1$, and $\beta_2$ be $10$, $8$, and $5$, respectively.

Call this new sample `sample_model_3`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
set.seed(654) # DO NOT CHANGE!

# sample_size = ...

# sample_model_3 <- tibble(
#   x_1 = runif(n = ..., ..., ...),
#   x_2 = factor(rbinom(n = ..., size = ..., prob = ...), labels=c("A", "B"))
#   y = ... * if_else(x_2 == "B", ..., ...) + r...(n = ..., ..., ...)
# )

# your code here
fail() # No Answer - remove if you provide an answer

head(sample_model_3)

In [None]:
test_2.6()

**Question 2.7**
<br>{points: 1}

Use the simulated `sample_model_3` to estimate the regression parameters $\beta_0$, $\beta_1$, and $\beta_2$ and elements to make inference using hypothesis tests. 

Use the function `lm()` to estimate the regression parameters. Assign the results to the object `model_3`.

Obtain the estimated coefficients, their standard errors, corresponding $p$-values, and $95\%$ confidence intervals using `tidy()`. Store the results in `model_3_results`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# model_3 <- ...(..., ...)

# model_3_results <- 
#     ...(..., ...) %>%
#     mutate_if(is.numeric, round, 2)

# your code here
fail() # No Answer - remove if you provide an answer

model_3_results

In [None]:
test_2.7()

#### Conclusive remarks: Normality

Note that the estimate of the regression parameters are not heavily affected by this problem. However, not only the distribution but also the variance of the error term has changed! As a result, the SE are larger in this case. 

The distributions of the test statistics such as the $t$ or the $F$-statistics rely on the normality of the $\varepsilon_i$'s, unless the sample size $n$ is large enough in which case their distribution can be approximated using asymptotic results from the CLT. 

In this case, the sample size is large so, according to the CLT, the sampling distributions when $\sigma$ is known is approximately Normal. 

Since in general, $\sigma$ is unknown, it needs to be estimated and the sampling distribution becomes a $t$-Student distribution (similar to Normal but with slightly heavier tails).

<font style='color: darkred'>*Important note*: `lm` *assumes* that either the errors are Normal or the conditions of the CLT are met. Regardless, `lm` assumes that the sampling distribution can be approximated by a $t$-Student distribution. It is **your** job to check these assumptions!</font>

### 2.4 Diagnostic plots

A $Q$-$Q$ plot and the histogram of residuals are graphical tools that help us to assess the normality assumption.

Let us compare these plots for `model_1` (with Normal errors) and `model_3` (with non-Normal errors). We can obtain both $Q$-$Q$ plots via the function `plot()` and the histograms using `hist()`.

*Run the cell below before continuing.*

In [None]:
# Q-Q plots for Models 1 and 3
plot(model_1, 2, main = "Model 1")
plot(model_3, 2, main = "Model 3")

# Histograms for Models 1 and 3
hist(residuals(object = model_1),
  breaks = 10,
  main = "Histogram of Residuals for Model 1",
  xlab = "Residuals"
)

hist(residuals(object = model_3),
  breaks = 10,
  main = "Histogram of Residuals for Model 3",
  xlab = "Residuals"
)

**Question 2.8**
<br>{points: 1}


What is the difference between both pairs plots corresponding to `model_1` and `model_3`?

**A.** There are no differences between both pairs of plots, suggesting that both models fulfil the normality assumption.

**B.** For `model_3`, most of the points lie on the 45° degree dotted line of the $Q$-$Q$ plot suggesting that the errors are normally distributed. 

**C.** For `model_1`, most of the points lie on the 45° degree dotted line of the $Q$-$Q$ plot suggesting that the errors are normally distributed. 

**D.** The histogram of the residuals of `model_3` is similar to that of the residuals of `model_1`. 

*Assign your answer to an object called `answer2.8`. Your answer should be one of `"A"`, `"B"`, `"C"`, or `"D"`, surrounded by quotes.*

In [None]:
# answer2.8 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.8()

### 2.5 Multicollinearity

A common problem in real data occurs when the input variables are correlated. This problem is known as multicollinearity. In this section, we will simulate correlated input variables to understand how this problem affects the sampling distribution of the least squares regression parameter estimators.

In this exercise, we will generate 1000 samples of size $n = 100$ from the following data-generating process:

$$Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \varepsilon_i,\quad i = 1, ..., n$$ 

where the error terms $\varepsilon_i \sim \mathcal{N}(0, \sigma^2 = 4)$ are independent and identically distributed, <font style='color: darkred'>and $X_{1}$ and $X_{2}$ are two continuous correlated input variables.</font> 

Assume that all the true (population) regression terms $\beta_0$, $\beta_1$, and $\beta_2$ are equal to $10$. 

**Note**: Note that $n$ and the regression coefficients are different in this problem.

**Question 2.9**
<br>{points: 1}

First, let's learn how to generate correlated inputs $X_{1}$ and $X_{2}$ from a bivariate normal distribution. Assume these variables have population means $\mu_1 = 10$ and $\mu_2 = 20$, respectively. Furthermore, their respective population standard deviations are $\sigma_1 = 4$ and  $\sigma_2 = 8$ and a high correlation, say $\rho = 0.95$, to generate multicollinearity.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
set.seed(456) # DO NOT CHANGE!

# sample_size <- ...
# bivariate_normal_sample <- rnorm_multi(
#   n = ...,
#   mu = c(..., ...),
#   sd = c(..., ...),
#   r = ...,
#   varnames = c("x_1", "x_2"),
#   empirical = FALSE
# )

# your code here
fail() # No Answer - remove if you provide an answer

head(bivariate_normal_sample)

In [None]:
test_2.9()

**Question 2.10**
<br>{points: 1}

Now we generate $1000$ datasets of size $n=100$ from the data generating process described and fit an additive MLR using `lm` to each dataset. 

Store the corresponding $\hat{\beta}_0$, $\hat{\beta}_1$, and $\hat{\beta}_2$ per sample in a dataframe called `lm_multicollinearity` of 1000 rows and three columns:

- intercept: The estimated intercept $\hat{\beta}_0$.
- beta_1_hat: The estimated slope $\hat{\beta}_1$.
- beta_2_hat: The estimated slope $\hat{\beta}_2$.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
set.seed(321) # DO NOT CHANGE!

# sample_size <- ...
# num_replicates <- ...

# beta_0 <- ...
# beta_1 <- ...
# beta_2 <- ...

# lm_multicollinearity <- replicate(..., {
#   rnorm_multi(
#     n = ...,
#     mu = ...,
#     sd = ...,
#     r = ...,
#     varnames = c("x_1", "x_2"),
#     empirical = FALSE
#   ) %>%
#     mutate(y = ... + beta_1 * ... + ... * x_2 +
#       rnorm(n = ..., mean = 0, sd = ...)) %>%
#     lm(..., data = .) %>%
#     .$coef
# })

# lm_multicollinearity <- data.frame(
#   intercept = lm_multicollinearity[1, ],
#   beta_1_hat = lm_multicollinearity[2, ],
#   beta_2_hat = lm_multicollinearity[3, ]
# )

# head(lm_multicollinearity)

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.10()

**Question 2.11**
<br>{points: 1}

For comparison purposes, repeat the process for samples taken from a population without multicollinearity (use $\rho = 0.001$).

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
set.seed(321) # DO NOT CHANGE!

# sample_size <- ...
# num_replicates <- ...

# beta_0 <- ...
# beta_1 <- ...
# beta_2 <- ...

# lm_no_multicollinearity <- replicate(..., {
#   rnorm_multi(
#     n = ...,
#     mu = ...,
#     sd = ...,
#     r = ...,
#     varnames = c("x_1", "x_2"),
#     empirical = FALSE
#   ) %>%
#     mutate(y = ... + beta_1 * ... + ... * x_2 +
#       rnorm(n = ..., mean = 0, sd = ...)) %>%
#     lm(..., data = .) %>%
#     .$coef
# })
# lm_no_multicollinearity <- data.frame(
#   intercept = lm_no_multicollinearity[1, ],
#   beta_1_hat = lm_no_multicollinearity[2, ],
#   beta_2_hat = lm_no_multicollinearity[3, ]
# )

# head(lm_no_multicollinearity)

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.11()

**Question 2.12**
<br>{points: 1}

Plot the 1000 regression estimates for the slope corresponding to $X_1$ (i.e., $\hat{\beta}_1$) stored in `lm_multicollinearity` and `lm_no_multicollinearity`, separately (i.e., two histograms with counts on the $y$-axis and the estimate on the $x$-axis). 

Call the ggplot() object's names `hist_multicollinearity_slope_x_1` and `hist_no_multicollinearity_slope_x_1`, respectively. Moreover, plot the averages of these estimates in their respective histograms as vertical red lines.

<font style='color: darkred'>*Note*: these are **not** bootstrapped estimates!</font>

*Fill out those parts indicated with ..., uncomment the corresponding code in the cell below, and run it.*

In [None]:
# Adjust these numbers so the plot looks good in your desktop.
options(repr.plot.width = 7, repr.plot.height = 9) 

# hist_multicollinearity_slope_x_1 <- ggplot(..., aes(...)) +
#   ...(bins = 15, color = "white") +
#   ...(..., col = "red", size = 1) +
#   coord_cartesian(xlim = c(9.4, 10.6), ylim = c(0, 250)) +
#   scale_x_continuous(breaks = seq(9, 10.6, 0.2)) +
#   xlab(...) +
#   ylab(...) +
#   theme(text = element_text(size = 14)) +
#   ggtitle(...)

# hist_no_multicollinearity_slope_x_1 <- ggplot(..., aes(...)) +
#   ...(bins = 15, color = "white") +
#   ...(..., col = "red", size = 1) +
#   coord_cartesian(xlim = c(9.4, 10.6), ylim = c(0, 250)) +
#   scale_x_continuous(breaks = seq(9, 10.6, 0.2)) +
#   xlab(...) +
#   ylab(...) +
#   theme(text = element_text(size = 14)) +
#   ggtitle(...)

# plot_grid(hist_multicollinearity_slope_x_1, hist_no_multicollinearity_slope_x_1,
#   ncol = 1
# )

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.12.0()
test_2.12.1()

**Question 2.13**
<br>{points: 1}

Based on your findings in `hist_multicollinearity_slope_x_1` and `hist_no_multicollinearity_slope_x_1`, what are the implications of multicollinearity in the sampling distributions of least squares estimators of MLR?

**A.** The multicollinearity reduces the standard error of the regression estimator.

**B.** Multicollinearity inflates the standard error of the regression estimator.

**C.** Multicollinearity does not seem to affect the estimation results.

*Assign your answer to an object called answer2.13. Your answer should be one of `"A"`, `"B"`, or `"C"` surrounded by quotes.*


In [None]:
# answer2.13 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.13()

# PART III - Causality and Confounders

You will work on a simulation experiment outlined in `tutorial_03`.