# Tutorial 06: Goodness of Fit and Nested Models

#### Lecture and Tutorial Learning Goals:
After completing this week's lecture and tutorial work, you will be able to:

1. List model metrics that are suitable for evaluation of a statistical model developed to make inferences about the data-generating mechanism (e.g., $R^2$, $\text{AIC}$, Likelihood ratio test/$F$-test), their strengths and limitations, as well as how they are calculated.
2. Write a computer script to calculate these model metrics. Interpret and communicate the results from that computer script.
3. Explain how an $F$-test to compare nested models can be used as a variable selection methods.

In [None]:
# Run this cell before continuing.

library(broom)
library(tidyverse)
source("tests_tutorial_06.R")

## Can we predict protein from mRNA?

In *Worksheet 06*, you studied the significance of `mrna` and analyzed the goodness-of-fit of some models. However, there are other models that can be compared. For example, are interaction terms needed, or should we just use an additive model? 

Consider the following models using a dataset with 3 randomly selected genes:

- model.1: $\text{prot}_t=\beta_0 + \varepsilon_t$ 

- model.2:  $\text{prot}_t=\beta_0 + \beta_1 \text{mrna}_{t} + \varepsilon_t$ 

- model.3:  $\text{prot}_t=\beta_0 + \beta_2 \text{gene2}_{t} + \beta_3 \text{gene3}_{t} + \varepsilon_t$ 

- model.4:  $\text{prot}_t=\beta_0 + \beta_1 \text{mrna}_{t} + \beta_2 \text{gene2}_{t} + \beta_3 \text{gene3}_{t} + \varepsilon_t$ 

- model.5:  $\text{prot}_t=\beta_0 + \beta_1 \text{mrna}_{t} + \beta_2 \text{gene2}_{t} + \beta_3 \text{gene3}_{t} + \beta_4 \text{gene2}_{t}\text{mrna}_{t} + \beta_5 \text{gene3}_{t}\text{mrna}_{t} + \varepsilon_t$ 

In [None]:
# Read and take a look at the data.
dat_bio <- read.csv("data/nature_dat.csv", row.names = 1, stringsAsFactors=TRUE)

str(dat_bio)
head(dat_bio,3)
tail(dat_bio,3)

In [None]:
#run this cell
set.seed(561)
dat_3genes <- dat_bio  %>%  
         subset(gene %in% sample(gene,3)) 

**Question 1.0**
<br>{points: 4}

We can use the adjusted $R^2$ to compare the goodness-of-fit of `model.5` and `model.4` to conclude which fits the data better. 

a) [2pts] Use the dataset `dat_3genes` to fit both models and the function `glance()` to obtain their $R^2$ and adjusted $R^2$. 

b) [1pts] Compare the adjusted $R^2$ of the two models and discuss the results. 

c) [1pts] Compare the $R^2$ of the two models and explain why that of `model.5` is larger.

In [None]:
# Your code and numerical results go here. We will grade this cell manually

# your code here
fail() # No Answer - remove if you provide an answer

> *Your explanation of the results goes here.*

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 1.1**
<br>{points: 1}

Use the function `anova()` to test if the model with interaction terms (`model.5`) is significantly different from an additive one (`model.4`). 

> Note that both models have `mrna`,  but the full model, `model.5`, assumes that the expected change in protein levels per unit change in mRNA levels differs for each gene.

Store the output of the `anova` function in an object called `Ftest_3genes_add_full`.

*Write your own code and run it.*

In [None]:
#[write your code here]

# Hints: 
# - fit the additive model
# - fit the full model
# - call the anova function and store the output in Ftest_3genes_add_full


# your code here
fail() # No Answer - remove if you provide an answer

Ftest_3genes_add_full

In [None]:
test_1.1()

**Question 1.2**
<br>{points: 1}

Using a significance level $\alpha = 0.05$ and the results in `Ftest_3genes_add_full`, in plain words, what is the conclusion from the test run results in *Question 1.1*?

**A.** We reject the null hypothesis; thus, the *full* model is significatly better than the *reduced* model.

**B.** We fail to reject the null hypothesis; thus, there is not enough evidence that the *full* model with additional interaction terms is better than the additive (reduced) model.

**C.** We accept the alternative hypothesis; thus, the *full* model is significantly better than the *reduced* model.

**D.** We do not accept the alternative hypothesis; thus, the *full* model with additional interaction terms is not better than the *reduced* model.

*Assign your answer to an object called `answer1.2`. Your answer should be one of `"A"`, `"B"`, `"C"`, or `"D"` surrounded by quotes.*

In [None]:
# answer1.2 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.2()

#### Assessing mRNA in the additive model

As a final test, let's examine the significance of `mrna` in the additive model.

**Question 1.3**
<br>{points: 2}

Compare the additive model with mRNA as an input and distinct intercepts per gene  (`model.4`) with a model without `mrna` and only the categorical variable `gene` as input variable (`model.3`). Note that the second model predicts protein levels with the average protein level within each gene.

Use the function `tidy()` to obtain a summary of the additive model. Include the corresponding asymptotic 90% confidence intervals. Store the results in an object called `add_mrna_results`.

Use the function `anova()` to compare these models and store the results in an object called `Ftest_3genes_add_mrna`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
#[write your code here]

#add_mrna_results <- ...
#Ftest_3genes_add_mrna <- ...

# your code here
fail() # No Answer - remove if you provide an answer

add_mrna_results
Ftest_3genes_add_mrna

In [None]:
test_1.3.0()
test_1.3.1()

**Question 1.4**
<br>{points: 2}

Compare the $p$-value for `mrna` in `add_mrna_results` with that reported in `Ftest_3genes_add_mrna`. What do you observe? Indicate the null hypotheses tested in each case and explain the results.

> *Your explanation of the results goes here.*

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 1.5**
<br>{points: 1}

Using a **significance level $\alpha = 0.10$** and the results in `add_mrna_results`, which of the following claims is correct? 

**A.** The `model.4` that includes `mRNA` is significantly different from `model.3`.

**B.** There is not enough evidence that the `model.4` that includes `mRNA` as a predictor is significantly better than `model.3`.

**C.** The `model.4` that includes `mrna` as a predictor is equivalent to `model.3` since the coefficient for `mrna` is not significantly different from zero.

**D.** The variable `mRNA` is essential to predict protein levels.

*Assign your answer to an object called `answer1.5`. Your answer should be one of `"A"`, `"B"`, `"C"`, or `"D"` surrounded by quotes.*

In [None]:
# answer1.5 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.5()