# Worksheet 06: Goodness of Fit and Nested Models

#### Lecture and Tutorial Learning Goals:
After completing this week's lecture and tutorial work, you will be able to:

1. List model metrics that are suitable for evaluation of a statistical model developed to make inferences about the data-generating mechanism (e.g., $R^2$, $\text{AIC}$, Likelihood ratio test/$F$-test), their strengths and limitations, as well as how they are calculated.
2. Write a computer script to calculate these model metrics. Interpret and communicate the results from that computer script.
3. Explain how an $F$-test to compare nested models can be used as a variable selection methods.

In [None]:
# Run this cell before continuing.
library(tidyverse)
library(repr)
library(broom)
library(leaps)
library(moderndive)
source("tests_worksheet_06.R")

# Model Evaluation of Generative Models

In this worksheet, you will learn different methods to evaluate and select appropriate models to make inferences about the data-generating mechanism. In other words, the main goal is to estimate and assess generative models.

## Case Study: protein vs mRNA

In this section, we will work with a biology case study to learn how to examine the goodness of the fitted model and choose among different nested models. The data and some of the discussions related to this case were published in Nature (see citations below).

In 2014, a research group claimed to have found a "predictive model" that can predict protein expression from mRNA expression.

> Wilhelm, M. et al. Mass-spectrometry-based draft of the human proteome. Nature 509, 582–587 (2014)

Although their hypotheses were funded in the Central Dogma of Biology, most experimental results have shown a very low correlation between protein and mRNA values. 

Further examination of their analysis has shown that their data do not support their claims:

> Fortelny N, Overall CM, Pavlidis P, Freue GVC. Can we predict protein from mRNA levels? Nature. 2017 Jul 26;547(7664):E19-E20. doi: 10.1038/nature22293.

We'll use data from this group submitted to the Journal to (re) analyze their data and evaluate different models using concepts learned in this course.

### Data

The technology used did not detect many proteins due to values below the detection limits, and the protein dataset contains many missing values. For simplicity, we will use a set of 1392 genes that were measured in all 12 tissues and thus contain complete data on both protein and mRNA sets. 

*Run the cell below before continuing to read the data and take a peek at it.* 

In [None]:
# Read and take a look at the data.
dat_bio <- read.csv("data/nature_dat.csv", row.names = 1, stringsAsFactors=TRUE)
str(dat_bio)
head(dat_bio,3)
tail(dat_bio,3)

In this long-format dataset, we can think that we have 1392 datasets, one for each gene.

> each dataset contains 12 observations, one per tissue, and 2 variables, `prot` and `mrna`, along with an accession number for the gene (`gene`, an ID for each gene) and the name of the tissue (`tissue`, that works as an ID for each observation).

*Run the cell below to get the data for gene ENSG00000085733.* 

In [None]:
(dat_ENSG00000085733 <- 
    dat_bio %>%
    subset(gene == "ENSG00000085733"))

### Models and Estimation

In the paper, the authors used linear regression to estimate the relation between protein and mRNA levels *per gene*. They used that model to predict protein levels *per gene*.

> it will be important at a later phase of the analysis to note that models are *gene-specific*

**Gene-specific models**: for each gene, they estimated the following model (for simplicity, we do not use a subscript $g$ for gene)  

$$\texttt{prot}_{t} = \beta_1 \times \texttt{mrna}_{t} + \varepsilon_t$$ 

where $\texttt{prot}_{t}$ and $\texttt{mrna}_{t}$ are the protein and mRNA levels of a gene $g$ in tissue $t$, respectively. In this case, they estimated $\beta_1$ with $\hat{\beta}_1 = \text{median}(\texttt{mrna}_{t}\ / \ \texttt{prot}_{t})$.

<font color="darkred"> Note that these models do not contain an intercept, and they were not estimated by LS.</font> 
    
While it's true that various models and estimation methods can be applied to the same dataset, it's crucial to assess the results in light of the underlying assumptions and the chosen methodology. 
    
In this worksheet, we will use this same dataset to estimate different models and evaluate their fit with the data.

### 1. Gene-specific SLR for one selected gene

For each gene $g$, we will use LS to estimate the following SLR:  

$$\texttt{prot}_{t} = \beta_0 +\beta_1  \times \texttt{mrna}_{t} + \varepsilon$$ 

where $\hat{\beta}_1$ is estimated by LS. Note that we are including the intercept here. 

**Question 1.0: Visualization of the estimated line**
<br>{points: 1}

Using data from gene ENSG00000085733 in `dat_ENSG00000085733`, make the scatterplot `prot` versus `mrna` and add the estimated SLR. The `ggplot()` object's name will be `SLR_ENSG00000085733_plot`

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# Adjust these numbers so the plot looks good in your desktop.
options(repr.plot.width = 8, repr.plot.height = 5) 

# SLR_ENSG00000085733_plot <- 
#    ggplot(..., aes(..., ...)) +
#    ...() +
#    ...(..., se = FALSE, linewidth = 1.5) +
#    xlab(...) +
#    ylab(...) +
#    ggtitle("Sample and Estimated SLR for gene ENSG00000085733") +
#    theme(text = element_text(size = 14))

# your code here
fail() # No Answer - remove if you provide an answer

SLR_ENSG00000085733_plot

In [None]:
test_1.0()

**Question 1.1**
<br>{points: 3}

For the selected gene:

- 1.1.0 Use `lm()` to obtain the LS estimated coefficients for the selected gene. Call the model `SLR_ENSG00000085733`. 

- 1.1.1 Use `tidy()` to store all inference quantities from this model in an object called `SLR_ENSG00000085733_results`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# SLR_ENSG00000085733 <- lm(...,...)
# SLR_ENSG00000085733_results <- ...(...)

# your code here
fail() # No Answer - remove if you provide an answer

SLR_ENSG00000085733_results

In [None]:
test_1.1.0()
test_1.1.1()

**Question 1.2**
<br>{points: 2}

1.2.0 For the selected gene, use `glance()` to obtain goodness of fit measurements and tests for the fitted model `SLR_ENSG00000085733`. Store the results in an object called `SLR_ENSG00000085733_gof`.

1.2.1 You will now write your own code to compute the following quantities given by `glance()`:

    - Residuals Sum of Square (RSS)
    - Total Sum of Square (TSS)
    - R-squared (R2)
    - adjusted R-squared (adjR2)
    - Residual Standard Error (RSE)

*Your answer should be a tibble with one row and 5 columns: `RSS`, `TSS`, `R2`, `adjR2`, and `RSE`. Assign your answer to an object called `my_gof`.* 

> **Hints**: you can get the residuals by using the `residuals()` function and passing the estimated model, you can use the standard deviation of the observed protein values to compute the TSS)*.  

Are your results the same as those computed by `glance()`? 

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# n_tissues <- ...

# SLR_ENSG00000085733_gof <- ...(...)

# my_gof <- 
#     ... %>% 
#     mutate(residual = ...) %>%
#     summarize(RSS = ...,
#               TSS = ...,
#               R2 = ..., 
#               adjR2 = ...,
#               RSE = ...)

# your code here
fail() # No Answer - remove if you provide an answer

SLR_ENSG00000085733_gof
my_gof

In [None]:
test_1.2.0()
test_1.2.1()

**Question 1.3**
<br>{points: 1}

The output of `glance()` includes a $p$-value corresponding to a hypothesis test that compares the proposed SLR $\texttt{prot}_t=\beta_0 + \beta_1 \texttt{mrna}_{t} + \varepsilon_t$ with a reduced (null) model. Which of the following models is the null model in this comparison?

**A.** $\texttt{prot}_t=\beta_0 + \varepsilon_t$ 

**B.** $\texttt{prot}_t=\beta_0 + \beta_1 \texttt{gene}_{t} + \varepsilon_t$ 

**C.** $\texttt{prot}_t=\beta_0 + \beta_1 \texttt{gene}_{t} + \beta_2 \texttt{mrna}_{t} + \varepsilon_t$ 

*Assign your answer to an object called `answer1.3`. Your answer should be one of `"A"`, `"B"`, or `"C"` surrounded by quotes.*

In [None]:
# answer1.3 <-

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.3()

**Question 1.4**
<br>{points: 1}

Select the code(s) that can be used to compute the same test as that included in the output of `glance()`:

**A.** `anova(lm(prot ~ 1, dat_ENSG00000085733))`

**B.** `anova(lm(prot ~ 1, dat_ENSG00000085733), lm(prot ~ mrna, dat_ENSG00000085733))`

**C.** `tidy(lm(prot ~ mrna, dat_ENSG00000085733)) %>% filter(term == 'mrna') %>% pull(p.value)`

*Assign your answer to an object called `answer1.4`. Your answers have to be included in a single string indicating the correct options in alphabetical order and surrounded by quotes (e.g., `"ABC"` indicates you are selecting the three options).*

In [None]:
# answer1.4 <-

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.4()

**Question 1.5**
<br>{points: 1}

Based on the results of the goodness-of-fit measurements obtained for the estimated SLR for gene `dat_ENSG00000085733`, select all the true claims:

**A.** These results suggest that for the selected gene, a gene-specific model explains less than 6% of the observed variation in protein abundance. 

**B.** A gene-specific model fits the data well. 

**C.** It now becomes possible to predict protein abundance in any given tissue with good accuracy from the measured mRNA for any gene.  

**D.** Using $5\%$ significance level, there is not enough evidence that the gene-specific SLR using `mrna` as an input is better than using the average protein level to predict protein abundances.

*Assign your answer to an object called `answer1.5`. Your answers have to be included in a single string indicating the correct options in alphabetical order and surrounded by quotes (e.g., `"ABCD"` indicates you are selecting the four options).*

In [None]:
# answer1.5 <-

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.5()

### F-test and t-test for a SLR

Note that in this simple case, the only difference between the reduced and the full model is the term with $\beta_1$. Therefore, we are basically testing if $\beta_1$ is different from zero. i.e., $H_0: \beta_1 = 0$. Same as before. 

Note the $p$-values are the same. This is not a coincidence. When we test only one parameter: $t^2 = F$

*Remember that this is *only 1* gene in the dataset, but there are 1391 more genes.* Results may be different for other genes.

### 2. Many gene-specific SLR models

**Question 2.0**
<br>{points: 1}

In this problem, you will: 

- fit 1392 SLR, one for each gene in the dataset using `lm()` and `group_by()`

- compute the $R^2$ for each gene and estimated SLR using the function `glance()`

- select the columns `gene`, `r.squared` and `p.value`. Store them in an object called `summary_gof`

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# dat_glance <- 
#     dat_bio %>% 
#     group_by(...) %>% 
#     do(glance(...(... ~ ..., data = .))) 

# summary_gof <- 
#     dat_glance %>% 
#     select(..., ..., ...)

# your code here
fail() # No Answer - remove if you provide an answer

tail(summary_gof, 8)

In [None]:
test_2.0()

#### Visualization of results 

We have computed 1392 coefficients of determination, one for each *gene-specific* fitted model. Let's visualize the results using a histogram of the coefficients of determination.

In [None]:
hist_slr_r2 <- 
    summary_gof  %>% 
    ggplot(aes(x=r.squared)) + 
    geom_histogram(color="white", bins=20) + 
    geom_vline(xintercept=median(summary_gof$r.squared, na.rm=T),color="red") +
    labs(title = "R² of gene-specific LS models",
         x = "R²",
         y = "Count") +
    xlim(0, 1)

hist_slr_r2

**Question 2.1**
<br>{points: 1}

In the following claims, a "gene-specific model" refers to an SLR using `mrna` as an input variable and estimated by LS for each gene.

Select all valid claims based on the $R^2$ computed for each gene.

**A.** These results suggest that the quality of the proposed models varies greatly across genes. A gene-specific model explains more than 80% of the observed variation in protein abundance for some genes. However, for more than half of the genes, it explains less than 15% of the observed variation in protein levels.

**B.** For the majority of the genes, a gene-specific model fits the data well. 

**C.** With the suggested gene-specific models, it now becomes possible to predict protein abundance in any given tissue with good accuracy from the measured mRNA for any gene.  

**D.** For gene ENSG00000262246, the gene-specific model explains approximately 87% of the observed variation in protein abundance, making `mrna` statistically significant.

*Assign your answer to an object called `answer2.1`. Your answer should be one of `"A"`, `"B"`, `"C"`, or `"D"` surrounded by quotes.*

In [None]:
# answer2.1 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.1()

**Question 2.2**
<br>{points: 1}

The $R^2$ provides a measure for the goodness-of-fit for each model but is not a statistical test. In this exercise, you will examine the $p$-values from the $F$-tests for each gene. Each of these tests compares the full model with an intercept-only model. For these simple particular models, the null hypothesis is:

$H_0: \beta_1 = 0$

Plot a histogram of the $p$-values from all gene-specific SLRs. Add a solid blue vertical line with the median of the $p$-values displayed and a dotted red line for a threshold value 0.05.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
#hist_slr_F <- summary_gof  %>% 
#  ggplot(aes(x = ...)) + 
#  geom_...(color = "white", binwidth = 0.05, boundary = 0) + 
#  geom_...(xintercept = median(summary_gof$..., na.rm = T), color = "blue") +
#  ...(xintercept = ..., linetype = 2, color = "red") +
#  labs(
#    title = "F-test of gene-specific SLR models",
#    x = "pvalue",
#    y = "Count")+
#  xlim(0, 1)

# your code here
fail() # No Answer - remove if you provide an answer

hist_slr_F

In [None]:
test_2.2()

**Question 2.3**
<br>{points: 1}

Looking at the histogram in *Question 2.2* we conclude that for the majority of the genes, a gene-specific model with `mrna` as an input is significantly better than an intercept-only model.

TRUE or FALSE?

*Assign your answer to an object called answer2.3 Your answer should be either "true" or "false", surrounded by quotes.*

In [None]:
# answer2.3 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.3()

### 3. Other LR models

The authors claimed that it is fundamental to consider different models for different genes. In the previous exercises, you fit different models, grouping the data by levels of `gene`. However, in previous worksheets, we noticed that we can fit different LRs for each level of a categorical variable simultaneously, adding dummy variables in the model. For simplicity, we will use a dataset with only 3 genes, called `dat_3genes`.  

Let's first outline different models that can be considered to model data from different genes.

**Question 3.0**
<br>{points: 1}

Consider the following models:

- model.1: $\text{prot}_t=\beta_0 + \varepsilon_t$ 

- model.2:  $\text{prot}_t=\beta_0 + \beta_1 \text{mrna}_{t} + \varepsilon_t$ 

- model.3:  $\text{prot}_t=\beta_0 + \beta_2 \text{gene2}_{t} + \beta_3 \text{gene3}_{t} + \varepsilon_t$ 

- model.4:  $\text{prot}_t=\beta_0 + \beta_1 \text{mrna}_{t} + \beta_2 \text{gene2}_{t} + \beta_3 \text{gene3}_{t} + \varepsilon_t$ 

- model.5:  $\text{prot}_t=\beta_0 + \beta_1 \text{mrna}_{t} + \beta_2 \text{gene2}_{t} + \beta_3 \text{gene3}_{t} + \beta_4 \text{gene2}_{t}\text{mrna}_{t} + \beta_5 \text{gene3}_{t}\text{mrna}_{t} + \varepsilon_t$ 

In previous worksheets, you've learned how to fit and interpret these models. Match the equations to the codes you can use to estimate these models using `dat_3genes`:

**A.** `lm(prot ~ mrna * gene, dat_3genes)`

**B.** `lm(prot ~ gene, dat_3genes)`

**C.** `lm(prot ~ mrna + gene, dat_3genes)`

**D.** `lm(prot ~ 1, dat_3genes)`

**E.** `lm(prot ~ mrna, dat_3genes)`


*Assign your answers to the objects `model.1`, `model.2`, `model.3`, `model.4`, and `model.5`. Your answer should each be a single character (`"A"`, `"B"`, `"C"`, `"D"`, or `"E"`) surrounded by quotes.*

In [None]:
# model.1 <- ...
# model.2 <- ...
# model.3 <- ...
# model.4 <- ...
# model.5 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.0()

Let's randomly select 3 genes to run some analyses:

In [None]:
#run this cell
set.seed(561)
dat_3genes <- 
    dat_bio %>%
    filter(gene %in% sample(gene, 3)) 

**Question 3.1**
<br>{points: 2}

Using the data from the 3 selected genes in `dat_3genes`, fit a model with interaction terms (`model.5`). Store the results in an object called `mlr_3genes_int`.

Use `tidy()` to obtain a table with results from the LS estimation and inference, call the output `mlr_3genes_int_results`.

Use `glance()` to obtain goodness-of-fit for this model; call the output `mlr_3genes_int_gof`. 

In [None]:
#mlr_3genes_int <- ...(...~ ..., data = ...)

#mlr_3genes_int_results <- tidy(...)

# mlr_3genes_int_gof <- 
#     glance(...) %>%
#     round(3)

# your code here
fail() # No Answer - remove if you provide an answer

mlr_3genes_int_results
mlr_3genes_int_gof

In [None]:
test_3.1.0()
test_3.1.1()

**Question 3.2**
<br>{points: 1}

The output of `glance()` for the model with interaction includes the $p$-value of a test of hypothesis. Which null hypothesis is tested?

**A.** $H_0: \beta_2 = \beta_3 = 0$ vs. $H_1: \text{at least one } \beta_j \neq 0 \text{ (for } $j = 2, 3$ \text{)}$

**B.** $H_0: \beta_4 = \beta_5 = 0$ vs. $H_1: \text{at least one } \beta_j \neq 0 \text{ (for } $j = 4, 5$ \text{)}$

**C.** $H_0: \beta_0 = \beta_1 = \beta_2 = \beta_3 = \beta_4 = \beta_5 = 0$ vs. $H_1: \text{at least one } \beta_j \neq 0 \text{ (for } $j = 0, 1, 2, 3, 4, 5$ \text{)}$

**D.** $H_0: \beta_1 = \beta_2 = \beta_3 = \beta_4 = \beta_5 = 0$ vs. $H_1: \text{at least one } \beta_j \neq 0 \text{ (for } $j = 1, 2, 3, 4, 5$ \text{)}$

**E.** $H_0: \hat{\beta}_2 = \hat{\beta}_3 = 0$ vs. $H_1: \text{at least one } \hat{\beta}_j \neq 0 \text{ (for } $j = 2, 3$ \text{)}$

*Assign your answer to an object called `answer3.2`. Your answer should be one of `"A"`, `"B"`, `"C"`, `"D"` or `"E"` surrounded by quotes.*

In [None]:
# answer3.2 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.2()

**Question 3.3**
<br>{points: 1}

Use the function `anova()` to reproduce the test in **Question 3.2** using `dat_3genes`. 

Store $F$-test and its correspoinding $p$-value in an object called `Ftest_3genes_interaction`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# Ftest_3genes_interaction  <- ...(..., ...)  

# your code here
fail() # No Answer - remove if you provide an answer

Ftest_3genes_interaction 

In [None]:
test_3.3()

**Question 3.4**
<br>{points: 1}

Which of the following claims is correct based on the results from  *Question 3.3*.

**A.** The three linear regressions are statistically significant.

**B.** There is enough evidence to reject the null hypothesis that the model with interaction terms (`model.5`) is equivalent to an intercept-only model (`model.1`)

**C.** `mrna` is statistically significant.


*Assign your answer to an object called `answer3.4`. Your answer should be one of `"A"`, `"B"`, or `"C"` surrounded by quotes.*

In [None]:
# answer3.4 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.4()

### How does the interaction model fit the data?

Results from *Question 3.1* suggest that `model.5` does not fit the data well, even if it is better than the intercept-only model. The model explains only 44% of the variation observed in protein levels. While the $p$-value from `glance()` indicates there is enough evidence to reject $H_0$, it still doesn't mean that `mrna` is a relevant variable. Let's compare this model with other models. 

To examine that question, let's compare the following nested models:

$$\textbf{reduced}:\text{prot}_t=\beta_0 + \beta_2 \text{gene2}_{t} + \beta_3 \text{gene3}_{t} + \varepsilon_t$$

$$\textbf{full}:\text{prot}_t = \beta_0 + \beta_1 \text{mrna}_{t} + \beta_2 \text{gene2}_{t}  + \beta_3 \text{gene3}_{t}  + \beta_4 \text{gene2}_{t}  \text{mrna}_{t} + \beta_5 \text{gene3}_{t}  \text{mrna}_{t} +\varepsilon_t$$

**Question 3.5**
<br>{points: 1}

An $F$-test can be used to test *simultaneously* whether the additional parameters in the full model are zero. Which hypotheses need to be tested using an $F$-test?

**A.** $H_0: \beta_2 = \beta_3 = 0$ vs. $H_1: \text{at least one } \beta_j \neq 0 \text{ (for } $j = 2, 3$ \text{)}$

**B.** $H_0: \beta_1 = \beta_4 = \beta_5 = 0$ vs. $H_1: \text{at least one } \beta_j \neq 0 \text{ (for } $j = 1, 4, 5$ \text{)}$

**C.** $H_0: \beta_0 = \beta_1 = \beta_2 = \beta_3 = \beta_4 = \beta_5 = 0$ vs. $H_1: \text{at least one } \beta_j \neq 0 \text{ (for } $j = 0, 1, 2, 3, 4, 5$ \text{)}$

**D.** $H_0: \beta_1 = \beta_2 = \beta_3 = \beta_4 = \beta_5 = 0$ vs. $H_1: \text{at least one } \beta_j \neq 0 \text{ (for } $j = 1, 2, 3, 4, 5$ \text{)}$

**E.** $H_0: \hat{\beta}_2 = \hat{\beta}_3 = 0$ vs. $H_1: \text{at least one } \hat{\beta}_j \neq 0 \text{ (for } $j = 2, 3$ \text{)}$

*Assign your answer to an object called `answer3.5`. Your answer should be one of `"A"`, `"B"`, `"C"`, "D"` or `"E"` surrounded by quotes.*

In [None]:
# answer3.5 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.5()

**Question 3.6**
<br>{points: 1}

The following hypotheses compare the reduced (`model.3` without `mrna`) and the full model (`model.5` with interaction) stated above.

Select the code that you can use to test these hypotheses.

**A.** `tidy(lm(prot ~ gene * mrna, dat_3genes))`

**B.** `tidy(lm(prot ~ gene + mrna, dat_3genes))`

**C.** `glance(lm(prot ~ gene * mrna, dat_3genes))`

**D.** `glance(lm(prot ~ gene + mrna, dat_3genes))`

**E.** `anova(lm(prot ~ gene, dat_3genes), lm(prot ~ gene * mrna, dat_3genes))`


*Assign your answer to an object called `answer3.6`. Your answer should be one of `"A"`, `"B"`, `"C"`, `"D"`, or `"E"` surrounded by quotes.*

In [None]:
# answer3.6 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.6()

**Question 3.7**
<br>{points: 1}

Use the function `anova()` to run the test from *Question 3.5*. Store your results in an object called `Ftest_3genes_mrna`. Note: `Ftest_3genes_mrna` contains the output of the `anova()` function. 

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
#[write your code here]

# your code here
fail() # No Answer - remove if you provide an answer

Ftest_3genes_mrna

In [None]:
test_3.7()

#### Conclusions from the analysis

While the authors claimed that "*it now becomes possible to predict protein abundance in any given tissue with good accuracy from the measured mRNA abundance*" as long as we consider different models for different genes, the test above does not provide enough evidence to support this hypothesis (at least for the 3 genes selected). Note that the complex model with different LR for each gene is not significantly different than a model without mRNA!! Most of the variation in the data seems to be explained by the variable `gene`.

## Conclusions

#### Evaluation of Models when the main goal is estimation and inference:

- The $R^2$, coefficient of determination, can be used to compared the sum of squares of the residuals of the fitted model with that of the null model


- The $R^2$ is usually interpreted as the part of the variation in the response explained by the model


- Many definitions and interpretations of the $R^2$ are for LS estimators of LR containing an intercept


- The $R^2$ is not a test and does not provide a probabilistic result, and its distribution is unknown.

- Instead, we can use an $F$ test, also refer as ANOVA, to compare nested models

    - tests the simultaneous significance of additional coefficients of the full model (not in the reduced model)
    
    - in particular, we can use it to test the significance of the fitted model over the null model
    

- These $F$ tests can be used to select variables. Since we are comparing and testing how the fit changes as we select variables

#### Evaluation of Models when the main goal is prediction:

- The test MSE is a natural measure to compare new responses from a test set with the predicted values $\hat{y}_i$ from the LR estimated with training data

- The $R^2$ based on test data can also be used, but it should not be called a "coefficient of determination". 