# Tutorial 7: Model Evaluation and Model Selection

#### Lecture and Tutorial Learning Goals:
After completing this week's lecture and tutorial work, you will be able to:

1. List model metrics that are suitable for evaluation of a statistical model developed to make inference about the data-generating mechanism (e.g., $R^2$, $\text{AIC}$, Likelihood ratio test/$F$-test), their strengths and limitations, as well as how they are calculated.
2. Write a computer script to calculate these model metrics. Interpret and communicate the results from that computer script.
3. Explain a variable selection method based on:
    - $F$-test to compare nested models.
    - RSS for models of equal size
    - Adjusted $R^2$ for models of different sizes

In [None]:
# Run this cell before continuing.
library(tidyverse)
library(repr)
library(digest)
library(infer)
library(gridExtra)
library(faraway)
library(broom)
library(leaps)
library(mltools)
source("tests_tutorial_07.R")

## Can we predict protein from mRNA?

In worksheet_07 you studied the significance of `mrna` and analyzed the goodness-of-fit of some models. However, there are other models that can be compared. For example, are interactions terms needed or is it equivalent to consider an additive model? 

Consider the following models using a dataset with 3 randomly selected genes:

- model.1: $\text{prot}_t=\beta_0 + \varepsilon_t$ 

- model.2:  $\text{prot}_t=\beta_0 + \beta_1 \text{mrna}_{t} + \varepsilon_t$ 

- model.3:  $\text{prot}_t=\beta_0 + \beta_2 \text{gene2}_{t} + \beta_3 \text{gene3}_{t} + \varepsilon_t$ 

- model.4:  $\text{prot}_t=\beta_0 + \beta_1 \text{mrna}_{t} + \beta_2 \text{gene2}_{t} + \beta_3 \text{gene3}_{t} + \varepsilon_t$ 

- model.5:  $\text{prot}_t=\beta_0 + \beta_1 \text{mrna}_{t} + \beta_2 \text{gene2}_{t} + \beta_3 \text{gene3}_{t} + \beta_4 \text{gene2}_{t}\text{mrna}_{t} + \beta_5 \text{gene3}_{t}\text{mrna}_{t} + \varepsilon_t$ 

In [None]:
# Read and take a look at the data.
dat_bio <- read.csv("data/nature_dat.csv", row.names = 1, stringsAsFactors=TRUE)
str(dat_bio)
head(dat_bio,3)
tail(dat_bio,3)

In [None]:
#run this cell
set.seed(561)
dat_3genes <- dat_bio  %>%  
         subset(gene %in% sample(gene,3)) 

**Question 1.0**
<br>{points: 1}

Test if the model with interaction terms (`model.5`) is significantly different from an additive one (`model.4`). Note that both models have `mrna` but the full model assumes that the change in protein levels per unit change in mRNA levels is different for each gene.

Store your results in an object called `Ftest_3genes_add_full`.

*Write your own code and run it.*

In [None]:
#[write your code here]
#Ftest_3genes_add_full

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.0()

**Question 1.1**
<br>{points: 1}

Using a significance level $\alpha = 0.05$ and the results in `Ftest_3genes_add_full`, in plain words, what is the conclusion from the results of the test run in **Question 1.0**?

**A.** We reject the null hypothesis; thus, the *full* model is significatly better than the *reduced* model.

**B.** We fail to reject the null hypothesis; thus, there is not enough evidence that the *full* model with additional interaction terms is better than the additive (reduced) model.

**C.** We accept the alternative hypothesis; thus, the *full* model is significantly better than the *reduced* model.

**D.** We do not accept the alternative hypothesis; thus, the *full* model with additional interaction terms is not better than the *reduced* model.

*Assign your answer to an object called `answer1.1`. Your answer should be one of `"A"`, `"B"`, `"C"`, or `"D"` surrounded by quotes.*

In [None]:
# answer1.1 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.1()

#### Assessing mRNA in the additive model

As a final test, let's examine the significance of `mrna` in the additive model.

**Question 1.2**
<br>{points: 2}

Compare the additive model with mRNA as an input and distinct intercepts per gene  (`model.4`) with a model without `mrna` and only the categorical variable `gene` as input variable (`model.3`). Note that the second model predicts protein levels with the average protein level within each gene!!

Use the function `tidy()` to obtain a summary of the additive model. Include the corresponding asymptotic 90% confidence intervals. Store the results in an object called `add_mrna_results`.

Use the function `anova()` to compare these models and store the results in an object called `Ftest_3genes_add_mrna`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
#[write your code here]

#add_mrna_results <- ...
#Ftest_3genes_add_mrna <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.2.0()
test_1.2.1()

**Question 1.3**
<br>{points: 2}

Compare the $p$-value for `mrna` in `add_mrna_results` with that reported in `Ftest_3genes_add_mrna`. What do you observe? Indicate the null hypotheses tested in each case and explain the results.

> *Your explanation of the results goes here.*

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 1.4**
<br>{points: 1}

Using a **significance level $\alpha = 0.10$** and the results in `add_mrna_results`, which of the following claims is correct? 

**A.** The `model.4` that includes `mrna` is significatly different from `model.3`.

**B.** There is not enough evidence that the `model.4` that includes `mrna` as a predictor is significatly better than `model.3`.

**C.** The `model.4` that includes `mrna` as a predictor is equivalent to `model.3` since the coefficient for `mrna` is not significantly different from zero.

**D.** The variable `mrna` is essencial to predict protein levels.

*Assign your answer to an object called `answer1.4`. Your answer should be one of `"A"`, `"B"`, `"C"`, or `"D"` surrounded by quotes.*

In [None]:
# answer1.4 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.4()

**Question 1.5**
<br>{points: 4}

a) We can use the adjusted $R^2$ to compare the goodness-of-fit of `model.5` and `model.4` to conclude which one fits the data better. Use the function `glance()` to obtain these values and discuss the results obtained. 

b) Compare the $R^2$ for both models and explain why that of `model.5` is larger.

In [None]:
# Your code and numerical results go here. We will grade this cell manually

# your code here
fail() # No Answer - remove if you provide an answer

> *Your explanation of the results goes here.*

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

# 2. Model selection

In this section you will use the **backward selection** algorithm to construct a generative model. 

> Note that the choice of which algorithm to use in each case was arbitrary and you can play with these algorithms and try other choices!

Let us start by loading the dataset to be used throughout this tutorial. We will use the dataset `fat` from the library `faraway`. You can find detailed information about it in [Johnson (1996)](https://www.tandfonline.com/doi/full/10.1080/10691898.1996.11910505). This dataset contains the percentage of body fat and a whole variety of body measurements (continuous variables) of 252 men. We will use the variable `brozek` as the response variable and a subset 14 variables to build different models. 

The response variable `brozek` is the percent of body fat using Brozek's equation:

$$\texttt{brozek} = \frac{457}{\texttt{density}} - 414.2,$$

where body `density` is measured in $\text{g}/\text{cm}^3$.

The 14 input variables are:

- `age`: Age in $\text{years}$.
- `weight`: Weight in $\text{lb}$.
- `height`: Height in $\text{in}$.
- `adipos`: Adiposity index in $\text{kg}/\text{m}^2$.

$$\texttt{adipos} = \frac{\texttt{weight}}{\texttt{height}^2}$$

- `neck`: Neck circumference in $\text{cm}$.
- `chest`: Chest circumference in $\text{cm}$.
- `abdom`: Abdomen circumference at the umbilicus and level with the iliac crest in $\text{cm}$.
- `hip`: Hip circumference in $\text{cm}$.
- `thigh`: Thigh circumference in $\text{cm}$.
- `knee`: Knee circumference in $\text{cm}$.
- `ankle`: Ankle circumference in $\text{cm}$.
- `biceps`: Extended biceps circumference in $\text{cm}$.
- `forearm`: Forearm circumference in $\text{cm}$.
- `wrist`: Wrist circumference distal to the styloid processes in $\text{cm}$.

Run the code below to create the working data frame called `fat_sample`.

In [None]:
# run this cell

fat_sample <- fat %>%
  select(
    brozek, age, weight, height, adipos, neck, chest, abdom,
    hip, thigh, knee, ankle, biceps, forearm, wrist
  )

head(fat_sample,3)

### Selecting a generative model

Although many potential input variables are available in the dataset, not all may be relevant to explain the variation of the response variable. The *subset* algorithms learned in the lecture can be used to select a subset of variable to build generative models. 

Generative models are built and trained to examine the association between the response and the input variables. 

Since the same data can not be used to select and to make inference, we need 2 different datasets: a *selection* set and a *training* set. If two independent datasets are not available to select and build a generative model, we can use the data in hand and split it to create these datasets. 

> using the same data to select and estimate violates the assumptions of the analysis and invalidate inference results. This problem is known as a *post-inference* problem and will be further discuss in future lectures. 

In the following questions, we will use the *backward* selection algorithm and the *adjusted* $R^2$ to select a smaller model. 

**Question 2.0**
<br>{points: 1}

Let's start by randomly splitting `fat_sample` in two sets on a 70-30% basis: `training_fat` (70% of the data) and `second_set_fat` (the remaining 30%). The selection will be done using the `second_set_fat` dataset and the model will be built using the `training_fat` data.

Follow the next 3 steps to complete the code below:

1. Create an `ID` column in `fat_sample` (i.e., `fat_sample$ID`) with the row number corresponding to each man in the sample.

2. Use the function `sample_n()` to create `training_fat` (sampling *without* replacement) with 70\% of the observations coming from `fat_sample`.

3. Use `anti_join()` with `fat_sample` and `training_fat` to create `second_set_fat` by column `ID`.

4. Remove the variable `ID` used to split the data

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
set.seed(123) # DO NOT CHANGE!

# fat_sample$ID <- rownames(fat_sample)
# training_fat <- ...(..., size = nrow(fat_sample) * 0.70,
#   replace = ...
# )

# second_set_fat <- anti_join(...,
#   ...,
#   by = ...
# )

# training_fat <- training_fat  %>% select(-"ID")
# second_set_fat <- second_set_fat %>% select(-"ID")

# head(training_fat)
# nrow(training_fat)

# head(second_set_fat)
# nrow(second_set_fat)

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.0()

**Question 2.1**
<br>{points: 1}

Using only the extra data in `second_set_fat`, select a reduced LR using the **backward selection** algorithm. Recall that this method is implemented in the function `regsubsets()` from library `leaps`.

The function `regsubsets()` identifies various subsets of input variables selected for models of different sizes. The argument `x` of `regsubsets()` is analogous to `formula` in `lm()`. 

Create one object using `regsubsets()`with `second_set_fat` and call it `fat_backward_sel`. We will use `fat_bwd_summary_df` to check the results.

**Maintain any ordering of columns seen in `second_set_fat`**

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# fat_backward_sel <- ...(
#   x=..., 
#   nvmax=...,
#   data=...,
#   method=...,
# )
# fat_backward_sel

#fat_bwd_summary <- summary(fat_backward_sel)

#fat_bwd_summary_df <- data.frame(
#    n_input_variables = 1:14,
#    RSQ = fat_bwd_summary$rsq,
#    RSS = fat_bwd_summary$rss,
#    ADJ.R2 = fat_bwd_summary$adjr2
#)

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.1()

**Question 2.2**
<br>{points: 1}

The *backward* subset algorithm selected the best model of each size. Results of the 14 models selected are stored in `fat_bwd_summary`. 

Use the *adjusted* $R^2$ of these 14 models, stored in `fat_bwd_summary_df`, to select the best generative model and indicate which input variables are in the selected model.

**A.** `age`.

**B.** `weight`.

**C.** `height`.

**D.** `adipos`.

**E.**  `neck`.

**F.**  `chest`.

**G.**  `abdom`.

**H.**  `hip`.

**I.**  `thigh`.

**J.**  `knee`.

**K.**  `ankle`.

**L.**  `biceps`.

**M.**  `forearm`.

**N.**  `wrist`.

*Assign your answers to the object `answer2.2`. Your answers have to be included in a single string indicating the correct options **in alphabetical order** and surrounded by quotes.*

In [None]:
#Run this cell before continuing to examine the results

fat_bwd_summary_df

fat_bwd_summary 

In [None]:
# answer2.2 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.2()

**Question 2.3**
<br>{points: 1}

Now that you have selected a subset of input variables, use the independent dataset `training_fat` to build and evaluate a *generative* model. 

Use `lm` to fit the selected model using `training_fat`, and store the results in an object called `fat_bwd_generative`. 

> Enter the selected variables in the **same order** as they are in `training_fat`. This is not statistically needed, it's only needed to autograde this question.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# fat_bwd_generative <- ...(...,
#   ...
# )

# tidy(fat_bwd_generative)

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.3()

**Question 2.4**
<br>{points: 1}

Compute the coefficient of determination $R^2$ to evaluate the goodness of fit of the model.

> Note that the evaluation is also based on data from `training_fat`

*Assign your answer to the object `answer2.4`. Your answer is a numeric object*

In [None]:
# *Your code goes here.*

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.4()

**Question 2.5**
<br>{points: 1}

Interpret the coefficient of determination $R^2$ computed in **Question 2.4** and comment on the goodness-of-fit of the selected model.

> *Your answer goes here.*

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 2.6**
<br>{points: 2}

Previous research has shown that while weight can be highly variable during the day and even across days, body circumference measurements (e.g., abdominal circumference) are more stable and better predictors of body fat. Using the results from **Question 2.3**, corroborate if the abdominal circumference, `abdom` is statistically (linearly) associated with the percent of body fat measured by `brozek` (at a significance level of 0.01). 

In your answer, include an interpretation of the estimated coefficients as well as the results of the $t$-tests reported using `tidy()`.

> *Your answer goes here.*

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.