# Worksheet 07: Goodness of Fit beyond MLR and Stepwise Selection

#### Lecture and Tutorial Learning Goals:
After completing this week's lecture and tutorial work, you will be able to:

1. List model metrics that are suitable for evaluation of a statistical model developed to make inferences about the data-generating mechanism (e.g., $R^2$, $\text{AIC}$, Likelihood ratio test/$F$-test), their strengths and limitations, as well as how they are calculated.
2. Identify appropriate goodness-of-fit metrics for MLR, logistic and Poisson regressions.
3. Compute appropiate residuals of logistic and Poisson regressions.
4. Explain how an $F$-test to compare nested models can be used as a variable selection methods.
5. Write a computer script to calculate these model metrics. Interpret and communicate the results from that computer script.

In [None]:
# Run this cell before continuing.

library(gridExtra)
library(faraway)
library(broom)
library(tidyverse)
library(leaps)
source("tests_worksheet_07.R")

# 1. Goodness-of-fit beyond MLR

In this section, you will evaluate a logistic regression model using appropriate goodness-of-fit measures.

#### Dataset

You will first use the data frame `breast_cancer` to build and evaluate a logistic regression. 

This dataset will be a subset of the Wisconsin Diagnostic Breast Cancer dataset ([Mangasarian et al., 1995](http://ubc.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1Nb9QwEB2xPSA4tHQLohRKDoDgsDSJndiRKlApVBx74PNk2bGDKui2jbf8Ff4uM46tbpZKFZdIO55NvNLLeLx-8waAla_z2UpMEAZXtlZKzllZGSk7WTBdtDa3ts6N61aoOnUqjSGWZaAJhkN9zJfML7dHCi-yfnt-MaPmUXTIGjtpTGAi2cDr-rKkvFsPbQwYBpyafxsvQMTDbPEOVxE5yQSH8iZKHH2iKv4TrsMadLQBKk03kU9WagPHAo___7vuwXpMT7ODAU-bcMvNp3A7seOnsJG6QGQxKEzh7pKk4RQ2o91nL6Oi9ast-LN_qvufb94RAX6xvxc-ZIPtkFDXj23vB_rfiR-b9dxmx_3ZdUO_T_TYgFtsfIXHtuOBinaK84wD9-Hz0YdPhx9nsSPEDPeBJWmp6pJ3VS1tzlvR8Ua6WljWlm3RaMkMs52R1jmJWGu4a6wwtugaaauGtQWaH8Da_GzuHkJWVlpwbgpthKOz3aZsmeRWC9mJxnG7DS8STNT5IPyhaMOEO0xF_WkUZ4qrSuTomEB0k-MzgpiK3UXx4un_F_9DX3qvDjCPw2yPMbxfcCPwLXrd6lgngdMmqa5lx6cJqyoiNTzQLz3xeRq4YWZbAYxXXgGJ27CT8K5iZPOqJEFAUnl8dP2XduDOUPVPjObHsLboL92TIGmxCxPx9TteMcDshnf0L2aRR50)). It has a **binary** response `target`: whether the tumour is `benign` or `malignant`. Hence, the binary response $Y_i$ is mathematically set as:

$$
Y_i =
\begin{cases}
1 \; \; \; \; \mbox{if the $i$th tumour is malignant},\\
0 \; \; \; \; 	\mbox{otherwise.}
\end{cases}
$$

The original dataset contains 569 observations from a digitized image of a breast mass' fine needle aspirate (FNA) and 30 real-valued characteristics (i.e., continuous input variables) plus the binary response and ID number. 

**We will only work with 16 input variables**.

In [None]:
#Run this cell to create a dataset

breast_cancer <- read_csv("data/breast_cancer.csv") %>%
  select(-c(
    mean_area, area_error, concavity_error, concave_points_error, worst_radius, worst_texture, worst_perimeter,
    worst_area, worst_smoothness, worst_compactness, worst_concavity, worst_concave_points, worst_symmetry,
    worst_fractal_dimension
  ))

**Question 1.0**
<br>{points: 1}

Replace the levels `malignant` and `benign` for `target` in the dataset `breast_cancer_train` with the numerical values `1` and `0`, respectively.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# breast_cancer <- 
#     breast_cancer %>% 
#     ...(... = ...(..., 1, 0))

# your code here
fail() # No Answer - remove if you provide an answer

head(breast_cancer)

In [None]:
test_1.0()

**Question 1.1**
<br>{points: 1}

Using the `glm` function, fit a logistic regression model. The model's response will be `target` and the rest of the covariates will be inputs. Call the resulting object `breast_cancer_model`.

> tip: check the data object, note that there's a variable called ID that you need to exclude

**Note**: You need to write most of this code cell. Go back to `worksheet_04` and `tutorial_04` if you don't recall how to fit a logistic model.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# breast_cancer_model <- 
#     ...

# your code here
fail() # No Answer - remove if you provide an answer

summary(breast_cancer_model)

In [None]:
test_1.1()

**Question 1.2**
<br>{points: 1}

1. Use the function `augment()` with the object `breast_cancer_model` to compute fitted values and residuals and add them to the dataset. We will select only the target to reduce the number of columns displayed.

Note that `augment()` computes the *deviance residuals* but you can comput others from the model object, e.g., the *pearson residuals*.

2. Add the residuals using the function `residuals` on the estimated model `breast_cancer_model`, and type `response`, `pearson` and `deviance` to the tibble.

*Assign your answer to the object `breast_cancer_resid`. Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# breast_cancer_resid <- breast_cancer_model %>%
#               ...() %>%
#               dplyr::select(target,.fitted, .resid,.std.resid) %>%
#               mutate(resid_raw = residuals(..., type = "..."),
#                      resid_deviance = residuals(...),
#                      resid_pearson =...(...,... = "...")
#                      )

# your code here
fail() # No Answer - remove if you provide an answer

head(breast_cancer_resid,3)

In [None]:
test_1.2()

**Question 1.3**
<br>{points: 2}

The `summary()` function reports a *Residual deviance*, which is a analogous to the residual sum of squares (RSS) of MLR.

1. Use the output in `breast_cancer_resid` to compute the residual deviance. Save it in an object called `breast_cancer_resid_dev`

2. Extract the object `deviance` from `breast_cancer_model` to verify that you computed it correctly. Save it in an object called `summary_dev`

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# breast_cancer_resid_dev <- ...((...$...)^2)
# summary_dev <- ...$...

# your code here
fail() # No Answer - remove if you provide an answer

breast_cancer_resid_dev
summary_dev

In [None]:
test_1.3.0()
test_1.3.1()

**Question 1.4**
<br>{points: 2}

Note that the function `glance()` also computes the deviance of the intercept-only and full models. However, it does not provide results of statistical test.

1. Get the deviance of the intercept-only and full models using `glance()` and `breast_cancer_model`. Call the resulting object `breast_cancer_glance`
   
2. Extract the object `null` from `breast_cancer_model` to verify that is the same as that given by `glance()`. Call the resulting object `summary_null`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# breast_cancer_glance <- ...(...)
# summary_null <- ...$null

# your code here
fail() # No Answer - remove if you provide an answer

breast_cancer_glance
summary_null

In [None]:
test_1.4.0()
test_1.4.1()

**Question 1.5**
<br>{points: 1}

1. Fit a logistic model with only an intercept. Call this model `model_null`.

2. Note that the null deviance from `model_null` is the same as its residual deviance. Why??

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# model_null <- ...(
#   ... ~ ...,
#   ...,
#   ...  
# )

# your code here
fail() # No Answer - remove if you provide an answer

summary(model_null)

In [None]:
test_1.5()

**Question 1.6**
<br>{points: 1}

1. Use the function `anova()` and the test `Chisq` to test if the full additive logistic model (`breast_cancer_model`) is significantly better from the intercept-only model (`model_null`).

2. Store the output of the `anova` function in an object called `breast_cancer_gof`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
#breast_cancer_gof <- ...(..., ..., test = "...")

# your code here
fail() # No Answer - remove if you provide an answer

breast_cancer_gof

In [None]:
test_1.6()

**Question 1.7: TRUE or FALSE**
<br>{points: 1}

Compare the quantities given in the anova table with those obtained before using `summary` and `glance`. *Is the following observation true or false?*

"The quantities in the column `Resid.Dev` corresponde to the residual deviance of the intercept-only and the full models, respectively."

*Assign your answer to an object called `answer1.6`. Your answer should be one of `"true"` or `"false"` surrounded by quotes.*

In [None]:
# answer1.7 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.7()

**Question 1.8**
<br>{points: 1}

Using a significance level $\alpha = 0.05$ and the results in `breast_cancer_model`, in plain words, what is the conclusion from the test run results in *Question 1.6*?

**A.** We fail to reject the null hypothesis; thus, there is not enough evidence that the *full* model with additional interaction terms is better than the additive (reduced) model.

**B.** We reject the null hypothesis; thus, the *full* model is significatly better than the *reduced* model.

**C.** We accept the alternative hypothesis; thus, the *full* model is significantly better than the *reduced* model.

**D.** We do not accept the alternative hypothesis; thus, the *full* model with additional interaction terms is not better than the *reduced* model.

*Assign your answer to an object called `answer1.8`. Your answer should be one of `"A"`, `"B"`, `"C"`, or `"D"` surrounded by quotes.*

In [None]:
# answer1.8 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.8()

**Question 1.9**
<br>{points: 2}

Although the `breast_cancer_model` is significantly better than the intercept-only model, not all variables are needed in the model. In fact, many variables are not statistically significant and there may be a multicollinearity problem.

In this question, we'll compare the additive logistic model `breast_cancer_model` with a reduced model called `breast_cancer_reduced`. 

1. Fit a reduced (additive and logistic) model *with only* the variables `mean_fractal_dimension`, `mean_smoothness`, `mean_compactness`, and `mean_concavity`
   
3. Use the function `anova()` to compare these models and store the results in an object called `breast_cancer_aov`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# breast_cancer_reduced <- ...

#breast_cancer_aov <- ...(..., ..., test = "...")

# your code here
fail() # No Answer - remove if you provide an answer
summary(breast_cancer_reduced)
breast_cancer_aov

In [None]:
test_1.9()

**Question 1.10**
<br>{points: 1}

Using a **significance level $\alpha = 0.01$** and the results in `breast_cancer_aov`, which of the following claims is correct? 

**A.** There is enough evidence to reject the null hypothesis, which states that, compared to `breast_cancer_reduced`, the additional variables in the full model `breast_cancer_model` are not relevant (i.e., all their coefficients are simultaneously equal to 0).

**B.** There is not enough evidence that the `breast_cancer_reduced` with only 4 covariates is significantly better than the full model `breast_cancer_model`.

**C.** The `breast_cancer_reduced` is the best subset we could have considered.

**D.** It is surprising that the `breast_cancer_reduced` fits the data well since some of the selected variables were *not* statistically significant in `breast_cancer_model`.

*Assign your answer to an object called `answer1.10`. Your answer should be one of `"A"`, `"B"`, `"C"`, or `"D"` surrounded by quotes.*

In [None]:
# answer1.10 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.10()

# 2. Model selection

In this section, you will use the **backward selection algorithm** to construct a generative multiple linear regression model.

Let us start by loading the dataset to be used throughout this tutorial. We will use the dataset `fat` from the library `faraway`. You can find detailed information about it in [Johnson (1996)](https://www.tandfonline.com/doi/full/10.1080/10691898.1996.11910505). This dataset contains the percentage of body fat and a variety of body measurements (continuous variables) of 252 men. We will use `brozek` as the response variable and a subset of 14 variables to build different models. 

The response variable `brozek` is the percent of body fat using Brozek's equation:

$$\texttt{brozek} = \frac{457}{\texttt{density}} - 414.2,$$

where body `density` is measured in $\text{g}/\text{cm}^3$.

The 14 input variables are:

- `age`: Age in $\text{years}$.
- `weight`: Weight in $\text{lb}$.
- `height`: Height in $\text{in}$.
- `adipos`: Adiposity index in $\text{kg}/\text{m}^2$.

$$\texttt{adipos} = \frac{\texttt{weight}}{\texttt{height}^2}$$

- `neck`: Neck circumference in $\text{cm}$.
- `chest`: Chest circumference in $\text{cm}$.
- `abdom`: Abdomen circumference at the umbilicus and level with the iliac crest in $\text{cm}$.
- `hip`: Hip circumference in $\text{cm}$.
- `thigh`: Thigh circumference in $\text{cm}$.
- `knee`: Knee circumference in $\text{cm}$.
- `ankle`: Ankle circumference in $\text{cm}$.
- `biceps`: Extended biceps circumference in $\text{cm}$.
- `forearm`: Forearm circumference in $\text{cm}$.
- `wrist`: Wrist circumference distal to the styloid processes in $\text{cm}$.

Run the code below to create the working data frame called `fat_sample`.

In [None]:
# run this cell

fat_sample <- 
    fat %>%
    select(brozek, age:adipos, neck:wrist)

head(fat_sample,3)

### Selecting a generative model

Although many potential input variables are available in the dataset, not all may be relevant to explain the variation of the response variable. The *subset* algorithms learned in the lecture can be used to select a subset of variables to build generative models. 

Generative models are built and trained to examine the association between the response and the input variables. 

<font style='color: darkred'>However, using the same data to select and estimate violates the analysis's assumptions and invalidates inference results</font>. This problem is known as a **post-inference problem** and will be discussed further in future lectures. Since the same data cannot be used to select and make inferences, we need to split our dataset into two sets: a *selection* set and a *training* set. 

In the following questions, we will use the *backward* selection algorithm and the *adjusted* $R^2$ to select a smaller model. 

**Question 2.0**
<br>{points: 1}

Let's start by randomly splitting `fat_sample` into two sets on a 70-30% basis: `training_fat` (70% of the data) and `selection_set_fat` (the remaining 30%). The selection will be done using the `selection_set_fat` dataset and the model will be built using the `training_fat` data.

Follow the next 3 steps to complete the code below:

1. Create an `id` column in `fat_sample` with the row number corresponding to each man in the sample (see function `row_number`).

2. Use the function `slice_sample()` to create `training_fat` (sampling *without* replacement) with 70\% of the observations coming from `fat_sample`.

3. Use `anti_join()` with `fat_sample` and `training_fat` to create `selection_set_fat` by column `id`.

4. Remove the variable `id` used to split the data

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
set.seed(123) # DO NOT CHANGE!

# fat_sample <- 
#     fat_sample %>%
#     ...(id = ...)

# training_fat <- 
#     ... %>%
#     ...(prop = ..., replace = ...)

# selection_set_fat <- 
#     ... %>%
#     anti_join(..., by = ...)

# training_fat <- 
#     training_fat %>% 
#     select(-"id")

# selection_set_fat <- 
#     selection_set_fat %>% 
#     select(...)

# your code here
fail() # No Answer - remove if you provide an answer

head(training_fat)
nrow(training_fat)

head(selection_set_fat)
nrow(selection_set_fat)

In [None]:
test_2.0()

**Question 2.1**
<br>{points: 1}

Using only the extra data in `selection_set_fat`, select a reduced LR using the **backward selection** algorithm. Recall that this method is implemented in the `leaps::regsubsets()` function.

The function `regsubsets()` from the `leaps` package identifies various subsets of input variables selected for models of different sizes. The argument `x` of `regsubsets()` is analogous to `formula` in `lm()`. 

Create one object using `regsubsets()`with `selection_set_fat` and call it `fat_backward_sel`. We will use `fat_bwd_summary_df` to check the results.

**Maintain any ordering of columns seen in `selection_set_fat`**

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# fat_backward_sel <- ...(
#   x=..., 
#   nvmax=...,
#   data=...,
#   method=...,
# )

# fat_backward_sel

# fat_bwd_summary <- summary(fat_backward_sel)

# fat_bwd_summary_df <- 
#     tibble(
#         n_input_variables = 1:14,
#         RSQ = fat_bwd_summary$rsq,
#         RSS = fat_bwd_summary$rss,
#         ADJ.R2 = fat_bwd_summary$adjr2
#     )

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.1()

**Question 2.2**
<br>{points: 1}

The *backward* subset algorithm selected the best model of each size. Results of the 14 models selected are stored in `fat_bwd_summary`. 

Use the *adjusted* $R^2$ of these 14 models, stored in `fat_bwd_summary_df`, to select the best generative model and indicate which input variables are in the selected model.

**A.** `age`.

**B.** `weight`.

**C.** `height`.

**D.** `adipos`.

**E.**  `neck`.

**F.**  `chest`.

**G.**  `abdom`.

**H.**  `hip`.

**I.**  `thigh`.

**J.**  `knee`.

**K.**  `ankle`.

**L.**  `biceps`.

**M.**  `forearm`.

**N.**  `wrist`.

*Assign your answers to the object `answer2.2`. Your answers must be included in a single string indicating the correct options **in alphabetical order** and surrounded by quotes.*

In [None]:
#Run this cell before continuing to examine the results


# HINT: try arranging your results in descending order with: arrange(desc(column_name))
fat_bwd_summary_df

# HINT 2: You can also use the output $which to see the summary of the variables in a 
# matrix format. Give it a try: fat_bwd_summary$which 
fat_bwd_summary

In [None]:
# answer2.2 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.2()

**Question 2.3**
<br>{points: 1}

Now that you have selected a subset of input variables use the independent dataset `training_fat` to build and evaluate a *generative* model. 

Use `lm` to fit the selected model using `training_fat`, and store the results in an object called `fat_bwd_generative`. 

Enter the selected variables in the **same order** as they are in `training_fat`. This is not statistically needed; it's only needed for auto-grading purposes.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# fat_bwd_generative <- ...(..., ...)

# your code here
fail() # No Answer - remove if you provide an answer

tidy(fat_bwd_generative)

In [None]:
test_2.3()

**Question 2.4**
<br>{points: 1}

Compute the coefficient of determination $R^2$ to evaluate the model's goodness of fit.

> Note that the evaluation is also based on data from `training_fat`

*Assign your answer to the object `answer2.4`. Your answer must be a number.*

In [None]:
# *Your code goes here.*

# your code here
fail() # No Answer - remove if you provide an answer

answer2.4

In [None]:
test_2.4()