# Tutorial 7: Model Evaluation and Model Selection

#### Lecture and Tutorial Learning Goals:
After completing this week's lecture and tutorial work, you will be able to:

1. List model metrics that are suitable for evaluation of a statistical model developed to make inference about the data-generating mechanism (e.g., $R^2$, $\text{AIC}$, Likelihood ratio test/$F$-test), their strengths and limitations, as well as how they are calculated.
2. Write a computer script to calculate these model metrics. Interpret and communicate the results from that computer script.
3. Explain a variable selection method based on:
    - $F$-test to compare nested models.
    - RSS for models of equal size
    - Adjusted $R2$ for models of different sizes

In [None]:
# Run this cell before continuing.
library(tidyverse)
library(repr)
library(digest)
library(infer)
library(gridExtra)
library(carData)
library(broom)
library(leaps)
library(mltools)
source("tests_tutorial_07.R")

## 1. Goodness of Fit and Nested Models

In this section we will compare nested models to select multiple linear regression (MLR) models. We will continue using the Facebook dataset from `tutorial_04`. 

Recall this dataset contains critical information on users' engagement during 2014 on a Facebook page of a famous cosmetics brand. The original dataset contains 500 observations related to different types of posts. It can be found in [data.world](https://data.world/uci/facebook-metrics/workspace/project-summary?agentid=uci&datasetid=facebook-metrics). The dataset you will work with in this tutorial contains 491 observations.  The dataset was firstly analyzed by [Moro et al. (2016)](https://gw2jh3xr2c.search.serialssolutions.com/log?L=GW2JH3XR2C&D=ADALY&J=JOUROFBUSRE&P=Link&PT=EZProxy&A=Predicting+social+media+performance+metrics+and+evaluation+of+the+impact+on+brand+building%3A+A+data+mining+approach&H=d8c19bb47c&U=https%3A%2F%2Fezproxy.library.ubc.ca%2Flogin%3Furl%3Dhttps%3A%2F%2Fwww.sciencedirect.com%2Fscience%2Flink%3Fref_val_fmt%3Dinfo%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal%26svc_val_fmt%3Dinfo%3Aofi%2Ffmt%3Akev%3Amtx%3Asch_srv%26rfr_dat%3Dsaltver%3A1%26rfr_dat%3Dorigin%3ASERIALSSOL%26ctx_enc%3Dinfo%3Aofi%2Fenc%3AUTF-8%26ctx_ver%3DZ39.88-2004%26rft_id%3Dinfo%3Adoi%2F10.1016%2Fj.jbusres.2016.02.010) in their data mining work to predict the performance of different post metrics, which are also based on the type of post. The original dataset has 17 variables of different types (continuous and discrete). Nonetheless, for this tutorial, we selected four variables to be saved in the object facebook_data:


1.  The continuous variable `total_engagement_percentage` is an essential variable for any company owning a Facebook page. It gives a sense of how engaged the overall social network's users are with the company's posts, **regardless of whether they previously liked their Facebook page or not**. *The larger the percentage, the better the total engagement*. It is computed as follows:

$$\texttt{total_engagement_percentage} = \frac{\text{Lifetime Engaged Users}}{\text{Lifetime Post Total Reach}} \times 100\%$$

-   **Lifetime Post Total Reach:** The number of overall *Facebook unique users* who *saw* the post.
-   **Lifetime Engaged Users:** The number of overall *Facebook unique users* who *saw and clicked* on the post. This count is a subset of **Lifetime Post Total Reach**.

2.  The continuous variable `page_engagement_percentage` is analogous to `total_engagement_percentage`, but only with users who engaged with the post **given they have liked the page**. This variable provides a sense to the company to what extent these subscribed users react to its posts. *The larger the percentage, the better the page engagement*. It is computed as follows:

$$\texttt{page_engagement_percentage} = \frac{\text{Lifetime Users Who Have Liked the Page and Engaged with the Post}}{\text{Lifetime Post Reach by Users Who Liked the Page}} \times 100\% $$

-   **Lifetime Post Reach by Users Who Liked the Page:** The number of *Facebook unique page subscribers* who *saw* the post.
-   **Lifetime Users Who Have Liked the Page and Engaged with the Posts:** The number of *Facebook unique page subscribers* who *saw and clicked* on the post. This count is a subset of **Lifetime Post Reach by Users Who Liked the Page**.

3.  The continuous `share_percentage` is the percentage that the number of *shares* represents from the sum of *likes*, *comments*, and *shares* in each post. It is computed as follows:

$$\texttt{share_percentage} = \frac{\text{Number of Shares}}{\text{Total Post Interactions}} \times 100\% $$

-   **Total Post Interactions:** The sum of *likes*, *comments*, and *shares* in a given post.
-   **Number of Shares:** The number of *shares* in a given post. This count is a subset of *Total Post Interactions*.

4.  The continuous `comment_percentage` is the percentage that the number of *comments* represents from the sum of *likes*, *comments*, and *shares* in each post. It is computed as follows:

$$\texttt{comment_percentage} = \frac{\text{Number of Comments}}{\text{Total Post Interactions}} \times 100\% $$

-   **Total Post Interactions:** The sum of *likes*, *comments*, and *shares* in a given post.
-   **Number of Comments:** The number of *comments* in a given post. This count is a subset of *Total Post Interactions*.

Let us add a discrete input called `post_category`, which has three different categories depending on the content characterization:

-   `Action`: Brand's contests and special offers for the customers.
-   `Product`: Regular advertisements for products with explicit brand content.
-   `Inspiration`: Non-explicit brand-related content.

> **For this tutorial, we will consider `total_engagement_percentage` as a response along with `page_engagement_percentage`, `share_percentage`, `comment_percentage`, and `post_category` as inputs.**

Run the cell below before proceeding.

In [None]:
facebook_data <- read_csv("data/facebook_data.csv")
head(facebook_data)

facebook_data$post_category <- as.factor(facebook_data$post_category)
levels(facebook_data$post_category)

**Question 1.0**
<br>{points: 1}

Let us draw a random sample of $n = 100$ observations called `facebook_sample` from `facebook_data`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
set.seed(123) # DO NOT CHANGE!

# facebook_sample <- ...(..., size = ...)
# head(facebook_sample)

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.0()

**Question 1.1**
<br>{points: 1}

We will use `facebook_sample` to explore the difference between additive and interaction MLR models. Let $Y_i$ be the response `total_engagement_percentage` for these models. The input $X_{1i}$ corresponds to the continuous `page_engagement_percentage`. Furthermore, we will have the following dummy variables associated with `post_category` as follows:

$$
X_{2i} =
\begin{cases}
1 \; \; \; \; \mbox{if the $i$th post is Inspiration-type,}\\
0 \; \; \; \; \mbox{otherwise};
\end{cases}
$$

$$
X_{3i} =
\begin{cases}
1 \; \; \; \; \mbox{if the $i$th post is Product-type,}\\
0 \; \; \; \; \mbox{otherwise}.
\end{cases}
$$

> **Heads-up:** Note that `Action` is the baseline level in `post_category`.

Finally, $X_{4i}$ and $X_{5i}$ correspond to the continuous variables `share_percentage` and `comment_percentage` respectively.

Estimate the following MLRs:

- An additive MLR with the inputs `page_engagement_percentage` and `post_category`. Call it `facebook_MLR_add`.
- A second additive MLR with the inputs `page_engagement_percentage`, `post_category`, `share_percentage`, and `comment_percentage`. Call it `facebook_MLR_add_2`.
- An interaction MLR with the inputs `page_engagement_percentage` and `post_category`. Call it `facebook_MLR_int`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# facebook_MLR_add <- lm(...,
#   ...
# )
# facebook_MLR_add

# facebook_MLR_add_2 <- lm(...,
#   ...
# )
# facebook_MLR_add_2

# facebook_MLR_int <- lm(...,
#   ...
# )
# facebook_MLR_int

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.1()

**Question 1.2**
<br>{points: 1}

Create a plot of the data (using `geom_point()`) and the predictions of `facebook_MLR_add` (note that your plot should have three regression lines, one for each `post_category`). You have to colour the points and regression lines by `post_category`. Include a legend indicating which colour corresponds to each `post_category` and proper axis labels. The `ggplot()` object's name will be `facebook_MLR_Add_plot`. Recall that the response must be placed on the $y$-axis, whereas the continuous input must be on the $x$-axis.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
options(repr.plot.width = 15, repr.plot.height = 7) # Adjust these numbers so the plot looks good in your desktop.

# facebook_sample$pred_MLR_Add <- predict(facebook_MLR_add) # Using predict() to create estimated regression lines.

# facebook_MLR_Add_plot <- ggplot(..., aes(
#   ...,
#   ...,
#   color = ...
# )) +
#   ...() +
#   geom_line(aes(y = pred_MLR_Add), size = 1) +
#   labs(
#     title = ...,
#     x = ...,
#     y = ...
#   ) +
#   theme(
#     text = element_text(size = 16.5),
#     plot.title = element_text(face = "bold"),
#     axis.title = element_text(face = "bold"),
#     legend.title = element_text(face = "bold"),
#   ) +
#   labs(color = ...)
# facebook_MLR_Add_plot

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.2()

**Question 1.3**
<br>{points: 1}

Create a plot of the data (using `geom_point()`) and the predictions of `facebook_MLR_int` (note that your plot should have three regression lines, one for each `post_category`). You have to colour the points and regression lines by `post_category`. Include a legend indicating which colour corresponds to each `post_category` and proper axis labels. The `ggplot()` object's name will be `facebook_MLR_Int_plot`. Recall that the response must be placed on the $y$-axis, whereas the continuous input must be on the $x$-axis.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# facebook_sample$pred_MLR_Int <- predict(facebook_MLR_int) # Using predict() to create estimated regression lines.

# facebook_MLR_Int_plot <- ggplot(..., aes(
#   ...,
#   ...,
#   color = ...
# )) +
#   ...() +
#   geom_line(aes(y = pred_MLR_Int), size = 1) +
#   labs(
#     title = ...,
#     x = ...,
#     y = ...
#   ) +
#   theme(
#     text = element_text(size = 16.5),
#     plot.title = element_text(face = "bold"),
#     axis.title = element_text(face = "bold"),
#     legend.title = element_text(face = "bold"),
#   ) +
#   labs(color = ...)
# facebook_MLR_Int_plot

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.3()

**Question 1.4**
<br>{points: 1}

Find the estimated coefficients of `facebook_MLR_add`, `facebook_MLR_add_2`, and `facebook_MLR_int` using `tidy()`. Report the estimated coefficients, their standard errors and corresponding $p$-values. Include the corresponding asymptotic 90% confidence intervals. Store the results in the variable `facebook_MLR_add_results`, `facebook_MLR_add_2_results`, and `facebook_MLR_int_results` respectively.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# print("facebook_MLR_add_results")
# facebook_MLR_add_results <- ...(..., ..., ....) %>% mutate_if(is.numeric, round, 2)
# facebook_MLR_add_results

# print("facebook_MLR_add_2_results")
# facebook_MLR_add_2_results <- ...(..., ..., ...) %>% mutate_if(is.numeric, round, 2)
# facebook_MLR_add_2_results

# print("facebook_MLR_int_results")
# facebook_MLR_int_results <- ...(..., ..., ...) %>% mutate_if(is.numeric, round, 2)
# facebook_MLR_int_results

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.4()

**Question 1.5**
<br>{points: 1}

Using a **significance level $\alpha = 0.10$** and the results in `facebook_MLR_add_results`, which of the following claims *is* correct? (only one is correct) 

**A.** There's enough evidence to reject the null hypothesis that the expected `total_engagement_percentage` is the same in all `post_category` types.

**B.** There's enough evidence to reject the null hypothesis that the expected `total_engagement_percentage` is the same in `Action` and `Product` posts, for any value of `page_engagement_percentage`.

**C.** There's enough evidence to reject the null hypothesis that the expected `total_engagement_percentage` is the same in `Action` and `Inspiration` posts, for any value of `page_engagement_percentage`.

**D.** There's enough evidence to reject the null hypothesis that the expected `total_engagement_percentage` is the same in `Product` and `Inspiration` posts, for any value of `page_engagement_percentage`.


*Assign your answers to the object `answer1.5`. Your answers should be one of `"A"`, `"B"`, `"C"`, or `"D"` surrounded by quotes.*

In [None]:
# answer1.5 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.5()

**Question 1.6**
<br>{points: 1}

Using a **significance level $\alpha = 0.10$** and the results in `facebook_MLR_add_2_results`, which of the following claims *are* correct? (there may be more than one)

**A.** We fail to reject the null hypothesis that `total_engagement_percentage` and `share_percentage` are not associated, holding all variable fix.

**B.** We fail to reject the null hypothesis that `total_engagement_percentage` and `comment_percentage` are not associated, holding all variable fix.

**C.** We fail to reject the null hypothesis that the expected `total_engagement_percentage` is the same in `Action` and `Product` posts,  holding all other variables fix, 

**D.** We fail to reject the null hypothesis that the expected `total_engagement_percentage` is the same in `Action` and `Inspiration` posts,  holding all other variables fix.

**E.** We fail to reject the null hypothesis that the expected `total_engagement_percentage` is the same in `Product` and `Inspiration` posts,  holding all other variables fix.

*Assign your answers to the object `answer1.6`. Your answers have to be included in a single string indicating the correct options **in alphabetical order** and surrounded by quotes (e.g., `"ABCD"` indicates you are selecting the four options).*

In [None]:
# answer1.6 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.6()

**Question 1.7**
<br>{points: 1}

Using a **significance level $\alpha = 0.10$** and the results in `facebook_MLR_int_results`, which of the following claims *are* correct? (there may be more than one)

> *Tip*: Note that the results of `facebook_MLR_int_results` can not be used to examine if (overall) `total_engagement_percentage` and `page_engagement_percentage`  are associated. The $t$-tests in the `tidy()` table, compares this association for different types of posts (i.e., levels of the categorical variable `post_category`). An $F$-test is required instead!

**A.** After controlling for other input variables in the model, `total_engagement_percentage` and `page_engagement_percentage` are significantly associated.

**B.** After controlling for other input variables in the model, `total_engagement_percentage` and `page_engagement_percentage` are significantly associated for `Action` posts.

**C.** After controlling for other input variables in the model, `total_engagement_percentage` and `page_engagement_percentage` are significantly the associated for `Product` posts.

**D.** After controlling for other input variables in the model, the association between `total_engagement_percentage` and `page_engagement_percentage` is significantly different for `Product` posts compared to that of `Action` posts.


*Assign your answers to the object `answer1.7`. Your answers have to be included in a single string indicating the correct options **in alphabetical order** and surrounded by quotes (e.g., `"ABCD"` indicates you are selecting the four options).*

In [None]:
# answer1.7 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.7()

**Question 1.8**
<br>{points: 1}

Before testing the goodness of fit and compare our three models, let us check the corresponding sample's regression equations:

- `facebook_MLR_add`:

$$Y_i = \beta_0 + \beta_1 X_{1,i} + \beta_2 X_{2,i} + \beta_3 X_{3,i} + \varepsilon_i.$$

- `facebook_MLR_add_2`:

$$Y_i = \beta_0 + \beta_1 X_{1,i} + \beta_2 X_{2,i} + \beta_3 X_{3,i} + \beta_4 X_{4,i} + \beta_5 X_{5,i} + \varepsilon_i.$$

- `facebook_MLR_int`:

$$Y_i = \beta_0 + \beta_1 X_{1,i} + \beta_2 X_{2,i} + \beta_3 X_{3,i} + \beta_6 X_{1,i} \times X_{2,i} + \beta_7 X_{1,i} \times X_{3,i}+ \varepsilon_i.$$


Using `glance()`, obtain metrics to explore the goodness of fit of these models. Store values in `facebook_MLR_add_statistics`, `facebook_MLR_add_2_statistics`, and `facebook_MLR_int_statistics`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# print("facebook_MLR_add_statistics")
# facebook_MLR_add_statistics <- ...(...) %>% mutate_if(is.numeric, round, 3)
# facebook_MLR_add_statistics

# print("facebook_MLR_add_2_statistics")
# facebook_MLR_add_2_statistics <- ...(...) %>% mutate_if(is.numeric, round, 3)
# facebook_MLR_add_2_statistics

# print("facebook_MLR_int_statistics")
# facebook_MLR_int_statistics <- ...(...) %>% mutate_if(is.numeric, round, 3)
# facebook_MLR_int_statistics

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.8()

**Question 1.9**
<br>{points: 1}

We can use the adjusted $R^2$ to compare the three models and conclude which one fits the data better. By looking at your metrics from **Question 1.8**, what model fits the data betters according to the adjusted $R^2$ and why?

> *Your answer goes here.*

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 1.10**
<br>{points: 1}

Do these three models fit the data better than the null model. Using your results from **Question 1.8** with a **significance level $\alpha = 0.10$**, provide the three corresponding statistical conclusions of these tests.

> *Your answer goes here.*

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 1.11**
<br>{points: 1}

By looking at the equations in **Question 1.8**, `facebook_MLR_add` is nested in `facebook_MLR_add_2`. Moreover, `facebook_MLR_add` is nested in `facebook_MLR_int`. Hence, we can use $F$-tests to make pairwise comparisons between these models. Run these tests following these steps: 

- With models `facebook_MLR_add` and `facebook_MLR_add_2`, use `anova()` to obtain the $F$-statistic and its associated $p$-value to *simultaneously* test is all additional coefficients in `facebook_MLR_add_2` are zero. Store your results in an object called `facebook_F_test_add2_vs_add`.

- With models `facebook_MLR_add` and `facebook_MLR_int`, use `anova()` to obtain the $F$-statistic and its associated $p$-value to *simultaneously* test is all additional coefficients in `facebook_MLR_int` are zero. Store your results in an object called `facebook_F_test_int_vs_add`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# print("facebook_MLR_add vs. facebook_MLR_add_2")
# facebook_F_test_add2_vs_add <- ...(..., ...) %>% mutate_if(is.numeric, round, 3)
# facebook_F_test_add2_vs_add

# print("facebook_MLR_add vs. facebook_MLR_int")
# facebook_F_test_int_vs_add <- ...(..., ...) %>% mutate_if(is.numeric, round, 3)
# facebook_F_test_int_vs_add

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.11()

**Question 1.12**
<br>{points: 1}

Based in your results from **Question 1.11** with a **significance level $\alpha = 0.10$**:

- Does `facebook_MLR_add_2` fit the data better than `facebook_MLR_add`
- Comparing `facebook_MLR_add` vs `facebook_MLR_int`, does the inclusion of interaction terms in `facebook_MLR_int` improve the model's fit to the data?

> *Your answer goes here.*

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 1.13**
<br>{points: 1}

In **Question 1.7** we noted that when you fit a model with interactions the results in `facebook_MLR_int_results` can not be used to examine if (overall) `total_engagement_percentage` and `page_engagement_percentage`  are associated. 

The $t$-tests in the `tidy()` table, compares this association for different types of posts (i.e., levels of the categorical variable `post_category`). 

If you want to answer examine the *overall* association, on average over different posts and the effect of other variables, you need to compare `facebook_MLR_int` with a model that does not contain the variable `page_engagement_percentage`.

**1.13.0** Use `lm` to write the code of the nested model needed to compare with `facebook_MLR_int`. Give an object name to this output to be used in the next questions. 

**1.13.1** Write the code run an appropriate $F$ test to answer the question above? 

> *Your answer goes here.*

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

#### An interesting exercise (not for marks): 

Repeat the comparison before for the model you suggested in **Question 1.13.0** with `facebook_MLR_add`. Note that the $F$-statistic from `anova` equals the square of one of the $t$-statistic from `facebook_MLR_add_results`! This is not a coincidence if you think that these nested models differ in only one variable!! 