# Tutorial 4: MLR with different types of input variables

#### Lecture and Tutorial Learning Goals:
After completing this week's lecture and tutorial work, you will be able to:

1. Give an example of a real problem that that could be answered by a multiple linear regression.
2. Interpret the coefficients and $p$-values of different types of input variables, including categorical input variables.
3. Define interactions in the context of linear regression.
4. Write a computer script to perform linear regression when input variables are continuous or discrete, and when there are interactions between some of these variables.

In [None]:
# Run this cell before continuing.
library(tidyverse)
library(repr)
library(infer)
library(cowplot)
library(broom)
source("tests_tutorial_04.R")

## The data

We will continue using the Facebook dataset from `tutorial_03`. Recall this dataset is related to critical information on user engagement during 2014 on a Facebook page of a famous cosmetics brand. The original dataset contains 500 observations related to different types of posts. It can be found in [data.world](https://data.world/uci/facebook-metrics/workspace/project-summary?agentid=uci&datasetid=facebook-metrics). The dataset you will work with in this tutorial contains 491 observations. The dataset was firstly analyzed by [Moro et al. (2016)](https://gw2jh3xr2c.search.serialssolutions.com/log?L=GW2JH3XR2C&D=ADALY&J=JOUROFBUSRE&P=Link&PT=EZProxy&A=Predicting+social+media+performance+metrics+and+evaluation+of+the+impact+on+brand+building%3A+A+data+mining+approach&H=d8c19bb47c&U=https%3A%2F%2Fezproxy.library.ubc.ca%2Flogin%3Furl%3Dhttps%3A%2F%2Fwww.sciencedirect.com%2Fscience%2Flink%3Fref_val_fmt%3Dinfo%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal%26svc_val_fmt%3Dinfo%3Aofi%2Ffmt%3Akev%3Amtx%3Asch_srv%26rfr_dat%3Dsaltver%3A1%26rfr_dat%3Dorigin%3ASERIALSSOL%26ctx_enc%3Dinfo%3Aofi%2Fenc%3AUTF-8%26ctx_ver%3DZ39.88-2004%26rft_id%3Dinfo%3Adoi%2F10.1016%2Fj.jbusres.2016.02.010) in their data mining work to predict the performance of different post metrics, which are also based on the type of post. The original dataset has 17 variables of different types (continuous and discrete). Nonetheless, for this lab, we extracted four variables saved in  `facebook_data`:

1.  `total_engagement_percentage` (continuous): gives a sense of how engaged the overall social network's users are with the company's posts, **regardless of whether they previously liked their Facebook page or not**. *The larger the percentage, the better the total engagement*. It is computed as follows:

$$\texttt{total_engagement_percentage} = \frac{\text{Lifetime Engaged Users}}{\text{Lifetime Post Total Reach}} \times 100\%$$

-   **Lifetime Post Total Reach:** The number of overall *Facebook unique users* who *saw* the post.
-   **Lifetime Engaged Users:** The number of overall *Facebook unique users* who *saw and clicked* on the post. This count is a subset of **Lifetime Post Total Reach**.

2.  `page_engagement_percentage` (continuous): is analogous to `total_engagement_percentage`, but only with users who engaged with the post **given they have liked the page**. This variable provides a sense to the company to what extent these subscribed users react to its posts. *The larger the percentage, the better the page engagement*. It is computed as follows:

$$\texttt{page_engagement_percentage} = \frac{\text{Lifetime Users Who Have Liked the Page and Engaged with the Post}}{\text{Lifetime Post Reach by Users Who Liked the Page}} \times 100\% $$

-   **Lifetime Post Reach by Users Who Liked the Page:** The number of *Facebook unique page subscribers* who *saw* the post.
-   **Lifetime Users Who Have Liked the Page and Engaged with the Posts:** The number of *Facebook unique page subscribers* who *saw and clicked* on the post. This count is a subset of **Lifetime Post Reach by Users Who Liked the Page**.

3.  `share_percentage` (continuous): is the percentage that the number of *shares* represents from the sum of *likes*, *comments*, and *shares* in each post. It is computed as follows:

$$\texttt{share_percentage} = \frac{\text{Number of Shares}}{\text{Total Post Interactions}} \times 100\% $$

-   **Total Post Interactions:** The sum of *likes*, *comments*, and *shares* in a given post.
-   **Number of Shares:** The number of *shares* in a given post. This count is a subset of *Total Post Interactions*.

4.  `comment_percentage` (continuous): is the percentage that the number of *comments* represents from the sum of *likes*, *comments*, and *shares* in each post. It is computed as follows:

$$\texttt{comment_percentage} = \frac{\text{Number of Comments}}{\text{Total Post Interactions}} \times 100\% $$

-   **Total Post Interactions:** The sum of *likes*, *comments*, and *shares* in a given post.
-   **Number of Comments:** The number of *comments* in a given post. This count is a subset of *Total Post Interactions*.

5. `post_category` (categorical): is a discrete variable with three different levels characterizing the content of the post:

-   `Action`: Brand's contests and special offers for the customers.
-   `Product`: Regular advertisements for products with explicit brand content.
-   `Inspiration`: Non-explicit brand-related content.

> **Note**: For all the estimated regression models in this tutorial, we will consider `total_engagement_percentage` as a response along with `page_engagement_percentage` and `post_category` as inputs.

#### Important: Note that if the categorical variable is not a factor, `lm` won't create a dummy variable!! 

> Make sure that categorical variables in your model are factors. If they are not, then **set them as factors**!! (see code example below)

In [None]:
facebook_data <- read_csv("data/facebook_data.csv")
head(facebook_data)

facebook_data$post_category <- as.factor(facebook_data$post_category)
levels(facebook_data$post_category)

## 1. MLR: additive

Let $Y_i$ be the response `total_engagement_percentage` and $X_{1i}$ be a continuous input for the variable `page_engagement_percentage`. Furthermore, define the following dummy variables associated to `post_category` as:

$$
X_{2i} =
\begin{cases}
1 \; \; \; \; \mbox{if the $i$th post is Inspiration-type,}\\
0 \; \; \; \; \mbox{otherwise};
\end{cases}
$$

$$
X_{3i} =
\begin{cases}
1 \; \; \; \; \mbox{if the $i$th post is Product-type,}\\
0 \; \; \; \; \mbox{otherwise}.
\end{cases}
$$

Mathematically, the equation of an additive MLR with one continuous variable and one categorical variable with 3 levels can be written as:

$$Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \beta_3 X_{3i} + \varepsilon_i$$

The regression coefficients $\beta_0$, $\beta_1$, $\beta_2$, and $\beta_3$ are the **true and unknown** regression coefficients **we want to estimate**. 

Moreover, $\varepsilon_i$ is the error term with $E(\varepsilon_i|X_{ji}) = E(\varepsilon_i) = 0$ (for all $j$ variables) and $\text{Var}(\varepsilon_i) = \sigma^2$. Note that $i$ could take on the following values: $1 , \dots, n$. Recall that $n$ is the sample size.

> **Note** that only one of these coefficients is a *slope* in a mathematical sense. The coefficients of dummy variables can not be interpreted as slopes!

**Question 1.0**
<br>{points: 1}

Since the input `post_category` is discrete and nominal, `lm` selects, by default, one level as a baseline to create 2 dummy variables. Which of the three levels of `post_category` is selected, by default, as a baseline?

**A.** `Inspiration`

**B.** `Product`

**C.**  `Action`

*Assign your answer to an object called `answer1.0`. Your answer should be one of `"A"`, `"B"`, or `"C"` surrounded by quotes.*

In [None]:
# answer1.0 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.0()

**Question 1.1**
<br>{points: 1}

How many dummy variables does `lm` create to fit a linear regression with the categorical variable `post_category`?

**A.** 1

**B.** 2

**C.** 3

**D.** 4

*Assign your answer to an object called `answer1.1`. Your answer should be one of `"A"`, `"B"`, `"C"`, or `"D"` surrounded by quotes.*

In [None]:
# answer1.1 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.1()

**Question 1.2**
<br>{points: 1}

If you assume that the relation between `total_engagement_percentage` and `page_engagement_percentage` is the same for all type of posts (that is for all levels of `post_category`), which MLR will you fit in `R` using the `lm` function?

**A.** `lm(total_engagement_percentage ~ page_engagement_percentage + post_category, data = facebook_sample)`

**B.** `lm(total_engagement_percentage ~ page_engagement_percentage * post_category, data = facebook_sample)`

*Assign your answer to an object called `answer1.2`. Your answer should be one of `"A"` or `"B"` surrounded by quotes.*

In [None]:
# answer1.2 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.2()

**Question 1.3**
<br>{points: 1}

Which of the following descriptions will best describe a visualization of the MLR considered in **Question 1.2**? 

**A.** one line through a cloud of data points

**B.** two lines with equal slopes but different intercepts

**C.** two lines with different slopes and different intercepts

**D.** three lines with equal slopes but different intercepts

**E.** three lines with different slopes and different intercepts

**F.** three boxplots for different levels of `post_category`

*Assign your answer to an object called `answer1.3`. Your answer should be one of `"A"`, `"B"`, `"C"`, `"D"`, `"E"`, or `"F"` surrounded by quotes.*

In [None]:
# answer1.3 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.3()

**Question 1.4**
<br>{points: 1}

Using `facebook_data`, estimate the MLR proposed in **Question 1.2** and called it `facebook_MLR_add`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# facebook_MLR_add <- ...(...,
#   ...
# )
# facebook_MLR_add

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.4.0()
test_1.4.1()
test_1.4.2()
test_1.4.3()

**Question 1.5**
<br>{points: 1}

Create a plot of the data in `facebook_data` (using `geom_point()`) along with the estimated regression lines coming from the additive regression model `facebook_MLR_add`. Use different colours for the points and regression lines of each type of post (levels of `post_category`). Include a legend indicating what colour corresponds to each level with proper axis labels. The `ggplot()` object's name will be `facebook_MLR_add_plot`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
options(repr.plot.width = 15, repr.plot.height = 7) # Adjust these numbers so the plot looks good in your desktop.

# facebook_data$pred_MLR_Add <- predict(facebook_MLR_add) # Using predict() to create estimated regression lines.

# facebook_MLR_add_plot <- ggplot(..., aes(
#   ...,
#   ...,
#   color = ...
# )) +
#   ...() +
#   geom_line(aes(y = pred_MLR_Add), size = 1) +
#   labs(
#     title = ...,
#     x = ...,
#     y = ...
#   ) +
#   theme(
#     text = element_text(size = 16.5),
#     plot.title = element_text(face = "bold"),
#     axis.title = element_text(face = "bold"),
#     legend.title = element_text(face = "bold"),
#   ) +
#   labs(color = "Post Category")
# facebook_MLR_add_plot

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.5()

**Question 1.6**
<br>{points: 1}

Find the estimated coefficients of `facebook_MLR_add` using `tidy()`. Report the estimated coefficients, their standard errors and corresponding $p$-values. Include the corresponding asymptotic **90% confidence intervals**. Store the results in the variable `facebook_MLR_add_results`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# facebook_MLR_add_results <- ...(..., ..., ....) %>% mutate_if(is.numeric, round, 2)
# facebook_MLR_add_results

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.6()

**Question 1.7**
<br>{points: 1}

Using the results in `facebook_MLR_add_results` from **Question 1.6**, how would you interpret the estimated coefficient of the continuous variable `page_engagement_percentage` ?

> *Your answer goes here.*

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 1.8**
<br>{points: 1}

Using a **significance level $\alpha = 0.10$**, which of the following claims are correct?


**A.** there is enough evidence to reject the hypothesis that, for any page engagement percentage, the expected total engagement percentage is the same for Inspiration-posts and Action-posts 

**B.** there is enough evidence to reject the hypothesis that, for any page engagement percentage, the expected total engagement percentage is the same for Product-posts and Action-posts 

**C.** for any type of post, the page engagement percentage is statistically associated with the total engagement percentage.

**D.** there is enough evidence to believe that the association between `page_engagement_percentage` and `total_engagement_percentage` varies with the types of posts.


*Assign your answers to the object `answer1.8`. Your answers have to be included in a single string indicating the correct options **in alphabetical order** and surrounded by quotes (e.g., `"ABCD"` indicates you are selecting the four options).*

In [None]:
# answer1.8 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.8()

## 2. MLR: with interactions

In this section we will explore if the relation between `total_engagement_percentage` and `page_engagement_percentage` is the same for all types of posts. We can do this using **interactions** between the input variables!

> **Note** that interactions can be used, in general, when the relation between an input and the response depends on another input variable (not necessarily categorical!)

Mathematically, the equation of a MLR with one continuous input variable that interacts with one categorical variable with 3 levels can be written as:

$$Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \beta_3 X_{3i} + \beta_4 X_{1i} \times X_{2i} + \beta_5 X_{1i} \times X_{3i} + \varepsilon_i$$

> **Note**: the continuous variable, $X_1$, is multiplied by the dummy variables, $X_2$ and $X_3$, to model *different slopes* 

**Question 2.0**
<br>{points: 1}

We can use `lm` to fit the MLR with interactions between the continuous variable `page_engagement_percentage` and the categorical variable `post_category` (with 3 levels) defined above.

How many regression coefficients will be estimated by `lm`?

**A.** 1

**B.** 2

**C.** 3

**D.** 4

**E.** 5

**F.** 6

*Assign your answer to an object called `answer2.0`. Your answer should be one of `"A"`, `"B"`, `"C"`, `"D"`, `"E"`, or `"F"` surrounded by quotes.*

In [None]:
# answer2.0 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.0()

**Question 2.1**
<br>{points: 1}

Using `facebook_data`, estimate the MLR with interaction described above and called it `facebook_MLR_int`.

> **Hint:** Interaction terms can be easily specified in `lm()` using the notation `*`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# facebook_MLR_int <- ...(...,
#   ...
# )
# facebook_MLR_int

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.1.0()
test_2.1.1()
test_2.1.2()
test_2.1.3()
test_2.1.4()
test_2.1.5()

**Question 2.2**
<br>{points: 1}

Create a plot of the data in `facebook_data` (using `geom_point()`) along with the estimated regression lines coming from the interaction regression model `facebook_MLR_int` (note that your plot should have three regression lines, one for each `post_category`). Use different colours for the points and regression lines of each type of post (levels of `post_category`). Include a legend indicating what colour corresponds to each level with proper axis labels. The `ggplot()` object's name will be `facebook_MLR_int_plot`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# facebook_data$pred_MLR_int <- predict(facebook_MLR_int) # Using predict() to create estimated regression lines.

# facebook_MLR_int_plot <- ggplot(..., aes(
#   ...,
#   ...,
#   color = ...
# )) +
#   ...() +
#   geom_line(aes(y = pred_MLR_int), size = 1) +
#   labs(
#     title = ...,
#     x = ...,
#     y = ...
#   ) +
#   theme(
#     text = element_text(size = 16.5),
#     plot.title = element_text(face = "bold"),
#     axis.title = element_text(face = "bold"),
#     legend.title = element_text(face = "bold"),
#   ) +
#   labs(color = "Post Category")
# facebook_MLR_int_plot

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.2()

**Question 2.3**
<br>{points: 1}

Find the estimated coefficients of `facebook_MLR_int` using `tidy()`. Report the estimated coefficients, their standard errors and corresponding $p$-values. Include the corresponding asymptotic 90% confidence intervals. Store the results in the variable `facebook_MLR_int_results`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# facebook_MLR_int_results <- ...(..., ..., ....) %>% mutate_if(is.numeric, round, 2)
# facebook_MLR_int_results

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.3()

**Question 2.4**
<br>{points: 1}

Using a **significance level $\alpha = 0.10$**, which of the following claims are correct?


**A.** There is enough evidence to reject the hypothesis that, for any page engagement percentage, the expected total engagement percentage is the same for Product-posts and Action-posts 

**B.** For any type of post, the page engagement percentage is statistically associated with the total engagement percentage.

**C.** There is enough evidence to reject the hypothesis that the association between `page_engagement_percentage` and `total_engagement_percentage` is the same for Product-posts and Action-posts.

**D.** There is not enough evidence to reject the hypothesis that the association between `page_engagement_percentage` and `total_engagement_percentage` is the same for Inspiration-posts and Action-posts.

**E.** There is a statistically significant association between `page_engagement_percentage` and `total_engagement_percentage` for Action-posts.

*Assign your answers to the object `answer2.4`. Your answers have to be included in a single string indicating the correct options **in alphabetical order** and surrounded by quotes (e.g., `"ABCDEFG"` indicates you are selecting the seven options).*

In [None]:
# answer2.4 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.4()

**Question 2.5**
<br>{points: 1}

What is the correct interpretation of estimated coefficient of the interaction term `page_engagement_percentage:post_categoryProduct` from `facebook_MLR_int_results` in **Question 2.3**?

> *Your answer goes here.*

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 2.6**
<br>{points: 1}

Fit the following 3 models as indicated:

**A.** a SLR with `total_engagement_percentage` as the response and `page_engagement_percentage` as the *only* input variable using only Action-posts in `facebook_data`. Use `tidy` to get estimated parameters and standard errors. Call the results `facebook_SLR_action_results`

**B.** a SLR with `total_engagement_percentage` as the response and `page_engagement_percentage` as the *only* input variable using only Product-posts in `facebook_data`. Use `tidy` to get estimated parameters and standard errors. Call the results `facebook_SLR_product_results`

**C.** a MLR with `total_engagement_percentage` as the response and `page_engagement_percentage` and `post_category` as input variables, *including their interaction*, using `facebook_data`. Note that you already have the estimated parameters and standard errors in `facebook_MLR_int_results`. Uncomment the line to get the results again here.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
#facebook_action_data <- ...  %>% subset(... == ...)
#facebook_product_data <- ...  %>% subset(... == ...)

#facebook_SLR_action_results <- tidy(lm(...~..., data= ...)) %>% mutate_if(is.numeric, round, 2)
#facebook_SLR_action_results

#facebook_SLR_product_results <- tidy(lm(...~..., data= ...)) %>% mutate_if(is.numeric, round, 2)
#facebook_SLR_product_results

#facebook_MLR_int_results


# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.6.0()
test_2.6.1()
test_2.6.2()

**Question 2.7**
<br>{points: 1}

**2.7.0** Using the results from `facebook_SLR_action_results` and `facebook_MLR_int_results` in **Question 2.6**, explain why the estimated coefficients of `page_engagement_percentage` are the same in both models

**2.7.1** Using the results from `facebook_SLR_product_results` and `facebook_MLR_int_results` in **Question 2.6**, explain why the estimated coefficients of `page_engagement_percentage` are *not* the same in both models. 

**2.7.2** Explain why the estimated coefficients of `page_engagement_percentage` in `facebook_SLR_product_results` is *not* the same as that of `page_engagement_percentage:post_categoryProduct` in `facebook_MLR_int_results` using the results from **Question 2.6**.

> *Your answer goes here.*

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.