# Tutorial 3: Introduction to Generative Modelling

## Learning Objectives

After completing this week's lecture and tutorial work, you will be able to:

1. Give an example of a question that could be answered by generative modelling.
2. Explain how a linear regression can be used to approximate the underlying mechanism that generated the data (quantitative response and input variables).
3. Interpret the estimated coefficients and $p$-values derived from theoretical results for a simple linear regression (i.e., one input variable).
4. Discuss the assumptions made to estimate the simple linear regression coefficients and approximate their sampling distribution.
5. Explain how to approximate the sampling distribution of the simple linear regression coefficient estimators using bootstrapping. 
6. Contrast the sampling distribution approximated using theoretical results with bootstrapping alternatives for a simple linear regression setting.
7. Compute confidence intervals for the simple linear regression coefficients using theoretical approximations and bootstrapping results.
8. Write a computer script to perform simple linear regression analysis.

In [None]:
# Run this cell before continuing.
library(tidyverse)
library(repr)
library(digest)
library(infer)
library(cowplot)
library(broom)
library(GGally)
source("tests_tutorial_03.R")

## 1. Warm Up Questions

**Question 1.0**
<br>{points: 1}

**True or false?**

To estimate how the weight at birth of newborn child is affected by the socioeconomic status of the family and the parental stability, a researcher wants to use a linear regression. The variable 'newborn_weight' must be used as a response variable.

*Assign your answer to an object called `answer1.0`. Your answer should be either "true" or "false", surrounded by quotes.*

In [None]:
# answer1.0 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.0()

**Question 1.1**
<br>{points: 1}

**True or false?**

In a simple linear regression, the response is an exact linear function of the input variable, i.e.,

$$Y = \beta_0 + \beta_1 X$$

*Assign your answer to an object called `answer1.1`. Your answer should be either "true" or "false", surrounded by quotes.*

In [None]:
# answer1.1 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.1()

**Question 1.2**
<br>{points: 1}

**True or false?**

The error term $\varepsilon_i$ in the regression equation below  contains relevant explanatory variables not taken into account in the model.

$$Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i$$

*Assign your answer to an object called `answer1.2`. Your answer should be either "true" or "false", surrounded by quotes.*

In [None]:
# answer1.2 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.2()

**Question 1.3**
<br>{points: 1}

**True or false?**

The population regression coefficient $\beta_1$ is always known, and we do not have to estimate it.

*Assign your answer to an object called `answer1.5`. Your answer should be either "true" or "false", surrounded by quotes.*

In [None]:
# answer1.3 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.3()

## 2. Data Exploratory Analysis

An important step in any data analysis is to explore and know important characteristics of the data. This includes, but it is not limited to:

- knowing the size of the data 

- examining distributions of all variables using graphical and numerical summaries

- identifying missing values and potential outliers

- beginning to discover relationships between variables

This step, usually referred as **Eploratory Data Analysis (EDA)**, is generally the first step in the analysis. However, a typical data science workflow is never linear and you may need to continue exploring the data at many points of the analysis path.

> **Heads-up**: Professors Peng and Matsui in their book "The Art of Data Science" described this process with **epicycles**

###  EDA Checklist

> From The Art of Data Science (by Peng and Matsui)

1. Formulate your question
2. Read in your data
3. Check the packaging
4. Look at the top and the bottom of your data
5. Check your “n”s

#### The question

As a community manager of a cosmetics brand, you are interested in incorporating social media marketing communication to leverage your business. Thus, you want to determine which variables are associated with the impact of posts. 

#### Read in the data

In this tutorial, you will use part of the **Facebook dataset** to answer this question. This dataset contains critical information on users' engagement during 2014 on a Facebook page of a famous cosmetics brand. The original dataset contains 500 observations and 17 variables of different types (continuous and discrete). It can be found in [data.world](https://data.world/uci/facebook-metrics/workspace/project-summary?agentid=uci&datasetid=facebook-metrics). 

In this tutorial, you will work with a slightly smaller dataset containing 491 observations that was firstly analyzed by [Moro et al. (2016)](https://gw2jh3xr2c.search.serialssolutions.com/log?L=GW2JH3XR2C&D=ADALY&J=JOUROFBUSRE&P=Link&PT=EZProxy&A=Predicting+social+media+performance+metrics+and+evaluation+of+the+impact+on+brand+building%3A+A+data+mining+approach&H=d8c19bb47c&U=https%3A%2F%2Fezproxy.library.ubc.ca%2Flogin%3Furl%3Dhttps%3A%2F%2Fwww.sciencedirect.com%2Fscience%2Flink%3Fref_val_fmt%3Dinfo%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal%26svc_val_fmt%3Dinfo%3Aofi%2Ffmt%3Akev%3Amtx%3Asch_srv%26rfr_dat%3Dsaltver%3A1%26rfr_dat%3Dorigin%3ASERIALSSOL%26ctx_enc%3Dinfo%3Aofi%2Fenc%3AUTF-8%26ctx_ver%3DZ39.88-2004%26rft_id%3Dinfo%3Adoi%2F10.1016%2Fj.jbusres.2016.02.010) to predict impact of posts using different performance metrics. 

Consider the dataset `facebook_data` as your *random sample* (like in DSCI 100 and STAT 201!) in your analysis.

> **Heads up**: Recall the importance of using a *random* sample to obtain representative summaries and broad conclusions!

Let's start by reading this dataset!

In [None]:
facebook_data <- read_csv("data/facebook_data.csv") %>%
  select(total_engagement_percentage, page_engagement_percentage, share_percentage, comment_percentage, )

The object `facebook_data` contained four (of the original 17) variables:

1.  The continuous variable `total_engagement_percentage` is an essential variable for any company owning a Facebook page. It gives a sense of how engaged the overall social network's users are with the company's posts, **regardless of whether they previously liked their Facebook page or not**. *The larger the percentage, the better the total engagement*. It is computed as follows:

$$\texttt{total_engagement_percentage} = \frac{\text{Lifetime Engaged Users}}{\text{Lifetime Post Total Reach}} \times 100\%$$

-   **Lifetime Post Total Reach:** The number of overall *Facebook unique users* who *saw* the post.
-   **Lifetime Engaged Users:** The number of overall *Facebook unique users* who *saw and clicked* on the post. This count is a subset of **Lifetime Post Total Reach**.

2.  The continuous variable `page_engagement_percentage` is analogous to `total_engagement_percentage`, but only with users who engaged with the post **given they have liked the page**. This variable provides a sense to the company to what extent these subscribed users react to its posts. *The larger the percentage, the better the page engagement*. It is computed as follows:

$$\texttt{page_engagement_percentage} = \frac{\text{Lifetime Users Who Have Liked the Page and Engaged with the Post}}{\text{Lifetime Post Reach by Users Who Liked the Page}} \times 100\% $$

-   **Lifetime Post Reach by Users Who Liked the Page:** The number of *Facebook unique page subscribers* who *saw* the post.
-   **Lifetime Users Who Have Liked the Page and Engaged with the Posts:** The number of *Facebook unique page subscribers* who *saw and clicked* on the post. This count is a subset of **Lifetime Post Reach by Users Who Liked the Page**.

3.  The continuous `share_percentage` is the percentage that the number of *shares* represents from the sum of *likes*, *comments*, and *shares* in each post. It is computed as follows:

$$\texttt{share_percentage} = \frac{\text{Number of Shares}}{\text{Total Post Interactions}} \times 100\% $$

-   **Total Post Interactions:** The sum of *likes*, *comments*, and *shares* in a given post.
-   **Number of Shares:** The number of *shares* in a given post. This count is a subset of *Total Post Interactions*.

4.  The continuous `comment_percentage` is the percentage that the number of *comments* represents from the sum of *likes*, *comments*, and *shares* in each post. It is computed as follows:

$$\texttt{comment_percentage} = \frac{\text{Number of Comments}}{\text{Total Post Interactions}} \times 100\% $$

-   **Total Post Interactions:** The sum of *likes*, *comments*, and *shares* in a given post.
-   **Number of Comments:** The number of *comments* in a given post. This count is a subset of *Total Post Interactions*.

#### Check the packaging

In [None]:
str(facebook_data)

#### 2.4. Look at the top and the bottom of your data

**Question 2.0**
<br>{points: 1}

Get the first and last 3 rows of `facebook_data`. Call these new objects `facebook_data_top` and `facebook_data_tail`, respectively.

*Fill out those parts indicated with ..., uncomment the corresponding code in the cell below, and run it.*

In [None]:
# facebook_data_head <- ... %>% ...(...)
# facebook_data_tail <- ... %>% ...(...)
#facebook_data_head
#facebook_data_tail

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.0()

#### Check your “n”s

**Question 2.1**
<br>{points: 1}

Check the dimensions of `facebook_data`. Call these new objects `facebook_data_dim`.

**2.1.** How many counties are there in the dataset `facebook_data`?

*Fill out those parts indicated with ..., uncomment the corresponding code in the cell below, and run it. Assign your answers to the object `answer2.1` (numeric type).*

In [None]:
# facebook_data_dim <- ...
# facebook_data_dim
# answer2.1 <- ...
# answer2.1

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.1()

#### Get summary statistics

In this example, we will obtain some useful summary statistics of all the variables in the dataset. 

- Use the `gather` function to convert the dataset into a *long* format

- Then use `summarise` to obtain summary statistic

In [None]:
facebook_data_long <- gather(facebook_data, factor_key=TRUE)
head(facebook_data_long)

**Question 2.2**
<br>{points: 1}

Otain the sample mean, standard deviation, maximum and minimum summary statistics for all variables and save them into an object called `facebook_data_stats`

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# facebook_data_stats <- facebook_data_long %>% group_by(...)%>%
#  summarise(mean= ...(...), sd= ...(...), max = ...(...),min = ...(...))
# facebook_data_stats 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.2()

#### Check the distribution of the variables

Although not mentioned in the previous checklist, an important item in the EDA is to check the distribution of your variables. 

**Question 2.7**
<br>{points: 1}

As a community manager of the cosmetics brand, you are interested in determining the inputs associated with the main metric `total_engagement_percentage`.

Use the plotting function `ggpairs()`, from the library `GGally`, to generate a pair plot **of ALL the variables found in `facebook_data`**. The `ggplot()` object's name will be `facebook_pair_plots`. 

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
options(repr.plot.width = 15, repr.plot.height = 12) # Adjust these numbers so the plot looks good in your desktop.

# facebook_pair_plots <- ... %>%
#   ...(progress = FALSE) +
#   theme(
#     text = element_text(size = 15),
#     plot.title = element_text(face = "bold"),
#     axis.title = element_text(face = "bold")
#   )
# facebook_pair_plots

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.7()

> **Heads up**: compare the output of the function `ggpairs()` with that of the function `pairs()` used in `worksheet_03`. Which one do you prefer? No need to answer here, just think about it and discuss it with peers if you want!

**Question 2.8**
<br>{points: 1}

Looking at the distribution of `page_engagement_percentage`, how would you describe the empirical distribution of this variable?

**A.** Fairly symmetric.

**B.** Left-skewed.

**C.** Right-skewed.

*Assign your answer to an object called `answer2.8`. Your answer should be one of `"A"`, `"B"`, or `"C"` surrounded by quotes.*

In [None]:
# answer2.8 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.8()

# 3. Simple Linear Regression (SLR)

**Question 3.0**
<br>{points: 1}

Based on the visual inspection of the relationship between the variables in the data, you decide to use a simple linear regression (SLR) to study the relation of `total_engagement_percentage` and `page_engagement_percentage`. 

How would you describe the graphical association between these two variables?

**A.** Positive.

**B.** Negative.

*Assign your answer to an object called `answer3.0`. Your answer should be one of `"A"` or `"B"` surrounded by quotes.*

In [None]:
# answer3.0 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.0()

**Question 3.1**
<br>{points: 1}

Within the context of this exercise, answer the following:

3.1.0. Which variable will you choose as a response $Y$? Answer with the column's name from `facebook_data`.

3.1.1. Which variable will you choose as the input $X$? Answer with the column's name from `facebook_data`.

*Assign your answers to the objects answer3.1.0 (character type surrounded by quotes), answer3.1.1 (character type surrounded by quotes).*

In [None]:
# answer3.1.0 <- ...
# answer3.1.0
# answer3.1.1 <- ...
# answer3.1.1

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.1.0()
test_3.1.1()

**Question 3.2**
<br>{points: 1}

Fit the SLR model proposed and assign it to the object `facebook_SLR`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# facebook_SLR <- ...(... ~ ...,
#   data = ...
# )
# facebook_SLR

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.2()

**Question 3.3**
<br>{points: 1}

Find the estimated coefficients of `facebook_SLR` using `tidy()`. Report the estimated coefficients, their standard errors and corresponding $p$-values. Include the corresponding asymptotic 95% confidence intervals. Store the results in the variable `facebook_SLR_results`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# facebook_SLR_results <- ...(..., conf.int = ...) %>% mutate_if(is.numeric, round, 2)
# facebook_SLR_results

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.3()

**Question 3.4**
<br>{points: 1}

Using `facebook_data`, create a scatterplot of the response $Y$ versus the input $X$. Add the estimated SLR!

Call the resulting object `facebook_scatterplot`.

*Fill out those parts indicated with ..., uncomment the corresponding code in the cell below, and run it.*

In [None]:
# facebook_scatterplot <- ggplot(..., aes(x = ..., y = ...)) +
#   ...() +
#   ...(aes(x = ..., y = ...), method = ..., se = FALSE, size = 1.5) +
#   xlab(...) +
#   ylab(...) +
#   theme(
#     text = element_text(size = 16.5),
#     plot.title = element_text(face = "bold"),
#     axis.title = element_text(face = "bold")
#   )
# facebook_scatterplot
 
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.4()

**Question 3.5**
<br>{points: 1}

Using the results in  `facebook_SLR_results`, write a the correct interpretation of the estimated slope

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

# 4. Inference

**Question 4.0**
<br>{points: 1}

Using a significance level $\alpha = 0.05$, is `page_engagement_percentage` statistically associated with `total_engagement_percentage`?

*Assign your answer to an object called answer4.0. Your answer should be either "true" or "false", surrounded by quotes.*

In [None]:
# answer4.0 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_4.0()

**Question 4.1**
<br>{points: 1}

The `p-values` computed by the function `lm` displayed in `facebook_SLR_results` are based on:

**A.** classical theoretical approximations or results

**B.** bootstrapping experiments

**C.** none of the above

*Assign your answer to an object called answer4.1 Your answer should be one of "A", "B", or "C" surrounded by quotes.*

In [None]:
# answer4.1 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_4.1()

**Question 4.2**
<br>{points: 1}

Using `facebook_SLR_results`, provide a correct intepretation of the 95% CI for `page_engagement_percentage`.



DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 4.3**
<br>{points: 1}

One of the *sampling distributions* related to SLR is 

**A.** The distribution of the response $Y$.

**B.** The distribution of the true population slope $\beta_1$.

**C.** The distribution of $\hat{\beta}_1$, the estimator of the slope.

**D.** The distribution of the input variable $X$.

*Assign your answer to an object called `answer4.3`. Your answer should be one of "A", "B", "C", or "D" surrounded by quotes.*

In [None]:
# answer4.3<- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_4.3()

**Question 4.4**
<br>{points: 1}

If we assume that the error terms are Normal, it can be proved that the sampling distributions are Normal as well. However, we usually don't know the true distribution of the error terms. 

When we don't want to make these assumptions, another way of approximating the sampling distribution is bootstrapping from `facebook_data`.

- Obtain $B = 1000$ sets of regression estimates by fitting a SLR $B$ times using their respective boostrapped samples. 

- Store the corresponding $\hat{\beta}^*_0$ and $\hat{\beta}^*_1$ per boostrapped sample in the data frame `lm_boot` of 1000 rows and two columns:
    - boot_intercept: list of bootstrapped intercept $\hat{\beta}^*_0$.
    - boot_slope: list of bootstrapped slope $\hat{\beta}^*_1$.
    
*Fill out those parts indicated with ..., uncomment the corresponding code in the cell below, and run it.*

In [None]:
set.seed(123)  # DO NOT CHANGE!

# n <- ...
# B <- ...

# lm_boot <- replicate(..., {
#   sample_n(..., ..., replace = ...) %>%
#     lm(formula = ..., data = .) %>%
#     .$coef
# })
# lm_boot <- data.frame(boot_intercept = lm_boot[1, ], boot_slope = lm_boot[2, ])

# head(lm_boot)
# tail(lm_boot)

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_4.4()

**Question 4.5**
<br>{points: 1}

Now that we have a list of bootstrapped estimates, we can visualize the sampling distributions! 

Let's focus on the sampling distribution of the estimator of the slope.

The ggplot() object's name will be `slope_sampling_dist_boot`.

*Fill out those parts indicated with ..., uncomment the corresponding code in the cell below, and run it.*

In [None]:
#slope_sampling_dist_boot <-  ggplot(..., aes(x = ...)) +
#    geom_histogram(bins = ...) +
#    coord_cartesian(xlim = c(0.9, 1.2)) +
#    xlab("...") +
#    ggtitle("Sampling distribution for the estimator of the slope")

#slope_sampling_dist_boot


# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_4.5()

**Question 4.6**
<br>{points: 1}

Add vertical lines to the plot of the sampling distribution, `slope_sampling_dist_boot`, to visulize the upper and lower limits of the bootstrap *percentile CIs* of the slope.

The ggplot() object's name will be `slope_sampling_dist_boot_limits`.

*Fill out those parts indicated with ..., uncomment the corresponding code in the cell below, and run it.*

In [None]:
#slope_sampling_dist_boot_limits <- ... + 
#   geom_vline(aes(xintercept = quantile(...,...)),col='blue',size=1)+
#   geom_vline(aes(xintercept = quantile(...,...)),col='blue',size=1)

# your code here
fail() # No Answer - remove if you provide an answer
slope_sampling_dist_boot_limits

In [None]:
test_4.6()

**Question 4.7**
<br>{points: 1}

In one or two sentences explain how to use `lm_boot` generated in **Question 4.4** to approximate the sampling distribution of the estimator of the intercept.

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.