# Tutorial 01: Introduction to Generative Modelling

## Learning Objectives

After completing this week's lecture and tutorial work, you will be able to:

1. Give an example of a question that could be answered by generative modelling.
2. Explain how a linear regression can be used to approximate the underlying mechanism that generated the data (quantitative response and input variables).
3. Interpret the estimated coefficients and $p$-values derived from theoretical results for a simple linear regression (i.e., one input variable).
4. Discuss the assumptions made to estimate the simple linear regression coefficients and approximate their sampling distribution.
5. Explain how to approximate the sampling distribution of the simple linear regression coefficient estimators using bootstrapping. 
6. Contrast the sampling distribution approximated using theoretical results with bootstrapping alternatives for a simple linear regression setting.
7. Compute confidence intervals for the simple linear regression coefficients using theoretical approximations and bootstrapping results.
8. Write a computer script to perform simple linear regression analysis.

In [None]:
# Run this cell before continuing.
library(tidyverse)
library(repr)
library(infer)
library(cowplot)
library(broom)
library(GGally)
library(AER)
source("tests_tutorial_01.R")

## 1. Warm Up Questions

**Question 1.0**
<br>{points: 1}

**True or false?**

To estimate how the weight at birth of a newborn child is affected by the socioeconomic status of the family and the parental stability, a researcher wants to use a linear regression. In this case, the variable that measures the newborns' weight would be the response variable.

*Assign your answer to an object called `answer1.0`. Your answer should be either "true" or "false", surrounded by quotes.*

In [None]:
# answer1.0 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.0()

**Question 1.1**
<br>{points: 1}

**True or false?**

In a simple linear regression, the response is an exact linear function of the input variable, i.e.,

$$Y = \beta_0 + \beta_1 X$$

*Assign your answer to an object called `answer1.1`. Your answer should be either "true" or "false", surrounded by quotes.*

In [None]:
# answer1.1 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.1()

**Question 1.2**
<br>{points: 1}

**True or false?**

The error term $\varepsilon_i$ in the regression equation below contains relevant information of explanatory variables not considered in the model to explain the variation in the response.

$$Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i$$

*Assign your answer to an object called `answer1.2`. Your answer should be either "true" or "false", surrounded by quotes.*

In [None]:
# answer1.2 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.2()

**Question 1.3**
<br>{points: 1}

**True or false?**

The population regression coefficient $\beta_1$ is always known, and we do not have to estimate it.

*Assign your answer to an object called `answer1.5`. Your answer should be either "true" or "false", surrounded by quotes.*

In [None]:
# answer1.3 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.3()

## 2. Data Exploratory Analysis

An essential step in any data analysis is to explore and know important characteristics of the data. This includes, but is not limited to:

- knowing the size of the data 

- examining distributions of all variables using graphical and numerical summaries

- identifying missing values and potential outliers

- beginning to discover relationships between variables

This step, usually referred to as *Exploratory Data Analysis (EDA)*, is generally the first step in the analysis. However, a typical data science workflow is never linear, and you may need to continue exploring the data at many points of the analysis path. Professors Peng and Matsui, in their book "The Art of Data Science", described this process with epicycles.

###  EDA Checklist

From The Art of Data Science (by Peng and Matsui)

1. Formulate your question
2. Read your data
3. Check the packaging
4. Look at the top and the bottom of your data
5. Check your “n”s

#### The question

You want to know if students' school performance is associated with the family’s income.

#### Read in the data

In this tutorial, we will use a real-world dataset from 420 K-6 and K-8 districts in California with data available for 1998 and 1999. The California School data set (`CASchools`) comes with an R package called AER, an acronym for Applied Econometrics with R (by Christian Kleiber & Zeileis, 2017). 

The dataset contains data on test performance, school characteristics, and student demographic backgrounds in Californian school districts. Among many variables available, we will use the following:

- `grades`: factor indicating grade span of district.

- `income`: District average income (in USD 1,000).

- `english`: Percent of English learners.

- `read`: Average reading score.


Consider the dataset `CASchools` as your *random sample* (like in DSCI 100 and STAT 201!) for the analysis.

> Recall the importance of using a *random* sample to obtain representative summaries and broad conclusions!

Let's start by reading this dataset!

In [None]:
#run this cell

data(CASchools)

caschools <- CASchools %>%
  select(grades, income, english, read) %>%
  mutate_if(is.numeric, round, 2)

head(caschools)

#### Check the packaging

In [None]:
str(caschools)

**Question 2.0: Look at the top and the bottom of your data**
<br>{points: 1}

Get the first and last 3 rows of `caschools_data`. Call these new objects `caschools_head` and `caschools_tail`, respectively.

*Fill out those parts indicated with ..., uncomment the corresponding code in the cell below, and run it.*

In [None]:
# caschools_head <- ... %>% ...(...)
# caschools_tail <- ... %>% ...(...)
#caschools_head
#caschools_tail

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.0()

**Question 2.1: Check the dimensions**
<br>{points: 1}

Check the dimensions of `caschools`. Call the new objects `caschools_dim`. 

(*Hint: Check the function `dim`*). 


In [None]:
# caschools_dim <- ...
# caschools_dim

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.1()

#### Get summary statistics

In this example, we will obtain some useful summary statistics of all the continuous variables in the dataset. 

- Use `select` to select only the continuous variables in the dataset

- Use the `gather` function to convert the dataset into a *long* format

- Then use `summarise` to obtain the summary statistics in the skeleton

In [None]:
#run this cell

caschools_long <- caschools %>%
    select(-grades) %>%
    gather(factor_key=TRUE, key = 'variable')
head(caschools_long)

**Question 2.2**
<br>{points: 1}

Obtain the sample mean, standard deviation, maximum and minimum summary statistics for all continuous variables and save them into an object called `caschools_stats`

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# caschools_stats <- 
#     caschools_long %>% 
#     group_by(...) %>%
#     summarise(mean = ...(...),
#               sd = ...(...), 
#               max = ...(...), 
#               min = ...(...))


# your code here
fail() # No Answer - remove if you provide an answer

caschools_stats 

In [None]:
test_2.2()

**Question 2.3: Check the distribution of the variables**
<br>{points: 1}

Although not mentioned in the previous checklist, checking the distribution of your variables is an important item in the EDA. Use the plotting function `ggpairs()`, from the library `GGally`, to generate a pair plot of all the variables found in `cashools`.  

> **Note**: that when the dataset contains too many variables using `ggpairs()` is not the best visualization you can use.

The `ggplot()` object's name will be `caschools_pair_plots`. 

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# Adjust these numbers so the plot looks good in your computer.
options(repr.plot.width = 15, repr.plot.height = 12) 

# caschools_pair_plots <- 
#   ... %>%
#   ...(progress = FALSE) +
#   theme(
#     text = element_text(size = 15),
#     plot.title = element_text(face = "bold"),
#     axis.title = element_text(face = "bold")
#   )

# your code here
fail() # No Answer - remove if you provide an answer

caschools_pair_plots

In [None]:
test_2.3()

> Compare the output of the function `ggpairs()` with that of the function `pairs()` used in `worksheet_03`. Which one do you prefer? Discuss it with peers!

**Question 2.4**
<br>{points: 1}

Looking at the distribution of `income`, how would you describe the empirical distribution of this variable?

**A.** Fairly symmetric.

**B.** Left-skewed.

**C.** Right-skewed.

*Assign your answer to an object called `answer2.4`. Your answer should be one of `"A"`, `"B"`, or `"C"` surrounded by quotes.*

In [None]:
# answer2.4 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.4()

# 3. Simple Linear Regression (SLR)

**Question 3.0**
<br>{points: 1}

Based on the visual inspection of the relationship between the variables in the data, you use a Simple Linear Regression (SLR) to study the relation of `read` and `income`. 

How would you describe the graphical association between these two variables?


**A.** Positive and non-linear.

**B.** Positive and linear.

**C.** Negative and non-linear.

**D.** Negative and linear.

*Assign your answer to an object called `answer3.0`. Your answer should be one of `"A"` or `"B"` surrounded by quotes.*

In [None]:
# answer3.0 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.0()

**Question 3.1**
<br>{points: 1}

Within the context of this exercise and to answer the question of interest:

- Which variable will you choose as a response variable? Answer with the column's name from `caschools`. Assign your answer to the object `answer3.1_response`.
    
    
- Which variable will you choose as the input variable? Answer with the column's name from `caschools`. Assign your answer to the object `answer3.1_predictor`.

In [None]:
# answer3.1_response <- "..."
# answer3.1_predictor <- "..."

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.1.0()
test_3.1.1()

**Question 3.2**
<br>{points: 1}

Even though we found that the relationship between `read` and `income` is non-linear, let us fit the SLR model estimated by least squares (LS) anyway for practice purposes. Assign it to the object `caschools_SLR`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# caschools_SLR <- ...(... ~ ..., data = ...)

# your code here
fail() # No Answer - remove if you provide an answer

caschools_SLR

In [None]:
test_3.2()

**Question 3.3**
<br>{points: 1}

Find the estimated coefficients of `caschools_SLR` using `tidy()`. Report the estimated coefficients, their standard errors and corresponding $p$-values. Include the corresponding asymptotic 95% confidence intervals. Store the results in the variable `caschools_SLR_results`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# caschools_SLR_results <- 
#    ...(..., conf.int = ...) %>% 
#    mutate_if(is.numeric, round, 2)

# your code here
fail() # No Answer - remove if you provide an answer

caschools_SLR_results

In [None]:
test_3.3()

**Question 3.4**
<br>{points: 1}

Using `caschools`, create a scatterplot of the response variable (in the $y$-axis) versus the input variable (in the $x$-axis), but add the estimated SLR line.

Call the resulting object `caschools_scatterplot`.

*Fill out those parts indicated with ..., uncomment the corresponding code in the cell below, and run it.*

In [None]:
# Adjust these numbers so the plot looks good in your computer.
options(repr.plot.width = 10, repr.plot.height = 5) 

# caschools_scatterplot <- 
#   ggplot(..., aes(x = ..., y = ...)) +
#   ...() +
#   ...(method = ..., se = FALSE, linewidth = 1.5) +
#   xlab(...) +
#   ylab(...) +
#   theme(
#     text = element_text(size = 16.5),
#     plot.title = element_text(face = "bold"),
#     axis.title = element_text(face = "bold")
#   )

 
# your code here
fail() # No Answer - remove if you provide an answer

caschools_scatterplot

In [None]:
test_3.4()

**Question 3.5**
<br>{points: 1}

Using the results in  `caschools_SLR_results`, write a correct interpretation of the estimated slope.

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

# 4. Inference

**Question 4.0**
<br>{points: 1}

Using a significance level $\alpha = 0.05$, is `income` statistically associated with `read`?

*Assign your answer to an object called answer4.0. Your answer should be either "true" or "false", surrounded by quotes.*

In [None]:
# answer4.0 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_4.0()

**Question 4.1**
<br>{points: 1}

The `p-values` computed by the function `lm` displayed in `caschools_SLR_results` are based on:

**A.** classical theoretical approximations or results

**B.** bootstrapping experiments

**C.** none of the above

*Assign your answer to an object called answer4.1 Your answer should be one of "A", "B", or "C" surrounded by quotes.*

In [None]:
# answer4.1 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_4.1()

**Question 4.2**
<br>{points: 1}

Using `caschools_SLR_results`, provide a correct interpretation of the 95% CI for the regression parameter of `income`.



DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 4.3**
<br>{points: 1}

One of the *sampling distributions* related to SLR is 

**A.** The distribution of the response $Y$.

**B.** The distribution of the true population slope $\beta_1$.

**C.** The distribution of $\hat{\beta}_1$, the estimator of the slope.

**D.** The distribution of the input variable $X$.

*Assign your answer to an object called `answer4.3`. Your answer should be one of "A", "B", "C", or "D" surrounded by quotes.*

In [None]:
# answer4.3<- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_4.3()

**Question 4.4**
<br>{points: 1}

If we assume that the error terms are Normal, it can be proved that the sampling distributions are also Normal. However, we usually don't know the true distribution of the error terms. 

If we prefer not to make these assumptions, we can approximate the sampling distribution using bootstrapping from 'caschools'.

- Obtain $B = 10000$ sets of regression estimates by fitting an SLR with LS $B$ times for each bootstrapped sample. Note that it might take quite a few seconds to run.  

- Store the corresponding bootstrap estimates in the data frame `lm_boot` of $B$ rows and two columns:
    - boot_intercept: list of bootstrapped intercepts.
    - boot_slope: list of bootstrapped slopes.
    
*Fill out those parts indicated with ..., uncomment the corresponding code in the cell below, and run it.*

In [None]:
set.seed(123)  # DO NOT CHANGE!

# n <- ...
# B <- ...

# lm_boot <- replicate(..., {
#   sample_n(..., ..., replace = ...) %>%
#     lm(formula = ..., data = .) %>%
#     .$coef
# })
#
# lm_boot <- data.frame(boot_intercept = lm_boot[1, ], boot_slope = lm_boot[2, ])

# your code here
fail() # No Answer - remove if you provide an answer

head(lm_boot)
tail(lm_boot)

In [None]:
test_4.4()

**Question 4.5**
<br>{points: 1}

Now that we have a list of bootstrapped estimates, we can visualize the sampling distributions! 

Let's focus on the sampling distribution of the slope estimator.

The ggplot object's name will be `slope_sampling_dist_boot`.

*Fill out those parts indicated with ..., uncomment the corresponding code in the cell below, and run it.*

In [None]:
#slope_sampling_dist_boot <-
#    ggplot(..., aes(x = ...)) +
#    geom_histogram(bins = 30, color = 'white') +
#    coord_cartesian(xlim = c(1.5, 2.5)) +
#    xlab("...") +
#    ggtitle("Sampling distribution for the estimator of the slope")

# your code here
fail() # No Answer - remove if you provide an answer

slope_sampling_dist_boot

In [None]:
test_4.5()

**Question 4.6**
<br>{points: 1}

Add vertical lines to the plot of the sampling distribution, `slope_sampling_dist_boot`, to visualize the upper and lower limits of the  percentile-based bootstrap *95% CIs* of the slope.

The ggplot() object's name will be `slope_sampling_dist_boot_limits`.

*Fill out those parts indicated with ..., uncomment the corresponding code in the cell below, and run it.*

In [None]:
#slope_sampling_dist_boot_limits <- 
#   ... + 
#   geom_vline(aes(xintercept = quantile(..., ...)), col = 'blue', size = 1)+
#   geom_vline(aes(xintercept = quantile(..., ...)), col = 'blue', size = 1)

# your code here
fail() # No Answer - remove if you provide an answer

slope_sampling_dist_boot_limits

In [None]:
test_4.6()

**Question 4.7**
<br>{points: 1}

In one or two sentences, explain how to use `lm_boot` generated in **Question 4.4** to approximate the sampling distribution of the estimator of the intercept.

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 4.8**
<br>{points: 1}

Write appropriate code and use `lm_boot` to generate and visualize percentile-based bootstrap 90% confidence intervals for the estimator of the intercept. Your answer must include a visualization and a 90% CI.

In [None]:
## Your code goes here

# your code here
fail() # No Answer - remove if you provide an answer