# Worksheet 3: Introduction to Generative Modelling

## Learning Objectives

After completing this week's lecture and tutorial work, you will be able to:

1. Give an example of a question that could be answered by generative modelling.
2. Explain how a linear regression can be used to approximate the underlying mechanism that generated the data (quantitative response and input variables).
3. Interpret the estimated coefficients and $p$-values derived from theoretical results for a simple linear regression (i.e., one input variable).
4. Discuss the assumptions made to estimate the simple linear regression coefficients and approximate their sampling distribution.
5. Explain how to approximate the sampling distribution of the simple linear regression coefficient estimators using bootstrapping. 
6. Contrast the sampling distribution approximated using theoretical results with bootstrapping alternatives for a simple linear regression setting.
7. Compute confidence intervals for the simple linear regression coefficients using theoretical approximations and bootstrapping results.
8. Write a computer script to perform simple linear regression analysis.

In [None]:
# Run this cell before continuing.
library(tidyverse)
library(repr)
library(infer)
library(cowplot)
library(broom)
source("tests_worksheet_03.R")

# 1. Generative models

As data scientists, we are often interested in understanding the relationship between variables in our data using *models*. For example:

- which variables are associated with a response of interest? 

- can we model the relationship between the response and the input variables? Is a linear model adequate?

- which variables are positively/negatively associated with the response? 

- does the relationship between the response and an input variable depend on the values of the other variables?

**Linear regression models can be used to answer these questions**


## 1.1 Scope of Linear Regression

A linear regression is used to study the relationship between a *continuous response* and one or more input variables of different types (continuous or discrete)


- In STAT 201 (Statistical Inference) you have learned tools to study the relation between a continuous and a categorical variable. For example, 
    - *does website A attracts more donations than website B?* 


- However, there are some questions that you won't be able to answer with the tools learned in STAT 201. For example, 
    - *is the size of the donation related to the income of the user?* (unless you collapse the information into 2 groups, e.g., large vs small income users)
    
    - if a company changes its website, what would be the expected size of the donations received?

#### <font color=blue>  Linear Regression Models provide a unifying framework to study the relation between different type of variables and a **continuous response** </font>

Research in linear models has been focused on 3 important aspects: **estimation, inference, and prediction** 

- **Estimation**: how to estimate the true (but unknown) relation between the response and the input variables

- **Inference**: how to use the model to infer information about the unknown relation between variables

- **Prediction**: how to use the model to predict the value of the response for new observations 

**Note**: These goals are related!

## 1.2 Historical throwback ...

> **Historical note I**: Least squares (a classical method in Regression) was first used by **Legendre** (1805) and by **Gauss** (1809) to estimate the orbits of comets based on measurements of the comets’ previous locations. Gauss even predicted the appearence of the asteroid Ceres using LS combined with other complex computations (Source: [The Discovery of Statistical Regression](http://econ.ucsb.edu/~doug/240a/The%20Discovery%20of%20Statistical%20Regression.htm))

    
![](http://pix-media.s3.amazonaws.com/blog/1061/image021.png)

> **Historical note II**: However, neither Legendre or Gauss coined the term "Regression". **Francis Galton** in the nineteenth century use this term to describe a biological phenomenon that he observed: "It appeared from these experiments that the offspring did not tend to resemble their parents seeds in size, but to be always more mediocre than they". It was later his colleage **Karl Pearson** who associated Least Squares to Regression...

<img src="img/galton_pearson.png" style="width:800px"/>

Note: unfortunately, Francis Galton had disturbing and unacceptable views of race (https://en.wikipedia.org/wiki/Francis_Galton)

## 1.3. How have linear regression models been used?

### Sports: an example of prediction

Billy Bean, manager of the Oakland Athletics, used statistics to identify low cost players who can help the team win (example from Introduction to Data Science, Rafael Irizarry)

<table><tr>
<td> <img src="img/moneyball.png" style="width:800px;"/> </td>
<td> <img src="img/moneyball_RI.png" style="width:800px;"/> </td>
</tr></table>

### Public Health: an example of estimation

#### [Funding and Publication of Research on Gun Violence](https://jamanetwork.com/journals/jama/fullarticle/2595514)

![](img/jama_fig.png)
Reference: *JAMA* 2017; 317(1):84-85

Fetured twice in New York Times: [Gun Research Is Suddenly Hot](https://www.nytimes.com/2019/04/17/upshot/gun-research-is-suddenly-hot.html), [There's an Awful Lot We Still Don't Know About Guns](https://www.nytimes.com/interactive/2018/03/02/upshot/what-should-government-study-gun-research-funding.html)

### Climate Change: an example of inference

[Here’s The Best Place To Move If You’re Worried About Climate Change](https://fivethirtyeight.com/features/heres-the-best-place-to-move-if-youre-worried-about-climate-change/)

![img/pic_FiveThirtyEight](https://fivethirtyeight.com/wp-content/uploads/2019/09/CLIMATEWINNERS-0919-4x3.png?w=575)
Reference: featured article in FiveThirtyEight

[Climate Amenities, Climate Change, and American Quality of Life](https://www.journals.uchicago.edu/doi/full/10.1086/684573)

Economists have used different **linear regression models** to explain people's choices in relation to climate variables. Reference: *JAERE* 2016; 3(1): 205-246

### Medicine and Molecular Biology: an example of prediction

#### [Can We Predict Protein from mRNA levels?](https://www.nature.com/articles/nature23293)

<img src="img/nature_gcf.png" style="width:1300px;"/>

Reference: *Nature* 2017, 547:E19–E20

# PART I: Estimation

# 2. Introduction to Simple Linear Regression (SLR)

The purpose of this section is to practice building and interpreting SLR models and become familiar with `R` functions such as `lm()` and `broom()`. 

> "Simple" refers to a linear model with only *one* input variable!

#### The question
In this worksheet we will examine the relationship between cancer mortality rates and different demographic and medical variables.

#### Read in the data

We will use the dataset `US_cancer_data` that contains data on cancer mortality rate and different demographic and medical variables in American counties. 

The data come from [data.world](https://data.world/nrippner/ols-regression-challenge) and other sources: [census.gov](http://census.gov/), [clinicaltrials.gov](http://clinicaltrials.gov/), and [cancer.gov](http://cancer.gov/). All values have been collected in the 2010s but vary per source.

> **Heads up**: Recall the importance of using a *random* sample to obtain representative summaries and broad conclusions!

In the [source documentation](https://data.world/nrippner/ols-regression-challenge) you can find the definition of each of the selected variables:

- `TARGET_deathRate`: a continuous variable that measures cancer mortality per capita (for every 100,000 inhabitants), obtained as an average of data collected from the years 2010-2016.

- `povertyPercent`: a continuous variable that measures percentage of the county's populace in poverty from 2013 American Census estimates.

- `PctPrivateCoverage`: a continuous variable that measures percentage of the county residents with private health coverage from 2013 American Census estimates. 

Let's start by reading this dataset!

In [None]:
US_cancer_data <- read_csv("data/US_county_cancer_data.csv") %>%
  select(TARGET_deathRate, povertyPercent, PctPrivateCoverage)

You can hypothesize that the cancer mortality of each county may be affected by the average level of povery. Thus, you want to study quantitavely the association, at county level, between `TARGET_deathRate` and `povertyPercent`.

**Question 2.0**
<br>{points: 1}


Within the context of this exercise, answer the following:

**2.0.0.** Which variable will you choose as a response $Y$? Answer with the column's name from `US_cancer_data`.

**2.0.1.** Which variable will you choose the input $X$? Answer with the column's name from `US_cancer_data`.


*Assign your answers to the objects `answer2.0.0` (character type surrounded by quotes), `answer2.0.1` (character type surrounded by quotes).*

In [None]:
# answer2.0.0 <- ...
# answer2.0.0
# answer2.0.1 <- ...
# answer2.0.1

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.0()

**Question 2.1**
<br>{points: 1}

Using `US_cancer_data`, create a scatterplot of the response $Y$ versus the input $X$. Call the resulting object `cancer_poverty_scatterplot`. 

> **Heads-up:** It is always important to display units of variables in plots to allow their proper interpretation!

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
options(repr.plot.width = 10, repr.plot.height = 7) # Adjust these numbers so the plot looks good in your desktop.


# cancer_poverty_scatterplot <- ggplot(..., aes(..., ...)) +
#   ...() +
#   xlab(...) +
#   ylab(...) +
#   theme(
#     text = element_text(size = 16.5),
#     plot.title = element_text(face = "bold"),
#     axis.title = element_text(face = "bold")
#   )
# cancer_poverty_scatterplot

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.1()

**Question 2.2**
<br>{points: 1}

Based on the `cancer_poverty_scatterplot`, how would you describe the graphical association between $X$ and $Y$?

**A.** Negative.

**B.** Positive.

*Assign your answer to an object called `answer2.2`. Your answer should be one of `"A"` or `"B"` surrounded by quotes.*

In [None]:
# answer2.2 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.2()

## 2.1 The model

We will begin by modeling the cancer mortality rate as a *linear* function of poverty. Note that this relation is not exact or true. Is it reasonable to *assume* that the cancer mortality changes *at a constant rate* with the percentage of populace poverty? 

In this stage of the analysis we will examine this assumption and estimate the corresponding **Simple Linear Regression** (SLR). Define:

- $Y$: the response variable

   - in our example: the cancer mortality per capita 

- $X$: the input variable

    - in our example: percentage of the populace in poverty 
    
> "Simple" means that there's a single continuous input variable in the model    

### Linearity Assumption

#### The conditional expectation: $E[Y|X] = \beta_0 + \beta_1 X$

- in our example: $E[\text{mortality}|\text{poverty}] = \beta_0 + \beta_1 \text{povery}$

#### <font color="blue">  The conditional expectation of the response is linearly related to the input variable and the line is the *linear regression*</font>

> **NOTE 1**: some textbooks ignore the "conditional" statement assuming that $X$ is not random

> **NOTE 2**: This is not the only way to model the conditional expectation! If the true conditional expectation is not linear, other methods will be better to predict the response! For example: in DSCI100, you have used `kNN`!

### Terminology

- **The response variable: $Y$**

Also known as: **output**, **explained variable**, **dependent variable**

- **The input variable(s): $X$**

Also known as: **output**, **explanatory variables**, **independent variable**, **covariates**, **features**

- **The regression coefficients: $\beta_0, \; \beta_1$**

The true intercept and the slope of the line are called **regression parameters or coefficients** 

> <font color="blue"> IMPORTANT: The population parameters are *unknown* and *non-random*.</font> **We will use a sample to estimate them using the `lm` function in R**

# 3. Estimation of the regression line

**The true regression parameter are unknown!** so we *use data* from a random sample to *estimate them!!*

Let ${(X_i,Y_i): i = 1, \ldots , n}$ be a <font color="red">random sample</font> of size $n$ from the population

In our example:

- $Y_i$: the cancer mortality per capita for the $i$th county

- $X_i$: percentage of the populace in poverty for the $i$th county

> Note that the response and input are indexed by $i$ which could take on the following values: $1 , \dots, n$ to identify the $i$th county in the data.

Observations from a random sample won't be perfectly lined, but we can assume that:

$$Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i,$$

#### The error term: $\varepsilon_i$

The error term contains all factors that deviate $Y_i$ from its conditional expected value:

$$\varepsilon_i = Y_i - (\beta_0 + \beta_1 X_i) = Y_i - E[Y_i|X_i]$$ 

> **Heads up**: Note that $\varepsilon$ is a random variable!

- We assume that this random variable is centered at $0$, $E(\varepsilon) = 0$, and has a variance denoted by $\text{Var}(\varepsilon) = \sigma^2$

- We assume that these random errors are independent and identically distributed: *iid assumption*
    - as any other assumption, it may not hold or may not be a good assumption

- Note that any distributional assumption made about the error term also affect the random variable $Y$
    - for example, if you assume that $\varepsilon$ is a Normal random variable, then $Y$ would also be Normal

> <font color="blue"> **IMPORTANT: the population parameters are *unknown* and *non-random*.</font> How can we use data to estimate them??**

#### but how ...???

There is an infinite number of lines to choose from ...

#### Which one is the best line??

In [None]:
US_cancer_data %>% ggplot(aes(povertyPercent, TARGET_deathRate)) + theme(axis.text.x = element_text(angle = 90))+
    geom_point()+ 
    geom_abline(intercept=145,slope=1.5, size=2, col = "blue")+
    geom_abline(intercept=200,slope=-1, size=2, col = "orange")+
    geom_abline(intercept=100,slope=5, size=2, col = "red")+
    geom_abline(intercept=155,slope=1.2, size=2, col = "green")+
    geom_abline(intercept=147,slope=1.9, size=2, col = "purple")+
    xlab("Populace in Poverty (%)") +
    ylab("Cancer Mortality per Capita (cases/100,000)")

**Question 3.0**
<br>{points: 1}

How would you choose the **best line**? 

**A.** The line that contains most data points 

**B.** The line that minimizes the distance of the points to the line 

**C.** The line that looks the best upon visual inspection

*Assign your answer to an object called `answer3.0`. Your answer should be one of `"A"`, `"B"`, or `"C"` surrounded by quotes.*

In [None]:
# answer3.0 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.0()

To define the **best line** we need to know how to measure the distance of the points to the line! Which of the following criteria would you choose to define "distance of a point to the line"?? 

> **Note**: discuss with your choice with your neighbour, more than one option can be possible

![](img/dist.png)

Figure by Prof. Joel Ostblom

> **Option B** measures the distance in the response $Y_i$ to its expected value

![image.png](img/LS_ISL.png)

From An Introduction to Statistical Learning, by James, Witten, Hastie and Tibshirani 

### 3.1 Least Squares (LS) method 

Many methods can be used to estimate the true regression line depending on the criteria used to define "optimal"!

**Least Squares method** minimizes the sum of the *squares of the residuals*!!

#### Residual 

The residual of an observation is the difference between its response value and its predicted response on the line (dotted vertical line in **B**)

> **Note**: discuss why the **residual** is different from the **error term**. 

#### Check [this application](http://setosa.io/ev/ordinary-least-squares-regression/)

# 4. LS in R

The parameters of the linear model, a.k.a. regression coefficients, can be estimated through **ordinary least squares**! We will use the `lm` function in R to obtain estimates using the data in `US_cancer_data`.

> **Heads up**: LS is not the only method to estimate the regression coefficients. However, it is the default method in `lm`

The relevant arguments are:

- `formula`: takes the form `response ~ input`.
- `data`: takes a data frame in tidy format.

> **Note**: `lm(response ~ .,data= df)` uses all variables in the dataset `df`, except the `response`, as predictors 

> **Note**: `lm(response ~ input - 1,data= df)` forces the estimated intercept to be 0. Never do this unless you know what you are doing and why. 

In this dataset we have sample of size in $n = 3047$ to estimate the regression coefficients. 

To examine the properties of the estimator let's start with a smaller sample and see what happens as the sample size increases.

**Question 4.0**
<br>{points: 1}

From the pool of American counties `US_cancer_data`, collect one random sample of size `250` and call it `US_cancer_sample250`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
set.seed(123) # DO NOT CHANGE!

# US_cancer_sample250 <- ...(..., size = ...)

# your code here
fail() # No Answer - remove if you provide an answer

head(US_cancer_sample250)

In [None]:
test_4.0()

**Question 4.1**
<br>{points: 1}

It is time to use `R` for estimating the SLR using `US_cancer_sample250`. This estimated model can be used to evaluate if there exist a linear association between cancer mortality and poverty.

Use the `lm()` function to estimate the SLR. 

Store this estimated model in the variable `SLR_cancer_sample250`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# SLR_cancer_sample250 <- ...(formula = , data = ...)
# SLR_cancer_sample250

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_4.1()

### The estimated slope

The estimated slope: $\hat{\beta}_1=1.52$ 

> the "hat" indicates that this is an *estimated slope* based on the data, not the *true* slope!!


#### Interpretation:

- **Correct**: A 1 percent increase in the input variable *is associated* with a $\hat{\beta}_1$ change in the response


- **Correct**: A $\hat{\beta}_1$ change in the response is expected for every 1 percent increase in the input variable 


- **Wrong**: A 1 percent increase in the input variable *causes* a $\hat{\beta}_1$ change in the response


### The estimated intercept

The estimated intercept: $\hat{\beta}_0=153.04$ 

> the "hat" indicates that this is an *estimated intercept* based on the data, not the *true* intercept!!


- We are not usually interested in this parameter. However, it is important to include an intercept in the LR!! 


- It may not even be realistic in the context of the study. It's really an interpolated value of our model


#### Interpretation:

- It measures the expected response when the input variable is $0$!!

- Note that if the predictor is centered, $X_{i}-\bar{X}$, then the intercept represents the expected response for an average input value.


#### <font color="red"> Important: many statistical properties do not hold for models without intercept! </font>

**Question 4.2**
<br>{points: 1}

What is the correct interpretation of the regression estimated slope in `SLR_cancer_sample250_results`?

**A.** The effect of a one percent increase of the county's populace in poverty is 1.52 increase in the cancer mortality per capita (cases/100,000).

**B.** One percent increase of the county's populace in poverty causes 1.52 increase in cancer mortality per capita (cases/100,000).

**C.** The expected cancer mortality per capita (cases/100,000) increases by 1.52 per one percent increase of the county's populace in poverty.

*Assign your answer to an object called `answer4.2`. Your answer should be one of `"A"`, `"B"`, or `"C"` surrounded by quotes.*

In [None]:
# answer4.2 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_4.2()

### Visualization of the estimated line

**Question 4.3**
<br>{points: 1}

Using `US_cancer_sample250`, we can plot `TARGET_death_rate` versus `povertyPercent` **with points** and add the estimated SLR. The `ggplot()` object's name will be `SLR_cancer_sample250_plot`

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# SLR_cancer_sample250_plot <- ggplot(..., aes(..., ...)) +
#   ...() +
#   ...(aes(..., ...), ..., se = FALSE, size = 1.5) +
#   coord_cartesian(xlim = c(0, 50), ylim = c(50, 400)) +
#   xlab(...) +
#   ylab(...) +
#   ggtitle("Sample Scatterplot and Estimated SLR") +
#   theme(
#     text = element_text(size = 16.5),
#     plot.title = element_text(face = "bold"),
#     axis.title = element_text(face = "bold")
#   )


# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_4.3()

**Question 4.4**
<br>{points: 1}

Considering the SLR model estimated in this exercise, which of the following questions relates to inference and estimation, and which relates to prediction? 

| **Question** | **Type** |
| ------------------------------- | ----------------------- |
| How can we determine an association between the expected cancer mortality per capita (cases/100,000) and the county’s populace living in poverty of all American counties? | `answer4.4.0` |
| We observe a new American county with 14% of its populace living in poverty. What cancer mortality per capita (cases/100,000) should we expect? | `answer4.4.1` |

The right column of the table is empty but should describe one of the following: 

**A.** Prediction.

**B.** Inference and estimation.

*Assign your answers to the objects `answer4.4.0` and `answer4.4.1`. Your answer should each be a single character (`"A"` or `"B"`) surrounded by quotes.*

In [None]:
# answer4.4.0 <- ...
# answer4.4.1 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_4.4()

# PART II: Inference

# 5. Inference 

The estimated intercept and slope in `SLR_cancer_sample250_results` are computed using the $n = 250$ sampled observations from `US_cancer_data`. 

In other words, we used data from a *random sample* to compute *point estimates* of the *population regression parameters* ($\beta_0$ and $\beta_1$) using *least squares estimators* ($\hat{\beta}_0$ and $\hat{\beta}_1$)

Since $\hat{\beta}_0$ and $\hat{\beta}_1$ are computed from a random sample, they are random variables themselves!  

### Parameter *vs* Estimator *vs* Estimate 


#### 3 important different concepts:  

| **Course** | **Population Parameter** | **Estimator** | **Estimate** |
| -------------------------------| ------------------------------- | ------------------------------- | ----------------------- |
|  | unknown quantity| function of the random sample: *random variable*| real number computed with data (non-random) |
| STAT 201 | mean = $E[Y]$| sample mean = $\bar{Y}$| 153.04 |
| STAT 301 | slope = $\beta_1$| estimator of the slope = $\hat{\beta}_1 = \frac{\sum_{i=1}^n(X_i-\bar{X})(Y_i-\bar{Y})}{\sum_{i=1}^n(X_i-\bar{X})^2}$ | 1.52 |


> **Note**: usually $\hat{\beta}_0$ and $\hat{\beta}_1$ are used for both the estimates and the estimators, which can be confusing

We can think that the *estimates* are (good) guesses about the population parameters based on our data. However,  the values of the estimates depend on the random sample used to compute them:

> **Important**: different samples yield different estimates!!

Let's see an example! Recall the estimates we obtained with the sample taken:

In [None]:
SLR_sample250_0 <- tidy(lm(formula = TARGET_deathRate ~ povertyPercent, data = US_cancer_sample250)) %>% select(estimate)
many_SLR <- SLR_sample250_0 
many_SLR

Take *another sample* from the full dataset and estimate the regression line

> **Important**: In practice, we will rarely take multiple samples!

> **NOTE**: This is NOT bootstrapping!! Why??

In [None]:
set.seed(301)

US_cancer_sample250_1 <- rep_sample_n(US_cancer_data, size = 250)


In [None]:
# ANOTHER POINT ESTIMATES

SLR_sample250_1 <- tidy(lm(formula = TARGET_deathRate ~ povertyPercent, data = US_cancer_sample250_1))  %>% select(estimate)
many_SLR <- many_SLR %>% bind_cols(SLR_sample250_1)
many_SLR

In [None]:
set.seed(30)

US_cancer_sample250_2 <- rep_sample_n(US_cancer_data, size = 250)

In [None]:
# ANOTHER POINT ESTIMATES

SLR_sample250_2 <- tidy(lm(formula = TARGET_deathRate ~ povertyPercent, data = US_cancer_sample250_2))  %>% select(estimate)
many_SLR <- many_SLR  %>% bind_cols(SLR_sample250_2)
many_SLR

#### and so on .... as we take new samples we get different *estimates* of the regression parameters

> **Important**: what is the sample-to-sample variation??

### 5.1. The standard error!!

The variation of these estimates from sample to sample is measured by their standard deviation, which has a special name: *the standard error* (SE)

> But in practice, how can we compute the standard error if we have only 1 sample?? 

We have different ways of answering this question:

1. take multiple samples from the population and compute multiple estimates as we did above. Then compute their SD. But this is *not a realistic option*  

2. use a theoretical result! This is what `lm` does!!

3. use bootstrapping!! As you did in STAT 201 for other quantities!! This is what we will also do in STAT 301

**Question 5.1.0**
<br>{points: 1}

Use the `broom` package's `tidy()` to obtain the estimated coefficients, associated standard errors, $t$-statistics, and $p$-values obtained from the random sample `SLR_cancer_sample250`. 

Store them in the variable `SLR_cancer_sample250_results` whose columns are the following:

- The first column has the names of the regression terms.

- The second column shows the values of the estimated coefficients of the regression line, $\hat{\beta}_0$ and $\hat{\beta}_1$ 

- The remaining three columns have important quantities to assess uncertainty and test hypotheses about the regression terms (we'll learn more about these quantities later).

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# SLR_cancer_sample250_results <- ...(...) %>% mutate_if(is.numeric, round, 2)
# SLR_cancer_sample250_results

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_5.1.0()

### 5.2. Hypothesis Tests

### The null hypothesis

**Question 5.2.0**
<br>{points: 1}

The object `SLR_cancer_sample250_results` contains useful information to test some hypotheses about the regression coefficients.

Suppose we want to test if the there exist a linear association between the response and the input variable. Which of the following null hypotheses is correct:

**A.** $H_0: \hat{\beta}_1 = 0 $

**B.** $H_0: \hat{\beta}_0 = 0 $

**C.** $H_0: \beta_1 = 0$ 

**D.** $H_0: \beta_0 = 0$ 

*Assign your answer to an object called `answer5.2.0`. Your answer should be one of `"A"`, `"B"`, `"C"`, or `"D"` surrounded by quotes.*

In [None]:
# answer5.2.0 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_5.2.0()

### The alternative hypothesis 

**Question 5.2.1**
<br>{points: 1}

The alternative hypothesis reflects our believes about reality. 

> **Heads up**: Although the alternative hypothesis contains the claim that we want to prove, it is important to note that in statistics we are not proving that the null hypothesis is true or false! We can only *reject* or *fail to reject* the null hypothesis based on our evidence in the data!!

The object `SLR_cancer_sample250_results` contains useful information to test some hypotheses about the regression coefficients.

Suppose we want to test if the there exist a positive (linear) association between the response and the input variable. Which of the following null hypotheses is correct:

**A.** $H_1: \hat{\beta}_1 > 0 $

**B.** $H_1: \hat{\beta}_0 > 0 $

**C.** $H_1: \beta_1 \neq 0$ 

**D.** $H_1: \beta_1 > 0$ 

*Assign your answer to an object called `answer5.2.1`. Your answer should be one of `"A"`, `"B"`, `"C"`, or `"D"` surrounded by quotes.*

In [None]:
# answer5.2.1 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_5.2.1()

#### The statistic

To test $H_0$ we can use our estimated slope and check how far the estimate $\hat{\beta}_1$ is from $0$. But **how far is far??** 

The standard error $\text{SE}(\hat{\beta}_1)$ (found in column `std.error` from `SLR_cancer_sample250_results`) will help us in asessing this via the following test statistic:

$$T = \frac{\hat{\beta}_1 - 0}{\text{SE}(\hat{\beta}_1)}$$

The value of this statistic in our data can be found in the column `statistic` from `SLR_cancer_sample250_results`

#### The p-value

In `SLR_cancer_sample250_results` you can find the $p$-values in the column `p.value`. But how are these p-values computed?? 

> We need the distribution of the estimators $\hat{\beta}_0$ and $\hat{\beta}_1$ (the sampling distributions) to compute $p$-values!!

Under $H_0$, `lm` approximates the **sampling distribution** with a $t$-distribution with $n - k$ degrees of freedom, where $n$ is the sample size and $k$ the number of regression coefficients (see Section 6)

- For a SLR: there are $k = 2$ coefficients: $\beta_0$ and $\beta_1$

> **Heads up**: By default, the alternative hypothesis is $H_1: \beta_j \neq 0$, for all $j$th coefficients. But you can change the default settings!!

The `p.value` is interpreted as the probability, under $H_0$, that $\mid T \mid$ is equal or larger than the value observed in our sample (given in the column `statistic` of `SLR_cancer_sample250_results`). 

**Question 5.2.2**

Which of the following statement is correct??

**A.** The p-value is not the probability that the null hypothesis is true

**B.** The p-value is the probability that the alternative hypothesis is false

**C.** The p-value indicates the size or importance of the observed effect

**D.** The p-value is the probability that the observed effects were produced by random chance alone.

*Assign your answer to an object called `answer5.2.2`. Your answer should be one of `"A"`, `"B"`, `"C"`, or `"D"` surrounded by quotes.*

In [None]:
# answer5.2.2 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_5.2.2()

#### Decision rule

The smaller the $p$-value, the stronger the evidence against $H_0$. Thus, small $p$-values (less than the significance level $\alpha$) indicate that the data provides enough statistical evidence against the null hypothesis of no association (i.e., to reject $H_0$).

> **Heads up**: in the last years, the scientific community has identified the "crisis of p-values". If you are interested in this topic you can read more about it in [this article](https://www.nature.com/articles/d41586-019-00857-9) and in the [ASA statement](https://www.stat.berkeley.edu/~aldous/Real_World/ASA_statement.pdf).

**Question 5.2.3**
<br>{points: 1}

Using the output stored in `SLR_cancer_sample250_results` and a significance level $\alpha = 0.05$, in plain words, what is the conclusion of the following hypothesis test?

$H_0: \beta_1 = 0 $

$H_1: \beta_1 \neq 0 $

**A.** We accept the alternative hypothesis; thus, the percentage of the county's populace in poverty has a statistically significant effect on the county's cancer mortality per capita (cases/100,000).

**B.** We reject the null hypothesis; thus, the percentage of the county's populace in poverty is statistically associated with the county's cancer mortality per capita (cases/100,000).

**C.** We fail to reject the null hypothesis; thus, the percentage of the county's populace in poverty is not statistically associated with the county's cancer mortality per capita (cases/100,000).

*Assign your answer to an object called `answer5.2.3`. Your answer should be one of `"A"`, `"B"`, or `"C"` surrounded by quotes.*

In [None]:
# answer5.2.3 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_5.2.3()

### 5.3. Confidence Intervals

The values in `SLR_cancer_sample250_results` can also be used to compute confidence intervals for the regression parameters! 

Confidence intervals (CIs) reflect the uncertainty of our regression estimates and can be used to make inference about the regression coefficients. 


$$\hat{b} \pm \text{SE}(\hat{b}) \times t_{\alpha/ 2, n - k},$$

where 

- $\text{SE}(\hat{b})$ is the estimate's standard error. 

- $n$ is the sample size and $k$ the number of regression parameters 
    - $k = 2$ in the case of the SLR, $\beta_0$ and $\beta_1$.

- The $t_{\alpha/ 2, n - k}$ is the **quantile of the $t$-distribution** with $n - k$ degrees of freedom!! 

**Heads up**: this result is *also* based in classical theory approximations or distributional assumptions!!

There are also many missunderstandings around the concept of CIs

A 95% CI computed from the data is **not** a range of values that contain the true regression parameter with 95% probability. Once the interval has been computed based on the data, *nothing is random*! so it either covers or not the true value. 

> **Heads up**: this observation alludes to a more general and difficult concept: the difference between an *estimator* and an *estimate* as noted before!!

**Question 5.3.0**
<br>{points: 1}

Using `SLR_cancer_sample250` via `tidy()`, obtain the asymptotic 95% CIs for each regression parameters.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# SLR_cancer_sample250_CIs <- ...(..., ...) %>% mutate_if(is.numeric, round, 2)
# SLR_cancer_sample250_CIs

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_5.3.0()

# 6. The Sampling Distribution

We mentioned before that the estimators of the regression coefficient, $\hat{\beta}_0$ and $\hat{\beta}_1$, are *random variables*. Then they also have a *distribution*, called the *sampling distribution* (same as in STAT 201!). And we need this distribution to compute $p$-values!!

But how do we know the *sampling distribution* (i.e., the distribution of the estimators of the regression coefficients*)??

We have different ways of answering this question:

1. take multiple samples from the population and compute multiple estimates as we did above. Then look at their distribution. But this is *not a realistic option* 

2. use a theoretical result! This is what `lm` does!!

3. use bootstrapping!! As you did in STAT 201 for other quantities!! This is what we will also do in STAT 301


#### Theoretical results 

<img src="img/assumptions_SLR.png" style="width:600px"/>

From Beyond Multiple Linear Regression (BMLR), Applied Generalized Linear Models and Multilevel Models in R, by Paul Roback and Julie Legler

From the figure in BMLR, we can see that an additional assumption was made:

#### <font color="blue">  ASSUMPTION 2: the conditional distribution of the error terms is Normal!! and so is the conditional distribution of the response! </font>

This assumption is not always needed in the analysis of linear regression. However, it guarantees that the linear is a good fit to the data!

> **Classical Theory 1**: if we assume that the (conditional) distribution of the error terms $\varepsilon_i$ is Normal, under $H_0$, the statistic $T$ follows a $t$-distribution with $n - k$ degrees of freedom where $n$ is the sample size and $k$ the number of regression parameters.

    
**Important**: <font color="blue">CLT to the rescue</font>: if the assumption is not true but the conditional distribution of the error terms is *nice enough* (under certain conditions), **and if the sample size is large**, then 

> **Classical Theory 2**: under $H_0$, the **Central Limit Theory** can be used to prove that the statistic $T$ follows *approximately* a $t$-distribution with $n - k$ degrees of freedom (the correct terminology is *assymptotically*, for very large sample sizes).

    
#### This theoretical result is used by `lm` to compute p-values and confidence intervals!!

`lm` approximates the sampling distribution with a $t$-distribution with $n - k$ degrees of freedom 

> **Heads up**: what can we do if the assumptions do not hold? or if we don't want to use LS estimators? or if the sample size is not large to use approximations?

<img src="../../resources/img/bootstrapping.png" width=500>
<div style="text-align: center"><i>Bootstrapping diagram.</i></a></div>

>  **Historical note**: The term bootstrapping originates from the phrase "to pull oneself up by one's bootstraps", which refers to completing a seemingly impossible task with no external help. 

In Statistics, bootstrapping refers to sampling from our original sample **with replacement** (also called **resampling with replacement**) to generate a **bootstrap sampling distribution**. 

>  **Heads up**: **sampling with replacement** means that each time we choose an observation from the sample, we return it before randomly selecting another. Resampling with replacement is required to get enough samples to approximate the sampling variation.

The idea is to use the original sample as an *estimate* of the unknown population. The single sample acts as the "bootstraps" that we can use to "pull ourselves up" and create an approximation of sampling distribution.

<font color="blue">Again, note that we are sampling from the sample!! not from the population!!</font>

Using *bootstraping*, we generate a *long* list of estimates to compute *empirically* the sampling distribution!

- sample with replacement to obtain $B$ samples with size $n$

- for each sample, compute the estimated regression coefficients

- use the $B$ regression estimates of a given population parameter $\beta_j$ to calculate the sampling distribution of $\hat{\beta}_j$ 

> **Note**: we use the subscript $j$ to name either the intercept ($j=0$) or the slope ($j=1$)

> **Heads up**: this list can also be used to estimate the mean and the standard error of $\beta_j$


**Question 6.0**
<br>{points: 1}

Let's approximate the sampling distribution bootstrapping from `US_cancer_sample250`. 

Obtain $B = 1000$ sets of regression estimates by fitting a SLR $B$ times using their respective boostrapped sample. Store the corresponding $\hat{\beta}^*_0$ and $\hat{\beta}^*_1$ per boostrapped sample in the data frame `lm_boot250` of 1000 rows and two columns:

- `boot_intercept`: list of bootstrapped intercept $\hat{\beta}^*_0$.
- `boot_slope`: list of bootstrapped slope $\hat{\beta}^*_1$.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
set.seed(123)  # DO NOT CHANGE!

# n <- ...
# B <- ...

# lm_boot250 <- replicate(..., {
#   sample_n(..., ..., ...) %>%
#     lm(..., data = .) %>%
#     .$coef
# })
# lm_boot250 <- data.frame(boot_intercept = lm_boot250[1, ], boot_slope = lm_boot250[2, ])

# head(lm_boot250)
# tail(lm_boot250)

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_6.0()

**Question 6.1**
<br>{points: 1}

Now that we have a list of bootstrapped estimates, we can compute the visualize sampling distribution!

Let's focus on the sampling distribution of the slope

The `ggplot()` object's name will be `slope_sampling_dist_250`


*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
#slope_sampling_dist_250 <-  ggplot(..., aes(x = ...)) +
#    geom_histogram(bins = 30, color = "white", fill = "blue") +
#    coord_cartesian(xlim = c(0, 3)) +
#    xlab("...") +
#    ggtitle("Sampling distribution for the estimator of the slope, n=250")

#slope_sampling_dist_250 


# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_6.1()

### Does the sample size matter?

Our estimator depends on a random sample and thus on its size! What happen to the estimates and the sampling distribution when we change the sample size? 

Let's approximate the sampling distribution of estimators computed from samples of different sizes 

**Question 6.2**
<br>{points: 2}

- start with a new sample of size $n=500$

- repeat the bootstrapping experiment above 

- plot the sampling distribution

The `ggplot()` object's name will be `slope_sampling_dist_500`

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
set.seed(123)  # DO NOT CHANGE!
US_cancer_sample500 <- rep_sample_n(US_cancer_data, size = 500)  # DO NOT CHANGE!

# n <- ...
# B <- ...

# lm_boot500 <- replicate(..., {
#   sample_n(..., ..., ...) %>%
#     lm(..., data = .) %>%
#     .$coef
# })
# lm_boot500 <- data.frame(boot_intercept = lm_boot500[1, ], boot_slope = lm_boot500[2, ])

#slope_sampling_dist_500 <-  ggplot(..., aes(x = ...)) +
#    geom_histogram(bins = ...,color = "white", fill = "blue") +
#    coord_cartesian(xlim = c(0, 3)) +
#    xlab("...") +
#    ggtitle("Sampling distribution for the estimator of the slope, n=500")

#slope_sampling_dist_500 



# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_6.2()

**Question 6.3**
<br>{points: 1}

- use the whole sample with all American counties!

- repeat the bootstrapping experiment above 

- plot the sampling distribution

The `ggplot()` object's name will be `slope_sampling_dist_3047`

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
set.seed(123)  # DO NOT CHANGE!

# n <- ...
# B <- ...

# lm_boot3047 <- replicate(..., {
#   sample_n(..., ..., ...) %>%
#     lm(..., data = .) %>%
#     .$coef
# })
# lm_boot3047 <- data.frame(boot_intercept = lm_boot3047[1, ], boot_slope = lm_boot3047[2, ])

#slope_sampling_dist_3047 <-  ggplot(..., aes(x = ...)) +
#    geom_histogram(bins = ..., color = "white", fill = "blue") +
#    coord_cartesian(xlim = c(0, 3)) +
#    xlab("...") +
#    ggtitle("Sampling distribution for the estimator of the slope, n=3047")

#slope_sampling_dist_3047 



# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_6.3()

Looking at the 3 sampling distributions obtained by bootstrapping from samples of different sizes, side-by-side

In [None]:
plot_grid(slope_sampling_dist_250 , slope_sampling_dist_500 ,slope_sampling_dist_3047 )

**Question 6.4**
<br>{points: 1}

Which of the following observataion about the sampling distribution is true?

**A.** The sampling distribution of the estimator of the slope does not change with the size of the sample we bootstrapped from

**B.** The center of sampling distribution of the estimator of the slope does not change with the size of the sample we bootstrapped from

**C.** The sampling distribution of the estimator of the slope becomes tighter as the size of the sample we bootstrapped from increases


*Assign your answer to an object called `answer6.4`. Your answer should be one of `"A"`, `"B"`, or `"C"` surrounded by quotes.*

In [None]:
# answer6.4 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_6.4()

### Bootstrap Confidence Intervals

In this exercise we use the bootstrapping sampling distribution to compute *bootstrap percentile* CIs of regression parameters.

This empirical sampling distribution can be used to make inference, for example to construct CIs (also done in STAT 201).

Sorting the $B$ boostrapping estimates $\hat{\beta}_j$ from the smallest to the largest value, the corresponding $(1 - \alpha) \times 100\%$ boostrap CI is obtained with the following percentiles:

$$\Big(\hat{\beta}^*_{j,\alpha/2}, \hat{\beta}^*_{j,1 - \alpha/2}\Big)$$

**Question 6.5**
<br>{points: 1}

Obtain a summary of the $B$ results, `boot_SLR_cancer_CIs`, from `lm_boot` with two rows (one for `boot_intercept` and another for `boot_slope`) and three columns: boostrap estimate average (`B_avg`), 95% lower bound quantile (`B_conf.low`), and 95% upper bound quantile (`B_conf.high`).

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# boot_SLR_CIs <- data.frame(
#   B_avg = lm_boot3047 %>% summarize(
#     boot_intercept = ...,
#     boot_slope = ...
#   ) %>% unlist(),
#   B_conf.low = lm_boot3047 %>% summarize(
#     boot_intercept = ...,
#     boot_slope = ...
#   ) %>% unlist(),
#   B_conf.high = lm_boot3047 %>% summarize(
#     boot_intercept = ...,
#     boot_slope = ...
#   ) %>% unlist()
# ) %>% mutate_if(is.numeric, round, 2)
# boot_SLR_CIs

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_6.5()

In [None]:
slope_sampling_dist_3047 <-  ggplot(lm_boot3047, aes(x = boot_slope)) +
    geom_histogram(bins = 30, color = "white", fill = "blue") +
    geom_vline(aes(xintercept = quantile(boot_slope,0.025)),size=1)+
    geom_vline(aes(xintercept = quantile(boot_slope,0.975)),size=1) +
    coord_cartesian(xlim = c(0, 3)) +
    xlab("Estimated Slopes") +
    ggtitle("Sampling distribution and CI for the estimator of the slope, n=3047")

slope_sampling_dist_3047