# From Hypothesis Testing to Confidence Intervals with Bootstrapping
## Chapter 12.1-12.5 Overview Notebook

In [None]:
# load libraries
library(coursekata)

# formatting
css <- suppressWarnings(readLines("https://raw.githubusercontent.com/jimstigler/jupyter/master/ck_jupyter_styles.css"))
IRdisplay::display_html(sprintf('<style>%s</style>', paste(css, collapse = "\n")))

# load data
junkads_study <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRUb_QI68mB7nYNrD9zC1qVgYRSjIcrlddFz1Dj_hErlG0_lGVBQt3WASRi-zTyLZpg0qyMUKCO1PGr/pub?gid=877709649&single=true&output=csv")


## 1 Exploring and Modeling Variation in a New Dataset 


### 1.1 Let's collect some data. You will be handed a page with a few paragraphs and three questions. Read and answer the questions.

<div class="discussion-question">

### 1.2 Discussion Question: Tell us what you read about. What differences do you notice between the various versions of the handouts in the class?

</div>


### 1.3 Research Study by [Bryan et al. (2016)](https://www.pnas.org/doi/10.1073/pnas.1604586113#sec-7)

Researchers wanted to test a new way of motivating healthy eating among teens. They thought that traditional public health messages (e.g., “eat more fruits and vegetables because it’s good for you”), don’t always work well with adolescents because they can feel a little obvious or preachy.

Instead, the researchers tried reframing healthy eating as an act of freedom and resistance, a way to push back against manipulative food industry practices such as the unfair targeting of young people. In this “expose” version, healthy eating was connected to values of autonomy and justice, not just health.

Students were randomly assigned to receive one of these messages: expose or public-health. Then they were asked to rate some ads.

In our in-class version, we only rated at one ad (Sprite) but [students in the original study](https://www.pnas.org/doi/10.1073/pnas.1604586113#sec-7) rated advertisements for three different drinks (Sprite, Coca-Cola, and Gatorade). They were asked how *angry* the ads made them and how much the ads made them want to *drink* the product. Their responses to the three ads were averaged together.

The data frame `junkads_study`, contains data from a sample of 89 students with the following variables:
- `message` whether the student read the expose or public health message
- `SpriteOpen_ended` the open-ended written response to the Sprite ad
- `angry` (1–5) on average, how angry the three ads made them feel
- `drink` (1–5) on average, how much they wanted to drink the product after seeing the ad

<div class="discussion-question"> 
    
### 1.4 Discussion Question: What do you think researchers were hoping to find in this data? 

</div>

<div class="guided-notes">

### 1.5 What outcome is being modeled in the jitter plot: `angry` or `drink`? Label the y-axis of the graph.

Which variable would you expect to be, on average, higher for the public health group versus the expose group?

</div>

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/12.1-12.5-overview-no-outcome-label.jpg" alt="what is the outcome variable here?" width = 60%>

<div class="guided-notes">

### 1.6 Fill in the blanks below to write R code for a model predicting that outcome from the message the student received.

<center><code>message_model <- lm(__________ ~ __________, data=__________)</code></center>
    
</div>

<div class="guided-notes">

### 1.7 Take a look at the R output of message_model. Interpret $b_1$ in context of this study and label it on the graph.

</div>

## 2 Evaluating the Null Model of the DGP

<div class="discussion-question">

### 2.1 Discussion Questions: In prior chapters we learned how to evaluate whether this data came from a DGP where $\beta_1 = 0$. What are some ways we could evaluate that idea?

</div>

<div class="guided-notes">

### 2.2 The alpha (0.05) is depicted in the shuffled sampling distribution of $b_1$s. Use a highlighter to depict p-value and estimate how big it might be.
    
What does this tell you about the hypothesis that our sample came from a DGP where $\beta_1=0$?

</div>

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/12.1-12.5-overview-sdob1-alpha.jpg" alt="sliding sampling distribution around representing differen possible beta-1s" width = 60%>

In [None]:
# run this to see p-value
message_model <- lm(drink ~ message, data = junkads_study)
supernova(message_model)

## 3 A Better Question About the DGP: What could $\beta_1$ be?

Up until now, we’ve only evaluated one possible value of $\beta_1$: 0. We could reject it or not, but that’s a limited answer.

A better question is: What could $\beta_1$ be?

In reality, $\beta_1$ could take on a range of values. Our goal is to figure out which of those values are most consistent with our sample data.

The basic idea is illustrated in the animation below. The blue number lines represent the possible values of $\beta_1$ (the population parameter) and $b_1$ (the estimate from the sample). 

The animation starts with the sampling distribution we constructed for the empty model. Then, using hypothetical thinking, we imagine sliding that distribution up and down the number line, each position representing a different possible value of $\beta_1$:
- When the sampling distribution is shifted to the right, we’re imagining a DGP where $\beta_1$ is larger (sometimes, much larger than our sample $b_1$).
- When it’s shifted to the left, we’re imagining a DGP where $\beta_1$ is smaller.


<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/12.1-12.5-gif.gif" width = 80% alt = "sampling distribution being moved up and down number line">

<div class="discussion-question">

### 3.1 Discussion Question: Are some values of $\beta_1$ more likely to produce our observed $b_1$ than others?

In the animation, think about how close or far each $\beta_1$ is from the sample $b_1$. Which $\beta_1$s would make our observed $b_1$ more likely, and which would make it less likely?

</div>

<div class="guided-notes">

### 3.2 Turn to your paper numberlines and sampling distribution. Let's start with one DGP we're familiar with: the empty model.
    
- Mark the sample $b_1$ on the number line provided. 
- Then place the paper sampling distribution to represent the possible samples from the empty-model DGP. (Note this sampling distribution has 10,000 runs to make the shape smoother.)
    
</div>

<div class="discussion-question">

### 3.3 Discussion Question: Is the sample likely to come from this DGP? How is the paper activity like the code below?
    
<code>sdob1 <- do(1000) * b1(shuffle(drink) ~ message, data = junkads_study)</code>
         
<code>sample_b1 <- b1(drink ~ message, data = junkads_study)</code>
    
<code>gf_histogram(~b1, data = sdob1, fill = ~middle(b1, 0.95)) %>%</code><br>
<code>  gf_point(0 ~ sample_b1)</code>

</div>

<div class="guided-notes">

### 3.4 Try sliding your sampling distribution to explore different $\beta_1$s:

- Find a $\beta_1$ where our sample $b_1$ would be unlikely (in the outer tails) to be generated by that DGP. 
- Find a $\beta_1$ where our sample $b_1$ would be not-unlikely (in the middle .95) to be generated by that DGP.
- What do you notice about $\beta_1$s where our sample would be not-unlikely?

</div>

<div class="guided-notes">

### 3.5 Let's use this strategy to find the *confidence interval* for $\beta_1$: The range of values where our sample would be considered not-unlikely.
    
- Find the biggest $\beta_1$ where our sample is not-unlikely. Mark this value on the DGP numberline.
- Find the smallest $\beta_1$ where our sample is not-unlikely. Mark this value on the DGP numberline.

</div>

<div class="guided-notes">

### 3.6 Imagine a DGP where $\beta_1$ happens to be exactly the same as our sample $b_1$ (~0.50). Slide your sampling distribution so that it is aligned with $\beta_1 = 0.50$. 

How does this middle 95% region of this sampling distribution compare to the confidence interval you just marked?
    
</div>

## 4 Bootstrapping a Sampling Distribution of $b_1$s

If only we had a sampling distribution centered at our sample $b_1$, we could find the confidence interval directly (without using paper sampling distributions) because:

- The .025 cutoff in the lower tail of that sampling distribution would line up perfectly with the lower bound of the confidence interval.
- The .025 cutoff in the upper tail would line up with the upper bound.

We can create that distribution using bootstrapping (also called *resampling*). Bootstrapping assumes that the entire population is made up of many, many copies of individuals just like the ones in our sample.

The `resample()` function creates new samples by drawing individuals (rows) from our data frame with replacement. That means we might select the same person more than once, as if there were many identical copies of these 89 individuals from `junkads_study` in our population.

<div class="discussion-question">

### 4.1 Discussion Question: Run the code below a few times. What is the difference between the two results?
    
Use the `do()` function to create a few resampled $b_1$s. How different are they from the original sample $b_1$?

</div>

In [None]:
# run this code a few times
b1(drink ~ message, data = junkads_study)

b1(drink ~ message, data = resample(junkads_study))


<div class="guided-notes">

### 4.2 Fill in the blanks to bootstrap a sampling distribution and visualize it as a histogram

<code>sdob1_boot <- do(_____) * b1(drink ~ message, data = ____________________)</code>

<code>gf_histogram(~b1, data = ____________, fill = ~middle(b1, .95), bins = 100)</code>

</div>

In [None]:
# run this
sdob1_boot <- do(1000)*b1(drink ~ message, data = resample(junkads_study))

gf_histogram(~ b1, data = sdob1_boot, fill = ~middle(b1, .95), bins = 100) 

<div class="discussion-question">

### 4.3 Discussion Question: Why is this sampling distribution centered at 0.50? Why not at 0?

</div>

<div class="guided-notes">
    
### 4.4 Just by eyeballing the distribution, what would you estimate as the boundaries that represent the .025 cutoff in the lower and upper tails, the values that would align with the confidence interval? Label them on the sampling distribution.
    
</div>

### 4.5 To find the actual cutoff values, we can arrange the $b_1$s in sdob1_boot in order (from smallest to largest), then look at the 26th and the 975th (ignoring the bottom 25 and top 25 $b_1$s).
    
<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/R4scg010.png">

</div>

<div class="guided-notes">
    
### 4.6 Complete the table to show what each line of code does and write the missing code

</div>

<table border="1" style="font-size: 18px; margin-left: 0; border-collapse: collapse; width: 100%;"> 
    <thead> 
        <tr> 
            <td style="border: 1px solid black; font-weight: bold; text-align: center; width:40%">What it does</td> 
            <td style="border: 1px solid black; font-weight: bold; text-align: center; width:60%">Code</td> 
        </tr> 
    </thead> 
    <tbody> 
        <tr> 
            <td style="border: 1px solid black; text-align: left; height: 60px;"> </td> 
            <td style="border: 1px solid black; text-align: left; height: 60px;"><code>sdob1_boot &lt;- arrange(sdob1_boot, b1)</code></td> 
        </tr> 
        <tr> 
            <td style="border: 1px solid black; text-align: left; height: 60px;"> </td> 
            <td style="border: 1px solid black; text-align: left; height: 60px;"><code>sdob1_boot$b1[26]</code></td> 
        </tr>
        <tr> 
            <td style="border: 1px solid black; text-align: left; height: 60px;">Prints the 975th <i>b</i><sub>1</sub></td> 
            <td style="border: 1px solid black; text-align: left; height: 60px;"> </td> 
        </tr> 
    </tbody> 
</table>

In [None]:
# add to this code
sdob1_boot <- arrange(sdob1_boot, b1)
sdob1_boot$b1[26]


<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/12.1-12.5-overview-t-distributions.jpg" alt = "concept of confidence interval with t-distributions instead of simulated sampling distributions" align = "right" width = 50%>

### 4.7 Just as we used the t-distribution in the previous chapter to model the sampling distribution of $b_1$ for purposes of calculating a p-value (the approach used by `supernova()`), we can use it here to calculate a 95% confidence interval

Try running the code below to calculate a 95% confidence interval that uses the t-distribution as a model of the sampling distribution of $b_1$.

In [None]:
# run this
confint(message_model)

<div class="discussion-question">

### 4.8 Discussion Question: What is the confidence interval for $\beta_1$ using this approach? How does it compare to the bootstrapped confidence interval we found earlier?

</div>

<div class="guided-notes">
    
### 4.9 What are the pros and cons of the different ways we estimated confidence intervals today?
    
</div>

<table border="1" style="font-size: 18px; margin-left: 0; border-collapse: collapse; width: 100%;"> 
    <thead> 
        <tr> 
            <td style="border: 1px solid black; font-weight: bold; text-align: center; width:40%">Approach</td> 
            <td style="border: 1px solid black; font-weight: bold; text-align: center; width:30%">Pros</td> 
            <td style="border: 1px solid black; font-weight: bold; text-align: center; width:30%">Cons</td> 
        </tr> 
    </thead> 
    <tbody> 
        <tr> 
            <td style="border: 1px solid black; text-align: left; height: 60px;">Paper sampling distributions</td> 
            <td style="border: 1px solid black; text-align: left; height: 60px;"> </td> 
            <td style="border: 1px solid black; text-align: left; height: 60px;"> </td> 
        </tr> 
        <tr> 
            <td style="border: 1px solid black; text-align: left; height: 60px;">Bootstrapped sampling distributions<br><code>resample(junkads_study)</code></td> 
            <td style="border: 1px solid black; text-align: left; height: 60px;"> </td> 
            <td style="border: 1px solid black; text-align: left; height: 60px;"> </td> 
        </tr> 
        <tr> 
            <td style="border: 1px solid black; text-align: left; height: 60px;"><i>t</i>-distribution, mathematical model of sampling distribution<br>
                <code>confint(message_model)</code></td> 
            <td style="border: 1px solid black; text-align: left; height: 60px;"> </td> 
            <td style="border: 1px solid black; text-align: left; height: 60px;"> </td> 
        </tr> 
    </tbody> 
</table>

<div class="guided-notes">
    
### 4.10 All three methods lead to the same interpretation of the confidence interval: 0.04, 0.96. Write that interpretation.
    
</div>