# Sampling Distributions for Model Comparison (COMPLETE)
## Chapter 11.1-11.5 Overview Notebook

In [None]:
# run this to set up the notebook
suppressMessages(library(coursekata))

css <- suppressWarnings(readLines("https://raw.githubusercontent.com/jimstigler/jupyter/master/ck_jupyter_styles.css"))
IRdisplay::display_html(sprintf('<style>%s</style>', paste(css, collapse = "\n")))

# read in data frame
know_geo <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRxzG5YClv66VeFsYhq0At4eX6n1yJnp6ylluf-EEont6jg1aNvfpZ8wbA7Qi17v93ronB7eVg5Ba8X/pub?gid=1892668026&single=true&output=csv")

# remove scientific notation from notebook
options(scipen = 999)

<div class="teacher-note">
    
**Teacher Note:** In this section, students extend their understanding of sampling distributions to PRE and F. They learn to simulate sampling distributions for these statistics under the empty model and use them to evaluate whether the parameter estimates of a complex model could plausibly have arisen just by chance. This builds on earlier logic developed with sampling distributions of b1.

Students should understand that:
- Just like with b1, we can simulate sampling distributions of model-comparison statistics like PRE and F assuming that the empty model is true.
- Just like with b1, we build up the sampling distribution by shuffling the outcome variable to break any real relationship, repeatedly fitting the model, and recording the resulting statistic (PRE or F) each time.
- The observed statistic (PRE or F) from the original data can be compared to the simulated distribution. If it is more extreme than most shuffled values, this suggests the model may capture a real effect in the DGP. (Unlike with b1, “extreme” values for PRE and F are large positive values.)
- Just like with b1, a p-value can be estimated from the sampling distribution by calculating the proportion of shuffled values that are as extreme or more extreme than the observed value.
- This approach is structurally identical to null hypothesis significance testing, but framed through the lens of model comparison.
- Sampling distributions of b1, PRE, or F will lead to the same conclusion. For example, if we use the sampling distribution of b1 to reject the empty model as a plausible DGP, using sampling distributions of PRE or F will not only lead to the same conclusion (e.g., reject empty model), it will result in the same p-value. These are just different ways of exploring the same underlying question: Is the empty model a plausible model of the DGP?


    
A <a href="https://docs.google.com/document/d/1cn81sPCepswbaI-Ktdl1LK-lM5NE4Glsk1bjv8Ru6cI/edit?tab=t.5y2a0ykmi2fk" target="_blank">printable student guided-notes worksheet</a> is available to go with this Jupyter notebook, as well as a student version of this notebook.
    
</div>

## 1 Exploring and Modeling Variation in a New Dataset 


### 1.1 Let's collect some data. On the half-sheet of paper you have before you, answer this question:

> How many country names do you think you can write down in 1 minute? ______

Then, when we say "Go," flip your paper over and do the task. 

### 1.2 Data from a class of 22 students who did this task in `know_geo` data frame

There are three variables in this data frame:
- `pre_count` which is the student's answer to "How many country names do you think you can write down in 1 minute?"
- `actual_count` how many countries they named 
- `total_responses` how many total responses (these responses included non-countries such as states, cities, or continents)

Run the code below to take a look at the data frame.

In [None]:
# run this
head(know_geo)

<div class="guided-notes">
    
## 1.3 Write a word equation for this hypothesis: `pre_count` will help us make a better prediction of `actual_count`

</div>

<div class="discussion-question">

## 1.4 Discussion Questions: Take a look at the data visualization and best-fitting model. Do you think there is some evidence for this hypothesis? Explain.
    
</div>

In [None]:
# run this
pre_count_model <- lm(actual_count ~ pre_count, data = know_geo)

# we've zoomed out from the default graph a little bit
gf_point(actual_count ~ pre_count, data = know_geo) %>%
  gf_point(0 ~ -5, alpha = 0) %>%
  gf_lm(color = "firebrick")+ 
  scale_x_continuous(breaks=seq(-5,30,5)) 
  

<div class="guided-notes">
    
## 1.5 Annotate the graph with $b_0$ and $b_1$. What does each represent?
    
- $b_0 = 4.97$:

- $b_1 = 0.47$:


</div>

In [None]:
# run this
lm(actual_count ~ pre_count, data = know_geo)

<div class="guided-notes">
    
## 1.6 Here is the best-fitting parameter estimate for the empty model. Draw the empty model onto the visualization. 

</div>

In [None]:
# run this
lm(actual_count ~ NULL, data = know_geo)

<div class="guided-notes">
    
## 1.7 Interpret the PRE and F. (Make sure to note how these are model comparisons that compare the `pre_count_model` against the empty model.)

</div>

In [None]:
# run this
supernova(pre_count_model)

## 2 Could this sample have come from the empty-model DGP (where $\beta_1 = 0$)?

We’ve learned to evaluate this kind of question in two different ways (that lead to the same conclusion):
1. **Using the p-value:** we check how likely our sample statistic would be if the empty-model DGP were true using a mathematically modeled sampling distribution.
2. **By simulating a sampling distribution:** we check how likely our sample statistic would be if the empty-model DGP were true using a shuffled sampling distribution.

So far, we’ve done this with $b_1$ from a group model (e.g., comparing two conditions).

Now we’ll apply the same idea to the $b_1$ from a regression model, using the same concepts and R code.

Later, we’ll see that we can extend this idea to any sample statistic, not just $b_1$! We’ll do it with F and PRE, and we’ll see that all of these approaches—using $\alpha$, building a sampling distribution of $b_1$, F, or PRE—lead to the same conclusion.

<div class="discussion-question">

### 2.1 Make a Prediction: Based on the p-value from the ANOVA table, is our sample $b_1$ (and our sample F and PRE) unlikely to have been generated from the empty-model DGP? If we visualize our sample on a sampling distribution, where do you think it will fall?

Use the community standard for “unlikely” that we have seen before.
    
</div>

<div class="guided-notes">

### 2.2 Let's check our predictions by generating a sampling distribution from the empty-model DGP. 
    
Fill in the missing R code to generate a sampling distribution of 1000 $b_1$s (`sdob1`) from the empty-model DGP in the first three rows of the **Sampling Distribution of $b_1$** column.
    
</div>

<table border="1" style="font-size: 18px; margin-left: 0; border-collapse: collapse; width: 100%;">
  <thead>
    <tr>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:20%">What the Code is Doing</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:40%">Sampling Distribution of $b_1$</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:40%">Sampling Distribution of F</td>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;">Generate a sampling distribution of 1000 samples from the empty-model DGP</td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;"><code>sdob1 <- do(1000) * b1(shuffle(actual_count) ~ pre_count, data = know_geo)</code></td>
            <td style="border: 1px solid black; text-align: left; vertical-align: top;"><code>sdof <-</code></td>
    </tr>
    <tr>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;">Save our sample statistic as an R object</td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;"><code>sample_b1 <- ___(actual_count ~ pre_count, data = know_geo)</code></td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;"><code>sample_f <-</code></td>
    </tr>
    <tr>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;">Plot the sampling distribution; fill it so you can see alpha<br><br>
Add the sample statistic as a dot
</td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;"><code>gf_histogram(~___, data=_______,<br><br>
  fill = ~_________________) %>%<br><br>
  gf_point(0 ~ ______________)</code></td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;"><code>gf_histogram(~___, data=_______,<br><br>
  fill = ~_________________) %>%<br><br>
  gf_point(0 ~ ______________)</code></td>
    </tr>
    <tr>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;">Label the alpha in this sampling distribution<br><br>

Add a dot for the sample statistic<br><br>

Use a highlighter to indicate p-value<br><br>
</td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;"><img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/11.1-11.5-overview-sdob1-2.jpg" alt="sampling distribution of b1s"></td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;"><img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/11.1-11.5-overview-sdof-2.jpg" alt="sampling distribution of fs"></td>
    </tr>
  </tbody>
</table>


In [None]:
# Generate a sampling distribution of 1000 samples from the empty-model DGP
sdob1 <- do(1000) * b1(shuffle(actual_count) ~ pre_count, data = know_geo)

# Save our sample statistic as an R object
sample_b1 <- b1(actual_count ~ pre_count, data = know_geo)

# Plot the sampling distribution; fill it so you can see alpha
# Add the sample statistic as a dot
gf_histogram(~b1, data = sdob1, 
  fill = ~middle(b1, .95)) %>%
  gf_point(0 ~ sample_b1)


<div class="guided-notes">

### 2.3 Annotate the visualization

- Label the alpha in this sampling distribution
- Add a dot for the sample statistic
- Use a highlighter to indicate p-value

    
</div>

<div class="discussion-question">

### 2.4 Discussion Questions: What is the shape, center, spread of the sampling distribution of $b_1$? 

</div>

<div class="teacher-note">

**Sample Responses:**
- Shape: normal or bell-shaped, symmetrical
- Center: 0 (this makes sense because these samples all came from a DGP where the $\beta_1 = 0$); since there is no relationship between `pre_count` and `actual_count` in this DGP, most of the samples don't have a relationship either
- Spread: at least the 1000 $b_1$s roughly fall between $\pm0.6$; the middle .95 are between $\pm0.4$
    
**Teacher Note:** Use this question to also show students that, like any distribution, we can compute the summary statistics (or favstats) for a sampling distribution. Run the `favstats()` code to show the mean is basically 0 and the standard deviation is ~0.20. Note to students that the standard deviation of a sampling distribution has a special name: the standard error of $b_1$.
    
A connection you can make is that usually the middle 95% of a sampling distribution is $\pm2$ standard deviations ($0.20*2 = 0.40$). This is what we observed in the histogram.

</div>

In [None]:
# run this
favstats(~ b1, data = sdob1)

<div class="guided-notes">

### 2.5 Annotate the visualization of the sampling distribution of $b_1$ (`sdob1`) with the mean and standard error. 
    
</div>

## 3 Using a Sampling Distribution of F 

From p-value and generating a sampling distribution of $b_1$ we already have an answer to this question: Could this sample have come from the empty-model DGP (where $\beta_1 = 0$)?

It could *but* it's an unlikely $b_1$ from that DGP.

Here we will see how we can come to this conclusion through yet another route: by generating a sampling distribution of F from the empty-model DGP.

### 3.1 Here's a new function called `f()`. This compares the `pre_count` model to the empty model and returns the F-statistic.

Run it and confirm it matches the F you saw in the ANOVA table.

In [None]:
# run this
f(actual_count ~ pre_count, data = know_geo)


### 3.2 Modify the code to simulate an F from the empty-model DGP

Incorporate `shuffle()`, just like you did for the `b1()` function, to generate data from the null DGP, fit the model, compare it to the empty model, and return the F.

In [None]:
# modify this
f(actual_count ~ pre_count, data = know_geo)

# COMPLETE
f(shuffle(actual_count) ~ pre_count, data = know_geo)


<div class="discussion-question">

### 3.3 Discussion Question: Run the code to generate 5 Fs from the empty-model DGP. Why are these different?

</div>

In [None]:
# run this
do(5) * f(shuffle(actual_count) ~ pre_count, data = know_geo)

<div class="teacher-note">

**Sample Responses:**
- Each F value is different because every time we use `shuffle()`, we create a new random sample from the empty-model DGP.
- Each sample comes from a DGP where there is no real relationship between `pre_count` and `actual_count`. So these sample Fs vary because of randomness. 
- These different F values show that even if the empty-model DGP were true, there would be some variation in Fs.

**Teacher Note:** You may want to ask a follow up question—Do these Fs seem big or small? This is a hard question because students are usually like "How should I know? How do you even judge an F?" That could motivate generating 1000 Fs to see how they tend to vary and to remind them that our goal will be to see whether our sample F blends in with these random Fs or sticks out in some way.
</div>

<div class="guided-notes">

### 3.4 Let's try to figure out how to generate a sampling distribution of Fs from the empty-model DGP by using the `f()` function (instead of the `b1()` function)
    
Fill in the missing R code to generate a sampling distribution of 1000 Fs (`sdof`, so to speak) from the empty-model DGP in the first three rows of the **Sampling Distribution of F** column.
    
Don't worry about filling in the histogram with color yet. Just make the histogram first.
    
</div>

<table border="1" style="font-size: 18px; margin-left: 0; border-collapse: collapse; width: 100%;">
  <thead>
    <tr>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:20%">What the Code is Doing</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:40%">Sampling Distribution of $b_1$</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:40%">Sampling Distribution of F</td>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;">Generate a sampling distribution of 1000 samples from the empty-model DGP</td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;"><code>sdob1 <- do(1000) * b1(shuffle(actual_count) ~ pre_count, data = know_geo)</code></td>
            <td style="border: 1px solid black; text-align: left; vertical-align: top;"><code>sdof <-</code></td>
    </tr>
    <tr>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;">Save our sample statistic as an R object</td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;"><code>sample_b1 <- b1(actual_count ~ pre_count, data = know_geo)</code></td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;"><code>sample_f <-</code></td>
    </tr>
    <tr>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;">Plot the sampling distribution; fill it so you can see alpha<br><br>
Add the sample statistic as a dot
</td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;"><code>gf_histogram(~b1, data = sdob1,<br><br>
  fill = ~middle(b1, .95)) %>%<br><br>
  gf_point(0 ~ sample_b1)</code></td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;"><code>gf_histogram(~___, data=_______,<br><br>
  fill = ~_________________) %>%<br><br>
  gf_point(0 ~ ______________)</code></td>
    </tr>
    <tr>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;">Label the alpha in this sampling distribution<br><br>

Add a dot for the sample statistic<br><br>

Use a highlighter to indicate p-value<br><br>
</td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;"><img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/11.1-11.5-overview-sdob1-2.jpg" alt="sampling distribution of b1s"></td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;"><img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/11.1-11.5-overview-sdof-2.jpg" alt="sampling distribution of fs"></td>
    </tr>
  </tbody>
</table>


In [None]:
# Note this code takes a little time to run (like 15 seconds)
# Generate a sampling distribution of 1000 Fs from the empty-model DGP
sdof <- do(1000) * f(shuffle(actual_count) ~ pre_count, data = know_geo)

# Prints out the head of sdof
head(sdof)

In [None]:
# Save our sample statistic as an R object
sample_f <- f(actual_count ~ pre_count, data = know_geo)

# Plot the sampling distribution; don't worry about the fill yet
# Add the sample statistic as a dot
gf_histogram(~f, data = sdof) %>%
  gf_point(0 ~ sample_f)


<div class="discussion-question">

### 3.5 Discussion Question: What is the shape, center, and spread of this sampling distribution?

- Why is it different from the sampling distribution of $b_1$s? 
- Where are the "unlikely" Fs?
    
</div>

<div class="teacher-note">

**Sample Responses:**
- Shape: Very much right-skewed; long upper tail.
- Center: Although there are lots of Fs near 0, the mean is probably not 0 
- Spread: Most Fs are small, with occasional large values in the upper tail (as high as 15 or 20)
- The unlikely Fs are the really big ones (the small ones are very likely)

**Teacher Note:**
Point out that since F is a ratio of variances (MS Model and MS Error), and variances can’t be negative (they are squared!), so a ratio of two positive values can't go below 0. 

Another connection to make is that when $\beta_1 = 0$ (or there is no relationship in the DGP between X and Y), the MS Model (the variance explained by the model) and the MS Error (the varianced unexplained) are the same, so the F would be around 1. You can run `favstats()`.

</div>

In [None]:
# run this
favstats(~ f, data = sdof)

### 3.6 Because this sampling distribution is right-skewed, the unlikely values are not split between the lower and upper tails, but only in the upper tail.

That means we’ll want to color in the upper 5% of this distribution to represent our alpha region.

We can do this by using the functions `upper()` or `lower()` (instead of `middle()`) in our `gf_histogram()` code. Try playing around with these function in the code cell below.

In [None]:
# try upper() or lower() instead of middle()
gf_histogram(~f, fill = ~middle(f, .95), data = sdof) %>%
  gf_point(0 ~ sample_f)

# COMPLETE
# using upper
gf_histogram(~f, fill = ~upper(f, .05), data = sdof) %>%
  gf_point(0 ~ sample_f)

# using lower
gf_histogram(~f, fill = ~lower(f, .95), data = sdof) %>%
  gf_point(0 ~ sample_f)

<div class="guided-notes">

### 3.7 Update the code in the third row of the table so that the histogram of the Sampling Distribution of F shows the most unlikely 5% of Fs (the upper tail).

</div>

<div class="guided-notes">

### 3.8 Annotate the visualization of the sampling distribution of F.

- Label the alpha in this sampling distribution
- Add a dot for the sample statistic
- Use a highlighter to indicate p-value

    
</div>

### 3.9 Modify the code below to color-fill p-value

The code below color-fills Fs greater than 1 in a different color. How would we color-fill the random Fs that are more extreme than our sample F differently? 

In [None]:
# modify this
gf_histogram(~f, fill=~(f > 1), data = sdof) %>%
  gf_point(0 ~ sample_f)

# COMPLETE
gf_histogram(~f, fill=~(f > sample_f), data = sdof) %>%
  gf_point(0 ~ sample_f)

<div class="discussion-question">

### 3.10 Make a Prediction: The p-value from the ANOVA table was 0.0149. That represents the likelihood that the empty-model DGP would produce an F more extreme than our sample F. If we generated 1000 Fs by shuffling, how many of them would you expect to be more extreme than our sample F?
    
Another way to ask this is this: How many F's are shown in the p-value we just visualized?

It will only be approximately that amount, not exactly. Why? 

</div>

<div class="teacher-note">

**Sample Responses:**

- 0.0149 of 1000 Fs would be ~15 Fs.
- I won't be exactly 15 because it's just 1000 random Fs. A different set of 1000 Fs would lead to a slightly different number of extreme Fs. But it should be close to 15.

</div>

<div class="guided-notes">

### 3.11 Write the R code to count how many random Fs are more extreme than our sample F.

How would we turn this into a proportion?
</div>

In [None]:
# write code


# COMPLETE
tally(~ f > sample_f, data = sdof)
tally(~ f > sample_f, data = sdof, format = "proportion")

In [None]:
# If you'd like to see what happens with a new set of random Fs, run this

# Note this code takes a little time to run (like 15 seconds)
sdof <- do(1000) * f(shuffle(actual_count) ~ pre_count, data = know_geo)
tally(~ f > sample_f, data = sdof)

<div class="guided-notes">
    
### 3.12 Just like the sampling distribution of $b_1$s can be modeled by a smooth t-distribution, the sampling distribution of Fs can be modeled by a smooth F-distribution.

The exact shape of the F-distribution depends on both the df Model and the df Error.

Add the appropriate values for df Model (`df1`) and df Error (`df2`) to the code below to visualize our p-value on a mathematical F-distribution.
    
</div>

In [None]:
# add in the appropriate df1 and df2
xpf(sample_f, df1 = 1, df2 = 1)

# COMPLETE
xpf(sample_f, df1 = 1, df2 = 20)

<div class="discussion-question">

### 3.13 Discussion Question: We already had a strategy for evaluating whether our sample could have come from the empty-model DGP: using a sampling distribution of $b_1$. We learned another strategy that leads to the same conclusion: using a sampling distribution of F. Why might it still be useful to learn both approaches? 
    
What does the F approach help us see or do that the $b_1$ approach doesn’t?

</div>

<div class="teacher-note">

**Sample Responses:**

- $b_1$ works well for a two-group model or a simple regression, but if we have many groups or predictors, using F might be better than testing each parameter estimate separately.
- The F-distribution is easier to think about because we only have to check the upper tail! The `tally()` code is way easier.
- F provides a model-comparison perspective; it tells us how much better our model fits compared to the empty model, not just the direction or size of the slope. The F-statistic also accounts for how complex our model is.
- It’s good to know more than one way because it helps our thinking be more flexible. If we forget one method, we might remember the other. Also, learning both methods gives us another chance to reinforce ideas about shuffle, sampling distributions, alpha, and p-value.
- One thing that’s annoying about F is that the shuffle code takes a little longer to run!

**Teacher Note:**
This is a good chance for students to "convince themselves" that these two views are connected and that it is valuable to learn both. In psychology, this is called a *saying is believing* exercise. Encourage them to see that both $b_1$ and F are different examples of the same inferential logic: whether the data pattern we found is likely if the empty model of the DGP were true. 

</div>

## 4 Practice What You Learned: Using a Sampling Distribution of PRE

In this section, we will discover another example of the same inferential logic. Use what you learned from generating sampling distributions of $b_1$ and F in order to generate a sampling distribution of PRE.


<div class="teacher-note">

<b>Teacher Note:</b>

After following along and filling in their guided notes, this is a section where students can write code and explanations on their own. You may want to have students go to **Kernel → Restart & Run All** to ensure the notebook has run all the code cells above. 
    
</div>

### 4.1 The function pre() calculates a PRE, how much error is explained by the more complex `pre_count` model compared to the simpler empty model.

Try running the code below to verify that it produces our sample PRE.

Then modify it to generate a shuffled PRE from the empty-model DGP.

In [None]:
# run then modify 
pre(actual_count ~ pre_count, data = know_geo)

# COMPLETE
pre(shuffle(actual_count) ~ pre_count, data = know_geo)

### 4.2 Make a Prediction: Given what you know about PRE, do you predict its sampling distribution will look more like the t-distribution or the F-distribution?

What values of PRE will be considered “unlikely”?

### 4.3 Below is code that generates a sampling distribution of 1000 shuffled Fs. Modify it to generate a sampling distribution of 1000 shuffled PREs instead.

Run head() on your new data frame to check your results (you may want to rename it).

In [None]:
# Note this code takes a little time to run (like 15 seconds)
# modify this
sdof <- do(1000) * f(shuffle(actual_count) ~ pre_count, data = know_geo)

# Prints out the head of sdof
head(sdof)

# COMPLETE
# modify this
sdopre <- do(1000) * pre(shuffle(actual_count) ~ pre_count, data = know_geo)

# Prints out the head of sdof
head(sdopre)

### 4.4 Below is code to visualize the sampling distribution of F as a histogram and place the sample F on it. Modify it to visualize the sampling distribution of PRE and place the sample PRE on it.

The fill will show alpha = 0.05.

In [None]:
# modify this
sample_f <- f(actual_count ~ pre_count, data = know_geo)

gf_histogram(~f, fill = ~upper(f, .05), data = sdof) %>%
  gf_point(0 ~ sample_f, size = 4)

# COMPLETE
sample_pre <- pre(actual_count ~ pre_count, data = know_geo)

gf_histogram(~pre, fill = ~upper(pre, .05), data = sdopre) %>%
  gf_point(0 ~ sample_pre, size = 4)

### 4.5 If we had evaluated the empty-model of the DGP only using the sampling distribution of PRE, what would we have decided: to reject or not reject it? Explain why.