# Samples, Populations, and the Data Generating Process (DGP) (COMPLETE)

## Chapter 3.10-3.13 Overview Notebook

In [None]:
# run this cell to prepare the notebook
library(coursekata)
library(gridExtra)
set.seed(127)

# Shift all to have mean of ~100 and sd of ~15
target_mean <- 100
target_sd <- 15  

# Create original distributions
pop_a <- round(scale(rnorm(10000)) * target_sd + target_mean, 2)
pop_b <- round(scale(runif(10000)) * target_sd + target_mean, 2)
pop_c <- round(scale(rexp(10000, rate = 0.5)) * target_sd + target_mean, 2)

# make small samples
sample_1 <- c(98.29, 112.22, 92.77, 99.98, 100.92, 87.78, 91.44, 93.65, 114.99, 94.63, 
              101.47, 93.08, 95.05, 107.87, 92.09, 110.47, 100.29, 87.04, 100.66, 107.51)
sample_2 <- c(76.75, 76.43, 89.17, 79.25, 120.06, 76.52, 99.77, 116.69, 116.65, 115.28,
              111.7, 82.73, 101.17, 96.46, 125.63, 102.15, 120.95, 80.42, 78.64, 115.05)
sample_3 <- c(88.74, 75.61, 101.69, 99.39, 102.97, 86.36, 73.43, 95.07, 124.34, 110.32,
              87.97, 95.34, 97.29, 103.26, 71.73, 98.23, 122.74, 119.55, 105.15, 101.61)
sample_4 <- c(124,85.38,75.42,109.07,97.55,88.39,99.52,82.36,119.56,76.63,105.46,
              124.19,78.57,81.23,117.28,91.89,96.73,79.44,81.81,79.81)
sample_5 <- c(99.9,94.38,125.99,104.51,90.56,85.38,106.49,110.25,79.12,118.87,74.22,
              79.52,98.81,103.88,115.93,81.72,65.24,93.42,75.89,104.85)
sample_6 <- c(87.46,90.14,96.76,98.11,113.98,153.06,94.21,95.06,96.3,105.4,86.5,113.37,
              92.93,122.56,90.81,96.86,105.15,138.6,100.11,92.77)

# set styles
css <- suppressWarnings(readLines("https://raw.githubusercontent.com/jimstigler/jupyter/master/ck_jupyter_styles_v2.css"))
IRdisplay::display_html(sprintf('<style>%s</style>', paste(css, collapse = "\n")))

<div class="teacher-note">
<b>Section Goals:</b> In this section students are introduced to the idea that we look at data in order to understand the Data Generating Process (DGP), the hidden processes in the world that generate the patterns we see in a sample of data. Students should understand that:
<ul>
<li>The DGP includes both the processes that generate the distribution in a population over time and the processes by which researchers sample from that population.
<li>Simulations of a DGP using R are a useful tool for understanding the various types of sample distributions that might arise from the same DGP when samples are randomly selected for study. 
<li>Features of a sample (e.g., shape, center, spread, or "weird things") may differ from the features of the population they come from; and larger samples of data are more likely to resemble the population from which they are drawn.
<li>When they look at a sample of data, they recognize that it is one of many possible samples that could have been from the same DGP.
</ul>
    A <a href="https://docs.google.com/document/d/1xT0wDbnwwwiho1MwWAaX7rg4RS3sX-8mfwV_qCGeYs0/edit?tab=t.5y2a0ykmi2fk" target="_blank">printable student guided-notes worksheet</a> is available to go with this Jupyter notebook, as well as a student version of this notebook.
</div>

## 1 Defining Samples, Populations, and the DGP

When we collect a sample of data, we usually are interested not in the sample itself but in the population from which it is drawn. In fact, we usually are interested in the Data Generating Process (or DGP) that resulted in both the population and the sample. In this notebook we are going to sort through these ideas, and hopefully help you see why they are important.

### 1.1 Some definitions

**Sample:** The cases that got selected and measured and so ended up in our data. (Real and known)

**Population:** All the cases in the world from which we sampled, and that we want to understand. (Real but unknown; we want to know it!)

**Data Generating Process (DGP):** The hidden processes in the world that generate the distribution of a variable in a population and then the sample. (Real but unknown; we want to know it!)

<div class="guided-notes">

### 1.2 Label each row in the guided notes table as sample, population, or DGP. For each, also label it as *known* or *unknown*.
    
</div>

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/3.10-3.13-dgp-pop-sample.jpg" alt="DGP is the cause of variation in the population and the population is the source of variation in the sample">

### 1.3 Explanation of how sample, population, and DGP relate
    
The Data Generating Process (DGP) includes all the real-world processes that produce variation on a variable. When the DGP runs for a long time, it generates a population (the set of all cases and the variation among them at a given time). The population isn’t ever *done* being generated. It may change slightly from day to day, but as long as the same DGP is operating, the type of variation we see in the population stays similar over time.

From the population, a researcher draws a sample. The variation we see in the sample comes from the variation in the population, which came from the DGP.

<div class="teacher-note">
<b>Teacher Note:</b>

Spend some time explaining how these concepts fit together so that students can then write down an explanation in their own words. (This is a hard concept for students; all we want here is for them to begin thinking about this hard concept.) Spend time unpacking the figure below.

- **Sample:** The sample is real; it's generally all we have to go on as we try to understand the population from which it was drawn.

- **Population:** The population is real, but generally unknown. You can't go out and analyze the population like you can a sample.

- **DGP:** A concept that goes deeper than population—it refers to the *processes* in the world that generate variation. These processes are real but hidden. We want students to begin thinking in terms of *what causes the patterns* they see in the population as well as the sample.
    
</div>


<div class="guided-notes">

### 1.4 Write a brief explanation - in your own words - of the relationship between DGP, population, and sample.
    
</div>

<div class="discussion-question">

<h3>1.5 Key Discussion Questions: One way to think about the relationship of samples to populations is by analogy to children and parents. How well does this analogy apply when thinking about samples and populations?</h3> 

- Do children always look like their parents? 
- Can you tell exactly what a child’s parent looks like just by seeing the child? 
- Can you always predict what a parent’s children will look like?

Use your answers to these questions to reflect on the relationship between samples and populations.
</div> 

<div class="teacher-note">
    
<b>Teacher Note:</b> The goal of this discussion is to get students thinking about uncertainty in both directions—from sample to population and vice versa.
Use the parent-child analogy to draw out ideas like:

- Children often resemble their parents, but not always—just like samples often resemble their populations, but not always.

- It’s hard to know exactly what a population really looks like just from a small sample—just like it’s hard to reconstruct a particular parent just from looking at one child.

- Usually you can guess at what a parent's children will look like but sometimes you see children that look pretty different-just like samples are somewhat similar to the population they came from but sometimes a sample looks pretty different.
    
</div>

## 2 From Samples to Populations: The Problem of Inference

A key challenge for data analysts is the problem of inference: What can we infer about the population distribution by looking at a sample distribution? What kind of population might have produced this particular sample?

Take, for example, the sample distribution on the right. This sample consists of 20 observations sampled from one of the three population distributions on the left. Which one? How can you tell?

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/3.10-3.13-which-pop-sample.jpg" alt="When you look at a sample, we wonder what population was the source of it?" width = 80%>

### 2.1 The 6 samples we will show you came from one of these three population distributions (also shown in your guided notes)
Each of these population distributions consists of 10,000 observations. 

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/3.10-3.13-three-pop.jpg" alt="population A is normal; population B is uniform; population C is skewed" width = 100%>

<div class="guided-notes">

### 2.2 Label each population distribution based on its shape

</div>

<div class="teacher-note">
    
<b>Teacher Note:</b> This is a good opportunity for students to review some shapes of distributions: normal, uniform, and skewed to the right. 
    
(Students who are familiar with exponential functions may be interested to know that the skewed distribution on the right is called "exponential" in shape. It’s called exponential because the probability drops off exponentially as the values get larger.)
    
</div>

### 2.3 Activity instructions: Which population was the source of the sample?

We'll show you 6 sample distributions, one at a time. Each sample consists of 20 observations randomly sampled from one of the three population distributions. For each sample, vote on which of the three population distributions you think it came from.

<div class="teacher-note">
<b>Teacher Note:</b> Run the code in the next cell to see the first of six sample distributions. For each question:

- Display the sample by running the provided code.
- Ask students to raise their hands to vote for population A, B, or C as the source of each sample.
- Count the number of students selecting each option.
- Edit the title of the plot in R to reflect the vote counts (this makes it visible for discussion).
- Note that we are "saving" the plot using the assignment operator, <code><-</code>, because later we will look at all of them together.
  
**Example:**
If 20 students vote A, 10 vote B, and 2 vote C, update the title like this:
    
<code>plot1 <- gf_histogram(~sample_1) %>%
    gf_labs(title = "Sample 1: Pop A=20, B=10, C=2")</code>

</div>

In [None]:
# run this code
# then update the title with the vote count and run again
plot1 <- gf_histogram(~sample_1) %>%
  gf_labs(title = "Sample 1: Pop A=?, B=?, C=?") 

plot1

<div class="guided-notes">

### 2.4 Record the vote for sample_1 
</div>

In [None]:
# run this code
# then update the title with the vote count
plot2 <- gf_histogram(~sample_2) %>%
  gf_labs(title = "Sample 2: Pop A=?, B=?, C=?") 

plot2

<div class="guided-notes">

### 2.5 Record the vote for sample_2 
</div>

In [None]:
# run and update
plot3 <- gf_histogram(~sample_3) %>%
  gf_labs(title = "Sample 3: Pop A=?, B=?, C=?") 

plot3

<div class="guided-notes">

### 2.6 Record the vote for sample_3 
    
</div>

In [None]:
# run and update
plot4 <- gf_histogram(~sample_4) %>%
  gf_labs(title = "Sample 4: Pop A=?, B=?, C=?") 

plot4

<div class="guided-notes">

### 2.7 Record the vote for sample_4 

</div>

In [None]:
# run and update
plot5 <- gf_histogram(~sample_5) %>%
  gf_labs(title = "Sample 5: Pop A=?, B=?, C=?") 

plot5

<div class="guided-notes">

### 2.8 Record the vote for sample_5 
</div>

In [None]:
# run and update
plot6 <- gf_histogram(~sample_6) %>%
  gf_labs(title = "Sample 6: Pop A=?, B=?, C=?") 

plot6

<div class="guided-notes">

### 2.9 Record the vote for sample_6 
</div>

<div class="discussion-question">

<h3>2.10 Key Discussion Questions: Review the vote totals for each sample. Which samples had less agreement? Why do you think this is?

</div>


In [None]:
# run this code to look at all 6 plots at once
# resizes the plot
options(repr.plot.width = 6*2.5, repr.plot.height = 4*2)

# makes a grid of our 6 plots
grid.arrange(plot1, plot2, plot3, 
             plot4, plot5, plot6, nrow=2)

# resizes plots back to default
options(repr.plot.width = 6, repr.plot.height = 4)

<div class="discussion-question">

<h3>2.11 Key Discussion Questions: Where did we get the population wrong?</h3>

Take a look at the figure below which shows you which sample came from which population. You may want to circle the correct population in your guided notes to help compare with our original guesses.

- For which samples did we misidentify the population?
- What about the sample led us astray?
- What does this activity show about trying to infer a population from a small sample (n=20)?

</div>

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/3.10-3.13-pop-answers2.jpg" alt="sample 1 is from pop C, sample 2 is from pop B, sample 3 is from pop A, sample 4 is from pop X, sample 5 is from pop X, sample 6 is from pop X">

<div class="teacher-note">
<b>Teacher Note:</b>

- Reassure students that it’s okay if their guesses didn’t match the actual populations. This activity is meant to demonstrate how difficult inference is in the real world of data analysis. Samples, especially small samples, often don't look like the populations from which they were drawn.
- Highlight the limitations of small samples: Even using reasonable logic (matching the rough shape of the sample to the population), they can get tripped up because small samples can still look very different from the populations they came from.

</div>


<div class="guided-notes">

### 2.12 If we see a small sample of data with a distribution that appears skewed to the right, what can we assume about the population? 

Other than the shape of the population distribution, what else might explain the skewed shape of the sample?
    
</div>

<div class="teacher-note">

<b>Sample Responses:</b>

- Can’t assume the population is skewed. The skew could be due to random sampling.
- The population might be skewed, but it also might not be.
- Small samples vary a lot, so even a skewed-looking sample could come from a non-skewed population. 

</div>


## 3 From Populations to Samples: the Power of Simulation

As we’ve seen, it’s hard to draw conclusions about a population from a single small sample—especially in real life data analysis where we can *never* directly study the population. That’s where simulation comes in.

Using R, we can make assumptions about a population and simulate multiple samples from it. By examining the many “children” produced by a single “parent,” we can explore whether our actual sample could plausibly have come from that population—or if it seems unlikely.

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/3.10-3.13-pop-sim-samples.jpg" alt="We can simulate random samples from a population" width = 75%>

### 3.1 Edit the code to view `pop_a` in a histogram.

Each of the populations you saw earlier (normal, uniform, and skewed) is stored in a vector of 10,000 values: `pop_a`, `pop_b`, and `pop_c`.

Let’s start by visualizing Population A.

In [None]:
# edit this code
#gf_histogram(~vector, fill = "purple")

# sample response
gf_histogram(~pop_a, fill = "purple")

<div class="discussion-question">
    <h3>3.2 Key Discussion Question: How would you describe the distribution of <code>pop_a</code>?</h3> Think about shape, center, spread, and weird things.
</div> 

### 3.3 Write code to simulate a random sample of n=20 from `pop_a`

We will use the `sample()` function in R: <code>sample(&lt;vector&gt;, &lt;sample size&gt;)</code>. This function takes a random sample of a specified sample size from a population saved in a vector.


In [None]:
# write code

#sample response
sample(pop_a, 20)


<div class="teacher-note">
<b>Teacher Note:</b> Try to help the students go from the general structure <code>sample(&lt;vector&gt;, &lt;sample size&gt;)</code> to filling in the name of the vector <code>pop_a</code> and the desired sample size <code>20</code>. This will help them improve their ability to read R documentation. Make sure students notice that the result of the sample function is a list of 20 numbers sampled from the <code>pop_a</code> vector.
</div>

<div class="guided-notes">

### 3.4 Write R code to make a histogram of the sample of 20 from `pop_a`

Now take the sample you created and put it inside the `gf_histogram()` function to graph the distribution of your sample.
    
</div>

In [None]:
# revise this code
# sample(pop_a, 20)
gf_histogram(~ sample(pop_a, 20))

### 3.5 What do you predict will happen if we run the code above again? 

Will we get the same distribution? Why or why not?


<div class="guided-notes">
    <h3>3.6 Go ahead and run that code again and again. What sample shapes do you observe?</h3> The image below contains the population and some examples of simulated samples to help with discussion.
</div> 

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/3.10-3.13-pop-a-and-11-samples.jpg" width="100%">

<div class="guided-notes">

### 3.7 If we assume that a population is normally distributed, what can we predict about random samples from it? 
What do you think causes some samples to look different from the population?
    
</div>

<div class="teacher-note">

<b>Sample Responses:</b>

- Some samples will look roughly normal, but not all—especially if the sample is small.
- Some samples might look skewed or uneven just because of randomness.
- Random sampling causes some samples to look different from the population.

</div>


## 4 How Does Sample Size Affect Shape of the Sample Distribution?
We have seen that small samples may not represent the distribution of the population from which they are sampled. But how small is small? How big would a sample need to be for you to be confident that you can tell the shape of the population distribution?

### 4.1 Let's try different sample sizes simulated from the normal population (`pop_a`)
In the code cell below, try samples of 20, 50, and 500. Run each sample size multiple times and see what you notice.

In [None]:
# change the sample size, run multiple times, look at the result
# try samples of 20, 50, and 500
gf_histogram(~ sample(pop_a, 20))


<div class="discussion-question">
    <h3>4.2 What sample sizes seem to better represent the population distribution?</h3> What do you notice about multiple samples of 20 compared to multiple samples of 500? The image below contains a row of n=20, n=50, and n=500 samples to help with discussion.
</div> 

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/3.10-3.13-pop-a-20-50-500.jpg" width="100%">

<div class="teacher-note">
<b>Teacher Note:</b> We want students to notice that whereas multiple samples of n=20 seem to vary a lot in shape across multiple runs, multiple samples of n=500 all seem to well represent the normal shape of the parent population.
</div>

## 5 Practice What You've Learned

So far we've been sampling from `pop_a`, which is normal in shape. Try taking random samples of different sizes from `pop_b` (uniform in shape) and `pop_c` (exponential in shape, skewed to the right). 

For each population, make a histogram of the parent population, then histograms of random samples of different sizes. Try drawing small samples (n = 20), medium samples (n=50), and large samples (n = 500) from each population and compare their shapes to the shape of the population.

### 5.1 Look at `pop_b` and its samples (n=20, 50, 500)

In [None]:
# edit this
gf_histogram(~ pop_b)

# sample code
gf_histogram(~ sample(pop_b,20))
gf_histogram(~ sample(pop_b,50))
gf_histogram(~ sample(pop_b,500))

<div class="guided-notes">
    <h3>5.2 Write down the code you tested and describe the shapes of the sample distributions</h3>
</div> 

### 5.3 Look at `pop_c` and its samples (n=20, 50, 500)

In [None]:
# edit this
gf_histogram(~ pop_c)

# sample code
gf_histogram(~ sample(pop_c,20))
gf_histogram(~ sample(pop_c,50))
gf_histogram(~ sample(pop_c,500))

<div class="guided-notes">
    <h3>5.4 Write down the code you tested and describe the shapes of the sample distributions.</h3>
</div> 

<div class="teacher-note">
<b>Teacher Note:</b> 

Section 6 is an optional section—best for students (or instructors) who are curious about how we created the population distributions used earlier in the lesson.

It’s also an opportunity to introduce a subtle but important distinction:  
- A Data Generating Process (DGP) refers to the real-world **causes** of variation and can include everything from things like genes, psychological causes, social mechanisms, to sampling methodology.  
- A simulation is a *model of a DGP*. It helps us study patterns of variation, but it’s not the real thing.

This may be a challenging concept, especially because students are still getting comfortable with what a DGP is. If it feels like too big a leap, you can skip this section for now or assign it as a separate bonus notebook.

</div>


## 6 Extra: Where Did the Populations We Sampled From Come From?

We used three populations of 10,000 in this lesson: `pop_a`, `pop_b`, and `pop_c`, Each was in its own vector. We learned how to randomly sample from these populations using the `sample()` function. But how did we get these 10,000 data points to begin with?

The answer is: we simulated them. They are not from “real” populations. Instead, we used functions in R to generate data that come from different population shapes (normal, uniform, exponential). These simulations are not the real-world DGPs. We use simulation as a technique to *model* how a population with a certain shape might be generated.

For example, to model a normal population, we used the function `rnorm()`, which generates values from a theoretical normal distribution with particular features (the default features are a mean of 0 and a standard deviation of 1).

### 6.1 Try generating random samples from a normal distribution using <code>rnorm(&lt;sample size&gt;)</code> function. Try generating samples that are small (like 10 or 20) or very large (like 500 or 1000).

The code below generates 10 random values from a normal distribution and graphs them in a histogram. 

In [None]:
# edit this code
gf_histogram(~ rnorm(10))

### 6.2 Try generating samples from a uniform distribution using <code>runif(&lt;sample size&gt;)</code>.  Try generating samples that are small (like 10 or 20) or very large (like 500 or 1000).
The code below generates 10 values from a uniform distribution, where all values in a range are equally likely.

In [None]:
# edit this code
gf_histogram(~runif(5))

### 6.3 Try generating samples from an exponential (skewed) distribution using <code>rexp(&lt;sample size&gt;)</code>.  Try generating samples that are small (like 10 or 20) or very large (like 500 or 1000).

The code below generates 10 values from an exponential (skewed) distribution.

In [None]:
# edit this code
gf_histogram(~rexp(5))