# Using the Distribution to Make Predictions (COMPLETE)

## Chapter 6.7-6.11 Overview Notebook

In [None]:
# run this to set up the notebook
library(coursekata)
library(gridExtra)

# set styles
css <- suppressWarnings(readLines("https://raw.githubusercontent.com/jimstigler/jupyter/master/ck_jupyter_styles_v2.css"))
IRdisplay::display_html(sprintf('<style>%s</style>', paste(css, collapse = "\n")))


<div class="teacher-note">
    
<b>Section Goals:</b> In this section students are introduced to the idea of using a distribution of data as a probability distribution in order to make predictions about future observations, as well as the concept of modeling the distribution with the normal probability distribution.

- Students will learn how to use the distribution of data as a probability distribution to make predictions about the probability of future observations falling within a certain range.
- Students should understand that a distribution used in this way may be misleading if the sample size is small, and that we may get better predictions by modeling the shape of the distribution with a theoretical distribution, in this case the normal distribution.
- Students will understand that the underlying distribution of many variables is well modeled by the normal curve, and that especially variables that are sums of other variables will tend to have a normal shape.
- Students will learn to use the xpnorm() function in R to calculate the probability of a specific range of scores on an outcome variable using the normal distribution.
    
</div>

## 1 A Quick Intro to Pokemon Cards (and HP)

<img align = "right" src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/6.7-6.11-overview-pikachu-card.jpg" alt="example pikachu card; 60 HP" width="250" style="padding: 20px;">Pokemon is a popular card game where each card represents a "pocket monster" (or Pokemon). There are many different Pokemon, each with different abilities and stats. You “battle” by choosing different Pokemon to go up against opponents. Knowing how powerful your opponent’s Pokemon is could give you an edge.

One of the simplest measures of a Pokemon's power is **Hit Points (HP)**, the amount of damage it can take before it faints.  

- Higher HP generally means a tougher, more powerful Pokemon.  
- For example, a basic Pikachu might have **60 HP**, as noted in the upper right corner of the card.

> **Note for Pokemon experts in the room:** Even though the game is more complicated than just comparing HP, we’re going to simplify and treat HP as a “measure of power” to help everyone (even those who have never played Pokemon) get on the same page.

If you could **predict your opponent’s Pokemon’s HP**, that would give you some sense of how strong they might be in a battle. This makes HP a perfect example for thinking about distributions, probabilities, and predictions.


### 1.1 Step into the Shoes of a Game Maker

As of 2025, there are 1025 different Pokemon, and there number of Hit Points can range from 1 into the hundreds. Imagine a Pokemon game designer who has to decide how many Hit Points (HP) each Pokemon should have. Here are some of the things they must consider:

- If every Pokemon had super high HP, battles wouldn’t be very interesting.  
- If every Pokemon had really low HP, battles *also* wouldn’t be very interesting.  
- If HP values were spread evenly across the whole range, the difference in hit points between two opponents would be large, making too many battles wildly one-sided.

To keep the game fair and fun, game makers usually design HP (and other features) to follow a **bell-shaped distribution**: Most Pokemon have HP clustered around the average, while only a few are extremely weak or extremely powerful. 

### 1.2 Simulating a distribution of HP

Let's consider a game maker who decides that the mean HP of all Pokemon should be 75 HP with a standard deviation of 25. Run the code below to generate a bell-shaped (normal) distribution of HP for 1000 Pokemon, save them in a vector called `hp`, and then make a histogram to show the resulting distribution.

In [None]:
# this is how a gamemaker might generate a large sample distribution of HP 
set.seed(82)
hp <- rnorm(1000, 75, 25)

# here's a histogram of the generated HP
gf_histogram(~hp)

<div class="discussion-question">
    
### 1.3 Discussion Question: How does a bell-shaped distribution make the game more fair or more fun?  
    
</div>

<div class="teacher-note">

**Sample Responses:**

- Most Pokemon are average, so battles are usually close. (You'll need more strategy!)
- It can be boring if the gap in HP between opponents is to large.
- The game usually feels fair but every once in a while you get a rare strong (or weak) card.
</div>


## 2 From Proportions to Probabilities

### 2.1 Where we are going in this section

The game maker generated a sample distribution of 1000 Pokemon cards with a variety of HP and saved it in a vector called `hp`. We can use this distribution to calculate probabilities and make predictions about what future cards might be like.  

We’ll explore this idea with three questions. The questions are similar, but not identical. The first question is most concrete, with each of the next two questions a bit more abstract: 

1. **Proportion:** What *proportion* of cards have HP > 75?  
2. **Probability:** Imagine that each card from `hp` is mixed up in a giant hat. What is the *probability* of drawing one with HP > 75?  
3. **Future prediction:** Imagine getting a new Pokemon card *in the future*. What is the probability it will have HP > 75?  

Let’s get started with the first, most concrete question.

### 2.2 Run the cell below to get (again) the histogram of `hp` that was generated above

In [None]:
# make a histogram of the simpulated HP
gf_histogram(~hp) 

<div class="discussion-question">
    
### 2.3 Discussion Question: Just based on looking at the histogram, what would you guess is the proportion of cards with HP > 75? Explain your thinking.  
    
</div>

<div class="guided-notes">

### 2.4 Write R code to calculate the exact proportion of cards in the `hp` vector with HP > 75
    
</div>

In [None]:
# code here
tally(hp > 75, format = "proportion")

# challenge question: calculate the proportion of cards with HP > 50 and < 100
tally(hp > 50 & hp < 100, format = "proportion")

<div class="teacher-note">

**Teacher Note:**
We want students to have the intuition that as long as this is a symmetrical distribution, approximately 50% of the observations will be above the mean. But, it may not be exactly 50%; it will depend on the exact HPs in the distribution. In this case, however, it is  50%.
    
</div>

### 2.5 From proportions to probabilities

Proportions are very concrete. For example, in the game maker's sample distribution, 0.50 of the HP values are greater than the mean of 75. That’s a fact we can check by tallying them up.

But now let's move to the world of probability. Imagine writing each of the 1,000 HP values on a piece of paper and mixing them all up in a giant hat. You close your eyes and randomly draw one HP. What is the probability that the card you pick will have HP > 75?

<div class="discussion-question">
    
### 2.6 Discussion Question: What is the *probability* that the single HP you pick out of the hat will be greater than the mean of 75? Explain your thinking.
    
</div>

<div class="teacher-note">
    
**Sample Responses:**
- Same as proportion: 50% or 0.50
- Although the answer is the same, the thinking is different. If there are 1000 HPs in the hat, each one has a 1/1000 chance of being picked. Because 500 of the cards are > 75, there is a 500/1000 chance you would pick one of those cards.
- Even though the probability is 0.50, you still could pick a card that is < 75.
   
</div>

### 2.7 From probability to prediction: When the exact probability is unknown

In the previous example, we knew the exact probability of picking a card with HP > 75 out of the hat. The proportion of cards with HP > 75 was exactly 0.50 (500 out of the 1000 cards). Each card in the hat had an equal chance of being drawn, 1/1000. Because 500 of the 1000 cards were > 75, there was a 500/1000 chance of selecting one of those cards.

But now let’s take another conceptual leap... to cards that are not part of this 1000! Let's the game maker sent us a new Pokemon card. It is not one of the 1000 we already have, but it comes from the same series. What is the probability that this new card will have an HP > 75? 

<div class="discussion-question"> 
    
### 2.8 Discussion Questions: 
- What is the probability that this new card will have an HP greater than 75? 
- How does this question differ from the previous one? 
- Can you be just as certain of your answer? Why or why not?
- What assumptions are you making when you use a sample distribution to make predictions?

</div>

<div class="teacher-note">

**Sample Responses:**

- It should still be 0.50 assuming the game maker created the card from the same underlying distribution.
- This is a different question from the previous one because we don't know the exact probability; this card is not in the original hat.
- We cannot be as certain of our answer because we are not drawing from a hat with a known distribution. Instead, we are assuming that the distribution of the cards we have can be used to make predictions about cards we do not have.

</div>


<div class="guided-notes">

### 2.9 Each row of the table represents one step in our progression of questions from proportion to probability. Fill in the missing cells in your guided notes.
    
</div>

<table border="1" style="font-size: 18px; margin-left: 0; border-collapse: collapse; width: 100%">
  <thead>
    <tr>
      <td style="border: 1px solid black; font-weight: bold; text-align: left; width: 20%">Question</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: left; width: 40%">How to Calculate Answer</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: left; width: 40%">Assumptions</td>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="border: 1px solid black; text-align: left;">What proportion of our sample of 1000 HPs are greater than 75?</td>
        <td style="border: 1px solid black; text-align: left;"><code>tally(hp > 75, format = "proportion")</code></td>
      <td style="border: 1px solid black; text-align: left;">None. Proportion > 75 is a fixed feature of the distribution.</td>
    </tr>
    <tr>
      <td style="border: 1px solid black; text-align: left;">What is the probability that the single HP you pick out of the hat will be greater than 75?</td>
      <td style="border: 1px solid black; text-align: left;"></td>
      <td style="border: 1px solid black; text-align: left;">Assume each HP has an equal probability of being picked out of the hat (i.e., 1/1000)
</td>
    </tr>
    <tr>
      <td style="border: 1px solid black; text-align: left;">What is the probability that a new card you get will  have an HP greater than 75?</td>
      <td style="border: 1px solid black; text-align: left;"></td>
      <td style="border: 1px solid black; text-align: left;"></td>
    </tr>
  </tbody>
</table>



<div style="font-size: 18px; line-height: 1.4; border: 2px solid black; padding: 10px;"><b>Important Note:</b> 
    
As soon as we start predicting the probabilities of future observations, we are making an assumption: that the shape, center, and spread of the population distribution (what we get if we run the Data Generating Process for a long time) is the same as in our sample.  

<i><b>The new card is not one of the 1,000 in the sample, but another card presumed to come from the same Data Generating Process (or population). Thus, if we use our sample to calculate the probability of a future event, we are assuming our sample is representative of the population.</b></i>
    
</div>


## 3 Calculating Probabilities When the DGP or Population Distribution Is Not Well Understood
<img align = "right" src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/6.7-6.11-overview-pikachu-card.jpg" alt="example pikachu card; 60 HP" width="250" style="padding: 20px;">If a game maker *designs* a distribution of HP, they know everything about the data-generating process (DGP) (though we may not!). Similarly, when we simulated the distribution of 1000 HPs from a normal distribution, we knew the exact DGP. 

But what about situations where the DGP is **not well known**? This is actually much more common.  

For Pokemon HP, the game maker has not published information about the DGP they use to generate cards. But even if they did, the real-world distribution may not be the same as the DGP that intitially was used by the game maker. Why not? Think about the following: 

- Billions of cards are printed each year.  
- Some Pokemon (like Pikachu) are fan favorites, so more of them may be in circulation.  
- Once cards are out in the world, some are lost, damaged, or destroyed.  

Because of factors like these, the actual distribution of HP among cards in the real world is hard to know!

**If we don’t know the population distribution, how can we estimate the probability that my *next* card will have HP lower than Pikachu (60 HP)?** We will refer to this probability as P(HP < 60).


### 3.1 Estimating probabilities when all we have are samples... and sometimes not very large samples!

We usually don't know much about the population distribution. If we did, we would use the population distribution to calculate the probability of a future sample; that would be the most accurate estimate. But usually all we have are samples, and sometimes only small samples. Let's investigate the impact of sample size on probability estimates.

We actually have access to a very large dataset of Pokemon cards and their HPs (not billions, but pretty big). In fact, it is so large (including more than 48,000 cards) that we could assume it is representative of the whole population. But let's pretend we only have access to three samples from this larger dataset: 

- A tiny sample of 10 cards  
- A small-ish sample of 30 cards  
- A much larger sample of 360 cards  

Let's see how well we do with each of these samples estimating **the probability that HP < 60**.   


### 3.2 Run the code below to load the dataset. (It might take a little longer than usual.)  

In [None]:
# run this to get big_poke
big_poke <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRuTLjKZoPCjTTT40T8w9UZ10-wmdVXyjXXYIztLVoYm_yVXH1YPXfuIUYGrYP3c3zyjSja8MeW4KtM/pub?gid=2093797551&single=true&output=csv")

# feel free to take a sneak peek 
# but don't learn too much about it!
head(big_poke)

### 3.3 Write code to take a random sample of 10 cards from `big_poke`

R has a function called `sample()` that lets us simulate taking a random sample from a dataframe.  

Try to write code that takes a random sample of 10 cards from `big_poke`.

(Note that you will get a different sample each time you run the code as the sample is random!)

In [None]:
# code here

# COMPLETE
sample(big_poke, 10)


### 3.4 `set.seed()` will give the same random sample each time

When we are working with others (e.g., in class), it’s often helpful if everyone gets the same random sample. To do this, we can use the `set.seed()` function right before running the `sample()` function. As long as you put the same number into the function it will generate the same "random" sample each time you run the code. 

Think of it like setting the “starting point” for randomness. If you use the same seed, you’ll get the same sample every time. In the code below we used the seed 15. (You can choose any number.) Go ahead and run the cell a few times.

In [None]:
# run this several times
# now we get the same random sample every time
set.seed(15)
sample(big_poke, 10)

<div class="guided-notes">

### 3.5 Write code to simulate and save three random samples 
    
Fill in the second column. Create three random samples of 10, 30, and 360 cards and then save those samples as data frames (named `poke10`, `poke30`, `poke360`, respectively). We have started some code for you.  
    
</div>

<table border="1" style="font-size: 16px; margin-left: 0; border-collapse: collapse; width: 100%">
  <thead>
    <tr>
      <td style="border: 1px solid black; font-weight: bold; text-align: left; width: 10%">Sample size</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: left; width: 40%">[3.5] Write code to simulate and save random samples</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: left; width: 40%">[3.6] Write code to estimate P(HP &lt; 60) in the population from that sample</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: left; width: 10%">[3.7] Record Estimated
P(HP < 60)
</td>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="border: 1px solid black; text-align: left;">n = 10</td>
      <td style="border: 1px solid black; text-align: left;"><code>poke10 &lt;- </code></td>
      <td style="border: 1px solid black; text-align: left;"></td>
      <td style="border: 1px solid black; text-align: left;"> </td>
    </tr>
    <tr>
      <td style="border: 1px solid black; text-align: left;">n = 30</td>
      <td style="border: 1px solid black; text-align: left;"><code>poke30 &lt;- </code></td>
      <td style="border: 1px solid black; text-align: left;"></td>
      <td style="border: 1px solid black; text-align: left;"> </td>
    </tr>
    <tr>
      <td style="border: 1px solid black; text-align: left;">n = 360</td>
      <td style="border: 1px solid black; text-align: left;"></code></td>
      <td style="border: 1px solid black; text-align: left;"></td>
      <td style="border: 1px solid black; text-align: left;"> </td>
    </tr>
  </tbody>
</table>


In [None]:
# code here
set.seed(10)

# COMPLETE
set.seed(10)
poke10 <- sample(big_poke, size = 10)
poke30 <- sample(big_poke, size = 30)
poke360 <- sample(big_poke, size = 360)


<div class="guided-notes">

### 3.6 Write code to estimate P(HP < 60) in the population based on each of the three samples 
    
</div>

<table border="1" style="font-size: 16px; margin-left: 0; border-collapse: collapse;">
  <thead>
    <tr>
      <td style="border: 1px solid black; font-weight: bold; text-align: left;">Sample size</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: left;">Code to simulate and save random samples</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: left;">Code to estimate P(HP &lt; 60) in the population based on each sample</td>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="border: 1px solid black; text-align: left;">n = 10</td>
      <td style="border: 1px solid black; text-align: left;"><code>poke10 &lt;- sample(big_poke, 10)</code></td>
      <td style="border: 1px solid black; text-align: left;"></td>
    </tr>
    <tr>
      <td style="border: 1px solid black; text-align: left;">n = 30</td>
      <td style="border: 1px solid black; text-align: left;"><code>poke30 &lt;- sample(big_poke, 30)</code></td>
      <td style="border: 1px solid black; text-align: left;"></td>
    </tr>
    <tr>
      <td style="border: 1px solid black; text-align: left;">n = 360</td>
      <td style="border: 1px solid black; text-align: left;"><code>poke360 &lt;- sample(big_poke, 360)</code></td>
      <td style="border: 1px solid black; text-align: left;"></td>
    </tr>
  </tbody>
</table>


In [None]:
# write code


# COMPLETE
tally(~hp < 60, data = poke10, format = "proportion")
tally(~hp < 60, data = poke30, format = "proportion")
tally(~hp < 60, data = poke360, format = "proportion")

<div class="guided-notes">

### 3.7 Record your estimated P(HP < 60) based on each sample in the last column of the table
    
</div>

<table border="1" style="font-size: 16px; margin-left: 0; border-collapse: collapse; width: 100%">
  <thead>
    <tr>
      <td style="border: 1px solid black; font-weight: bold; text-align: left; width: 10%">Sample size</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: left; width: 40%">[3.5] Write code to simulate and save random samples</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: left; width: 40%">[3.6] Write code to estimate P(HP &lt; 60) in the population from that sample</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: left; width: 10%">[3.7] Record Estimated
P(HP < 60)
</td>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="border: 1px solid black; text-align: left;">n = 10</td>
      <td style="border: 1px solid black; text-align: left;"><code>poke10 &lt;- sample(big_poke, 10)</code></td>
      <td style="border: 1px solid black; text-align: left;"><code>tally(~hp < 60, data = poke10, format = "proportion")</code></td>
      <td style="border: 1px solid black; text-align: left;"> </td>
    </tr>
    <tr>
      <td style="border: 1px solid black; text-align: left;">n = 30</td>
      <td style="border: 1px solid black; text-align: left;"><code>poke30 &lt;- sample(big_poke, 30)</code></td>
      <td style="border: 1px solid black; text-align: left;"><code>tally(~hp < 60, data = poke30, format = "proportion")</code></td>
      <td style="border: 1px solid black; text-align: left;"> </td>
    </tr>
    <tr>
      <td style="border: 1px solid black; text-align: left;">n = 360</td>
      <td style="border: 1px solid black; text-align: left;"><code>poke360 &lt;- sample(big_poke, 360)</code></td>
      <td style="border: 1px solid black; text-align: left;"><code>tally(~hp < 60, data = poke360, format = "proportion")</code></td>
      <td style="border: 1px solid black; text-align: left;"> </td>
    </tr>
  </tbody>
</table>


### 3.8 Below are the histograms and `favstats` for `hp` in the three samples

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/6.7-6.11-overview-three-samples.jpg" alt="histograms of the three samples (n=10, 30, 360)">

In [None]:
# run this code
favstats(~ hp, data = poke10)
favstats(~ hp, data = poke30)
favstats(~ hp, data = poke360)

<div class="guided-notes">

### 3.9 Describe the shape, center, and spread) of each distribution in the table
    
</div>


<table border="1" style="font-size: 16px; margin-left: 0; border-collapse: collapse; width: 100%">
  <thead>
    <tr>
      <td style="border: 1px solid black; font-weight: bold; text-align: left;" width = 15%>Sample size</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: left;" width = 55%>Shape</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: left;" width = 15%>Center (mean)</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: left;" width = 15%>Spread (sd)</td>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="border: 1px solid black; text-align: left;">n = 10</td>
      <td style="border: 1px solid black; text-align: left;"></td>
      <td style="border: 1px solid black; text-align: left;"></td>
      <td style="border: 1px solid black; text-align: left;"></td>
    </tr>
    <tr>
      <td style="border: 1px solid black; text-align: left;">n = 30</td>
      <td style="border: 1px solid black;"></td>
      <td style="border: 1px solid black;"></td>
      <td style="border: 1px solid black;"></td>
    </tr>
    <tr>
      <td style="border: 1px solid black; text-align: left;">n = 360</td>
      <td style="border: 1px solid black;"></td>
      <td style="border: 1px solid black;"></td>
      <td style="border: 1px solid black;"></td>
    </tr>
  </tbody>
</table>

<div class="discussion-question">
    
### 3.10 Discussion Question: These sample distributions look quite different from one another. Did they come from the same population distribution? Answer YES, NO, or NOT SURE, and explain your thinking.
    
</div>

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/6.7-6.11-overview-three-samples.jpg" alt="histograms of the three samples (n=10, 30, 360)">


<div class="teacher-note">

**Sample Responses:**
    
- NO: They look different, so maybe they’re from different populations. <i>(This is a common misconception.)</i>   
- NOT SURE: They look pretty different; maybe that's randomness; but didn't they all come from `big_poke`?
- YES: Even though they look very different, we used `sample()` to pull them all from `big_poke`. They have to be from the same population. We simulated that part.

**Teacher Note:** Even though this may seem obvious to experts, students may be genuinely thrown off by the fact that these look so different from each other and have different means and standard deviations. This is a good moment to reinforce that differences between samples don't necessarily mean they are from different populations. They can just be the result of sampling variation.

</div>


<div class="discussion-question">
    
### 3.11 Discussion Questions: Imagine you are a researcher and everything you know about `hp` in `big_poke` is based on these three samples. 

- How confident would you be in each of the three probability estimates of **P(HP < 60)**?  
- Is there one estimate you trust more than the others? Why?

</div>

<div class="teacher-note">

**Sample Responses:**

- The sample of 10 seems less trustworthy than the sample of 360.
- 360 might be closer to the population's shape, center, spread. 
- None of them are perfect; we don't truly know. But if I had to guess, maybe I would trust the biggest sample most.
    
**Teacher Note:**
Some students may intuit that smaller samples will vary more. You can demonstrate this by running `gf_histogram(~ hp, data = sample(big_poke, 10)` a few times versus the version with 360. They will be able to visually see that the larger sample is consistently normally distributed while the small sample will vary more.

</div>


## 4 Modeling a Distribution with Mathematical Models

In a real research setting all we have is our sample. If we only have a sample of n=30 or n=360, it may look kind of like the ones we sampled above. 

Up to now, we have estimated probabilities in the population by calculating proportions of a sample distribution. But we can also estimate probabilities using mathematical functions. Whereas histograms based on real samples of data can be very jagged and uneven, mathematical models typically are smooth, continuous functions. 

To use a mathematical model we must make some guesses about the shape, center, and spread of the population. In this section we will explore why we might want to do that.

<div class="discussion-question">

### 4.1 Discussion: Based on what we've seen in these samples, what shape do you think the distribution of HP in the population might have? What makes you think that?

</div>

<div class="teacher-note">

**Sample Responses:**
- Especially if students have been convinced that the largest sample (n=360) is the one to pay attention to they might say:
    - It's not perfectly normal but it's roughly normal.
    - It's a little bit skewed but it looks sort of like bell curve.
- Students that also look at the n=30 sample might believe the population is somewhat skewed. 

**Teacher Note:** Most students will recognize some form of bell-shaped/normal-ness, especially in the larger sample. We want students to get comfortable with the idea that the normal distribution may not be a perfect fit, but it might still be a useful approximation of the shape of the population.

</div>


### 4.2 The normal curve is a mathematical model used to model distributions

The normal curve does not exist "in nature"; it was invented by mathematicians. But it has proven very useful for modeling the shape of distributions that do exist out there in the world (e.g., heights, reaction times, even Pokemon HP).

The normal curve is a **mathematical function** (more specifically, a density function). 

Just like a line in algebra can be defined with a slope and y-intercept, or a square can be defined just by knowing the length of one side, the normal curve is defined by two numbers:
- the **mean** (center)
- the **standard deviation** (spread)

This function gives us a smooth, idealized curve that we can use to model real-world data even if that data is a little jagged or messy.

Run the R code below to overlay a normal curve on our largest sample of Pokemon cards (n = 360) and see how well it fits. 

In [None]:
# notice the d for density histogram
gf_dhistogram(~ hp, data = poke360) %>%
  gf_lims(x=c(-20,200)) %>%
  gf_fitdistr() 

### 4.3 How R creates the curve to overlay on our histogram

The particular normal curve R overlaid on our data is defined by the mean and standard deviation of our sample (`poke360`). R first calculates the sample **mean** and **standard deviation**, then plugs them into the formula for the normal curve, and finally overlays that density curve on top of the histogram.
    
In the figure below we have added a vertical line as well as a short horizontal line (jutting out from the vertical line). Which represents the mean? Which the standard deviation? 

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/6.7-6.11-overview-fitdistr-mean-sd-v2.png" alt="density histogram of n = 360 with normal distribution fit over it" width = 50%>

<div class="guided-notes">

### 4.4 Label the overlaid normal curve in your guided notes with the mean and standard deviation
    
</div>


### 4.5 Calculating probabilities using the normal curve

Calculating probabilities from data is pretty straightforward. In our sample of 360 Pokemon cards, we just tallied how many HP values were below 60, and used that proportion as a probability estimate.

But when we use a **normal curve** to model the population, there are no individual data points to count. The normal curve is a **continuous mathematical function** (that's why it's also called a **probability density function**).  

So instead of counting, we calculate the **area under the curve** that falls within a range of values on the x-axis (e.g., HP < 60, or HP > 75, or HP between 50 and 100). The total area under the curve is considered to be 1.0 (representing all of the probability).The proportion of total area that falls within a specified range of values represents the **probability** of a value falling in that range. 

<div class="guided-notes">

### 4.6 Shade the areas under the normal curve to represent each probability

In each normal curve, shade the area under the curve that matches the probability described:

- Shade the full area = 1.0  
- Shade 0.50 of the area (either side)  
- Shade 0.68 of the area 
    
</div>


<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/6.7-6.11-overview-to-shade.jpg" alt="three normal distributions that should be shaded with probably = 1.0, 0.5, and 0.68">

<div class="guided-notes">
    
### 4.7 Use `xpnorm()` to estimate P(HP < 60 using the normal curve
                                                              
Here’s the format: `xpnorm(<cut point>, mean = <mean>, sd = <sd>)`
   
- What mean and sd would we ideally use? 
- What values should we use instead as our best guess?
    
</div>
    

In [None]:
# write code


# COMPLETE
xpnorm(60, mean = 69.1, sd = 25.39)

<div class="guided-notes">

### 4.8 When you run `xpnorm()`, R gives you a lot of output. Write down in the table what you think each part of the output means.

</div>

<table border="1" style="font-size: 16px; margin-left: 0; border-collapse: collapse; width: 100%">
  <thead>
    <tr>
      <td style="border: 1px solid black; font-weight: bold; text-align: left;" width = 25%>Output</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: left;" width = 75%>What it means</td>
    </tr>
  </thead>
  <tbody>
    <tr>
        <td style="border: 1px solid black; text-align: left;"><code>X ~ N(69.1, 25.39)</code></td>
      <td style="border: 1px solid black; text-align: left;"></td>
    </tr>
    <tr>
        <td style="border: 1px solid black; text-align: left;"><code>P(X <= 60)</code></td>
      <td style="border: 1px solid black;"></td>
    </tr>
    <tr>
        <td style="border: 1px solid black; text-align: left;"><code>P(Z <= -0.3584)</code></td>
      <td style="border: 1px solid black;"></td>
    </tr>
    <tr>
        <td style="border: 1px solid black; text-align: left;"><code>0.36</code> and <code>0.64</code></td>
      <td style="border: 1px solid black;"></td>
    </tr>
    <tr>
        <td style="border: 1px solid black; text-align: left;"><code>0.3600187</code></td>
      <td style="border: 1px solid black;"></td>
    </tr>
     <tr>
        <td style="border: 1px solid black; text-align: left;">the darker shaded region</td>
      <td style="border: 1px solid black;"></td>
    </tr>
    <tr>
        <td style="border: 1px solid black; text-align: left;">the lighter shaded region</td>
      <td style="border: 1px solid black;"></td>
    </tr>
    <tr>
        <td style="border: 1px solid black; text-align: left;"><code>z = -0.36</code></td>
      <td style="border: 1px solid black;"></td>
    </tr>
  </tbody>
</table>

### 4.9 Calculate the z-score of 60 in the code cell below

What mean and standard deviation should we use? Does this help us understand the output of `xpnorm`?

In [None]:
# ( x - m ) / sd


# COMPLETE
(60 - 69.1) / 25.39

<div class="discussion-question">

### 4.10 Discussion Questions: What is the probability that a randomly drawn Pokemon card has HP less than Pikachu (60)? Of all the probabilities we estimated, which would you use as your best guess? 

- When might the probabilities from a **sample distribution** be more accurate than using the normal curve?
- When might the probabilities from a **normal curve** give you a better estimate than the sample tallies?


</div>

## 5 Time to Unveil the Mystery...

So far, we’ve been estimating the probability that a randomly drawn Pokemon card has HP less than 60, using:
- A tiny sample (n = 10)
- A small-ish sample (n = 30)
- A larger sample (n = 360)
- The normal curve (with mean and sd from `poke360`

But unlike most cases, we actually have access to the much larger dataset that the samples came from: `big_poke`.

### 5.1 Use `big_poke` to calculate the true proportion of cards in the population with HP < 60.

In [None]:
# write code


# COMPLETE
tally(~ hp < 60, data = big_poke, format = "proportion")

<div class="discussion-question">

### 5.2 Discussion Questions: 
    
- What is the actual value of **P(HP < 60)** in `big_poke`?
- Why is this different from your previous estimates?
- Do you trust this estimate more than your previous ones? Why or why not?
- Which of your previous estimates came closest to this estimate?  
- What does this reveal about sample size, shape, and the usefulness of modeling?

</div>

# End of Notebook

In [None]:
source("https://raw.githubusercontent.com/UCLATALL/stopwatch/refs/heads/main/add_sd_ruler.R")
# notice the d for density histogram
plot <- gf_dhistogram(~ hp, data = poke360, fill="grey") %>%
  gf_lims(x=c(-20,200)) %>%
  gf_fitdistr() %>% gf_model(lm(hp ~ NULL, data=poke360))
add_sd_ruler(plot, hp)