<div class="alert alert-block alert-danger">

# Central Limit Theorem for **S**ampling **D**istribution **o**f the **M**ean (SDoM) (COMPLETE)

</div>

In [None]:
# Load the CourseKata library
library(coursekata)


## 1.0 - From Simulations to a Theorem

In this class, we use simulations to learn about the characteristics of sampling distributions. But this approach would not have been possible before the invention of the modern day computer. Think how long it would take to draw 1,000 random samples from a population and calculate the sample statistics for each? We can do that in R in one line of code (e.g., `do(1000) * mean(...)`) which runs in 10 seconds or less. Before, it would have been days of work!

In [None]:
# pick or simulate an outcome variable Y
Y <- Fingers$Thumb
# pick a sample size
n <- 250

# make a sampling distribution of the mean by resampling Y
SDoM <- do(1000) * mean(resample(Y, n))

# visualize the sampling distribution 
gf_histogram(~ mean, data = SDoM) %>%
  gf_labs(title = "Sampling Distribution of the Mean")



Sampling distributions tend to exhibit certain patterns that have been captured long ago by  mathematicians as part of the *Central Limit Theorem*. The Central Limit Theorem (CLT) describes the shape, center, and spread of a distribution of sample means of equal size when each sample is randomly chosen from some population. And the CLT has been proven; it’s a mathematical law which means that it is true of sampling distributions of means from all kinds of populations—not just the ones we have simulated.

According to the Central Limit Theorem, sampling distributions of means are known to have the following characteristics:
- The **shape** of the distribution of means ($\bar{Y}$) will typically be normal in shape, provided the sample size ($n$) is large enough OR if the shape of the population is normal.
- The **mean** of the distribution of means ($\mu_{\bar{Y}}$) often called the **expected value**) will be the same as the mean of the population ($\mu$) from which the samples are randomly chosen. That is, the sample means will center around the true population mean.
- The **standard error**, the standard deviation, of the distribution of means ($\sigma_{\bar{Y}}$), will be smaller for larger sample sizes. Even more specifically, the standard error will be equal to the population standard deviation divided by the square root of the sample size ($\sigma_{\bar{Y}} = \frac{\sigma}{\sqrt{n}}$)

## 2.0 - Processing the Notation

**2.1:** Take a look at the notation for the mean of the distribution of means ($\mu_{\bar{Y}}$) and the standard error ($\sigma_{\bar{Y}}$). 

a. Why does one have a $\mu$ and. theother have a $\sigma$?
b. Why do you think there is a little $\bar{Y}$ subscript for each?

**2.2:** What parts of this formula for standard error corresponds to the three components below?

$\sigma_{\bar{Y}} = \frac{\sigma}{\sqrt{n}}$

The three components:
- Population standard deviation (the standard deviation of the DGP that produced this distribution of means)
- The square root of sample size
- Standard error (of the distribution of means)

## 3.0 - Standard Error

If the distribution of the mean looks just like the population distribution, then the standard deviations will have the same value. But as the sample size of our simulated samples grows, the means will vary less and cluster around the population mean. The CLT quantifies this pattern specifically as:

$$\sigma_{\bar{Y}}=\frac{\sigma}{\sqrt{n}}$$

Let’s explore this mathematical pattern. 

<iframe data-type="vimeo" id="381975155" width="640" height="360" src="https://player.vimeo.com/video/381975155" frameborder="0" allow="autoplay; fullscreen" allowfullscreen></iframe>


<style>
    table.table--outlined { border: 1px solid black;  border-collapse: collapse; margin-left: auto; margin-right: auto;  }
    table.table--outlined th, table.table--outlined td  { border: 1px solid black; padding: .5em; }
</style>
<table class="table--outlined">
    <thead>
        <tr>
            <th align="right">Sample Size of Simulated Samples (n)</th>
            <th align="right">Calculated Standard Error (from CLT)</th>
            <th align="right">Simulated Standard Error</th>
    </thead>
    <tbody>
        <tr>
            <td align="right">2</td>
            <td align="right">3.54</td>
            <td align="right">3.52</td>
        </tr>
        <tr>
            <td align="right">5</td>
            <td align="right">2.24</td>
            <td align="right">2.23</td>
        </tr>
        <tr>
            <td align="right">16</td>
            <td align="right">1.25</td>
            <td align="right">1.24</td>
        </tr>
        <tr>
            <td align="right">20</td>
            <td align="right">1.12</td>
            <td align="right">1.12</td>
        </tr>
        <tr>
            <td align="right">25</td>
            <td align="right">1.00</td>
            <td align="right">1.01</td>
        </tr>
    </tbody>
</table>
<br>

**3.1:** In the video, what did we find out about the standard error part of the Central Limit Theorem (CLT)? (Check all that apply.)

A	This is a formula that is helpful for creating simulations.

B	This is the mathematical relationship between the standard error, population standard deviation, and sample size.

C	By using the CLT, you can predict the standard error you might see from simulations pretty well.

D	Sometimes it is more accurate to simulate to get standard error and sometimes it is more accurate to calculate it from the CLT.

E	If you calculate standard error from the CLT, you must also simulate it.

## 4.0 - Can You Break the CLT?

Think you could break the CLT? Give it a try. Make population distributions with crazy shapes, then simulate sampling distributions of the mean. Are they normal? Is their standard error basically equal to the population standard deviation divided by the square root of n?

<a href="http://onlinestatbook.com/stat_sim/sampling_dist" target="_blank">(Click here to give it a try)</a>
