<div class="alert alert-block alert-danger">

# Assumptions in CI of the Mean (COMPLETE)
  
</div>

In [None]:
# Load the CourseKata library
library(coursekata)


## Using R to Calculate the Confidence Interval of the Sample Mean

Whenever we have a sample mean (such as the mean of `Thumb` in our `Fingers` data frame), we can calculate the confidence interval of the population mean using R.

In [None]:
#COMPLETE
empty_model <- lm(Thumb ~ NULL, data = Fingers)
confint(empty_model)

To calculate the confidence interval of the mean, we create a sampling distribution of the statistic (like we can do with `resample`, `shuffle`, or a mathematical approximation). This sampling distribution allows us to:
- figure out the margin of error 
- calculate the confidence interval, the highest and lowest $\beta_0$ that can produce our sample $b_0$ with a particular degree of confidence

This notebook will delve into three of the assumptions R uses to model the sampling distribution, margin of error, and ultimately the confidence interval:

1. Sampling distribution is modeled mathematically as a t-distribution. 
2. Margin of error is modeled as the number of of standard errors multiplied by the length of a standard error.
3. Confidence interval calculated as sample statistic plus or minus margin of error.

## 1.0 - Sampling Distribution of Means (SDoM) modeled as t-distribution

1.1 - If our sampling distribution was assumed to be normal, then how far do we go from the center of the SDoM to find the 95% cut off boundaries? 

Note: When we express this margin of error in units of standard error (rather than in mm like `Thumb` length is measured), we call it **the critical z**. 

In [None]:
# this code assumes a normal distribution
# and finds the 99% cut off boundaries 
# modify to find the 95% cut off boundaries
xqnorm(.005)
xqnorm(.995)

In [None]:
# COMPLETE
# this code assumes a normal distribution
# and finds the 99% cut off boundaries 
# modify to find the 95% cut off boundaries
xqnorm(.025)
xqnorm(.975)

The normal distribution isn’t *always* going to be the best mathematical model for the SDoM. When your sample size is fairly large, you can assume the sampling distribution is approximately normal. But if the sample size is small or if the $\sigma$ of the DGP is unknown (which it generally is), you’ll have more variation in your sampling distribution than is modeled by the normal distribution. 

What do we do then? We use the *t-distribution*, which is very similar but slightly wider than the normal distribution (also called the z-distribution). The t-distribution has a slightly wider shape depending on the degrees of freedom used to estimate $\sigma$. 
- For small samples, the t-distribution is much wider. 
- For very large samples, the t-distribution looks exactly like the normal distribution (because it's less wide).  
- Bottomline: we usually use the t-distribution because it can act like the normal distribution when appropriate. 

**The t-distribution, like the standard normal distribution, is a mathematical probability function that is a good model for sampling distributions of the mean. It’s just a little wider.**

1.2 - Let’s try to think about what this distance would be if we used the t-distribution instead of the z-distribution. In other words, what would the critical t be? Would it be bigger or smaller than 1.96? 

<div class="alert alert-block alert-warning">

**Sample Response**

- If a distribution is wider (more spread out), then to capture the middle 95%, you'd have to extend the boundaries out a bit further. So it would be bigger than 1.96.

</div>

1.3 - A function called ```xqt()``` will take in the proportion you would like to see in one tail (e.g., .025 or .975) and the degrees of freedom (which, for now, will be n-1) and tell you the critical t-—the margin of error in units of standard error (which is about 2 but not exactly).

For a very large sample (like 1,000 data points), this code will return something very close to 1.96. Modify the code cell below to find the critical t when the degrees of freedom is 1000-1.

In [None]:
# modify this code for a very large sample (n=1000)
xqt(.975, df = 99)

# un-comment if you want to compare the critical t to the critical z
#xqnorm(.975)

In [None]:
# COMPLETE
# modify this code for a very large sample (n=1000)
xqt(.975, df = 999)

# un-comment if you want to compare the critical t to the critical z
#xqnorm(.975)

1.4 - Both the critical t and z is around 1.96. Now try modifying `xqt()` for a small sample (such as n = 20). Are the critical t and z more similar or less similar? Why?

In [None]:
# COMPLETE
# modify this code for a very small sample (n=20)
xqt(.975, df = 19)

# un-comment if you want to compare the critical t to the critical z
#xqnorm(.975)

<div class="alert alert-block alert-warning">

**Sample Response**

- Because the t-distribution is wider (more spread out) with smaller samples, then to capture the middle 95%, you'd have to extend the boundaries out a bit further. Thus, you have to go all the way to 2.09 (a bit farther out than 1.96).

</div>

Let’s take a look at the critical t for different sample sizes.  

<style>
    table.table--outlined { border: 1px solid black;  border-collapse: collapse; margin-left: auto; margin-right: auto;  }
    table.table--outlined th, table.table--outlined td  { border: 1px solid black; padding: .5em; }
</style>
<table class="table--outlined">
    <thead>
        <tr>
            <th>Degrees of Freedom (df)</th>
            <th align="left">Critical <i>t</i> (the critical distance<br>in units of standard error)</th>
    </thead>
    <tbody>
        <tr>
            <td align="right">499</td>
            <td align="right">1.964729</td>
        </tr>
        <tr>
            <td align="right">156</td>
            <td align="right">1.975288</td>
        <tr>
        <tr>
            <td align="right">49</td>
            <td align="right">2.009575</td>
        <tr>
        <tr>
            <td align="right">19</td>
            <td align="right">2.093024</td>
        <tr>
    </tbody>
</table><br> 

1.5 - What pattern do you notice between the df and the critical t?

<div class="alert alert-block alert-warning">

**Sample Responses**

- Bigger df (or sample size) is associated with smaller critical t (the number of standard errors to cover 95% of the sampling distribution). 
- Use this opportunity to gesture with the students to indicate the SDoM is more tightly clustered around the center for larger df (sample size).
</div>

## 2.0: Margin of Error in Raw Units

2.1 - Whenever we have a variable (such as `Thumb` from `Fingers`), there is a raw unit: the actual units of measurement recorded in the values of the variable. In the case of `Thumb` what is the raw unit of measurement?

<div class="alert alert-block alert-warning">

**Sample Responses**

mm

</div>

When we figure out the margin of error using mathematical models of the sampling distribution, these models are *all purpose* -- they aren't tailored to mm, feet, pounds, years, and every other type of raw unit out there. The critical t is expressed in **standardized units** rather than raw units. The unit is *# of standard errors*. 

If we know the margin of error in *# of standard errors* and we know how long a standard error is in mm, we can figure out the margin of error in *mm*. Said more generally, if we know the margin of error in *standardized units*, and we can figure out how long a standard unit is in raw units, we can figure out the margin of error in *raw units*. 

$$\text{margin of error} = \underbrace{\text{critical t}}_\text{# of SE} * \underbrace{\text{standard error}}_\text{length of SE} $$

2.2 - Find critical t (the number of standard errors) to cover 95% of an SDoM for the sample of `Thumb` lengths from the `Fingers` data. (Note there are 157 thumb lengths in this data set.)

In [None]:
# COMPLETE
xqt(.975, df = 156)

2.3 - The standard error is estimated using a part of the Central Limit Theorem ($\text{SE} = \frac{s}{\sqrt{n}}$). Use R to estimate SE below.

In [None]:
# COMPLETE
sd(Fingers$Thumb) / sqrt(157)

2.4 - Use the critical t and estimated standard error to calculate the margin of error for `Thumb` in mm.

In [None]:
# COMPLETE

# most efficient code
xqt(.975, df = 156) * sd(Fingers$Thumb) / sqrt(157)

# each component laid out separately
critical_t <- xqt(.975, df = 156) 
se <- sd(Fingers$Thumb) / sqrt(157)
critical_t * se

## 3.0: Confidence Interval of the Mean

Now that we have mathematically modeled the SDoM with a t-distribution, estimated standard error using the Central Limit Theorem, we can put all of that together to calculate Confidence Interval of the Mean like R does with `confint(empty_model)`.

$$\text{CI of mean} = \text{sample mean} \pm \text{margin of error in raw units}$$

Find the confidence interval using this method in the code block below.

In [None]:
# COMPLETE
margin_of_error <- xqt(.975, df = 156) * sd(Fingers$Thumb) / sqrt(157)
sample_mean <- b0(empty_model)

sample_mean + margin_of_error
sample_mean - margin_of_error

# if you want to check against confint()
confint(empty_model)