In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from scipy.stats import norm, bernoulli, t
from statsmodels.stats import proportion
from math import ceil
from IPython.display import display, Latex

# Reference

> [Unit: Confidence intervals](https://www.khanacademy.org/math/statistics-probability/confidence-intervals-one-sample)

---

# Introduction of confidence intervals

> [Confidence intervals and margin of errors](https://www.khanacademy.org/math/statistics-probability/confidence-intervals-one-sample/introduction-to-confidence-intervals/v/confidence-intervals-and-margin-of-error)<br>

Proportion
- Population proportion: $\displaystyle p$
- Population mean (expected value): $\displaystyle np$
- Population variance: $\displaystyle \sigma^{2} = pq$
- Sample proprotion: $\displaystyle \hat p$
- Sample size: $\displaystyle n$
- Standard deviation(parameter): $\displaystyle \sigma_{\hat p} = \sqrt{\frac{\sigma^{2}}{n}} = \frac{\sigma}{\sqrt{n}}$
- Standard error(statistic): $\displaystyle SE = \sigma_{\hat p} \approx \sqrt{\frac{\sigma_{\hat p}^{2}}{n}} = \frac{\sigma_{\hat p}}{\sqrt{n}}$

Mean
- Population mean: $\displaystyle \mu$
- Population variance: $\displaystyle \sigma^{2}$
- Sample mean: $\displaystyle \bar x$
- Sample Standard Deviation (statistic): $\displaystyle \sigma_{x}$ or $\displaystyle s$
- The mean of the sampling distribution of a sample mean (statistic): $\mu_{\bar x} = \mu$
- Sample size: $\displaystyle n$
- Standard deviation(parameter): $\displaystyle \sigma_{\bar x} = \sqrt{\frac{\sigma^{2}}{n}} = \frac{\sigma}{\sqrt{n}}$
- Standard error (statistic): $\displaystyle SE = \hat \sigma_{\bar x} = \frac{\sigma_{x}}{\sqrt{n}}$
- Standard error (statistic): $\displaystyle SE = s_{\bar x} = \frac{s}{\sqrt{n}}$


## Margin of error

> [Margin of error](https://en.wikipedia.org/wiki/Margin_of_error)

The **margin of error** is a statistic expressing the amount of random [sampling error](https://en.wikipedia.org/wiki/Sampling_error "Sampling error") in the results of a [survey](https://en.wikipedia.org/wiki/Statistical_survey "Statistical survey"). The larger the margin of error, the less confidence one should have that a poll result would reflect the result of a survey of the entire [population](https://en.wikipedia.org/wiki/Statistical_population "Statistical population"). The margin of error will be positive whenever a population is incompletely sampled and the outcome measure has positive [variance](https://en.wikipedia.org/wiki/Variance "Variance"), which is to say, the measure _varies_.

Proportion
- Margin of error: $\displaystyle MOE_{\gamma} = z_{\gamma} \cdot SE$
- Confidence level: $\gamma$
- Critical value: $z_{\gamma}$

Mean
- Margin of error: $\displaystyle MOE_{\gamma} = t_{\gamma} \cdot SE$
- Confidence level: $\gamma$
- Critical value: $t_{\gamma}$

---

## Confidence interval

> [Confidence interval simulation](https://www.khanacademy.org/math/statistics-probability/confidence-intervals-one-sample/introduction-to-confidence-intervals/v/confidence-interval-simulation)<br>
> [Interpreting confidence level example](https://www.khanacademy.org/math/statistics-probability/confidence-intervals-one-sample/introduction-to-confidence-intervals/v/interpreting-confidence-intervals-example)<br>
> [Interpreting confidence levels and confidence intervals](https://www.khanacademy.org/math/statistics-probability/confidence-intervals-one-sample/introduction-to-confidence-intervals/a/interpreting-confidence-levels-and-confidence-intervals)<br>
> [Confidence interval](https://en.wikipedia.org/wiki/Confidence_interval)

The confidence _level_ refers to the long-term success rate of the method, that is, how often this type of interval will capture the parameter of interest.

A specific confidence _interval_ gives a range of plausible values for the parameter of interest.

In [frequentist statistics](https://en.wikipedia.org/wiki/Frequentist_statistics "Frequentist statistics"), a **confidence interval** (**CI**) is a range of estimates for an unknown [parameter](https://en.wikipedia.org/wiki/Statistical_parameter "Statistical parameter"). A confidence interval is computed at a designated _confidence level_. The $95\%$ level is most common, but other levels (such as $90\%$ or $99\%$) are sometimes used. The confidence level represents the long-run [proportion](https://en.wikipedia.org/wiki/Frequency_distribution "Frequency distribution") of correspondingly computed intervals that end up containing the true value of the parameter.

Proportion
- Confidence interval: $\displaystyle CI = \hat p \pm MOE_{\gamma} = \hat p \pm \sqrt{\frac{\hat p (1 - \hat p)}{n}}$

Mean
- Confidence interval: $\displaystyle CI = \bar x \pm MOE_{\gamma} = \bar x \pm \sqrt{\frac{\sigma_{x}^{2}}{n}}$

---

### Example 1: Interpreting a confidence level

A political pollster plans to ask a random sample of $500$ voters whether or not they support the incumbent candidate. The pollster will take the results of the sample and construct a $90\%$ confidence interval for the true proportion of all voters who support the candidate.

**Which of the following is a correct interpretation of the $90\%$ confidence level?**

If the pollster repeats this process and constructs $20$ intervals from separate independent samples, we can expect about $18$ of those intervals to contain the true proportion of voters who support the candidate.

Explain:

*The stated confidence level means that we can expect $\approx90\%$ of these intervals to contain the parameter of interest, and $18$ of $20$ is $90\%$.*

If the pollster repeats this process many times, then about $90\%$ of the intervals produced will capture the true proportion of voters who support the candidate.

Explain:

*Confidence levels tell us the long-term rate at which a certain type of confidence interval will successfully capture the parameter of interest.*

---

### Example 2: Interpreting a confidence interval

A baseball coach was curious about the true mean speed of fastball pitches in his league. The coach recorded the speed in kilometers per hour of each fastball in a random sample of $100$ pitches and constructed a $95\%$ confidence interval for the mean speed. The resulting interval was $(110, 120)$.

**Which of the following is a correct interpretation of the interval $(110, 120)$?**

We're $95\%$ confident that the interval $(110, 120)$ captured the true mean pitch speed.

Explain:

*A confidence interval gives us a range of plausible values for a population parameter.*

Can we say there is a $95\%$ chance that the true mean is between $110$ and $120$ kilometers per hour?

Explain:
    
*We shouldn't say there is a $95\%$ chance that _this specific_ interval contains the true mean, because it implies that the mean may be within this interval, or it may be somewhere else. This phrasing makes it seem as if the population mean is variable, but it's not. This interval either captured the mean or didn't. Intervals change from sample to sample, but the population parameter we're trying to capture does not.*

*It's safer to say we're $95\%$ confident that this interval captured the mean, since this phrasing more closely agrees with the long-term capture rate of confidence levels.*

---

### Example 3: Effect of changing confidence level

Suppose that the coach from the previous example decides they want to be more confident. The coach uses the same sample data as before, but recalculates the confidence interval using a $99\%$ confidence level.

**How will increasing the confidence level from $95\%$ to $99\%$ affect the confidence interval?**

Increasing the confidence will increase the margin of error resulting in a wider interval.

Explain:

*A larger margin of error produces a wider confidence interval that is more likely to contain the parameter of interest (increased confidence).*

*Increasing the confidence level means we are more likely to capture the true value of the parameter with each interval. To be more confident, we use wider intervals with a larger margin of error.*

---

# Estimating a population proportion

> [Confidence interval example](https://www.khanacademy.org/math/statistics-probability/confidence-intervals-one-sample/estimating-population-proportion/v/confidence-interval-example)<br>
> [Conditions for valid confidence intervals for a proportion](https://www.khanacademy.org/math/statistics-probability/confidence-intervals-one-sample/estimating-population-proportion/v/conditions-for-valid-confidence-intervals)<br>
> [Interpreting a z interval for a proportion](https://www.khanacademy.org/math/statistics-probability/confidence-intervals-one-sample/estimating-population-proportion/a/interpreting-z-interval-proportion)

When we want to carry out inferences on one proportion (build a confidence interval or do a significance test), the accuracy of our methods depend on a few conditions. Before doing the actual computations of the interval or test, it's important to check whether or not these conditions have been met, otherwise the calculations and conclusions that follow aren't actually valid.

The conditions we need for inference on one proportion are:

-   **Random** (most important): The data needs to come from a random sample or randomized experiment.
-   **Normal**: The sampling distribution of $\hat p$ needs to be approximately normal — needs at least $10$ expected successes and $10$ expected failures. $n \hat p \geq 10$ and $n \hat q \geq 10$
-   **Independent**: Individual observations need to be independent. If sampling without replacement, our sample size shouldn't be more than $10\%$ of the population. $n < \text{10% of population}$ (without replacement).

---

## Conditions for a z interval for a proportion

> [Conditions for valid confidence intervals for a proportion](https://www.khanacademy.org/math/statistics-probability/confidence-intervals-one-sample/estimating-population-proportion/v/conditions-for-valid-confidence-intervals)

### The random condition

Random samples give us unbiased data from a population. When samples aren't randomly selected, the data usually has some form of bias, so using data that wasn't randomly selected to make inferences about its population can be risky.

More specifically, sample proportions are unbiased estimators of their population proportion. For example, if we have a bag of candy where $50\%$ of the candies are orange and we take random samples from the bag, some will have more than $50\%$ orange and some will have less. But on average, the proportion of orange candies in each sample will equal $50\%$. We write this property as $\mu_{\hat p}=p$, which holds true as long as our sample is random.

This won't necessarily happen if our sample isn't randomly selected though. Biased samples lead to inaccurate results, so they shouldn't be used to create confidence intervals or carry out significance tests.

---

### The normal condition

The sampling distribution of $\hat p$ is approximately normal as long as the expected number of successes and failures are both at least $10$. This happens when our sample size $n$ is reasonably large.

So we need:

- $\displaystyle \text{expected success: } np \geq 10$
- $\displaystyle \text{expected failures: } n(1 - p) \geq 10$

If we are building a confidence interval, we don't have a value of $p$ to plug in, so we instead count the observed number of successes and failures in the sample data to make sure they are both at least $10$. If we are doing a significance test, we use our sample size $n$ and the hypothesized value of $p$ to calculate our expected numbers of successes and failures.

---

### The independence condition

To use the formula for standard deviation of $\hat p$, we need individual observations to be independent. When we are sampling without replacement, individual observations aren't technically independent since removing each item changes the population.

But the $10\%$ condition says that if we sample $10\%$ or less of the population, we can treat individual observations as independent since removing each observation doesn't significantly change the population as we sample. For instance, if our sample size is $n=150$, there should be at least $N=1500$ members in the population.

This allows us to use the formula for standard deviation of $\hat p$:

$\displaystyle \sigma_{\hat p} = \sqrt{\frac{\sigma^{2}}{n}} = \frac{\sigma}{\sqrt{n}}$

In a significance test, we use the sample size $n$ and the hypothesized value of $p$.

If we are building a confidence interval for $p$, we don't actually know what $p$ is, so we substitute $\hat p$ as an estimate for $p$. When we do this, we call it the **standard error** of $\hat p$ to distinguish it from the standard deviation.

So our formula for standard error of $\hat p$ is

$\displaystyle SE = \sigma_{\hat p} \approx \sqrt{\frac{\sigma_{\hat p}^{2}}{n}} = \frac{\sigma_{\hat p}}{\sqrt{n}}$

---

### Example 1

Ali is in charge of the dinner menu for his senior prom, and he wants to use a one-sample $z$ interval to estimate what proportion of seniors would order a vegetarian option. He randomly selects $30$ of the $150$ total seniors and finds that $7$ of those sampled would order the vegetarian option.

**Which conditions for constructing this confidence interval did Ali's sample meet?**

- The data is a random sample from the population of interest.

---

### Example 2

A school administrator wants to know what proportion of teachers in their state have a Master's degree. The administrator takes an of $100$ teachers from a statewide database containing every teacher, and they find $55$ teachers in the sample have a Master's degree. The administrator wants to use this data to construct a one-sample $z$ interval for a proportion.

**Which conditions for constructing this confidence interval did their sample meet?**

- The data is a random sample from the population of interest.
- $n \hat p \geq 10$ and $n \hat q \geq 10$
- Individual observations can be considered independent.

---

## Finding the critical value $z^{*}$ for a desired confidence level

---

### Example 1: Finding the critical value

**What is the critical value $z^{*}$ for constructing an $88\%$ confidence interval?**

In [2]:
CL = 0.88
critical_value = norm.ppf((1 - CL) / 2 + CL)
percision = 3

display(Latex(f"$z^* = {round(critical_value, percision)}$"))

<IPython.core.display.Latex object>

In [3]:
# OR
norm.interval(0.88)

(-1.5547735945968535, 1.5547735945968535)

---

### Example 2: Finding the confidence level

Emer made a one-sample $z$ interval for a proportion and used the critical value $z^*=1.476$.

**What confidence level did she use?**

In [4]:
critical_value = 1.476
CL = norm.cdf(critical_value) - (1 - norm.cdf(critical_value))
percision = 3

display(Latex(f"$a = {round(CL, percision)}$"))

<IPython.core.display.Latex object>

---

## Calculating a z interval for a proportion

---

### Example 1

A survey of $400$ randomly selected homes from a large city with over $30{,}000$ homes showed that $16$ of the sampled homes didn't have a television.

**Based on this sample, which of the following is a $95\%$ confidence interval for the proportion of homes in this city that don't have a television?**

In [5]:
k, n, CL = 16, 400, .95
p = k / n
percision = 3

# critical_value = NormalDist().inv_cdf((1 - CL) / 2 + CL)
critical_value = norm.ppf((1 - CL) / 2 + CL)
SE = bernoulli.std(p) / np.sqrt(n)
MOE = SE * critical_value
CI_L, CI_R = p - MOE, p + MOE
CI_L, CI_R = round(CI_L, percision), round(CI_R, percision)
(CI_L, CI_R)

(0.021, 0.059)

In [6]:
# OR
k, n, CL = 16, 400, .95
alpha = 1 - CL
percision = 3

ci_low, ci_upp = proportion.proportion_confint(count=k, nobs=n, alpha=alpha, method='normal')
round(ci_low, percision), round(ci_upp, percision)

(0.021, 0.059)

---

### Example 2

Yun is curious what proportion of users will click an advertisement that appears on his website. He takes a random sample of $200$ users and finds that $34$ of them clicked the advertisement. He's willing to assume independence between users in the sample.

**Based on this sample, which of the following is a $99\%$ confidence interval for the proportion of users that click the ad?**

In [7]:
CL, n, k = 0.99, 200, 34
p = k / n
percision = 3

critical_value = norm.ppf((1 - CL) / 2 + CL)
SE = bernoulli.std(p) / np.sqrt(n)

print(p, 1 - p, n, critical_value)

0.17 0.83 200 2.5758293035489004


$\displaystyle 0.17 \pm 2.576 \sqrt{\frac{0.17(0.83)}{200}}$

---

## Interpreting a z interval for a proportion

> [Interpreting a z interval for a proportion](https://www.khanacademy.org/math/statistics-probability/confidence-intervals-one-sample/estimating-population-proportion/a/interpreting-z-interval-proportion)

Once we build a confidence interval for a proportion, it's important to be able to interpret what the interval tells us about the population, and what it doesn't tell us. Let's look at few examples that demonstrate how to interpret a confidence interval for a proportion.

---

### Example 1

Ahmad saw a report that claimed $57\%$ of US adults think a third major political party is needed. He was curious how students at his large university felt on the topic, so he asked the same question to a random sample of $100$ students and made a $95\%$ confidence interval to estimate the proportion of students who agreed that a third major political party was needed. His resulting interval was $\left(0.599, 0.781\right)$. Assume that the conditions for inference were all met.

**Based on his interval, is it plausible that $57\%$ of all students at his university would agree that a third party is needed?**

No, it isn't. The interval says that plausible values for the true proportion are between $59.9\%$ and $78.1\%$. Since the interval doesn't contain $57\%$, it doesn't seem plausible that $57\%$ of students at this university would agree. In other words, the entire interval is above $57\%$, so the true proportion at this university is likely higher.

---

### Example 2

Ahmad's sister, Diedra, was curious how students at her large high school would answer the same question, so she asked it to a random sample of $100$ students at her school. She also made a $95\%$ confidence interval to estimate the proportion of students at her school who would agree that a third party is needed. Her interval was $\left(0.557, 0.743\right)$. Assume that the conditions for inference were all met.

**Based on her interval, is it plausible that $57\%$ of students at her school would agree that a third party is needed?**

Yes. Since the interval contains $57\%$, it is a plausible value for the population proportion.

**Does her interval provide evidence that the true proportion of students at her school who would agree that a third party is needed is $57\%$?**

No. Confidence intervals don't give us evidence that a parameter equals a specific value; they give us a range of plausible values. Diedra's interval says that the true proportion of students who agree could be as low as $55.7\%$ or as high as $74.3\%$ and that values outside of this interval aren't likely. So it wouldn't be appropriate to say this interval supports the value of $57\%$.

---

### Example 3

A video game gives players a reward of gold coins after they defeat an enemy. The creators of the game want players to have a chance at earning bonus coins when they defeat a certain challenging enemy. The creators attempt to program the game so that the bonus is awarded randomly with a $30\%$ probability after the enemy is defeated.

To see if the bonus is being awarded as intended, the creators defeated the enemy in a series of $100$ attempts (they're willing to treat this as a random sample). After each attempt, they recorded whether or not the bonus was awarded. They used the results to build a $95\%$ confidence interval for ppp, the proportion of attempts that will be rewarded with the bonus. The resulting interval was $(0.323, 0.517)$.

**What does this interval suggest?**

The bonus most likely isn't being awarded at the intended rate of $30\%$.

Explain:

Our interval doesn't contain $30\%$. The entire interval is greater than $30\%$, so the bonus is likely being awarded more often than intended.

---

### Example 4

The creators of the video game also want players to have a chance at earning a rare item when they defeat a challenging enemy. The creators attempt to program the game so that the rare item is awarded randomly with a $15\%$ probability after the enemy is defeated.

To see if the rare item is being awarded as intended, the creators defeated the enemy in a series of $100$ attempts (they're willing to treat this as a random sample). After each attempt, they recorded whether or not the rare item was awarded. They used the results to build a $95\%$ confidence interval for $p$, the proportion of attempts that will be rewarded with the rare item, which was $0.12 \pm 0.06$.

**What does this interval suggest?**

It's plausible that the rare item is being awarded at the intended rate of $15\%$.

Explain:

The interval contains $15\%$, so that is a plausible value for rate.

---

## Sample size and margin of error in a z interval for p

---

### Example 1

Della wants to make a one-sample $z$ interval to estimate what proportion of her community members favor a tax increase for more local school funding. She wants her margin of error to be no more than $\pm 2\%$ at the $95\%$ confidence level.

**What is the smallest sample size required to obtain the desired margin of error?**

In [8]:
CL, MOE = 0.95, 0.02
p = 0.5 # maximize p(1 - p) to minimize n
percision = 3

critical_value = norm.ppf((1 - CL) / 2 + CL)
SE = MOE / critical_value
n = (p * (1 - p)) / SE**2
n
print("smallest sample size:", round(n))

smallest sample size: 2401


In [9]:
# OR
CL, MOE = 0.95, 0.02
p = 0.5 # maximize p(1 - p) to minimize n
alpha = 1 - CL

n = proportion.samplesize_confint_proportion(proportion=p, half_length=MOE, alpha=alpha)
print("smallest sample size:", round(n))

smallest sample size: 2401


---

### Example 2

A large restaurant chain is curious what proportion of their customers in a given day are new customers. They are thinking of taking a sample of either $n=50$ or $n=100$ customers and building a one-sample $z$ interval for a proportion using the data from the sample.

**Assuming the sample proportion is the same in each sample, what is true about the margins of error from these two samples?**

The margin of error from the smaller sample will be $\sqrt{2}$ times the margin of error from the larger sample.

---

### Example 3

Researchers make a one-sample $z$ interval to estimate the proportion of adults who would say they learned something new in the past month. Previous studies have suggested that about $40\%$ would answer "yes" to this question. The researchers plan on using a confidence level of $95\%$, and they want the margin of error to be no more than $\pm 5\%$.

**If we assume $\hat p=0.40$, what is the smallest sample size required to obtain the desired margin of error?**

In [10]:
CL, MOE = 0.95, 0.05
p = 0.4
alpha = 1 - CL

n = proportion.samplesize_confint_proportion(proportion=p, half_length=MOE, alpha=alpha)
print("smallest sample size:", round(n))

smallest sample size: 369


---

# Estimating a population mean

---

## Conditions for valid t intervals

> [Conditions for inference on a mean](https://www.khanacademy.org/math/statistics-probability/confidence-intervals-one-sample/estimating-population-mean/a/reference-conditions-inference-one-mean)

When we want to carry out inference (build a confidence interval or do a significance test) on a mean, the accuracy of our methods depends on a few conditions. Before doing the actual computations of the interval or test, it's important to check whether or not these conditions have been met. Otherwise the calculations and conclusions that follow may not be correct.

The conditions we need for inference on a mean are:

-   **Random**: A random sample or randomized experiment should be used to obtain the data.
-   **Normal**: The sampling distribution of $\bar x$ (the sample mean) needs to be approximately normal. This is true if our parent population is normal or if our sample is reasonably large $(n \geq 30)$.
-   **Independent**: Individual observations need to be independent. If sampling without replacement, our sample size shouldn't be more than $10\%$ of the population.

Let's look at each of these conditions a little more in-depth.

---

### The random condition

Random samples give us unbiased data from a population. When we don't use random selection, the resulting data usually has some form of bias, so using it to infer something about the population can be risky.

More specifically, sample means are unbiased estimators of their population mean. For example, suppose we have a bag of ping pong balls individually numbered from $0$ to $30$, so the population mean of the bag is $15$. We could take random samples of balls from the bag and calculate the mean from each sample. Some samples would have a mean higher than $15$ and some would be lower. But on average, the mean of each sample will equal $15$. We write this property as $\mu_{\bar x}=\mu$, which holds true as long as we are taking random samples.

This won't necessarily happen if we use a non-random sample. Biased samples can lead to inaccurate results, so they shouldn't be used to create confidence intervals or carry out significance tests.

---

### The normal condition

The sampling distribution of $\bar x$ (a sample mean) is approximately normal in a few different cases. The shape of the sampling distribution of $\bar x$ mostly depends on the shape of the parent population and the sample size $n$.

#### Case 1: Parent population is normally distributed

If the parent population is normally distributed, then the sampling distribution of $\bar x$ is approximately normal regardless of sample size. So if we know that the parent population is normally distributed, we pass this condition even if the sample size is small. In practice, however, we usually don't know if the parent population is normally distributed.

#### Case 2: Not normal or unknown parent population; sample size is large $(n \geq 30)$

The sampling distribution of $\bar x$ is approximately normal as long as the sample size is reasonably large. Because of the central limit theorem, when $n \geq 30$, we can treat the sampling distribution of $\bar x$ as approximately normal regardless of the shape of the parent population.

There are a few rare cases where the parent population has such an unusual shape that the sampling distribution of the sample mean $\bar x$ isn't quite normal for sample sizes near $30$. These cases are rare, so in practice, we are usually safe to assume approximately normality in the sampling distribution when $n \geq 30$.

#### Case 3: Not normal or unknown parent population; sample size is small $(n<30)$

As long as the parent population doesn't have outliers or strong skew, even smaller samples will produce a sampling distribution of $\bar x$ that is approximately normal. In practice, we can't usually see the shape of the parent population, but we can try to infer shape based on the distribution of data in the sample. If the data in the sample shows skew or outliers, we should doubt that the parent is approximately normal, and so the sampling distribution of $\bar x$ may not be normal either. But if the sample data are roughly symmetric and don't show outliers or strong skew, we can assume that the sampling distribution of $\bar x$ will be approximately normal.

_The big idea is that we need to graph our sample data when $n<30$ and then make a decision about the normal condition based on the appearance of the sample data._

---

### The independence condition

To use the formula for standard deviation of $\bar x$, we need individual observations to be independent. In an experiment, good design usually takes care of independence between subjects (control, different treatments, randomization).

In an observational study that involves sampling without replacement, individual observations aren't technically independent since removing each observation changes the population. However the $10\%$ condition says that if we sample $10\%$ or less of the population, we can treat individual observations as independent since removing each observation doesn't change the population all that much as we sample. For instance, if our sample size is $n=30$, there should to be at least $N=300$ members in the population for the sample to meet the independence condition.

Assuming independence between observations allows us to use this formula for standard deviation of $\bar x$ when we're making confidence intervals or doing significance tests:

$\displaystyle \sigma_{\bar x} = \frac{\sigma}{\sqrt{n}}$

We usually don't know the population standard deviation $\sigma$, so we substitute the sample standard deviation $s_x$ as an estimate for $\sigma$. When we do this, we call it the **standard error** of $\bar x$ to distinguish it from the standard deviation.

So our formula for standard error of $\bar x$ is:

$\displaystyle \sigma_{\bar x} \approx \frac{s_x}{\sqrt{n}}$

---

### Summary


If all three of these conditions are met, then we can we feel good about using $t$ distributions to make a confidence interval or do a significance test. Satisfying these conditions makes our calculations accurate and conclusions reliable.

The random condition is perhaps the most important. If we break the random condition, there is probably bias in the data. The only reliable way to correct for a biased sample is to recollect the data in an unbiased way.

The other two conditions are important, but if we don't meet the normal or independence conditions, we may not need to start over. For example, there is a way to correct for the lack of independence when we sample more than $10\%$ of a population, but it's beyond the scope of what we're learning right now.

The main idea is that it's important to verify certain conditions are met before we make these confidence intervals or do these significance tests.

---

### Example 1

Flavia wanted to estimate the mean age of the faculty members at her large university. She took an SRS (simple random sample) of $20$ of the approximately $700$ faculty members, and each faculty member in the sample provided Flavia with their age. The ages were skewed to the right with a sample mean of $\bar x =38.75$. She's considering using her data to make a confidence interval to estimate the mean age of faculty members at her university.

**Which conditions for constructing a $t$ interval have been met?**

- The data is a random sample from the population of interest.
- Individual observations can be considered independent.

---

### Example 2

Here are two different samples drawn from two different populations:

![](https://raw.githubusercontent.com/ZacksAmber/PicGo/master/img/20220501123656.png)

**Which sample satisfies the normal condition for constructing a $t$ interval?**

Sample A only

Even though the sample is small, the sample data are roughly symmetric with no outliers, so it satisfies the normal condition.

---

## Finding the critical value $t^{*}$ for a desired confidence level

---

### Example 1

**What is the critical value $t^{*}$ for constructing a $95\%$ confidence interval for a mean from a sample size of $n=20$ observations?**

In [11]:
CL, n = .95, 20
critical_value = t.ppf((1 - CL) / 2 + CL, df=n-1)
percision = 3

display(Latex(f"$t^* = {round(critical_value, percision)}$"))

<IPython.core.display.Latex object>

In [12]:
# Or
CL, n = .95, 20
t.interval(CL, df=n-1)

(-2.093024054408263, 2.093024054408263)

---

## Calculating a t interval for a mean

---

### Example 1

Ignacio is curious about the average age of cars in the commuter lot at his large university. He takes a random sample of $16$ cars and finds that their average age is $\bar x=14.5$ years and their standard deviation is $s_x=4.6$ years. The distribution of ages in the sample was roughly symmetric with no obvious outliers.

**Based on this sample, which of the following is a $90\%$ confidence interval for the mean age of cars (in years) in this commuter lot?**

In [13]:
CL, n, s_mu, s_sigma = .90, 16, 14.5, 4.6
percision = 2

SE = s_sigma / np.sqrt(n)
CI_L, CI_R = t.interval(CL, df=n-1, loc=s_mu, scale=SE)
CI_L, CI_R = round(CI_L, percision), round(CI_R, percision)
(CI_L, CI_R)

(12.48, 16.52)

---

### Example 2

A nutritionist wants to estimate the average caloric content of the burritos at a popular restaurant. They obtain a random sample of $14$ burritos and measure their caloric content. Their sample data are roughly symmetric with a mean of $700$ calories and a standard deviation of $50$ calories.

**Based on this sample, which of the following is a $95\%$ confidence interval for the mean caloric content of these burritos?**

In [14]:
CL, n, s_mu, s_sigma = .95, 14, 700, 50
critical_value = t.ppf((1 - CL) / 2 + CL, df=n-1)
percision = 1

SE = s_sigma / np.sqrt(n)
MOE = critical_value * SE
t.interval(CL, df=n-1, loc=s_mu, scale=SE)
display(Latex(f"${s_mu} \pm {round(MOE, percision)}$"))

<IPython.core.display.Latex object>

---

## Confidence interval for a mean with paired data

> [Making a t interval for paired data](https://www.khanacademy.org/math/statistics-probability/confidence-intervals-one-sample/estimating-population-mean/a/one-sample-t-interval-paired-data)

In some studies, we make two observations on the same individual. For instance, we might look at each student's pre-test and post-test scores in a course. In other studies, we might make an observation on each of two similar individuals. For example, some medicine trials involve pairing similar subjects so one receives the medicine and the other receives a placebo.

In both types of studies, we're working with _paired data_ (see also [Study Design - Experiments](./study_design.ipynb), and whenever we're working with paired data, we're typically interested in the difference between each pair—for example, the difference between the pre-test and the post-test data, or the difference between the medicine and the placebo data.

If certain conditions are met, we can construct a $t$ interval to estimating the mean of these differences and draw conclusions.

In this article, we'll be going through two examples of making a $t$ interval for paired data. Importantly, you'll have a chance to work through the second example on your own to ensure you've picked up on the main ideas.

---

### Example 1

A running magazine wanted to review two watches—watch A and watch B—that use global position systems (GPS) to calculate the distance someone runs. They noticed that the watches didn't usually agree on the distance someone traveled in a given run.

The magazine took a random sample of 555 subscribers and asked them to run a $10$ kilometer route wearing both watches at the same time (they all agreed to participate). At the end of their runs, the participants recorded the distance each watch said they traveled. Here are the data (all distances are in kilometers):

|Runner|1|2|3|4|5|
|:-:|-:|-:|-:|-:|-:|
|Watch A|9.8|9.8|10.1|10.1|10.2|
|Watch B|10.1|10|10.2|9.9|10.1|

**Construct a $95\%$ confidence interval to estimate the mean difference in the distances reported by these watches. Does the interval suggest that there is a difference between the two watches?**

---

#### Step 1: Calculate the differences

Even though it appears we have two sets of data—watch A and watch B—these data didn't come from two independent samples. The magazine took a single sample of $5$ runners, and each runner wore both watches, so this is a matched pairs design. The one set of data we're interested in is the _difference_ between watch A and watch B for each runner. Let's define this variable as $\text{difference = B - A}$ and calculate the difference for each runner:

In [15]:
df = pd.DataFrame(
    {'Watch A': [9.8, 9.8, 10.1, 10.1, 10.2],
     'Watch B': [10.1, 10, 10.2, 9.9, 10.1]},
    index=pd.Series([1, 2, 3, 4, 5], name='Runner')
)
df.transpose()

Runner,1,2,3,4,5
Watch A,9.8,9.8,10.1,10.1,10.2
Watch B,10.1,10.0,10.2,9.9,10.1


In [16]:
df['Difference(B - A)'] = df['Watch B'] - df['Watch A']
df.transpose()

Runner,1,2,3,4,5
Watch A,9.8,9.8,10.1,10.1,10.2
Watch B,10.1,10.0,10.2,9.9,10.1
Difference(B - A),0.3,0.2,0.1,-0.2,-0.1


**Key idea:** When dealing with paired data, we're most interested in the distribution of the differences.

---

#### Step 2: Check conditions

We want to use these $n=5$ differences to construct a confidence interval for the mean difference. Since we don't know the population standard deviation of the differences, we'll have to use the sample standard deviation in its place. This makes it appropriate to use a $t$ interval instead of a $z$ interval to estimate the mean difference. Let's check the conditions for making a $t$ interval.

- Random: The magazine took a random sample of their subscribers.
- Normal: Since our sample of $n=5$ runners is small, we need to plot the data. The differences are roughly symmetric with no outliers, so it should be safe to proceed.
- Independent: It's reasonable to assume independence between each runner's measurements. They were randomly selected, and they shouldn't influence each other's results.

---

#### Step 3: Construct the interval

Here are the data:

In [17]:
df.transpose()

Runner,1,2,3,4,5
Watch A,9.8,9.8,10.1,10.1,10.2
Watch B,10.1,10.0,10.2,9.9,10.1
Difference(B - A),0.3,0.2,0.1,-0.2,-0.1


Here are the summary statistics:

In [18]:
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Watch A,5.0,10.0,0.187083,9.8,9.8,10.1,10.1,10.2
Watch B,5.0,10.06,0.114018,9.9,10.0,10.1,10.1,10.2
Difference(B - A),5.0,0.06,0.207364,-0.2,-0.1,0.1,0.2,0.3


Since we want to construct a confidence interval for the mean difference, we only need the summary statistics for the differences.

We'll use the formula for a one-sample $t$ interval for a mean:

$\displaystyle \text{statistic} \pm \text{(criticial value)(standard deviation of statistic)}$

$\displaystyle \bar x_{\text{Diff}} \pm t^{*} \cdot \frac{s_{\text{Diff}}}{\sqrt{n}}$

**Components of formula:**

Our statistic is the sample mean $\bar x_{\text{Diff}}=0.06$.

Our sample size is $n=5$ runners.

Our sample standard deviation is $s_{\text{Diff}}=0.21\text{ km}$.

Our degrees of freedom is $\text{df}=5-1=4$, so for $95\%$ confidence our critical value is $t^{*}=2.776$.

In [19]:
CL, n, s_mu, s_sigma = .95, df.shape[0], df['Difference(B - A)'].mean(), df['Difference(B - A)'].std()
percision = 2

SE = s_sigma / np.sqrt(n)
CI_L, CI_R = t.interval(CL, df=n-1, loc=s_mu, scale=SE)
CI_L, CI_R = round(CI_L, percision), round(CI_R, percision)
(CI_L, CI_R)

(-0.2, 0.32)

---

#### Step 4: Interpret the interval

**Does the interval suggest that there is a difference between the two watches?**

We're $95\%$ confident that the interval $(-0.20, 0.32)$ captures the mean difference between the distances (in kilometers) reported by the watches on this sort of run. Notice that the interval contains $0\text{ km}$—which represents no difference—so it's plausible that there is no difference between the distances reported by Watch A and Watch B.

If the entire interval had been above $0$ (all positive values), or if it had been entirely below $0$ (all negative values), then it would have suggested a difference between the two watches.

---

## Interpreting a confidence interval for a mean

> [Interpreting a confidence interval for a mean](https://www.khanacademy.org/math/statistics-probability/confidence-intervals-one-sample/estimating-population-mean/a/interpret-one-sample-t-interval-mean)

After we build a confidence interval for a mean, it's important to be able to interpret what the interval tells us about the population and what it doesn't tell us.

A confidence interval for a mean gives us a range of plausible values for the population mean. If a confidence interval does not include a particular value, we can say that it is not likely that the particular value is the true population mean. However, even if a particular value is within the interval, we shouldn't conclude that the population mean equals that specific value.

Let's look at few examples that demonstrate how to interpret a confidence interval for a mean.

---

### Example 1

Felix is a quality control expert at a factory that paints car parts. Their painting process consists of a primer coat, color coat, and clear coat. For a certain part, these layers have a combined target thickness of $150$ microns. Felix measured the thickness of $50$ randomly selected points on one of these parts to see if it was painted properly. His sample had a mean thickness of $\bar x=148$ microns and a standard deviation of $s_x=3.3$ microns.

A $95\%$ confidence interval for the mean thickness based on his data is $(147.1,148.9)$.

**Based on his interval, is it plausible that this part's average thickness agrees with the target value?**

No, it isn't. The interval says that the plausible values for the true mean thickness on this part are between $147.1$ and $148. 9$ microns. Since this interval doesn't contain $150$ microns, it doesn't seem plausible that this part's average thickness agrees with the target value. In other words, the entire interval is below the target value of $150$ microns, so this part's mean thickness is likely below the target.

---

### Example 2

Martina read that the average graduate student is $33$ years old. She wanted to estimate the mean age of graduate students at her large university, so she took a random sample of $30$ graduate students. She found that their mean age was $\bar x=31.8$ and the standard deviation was $s_x=4.3$ years. A $95\%$ confidence interval for the mean based on her data was $(30.2,33.4)$.

**Based on this interval, is it plausible that the mean age of all graduate students at her university is also $33$ years?**

Yes. Since $33$ is within the interval, it is a plausible value for the mean age of the entire population of graduate students at her university.

---

## Sample size and margin of error in a confidence interval for a mean

---

### Example 1

Nadia wants to estimate the mean driving range for her company's new electric vehicle. She'll sample vehicles and measure each of their driving ranges to construct a confidence interval for the mean driving range. She wants the margin of error to be no more than $10$ kilometers at a $90\%$ level of confidence. A pilot study suggests that the driving ranges for this type of vehicle have a standard deviation of $15$ kilometers.

**Which of these is the smallest approximate sample size required to obtain the desired margin of error?**

In [20]:
CL, sd, MOE = .90, 15, 10
# Since we know population sd, we should use z interval instead of t interval
critical_value = norm.ppf((1 - CL) / 2 + CL)
percision = 2

# MOE <= 10
# critical_value * (sd / n**0.5) <= MOE
# (critical_value * sd / MOE)**2 <= n
n = (critical_value * sd / MOE)**2

ceil(n)

7

---

### Example 2

Dr. Ashby is studying the heart rate of adult men. Preliminary data suggests that the standard deviation of heart rates for adult men is $\sigma=7.5$ beats per minute (bpm). He plans on taking a sample of $n$ men to construct a $99\%$ confidence interval for the mean heart rate. He wants the margin of error to be no more than $3\text{ bpm}$.

**Which of these is the smallest approximate sample size required to obtain the desired margin of error?**

In [21]:
CL, sd, MOE = .99, 7.5, 3
# Since we know population sd, we should use z interval instead of t interval
critical_value = norm.ppf((1 - CL) / 2 + CL)
percision = 2

# MOE <= 10
# critical_value * (sd / n**0.5) <= MOE
# (critical_value * sd / MOE)**2 <= n
n = (critical_value * sd / MOE)**2

ceil(n)

42