## Chapter 06
# Confidence Intervals

Adopted from ["Elementary Statistics - Picturing the World" 6th edition](https://www.amazon.com/Elementary-Statistics-Picturing-World-6th/dp/0321911210/)

In [1]:
from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
        'scroll': True,
        'width': "100%",
        'height': "100%",
})

{'width': '100%', 'height': '100%', 'scroll': True}


## 6.1 <br/>Confidence Intervals for the Mean ($\sigma$ Known)

### Estimating Population Parameters

- In this chapter, we will learn an important technique of statistical inference—to use sample statistics to estimate the value of an unknown population parameter.
- In this section and the next, we will learn how to use sample statistics to make an estimate of the population parameter $\mu$ when the population standard deviation $\sigma$ is known (this section) or when $\sigma$ is unknown. 
- To make such an inference, begin by finding a **point estimate**.


### Point Estimate

- A **point estimate** is a single value estimate for a population parameter. 
- The most unbiased point estimate of the population mean $\mu$ is the sample mean $\bar{x}$.
- The validity of an estimation method is increased when you use a sample statistic that is unbiased and has low variability. 
- A statistic is unbiased if it does not overestimate or underestimate the population parameter.

### Finding a Point Estimate [example 1]

- An economics researcher is collecting data about grocery store employees in a county. 
- The data listed below represents a random sample of the number of hours worked by $40$ employees from several grocery stores in the county. 
- Find a point estimate of the population mean $\mu$.

![](./image/6_1_ex_1_point_estimate.png)

### Finding a Point Estimate [solution]


- The sample mean of the data is:

$\bar{x} = \frac{\sum{x}}{n} = \frac{1184}{40} = 29.6$

- The point estimate for the mean number of hours worked by grocery store employees in this county is $29.6$ hours.
- The probability that the population mean is exactly $29.6$ is virtually zero. 
- Instead of estimating $\mu$ to be exactly $29.6$ using a point estimate, we can estimate that $\mu$ lies in an interval. 
- This is called making an **interval estimate**.


### Interval Estimate

- An **interval estimate** is an interval, or range of values, used to estimate a population parameter.
- To form an interval estimate, use the point estimate as the center of the interval, and then add and subtract a margin of error. 
- For instance, if the margin of error is $2.1$, then an interval estimate would be given by 
 - $29.6 \pm 2.1$ or 
 - $27.5 < \mu < 31.7$. 
- The point estimate and interval estimate are shown in the figure.

![](./image/6_1_interval_estimate.png)

- Before finding a margin of error for an interval estimate, we should first determine how confident we need to be that your interval estimate contains the population mean $\mu$.

### Level of Confidence

- The **level of confidence** $c$ is the probability that the interval estimate contains the population parameter, assuming that the estimation process is repeated a large number of times.
- We know from the Central Limit Theorem that when $n > 30$, the sampling distribution of sample means is a normal distribution. 
- The level of confidence $c$ is the area under the standard normal curve between the critical values, $-z_{c}$ and $z_{c}$.
- **Critical values** are values that separate sample statistics that are probable from sample statistics that are improbable, or unusual.
- We can see from the figure shown below that $c$ is the percent of the area under the normal curve between $-z_{c}$ and $z_{c}$. 

![](./image/6_1_level_of_confidence_graph.png)

- The area remaining is $1 - c$, so the area in each tail is $\frac{1}{2}(1 - c)$. 
- For instance, if $c = 90\%$, then $5\%$ of the area lies to the left of $-z_{c} -1.645$ and $5\%$ lies to the right of $z_{c} = 1.645$, as shown in the table.

![](./image/6_1_level_of_confidence_table.png)


### Sampling Error

- The difference between the point estimate and the actual parameter value is called the sampling error. 
- When $\mu$ is estimated, the sampling error is the difference $\bar{x} - \mu$. 
- In most cases, of course, $\mu$ is unknown, and $\bar{x}$ varies from sample to sample. 
- However, you can calculate a maximum value for the error when you know the level of confidence and the sampling distribution.

### Margin of Error

- Given a level of confidence $c$, the margin of error $E$ (sometimes also called the maximum error of estimate or error tolerance) is the greatest possible distance between the point estimate and the value of the parameter it is estimating. 
- For a population mean $\mu$ where $\sigma$ is known, the margin of error is:

  - $E = z_{c} \sigma_{\bar{x}} = z_{c} \frac{\sigma}{\sqrt{n}}$

- when these conditions are met:
 - The sample is random.
 - At least one of the following is true: The population is normally distributed or $n > 30$.

### Finding the Margin of Error [example 2]

- Use the data in Example 1 and a $95\%$ confidence level to find the margin of error for the mean number of hours worked by grocery store employees.
- Assume the population standard deviation is $7.9$ hours.

![](./image/6_1_ex_1_point_estimate.png)

### Finding the Margin of Error [solution]

- Because $\sigma$ is known $(\sigma = 7.9)$, the sample is random, and $n > 30$, we can use the formula for $E$. 
- The $z$-score that corresponds to a $95\%$ confidence level is $1.96$. 
- This implies that $95\%$ of the area under the standard normal curve falls within $1.96$ standard deviations of the mean. 
- We can approximate the distribution of the sample means with a normal curve by the Central Limit Theorem because $n > 30$.

![](./image/6_1_ex2_graph.png)

$E = z_{c} \frac{\sigma}{\sqrt{n}} = 1.96 \times \frac{7.9}{\sqrt{40}} \approx 2.4$

##### Interpretation  
You are $95\%$ confident that the margin of error for the population mean is about $2.4$ hours.

### Confidence Intervals for a Population Mean

- Using a point estimate and a margin of error, you can construct an interval estimate of a population parameter such as $\mu$. 
- This interval estimate is called a **confidence interval**.
- A $c$-confidence interval for a population mean $\mu$ is
 - $\bar{x} - E < \mu < \bar{x} + E$
- The probability that the confidence interval contains $\mu$ is $c$, assuming that the estimation process is repeated a large number of times.

### Guidelines fo Constructing a Confidence Interval for a Population Mean ($\sigma$ Known)

1. Verify that $\sigma$ is known, the sample is random, and either the population is normally distributed or $n > 30$.

2. Find the sample statistics $n$ and $\bar{x}$.
 - $\bar{x} = \frac{\sum{x}}{n}$

3. Find the critical value $z_{c}$ that corresponds to the given level of confidence.
4. Find the margin of error $E$.
 - $E = z_{c} \frac{\sigma}{\sqrt{n}}$
 
5. Find the left and right endpoints and form the confidence interval.
 - Left endpoint: $\bar{x} - E$
 - Right endpoint: $\bar{x} + E$
 - Interval: $\bar{x} - E < \mu < \bar{x} + E$

### Constructing a Confidence Interval [example 3]

Use the data in Example $1$ to construct a $95\%$ confidence interval for the mean number of hours worked by grocery store employees.

![](./image/6_1_ex_1_point_estimate.png)

### Constructing a Confidence Interval [solution]


- In Examples $1$ and $2$, we found that $\bar{x} = 29.6$ and $E \approx 2.4$. 
- The confidence interval is constructed as shown.

![](./image/6_1_ex_3_confidence_interval.png)

##### Interpretation  
With $95\%$ confidence, you can say that the population mean number of hours worked is between $27.2$ and $32.0$ hours.

In [7]:
# also cover example 4
from scipy import stats
from math import sqrt

n = 40
sigma = 7.9
mu = 29.6

stats.norm.interval(0.95, loc=mu, scale=sigma/sqrt(N))

(27.151809622396982, 32.04819037760302)

### Constructing a Confidence Interval [example 5]

- A college admissions director wishes to estimate the mean age of all students currently enrolled. 
- In a random sample of 20 students, the mean age is found to be $22.9$ years. 
- From past studies, the standard deviation is known to be $1.5$ years, and the population is normally distributed. 
- Construct a $90\%$ confidence interval of the population mean age.



### Constructing a Confidence Interval [solution]


- Because $\sigma$ is known, the sample is random, and the population is normally distributed, use the formula for $E$ given in this section. 
- Using $n = 20$, $\bar{x} = 22.9$, $\sigma = 1.5$, and $z_{c} = 1.645$, the margin of error at the $90\%$ confidence level is:
 - $E = z_{c} \frac{\sigma}{\sqrt{n}} = 1.645 \times \frac{1.5}{\sqrt{20}} \approx 0.6$

- The $90\%$ confidence interval can be written as $x \pm E \approx 22.9 \pm 0.6$ or as shown below.

![](./image/6_1_ex_5_confidence_interval.png)

##### Interpretation  
- With $90\%$ confidence, you can say that the mean age of all the students is between $22.3$ and $23.5$ years.
- “With  $90\%$ confidence, the mean is in the interval ($22.3$, $23.5$).” 
- This means that when a large number of samples is collected and a confidence interval is created for each sample, approximately $90\%$ of these intervals will contain $\mu$.

In [9]:
from scipy import stats
from math import sqrt

n = 20
mu = 22.9 
sigma = 1.5

stats.norm.interval(0.90, loc=mu, scale=sigma/sqrt(n))

(22.348299321564912, 23.451700678435085)

### Sample Size

- For the same sample statistics, as the level of confidence increases, the confidence interval widens. 
- As the confidence interval widens, the precision of the estimate decreases. 
- One way to improve the precision of an estimate without decreasing the level of confidence is to increase the sample size. 
- But how large a sample size is needed to guarantee a certain level of confidence for a given margin of error?
- By using the formula for the margin of error to find the minimum sample size $n$.
- Given a $c$-confidence level and a margin of error $E$, the minimum sample size $n$ needed to estimate the population mean $\mu$ is

$$
\begin{aligned}
E &= z_{c} \frac{\sigma}{\sqrt{n}} \\
n &= (\frac{z_{c} \sigma}{E})^{2}
\end{aligned}
$$

- When $\sigma$ is unknown, we can estimate it using $s$, provided you have a preliminary sample with at least $30$ members.

### Determining a Minimum Sample Size [example 6]

- The economics researcher in Example 1 wants to estimate the mean number of hours worked by all grocery store employees in the county. 
- How many employees must be included in the sample to be $95\%$ confident that the sample mean is within $1.5$ hours of the population mean?

![](./image/6_1_ex_1_point_estimate.png)

### Determining a Minimum Sample Size [solution]

- Using $c = 0.95$, $z_{c} = 1.96$, $\sigma = 7.9$ (from Example 2), and $E = 1.5$, we can solve for the minimum sample size $n$.
 - $n = (\frac{z_{c} \sigma}{E})^{2} = (\frac{1.96 \times 7.9}{1.5})^{2} \approx 106.56$
- When necessary, round up to obtain a whole number. So, the researcher needs at least $107$ grocery store employees in the sample.

##### Interpretation  
- The researcher already has $40$ employees, so the sample needs 67 more members. 
- Note that $107$ is the minimum number of employees to include in the sample. 
- The researcher could include more, if desired.

## 6.2 <br/>Confidence Intervals for the Mean ($\sigma$ Unknown) 

### The $t$-Distribution

- In many real-life situations, the population standard deviation ($\sigma$) is unknown.
- So, how can you construct a confidence interval for a population mean when $\sigma$ is not known? 
- For a random variable that is normally distributed (or approximately normally distributed), you can use a $t$-distribution.

### The $t$-Distribution

If the distribution of a random variable $x$ is approximately normal, then

$t = \frac{\bar{x} - \mu}{s / \sqrt(n)}$

Critical values of $t$ are denoted by $t_{c}$ . Here are several properties of the $t$-distribution.

1. The mean, median, and mode of the t@distribution are equal to $0$.
2. The $t$-distribution is bell-shaped and symmetric about the mean.
3. The total area under the $t$-distribution curve is equal to $1$.
4. The tails in the $t$-distribution are “thicker” than those in the standard normal distribution.
5. The standard deviation of the $t$-distribution varies with the sample size, but it is greater than $1$.
6. The $t$-distribution is a family of curves, each determined by a parameter called the degrees of freedom. The degrees of freedom (sometimes abbreviated as $d.f.$) are the number of free choices left after a sample statistic such as $\bar{x}$ is calculated. When you use a $t$-distribution to estimate a population mean, the degrees of freedom are equal to one less than the sample size.
$d.f. = n - 1$
7. As the degrees of freedom increase, the $t$-distribution approaches the standard normal distribution, as shown in the figure. After $30 d.f.$, the $t$-distribution is close to the standard normal distribution.

![](./image/6_2_t_distrtibution.png)