## Chapter 06
# Confidence Intervals

Adopted from ["Elementary Statistics - Picturing the World" 6th edition](https://www.amazon.com/Elementary-Statistics-Picturing-World-6th/dp/0321911210/)

In [1]:
from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
        'scroll': True,
        'width': "100%",
        'height': "100%",
})

{'scroll': True, 'width': '100%', 'height': '100%'}


## 6.1 <br/>Confidence Intervals for the Mean ($\sigma$ Known)

### Estimating Population Parameters

- In this chapter, we will learn an important technique of statistical inference—to use sample statistics to estimate the value of an unknown population parameter.
- In this section and the next, we will learn how to use sample statistics to make an estimate of the population parameter $\mu$ when the population standard deviation $\sigma$ is known (this section) or when $\sigma$ is unknown. 
- To make such an inference, begin by finding a **point estimate**.


### Point Estimate

- A **point estimate** is a single value estimate for a population parameter. 
- The most unbiased point estimate of the population mean $\mu$ is the sample mean $\bar{x}$.
- The validity of an estimation method is increased when you use a sample statistic that is unbiased and has low variability. 
- A sample statistic is unbiased if it does not overestimate or underestimate the population parameter.

### Finding a Point Estimate [example 1]

- An economics researcher is collecting data about grocery store employees in a county. 
- The data listed below represents a random sample of the number of hours worked by $40$ employees from several grocery stores in the county. 
- Find a point estimate of the population mean $\mu$.

![](./image/6_1_ex_1_point_estimate.png)

### Finding a Point Estimate [solution]


- The sample mean of the data is:

$\bar{x} = \frac{\sum{x}}{n} = \frac{1184}{40} = 29.6$

- The point estimate for the mean number of hours worked by grocery store employees in this county is $29.6$ hours.
- The probability that the population mean is exactly $29.6$ is virtually zero. 
- Instead of estimating $\mu$ to be exactly $29.6$ using a point estimate, we can estimate that $\mu$ lies in an interval. 
- This is called making an **interval estimate**.


### Interval Estimate

- An **interval estimate** is an interval, or range of values, used to estimate a population parameter.
- To form an interval estimate, use the point estimate as the center of the interval, and then add and subtract a margin of error. 
- For instance, if the margin of error is $2.1$, then an interval estimate would be given by 
 - $29.6 \pm 2.1$ or 
 - $27.5 < \mu < 31.7$
- The point estimate and interval estimate are shown in the figure.

![](./image/6_1_interval_estimate.png)

- Before finding a margin of error for an interval estimate, we should first determine how confident we need to be that your interval estimate contains the population mean $\mu$.

### Level of Confidence

- The **level of confidence** $c$ is the probability that the interval estimate contains the population parameter, assuming that the estimation process is repeated a large number of times.
- We know from the Central Limit Theorem that when $n \ge 30$, the sampling distribution of sample means is a normal distribution. 
- The level of confidence $c$ is the area under the standard normal curve between the critical values, $-z_{c}$ and $z_{c}$.
- **Critical values** are values that separate sample statistics that are probable from sample statistics that are improbable, or unusual.
- We can see from the figure shown below that $c$ is the percent of the area under the normal curve between $-z_{c}$ and $z_{c}$. 

![](./image/6_1_level_of_confidence_graph.png)

- The area remaining is $1 - c$, so the area in each tail is $\frac{1}{2}(1 - c)$. 
- For instance, if $c = 90\%$, then $5\%$ of the area lies to the left of $-z_{c}= -1.645$ and $5\%$ lies to the right of $z_{c} = 1.645$, as shown in the table.

![](./image/6_1_level_of_confidence_table.png)


### Sampling Error

- The difference between the point estimate and the actual parameter value is called the sampling error. 
- When $\mu$ is estimated, the sampling error is the difference $\bar{x} - \mu$. 
- In most cases, of course, $\mu$ is unknown, and $\bar{x}$ varies from sample to sample. 
- However, you can calculate a maximum value for the error when you know the level of confidence and the sampling distribution.

### Margin of Error

- Given a level of confidence $c$, the margin of error $E$ (sometimes also called the maximum error of estimate or error tolerance) is the greatest possible distance between the point estimate and the value of the parameter it is estimating. 
- For a population mean $\mu$ where $\sigma$ is known, the margin of error is:

  - $E = z_{c} \sigma_{\bar{x}} = z_{c} \frac{\sigma}{\sqrt{n}}$

- when these conditions are met:
 - The sample is random.
 - At least one of the following is true: The population is normally distributed or $n \ge 30$.

### Finding the Margin of Error [example 2]

- Use the data in Example 1 and a $95\%$ confidence level to find the margin of error for the mean number of hours worked by grocery store employees.
- Assume the population standard deviation is $7.9$ hours.

![](./image/6_1_ex_1_point_estimate.png)

### Finding the Margin of Error [solution]

- Because $\sigma$ is known $(\sigma = 7.9)$, the sample is random, and $n \ge 30$, we can use the formula for $E$. 
- The $z$-score that corresponds to a $95\%$ confidence level is $1.96$. 
- This implies that $95\%$ of the area under the standard normal curve falls within $1.96$ standard deviations of the mean. 
- We can approximate the distribution of the sample means with a normal curve by the Central Limit Theorem because $n \ge 30$.

![](./image/6_1_ex2_graph.png)

$E = z_{c} \frac{\sigma}{\sqrt{n}} = 1.96 \times \frac{7.9}{\sqrt{40}} \approx 2.4$

##### Interpretation  
You are $95\%$ confident that the margin of error for the population mean is about $2.4$ hours.

In [2]:
from scipy import stats
import math 

c = 95/100
sigma = 7.9
n = 40
p = 1 - ((1-c) / 2)

zc = stats.norm.ppf(p)

E = zc * sigma / math.sqrt(n)
print(f'E: {E}')

E: 2.448190377603018


### Confidence Intervals for a Population Mean

- Using a point estimate and a margin of error, you can construct an interval estimate of a population parameter such as $\mu$. 
- This interval estimate is called a **confidence interval**.
- A confidence interval for a population mean $\mu$ is
 - $\bar{x} - E < \mu < \bar{x} + E$
- The probability that the confidence interval contains $\mu$ is $c$, assuming that the estimation process is repeated a large number of times.

### Guidelines fo Constructing a Confidence Interval for a Population Mean ($\sigma$ Known)

1. Verify that $\sigma$ is known, the sample is random, and either the population is normally distributed or $n \ge 30$.

2. Find the sample statistic $\bar{x}$.
 - $\bar{x} = \frac{\sum{x}}{n}$

3. Find the critical value $z_{c}$ that corresponds to the given level of confidence.
4. Find the margin of error $E$.
 - $E = z_{c} \frac{\sigma}{\sqrt{n}}$
 
5. Find the left and right endpoints and form the confidence interval.
 - Left endpoint: $\bar{x} - E$
 - Right endpoint: $\bar{x} + E$
 - Interval: $\bar{x} - E < \mu < \bar{x} + E$

### Constructing a Confidence Interval [example 3]

Use the data in Example $1$ to construct a $95\%$ confidence interval for the mean number of hours worked by grocery store employees.

![](./image/6_1_ex_1_point_estimate.png)

### Constructing a Confidence Interval [solution]


- In Examples $1$ and $2$, we found that $\bar{x} = 29.6$ and $E \approx 2.4$. 
- The confidence interval is constructed as shown.

![](./image/6_1_ex_3_confidence_interval.png)

##### Interpretation  
With $95\%$ confidence, you can say that the population mean number of hours worked is between $27.2$ and $32.0$ hours.

In [3]:
# also cover example 4
from scipy import stats
import math

n = 40
sigma = 7.9
x_bar = 29.6

c = 95/100
p = 1 - (1 - c)/2

zc = stats.norm.ppf(p)
E = zc * sigma / math.sqrt(n)
print(E)

print(f'interval: from {x_bar-E} to {x_bar+E}')

2.448190377603018
interval: from 27.151809622396982 to 32.04819037760302


In [4]:
# also cover example 4
from scipy import stats
from math import sqrt

n = 40
sigma = 7.9
x_bar = 29.6
c = 95/100
sigma_x_bar = sigma/sqrt(n)

stats.norm.interval(c, loc=x_bar, scale=sigma_x_bar)

(27.151809622396982, 32.04819037760302)

### Constructing a Confidence Interval [example 5]

- A college admissions director wishes to estimate the mean age of all students currently enrolled. 
- In a random sample of 20 students, the mean age is found to be $22.9$ years. 
- From past studies, the standard deviation is known to be $1.5$ years, and the population is normally distributed. 
- Construct a $90\%$ confidence interval of the population mean age.



### Constructing a Confidence Interval [solution]


- Because $\sigma$ is known, the sample is random, and the population is normally distributed, use the formula for $E$ given in this section. 
- Using $n = 20$, $\bar{x} = 22.9$, $\sigma = 1.5$, and $z_{c} = 1.645$, the margin of error at the $90\%$ confidence level is:
 - $E = z_{c} \frac{\sigma}{\sqrt{n}} = 1.645 \times \frac{1.5}{\sqrt{20}} \approx 0.6$

- The $90\%$ confidence interval can be written as $\bar{x} \pm E \approx 22.9 \pm 0.6$ or as shown below.

![](./image/6_1_ex_5_confidence_interval.png)

##### Interpretation  
- With $90\%$ confidence, you can say that the mean age of all the students is between $22.3$ and $23.5$ years.
- “With  $90\%$ confidence, the mean is in the interval ($22.3$, $23.5$).” 
- This means that when a large number of samples is collected and a confidence interval is created for each sample, approximately $90\%$ of these intervals will contain $\mu$.

In [5]:
from scipy import stats
from math import sqrt

n = 20
sigma = 1.5
x_bar = 22.9
c = 90/100

sigma_x_bar = sigma/sqrt(n)
p = 1 - ((1-c)/2)
zc = stats.norm.ppf(p)
E = zc * sigma_x_bar
print(f'E: {E}')
print(f'interval: from {x_bar - E} to {x_bar + E}')

E: 0.5517006784350859
interval: from 22.348299321564912 to 23.451700678435085


In [6]:
from scipy import stats
from math import sqrt

n = 20
x_bar = 22.9 
sigma = 1.5
c = 90/100
sigma_x_bar = sigma/sqrt(n)

stats.norm.interval(c, loc=x_bar, scale=sigma_x_bar)

(22.348299321564912, 23.451700678435085)

### Sample Size

- For the same sample statistics, as the level of confidence increases, the confidence interval widens. 
- As the confidence interval widens, the precision of the estimate decreases. 
- One way to improve the precision of an estimate without decreasing the level of confidence is to increase the sample size. 
- But how large a sample size is needed to guarantee a certain level of confidence for a given margin of error?
- By using the formula for the margin of error to find the minimum sample size $n$.
- Given a confidence level and a margin of error $E$, the minimum sample size $n$ needed to estimate the population mean $\mu$ is

$$
\begin{aligned}
E &= z_{c} \frac{\sigma}{\sqrt{n}} \\
n &= (\frac{z_{c} \sigma}{E})^{2}
\end{aligned}
$$

- When $\sigma$ is unknown, we can estimate it using $s$, provided you have a preliminary sample with at least $30$ members.

### Determining a Minimum Sample Size [example 6]

- The economics researcher in Example 1 wants to estimate the mean number of hours worked by all grocery store employees in the county. 
- How many employees must be included in the sample to be $95\%$ confident that the sample mean is within $1.5$ hours of the population mean?

### Determining a Minimum Sample Size [solution]

- Using $c = 0.95$, $z_{c} = 1.96$, $\sigma = 7.9$ (from Example 2), and $E = 1.5$, we can solve for the minimum sample size $n$.
 - $n = (\frac{z_{c} \sigma}{E})^{2} = (\frac{1.96 \times 7.9}{1.5})^{2} \approx 106.56$
- When necessary, round up to obtain a whole number. So, the researcher needs at least $107$ grocery store employees in the sample.

##### Interpretation  
- The researcher already has $40$ employees, so the sample needs 67 more members. 
- Note that $107$ is the minimum number of employees to include in the sample. 
- The researcher could include more, if desired.

In [7]:
c = 0.95
zc = 1.96
sigma = 7.9
E = 1.5
n = (zc * sigma / E)**2
print(n)

106.55744711111112


## 6.2 <br/>Confidence Intervals for the Mean ($\sigma$ Unknown) 

### The $t$-Distribution

- In many real-life situations, the population standard deviation ($\sigma$) is unknown.
- So, how can you construct a confidence interval for a population mean when $\sigma$ is not known? 
- For a random variable that is normally distributed (or approximately normally distributed), you can use a $t$-distribution.

### The $t$-Distribution

If the distribution of a random variable $x$ is approximately normal, then

$t = \frac{\bar{x} - \mu}{s / \sqrt(n)}$

Critical values of $t$ are denoted by $t_{c}$ . Here are several properties of the $t$-distribution.

1. The mean, median, and mode of the $t$-distribution are equal to $0$.
2. The $t$-distribution is bell-shaped and symmetric about the mean.
3. The total area under the $t$-distribution curve is equal to $1$.
4. The tails in the $t$-distribution are “thicker” than those in the standard normal distribution.
5. The standard deviation of the $t$-distribution varies with the sample size, but it is greater than $1$.
6. The $t$-distribution is a family of curves, each determined by a parameter called the degrees of freedom. The degrees of freedom (sometimes abbreviated as $d.f.$) are the number of free choices left after a sample statistic such as $\bar{x}$ is calculated. When you use a $t$-distribution to estimate a population mean, the degrees of freedom are equal to one less than the sample size.
$d.f. = n - 1$
7. As the degrees of freedom increase, the $t$-distribution approaches the standard normal distribution, as shown in the figure. After $30$ $d.f.$, the $t$-distribution is close to the standard normal distribution.

![](./image/6_2_t_distrtibution.png)

### Finding Critical Values of $t$ [example 1]

Find the critical value $t_{c}$ for a $95\%$ confidence level when the sample size is $15$.

### Finding Critical Values of $t$ [solution]

- Because $n = 15$, the degrees of freedom are $d.f. = n - 1 = 15 - 1 = 14$.
- Using $d.f. = 14$ and $c = 0.95$, you can find the critical value $t_{c}$ , as shown by the highlighted areas in the table.

![](./image/6_2_critical_values_table.png)

- From the table, you can see that $t_{c} = 2.145$. 
- The figure below shows the $t$-distribution for $14$ degrees of freedom, $c = 0.95$, and $t_{c} = 2.145$.

![](./image/6_2_critical_values_graph.png)

##### Interpretation  
For a $t$-distribution curve with $14$ degrees of freedom, $95\%$ of the area under the curve lies between $t \pm 2.145$.

In [8]:
from scipy.stats import t

n = 15
df = n - 1
c = 95/100
p = 1 - ((1-c)/2)

tc = t.ppf(p, df)
print(f'tc: {tc}')

tc: 2.1447866879169273


### Confidence Intervals and $t$-Distributions

- Constructing a confidence interval for $\mu$ when $\sigma$ is not known using the $t$-distribution is similar to constructing a confidence interval for $\mu$ when $\sigma$ is known using the standard normal distribution—both use a point estimate $\bar{x}$ and a margin of error $E$. 
- When $\sigma$ is not known, the margin of error $E$ is calculated using the sample standard deviation $s$ and the critical value $t_{c}$ . 
- The formula for $E$ is:
 - $E = t_{c} \frac{s}{\sqrt{n}}$
- Before using this formula, verify that the sample is random, and either the population is normally distributed or $n \ge 30$.

### Guidelines for Constructing a Confidence Interval for a Population Mean ($\sigma$ Unknown)

1. Verify that $\sigma$ is not known, the sample is random, and either the population is normally distributed or $n \ge 30$.
2. Find the sample statistics $n$, $\bar{x}$, and $s$.
 - $\bar{x} = \frac{\sum{x}}{n}$
 - $s = \sqrt{\frac{\sum{(x-\bar{x})^{2}}}{n-1}}$
3. Identify the degrees of freedom, the level of confidence $c$, and the critical value $t_{c}$ 
 - $df = n-1$
4. Find the margin of error $E$.
 - $E = t_{c} \frac{s}{\sqrt{n}}$
5. Find the left and right endpoints and form the confidence interval.
 - Left endpoint: $\bar{x} - E$
 - Right endpoint: $\bar{x} + E$
 - Interval: $\bar{x}-E < \mu <\bar{x}+E$

### Constructing a Confidence Interval [example 2]

- You randomly select $16$ coffee shops and measure the temperature of the coffee sold at each. 
- The sample mean temperature is $162^\circ F$ with a sample standard deviation of $10^\circ F$. 
- Construct a $95\%$ confidence interval for the population mean temperature of coffee sold. 
- Assume the temperatures are approximately normally distributed.

### Constructing a Confidence Interval [solution]

- Because $\sigma$ is unknown, the sample is random, and the temperatures are approximately normally distributed, use the $t$-distribution. 
- Using $n = 16$, $\bar{x} = 162$, $s = 10$, $c = 0.95$, and $d.f. = 15$, you can use *Table* to find that $t_{c} = 2.131$. 

- The margin of error at the 95% confidence level is:
 - $E = t_{c} \frac{s}{\sqrt{n}} = 2.131 \times \frac{10}{\sqrt{16}} \approx 5.3$
- The confidence interval is shown below.

![](./image/6_2_ex_2_confidence_interval_1.png)

![](./image/6_2_ex_2_confidence_interval_2.png)

##### Interpretation  
With $95\%$ confidence, you can say that the population mean temperature of coffee sold is between $156.7^\circ F$ and $167.3^\circ F$.

In [9]:
from scipy.stats import t
from math import sqrt

n = 16
df = n - 1
c = 95/100
s = 10
x_bar = 162

p = 1 - ((1-c)/2)
tc = t.ppf(p, df)
print(f'tc: {tc}')

E = tc * s / sqrt(n)
print(f'E: {E}')

print(f'Left End Point: {x_bar - E}')
print(f'Right End Point: {x_bar + E}')

tc: 2.131449545559323
E: 5.328623863898308
Left End Point: 156.67137613610169
Right End Point: 167.32862386389831


In [10]:
from scipy.stats import t
from math import sqrt

n = 16
df = n - 1
c = 95/100
s = 10
x_bar = 162

t.interval(c, df, loc=x_bar, scale=s/sqrt(n))

(156.67137613610169, 167.32862386389831)

### Constructing a Confidence Interval [example 3]

- You randomly select $36$ cars of the same model that were sold at a car dealership and determine the number of days each car sat on the dealership’s lot before it was sold. 
- The sample mean is $9.75$ days, with a sample standard deviation of $2.39$ days. 
- Construct a $99\%$ confidence interval for the population mean number of days the car model sits on the dealership’s lot.

### Constructing a Confidence Interval [solution]

- Because $\sigma$ is unknown, the sample is random, and $n \ge 30$, use the $t$-distribution. 
- Using $n = 36$, $\bar{x} = 9.75$, $s = 2.39$, $c = 0.99$, and $d.f. = 35$, we can use Table to find that $t_{c} = 2.724$. 
- The margin of error at the $99\%$ confidence level is:
 - $E = t_{c} \frac{s}{\sqrt{n}} = 2.724 \times \frac{2.39}{\sqrt{36}} \approx 1.09$
- The confidence interval is constructed as shown.

![](./image/6_2_ex_3_confidence_interval.png)

##### Interpretation  
With $99\%$ confidence, you can say that the population mean number of days the car model sits on the dealership’s lot is between $8.66$ and $10.84$.

In [11]:
from scipy.stats import t
from math import sqrt

n = 36
df = n - 1
c = 99/100
s = 2.39
x_bar = 9.75

p = 1 - ((1-c)/2)
tc = t.ppf(p, df)
print(f'tc: {tc}')

E = tc * s / sqrt(n)
print(f'E: {E}')

print(f'Left End Point: {x_bar - E}')
print(f'Right End Point: {x_bar + E}')

tc: 2.723805589208047
E: 1.0849825597012053
Left End Point: 8.665017440298795
Right End Point: 10.834982559701205


In [12]:
from scipy.stats import t
from math import sqrt

n = 36
df = n - 1
c = 99/100
s = 2.39
x_bar = 9.75

t.interval(c, df, loc=x_bar, scale=s/sqrt(n))

(8.665017440298794, 10.834982559701206)

### Construct a Confidence Interval for a Population Mean

- The flowchart describes when to use the standard normal distribution and when to use the $t$-distribution to construct a confidence interval for a population mean.

![](./image/6_2_confidence_interval_for_a_population.png)

- Notice in the flowchart that when both $n < 30$ and the population is not normally distributed, you cannot use the standard normal distribution or the $t$-distribution.

### Choosing the Standard Normal Distribution or the $t$-Distribution [example 4]

- You randomly select $25$ newly constructed houses. The sample mean construction cost is $\$181,000$ and the population standard deviation is $\$28,000$. 
- Assuming construction costs are normally distributed, should you to construct a $95\%$ confidence interval for the population mean construction cost? 
- Explain your reasoning.

### Choosing the Standard Normal Distribution or the $t$-Distribution [solution]

- Is $\sigma$ known? Yes.
- Is either the population normally distributed or $n \ge 30$? Yes, the population is normally distributed.
- Decision:Use the standard normal distribution.


## 6.3 <br/>Confidence Intervals for Population Proportions

### Point Estimate for a Population Proportion

- The probability of success in a single trial of a binomial experiment is $p$. 
- This probability is a population proportion. 
- In this section, we will learn how to estimate a **population proportion** $p$ using a confidence interval.
- As with confidence intervals for $\mu$, we will start with a point estimate.

### Point Estimate for $p$

- The point estimate for $p$, the population proportion of successes, is given by the proportion of successes in a sample and is denoted by
 - $\hat{p} = \frac{x}{n}$
- where $x$ is the number of successes in the sample and $n$ is the sample size.
- The point estimate for the population proportion of failures is $\hat{q} = 1 - \hat{p}$.
- The symbols $\hat{p}$ and $\hat{q}$ are read as “$p$ hat” and “$q$ hat.”


### Finding a Point Estimate for $p$ [example 1]

- In a survey of $1000$ U.S. teens, $372$ said that they own smartphones. 
- Find a point estimate for the population proportion of U.S. teens who own smartphones.

### Finding a Point Estimate for $p$ [solution]


- Using $n = 1000$ and $x = 372$,

 - $\hat{p} = \frac{x}{n} = \frac{372}{1000} = 0.372 = 37.2\% $

- The point estimate for the population proportion of U.S. teens who own smartphones is $37.2\%$.

### Confidence Intervals for a Population Proportion

- Constructing a confidence interval for a population proportion $p$ is similar to constructing a confidence interval for a population mean. 
- We start with a point estimate and calculate a margin of error.

- A confidence interval for a population proportion $p$ is:
 - $\hat{p} - E < p < \hat{p} + E$
 
- where margin of error for $p$ is:
 - $E = z_{c} \sqrt{\frac{\hat{p}\hat{q}}{n}}$

- The probability that the confidence interval contains $p$ is $c$, assuming that the estimation process is repeated a large number of times.


### Guidelines for Constructing a Confidence Interval for a Population Proportion

1. Identify the sample statistics $n$ and $x$.
2. Find the point estimate $\hat{p}$.
 - $\hat{p} = \frac{x}{n}$
 
3. Verify that the sampling distribution of $\hat{p}$ can be approximated by a normal distribution.
 - $n\hat{p} \ge 5, n \hat{q} \ge 5$
 
4. Find the critical value $z_{c}$ that corresponds to the given level of confidence $c$.
 - We can use ```SciPy```
 
5. Find the margin of error $E$.
 - $E = z_{c} \sqrt{\frac{\hat{p}\hat{q}}{n}}$
 
6. Find the left and right endpoints and form the confidence interval.
 - Left endpoint: $\hat{p} - E$
 - Right endpoint: $\hat{p} + E$
 - Interval: $\hat{p} - E < p < \hat{p} + E$

### Constructing a Confidence Interval for p [example 2]

Use the data in Example $1$ to construct a $95\%$ confidence interval for the population proportion of U.S. teens who own smartphones.

### Constructing a Confidence Interval for p [solution]

- From Example 1, $\hat{p} = 0.372$. 
- The point estimate for the population proportion of failures is:
 - $\hat{q} = 1 - \hat{p} = 1 - 0.372 = 0.628$
 
- Using $n = 1000$, we can verify that the sampling distribution of $\hat{p}$ can be approximated by a normal distribution.
 - $n \hat{p} = 1000 \times 0.372 = 372 \ge 5$
 - $n \hat{q} = 1000 \times 0.628 = 628 \ge 5$

- Using $z_{c} = 1.96$, the margin of error is:
 - $E = z_{c} \sqrt{\frac{\hat{p} \hat{q}}{n}} = 1.96 \times \sqrt{\frac{0.372 \times 0.628}{1000}} \approx 0.03$

- Next, find the left and right endpoints and form the $95\%$ confidence interval.
![](./image/6_3_ex_2_confidence_interval.png)

##### Interpretation  
With $95\%$ confidence, you can say that the population proportion of U.S. teens who own smartphones is between $34.2\%$ and $40.2\%$.

In [13]:
from scipy.stats import t
from math import sqrt 

n = 1000
df = n-1
c = 0.95

p_hat = 0.372
q_hat = 1 - p_hat

zc = t.ppf(1 - ((1-c)/2), df)
print(f'zc: {zc}')

E = zc * sqrt(p_hat * q_hat / n)
print(f'E: {E}')
print(f'Left End Point: {p_hat - E}')
print(f'Right End Point: {p_hat + E}')

zc: 1.9623414611334487
E: 0.029993411898275826
Left End Point: 0.3420065881017242
Right End Point: 0.4019934118982758


In [14]:
t.interval(c, df, loc=p_hat, scale=sqrt(p_hat * q_hat / n))

(0.3420065881017242, 0.4019934118982758)

### Constructing a Confidence Interval for $p$ [example 3]

- The figure shows a survey of $498$ U.S. adults.
- Construct a $99\%$ confidence interval for the population proportion of U.S. adults who think that teenagers are the more dangerous drivers.

![](./image/6_3_dangerous_drivers.png)

### Constructing a Confidence Interval for $p$ [solution]


- From the figure, $\hat{p} = 0.71$. 
 - $\hat{q} = 1 - \hat{p} = 1 - 0.71 = 0.29$.
 
- Using these values and the values $n = 498$ and $z_{c} = 2.575$, the margin of error is:
 - $E = z_{c} \sqrt{\frac{\hat{p} \hat{q}}{n}} = 2.575 \times \sqrt{\frac{0.71 \times 0.29}{498}} \approx 0.052$

- Next, find the left and right endpoints and form the $99\%$ confidence interval.

![](./image/6_3_ex_3_confidence_interval.png)

##### Interpretation  
With $99\%$ confidence, you can say that the population proportion of U.S. adults who think that teenagers are the more dangerous drivers is between $65.8\%$ and $76.2\%$.

In [15]:
from scipy.stats import t
from math import sqrt 

n = 498
df = n-1
c = 0.99

p_hat = 0.71
q_hat = 1 - p_hat

zc = t.ppf(1 - ((1-c)/2), df)
print(f'zc: {zc}')

E = zc * sqrt(p_hat * q_hat / n)
print(f'E: {E}')
print(f'Left End Point: {p_hat - E}')
print(f'Right End Point: {p_hat + E}')

zc: 2.585757619531686
E: 0.052577667429680196
Left End Point: 0.6574223325703198
Right End Point: 0.7625776674296801


In [16]:
t.interval(c, df, loc=p_hat, scale=sqrt(p_hat * q_hat / n))

(0.6574223325703198, 0.7625776674296801)

### Finding a Minimum Sample Size 

- One way to increase the precision of a confidence interval without decreasing the level of confidence is to increase the sample size.
- Given a $c$-confidence level and a margin of error $E$, the minimum sample size $n$ needed to estimate the population proportion $p$ is:

 - $n = \hat{p} \hat{q}(\frac{z_{c}}{E})^{2}$
 
- This formula assumes that we have preliminary estimates of $\hat{p}$ and $\hat{q}$. 
- If not, use $\hat{p} = 0.5$ and $\hat{q} = 0.5$.

### Determining a Minimum Sample Size [example 4]

- You are running a political campaign and wish to estimate, with $95\%$ confidence, the population proportion of registered voters who will vote for your candidate.
- Your estimate must be accurate within $3\%$ of the population proportion. 
- Find the minimum sample size needed when:
 - Q1: no preliminary estimate is available
 - Q2: a preliminary estimate gives $\hat{p} = 0.31$
- Compare your results.

### Determining a Minimum Sample Size [solution]

##### Q1:
- Because we do not have a preliminary estimate of $\hat{p}$, use $\hat{p} = 0.5$ and $\hat{q} = 0.5$. 
- Using $z_{c} = 1.96$ and $E = 0.03$, we can solve for $n$.
 - $n = \hat{p} \hat{q}(\frac{z_{c}}{E})^{2} = 0.5 \times 0.5 \times (\frac{1.96}{0.03})^{2} \approx 1067.11$
- Because $n$ is a decimal, round up to the nearest whole number, $1068$.

##### Q2:
- We have a preliminary estimate of $\hat{p} = 0.31$. 
 - $\hat{q} = 1-\hat{p} = 1 - 0.31 = 0.69$. 
- Using $z_{c} = 1.96$ and $E = 0.03$, we can solve for $n$.
 - $n = \hat{p} \hat{q}(\frac{z_{c}}{E})^{2} = 0.31 \times 0.69 \times (\frac{1.96}{0.03})^{2} \approx 913.02$
- Because $n$ is a decimal, round up to the nearest whole number, $914$.

##### Interpretation  
- With no preliminary estimate, the minimum sample size should be at least $1068$ registered voters. 
- With a preliminary estimate of $\hat{p} = 0.31$, the sample size should be at least $914$ registered voters. 
- So, we will need a larger sample size when no preliminary estimate is available.

## 6.4 <br/>Confidence Intervals for Variance and Standard Deviation

### Point Estimate for Variance and Standard Deviation

- In manufacturing, it is necessary to control the amount that a process varies. 
- For instance, an automobile part manufacturer must produce thousands of parts to be used in the manufacturing process. 
- It is important that the parts vary little or not at all. 
- How can you measure, and consequently control, the amount of variation in the parts? 
- We can start with a **point estimate**.
- The point estimate for $\sigma^{2}$ is $s^{2}$ and the point estimate for $\sigma$ is $s$. 

### The Chi-Square Distribution

- We can use a chi-square distribution to construct a confidence interval for the variance and standard deviation.
- If a random variable $x$ has a normal distribution, then the distribution of $x^{2}$ forms a chi-square distribution for samples of any size $n > 1$.
 - $x^{2} = \frac{(n-1)s^{2}}{\sigma^{2}}$

### Properties of the Chi-Square distribution:

1. All values of $x^{2}$ are greater than or equal to $0$.
2. The chi-square distribution is a family of curves, each determined by the degrees of freedom. To form a confidence interval for $\sigma^{2}$ , use the chi-square distribution with degrees of freedom equal to one less than the sample size.
 - $df = n-1$
 
3. The total area under each chi-square distribution curve is equal to $1$.
4. The chi-square distribution is positively skewed and therefore the distribution is not symmetric.
5. The chi-square distribution is different for each number of degrees of freedom, as shown in the figure. As the degrees of freedom increase, the chi-square distribution approaches a normal distribution.

![](./image/6_4_chi_square_distribution.png)

### Critical Values

- There are two critical values for each level of confidence. 
 - $x_{R}^{2}$ represents the right-tail critical value.
 - $x_{L}^{2}$ represents the left-tail critical value.
 
![](./image/6_4_xr_plot.png)

![](./image/6_4_xl_plot.png)

![](./image/6_4_xl_xr_plot.png)

### Finding Critical Values for $x^{2}$ [example 1]

Find the critical values $x_{R}^{2}$ and $x_{L}^{2}$ for a $95\%$ confidence interval when the sample size is $18$.



### Finding Critical Values for $x^{2}$ [solution]

- Degree of freedom:
 - $df = n - 1 = 18 - 1 = 17$

- The areas to the right of $x_{R}^{2}$ and $x_{L}^{2}$ are:
 - area to the right of $x_{R}^{2} = \frac{1-c}{2} = \frac{1-0.95}{2} = 0.025$
 - area to the right of $x_{L}^{2} = \frac{1+c}{2} = \frac{1+0.95}{2} = 0.975$
 
- Using $df = 17$ and the areas $0.975$ and $0.025$, we can find the critical values, as shown by the highlighted areas in the table.

![](./image/6_4_ex_1_table.png)

- From the table, we can see that $x_{R}^{2} = 30.191$ and $x_{L}^{2} = 7.564$.

![](./image/6_4_ex_1_distribution_plot.png)

##### Interpretation  
For a chi-square distribution curve with $17$ degrees of freedom, $95\%$ of the area under the curve lies between $7.564$ and $30.191$, as shown in the figure at the left.

In [17]:
from scipy.stats import chi2
from math import sqrt

n = 18
df = n-1
c = 0.95

xr2 = (1-c)/2
xl2 = (1+c)/2

print(f'xr2: {xr2}')
print(f'xl2: {xl2}')
print()

xr2_critical = chi2.ppf(1-xr2, df)
xl2_critical = chi2.ppf(1-xl2, df)
print(f'xr2_critical: {xr2_critical}')
print(f'xl2_critical: {xl2_critical}')

xr2: 0.025000000000000022
xl2: 0.975

xr2_critical: 30.19100912163982
xl2_critical: 7.56418644957757


In [18]:
chi2.interval(c, df)

(7.56418644957757, 30.19100912163982)

### Confidence Intervals for $\sigma^{2}$ and $\sigma$

- We can use the critical values $x_{R}^{2}$ and $x_{L}^{2}$ to construct confidence intervals for a population variance and standard deviation. 
- The best point estimate for the variance is $s^{2}$ and the best point estimate for the standard deviation is $s$. 
- Because the chi-square distribution is not symmetric, the confidence interval for $\sigma^{2}$ cannot be written as $s^{2} \pm E$. 
- We must do separate calculations for the endpoints of the confidence interval.

- Confidence Interval for $\sigma^{2}$:
 - $\frac{(n-1)s^{2}}{x_{R}^{2}} < \sigma^{2} < \frac{(n-1)s^{2}}{x_{L}^{2}}$

- Confidence Interval for $\sigma$:
 - $\sqrt{\frac{(n-1)s^{2}}{x_{R}^{2}}} < \sigma < \sqrt{\frac{(n-1)s^{2}}{x_{L}^{2}}}$

- The probability that the confidence intervals contain $\sigma^{2}$ or $\sigma$ is $c$, assuming that the estimation process is repeated a large number of times.

### Guidelines for Constructing a Confidence Interval for a Variance and Standard Deviation

1. Verify that the population has a normal distribution.
2. Identify the sample statistic $n$ and the degrees of freedom.
 - $df = n-1$

3. Find the point estimate $s^{2}$.
 - $s^{2} = \frac{\sum{(x-\bar{x})^2}}{n-1}$

4. Find the critical values $x_{R}^{2}$ and $x_{L}^{2}$ that correspond to the given level of confidence $c$ and the degrees of freedom.
 - using table

5. Find the left and right endpoints and form the confidence interval for the population variance.
 - $\frac{(n-1)s^{2}}{x_{R}^{2}} < \sigma^{2} < \frac{(n-1)s^{2}}{x_{L}^{2}}$
 
6. Find the confidence interval for the population standard deviation by taking the square root of each endpoint.
 - $\sqrt{\frac{(n-1)s^{2}}{x_{R}^{2}}} < \sigma < \sqrt{\frac{(n-1)s^{2}}{x_{L}^{2}}}$


### Constructing Confidence Intervals [example 2]

- You randomly select and weight $30$ samples of an allergy medicine. 
- The sample standard deviation ($s$) is $1.20$ milligrams. 
- Assuming the weights are normally distributed, construct $99\%$ confidence intervals for the population variance ($\sigma^{2}$) and standard deviation ($\sigma$).

### Constructing Confidence Intervals [solution]

- The areas to the right of $x_{R}^{2}$ and $x_{L}^{2}$ are:
 - $x_{R}^{2} = \frac{1-c}{2} = \frac{1-0.99}{2} = 0.005$
 - $x_{L}^{2} = \frac{1+c}{2} = \frac{1+0.99}{2} = 0.995$
 
- Using the values $n = 30$, $df = 29$, and $c = 0.99$, the critical values $x_{R}^{2}$ and $x_{L}^{2}$ are:
 - $x_{R}^{2} = 52.336$    
 - $x_{L}^{2} = 13.121$
 
- Using these critical values and $s^{2} = 1.20$, the confidence interval for $\sigma^{2}$ is:
![](./image/6_4_ex_2_confidence_interval_for_variance.png)

- The confidence interval for $\sigma$ is:
![](./image/6_4_ex_2_confidence_interval_for_standard_deviation.png)

##### Interpretation  
With $99\%$ confidence, we can say that the population variance is between $0.80$ and $3.18$, and the population standard deviation is between $0.89$ and $1.78$ milligrams.

In [19]:
from scipy.stats import chi2
from math import sqrt

n = 30
df = n-1
c = 0.99
s = 1.20
s_2 = s ** 2

xr2 = (1-c)/2
xl2 = (1+c)/2

print(f'xr2: {xr2}')
print(f'xl2: {xl2}')
print()

xr2_critical = chi2.ppf(1-xr2, df)
xl2_critical = chi2.ppf(1-xl2, df)
print(f'xr2_critical: {xr2_critical}')
print(f'xl2_critical: {xl2_critical}')
print()

right_end_point = df * s_2 / xr2_critical
left_end_point = df * s_2 / xl2_critical
print(f'right_end_point (variance): {right_end_point}')
print(f'left_end_point (variance): {left_end_point}')
print()

print(f'right_end_point (sigma): {sqrt(right_end_point)}')
print(f'left_end_point (sigma): {sqrt(left_end_point)}')

xr2: 0.0050000000000000044
xl2: 0.995

xr2_critical: 52.335617785933614
xl2_critical: 13.121148887960413

right_end_point (variance): 0.7979269523636721
left_end_point (variance): 3.1826481321553914

right_end_point (sigma): 0.8932675704197887
left_end_point (sigma): 1.783997794885238
