In [1]:
import math
import scipy.stats as st

import ipywidgets as widgets
from ipywidgets import interact

# Chapter 8: Confidence Intervals

* `Inferential statistics` - use sample data to make generalizations about an unknown population.
    * **The sample data help us to make an estimate of a population parameter**

1. first calculate a **point estimate**
2. then construct interval estimates called **confidence intervals**.

Chapter Goals:
* How to construct and interpret confidence intervals.
* A new distribution - the Student's-t, and how it is used with these intervals.
* Keep in mind that the confidence interval is a random variable.  It is the population parameter that is fixed.
* The sample mean $\bar{x}$ is the **point estimate** for the population mean $\mu$.
* The sample standard deviation $s$ is the **point estimate** for the population standard deviation $\sigma$
* Each $\bar{x}$ and $s$ is called a **statistic**
* A **confidence interval**, is an estimate but it is an interval of numbers. It provides a _range_ of reasonable values in which we expect the population parameter to fall.
* The `empirical rule`, which applies to bell-shaped distributions, says that in approximately 95% of the samples, the sample mean, $\bar{x}$, will be within two standard deviations of the population mean $\mu$.
* A **confidence interval**, is created for an **unknown population parameter** like the population mean, $\mu$.  Confidence intervals for some parameters have the form:
    * (point estimate - margin of error, point estimate + margin of error)


## Calculating the Confidence Interval
> A confidence interval for a population mean, when the population standard deviation is known, is based on the conclusion of the Central Limit Theorem that the sampling distribution of the sample means follow an approximately normal distribution. Suppose that our sample has a mean of $\bar{x} = 10$ and we have constructed the 90% confidence interval (5, 15) where EBM = 5.

Constructing a confidence interval for a single unknown population mean $\mu$, **where the population standard deviation is known**, we need $\bar{x}$ as an estimate for $\mu$ and we need the margin of error. Here, the margin of error (EBM) is called the **error bound for a population mean** (abbreviated **EBM**).  The sample mean $\bar{x}$ is the **point estimate** of the unknown population mean $\mu$.

The confidence interval estimate will have the form:

(point estimate - error bound, point estimate + error bound) or, in symbols,($\bar{x}$–EBM,$\bar{x}$+EBM )

The marign of error (EBM) depends on the **confidence level** (abbreviated **CL**). The confidence level is often considered the probability that the calculated confidence interval estimate will contain the true population parameter. However, it is more accurate to state that the confidence level is the percent of confidence intervals that contain the true population parameter when repeated samples are taken.

**$\alpha$**: the probability that the interval does not contain the unknown population parameter.

* $\alpha + CL = 1$

##### <span style="color:orange">Example 8.1</span>
* Suppose we have collected data from a sample. We know the sample mean but we do not know the mean for the entire population.
* The sample mean is seven, and the error bound for the mean is 2.5.

$\bar{x}=7 \text{ and } EBM = 2.5$

The confidence interval is (7 – 2.5, 7 + 2.5), and calculating the values gives (4.5, 9.5).

If the confidence level (CL) is 95%, then we say that, "We estimate with 95% confidence that the true value of the population mean is between 4.5 and 9.5."

### Calculating the Confidence Interval
To construct a confidence interval estimate for an unknown population mean, we need data from a random sample. The steps to construct and interpret the confidence interval are:
* Calculate the sample mean $\bar{x}$ from the sample data. Remember, in this section we already know the population standard deviation $\sigma$.
* Find the z-score that corresponds to the confidence level.
* Calculate the error bound EBM.
* Construct the confidence interval.
* Write a sentence that interprets the estimate in the context of the situation in the problem. (Explain what the confidence interval means, in the words of the problem.)

### Finding the z-score for the Stated Confidence Level
<span style="color:pink">When we know the population standard deviation σ, we use a standard normal distribution to calculate the error bound EBM and construct the confidence interval</span>. We need to find the value of z that puts an area equal to the confidence level (in decimal form) in the middle of the standard normal distribution Z ~ N(0, 1).

The confidence level, CL, is the area in the middle of the standard normal distribution. CL = 1 – α, so α is the area that is split equally between the two tails. Each of the tails contains an area equal to  α2 .

The z-score that has an area to the right of $\frac{\alpha}{2}$ is denoted by $z_{\frac{\alpha}{2}}$.

For example, when CL = 0.95, α = 0.05 and $\frac{\alpha}{2} = 0.025$; we write $z_{\frac{\alpha}{2}}=z_{0.025}$.

The area to the right of $z_{0.025}$ is 0.025 and the area to the left of $z_{0.025}$ is 1 – 0.025 = 0.975.

$z_{\frac{\alpha}{2}}=z_{0.025}=1.96$ , using a calculator, computer or a standard normal probability table.

### Calculating the Error Bound (EBM)
The error bound formula for an unknown population mean μ when the population standard deviation σ is known is:
* $EBM=\bigl(z_{\frac{\alpha}{2}}\bigr)\bigl(\frac{\sigma}{\sqrt{n}}\bigr)$

### Constructing the Confidence Interval
The confidence interval estimate has the format $(\bar{x}-EBM, \bar{x}+EBM)$

The graph gives a picture of the entire situation.

$CL+\frac{\alpha}{2}+\frac{\alpha}{2}=CL+\alpha=1$

![image.png](attachment:d55cce77-52f2-4549-8fb8-40fa2f911372.png)

### Writing the Interpretation
The interpretation should clearly state the confidence level (CL), explain what population parameter is being estimated (here, a population mean), and state the confidence interval (both endpoints). "We estimate with ___% confidence that the true **population mean** (include the context of the problem) is between ___ and ___ (include appropriate units)."

##### <span style="color:orange">Example 8.2</span>
Suppose scores on exams in statistics are normally distributed with an unknown population mean and a population standard deviation of three points. A random sample of 36 scores is taken and gives a sample mean (sample mean score) of 68. Find a confidence interval estimate for the population mean exam score (the mean score on all exams).

Find a 90% confidence interval for the true (population) mean of statistics exam scores.

so,
* $\mu$ = ?
* $\sigma = 3$ points
* $n=36$
* $\bar{x}=68$

For a 90% confidence interval, that means:
* $CL = 0.90$
* $\alpha = 0.1$
* $\therefore \frac{\alpha}{2} = 0.05$
* $z_{\frac{a}{2}}=z_{0.05} = 1.645$

In [2]:
st.norm.ppf(1 - 0.05, loc=0, scale=1)

1.6448536269514722

$EBM = (1.645)(\frac{3}{\sqrt{36}}) = 0.8225$

$\bar{x} - EBM = 68 - 0.8225 = 67.1775$

$\bar{x} + EBM = 68 - 0.8225 = 68.8225$

The 90% confidence interval is (67.1775, 68.8225).

Interpretation: We estimate with 90% confidence that the true population mean exam score for all statistics students is between 67.18 and 68.82.

Explanation of 90% Confidence Level: Ninety percent of all confidence intervals constructed in this way contain the true mean statistics exam score. For example, if we constructed 100 of these confidence intervals, we would expect 90 of them to contain the true population mean exam score.

##### <span style="color:orange">Example 8.3</span>
|Phone Model	|SAR	|Phone Model	|SAR	|Phone Model	|SAR|
|--|--|--|--|--|--|
|Apple iPhone 4S	|1.11	|LG Ally	|1.36	|Pantech Laser	|0.74|
|BlackBerry Pearl 8120	|1.48	|LG AX275	|1.34	|Samsung Character	|0.5|
|BlackBerry Tour 9630	|1.43	|LG Cosmos	|1.18	|Samsung Epic 4G Touch	|0.4|
|Cricket TXTM8	|1.3	|LG CU515	|1.3	|Samsung M240	|0.867|
|HP/Palm Centro	|1.09	|LG Trax CU575	|1.26	|Samsung Messager III SCH-R750	|0.68|
|HTC One V	|0.455	|Motorola Q9h	|1.29	|Samsung Nexus S	|0.51|
|HTC Touch Pro 2	|1.41	|Motorola Razr2 V8	|0.36	|Samsung SGH-A227	|1.13|
|Huawei M835 Ideos	|0.82	|Motorola Razr2 V9	|0.52	|SGH-a107 GoPhone	|0.3|
|Kyocera DuraPlus	|0.78	|Motorola V195s	|1.6	|Sony W350a	|1.48|
|Kyocera K127 Marbl	|1.25	|Nokia 1680	|1.39	|T-Mobile Concord	|1.38|

Find a 98% confidence interval for the true (population) mean of the Specific Absorption Rates (SARs) for cell phones. Assume that the population standard deviation is σ = 0.337.

So,
* $\mu = ?$
* $\sigma = 0.337$
* $n=30$
* Use $CL=0.98$
* $\therefore \alpha = 0.02$
* $\frac{\alpha}{2}= 0.01$
* $z_{\frac{\alpha}{2}}=z_{0.01} = 2.326$

![image.png](attachment:02adf254-1947-4912-a89c-b7129be2ad6e.png)

In [3]:
phone_data = [1.11, 1.48, 1.43, 1.3, 1.09, 0.455, 1.41, 0.82, 0.78, 1.25,
             1.36, 1.34, 1.18, 1.3, 1.26, 1.29, 0.36, 0.52, 1.6, 1.39,
             0.74, 0.5, 0.4, 0.867, 0.68, 0.51, 1.13, 0.3, 1.48, 1.38]

In [4]:
sum(phone_data)/len(phone_data)

1.0237333333333332

$\therefore \bar{x}=1.3576$

In [5]:
st.norm.ppf(1 - 0.01, loc=0, scale=1)

2.3263478740408408

* $EBM=\bigl(z_{\frac{\alpha}{2}}\bigr)\bigl(\frac{\sigma}{\sqrt{n}}\bigr)$

In [6]:
(2.326)*(0.337/math.sqrt(30))

0.1431129664570382

To find the 98% confidence interval, find $\bar{x} \pm EBM$.

In [7]:
1.0237 + 0.14311

1.1668100000000001

In [8]:
1.0237 - 0.14311

0.8805900000000001

We estimate with 98% confidence that the true SAR mean for the population of cell phones in the United States is between 0.8809 and 1.1671 watts per kilogram.

##### <span style="color:orange">Example 8.4</span>

Suppose we change the original problem in <span style="color:orange">Example 8.2</span> by using a 95% confidence level. Find a 95% confidence interval for the true (population) mean statistics exam score.

so,
* $\mu$ = ?
* $\sigma = 3$ points
* $n=36$
* $\bar{x}=68$

For a 95% confidence interval, that means:
* $CL = 0.95$
* $\alpha = 0.5$
* $\therefore \frac{\alpha}{2} = 0.025$
* $z_{\frac{a}{2}}=z_{0.025} = 1.959$

In [9]:
st.norm.ppf(1-0.025, loc=0, scale=1)

1.959963984540054

* $EBM=\bigl(z_{\frac{\alpha}{2}}\bigr)\bigl(\frac{\sigma}{\sqrt{n}}\bigr)$

$EBM = (1.959)(\frac{3}{\sqrt{36}}) = 0.9795$

$\bar{x} - EBM = 68 - 0.9795 = 67.0205$

$\bar{x} + EBM = 68 - 0.9795 = 68.9795$

The 95% confidence interval is (67.0205, 68.9795).

We estimate with 95% confidence that the true population mean for all statistics exam scores is between 67.02 and 68.98.

Explanation of 95% Confidence Level: Ninety-five percent of all confidence intervals constructed in this way contain the true value of the population mean statistics exam score.

Comparing the results: The 90% confidence interval is (67.18, 68.82). The 95% confidence interval is (67.02, 68.98). The 95% confidence interval is wider. If you look at the graphs, because the area 0.95 is larger than the area 0.90, it makes sense that the 95% confidence interval is wider. To be more confident that the confidence interval actually does contain the true value of the population mean for all statistics exam scores, the confidence interval necessarily needs to be wider.

![image.png](attachment:fe4f3b40-f4bd-41e5-b5a0-8bc4baafa380.png)

**Summary: Effect of Changing the Confidence Level**
* Increasing the confidence level increases the error bound, making the confidence interval wider.
* Decreasing the confidence level decreases the error bound, making the confidence interval narrower.

##### <span style="color:orange">Example 8.5</span>
Suppose we change the original problem in Example 8.2 to see what happens to the error bound if the sample size is changed.

Leave everything the same except the sample size. Use the original 90% confidence level. What happens to the error bound and the confidence interval if we increase the sample size and use n = 100 instead of n = 36? What happens if we decrease the sample size to n = 25 instead of n = 36?
* $\bar{x} = 68$
* $EBM = \bigl(z_{\frac{a}{2}}\bigr)\bigl(\frac{\sigma}{\sqrt{n}}\bigr)$
* $\sigma=3$; The confidence level is 90% (CL=0.90);$z_{\frac{a}{2}}=z_{0.05}=1.645$

___
If we **increase** the sample size n to 100, we **decrease** the error bound.

When $n=100$: $EBM=\bigl( z_{\frac{a}{2}} \bigr) \bigl(\frac{\sigma}{\sqrt{n}}\bigr)=(1.645)(\frac{3}{\sqrt{100}})=0.4935$

___
If we **decrease** the sample size n to 100, we **increase** the error bound.

When $n=100$: $EBM=\bigl( z_{\frac{a}{2}} \bigr) \bigl(\frac{\sigma}{\sqrt{n}}\bigr)=(1.645)(\frac{3}{\sqrt{25}})=0.987$

___
Summary: Effect of Changing the Sample Size
* Increasing the sample size causes the error bound to decrease, making the confidence interval narrower.
* Decreasing the sample size causes the error bound to increase, making the confidence interval wider.

___

In the below widget, observe the slider is the sample size, and will show the resultant confidence interval

In [10]:
def ebm_play(sample_size):
    return 1.645*(3/math.sqrt(sample_size))

interact(ebm_play, sample_size=widgets.IntSlider(value=25, min=2, max=1500));

interactive(children=(IntSlider(value=25, description='sample_size', max=1500, min=2), Output()), _dom_classes…

## Working Backwards to Find the Error Bound or Sample Mean
When we calculate a confidence interval we:
* find the sample mean
* calculate the error bound

Then we use those to calculate the confidence interval.

If we know the confidence interval, we can work backwards to find both the error bound and the sample mean.

### Finding the Error Bound
* From the upper value for the interval, subtract the sample mean,
* OR, from the upper value for the interval, subtract the lower value. Then divide the difference by two.

### Finding the Sample Mean
* Subtract the error bound from the upper value of the confidence interval,
* OR, average the upper and lower endpoints of the confidence interval.

_Notice that there are two methods to perform each calculation. You can choose the method that is easier to use with the information you know._

##### <span style="color:orange">Example 8.6</span>
Suppose we know that a confidence interval is **(67.18, 68.82)** and we want to find the error bound. We may know that the sample mean is 68, or perhaps our source only gave the confidence interval and did not tell us the value of the sample mean.

**Calculate the Error Bound:**
* If we know that the sample mean is 68: $EBM = 68.82-68 = 0.82$
* If we don't know the sample mean: $EBM =  \frac{(68.82-67.18)}{2}  = 0.82$

**Calculate the Sample Mean:**
* If we know the error bound:  $\bar{x}= 68.82-0.82 = 68$
* If we don't know the error bound:  $\bar{x}=\frac{(67.18+68.82)}{2}  = 68$

### Calculating the Sample Size $n$
If researchers desire a specific margin of error, then they can use the error bound formula to calculate the required sample size.

The error bound formula for a population mean when the population standard deviation is known is:

$EBM=\bigl(z_{\frac{a}{2}}\bigr) \bigr(\frac{\sigma}{\sqrt{n}}\bigr)$

The formula for the sample size is $n=\frac{z^2 \sigma^2}{EBM^2}$ , found by solving the error bound formula for n.

In this formula, $z \text{ is } z_{\frac{a}{2}}$ , corresponding to the desired confidence level. A researcher planning a study who wants a specified confidence level and error bound can use this formula to calculate the size of the sample needed for the study.

##### <span style="color:orange">Example 8.7</span>
The population standard deviation for the age of Foothill College students is 15 years. If we want to be 95% confident that the sample mean age is within two years of the true population mean age of Foothill College students, how many randomly selected Foothill College students must be surveyed?

* From the problem, we know that σ = 15 and EBM = 2.
* $z = z_{0.025} = 1.96$, because the confidence level is 95%.
* $n =  \frac{z^2 \sigma^2}{{EBM}^2}  =  \frac{(1.96)^2 (15)^2}{2^2}  = 216.09$ using the sample size equation.
* Use $n = 217$: Always round the answer UP to the next higher integer to ensure that the sample size is large enough.

Therefore, 217 Foothill College students should be surveyed in order to be 95% confident that we are within two years of the true population mean age of Foothill College students.

___

## A Single Population Mean using the Student t Distribution

<span style="color:yellow">Use when:</span>
* <span style="color:yellow">Population standard deviation is unknown and</span>
* <span style="color:yellow">the distribution of the sample mean is approximately normal</span>

In practice, we rarely know the population **standard deviation**

If you draw a simple random sample of size $n$ from a population that has an approximately normal distribution with mean $\mu$ and unknown population standard deviation $\sigma$ and calculate the t-score $t=\frac{\bar{x} - \mu}{\bigl(\frac{s}{\sqrt{n}}\bigr)}$ ,  then the t-scores follow a `Student's t-distribution` **with n-1 degrees of freedom**. 

The t-score has the same interpretation as the **z-score**.  It measures how far $\bar{x}$ is from its mean $\mu$. <span style="color:pink">_For each sample size $n$, there is a different Student's t-distribution_</span>.

`degrees of freedom (df)`: $n-1$
 
**Properties of the Student's t-Distribution**
* The graph is $\sim$ to the standard normal curve.
* $\mu=0$ , the distribution is symmetric about zero.
* The tails have more probability than the standard normal distribution because the spread is greater than that of the normal. So the t-distribution will be thicker in the tails and shorter in the center than the standard normal distribution.
* The exact shape depends on the degrees of freedom. as the df increases, the graph of the Student's t-distribution becomes more like the graph of the standard normal distribution.
* The underlying population of individual observations is assumed to be normally distributed with unknown population mean μ and unknown population standard deviation σ. The size of the underlying population is generally not relevant unless it is very small. If it is bell shaped (normal) then the assumption is met and doesn't need discussion. Random sampling is assumed, but that is a completely separate assumption from normality.

The t-distribution tables take as input the: confidence level (column headings) and the degrees of freedom (rows).

**The notation for the Student's t-distribution (using $T$ as the random variable) is**:
* $T \sim t_{\text{df}}$ where $\text{df}=n-1$
* For example, if we have a sample of size $n=20$ items, then we calculate the degrees of freedom as $\text{df}=n-1=20-1=19$ and we write the distribution as $T\sim t_{19}$

**If the population standard deviation is not known**, the **error bound for a population mean** is:
* $EBM = \bigl(t_{\frac{a}{2}}\bigr)\bigr(\frac{s}{\sqrt{n}}\bigr)$
* $t_{\frac{\sigma}{2}}$ is the $t=$score with area to the right equal to $\frac{\alpha}{2}$
* use $\text{df}=n-1$ degrees of freedom, and
* $s = $ sample standard deviation

**The format for the confidence interval is**:

$(\bar{x} - EBM, \bar{x} + EBM)$

##### <span style="color:orange">Example 8.8</span>
Suppose you do a study of acupuncture to determine how effective it is in relieving pain. You measure sensory rates for 15 subjects with the results given. Use the sample data to construct a 95% confidence interval for the mean sensory rate for the population (assumed normal) from which you took the data.
8.6; 9.4; 7.9; 6.8; 8.3; 7.3; 9.2; 9.6; 8.7; 11.4; 10.3; 5.4; 8.1; 5.5; 6.9

So,
* $n=15$
* Construct: CI = 0.95

In [11]:
data = [8.6, 9.4, 7.9, 6.8, 8.3, 7.3, 9.2, 9.6, 8.7, 11.4, 10.3, 5.4, 8.1, 5.5, 6.9]

In [12]:
sum(data)/len(data)

8.226666666666667

In [13]:
len(data)

15

In [14]:
st.tstd(data)

1.6722383060978339

$\therefore$
* $\text{df} = 15 - 1 = 14$
* $\bar{x} = 8.2267$
* $s=1.6722$
* $\alpha = 1 - 0.95 = 0.05$
* $\frac{\alpha}{2}=\frac{0.05}{2}=0.025$, then $t_{\frac{\alpha}{2}}=t_{0.025}$

In [15]:
st.t.ppf(1 - 0.025, df=14)

2.1447866879169273

$EBM = \bigl(t_\frac{\alpha}{2}\bigr)\bigl(\frac{s}{\sqrt{n}}\bigr)$

$EBM = \bigl(2.14)\bigl(\frac{1.6722}{\sqrt{15}}\bigr)=0.924$

$\bar{x} - EBM = 8.2264 - 0.9240 = 7.3$

$\bar{x} - EBM = 8.2264 + 0.9240 = 9.15$

The 95% confidence interval is (7.30, 9.15).

We estimate with 95% confidence that the true population mean sensory rate is between 7.30 and 9.15.

##### <span style="color:orange">Example 8.9</span>
The Human Toxome Project (HTP) is working to understand the scope of industrial pollution in the human body. Industrial chemicals may enter the body through pollution or as ingredients in consumer products. In October 2008, the scientists at HTP tested cord blood samples for 20 newborn infants in the United States. The cord blood of the "In utero/newborn" group was tested for 430 industrial compounds, pollutants, and other chemicals, including chemicals linked to brain and nervous system toxicity, immune system toxicity, and reproductive toxicity, and fertility problems. There are health concerns about the effects of some chemicals on the brain and nervous system. Table 8.3 shows how many of the targeted chemicals were found in each infant’s cord blood.

In [16]:
data = [79, 145, 147, 160, 116, 100, 159, 151, 156, 126,
        137, 83, 156, 94, 121, 144, 123, 114, 139, 99]

Use this sample data to construct a 90% confidence interval for the mean number of targeted industrial chemicals to be found in an in infant’s blood.

In [17]:
len(data)

20

In [18]:
sum(data)/len(data)

127.45

In [19]:
st.tstd(data)

25.964500055997508

$\therefore$
* $n=20$
* $\text{df}=20-1=19$
* $\bar{x}=127.45$
* $s=25.9645$
* $\text{CL}=0.90$
* $\alpha = 1 - 0.90 = 0.1$
* $\frac{\alpha}{2}=\frac{0.1}{2}=0.05$, then $t_{\frac{\alpha}{2}}=t_{0.05}$ 

In [20]:
st.t.ppf(1 - 0.05, df=19)

1.729132811521367

$EBM = \bigl(t_\frac{\alpha}{2}\bigr)\bigl(\frac{s}{\sqrt{n}}\bigr)$

$EBM = \bigl(1.729)\bigl(\frac{25.96}{\sqrt{20}}\bigr)=10.038$

$\bar{x} - EBM = 127.45 - 10.038 = 117.412$

$\bar{x} - EBM = 127.45 + 10.038 = 137.488$

The 90% confidence interval is (117.412, 137.488).

We estimate with 90% confidence that the mean number of all targeted industrial chemicals found in cord blood in the United States is between 117.412 and 137.488.

In [21]:
1.729*(25.965/math.sqrt(20))

10.038488420686715

In [22]:
127.45-10.038

117.412

## A Population Proportion

The procedure to find the confidence interval, the sample size, the **error bound**, and the **confidence level** for a proportion is similar to that for the population mean, but the formulas are different.

**How do you know you are dealing with a proportion problem?** First, the underlying distribution is a **binomial distribution**. (There is no mention of a mean or average.)

The binomial is a type of distribution that has two possible outcomes

If,
* $X$ is a binomial radom variable
* $n$ is the number of trials
* $p$ is the probability of a success
* $B$ is a binomial distribution

then,
* $X\sim B(n, p)$

To form a proportion, take $X$, the ramdom variable for the number of successes and divide it by $n$, the number of trials(or the sample size). The random variable $P^\prime$ (read "P prime") is that proportion.
* $P^\prime = \frac{X}{n}$

**Note**: Sometimes the random variable is denoted as $\hat{P}$, read "P hat")

<span style="color:yellow">When $n$ is large and $p$ is not close to zero or one, we can use the **normal distribution** to approximate the binomial.</span>

$X\sim N(np, \sqrt{nqp})$

**$P^\prime$ follows a normal distribution for proportions:** $\frac{X}{n}=P^\prime \sim \bigl(\frac{np}{n}, \frac{\sqrt{npq}}{n}\bigr)$

The confidence interval has the form $(p^\prime - EBP, p^\prime + EBP)$
* EBP: Error bound for the proportion.
* $p^\prime = \frac{x}{n}$
* $p^\prime =$ the **estimated proportion** of successes ($p^\prime$ is a **point estimate** for p, the true proportion.)
* $x = $ the **number of successes
* $n = $ the size of the sample

**The error bound for a proportion is**
* $EBP = \bigl(z_{\frac{\alpha}{2}} \bigr)\bigl(\sqrt{\frac{p^\prime q^\prime}{n}}\bigr)$ where $q^\prime = 1 - p^\prime$

Note: For a mean, when the population standard deviation is known, the appropriate standard deviation that we use is $\frac{\sigma}{\sqrt{n}}$.  For a **proportion**, the appropriate standard deviations is $\sqrt{\frac{pq}{n}}$, however, in the **error bound formula** we use $\sqrt{\frac{p^\prime q^\prime}{n}}$

In the error bound formula, the **sample proportions $p^\prime$ and $q^\prime$ are estimates of the unknown population proportions $p$ and $q$**.  The estimated proportions $p^\prime$ and $q^\prime$ are used because $p$ and $q$ are not known. the sample proportions $p^\prime$ and $q^\prime$ are calcualted from the data:
* $p^\prime$ is the estimated proportion of successes
* $q^\prime$ is the estimated proportion of failures

<span style="color:yellow">The confidence interval can be used only if the number of successes $np^\prime$ and the number of failures $nq^\prime$ are both greater than five.</span>

<span style="color:orange">Example 8.10</span>

Suppose that a market research firm is hired to estimate the percent of adults living in a large city who have cell phones. Five hundred randomly selected adult residents in this city are surveyed to determine whether they have cell phones. Of the 500 people surveyed, 421 responded yes - they own cell phones. Using a 95% confidence level, compute a confidence interval estimate for the true proportion of adult residents of this city who have cell phones.

So,
* This is a proportion problem, because the underlying distribution is a binomial distribution (do they have a cell phone: yes/no)
* $\therefore X\sim B\bigl(500, \frac{421}{500}\bigr)$

To calculate the confidence interval, you must find $p^\prime$, $q^\prime$, and EBP
* $n=500$
* $p^\prime = \frac{421}{500}=0.842$ (for yes they have a cell phone)
* $q^\prime = 1-0.842=0.158$
* $\text{CL}=0.95$
* $\alpha = 0.05$
* $z_{\frac{\alpha}{2}}=z_{0.025}=1.96$
* $EBP = (1.96)(\sqrt{\frac{0.842*0.158}{500}})=0.032$

$\therefore$
* The confidence interval is:
    * (0.842 - 0.032, 0.842 + 0.032)
    
Interpretation: We estimate with 95% confidence that between 81% and 87.4% of all adult residents of this city have cell phones.

Explanation of 95% Confidence Level: Ninety-five percent of the confidence intervals constructed in this way would contain the true value for the population proportion of all adult residents of this city who have cell phones.

In [23]:
421/500

0.842

In [24]:
0.05/2

0.025

In [25]:
st.norm.ppf(1-0.025, loc=0, scale=1)

1.959963984540054

In [26]:
1-0.842

0.15800000000000003

In [27]:
1.96*math.sqrt((0.842*0.158)/500)

0.031970958621849295

In [28]:
0.842-0.032

0.8099999999999999

In [29]:
0.842+0.032

0.874

<span style="color:orange">Example 8.11</span>

For a class project, a political science student at a large university wants to estimate the percent of students who are registered voters. He surveys 500 students and finds that 300 are registered voters. Compute a 90% confidence interval for the true percent of students who are registered voters, and interpret the confidence interval.

So,
* This is a proportion problem, because the underlying distribution is a binomial distribution (are they a registered voter: yes/no)
* $\therefore X\sim B\bigl(500, \frac{300}{500}\bigr)$

To calculate the confidence interval, you must find $p^\prime$, $q^\prime$, and EBP
* $n=500$
* $p^\prime = \frac{300}{500}=0.6$ (for yes they are a registered voter)
* $q^\prime = 1-0.6=0.4$
* $\text{CL}=0.90$
* $\alpha = 0.1$
* $z_{\frac{\alpha}{2}}=z_{0.05}=1.645$
* $EBP = (1.645)(\sqrt{\frac{0.6*0.4}{500}})=0.036$

$\therefore$
* The confidence interval is:
    * (0.6 - 0.036, 0.6 + 0.036)
    
The confidence interval for the true binomial population proportion is $(p^\prime - EBP, p^\prime + EBP) = (0.564,0.636)$.

* Interpretation: We estimate with 90% confidence that the true percent of all students that are registered voters is between 56.4% and 63.6%.
* Alternate Wording: We estimate with 90% confidence that between 56.4% and 63.6% of ALL students are registered voters.

Explanation of 90% Confidence Level: Ninety percent of all confidence intervals constructed in this way contain the true value for the population percent of students that are registered voters.

In [30]:
300/500

0.6

In [31]:
st.norm.ppf(1-0.05, loc=0, scale=1)

1.6448536269514722

In [32]:
1.645*math.sqrt(((.6*.4)/500))

0.036040144283839934

In [33]:
(0.6 - 0.036, 0.6 + 0.036)

(0.564, 0.636)

### "Plus Four" Confidence Interval for $p$
There is a certain amount of error introduced into the process of calculating a confidence interval for a proportion. Because we do not know the true proportion for the population, we are forced to use point estimates to calculate the appropriate standard deviation of the sampling distribution. Studies have shown that the resulting estimation of the standard deviation can be flawed.

Fortunately, there is a simple adjustment that allows us to produce more accurate confidence intervals. We simply pretend that we have four additional observations. Two of these observations are successes and two are failures. The new sample size, then, is $n + 4$, and the new count of successes is $x + 2$.

<span style="color:yellow">Computer studies have demonstrated the effectiveness of this method. It should be used when the confidence level desired is at least 90% and the sample size is at least ten.</span>

<span style="color:orange">Example 8.12</span>

A random sample of 25 statistics students was asked: “Have you smoked a cigarette in the past week?” Six students reported smoking within the past week. Use the “plus-four” method to find a 95% confidence interval for the true proportion of statistics students who smoke.

So,
* This is a proportion problem, because the underlying distribution is a binomial distribution (have they smoked a cigarette in the past week: yes/no)
* $\therefore X\sim B\bigl(29, \frac{8}{29}\bigr)$

To calculate the confidence interval, you must find $p^\prime$, $q^\prime$, and EBP
* $n=25$
    * $n+4=29$
* $p^\prime = \frac{8}{29}=0.276$ (for yes they smoked)
* $q^\prime = 1-0.276=0.724$
* $\text{CL}=0.95$
* $\alpha = 0.05$
* $z_{\frac{\alpha}{2}}=z_{0.025}=1.96$
* $EBP = (1.96)(\sqrt{\frac{0.276*0.724}{29}})=0.163$

$\therefore$
* The confidence interval is:
    * (0.276 - 0.163, 0.276 + 0.163)
    
The confidence interval for the true binomial population proportion is $(p^\prime - EBP, p^\prime + EBP) = (0.113,0.439)$.

We are 95% confident that the true proportion of all statistics students who smoke cigarettes is between 0.113 and 0.439.

In [34]:
8/29

0.27586206896551724

In [35]:
2/29

0.06896551724137931

In [36]:
1-0.276

0.724

In [37]:
.05/2

0.025

In [38]:
st.norm.ppf(1-0.025, loc=0, scale=1)

1.959963984540054

In [39]:
1.96*math.sqrt((.276*0.724)/29)

0.16269750632851518

In [40]:
(0.276 - 0.163, 0.276 + 0.163)

(0.11300000000000002, 0.43900000000000006)

<span style="color:orange">Example 8.13</span>

The Berkman Center for Internet & Society at Harvard recently conducted a study analyzing the privacy management habits of teen internet users. In a group of 50 teens, 13 reported having more than 500 friends on Facebook. Use the “plus four” method to find a 90% confidence interval for the true proportion of teens who would report having more than 500 Facebook friends.

So,
* This is a proportion problem, because the underlying distribution is a binomial distribution (do they have more than 500 firends on facebook: yes/no)
* $\therefore X\sim B(54, \frac{15}{54})$
* $n = 50$
    * $n+4 = 54$
* $p^\prime = \frac{15}{54} = 0.278$ (percent said they have > 500 friends on facebook)
* $q^\prime = 1 - 0.278 = 0.722$
* $\text{CL} = 0.90$
* $\alpha = 0.10$
* $z_{\frac{\alpha}{2}}=z_{0.05}= 0.835$
* $EBP=(1.645)(\sqrt{\frac{0.278*0.722}{54}}=0.100$

$\therefore$
* The confidence interval is:
    * (0.278 - 0.100, 0.278 + 0.100)
    * (0.178, 0.378)

We are 90% confident that between 17.8% and 37.8% of all teens would report having more than 500 friends on Facebook.

In [41]:
15/54

0.2777777777777778

In [42]:
1-0.278

0.722

In [43]:
0.05/2

0.025

In [44]:
st.norm.ppf(1 - 0.05, loc=0, scale=1)

1.6448536269514722

In [45]:
1.645*(math.sqrt((0.2778*0.722)/54))

0.10025446917996003

In [46]:
(0.278 - 0.100, 0.278 + 0.100)

(0.17800000000000002, 0.378)

### Calculating the Sample Size $n$
If researchers desire a specific margin of error, then they can use the error bound formula to calculate the required sample size.

The error bound formula for a population proportion is:
* $EBP=\bigl(z_{\frac{\alpha}{2}}\bigr) \bigl(\sqrt{\frac{p^\prime q^\prime}{n}}\bigr)$
* Soving for $n$ gives you an equation for the sample size.
* $n=\frac{\bigl(z_{\frac{a}{2}}\bigr)^2 (p^\prime q^\prime)}{{EBP}^2}$

<span style="color:orange">Example 8.14</span>

Suppose a mobile phone company wants to determine the current percentage of customers aged 50+ who use text messaging on their cell phones. How many customers aged 50+ should the company survey in order to be 90% confident that the estimated (sample) proportion is within three percentage points of the true population proportion of customers aged 50+ who use text messaging on their cell phones.

So,
* $n = $ ?
* $\text{CL} = 0.90$
* $\alpha = 0.1$
* $z_{\frac{\alpha}{2}}=z_{\frac{0.1}{2}}=z_{0.05}=1.645$
* $EBP = 0.03$ (from the problem statement)

> However, in order to find n, we need to know the estimated (sample) proportion p′. Remember that q′ = 1 – p′. But, we do not know p′ yet. Since we multiply p′ and q′ together, we make them both equal to 0.5 because p′q′ = (0.5)(0.5) = 0.25 results in the largest possible product. (Try other products: (0.6)(0.4) = 0.24; (0.3)(0.7) = 0.21; (0.2)(0.8) = 0.16 and so on). The largest possible product gives us the largest n. This gives us a large enough sample so that we can be 90% confident that we are within three percentage points of the true population proportion. To calculate the sample size n, use the formula and make the substitutions.

* $n = \frac{z^2 p^\prime q^\prime}{{EBP}^2}$ gives $n=\frac{{1.645}^2 (0.5)(0.5)}{{0.03}^2}=751.7$

Round the answer to the next higher value. The sample size should be 752 cell phone customers aged 50+ in order to be 90% confident that the estimated (sample) proportion is within three percentage points of the true population proportion of all customers aged 50+ who use text messaging on their cell phones.

In [47]:
st.norm.ppf(1-0.05, loc=0, scale=1)

1.6448536269514722