<a href="https://colab.research.google.com/github/gibsonea/Biostats/blob/main/LabSolutions/Solutions14_Parametric_CI_Proportions_Full.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <a name="20intro">3.4: Parametric Confidence Intervals for Proportions</a>

---

# <a name="20poll-intro">Public Opinion Polls</a>

---

Confidence intervals are frequently used when polling public opinion.
Rather than give a point estimate alone, poll results are typically
given along with a <font color="dodgerblue">**margin of error**</font> corresponding to a specified <font color="dodgerblue">**confidence level**</font>, which is typically 95%. For example, summarized in the bar plot and table below are the results of a PBS NewsHour/NPR/Marist poll<sup>1</sup> that surveyed ${\color{tomato}{n=1,\!227}}$ randomly selected adults in the US and gauged their opinions on how the US is handling the COVID pandemic
approximately one year after the initial outbreak in the United States.


<figure>
<img
src="https://raw.githubusercontent.com/CU-Denver-MathStats-OER/Statistical-Theory/main/Images/20fig-vaccination-poll1.png"
alt="A stacked bar plot summary of poll results" width = "70%"/>
<figcaption aria-hidden="true">
A stacked bar plot summary of poll results
</figcaption>
</figure>

<figure>
<img
src="https://raw.githubusercontent.com/CU-Denver-MathStats-OER/Statistical-Theory/main/Images/20fig-vaccination-table1.png"
alt="A table summary of the poll
results" width = "90%"/>
<figcaption aria-hidden="true">
A table summary of the poll results
</figcaption>
</figure>

<br>

<font size=2>1. “Politics still drives how Americans fell about COVID response, one year in”, [PBS/News
Hour](https://www.pbs.org/newshour/politics/politics-still-drives-how-americans-feel-about-covid-response-one-year-in) March 11, 2021</font>

## <a name="20q1">Question 1</a>

---

Based on the poll summaries above, approximately what proportion of ALL
adults in the US do NOT plan to get vaccinated?

### <a name="20sol1">Solution to Question 1</a>

---

<br>  
$$p \approx \widehat{p} = 0.3$$
<br>

# <a name="20general-sum">A General Summary of Confidence Intervals</a>

---

A confidence interval is an interval estimate for a population parameter
with a rate of success given by the confidence level of the interval. We
can construct confidence intervals for all sorts of statistics, and we
can use confidence intervals as a tool for analyzing possible
associations between two different variables. In general, a confidence
interval has three components:

-   A point estimate is calculated from a sample.
-   A confidence level is chosen.
-   A margin of error (MoE) to account for the uncertainty due to   sampling.
    -   The MoE depends on the confidence level that is chosen.
    -   Careful, the MoE is different from the standard error, but they  are related!
    -   **The MoE is a multiple of the standard error, SE.**

If we want to construct a confidence interval estimate for parameter
$\theta$, then we have

$$({\color{dodgerblue}{\mbox{point estimate}}}) - {\color{tomato}{\mbox{MoE}}} < \theta < ({\color{dodgerblue}{\mbox{point estimate}}}) + {\color{tomato}{\mbox{MoE}}}.$$

All confidence intervals have this same general structure that we can
construct using similar steps:

1.  <font color="dodgerblue">Choose a point estimate.</font>
2.  Using the CLT, estimate the standard error.
3.  Based on the confidence level, find the appropriate multiple   ($z_{\alpha/2}$ or $t_{\alpha/2}$).
4.  <font color="tomato">The margin of error (MoE) is the of   product the SE and either $z_{\alpha/2}$ or $t_{\alpha/2}$.</font>

#### <a name="20single-mean">Confidence Intervals for a Single Mean with Known $\sigma^2$</a>

---

If we would like to estimate a population mean $\mu$ of a population
that has known variance $\sigma^2$, we can pick a random sample of $n$
values from the population. As long as the [sample is large enough](#20samp-size) or symmetric, we have:

1.  <font color="dodgerblue">A reasonable point estimate is the   corresponding sample mean, $\bar{x}$.</font>
2.  From CLT, the standard error is   $\mbox{SE} = \frac{\sigma}{\sqrt{n}}$.
3.  Using the standard normal distribution, find the value of   $z_{\alpha/2}$.
4.  <font color="tomato">The MoE   $= z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}}$.</font>

$$ {\color{dodgerblue}{\overline{X}}} - {\color{tomato}{z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}}}} < \mu < {\color{dodgerblue}{\overline{X}}} + {\color{tomato}{z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}}}} .$$

#### <a name="20var-known">Confidence Intervals for a Single Mean with Unknown $\sigma^2$</a>

---

If we would like to estimate a population mean $\mu$ of a population
that has unknown variance $\sigma^2$, we can pick a random sample of $n$
values from the population. As long as the [sample is large enough](#20samp-size) or symmetric, we have:

1.  <font color="dodgerblue">A reasonable point estimate is the   corresponding sample mean, $\bar{x}$.</font>
2.  We use CLT, <font color="mediumseagreen">plugging $s$ in place   of $\sigma$</font>, to get   $\mbox{SE} = \frac{{\color{mediumseagreen}{s}}}{\sqrt{n}}$.
3.  Using a <font color="mediumseagreen">$t$-distribution</font>,   find the value of ${\color{mediumseagreen}{t_{\alpha/2}}}$.
4.  <font color="tomato">The MoE   $= {\color{mediumseagreen}{t_{\alpha/2}}} \cdot \frac{{\color{mediumseagreen}{s}}}{\sqrt{n}}$.</font>

$${\color{dodgerblue}{\overline{X}}} - {\color{tomato}{t_{\alpha/2} \cdot \frac{s}{\sqrt{n}}}} < \mu < {\color{dodgerblue}{\overline{X}}} + {\color{tomato}{t_{\alpha/2} \cdot \frac{s}{\sqrt{n}}}} .$$

#### <a name="20diff-means">Confidence Interval for a Difference in Two Means</a>

---

If we would like to estimate the difference in means from two independent populations that have unknown variances, we can pick random samples of sizes $n_1$ and $n_2$ from each respective population. As long as the [samples are large enough](#20samp-size) or symmetric, we have:

1.  <font color="dodgerblue">A reasonable point estimate is the   corresponding difference in sample means,   $\bar{x}_1 - \bar{x}_2$.</font>
2.  We use CLT, <font color="mediumseagreen">plugging $s_1$ and   $s_2$ in place of $\sigma_1$ and $\sigma_2$</font>, to get

$${\color{mediumseagreen}{\mbox{SE} = \sqrt{ \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}}.$$

3.  Using a <font color="mediumseagreen">$t$-distribution</font>,   find the value of ${\color{mediumseagreen}{t_{\alpha/2}}}$.
4.  <font color="tomato">The MoE   $= {\color{mediumseagreen}{t_{\alpha/2}}} \cdot \sqrt{ \frac{{\color{mediumseagreen}{s_1}}^2}{n_1} + \frac{{\color{mediumseagreen}{s_2}}^2}{n_2}}$.</font>

$$\left( {\color{dodgerblue}{\overline{X}-\overline{Y}}} \right) - {\color{tomato}{t_{\alpha/2} \cdot \sqrt{ \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}} < \mu_1 - \mu_2 < \left( {\color{dodgerblue}{\overline{X}-\overline{Y}}} \right) + {\color{tomato}{t_{\alpha/2} \cdot \sqrt{ \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}} .$$

## <a name="20q2">Question 2</a>

---

We would like to estimate the parameter $p$, the proportion of all
adults in the US that do not plan on getting vaccinated. From the
vaccination poll in [Question 1](#20q1), we have one random sample of 1,227
adults. From our sample, we observe that 30% said they do not intend to
get vaccinated. Let’s apply the same general process summarized above to
construct a 95% confidence interval to estimate the **proportion** of
all adults in the US that do not plan on getting vaccinated.

### <a name="20q2a">Question 2a</a>

---

Based on the sample of polled adults, what is a reasonable point
estimate for $p$, the proportion of all adults in the US that do not
plan on getting vaccinated?

#### <a name="20sol2a">Solution to Question 2a</a>

---

<br>  
$$p \approx \widehat{p} = 0.3$$
<br>

### <a name="20q2b">Question 2b</a>

---

Using the [Central Limit Theorem for proportions](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Chap3/11-Sampling-Dist-Prop.ipynb#11clt), estimate the
standard error for the sampling distribution of sample proportions.

<br>

<font color="mediumseagreen">*Hint: To calculate the standard error, we need to know the population proportion $p$. Plug an appropriate sample statistic in place of $p$ to estimate the standard error.*</font>

#### <a name="20sol2b">Solution to Question 2b</a>

---

<br>  
$$\sigma_{\widehat{P}} = \sqrt{\frac{p(1-p)}{n}} \approx \sqrt{\frac{{\color{tomato}{\widehat{p}}}(1- {\color{tomato}{\widehat{p}}})}{n}} = \sqrt{\frac{{\color{tomato}{0.3}}(1- {\color{tomato}{0.3}})}{1,227}} = 0.01308$$
<br>

In [None]:
se.no <- sqrt( 0.3*(1 - 0.3) / 1227)
se.no

### <a name="20q2c">Question 2c</a>

---

Next, we identify the value ($z_{\alpha/2}$ or $t_{\alpha/2}$) we
multiply the standard error by to get the margin of error. For
proportions, as long as [the sample is large enough](#20samp-size), a normal distribution is an accurate model for
the underlying sampling distribution.

From the poll in [Question 1](#20q1), we have $n=1227$. Since we do not know
$p$, we substitute $\hat{p} = 0.3$ instead. Since both
$n\hat{p} \geq 10$ and $n(1-\hat{p}) \geq 10$, we can **use a normal
distribution to calculate the margin of error** for the confidence interval.

What is the value of $z_{\alpha/2}$ for a 95% confidence interval for a
proportion?

#### <a name="20sol2c">Solution to Question 2c</a>

---

<br>  

$$z_{\alpha/2} = 1.96$$

<br>

In [None]:
z.95 <- qnorm(0.975, 0, 1)
z.95

moe <- z.95 * se.no
moe

### <a name="20q2d">Question 2d</a>

---

Based on your previous answers in [Question 2](#20q2), give a 95% confidence
interval to estimate $p$, the proportion of all adults in the US that do
not plan on getting vaccinated.

#### <a name="20sol2d">Solution to Question 2d</a>

---

<br>  
The margin of error for a 95% confidence interval is therefore

$$\mbox{MoE} = (1.96)(0.01308) = 0.02564.$$

The lower cutoff for a 95% confidence interval is therefore

$$\hat{p} - z_{\alpha /2} \cdot \sqrt{\frac{{\color{tomato}{\widehat{p}}}(1- {\color{tomato}{\widehat{p}}})}{n}} = 0.3 - 0.02564 = 0.27436.$$

Similarly, the upper cutoff for a 95% confidence interval is

$$\hat{p} + z_{\alpha /2} \cdot \sqrt{\frac{{\color{tomato}{\widehat{p}}}(1- {\color{tomato}{\widehat{p}}})}{n}} = 0.3 + 0.02564 = 0.32564.$$


A 95% confidence interval for the proportion of all adults in the US that do not plan on getting vaccinated is

$$\boxed{ 0.27436 < p <  0.32564}.$$

<br>

### <a name="20q2e">Question 2e</a>

---

Interpret the practical meaning of your confidence interval in
[Question 2d](#20q2d) in the context of COVID vaccinations in the US.

#### <a name="20sol2e">Solution to Question 2e</a>

---

<br>  

There is a 95% chance that the interval from $27.436\%$ to $32.564\%$ contains the actual percent of all adults in the US that do not plan on getting vaccinated.

<br>

# <a name="20wald-one">Wald Confidence Interval for a Proportion</a>

---

The <font color="dodgerblue">**Wald confidence interval for a
proportion**</font> is given by

$${\large \boxed{ \mbox{Wald:} \qquad {\color{tomato}{\hat{p}}} - z_{\alpha/2} \cdot \sqrt{ \frac{{\color{tomato}{\hat{p}}}(1-{\color{tomato}{\hat{p}}})}{n}}  < p <  {\color{tomato}{\hat{p}}} + z_{\alpha/2} \cdot \sqrt{ \frac{{\color{tomato}{\hat{p}}}(1-{\color{tomato}{\hat{p}}})}{n}}}}$$

We use the **plug-in principle** and <font color="tomato">use
$\hat{p}$ for the unknown value of $p$</font> when calculating the
standard error.

-   The advantage of this estimate is we can do it by hand.
-   The downside is we have introduced additional uncertainty using $\hat{p}$ in place of $p$ when estimating the standard error.

We <font color="mediumseagreen">**use the standard normal distribution**</font> to identify  ${\color{mediumseagreen}{z_{\alpha/2}}}$ when finding the margin of error since we are using a <font color="mediumseagreen">continuous distribution</font> (proportions) to approximate a <font color="mediumseagreen">binomial distribution</font> (counts).

# <a name="20score">Score Confidence Interval For a Proportion</a>

---

From the [CLT for proportions](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Chap3/11-Sampling-Dist-Prop.ipynb#11clt), we have
$\widehat{P} \sim N \left( \mu_{\widehat{P}} , \sigma_{\widehat{P}} \right) = N\left( p, \sqrt{\frac{ p (1-p)}{n}} \right)$.
A standardized sample proportion has $z$-score
$z = \frac{\hat{p} - p}{\sqrt{\frac{ p (1-p)}{n}}}$. Thus, for
confidence level $CL$, we have

$$P \left( -z_{\alpha/2} < \frac{\hat{p} -\color{tomato}{p}}{\sqrt{\frac{\color{tomato}{p}(1-\color{tomato}{p})}{n}}} < z_{\alpha/2} \right) =CL.$$

In the equation above, the <font color="tomato">unknown population
parameter $p$ is in red</font>. All the other values ($\hat{p}$, $n$,
and $z_{\alpha/2}$) in the formula are known values. Given a confidence
level, we can algebraically solve for the cutoff values for
${\color{tomato}{p}}$ by solving the equations:

$$\dfrac{\hat{p} -{\color{tomato}{p}}}{\sqrt{\dfrac{{\color{tomato}{p}}(1-{\color{tomato}{p}})}{n}}} = z_{\alpha/2} \qquad \mbox{and} \qquad \dfrac{\hat{p} - {\color{tomato}{p}}}{\sqrt{\dfrac{{\color{tomato}{p}}(1-{\color{tomato}{p})}}{n}}} = -z_{\alpha/2}$$

## <a name="20score-form">Score Confidence Interval Formulas</a>

---

The confidence interval estimate resulting from the algebraic solution
is called the <font color="dodgerblue">**score confidence interval
for a proportion**</font>. The algebraic work involved in solving the
equations above is provided in the [Appendix](#20append-score). The
corresponding lower ($L$) and upper ($U$) cutoffs are:

$$\begin{aligned}
&L= \dfrac{\hat{p} + \dfrac{z_{\alpha/2}^2}{2n} - z_{\alpha/2} \cdot \sqrt{ \dfrac{\hat{p}(1-\hat{p})}{n} + \dfrac{z_{\alpha/2}^2}{4n^2}}}{1+ \dfrac{z_{\alpha/2}^2}{n}} \\
\\
&U= \dfrac{\hat{p} + \dfrac{z_{\alpha/2}^2}{2n} + z_{\alpha/2} \cdot \sqrt{ \dfrac{\hat{p}(1-\hat{p})}{n} + \dfrac{z_{\alpha/2}^2}{4n^2}}}{1+ \dfrac{z_{\alpha/2}^2}{n}}
\end{aligned}$$

-   **Pro:** There is no additional uncertainty beyond the initial   variability in sampling.
-   **Con:** The formulas are quite complicated. Calculating by hand is   not really practical.
-   Typically we use technology to calculate score confidence intervals.

## <a name="20score-r">Score Confidence Intervals in R</a>

---

R has a built in function `prop.test()$conf.int` that calculates a score
confidence interval for a proportion.

-   In R, use the command   `prop.test(X, n, conf.level = CL, correct = FALSE)$conf.int`
  -   $X$ denotes the number of “successes” observed in the sample.
  -   $n$ denotes the total number of observations in the sample.
  -   `CL` is a chosen confidence level (as a proportion).
  -   The option `correct = FALSE` means no [continuity correction](#20append-corr) is applied.

## <a name="20q4">Question 4</a>

---

Using the poll data in [Question 1](#20q1), find a 95% score confidence interval for the proportion of all adults in the US that do not plan to get vaccinated by completing the `prop.test()` command in the code
cell below.

### <a name="20sol4">Solution to Question 4</a>

---

In [None]:
#code to help with calculations

p.hat <- 0.3
n.no <- 1227
x.no <- p.hat * n.no

prop.test(x.no, n.no, conf.level = 0.95, correct = FALSE)$conf.int

<br>

A 95% score confidence interval (without a continuity correction) for the proportion of all adults in the US that do not plan on getting vaccinated is

$$\boxed{ 0.27502 < p <  0.32623}.$$

<br>

#### <a name="20check4">Checking Your Solution to Question 4</a>

---

Based on the polling sample data, enter the values for `X`, `n`, and
`z.star` in the first code cell below. Then run the code cell.

In [None]:
##################################################
# Replace the ?? in the three lines of code below
# with appropriate values or commands
##################################################
X <- 0.3 * 1227  # number of successes in sample (do not plan to get vax)
n <- 1227  # sample size
z.star <- qnorm(0.975, 0, 1)  # find z_alpha/2 for 95% confidence level

Next, run the code cell below to calculate the upper and lower cutoffs
for a 95% score confidence interval for a proportion.

In [None]:
#########################################
# first run the code cell above
# nothing to edit in this code cell
# run as is
#########################################
phat <- X/n  # Compute sample proportion

# Computes Cutoffs for Score Confidence Interval
lower.score95 <- (phat+z.star^2/(2*n) -
                z.star*sqrt( (phat*(1-phat))/n + z.star^2/(4*n^2) ) )/(1+z.star^2/n)
upper.score95 <- (phat+z.star^2/(2*n) +
                z.star*sqrt( (phat*(1-phat))/n + z.star^2/(4*n^2) ) )/(1+z.star^2/n)

# Print cutoffs to screen
lower.score95
upper.score95

## <a name="20cont-corr">Applying the Continuity Correction</a>

---

In our construction of a score confidence interval, we have used a
normal distribution to estimate a discrete (binomial) distribution.
Recall when using a continuous, normal distribution to approximate a
discrete, binomial distribution (as with the Central Limit Theorem for
proportions), we miss some area under the curve resulting in an
underestimate. We can improve estimates resulting from using a normal distribution instead of a binomial distribution by
applying a [continuity correction](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Chap3/11-Sampling-Dist-Prop#11correct).

Similarly, we can obtain a more accurate score confidence interval for a
proportion by applying a continuity correction. The [Appendix
explains how the continuity correction is applied](#20append-corr) and
provides the corresponding formulas. In practice, we can simply change
the `correct = FALSE` option in `prop.test()$conf.int` to
`correct = TRUE`.

-  `prop.test(X, n, conf.level = CL, correct = TRUE)$conf.int`
-  The default for `prop.test` if no `correct` option is specified is   `correct = TRUE`.
-  <font color="dodgerblue">**Applying the continuity correction   results in a more precise confidence interval.**</font>

### <a name="20corr-r">Applying the Continuity Correction in Code</a>

---

Below we perform the direct calculations using the [continuity
correction formulas derived in the Appendix](#20append-corr).

In [None]:
##############################################
# Be sure you have run previous code cells
# And have already defined X, n, and z.star
# Run this code cell without any edits needed
##############################################

# Continuity corrections applied to sample proportion
cc.phat.L <- (X - 0.5)/n
cc.phat.U <- (X + 0.5)/n

# Plugged into formulas for Score Conf Interval
cc.lower <- (cc.phat.L + z.star^2/(2*n) -
            z.star*sqrt( (cc.phat.L*(1-cc.phat.L))/n + z.star^2/(4*n^2) ) )/(1+z.star^2/n)
cc.upper <- (cc.phat.U + z.star^2/(2*n) +
            z.star*sqrt( (cc.phat.U*(1-cc.phat.U))/n + z.star^2/(4*n^2) ) )/(1+z.star^2/n)

# Print results to screen to check
cc.lower
cc.upper

In the code cell below, we apply the continuity correction using the
`correct = TRUE` option in `prop.test()` to compare with the previous result.

In [None]:
prop.test(368, 1227, conf.level = 0.95, correct = TRUE)$conf.int

<br>  
A 95% score confidence interval, with a continuity correction, for the proportion of all adults in the US that do not plan on getting vaccinated is

$$\boxed{ 0.27454 < p <  0.32657}.$$
<br>

# <a name="20diff-prop">A Difference in Two Proportions</a>

---

## <a name="20clt-prop">Central Limit Theorem for $\widehat{P}_1 - \widehat{P}_2$</a>

---

For a difference in two proportions, we can derive a Central Limit
Theorem to model the sampling distribution for the difference in two
sample proportions, $\widehat{P}_1 - \widehat{P}_2$. See the [Appendix
for a proof of the CLT for a difference in two
proportions](#20append-clt) which is stated below:

$$\widehat{P}_1 - \widehat{P}_2  \sim N \left( \mu_{\widehat{P}_1 - \widehat{P}_2} , \mbox{SE}(\widehat{P}_1 - \widehat{P}_2) \right) = N \left( p_1 - p_2  , \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}} \right).$$

## <a name="20ci-diff">Confidence Interval for $\widehat{P}_1 - \widehat{P}_2$</a>

---

We can modify the <font color="dodgerblue">**Wald confidence
interval**</font> to give an approximation for a confidence interval for
a difference in two proportions

-   The <font color="dodgerblue">point estimate</font> is the   difference in the two sample proportions,   $\color{dodgerblue}{\hat{p}_1 - \hat{p}_2}$.
-   The standard error we estimate by   <font color="mediumseagreen">plugging $\hat{p}_1$ and   $\hat{p}_2$ in place of $p_1$ and $p_2$</font> in the formula for   the standard error from the Central Limit Theorem:

$$\mbox{SE} \left( \widehat{P}_1 - \widehat{P}_2 \right) = \sqrt{ \frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}} \approx \sqrt{ \frac{{\color{mediumseagreen}{\hat{p}_1}}(1-{\color{mediumseagreen}{\hat{p}_1}})}{n_1} + \frac{{\color{mediumseagreen}{\hat{p}_2}}(1-{\color{mediumseagreen}{\hat{p}_2}})}{n_2}}$$

-   Since we are using a continuous distribution (proportions) to approximate a binomial distribution (counts), we **use the standard normal distribution** to identify   $z_{\alpha/2}$ to find the <font color="tomato">margin of error</font>.
  - Note: A normal distribution is an accurate model assuming <font color="tomato">**all four**</font> of the conditions are true: <font color="tomato">$n_1 p_1 \geq 10$, $n_1(1-p_1) \geq 10$, $n_2 p_2 \geq 10$, and $n_2(1-p_2) \geq 10$</font>.

- Thus, we have a <font color="dodgerblue">**Wald confidence interval for a difference in proportions**</font>.

$$({\color{dodgerblue}{\hat{p}_1 - \hat{p}_2}}) - {\color{tomato}{z_{\alpha/2} \cdot \sqrt{ \dfrac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \dfrac{\hat{p}_2 (1-\hat{p}_2) }{n_2}}}}  < p_1-p_2 < ({\color{dodgerblue}{\hat{p}_1 - \hat{p}_2}}) + {\color{tomato}{z_{\alpha/2} \cdot \sqrt{ \dfrac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \dfrac{\hat{p}_2 (1-\hat{p}_2) }{n_2}}}}$$



### <a name="20prop-tes-diff">A Wald Confidence Interval for $p_1 - p_2$ Using `prop.test()`</a>

---

The formulas above give a Wald confidence intervals for a difference in two proportions. Be aware there are other variations of confidence intervals for a
difference in two proportions similar to the score
confidence intervals.

In R, the command

<center>

`prop.test(c(x1, x2), c(n1, n2), conf.level = CL, correct = FALSE)$conf.int`

</center>

computes a **Wald confidence interval** for a difference in two
proportions without a continuity correction applied. If we want to apply a continuity correction<sup>2</sup>, we use the option `correct = TRUE`.

<br>

<font color="mediumseagreen">*Note: In R, the `prop.test()` function uses different methods depending on whether the confidence interval is for a single proportion or a difference in two proportions.*</font>

-   <font color="mediumseagreen">*For a single proportion, `prop.test()` gives a score confidence interval.*</font>
-   <font color="mediumseagreen">*For a difference in two proportions, `prop.test()` gives a Wald confidence interval.*</font>

<br>

<font size=2>2. See [Colab Notebook](https://colab.research.google.com/drive/1su7go-u4AAUtPS_CLeCIVSJjYx1gaqbY?usp=sharing) for an explanation of how a continuity correction is applied to a difference in proportions with `prop.test()` in R.</font>

## <a name="20q5">Question 5</a>

---

Using the data below collected from the poll in [Question 1](#20q1),
construct a 90% Wald confidence interval for the difference in the
proportion of all Democrats and the proportion of all Republicans that
do not plan to be vaccinated.

| <font size=3>Party</font> | <font size=3>Yes, will</font> | <font size=3>Yes, already</font> | <font size=3>No</font>  | <font size=3>Unsure</font> | <font size=3>Total</font> |
|------------|-----------|--------------|-----|--------|-------|
| <font size=3>Democrat</font> | <font size=3>213</font> | <font size=3>108</font> | <font size=3>40</font> | <font size=3>7</font>  | <font size=3>368</font>   |
| <font size=3>Republican</font> | <font size=3>93</font> | <font size=3>70</font> | <font size=3>120</font> | <font size=3>9</font> | <font size=3>292</font>   |
| <font size=3>Total</font> | <font size=3>306</font> | <font size=3>178</font> | <font size=3>160</font> | <font size=3>16</font> | <font size=3>660</font> |

### <a name="20sol5">Solution to Question 5</a>

---

In [None]:
# use code cell to help
prop.test(c(40, 120), c(368, 292), conf.level = 0.95, correct = TRUE)$conf.int

<br>  

A 95% Wald confidence interval for the difference in the proportion of all Democrats and the proportion of all Republicans that do not plan on getting vaccinated is

$$\boxed{ -0.3701 < p_D - p_R <  -0.2344}.$$
<br>

In particular, since the only plausible values for $p_D - p_R$ are less than zero, our data indicates $p_D < p_R$. The proportion of all Democrats that do not plan on getting vaccinated is less than the proportion of all Republicans that do not plan on getting vaccinated.

<br>

# <a name="20summary">Summarizing Results of Confidence Intervals</a>

---

| <font size=3>Parameter(s) of Interest</font> | <font size=3>Point Estimate</font> |<font size=3>Distribution</font> | <font size=3>Margin of Error</font> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; |
|--------------------------|-----------------|--------------|-----------------|
| <font size=3>A single mean <br> ($\sigma^2$ known)</font> | <font size=3>$\bar{x}$</font>  | <font size=3>$N(0,1)$</font>  | <font size=3>$z_{\alpha/2} \cdot \dfrac{\sigma}{\sqrt{n}}$</font> |
| <font size=3>A single mean <br> ($\sigma^2$ unknown) | <font size=3>$\bar{x}$</font>  | <font size=3><font color="mediumseagreen">$t$-dist</font></font> | <font size=3>${\color{mediumseagreen}{t_{\alpha/2}}} \cdot \dfrac{{\color{tomato}{s}}}{\sqrt{n}}$</font> |
| <font size=3>A difference in two means <br> (with unknown variances) | <font size=3>$\bar{x}_1 - \bar{x_2}$</font> | <font size=3><font color="mediumseagreen">$t$-dist</font></font> | <font size=3>${\color{mediumseagreen}{t_{\alpha/2}}} \cdot \sqrt{ \dfrac{{\color{tomato}{s_1}}^2}{n_1} + \dfrac{{\color{tomato}{s_2}}^2}{n_2}}$</font> |
| <font size=3>Wald for single <br> proportion</font>  | <font size=3>$\hat{p}=\dfrac{X}{n}$</font>  | <font size=3>$N(0,1)$</font>  | <font size=3>$z_{\alpha/2} \cdot \sqrt{ \dfrac{{\color{tomato}{\widehat{p}}}(1-{\color{tomato}{\widehat{p}}})}{n}}$</font> |
| <font size=3>Wald for a difference <br> in two proportions</font> | <font size=3> $\hat{p}_{1} - \hat{p}_{2}$ </font> | <font size=3>$N(0,1)$</font>  | <font size=3>$z_{\alpha/2} \cdot \sqrt{ \dfrac{{\color{tomato}{\hat{p}_1}}(1-{\color{tomato}{\hat{p}_1}})}{n_1} + \dfrac{{\color{tomato}{\hat{p}_2}}(1-{\color{tomato}{\hat{p}_2}})}{n_2}}$</font> |

## <a name="20samp-size">A Note About Sample Sizes</a>

---

-   For a **single mean**, we can use the CTL to construct a parametric   confidence interval as long as:

  -   Either the population is symmetric or $n \geq 30$.
  -   If the sample is symmetric, we can assume the population is  symmetric.

-   For a **difference in two means** , we can use the CTL to construct   a parametric confidence interval as long as:

  -   Population 1 is either symmetric or $n_1 \geq 30$, and
  -   Population 2 is either symmetric or $n_2 \geq 30$.

-   For a **single proportion**, we can use the CTL to construct a   parametric confidence interval as long as:

  -   Both $n\hat{p} \geq 10$ and $n(1-\hat{p}) \geq 10$.

-   For a **difference in two proportions**, we can use the CTL to   construct a parametric confidence interval as long as:

  -   All of $n_1\hat{p}_1 \geq 10$, $n_1(1-\hat{p}_1) \geq 10$,  $n_2\hat{p}_2 \geq 10$, and $n_2(1-\hat{p}_2) \geq 10$ are  satisfied.

## <a name="20r-functions">Useful R Functions</a>

---

In R, we have the functions:

-   `t.test()$conf.int` constructs a $t$-confidence interval for a   single or difference in two means.
-   `prop.test()$conf.int` constructs a score confidence interval for a   single proportion.
-   `prop.test()$conf.int` constructs a Wald confidence interval for a   difference in two proportions.

# <a name="20append">Appendix</a>

---

## <a name="20append-score">Deriving the Score Confidence Interval Formulas</a>

---

Let $X \sim \mbox{Binom}(n,p)$ and consider the distribution of sample
proportions, $\widehat{P} = \frac{X}{n}$. From the [CLT for proportions](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Chap3/11-Sampling-Dist-Prop.ipynb#11clt)
we know $\widehat{P} \sim N \left( p, \sqrt{\frac{p(1-p)}{n}} \right)$.
Thus, for confidence level CL, we have

$$P(-z_{\alpha/2}< Z < z_{\alpha/2}) = P \left( -z_{\alpha/2} < \frac{\hat{p} -{\color{tomato}{p}}}{\sqrt{\frac{{\color{tomato}{p}}(1-{\color{tomato}{p}})}{n}}} < z_{\alpha/2} \right) =CL.$$

The <font color="dodgerblue">**upper cutoff, $U$**</font> is a
value for ${\color{tomato}{p}}$ such that

$$\dfrac{\hat{p} -{\color{tomato}{p}}}{\sqrt{\dfrac{{\color{tomato}{p}}(1-{\color{tomato}{p}})}{n}}} = z_{\alpha/2}.$$

To solve for ${\color{tomato}{p}}$, we multiply both sides of the
equation above by
$\sqrt{\dfrac{{\color{tomato}{p}}(1-{\color{tomato}{p}})}{n}}$ and then
square both sides giving

$$\big( \hat{p} - {\color{tomato}{p}} \big)^2 = (z_{\alpha/2})^2 \left( \frac{{\color{tomato}{p}}(1-{\color{tomato}{p}})}{n} \right).$$

Next we distribute terms on both sides of the equation and have

$$\hat{p}^2 - 2 {\color{tomato}{p}}\hat{p} + {\color{tomato}{p}}^2 = {\color{tomato}{p}} \left( \frac{z_{\alpha/2}^2}{n} \right) - {\color{tomato}{p}}^2 \left( \frac{z_{\alpha/2}^2}{n} \right).$$

We have a quadratic equation for the unknown ${\color{tomato}{p}}$. We
group all like terms together on one side of the equation,

$${\color{dodgerblue}{\left( 1+ \frac{z_{\alpha/2}^2}{n} \right)}} p^2 + {\color{tomato}{\left(-2\hat{p}-\frac{z_{\alpha/2}^2}{n} \right)}} p + {\color{mediumseagreen}{\hat{p}^2}} = {\color{dodgerblue}{a}}p^2 + {\color{tomato}{b}} p + {\color{mediumseagreen}{c}} = 0.$$
We use the quadratic formula to solve for $p$. The quadratic equation
has two real solutions, the larger of the two solution is the upper
limit for a 95% score confidence interval

$${\large \boxed{ U = \frac{ \hat{p} + \dfrac{z_{\alpha/2}^2}{2n} + z_{\alpha/2} \cdot \sqrt{ \dfrac{\hat{p}(1-\hat{p})}{n} + \dfrac{z_{\alpha/2}^2}{4n^2}}}{1+\dfrac{z_{\alpha/2}^2}{n}}}}$$

The smaller of the two solutions is the
<font color="dodgerblue">**lower cutoff, $L$**</font>

$${\large \boxed{ L = \frac{ \hat{p} + \dfrac{z_{\alpha/2}^2}{2n} - z_{\alpha/2} \cdot \sqrt{ \dfrac{\hat{p}(1-\hat{p})}{n} + \dfrac{z_{\alpha/2}^2}{4n^2}}}{1+\dfrac{z_{\alpha/2}^2}{n}}}}$$

We also consider the equation

$$\dfrac{\hat{p} -{\color{tomato}{p}}}{\sqrt{\dfrac{{\color{tomato}{p}}(1-{\color{tomato}{p}})}{n}}} = -z_{\alpha/2}.$$

If we multiply both sides of the equation above by
$\sqrt{\dfrac{{\color{tomato}{p}}(1-{\color{tomato}{p}})}{n}}$ and then
square both sides, we get

$$\big( \hat{p} - {\color{tomato}{p}} \big)^2 = (-z_{\alpha/2})^2 \left( \frac{{\color{tomato}{p}}(1-{\color{tomato}{p}})}{n} \right).$$

The resulting equation is the same as with the first case we solved.
Thus, solving the equation above gives the same expressions for $U$ and
$L$.

## <a name="20append-corr">Applying a Continuity Correction to a Score Confidence Interval</a>

---

Recall, when using a normal distribution to approximate a discrete,
binomial distribution $X \sim \mbox{Binom}(n,p)$, we can improve the
estimate by using a [continuity
correction](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Chap3/11-Sampling-Dist-Prop.ipynb#11correct).

In the case of a score confidence interval for a proportion, the
continuity correction is applied as follows:

-   In the formula for the corrected lower cutoff $L^*$, we use the corrected sample   proportion ${\color{dodgerblue}{\hat{p}_L^*= \dfrac{X-0.5}{n}}}$.
-   In the formula for the corrected upper cutoff $U^*$, we use the corrected sample   proportion ${\color{tomato}{\hat{p}_U^* = \dfrac{X+0.5}{n}}}$.

$$\begin{aligned}
&L^* = \dfrac{{\color{dodgerblue}{\hat{p}_L^*}} + \dfrac{z_{\alpha/2}^2}{2n} - z_{\alpha/2} \cdot \sqrt{ \dfrac{{\color{dodgerblue}{\hat{p}_L^*}}(1-{\color{dodgerblue}{\hat{p}_L^*}})}{n} + \dfrac{z_{\alpha/2}^2}{4n^2}}}{1+ \dfrac{z_{\alpha/2}^2}{n}} \\
\\
&U^* = \dfrac{{\color{tomato}{\hat{p}_U^*}} + \dfrac{z_{\alpha/2}^2}{2n} + z_{\alpha/2} \cdot \sqrt{ \dfrac{{\color{tomato}{\hat{p}_U^*}}(1-{\color{tomato}{\hat{p}_U^*}})}{n} + \dfrac{z_{\alpha/2}^2}{4n^2}}}{1+ \dfrac{z_{\alpha/2}^2}{n}}
\end{aligned}$$

## <a name="20append-clt">CLT: Difference in Two Proportions</a>

---

Let $X_1 \sim \mbox{Binom}(n_1,p_1)$ and
$X_2 \sim \mbox{Binom}(n_2,p_2)$ be two independent binomial random
variables with distribution of sample proportions
$\widehat{P}_1 = \frac{X_1}{n_1}$ and $\widehat{P}_2 = \frac{X_2}{n_2}$,
respectively. As long as [both samples are large
enough](#20samp-size), the sampling distribution for the difference in
sample proportions $\widehat{P_1}-\widehat{P_2}$ will:

-   Be approximately <font color="dodgerblue">**normally   distributed**</font>.
-   Have mean   ${\color{dodgerblue}{E(\widehat{P_1}-\widehat{P_2}) = \mu_{\widehat{P}_1 - \widehat{P}_2} =p_1 - p_2}}$.
-   Have <font color="dodgerblue">**standard error**</font>   $${\color{dodgerblue}{\mbox{SE}(\widehat{P}_1 - \widehat{P}_2) =  \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}}}.$$

We summarize the results of the <font color="dodgerblue">**Central
Limit Theorem (CLT) for a Difference in Two Proportions**</font> more
concisely below:

$${\color{dodgerblue}{\boxed{ \widehat{P}_1 - \widehat{P}_2  \sim N \left( \mu_{\widehat{P}_1 - \widehat{P}_2} , \mbox{SE}(\widehat{P}_1 - \widehat{P}_2) \right) = N \left( p_1 - p_2  , \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}} \right)}}}.$$

### <a name="20append-proof">Proof of CLT for Difference in Two Proportions</a>

---

Below we prove both $E(\widehat{P_1}-\widehat{P_2}) = \mu_{\widehat{P}_1 - \widehat{P}_2} =p_1 - p_2$ and $\mbox{SE}(\widehat{P}_1 - \widehat{P}_2) = \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}$.

<br>

In proving the claims about a difference in proportions, we use the results of the Central Limit Theorem for a single proportion. If $X \sim \mbox{Binom}(n,p)$ and $\widehat{P} = \frac{X}{n}$, then

$$E \left( \widehat{P} \right) = p \qquad \mbox{and} \qquad \mbox{Var} \left( \widehat{P} \right) = \sqrt{\frac{p(1-p)}{n}}.$$




#### <a name="20append-proof1">Expected Value of Sampling Distribution for $\widehat{P}_1 - \widehat{P}_2$</a>

---


<br>

Using [properties of
expected value](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Chap2/06-Expected-Value-and-Variance.ipynb#06prop-ex), we have


<br>

$$E \left( \widehat{P}_1 - \widehat{P}_2 \right) = E \left( \widehat{P}_1 \right) - E \left( \widehat{P}_2 \right) = p_1 - p_2.$$




#### <a name="20append-proof2">Standard Error of Sampling Distribution for $\widehat{P}_1 - \widehat{P}_2$</a>

---

Using [properties of
variance](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Chap2/06-Expected-Value-and-Variance.ipynb#06prop-var), we have

$$\mbox{Var} \left( \widehat{P}_1 - \widehat{P}_2 \right) = \mbox{Var} \left( \widehat{P}_1 \right) + (-1)^2 \cdot \mbox{Var} \left( \widehat{P}_2 \right) = \frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}.$$

Therefore, the standard error for the sampling distribution of the difference in sample proportions is

$$\mbox{SE}\left( \widehat{P}_1 - \widehat{P}_2 \right) = \sqrt{ \frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}.$$


# <a name="20CC License">Creative Commons License Information</a>
---

![Creative Commons
License](https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png)

*Statistical Methods: Exploring the Uncertain* by [Adam
Spiegler (University of Colorado Denver)](https://github.com/CU-Denver-MathStats-OER/Statistical-Theory)
is licensed under a [Creative Commons
Attribution-NonCommercial-ShareAlike 4.0 International
License](http://creativecommons.org/licenses/by-nc-sa/4.0/). This work is funded by an [Institutional OER Grant from the Colorado Department of Higher Education (CDHE)](https://cdhe.colorado.gov/educators/administration/institutional-groups/open-educational-resources-in-colorado).

For similar interactive OER materials in other courses funded by this project in the Department of Mathematical and Statistical Sciences at the University of Colorado Denver, visit <https://github.com/CU-Denver-MathStats-OER>.