# Chapter 9 Inferences for Proportions and Count Data

In [10]:
import math
import polars as pl
from polars import col, lit
from scipy import stats
import numpy as np

RNG = np.random.default_rng()

## 9.1 Inferences on Proportion

This chapter begins with inference procedures for an unknown proportion $p$ in a Bernoulli population. The sample proportion $\hat{p}$ from a random sample of size $n$ is an unbiased estimate of $p$. Inferences on p are based on the central limit theorem (CLT) result that for large $n$, the sample proportion $\hat{p}$ is approximately normal with mean = $p$ and standard deviation = $\sqrt{pq/n}$ . A large sample two-sided 100(1- $\alpha$)% confidence interval for $p$ is given by

$$
\left[ \hat{p} \pm z_{\alpha /2} \sqrt{\frac{\hat{p} \hat{q}}{n}}\;\right]
$$

where $\hat{q}$ = 1 - $\hat{p}$ and $z_{\alpha/2}$ is the upper $\alpha/2$ critical point of the standard normal distribution. A large sample test on $p$ to test $H_0: p = p_0$ can be based on the test statistic

$$
z = \frac{\hat{p} - p_0}{\sqrt{\hat{p}\hat{q}/n}} \quad \text{or} \quad 
z = \frac{\hat{p} - p_0}{\sqrt{p_0 q_0 / n}}.
$$

Both these statistics are asymptotically standard normal under $H_0$.

### Ex 9.1

A business journal publisher plans to survey a sample of the subscribers to estimate the proportion $p$ with annual household incomes over $100.000.

#### (a)

How many subscribers must be surveyed to obtain a 99% CI for $p$ with a margin of error no greater than 0.05? Assume that no prior estimate of $p$ is available.

✍️ The margin of error

$$
E = z_{\alpha/2} \sqrt{\frac{p q}{n}} \text{.}
$$

Therefore,

$$
n = \frac{z_\alpha^2 p q}{E^2}
$$

Because no previous estimate of $p$ is available, we use $1/2$ as a conservative estimate.

In [7]:
α = 1 - 0.99
p = 1/2
n = stats.norm.ppf(1-α/2)**2 * p * (1-p)/ 0.05**2
print(np.ceil(n))

664.0


So 664 subscribers should be serveyed.

#### (b)

The marketing department thinks that $p$ = 0.30 would be a reasonable guess. What is the corresponding sample size?

In [10]:
α = 1 - 0.99
p = 0.3
n = stats.norm.ppf(1-α/2)**2 * p * (1-p)/ 0.05**2
print(np.ceil(n))

558.0


#### (c)

Refer to the sample size obtained in (b). If a 40% nonresponse rate is anticipated, how many surveys need to be mailed? How may such a high nonresponse rate cause bias in the estimate?

In [13]:
non_response_rate = 0.4
mails = n / (1 - non_response_rate)
print(np.ceil(mails))

929.0


Assuming the responses tend to come from higher-income households, the result may overestimate $p$.

### Ex 9.2

While imprisoned by the Germans during World War II, the English mathematician John Kerrich tossed a coin 10,000 times and obtained 5067 heads. Let $p$ be the probability of a head on a single toss. We wish to check if the data are consistent with the hypothesis that the coin was fair.

#### (a)
Set up the hypotheses. Why should the alternative be two-sided?

✍️

$$
\begin{align*}
H_0: & p = 1/2\text{, v.s.} \\
H_1: & p \ne 1/2 \text{.}
\end{align*}
$$

$H_1$ is two-sided because if the coin is not fair, $p$ could be either $> 1/2$ or $< 1/2$.

#### (b)
Calculate the $P$-value. Can you reject $H_0$ at the .05 level?

In [28]:
n = 10_000
p = 1/2
q = 1 - p
z = (5067/n - p)/np.sqrt(p*q/n)
p_val = 2 * stats.norm.sf(z)
print(p_val)

0.18024534492890254


Because 0.18 > 0.05, cannot reject $H_0$.

#### (c)
Find a 95% CI for the proportion of heads for Kerrich's coin.

In [27]:
α = 1 - 0.95
margin_of_error = float(stats.norm.ppf(1-α/2) * np.sqrt(p*q/n))
ci = (5067/n - margin_of_error, 5067/n + margin_of_error)
print(ci)

(0.49690018007729975, 0.5164998199227003)


### Ex 9.3

Calls to technical support service of a software company are monitored on a sampling basis for quality assurance. Each monitored call is classified as satisfactory or unsatisfactory by the supervisor in terms of the quality of help offered. A random sample of 100 calls was monitored over one month for a new trainee; 8 calls were classified as unsatisfactory.

#### (a)
Calculate a 95% CI for the actual proportion of unsatisfactory calls during the month.
Use both formulas (9.1) and (9.3) and compare the results.

✍️ Formula (9.1):

$$
\hat{p} - z_{\alpha/2}\sqrt{\frac{\hat{p}\hat{q}}{n}} \le p \le \hat{p} + z_{\alpha/2}\sqrt{\frac{\hat{p}\hat{q}}{n}} 
$$

In [29]:
α = 1 - 0.95
n = 100
p = 8/100
q = 1 - p
margin_of_error = float(stats.norm.ppf(1-α/2) * np.sqrt(p*q/n))
ci = (p - margin_of_error, p + margin_of_error)
print(ci)

(0.02682751000723329, 0.1331724899927667)


Formula (9.3):

$$
\frac{\hat{p} + \frac{z^2}{2n} - \sqrt{\frac{\hat{p}\hat{q}z^2}{n} + \frac{z^4}{4n^2}}}{1 + \frac{z^2}{n}} 
\le p \le  
\frac{\hat{p} + \frac{z^2}{2n} + \sqrt{\frac{\hat{p}\hat{q}z^2}{n} + \frac{z^4}{4n^2}}}{1 + \frac{z^2}{n}} 
$$

where $z$ = $z_{\alpha/2}$.

In [34]:
α = 1 - 0.95
z = stats.norm.ppf(1-α/2)
n = 100
p = 8/100
q = 1 - p
denom = 1 + z**2/n
margin = np.sqrt(p*q*z**2/n + z**4/(4*n**2)) / denom
mid = (p + z**2/(2*n)) / denom 
ci = (float(mid - margin), float(mid + margin))
print(ci)

(0.04109346148438062, 0.14998107700948735)


Formula (9.3) gives a slightly higher CI than formula (9.1). (shifts to the right by about 0.015)

#### (b) 
This CI is used to test $H_0: p = 0.10$ vs. $H_1: p \ne 0.10$. If $H_0$ is not rejected, then monitoring of the trainee is continued at the same frequency; if $H_0$ is rejected in the lower tail, then monitoring frequency is reduced; and if $H_0$ is rejected in the upper tail, then the trainee is provided additional training. Based on the CI calculated in (a), what action should be taken on this trainee?

✍️ Because 0.1 is contained in the CI (for both formulas), $H_0$ is not rejected. Therefore monitoring will continue at the same frequency.

### Ex 9.4

The fraction defective in a high volume production process is to be estimated using a 95% CI with a margin of error of 0.2%.

#### (a) 
If the a priori guess at the fraction defective is 1%, how many parts should be sampled? Compare this number with the sample size that you would need if no a priori information regarding the true fraction defective is assumed.

✍️ 

$$
n = \frac{z_\alpha^2 p q}{E^2}
$$

In [43]:
α = 1 - 0.95
p = np.array([0.01, 0.5])
n = stats.norm.ppf(1-α/2)**2 * p * (1-p)/ 0.002**2
print(np.ceil(n))

[  9508. 240092.]


#### (b)
One problem with estimating a very low fraction defective is that no defectives may be obtained in the sample, making it impossible to calculate a CI. What sampling method would you use to ensure that there will be sufficient number of defectives in the sample to provide reliable information on the true fraction defective?

✍️ Start with a small sample and, if no defectives are found, continue to add to the sample until a certain number of defectives are observed and the current sample provides enough evidence about the proportion of defectives.

### Ex 9.5

A quarterback from a Big Ten college football team worked to improve his proportion of completed passes. His career average had been 46.5% completed passes. His record halfway into the new season is 82 completed passes out of 151 attempted.

#### (a) 
Set up the hypotheses to test whether his proportion of completed passes has improved. Should the alternative be one-sided or two-sided? Explain.

✍️ Because we are looking for signs of improvement, one-sided alternative should be used.

$$
H_0: p = p_0 = 46.5\% \quad \text{v.s.} \quad H_1: p > p_0 
$$

#### (b) 
Perform a test at $\alpha$ = .05. Is there a significant improvement?

In [49]:
k, n = 82, 151
p0 = 0.465
z = (k/n - p0) / np.sqrt(p0*(1-p0)/n)
pvalue = stats.norm.sf(z)
print(pvalue)

0.027251575849474706


Or use the exact, binomial test:

In [47]:
test = stats.binomtest(k=82, n=151, p=0.465, alternative='greater')
print(test.pvalue)

0.032948694746468936


Both methods result in $p < \alpha$, so yes, there is significant improvement.

#### (c) 
At least how many passes out of 151 should he have completed in order to demonstrate significant improvement at $\alpha$ = .025?

✍️ The $z$ statistic should be at least

In [51]:
α = .025
z = stats.norm.ppf(1-α)
print(z)

1.959963984540054


Therefore,

In [53]:
k = n * (z * np.sqrt(p0*(1-p0)/n) + p0)
print(np.ceil(k))

83.0


At $\alpha$ = .025, he should complete 83 passes or more.

### Ex 9.6

A blood test intended to identify patients at "high risk" of cardiac disease gave positive results on 80 out of 100 known cardiac patients, but also on 16 out of 200 known normal patients.

#### (a) 
Find a 90% CI for the sensitivity of the test, which is defined as the probability that a cardiac patient is correctly identified.

In [3]:
α = 1 - 0.9
n = 100
p = 80/100
margin = float(stats.norm.ppf(1-α/2) * np.sqrt(p*(1-p)/n))
ci = (p - margin, p + margin)
print(ci)

(0.7342058549219411, 0.865794145078059)


Compare it with

In [14]:
(
    stats.binomtest(k=80, n=100, alternative='two-sided')
    .proportion_ci(confidence_level=0.9)
)

ConfidenceInterval(low=0.7227997503290864, high=0.8633386747541124)

#### (b) 
Find a 90% CI for the specifidty of the test, which is defined as the probability that a normal patient is correctly identified.

In [15]:
(
    stats.binomtest(k=200-16, n=200, alternative='two-sided')
    .proportion_ci(confidence_level=0.9)
)

ConfidenceInterval(low=0.8810282256716788, high=0.9491782829700732)

### Ex 9.7

People at high risk of sudden cardiac death can be identified using the change in a signal averaged electrocardiogram before and after prescribed activities. The current method is about 80% accurate. The method was modified, hoping to improve its accuracy. The new method is tested on 50 people and gave correct results on 46 patients. Is this convincing
evidence that the new method is more accurate?

#### (a) 
Set up the hypotheses to test that the accuracy of the new method is better than that of the current method.

✍️ 

$$
H_0: p = 0.8 \quad \text{v.s.} \quad H_1: p > 0.8 .
$$

#### (b) 
Perform a test of the hypotheses at $\alpha$ = .05. What do you conclude about the accuracy of the new method?

In [9]:
n = 50
p0 = 0.8
p = 46/50
z = (p - p0) / np.sqrt(p0*(1-p0)/n)
pval = stats.norm.sf(z)
print(pval)

0.016947426762344633


In [8]:
test = stats.binomtest(k=46, n=50, p=0.8, alternative='greater')
print(test.pvalue)

0.018496015060209342


The new method is significantly more accurate.

### Ex 9.8

Refer to the previous exercise.

#### (a) 
If the new method actually has 90% accuracy, what power does a sample of 50 have to demonstrate that the new method is better, using a .05-level test?

✍️ Knowing that $\frac{\hat{p} - p_1}{\sqrt{p_1 q_1 / n}} \sim N(0,1)$, the power
$$
\begin{align*}
\pi &= \mathrm{P}\left\{\frac{\hat{p} - p_0}{\sqrt{\frac{p_0 q_0}{n}}} > z_{\alpha}\right\}\\
&= \mathrm{P}\left\{ \frac{\hat{p} - p_1}{\sqrt{\frac{p_1 q_1}{n}}} > 
    z_{\alpha} \sqrt{\frac{p_0 q_0}{p_1 q_1}} + \frac{p_0 - p_1}{\sqrt{\frac{p_1 q_1}{n}}} \right\} \\
&= 1-\Phi\left( z_{\alpha} \sqrt{\frac{p_0 q_0}{p_1 q_1}} + \frac{p_0 - p_1}{\sqrt{\frac{p_1 q_1}{n}}}\right) \\
&= \Phi\left( \frac{(p_1 - p_0) \sqrt{n} - z_\alpha \sqrt{p_0 q_0}}{\sqrt{p_1 q_1}} \right) \text{.}
\end{align*}
$$

In [17]:
α = .05
p0 = 0.8
p1 = 0.9
n = 50
z = ((p1-p0)*math.sqrt(n) - stats.norm.ppf(1-α)*math.sqrt(p0*(1-p0))) / math.sqrt(p1*(1-p1))
power = float(stats.norm.cdf(z))
print(power)

0.5650889396286685


#### (b) 
How many patients should be tested in order for this power to be at least 0.75?

✍️ Using the result from (a) and letting $\pi = 0.75 = 1-\beta$, we have

$$
\frac{(p_1 - p_0) \sqrt{n} - z_\alpha \sqrt{p_0 q_0}}{\sqrt{p_1 q_1}} = z_{\beta} \text{,}
$$

therefore,

$$
n = \left( \frac{z_\beta \sqrt{p_1 q_1} + z_\alpha \sqrt{p_0 q_0}}{p_1 - p_0} \right)^2 \text{.}
$$

In [18]:
β = 1 - 0.75
n = ((stats.norm.ppf(1-β)*math.sqrt(p1*(1-p1)) + stats.norm.ppf(1-α)*math.sqrt(p0*(1-p0)))/(p0 - p1))**2
print(np.ceil(n))

75.0


### Ex 9.9

A preelection poll is to be planned for a senatorial election between two candidates. Previous polls have shown that the election is hanging in delicate balance. If there is a shift (in either direction) by more than 2 percentage points since the last poll, then the polling agency would like to detect it with probability of at least 0.80 using a .05-level test. Determine how many voters should be polled. If actually 2500 voters are polled, what is the value of this probability?

✍️ We are considering the case when the real poll percentage is either > 52% or < 48%, but not both. So it is essentially the same as Ex. 9.8 except we should substitute $z_{\alpha/2}$ for $z_\alpha$ in the formula for $n$, giving

$$
n = \left( \frac{z_\beta \sqrt{p_1 q_1} + z_{\alpha/2} \sqrt{p_0 q_0}}{p_1 - p_0} \right)^2 \text{.}
$$

Note that for $p_1 = 1/2 \pm \delta$, the result would be the same.

In [22]:
β = 1 - 0.8
α = .05
p1 = 0.52 # or 0.48, the result will be the same.
p0 = 0.5
n = ((stats.norm.ppf(1-β)*math.sqrt(p1*(1-p1)) + stats.norm.ppf(1-α/2)*math.sqrt(p0*(1-p0)))/(p0 - p1))**2
print(np.ceil(n))

4904.0


If actually 2500 voters are polled, the power

$$
\begin{align*}
\pi &= \mathrm{P}\left\{\frac{\hat{p} - p_0}{\sqrt{\frac{p_0 q_0}{n}}} > z_{\alpha/2}\right\}
    + \mathrm{P}\left\{\frac{\hat{p} - p_0}{\sqrt{\frac{p_0 q_0}{n}}} < -z_{\alpha/2}\right\}\\
&= \Phi\left( \frac{(p_1 - p_0) \sqrt{n} - z_{\alpha/2} \sqrt{p_0 q_0}}{\sqrt{p_1 q_1}} \right)
    + \Phi\left( \frac{(p_0 - p_1) \sqrt{n} - z_{\alpha/2} \sqrt{p_0 q_0}}{\sqrt{p_1 q_1}} \right) \text{.}
\end{align*}
$$

Again, for $p_1 = 1/2 \pm \delta$, the result would be the same.

In [28]:
α = .05
p0 = 0.5
p1 = 0.52 # or 0.48, the result will be the same
n = 2500
z1 = ((p1-p0)*np.sqrt(n) - stats.norm.ppf(1-α/2)*np.sqrt(p0*(1-p0))) / np.sqrt(p1*(1-p1))
z2 = ((p0-p1)*np.sqrt(n) - stats.norm.ppf(1-α/2)*np.sqrt(p0*(1-p0))) / np.sqrt(p1*(1-p1))
power = float(stats.norm.cdf(z1) + stats.norm.cdf(z2))
print(power)

0.5160175620301128


## 9.2 Inferences for Comparing Two Proportions

Next we consider the problem of comparing two Bernoulli proportions, $p_1$ and $p_2$, based on two independent random samples of sizes $n_1$ and $n_2$. The basis for inferences on $p_1 - p_2$ is the result that for large $n_1$ and $n_2$, the difference in the sample proportions, $\hat{p}_1 - \hat{p}_2$, is approximately normal with mean = $p_1 - p_2$ and standard deviation = $\sqrt{p_1 q_1 / n_1 + p_2 q_2 / n_2}$ . A large sample two- sided 100(1 - $\alpha$)% confidence interval for $p_1 - p_2$ is given by

$$
\left[ \hat{p}_1 - \hat{p}_2 \pm z_{\alpha/2} \sqrt{\frac{\hat{p}_1 \hat{q}_1}{n_1} + \frac{\hat{p}_2 \hat{q}_2}{n_2}}\; \right].
$$

A large sample two-sided $z$-test can be used to test $H_0: p_1 = p_2$ vs. $H_1: p_1 \ne  p_2$ by using the test statistic

$$
z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\frac{\hat{p}_1 \hat{q}_1}{n_1} + \frac{\hat{p}_2 \hat{q}_2}{n_2}}}
\quad \text{or} \quad 
z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}\hat{q}\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}}
$$


where $\hat{p} = (n_1\hat{p}_1 + n_2\hat{p}_2)/(n_1 + n_2)$ is the pooled sample proportion. Small sample tests to compare $p_1$ and $p_2$ are also given for independent samples (Fisher's exact test) and matched pairs designs (McNemar's test).

### Ex 9.10

To gauge a change in opinion regarding the public view on bilingual education, a telephone poll was taken in September 1993 and again in September 1995. The results based on the survey of 1000 American adults contacted in each poll were that 40% from the 1993 poll and 48% from the 1995 poll favored teaching all children in English over bilingual alternatives. Has there been a significant change in opinion? Answer by doing a two-sided test for the significance of the difference in two proportions at $\alpha$ = .05. Why is a two-sided alternative appropriate here?

### Ex 9.11

A high school had 17 students receive National Merit recognition (semifinalist or commendation) out of 482 seniors in 1992 and 29 students out of 503 seniors in 1995. Does this represent a significant change in the proportion recognized at this school? Answer by doing a two-sided test for the significance of the difference in two proportions at $\alpha$ = .10. Why is a two-sided alternative appropriate here?

## 9.3 Inferences for One-way Count Data

A generalization of the test on the binomial proportion $p$ is a test on the cell probabilities of a multinomial distribution. Based on a random sample of size $n$ from a $c$-cell multinomial distribution (one-way count data) with cell probabilities $p_1, p_2, \ldots, p_c$, the test of

$$
H_0: p_1 = p_{10},\, p_2 = p_{20},\, \ldots, \, p_c = p_{c0} \quad \text{vs.} \quad
H_1: \text{At least one}\, p_i \ne p_{i0}
$$

is based on the **chi-square statistic** having the general form:

$$
\chi^2 = \sum \frac{(\text{observed} - \text{expected})^2}{\text{expected}}
$$

where "observed" refers to the observed cell counts $n_i$ and "expected" refers to the expected cell counts $e_i = n p_{i0}$ under $H_0$. The degrees of freedom (d.f.) of the chi-square statistic are $c$ - 1. The primary use of this statistic is for the **goodness of fit** test of a specified distribution to a set of data. If any parameters of the distribution are estimated from the data, then one d.f. is deducted for each independent estimated parameter from the total d.f. $c$ - 1.

## 9.4 Inferences for Two-way Count Data

Two-way count data result when

1. a single sample is cross-classified based on two categorical variables into $r$ rows and $c$ columns (**multinomial sampling**), or 
2. independent samples are drawn from $r$ multinomial distributions with the same $c$ categories
(**product multinomial sampling**). 

In both cases, the data are summarized in the form of an $r \times c$ **contingency table** of counts. In case (1), the null hypothesis of interest is the **independence hypothesis** between the row and column variables; in case (2), it is the **homogeneity hypothesis**. In both cases, the chi-square statistic has the same general form given above, with the expected count for the $(i, j)$th cell (under $H_0$) being the $i$th row total times the proportion of all observations falling in the $j$ th column. The d.f. of the chi-square statistic equal $(r - 1)(c - 1)$. Thus association between the row and the column variable is demonstrated at level $\alpha$ if $\chi^2 > \chi^2_{(r-1)(c-1), \alpha}$·