# Chapter 9 Inferences for Proportions and Count Data

In [1]:
import polars as pl
from polars import col, lit
from scipy import stats
import numpy as np

RNG = np.random.default_rng()

## 9.1 Inferences on Proportion

This chapter begins with inference procedures for an unknown proportion $p$ in a Bernoulli population. The sample proportion $\hat{p}$ from a random sample of size $n$ is an unbiased estimate of $p$. Inferences on p are based on the central limit theorem (CLT) result that for large $n$, the sample proportion $\hat{p}$ is approximately normal with mean = $p$ and standard deviation = $\sqrt{pq/n}$ . A large sample two-sided 100(1- $\alpha$)% confidence interval for $p$ is given by

$$
\left[ \hat{p} \pm z_{\alpha /2} \sqrt{\frac{\hat{p} \hat{q}}{n}}\;\right]
$$

where $\hat{q}$ = 1 - $\hat{p}$ and $z_{\alpha/2}$ is the upper $\alpha/2$ critical point of the standard normal distribution. A large sample test on $p$ to test $H_0: p = p_0$ can be based on the test statistic

$$
z = \frac{\hat{p} - p_0}{\sqrt{\hat{p}\hat{q}/n}} \quad \text{or} \quad 
z = \frac{\hat{p} - p_0}{\sqrt{p_0 q_0 / n}}.
$$

Both these statistics are asymptotically standard normal under $H_0$.

### Ex 9.1

A business journal publisher plans to survey a sample of the subscribers to estimate the proportion $p$ with annual household incomes over $100.000.

#### (a)

How many subscribers must be surveyed to obtain a 99% CI for $p$ with a margin of error no greater than 0.05? Assume that no prior estimate of $p$ is available.

✍️ The margin of error

$$
E = z_{\alpha/2} \sqrt{\frac{p q}{n}} \text{.}
$$

Therefore,

$$
n = \frac{z_\alpha^2 p q}{E^2}
$$

Because no previous estimate of $p$ is available, we use $1/2$ as a conservative estimate.

In [7]:
α = 1 - 0.99
p = 1/2
n = stats.norm.ppf(1-α/2)**2 * p * (1-p)/ 0.05**2
print(np.ceil(n))

664.0


So 664 subscribers should be serveyed.

#### (b)

The marketing department thinks that $p$ = 0.30 would be a reasonable guess. What is the corresponding sample size?

In [10]:
α = 1 - 0.99
p = 0.3
n = stats.norm.ppf(1-α/2)**2 * p * (1-p)/ 0.05**2
print(np.ceil(n))

558.0


#### (c)

Refer to the sample size obtained in (b). If a 40% nonresponse rate is anticipated, how many surveys need to be mailed? How may such a high nonresponse rate cause bias in the estimate?

In [13]:
non_response_rate = 0.4
mails = n / (1 - non_response_rate)
print(np.ceil(mails))

929.0


Assuming the responses tend to come from higher-income households, the result may overestimate $p$.

### Ex 9.2

While imprisoned by the Germans during World War II, the English mathematician John Kerrich tossed a coin 10,000 times and obtained 5067 heads. Let $p$ be the probability of a head on a single toss. We wish to check if the data are consistent with the hypothesis that the coin was fair.

#### (a)
Set up the hypotheses. Why should the alternative be two-sided?

✍️

$$
\begin{align*}
H_0: & p = 1/2\text{, v.s.} \\
H_1: & p \ne 1/2 \text{.}
\end{align*}
$$

$H_1$ is two-sided because if the coin is not fair, $p$ could be either $> 1/2$ or $< 1/2$.

#### (b)
Calculate the $P$-value. Can you reject $H_0$ at the .05 level?

In [28]:
n = 10_000
p = 1/2
q = 1 - p
z = (5067/n - p)/np.sqrt(p*q/n)
p_val = 2 * stats.norm.sf(z)
print(p_val)

0.18024534492890254


Because 0.18 > 0.05, cannot reject $H_0$.

#### (c)
Find a 95% CI for the proportion of heads for Kerrich's coin.

In [27]:
α = 1 - 0.95
margin_of_error = float(stats.norm.ppf(1-α/2) * np.sqrt(p*q/n))
ci = (5067/n - margin_of_error, 5067/n + margin_of_error)
print(ci)

(0.49690018007729975, 0.5164998199227003)


### Ex 9.3

Calls to technical support service of a software company are monitored on a sampling basis for quality assurance. Each monitored call is classified as satisfactory or unsatisfactory by the supervisor in terms of the quality of help offered. A random sample of 100 calls was monitored over one month for a new trainee; 8 calls were classified as unsatisfactory.

#### (a)
Calculate a 95% CI for the actual proportion of unsatisfactory calls during the month. Use both formulas (9.1) and (9.3) and compare the results.

#### (b) 
This CI is used to test $H_0: p = 0.10$ vs. $H_1: p \ne 0.10$. If $H_0$ is not rejected, then monitoring of the trainee is continued at the same frequency; if $H_0$ is rejected in the lower tail, then monitoring frequency is reduced; and if $H_0$ is rejected in the upper tail, then the trainee is provided additional training. Based on the CI calculated in (a), what action should be taken on this trainee?

## 9.2 Inferences for Comparing Two Proportions

Next we consider the problem of comparing two Bernoulli proportions, $p_1$ and $p_2$, based on two independent random samples of sizes $n_1$ and $n_2$. The basis for inferences on $p_1 - p_2$ is the result that for large $n_1$ and $n_2$, the difference in the sample proportions, $\hat{p}_1 - \hat{p}_2$, is approximately normal with mean = $p_1 - p_2$ and standard deviation = $\sqrt{p_1 q_1 / n_1 + p_2 q_2 / n_2}$ . A large sample two- sided 100(1 - $\alpha$)% confidence interval for $p_1 - p_2$ is given by

$$
\left[ \hat{p}_1 - \hat{p}_2 \pm z_{\alpha/2} \sqrt{\frac{\hat{p}_1 \hat{q}_1}{n_1} + \frac{\hat{p}_2 \hat{q}_2}{n_2}}\; \right].
$$

A large sample two-sided $z$-test can be used to test $H_0: p_1 = p_2$ vs. $H_1: p_1 \ne  p_2$ by using the test statistic

$$
z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\frac{\hat{p}_1 \hat{q}_1}{n_1} + \frac{\hat{p}_2 \hat{q}_2}{n_2}}}
\quad \text{or} \quad 
z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}\hat{q}\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}}
$$


where $\hat{p} = (n_1\hat{p}_1 + n_2\hat{p}_2)/(n_1 + n_2)$ is the pooled sample proportion. Small sample tests to compare $p_1$ and $p_2$ are also given for independent samples (Fisher's exact test) and matched pairs designs (McNemar's test).

## 9.3 Inferences for One-way Count Data

A generalization of the test on the binomial proportion $p$ is a test on the cell probabilities of a multinomial distribution. Based on a random sample of size $n$ from a $c$-cell multinomial distribution (one-way count data) with cell probabilities $p_1, p_2, \ldots, p_c$, the test of

$$
H_0: p_1 = p_{10},\, p_2 = p_{20},\, \ldots, \, p_c = p_{c0} \quad \text{vs.} \quad
H_1: \text{At least one}\, p_i \ne p_{i0}
$$

is based on the **chi-square statistic** having the general form:

$$
\chi^2 = \sum \frac{(\text{observed} - \text{expected})^2}{\text{expected}}
$$

where "observed" refers to the observed cell counts $n_i$ and "expected" refers to the expected cell counts $e_i = n p_{i0}$ under $H_0$. The degrees of freedom (d.f.) of the chi-square statistic are $c$ - 1. The primary use of this statistic is for the **goodness of fit** test of a specified distribution to a set of data. If any parameters of the distribution are estimated from the data, then one d.f. is deducted for each independent estimated parameter from the total d.f. $c$ - 1.

## 9.4 Inferences for Two-way Count Data

Two-way count data result when

1. a single sample is cross-classified based on two categorical variables into $r$ rows and $c$ columns (**multinomial sampling**), or 
2. independent samples are drawn from $r$ multinomial distributions with the same $c$ categories
(**product multinomial sampling**). 

In both cases, the data are summarized in the form of an $r \times c$ **contingency table** of counts. In case (1), the null hypothesis of interest is the **independence hypothesis** between the row and column variables; in case (2), it is the **homogeneity hypothesis**. In both cases, the chi-square statistic has the same general form given above, with the expected count for the $(i, j)$th cell (under $H_0$) being the $i$th row total times the proportion of all observations falling in the $j$ th column. The d.f. of the chi-square statistic equal $(r - 1)(c - 1)$. Thus association between the row and the column variable is demonstrated at level $\alpha$ if $\chi^2 > \chi^2_{(r-1)(c-1), \alpha}$·