# Solution

## Goal
Given the results of $n$ independent coin tosses with $x$ heads, compute a confidence interval (CI) for the true probability of heads $p$.

## Problem Statement
How can you calculate the confidence interval for the proportion of heads in a series of coin tosses?

## Approach
We start with the sample proportion $\hat{p} = x/n$. For a $(1-\alpha)$ CI, the classic *normal approximation* (Wald interval) is:

$$\hat{p} ± z_{1-\alpha/2} \cdot \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$

In practice, we’ll also compute the Wilson score interval and the exact (Clopper–Pearson) interval to highlight when the Wald interval can misbehave.

In [1]:
import numpy as np
from scipy import stats

np.random.seed(42)

In [2]:
def _z_crit(alpha: float) -> float:
    if not (0 < alpha < 1):
        raise ValueError("alpha must be between 0 and 1")
    return stats.norm.ppf(1 - alpha / 2)


def wald_ci(x: int, n: int, alpha: float = 0.05):
    """Wald (normal approximation) CI for a binomial proportion."""
    if n <= 0:
        raise ValueError("n must be positive")
    if not (0 <= x <= n):
        raise ValueError("x must be between 0 and n")

    p_hat = x / n
    z = _z_crit(alpha)
    se = np.sqrt(p_hat * (1 - p_hat) / n)
    m = z * se
    lo, hi = p_hat - m, p_hat + m
    return max(0.0, lo), min(1.0, hi), p_hat, z, se


def wilson_ci(x: int, n: int, alpha: float = 0.05):
    """Wilson score CI (generally better behaved than Wald)."""
    if n <= 0:
        raise ValueError("n must be positive")
    if not (0 <= x <= n):
        raise ValueError("x must be between 0 and n")

    p_hat = x / n
    z = _z_crit(alpha)
    z2 = z**2

    denom = 1 + z2 / n
    center = (p_hat + z2 / (2 * n)) / denom
    half = (z / denom) * np.sqrt((p_hat * (1 - p_hat) / n) + (z2 / (4 * n**2)))

    lo, hi = center - half, center + half
    return max(0.0, lo), min(1.0, hi), p_hat, z


def clopper_pearson_ci(x: int, n: int, alpha: float = 0.05):
    """Exact (Clopper–Pearson) CI via Beta distribution quantiles."""
    if n <= 0:
        raise ValueError("n must be positive")
    if not (0 <= x <= n):
        raise ValueError("x must be between 0 and n")

    if x == 0:
        lo = 0.0
    else:
        lo = stats.beta.ppf(alpha / 2, x, n - x + 1)

    if x == n:
        hi = 1.0
    else:
        hi = stats.beta.ppf(1 - alpha / 2, x + 1, n - x)

    return lo, hi, x / n

## Worked example
Suppose you toss a coin $n=100$ times and observe $x=52$ heads. Below we compute 90%, 95%, and 99% CIs.

In [3]:
x, n = 52, 100
p_hat = x / n
print(f"Observed heads: x={x}, n={n}, p_hat={p_hat:.4f}")

for conf in [0.90, 0.95, 0.99]:
    alpha = 1 - conf

    w_lo, w_hi, p_hat_w, z, se = wald_ci(x, n, alpha=alpha)
    ws_lo, ws_hi, _, _ = wilson_ci(x, n, alpha=alpha)
    cp_lo, cp_hi, _ = clopper_pearson_ci(x, n, alpha=alpha)

    margin = z * se
    print()
    print(f"{int(conf*100)}% CI (alpha={alpha:.2f}, z={z:.3f})")
    print(f"  Wald   : ({w_lo:.3f}, {w_hi:.3f})  [SE={se:.4f}, margin={margin:.4f}]")
    print(f"  Wilson : ({ws_lo:.3f}, {ws_hi:.3f})")
    print(f"  Exact  : ({cp_lo:.3f}, {cp_hi:.3f})")

Observed heads: x=52, n=100, p_hat=0.5200

90% CI (alpha=0.10, z=1.645)
  Wald   : (0.438, 0.602)  [SE=0.0500, margin=0.0822]
  Wilson : (0.438, 0.601)
  Exact  : (0.433, 0.606)

95% CI (alpha=0.05, z=1.960)
  Wald   : (0.422, 0.618)  [SE=0.0500, margin=0.0979]
  Wilson : (0.423, 0.615)
  Exact  : (0.418, 0.621)

99% CI (alpha=0.01, z=2.576)
  Wald   : (0.391, 0.649)  [SE=0.0500, margin=0.1287]
  Wilson : (0.394, 0.643)
  Exact  : (0.388, 0.650)


## Key biases / pitfalls
- **Independence**: the usual CI formulas assume the tosses are independent. Correlation (e.g., a "sticky" coin, changing technique, or a non-stationary process) reduces the effective sample size and can make the CI too narrow.
- **Normal approximation limitations**: the Wald CI can be inaccurate for small $n$ or when $\hat{p}$ is near 0 or 1 (it can even extend outside $[0,1]$ before clipping).
- **Better default**: the **Wilson score interval** is often a better general-purpose choice for binomial proportions.
- **Exact isn’t magic**: the Clopper–Pearson interval guarantees coverage at least $(1-\alpha)$, but it can be conservative (wider than necessary).

## Demonstrations (Python): coverage via simulation
We simulate repeated experiments to see how often each CI contains the true $p$ ("coverage").

In [None]:
def simulate_coverage(p_true: float, n: int, conf: float = 0.95, reps: int = 20000):
    alpha = 1 - conf
    xs = np.random.binomial(n=n, p=p_true, size=reps)

    wald_contains = 0
    wilson_contains = 0

    for x in xs:
        lo, hi, *_ = wald_ci(int(x), n, alpha=alpha)
        wald_contains += (lo <= p_true <= hi)

        lo, hi, *_ = wilson_ci(int(x), n, alpha=alpha)
        wilson_contains += (lo <= p_true <= hi)

    return {
        'p_true': p_true,
        'n': n,
        'conf': conf,
        'reps': reps,
        'wald_coverage': wald_contains / reps,
        'wilson_coverage': wilson_contains / reps,
    }


for n in [20, 100]:
    out = simulate_coverage(p_true=0.5, n=n, conf=0.95, reps=10000)
    print(f"n={out['n']}, reps={out['reps']}, target={out['conf']:.2f}")
    print(f"  Wald coverage  : {out['wald_coverage']:.4f}")
    print(f"  Wilson coverage: {out['wilson_coverage']:.4f}")
    print()

# A more challenging case where p is near 0
out = simulate_coverage(p_true=0.05, n=50, conf=0.95, reps=10000)
print(f"n={out['n']}, p_true={out['p_true']}, reps={out['reps']}, target={out['conf']:.2f}")
print(f"  Wald coverage  : {out['wald_coverage']:.4f}")
print(f"  Wilson coverage: {out['wilson_coverage']:.4f}")

## Practical checklist
- Report $x$ and $n$ along with $\hat{p}$, not just the CI.
- Use a confidence level that matches the decision context (90/95/99%).
- Prefer Wilson (or exact) when $n$ is small or the observed proportion is close to 0 or 1.
- Sanity check that the CI stays in $[0,1]$ and interpret it as uncertainty about the true long-run probability.

## Conclusion
The Wald CI is easy to compute and matches the standard textbook formula, but it can have poor coverage in small samples or near the boundaries. The Wilson score interval is a strong default, and the Clopper–Pearson interval provides conservative "exact" coverage.