In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

plt.style.use('fivethirtyeight')
%matplotlib inline

In [None]:
def ks2_plot(x_interval, sample1, sample2):
    xmin = x_interval[0]
    xmax = x_interval[1]
    n=len(sample1)
    sorted_sample1 = np.sort(sample1)
    plt.step(sorted_sample1, np.arange(1, n+1)/n, where='post', lw=2, color='red', label='ECDF of Sample X')
    if sorted_sample1.item(0) > xmin:
        plt.plot((xmin, sorted_sample1.item(0), sorted_sample1.item(0)), (0, 0, 1/n), lw=2, color='red')
    if sorted_sample1.item(n-1) < xmax:
        plt.plot((sorted_sample1.item(n-1), xmax), (1, 1), lw=2, color='red')
    m = len(sample2)
    sorted_sample2 = np.sort(sample2)
    plt.step(sorted_sample2, np.arange(1, m+1)/m, where='post', lw=2, color='darkblue', label='ECDF of Sample Y')
    if sorted_sample2.item(0) > xmin:
        plt.plot((xmin, sorted_sample2.item(0), sorted_sample2.item(0)), (0, 0, 1/m), lw=2, color='darkblue')
    if sorted_sample2.item(m-1) < xmax:
        plt.plot((sorted_sample2.item(m-1), xmax), (1, 1), lw=2, color='darkblue')
    plt.legend()
    plt.title(f'n = {n}, m = {m}', size=12);

# Worksheet 6 #

## 1. Is it Exponential?

In Lecture 1 we used the exponential as a model for the interarrival times of California earthquakes of magnitude at least 4. If the earthquakes arrive like a Poisson process, the interarrival times are exponential. Based on the histogram of the observed distribution of those times, and some basic checking of means and variances, the exponential model seemed reasonable.

And then we said that later in the term we'd develop a method for testing whether data are like i.i.d. draws from an exponential distribution.

That time has come. In this exercise you will read in the data by running the cells below, and then perform a Kolmogorov-Smirnov test of whether the interarrival times are like an i.i.d. sample from an exponential distribution with unknown rate $\lambda$.

The code is taken from Lecture 1.

In [None]:
# Load the declustered earthquake data (produced by usgs_gk_ca_mainshocks.ipynb)
earthquakes = pd.read_csv('california_earthquakes_declustered.csv')
earthquakes['time'] = pd.to_datetime(earthquakes['time'], format='ISO8601')

print(f"Total earthquakes in dataset: {len(earthquakes)}")
print(f"  - Mainshocks: {earthquakes['is_mainshock'].sum()}")
print(f"  - Dependent: {(~earthquakes['is_mainshock']).sum()}")
print(f"Date range: {earthquakes['time'].min().date()} to {earthquakes['time'].max().date()}")

In [None]:
# Filter to mainshocks with M >= 4.0
mag_threshold = 4.0
mainshocks = earthquakes[(earthquakes['is_mainshock']) & (earthquakes['mag'] >= mag_threshold)].copy()
mainshocks = mainshocks.sort_values('time').reset_index(drop=True)

print(f"Mainshocks with M >= {mag_threshold}: {len(mainshocks)}")

In [None]:
# Compute interarrival times (in days)
interarrivals = mainshocks['time'].diff().dt.total_seconds() / (60 * 60 * 24)
interarrivals = interarrivals.dropna().values

print(f"Number of interarrival times: {len(interarrivals)}")
print(f"Mean interarrival time: {np.mean(interarrivals):.2f} days")
print(f"Median interarrival time: {np.median(interarrivals):.2f} days")

The array `interarrivals` is the sample. The null hypothesis is that it looks like i.i.d. draws from an exponential distribution with an unknown rate $\lambda$. 

**(a)** Find $\hat{\lambda}$, the MLE of $\lambda$. Refer to Lecture 1 for this if necessary.

**(b)** Find the Kolmogorov-Smirnov distance between the empirical cdf of the sample and the cdf of the exponential distribution with parameter $\hat{\lambda}$. You can use [`kstest`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html) for this. See the last code cell in Section 1 of Lecture 11 to see how to extract just the test statistic.

**(c)** Follow the parametric bootstrap process outlined in Section 4 of Lecture 11 to simulate the null distribution of the test statistic. That is, repeat the following 10,000 times and draw the empirical histogram of the results.
- The simulation is under the null hypothesis, so generate a new sample of the appropriate size from $F_{\hat{\theta}}$.
- Let $\hat{\theta}^*$ be the MLE of $\theta$ based on this sample, and let the corresponding $F_{\hat{\theta}^*}$ be your new "true" distribution.
- Calculate the KS distance between the ecdf of the new sample and the new "true" $F_{\hat{\theta}^*}$.

**(d)** Approximately what is the $p$-value of your test? Do you still think the exponential model looks reasonable?

In [None]:
# use as many cells as you need
...

## 2. Two-Sample Kolmogorov-Smirnov Test and Stochastic Ordering
The dataset `births` comes from a random sample of mother-newborn pairs. Each row contains data on one such pair. The column `Birth Weight` contains the newborn's birth weight in ounces. The column `Maternal Smoker` has Boolean values for whether or not the mother smoked during pregnancy.

In what way is maternal smoking during pregnancy associated with the birth weight of their baby? A step in addressing this question is to see whether there's an association in the first place. Let "smoker" be short for for "mother who smoked during pregnancy," and define "nonsmoker" analogously. You can assume that the smokers and nonsmokers were drawn independently of each other. You can also assume that within each of the two categories, the pairs are i.i.d.

**(a)** Complete the cell below to plot the empirical cdfs of the two samples. Each of `smoker_sample` and `nonsmoker_sample` should be an array of birth weights of the appropriate sample. 

In [None]:
# Answer to 2a
births = pd.read_csv('births.csv')

smoker_sample = ...
nonsmoker_sample = ...

ks2_plot([50, 200], smoker_sample, nonsmoker_sample)

**(b)** Let $F$ and $G$ be the underlying cdfs of a birth weight in the smoker population and non-smoker population respectively. Run the cell below to perform the two-sample Kolmogorov-Smirnov test of the null hypothesis $H_0: F = G$. What does the test conclude, and why?

In [None]:
stats.ks_2samp(smoker_sample, nonsmoker_sample)

**(c)** Simulate the null distribution of the statistic in the following steps, and say whether the distribution is consistent with the $p$-value in Part **b**.
- Under the null hypothesis, the birth weights in the entire data set are i.i.d. It doesn't matter what label each row has. Use this idea to define a function `one_ksd()` that takes no arguments and returns one simulated value of the Kolmogorov-Smirnov distance under the null hypothesis. You should shuffle the `Maternal Smoker` column (see documentation for `np.random.permutation`) and then split the relabeled sample as you did in Part **a**. To get just the value of the statistic, use `ks_2samp(sample1, sample2).statistic`.
- Generate 10,000 simulated values of the statistic, calling `one_ksd()` each time.
- Draw the histogram of the 10,000 simulated values.
- Examine the range of simulated values and explain why the $p$-value in Part **b** is or is not consistent with the histogram.

In [None]:
# Answer to 2c (use as many lines of code as you need)

def one_ksd():
    ...    

ksds = np.array([])
...

plt.hist(ksds, color='tab:blue', edgecolor='w', bins=50);

**(d)** Use the plot in Part **a** to say which of the two you think is bigger: $F(120)$ or $G(120)$? Explain, keeping in mind that the plot shows neither $F$ nor $G$.

**(e)** Use the empirical cdfs to say which group has babies with lower weights, and explain your reasoning. 

The terminology for what you are observing is *stochastic ordering*. A distribution with cdf $F_1$ is said to be *stochastically smaller* than a distribution with cdf $F_2$ if $F_1(x) \ge F_2(x)$ for all $x$.

---
---

## 3. Neyman–Pearson Examples

Recall the **Neyman–Pearson Lemma**: to test $H_0: \theta = \theta_0$ versus $H_1: \theta = \theta_1$ at level $\alpha$, the most powerful test rejects for large values of the **likelihood ratio**

$$\text{LR}(\mathbf{X}) = \frac{f_{\theta_1}(\mathbf{X})}{f_{\theta_0}(\mathbf{X})}.$$

That is, the NP test rejects when $\text{LR}(\mathbf{X}) > c$ for a cutoff $c$ chosen so that $P_{\theta_0}(\text{LR}(\mathbf{X}) > c) = \alpha$.

**(a)** Suppose the NP test rejects $H_0$ when $\text{LR}(\mathbf{X}) > c$, and let $\varphi$ be a strictly increasing function. Show that there exists a cutoff $c'$ such that rejecting when $\varphi(\text{LR}(\mathbf{X})) > c'$ gives exactly the same test.

Use this to conclude that the NP test can equivalently be written as: reject for large values of any statistic $T(\mathbf{X})$ that is a strictly increasing function of $\text{LR}(\mathbf{X})$.

**(b)** Let $X_1, X_2, \ldots, X_n$ be i.i.d. $\text{Exp}(\lambda)$ with density $f_\lambda(x) = \lambda e^{-\lambda x}$ for $x > 0$. Consider testing $H_0: \lambda = \lambda_0$ versus $H_1: \lambda = \lambda_1$ where $\lambda_1 > \lambda_0$.

Find the likelihood ratio $\text{LR}(\mathbf{X})$ and use part **(a)** to show that the NP test rejects for small values of $\bar{X}$.

**(c)** Let $X_1, X_2, \ldots, X_n$ be i.i.d. $N(\mu, \sigma^2)$ where $\sigma^2$ is known, with density $f_\mu(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left\{-\frac{(x - \mu)^2}{2\sigma^2}\right\}$. Consider testing $H_0: \mu = \mu_0$ versus $H_1: \mu = \mu_1$ where $\mu_1 > \mu_0$.

Find the likelihood ratio $\text{LR}(\mathbf{X})$ and use part **(a)** to show that the NP test rejects for large values of $\bar{X}$.

**(d)** Let $X \sim \text{Binomial}(n, p)$ with PMF $f_p(x) = \binom{n}{x} p^x (1-p)^{n-x}$ for $x \in \{0, 1, \ldots, n\}$. Consider testing $H_0: p = p_0$ versus $H_1: p = p_1$ where $p_1 > p_0$.

Find the likelihood ratio $\text{LR}(X)$ and use part **(a)** to show that the NP test rejects for large values of $X$.

Looking at parts **(b)**–**(d)**, you should notice that the NP test statistic is always a function of the sample mean (or of the single observation $X$ in the binomial case). Problem 4 will explain why this is not a coincidence.

---
---

## 4. Exponential Families

A parametric family of distributions with densities (or PMFs) $f_\theta$ is called a **one-parameter exponential family** if $f_\theta$ can be written in the form

$$f_\theta(x) = h(x) \exp\!\big\{\eta(\theta)\, T(x) - A(\theta)\big\}$$

where the different components have conventional names:
- $T(x)$ is the **sufficient statistic** (a function of the data, not of $\theta$),
- $\eta(\theta)$ is the **natural parameter** (a function of $\theta$ only),
- $A(\theta)$ is the **log-partition function** (ensures the density integrates to 1),
- $h(x) \geq 0$ is the **base density** (does not depend on $\theta$).

Note that this decomposition is not unique: for example, you could replace $T(x)$ by $2T(x)$ and $\eta(\theta)$ by $\eta(\theta)/2$ and still have a valid exponential family representation.

**(a)** Show that each of the three distributions from Problem 3 — $\text{Exp}(\lambda)$, $N(\mu, \sigma^2)$ with $\sigma^2$ known, and $\text{Binomial}(n, p)$ — is a one-parameter exponential family by identifying $T(x)$, $\eta(\theta)$, $A(\theta)$, and $h(x)$ for each.

**(b)** Now consider any one-parameter exponential family $f_\theta(x) = h(x) \exp\!\big\{\eta(\theta)\, T(x) - A(\theta)\big\}$, and suppose we observe $X_1, X_2, \ldots, X_n$ i.i.d. from $f_\theta$.

Consider the simple-vs-simple test $H_0: \theta = \theta_0$ versus $H_1: \theta = \theta_1$ where $\eta(\theta_1) > \eta(\theta_0)$.

Show that the likelihood ratio $\text{LR}(\mathbf{X})$ is a strictly increasing function of $\sum_{i=1}^n T(X_i)$, and conclude that the NP most powerful test rejects for large values of $\frac{1}{n}\sum_{i=1}^n T(X_i)$ (or for small values when $\eta(\theta_1) < \eta(\theta_0)$).

Check that this is consistent with your answers in Problem 3.

---
---

## 5. The Cauchy Likelihood Ratio Test

The **Cauchy location model** has density

$$f_\theta(x) = \frac{1}{\pi(1 + (x - \theta)^2)}, \qquad x \in \mathbb{R}.$$

This is a location family (like the Normal), but with much heavier tails: the Cauchy distribution has no finite mean or variance.

**(a)** Let $X_1, X_2, \ldots, X_n$ be i.i.d. $\text{Cauchy}(\theta)$. Consider testing $H_0: \theta = \theta_0$ versus $H_1: \theta = \theta_1$ where $\theta_1 > \theta_0$.

Write down the log-likelihood ratio $\log \text{LR}(\mathbf{X}) = \sum_{i=1}^n \log\!\left(\frac{f_{\theta_1}(X_i)}{f_{\theta_0}(X_i)}\right)$ explicitly.

For a single observation $X$, plot the per-observation log-likelihood ratio $g(x) = \log\!\left(\frac{f_{\theta_1}(x)}{f_{\theta_0}(x)}\right)$ over $x \in [-10, 10]$ for two cases on the same axes: $(\theta_0, \theta_1) = (0, 1)$ and $(\theta_0, \theta_1) = (0, 5)$. Describe the shapes: where does each curve peak? Are they monotone in $x$?

In [None]:
...

**(b)** In Problem 3(b), you showed that for exponential families, $\log \text{LR}$ is a **linear** function of $\sum T(X_i)$ — so the NP test rejects for large (or small) values of $\bar{X}$ regardless of which specific $\theta_1$ is the alternative.

The Cauchy distribution is **not** an exponential family. Looking at your plots from part **(a)**, explain why no single monotone function of $\bar{X}$ can be the most powerful test against all alternatives $\theta_1 > 0$ simultaneously. What changes about the optimal NP test statistic as $\theta_1$ changes?

**(c)** Now specialize to testing $H_0: \theta = 0$ versus $H_1: \theta = 1$. Compare the NP test with the naive test that rejects for large $\bar{X}$, across three sample sizes: $n = 1$, $n = 20$, and $n = 100$.

For each value of $n$, simulate $N = 10{,}000$ samples from the Cauchy distribution under both the null ($\theta = 0$) and alternative ($\theta = 1$).

For the naive test:
- Under $H_0$, compute $\bar{X}$ for each sample and find the 95th percentile as the cutoff $c$ (so the test has level $\alpha = 0.05$).
- Under $H_1$, compute the power: the proportion of samples where $\bar{X} > c$.

For the NP test:
- Under $H_0$, compute $\log \text{LR}(\mathbf{X})$ for each sample and find the 95th percentile as the cutoff.
- Under $H_1$, compute the power.

Report the power of both tests for each $n$ in a table. What pattern do you notice for each test as $n$ grows?

In [None]:
...

**(d)** It is a fact (which you do not need to prove) that if $X_1, X_2, \ldots, X_n$ are i.i.d. $\text{Cauchy}(\theta)$, then $\bar{X}$ also has the $\text{Cauchy}(\theta)$ distribution, regardless of $n$. Use this to explain why the naive test's power does not improve as $n$ grows.