### Basic Resampling: the Jackknife
The jackknife is a statistical resampling method used for bias reduction and obtaining approximate confidence intervals.

Suppose we have a collection $\{x_k\}_{k=1}^n$ of $n$ data points and we want to compute an estimator $\hat{\theta}$ of some parameter $\theta$. Rather than compute the estimator over the whole dataset, the jackknife estimate of $\hat\theta$ is computed multiple times over subsets formed by removing a single observation. Denote by $\hat\theta_k$ the estimate of $\hat\theta$ on the subset formed by removing observation $x_k$.

The bias (or error) of the estimator is given by $b(\hat\theta) = \text{E}[\hat\theta] - \theta$. Notice this formula uses the unknown sought-after parameter $\theta$. To actually determine the bias of the estimator $\hat\theta$, we use Quenouille's jackknife bias estimator,
$$
\hat{b}(\hat\theta) = (n-1)(\frac{1}{n}\sum_{k=1}^n \hat\theta_k - \hat\theta)
$$


We can get the jackknife estimate of $\theta$ by computing the mean of the $\hat{\theta}_k$'s and subtracting the bias:
$$
\hat\theta_{\text{jack}} = n\hat\theta - (n-1)\frac{1}{n}\sum_{k=1}^n \hat\theta_k
$$

The point is that this procedure gives correct first order estimates. It can be shown (see the intro to Shao and Tu's *The Jackknife and Bootstrap*) that the error of the jackknife estimate $\hat\theta_{\text{jack}}$ is given by
$$
b(\hat\theta_\text{jack}) = -\frac{C}{n(n-1)} + O(\frac{1}{n^2}),
$$
for some constant $C$. This is less than the bias of the estimator itself,
$$
b(\hat\theta) = \frac{c_0}{n} + \frac{c_1}{n^2} + O(\frac{1}{n^3}),
$$
for appropriate constants $c_{0,1}$.


In [1]:
from typing import Any, List, Callable
import numpy as np

def ith_jackknife_sample(data: List[Any], i: int) -> List[Any]:
    if i < 0 or i >= len(data):
        return data
    if i+1 < len(data):
        return np.array(data[:i] + data[i+1:])
    return np.array(data[:i])

def jackknife_sets(data: List[Any]) -> List[List[Any]]:
    return np.array([ith_jackknife_sample(data, i) for i in range(len(data))])

def jackknife_estimate(data: List[Any], estimator: Callable[[List[Any]], float], dtype=np.float64):
    sets = jackknife_sets(data)
    return np.fromiter(map(estimator, sets), dtype=dtype)

def jackknife_variance(data: List[Any], estimator: Callable[[List[Any]], float], dtype=np.float64):
    N = (len(data)-1.0)/float(len(data))
    estimates = jackknife_estimate(data, estimator, dtype=dtype)
    return N * np.sum(np.square(estimates - np.mean(estimates)))

Below we compute the jackknife estimate of the mean.

In [2]:
data = [1,2,3,4,5,6,7,8,9,10]
estimates = jackknife_estimate(data, np.mean)
mu = np.mean(estimates)
sigma = jackknife_variance(data, np.mean)
mu, sigma

(5.5, 0.9166666666666672)

We can also use this information to construct, e.g., a $95\%$ confidence interval for the mean.

In [3]:
alpha = 1.96
print(f"the 95% confidence interval for the mean is {mu - alpha * np.sqrt(sigma), mu + alpha * np.sqrt(sigma)}")

the 95% confidence interval for the mean is (3.623442868797577, 7.376557131202423)


The alpha value is readily available in statistical tables or computed from the cumulative normal distribution.

For more information on the jackknife see Jun Shao and Dongsheng Tu's book *The Jackknife and Bootstrap*.