# Chapter 14 Nonparametric Statistical Methods

Nonparametric or distribution-free methods are useful when the data cannot be assumed to follow a specific distribution (e.g., normal).

In [93]:
import math
import itertools
from typing import Any, Sequence

import polars as pl
from polars import col, lit
from scipy import stats
import numpy as np
import altair as alt

RNG = np.random.default_rng()
DATA = {}  # input data

## 14.1 Inferences for Single Samples

In the single sample case, the nonparametric alternatives to the t-test are the **sign test** and the **Wilcoxon signed rank test**. To test a null hypothesis $H_0: \tilde{\mu} = \tilde{\mu_0}$ on the population median $\tilde{\mu}$, based on data $x_1, x_2, \ldots , x_n$ from that population, the sign test takes into account only whether the differences $d_i = x_i - \tilde{\mu_0} > 0$ or $< 0$. The test statistics are $s_+ = \#(d_i > 0)$ and $s_- = \#(d_i < 0) = n - s_+$. Under $H_0$, both the test statistics have a $\mathrm{Bin}(n, 1/2)$ distribution. The Wilcoxon signed rank test takes into account not only the signs of the $d_i$ but also the ranks of the $|d_i|$ The test statistics are $w_+$ = sum of the ranks of the positive $d_i$'s and $w_- = n (n + 1)/2 - w_+$ = sum of the ranks of the negative $d_i$'s. This test is more powerful than the sign test but requires the additional assumption that the population distribution is symmetric. Both these tests can be inverted to obtain a confidence interval for $\tilde{\mu}$. Both methods extend to the paired sample case.

In scipy:
- The sign test is implemented as [stats.quantile_test](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.quantile_test.html).
- The Wilcoxon signed rank test is implemented as [stats.wilcoxon](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wilcoxon.html).

### Ex 14.1

Test whether the median of the population from which the following sample is drawn exceeds 30, i.e., test $H_0: \tilde\mu = 30$ vs. $H_1: \tilde\mu > 30$:

37 | 26 | 31 | 35 | 32 | 32 | 27 | 31 | 34 | 36
---|---|---|---|---|---|---|---|---|---

#### (a)

Find the exact P-value for the sign test and find the normal approximation to it. Is the normal approximation accurate? Can you reject $H_0$ at $\alpha = .05$?

In [2]:
DATA['14.1'] = np.array([37, 26, 31, 35, 32, 32, 27, 31, 34, 36])

For the exact p-value:

In [3]:
stats.quantile_test(DATA['14.1'], q=30, p=0.5, alternative='greater')

QuantileTestResult(statistic=2, statistic_type=1, pvalue=0.0546875)

which can be calculated directly using the binomial CDF.

In [36]:
stats.binom.cdf(np.count_nonzero(DATA['14.1'] <= 30), n=DATA['14.1'].size, p=0.5)

0.0546875

Or with normal approximation with continuity correction:

In [29]:
stats.norm.cdf((np.count_nonzero(DATA['14.1'] <= 30) - DATA['14.1'].size * 0.5 + 0.5) / np.sqrt(DATA['14.1'].size * 0.5 * 0.5))

0.056923149003329024

In any case the p-value is > α = 0.05, so cannot reject $H_0$.

#### (b)

Repeat part (a) using the Wilcoxon signed rank test.

For the exact test:

In [36]:
stats.wilcoxon(DATA['14.1']-30, alternative='greater', method='exact')

WilcoxonResult(statistic=43.5, pvalue=0.0654296875)

Normal approximation with continuity correction:

In [38]:
stats.wilcoxon(DATA['14.1']-30, alternative='greater', method='approx', correction=True)

WilcoxonResult(statistic=43.5, pvalue=0.05671152359300298)

In any case the p-value is > α = 0.05, so cannot reject $H_0$.

### Ex 14.2

In a study of the effect of vitamin B on learning, 12 matched pairs of children were randomly divided into two groups. One child in each pair received a vitamin B tablet (treatment) every day, while the other child received a placebo tablet and served as control. The following table shows the gain in IQ over the six weeks of the study:

Pair |1|2|3|4|5|6|7|8|9|10|11|12
---|---|---|---|---|---|---|---|---|---|---|---|---
Treated | 14 | 26 | 2 | 4 | -5 | 14 | 3 | -1 | 1 | 6 | 3 | 4
Control | 8 | 18 | -7 | -1 | 2 | 9 | 0 | -4 | 13 | 3 | 3 | 3

#### (a)

Find the exact P-value for the sign test to determine if vitamin B improves the IQ.

In [42]:
DATA['14.2'] = pl.DataFrame({
    'pair': range(1, 13),
    'treated': [14, 26, 2, 4, -5, 14, 3, -1, 1, 6, 3, 4],
    'control': [8, 18, -7, -1, 2, 9, 0, -4, 13, 3, 3, 3]})
DATA['14.2']

pair,treated,control
i64,i64,i64
1,14,8
2,26,18
3,2,-7
4,4,-1
5,-5,2
6,14,9
7,3,0
8,-1,-4
9,1,13
10,6,3


In [54]:
stats.quantile_test(
    DATA['14.2'].select(col('treated')-col('control')).to_series(), 
    alternative='greater')

QuantileTestResult(statistic=3, statistic_type=1, pvalue=0.072998046875)

The P-value (0.073) indicates that VB does not improve the IQ at the 0.05 significance level.

#### (b) 

Repeat part (a) using the Wilcoxon signed rank test. Why is a less significant result obtained in this case?

In [58]:
stats.wilcoxon(
    DATA['14.2'].get_column('treated'),
    DATA['14.2'].get_column('control'),
    alternative='greater')

WilcoxonResult(statistic=47.0, pvalue=0.10604514898621259)

Both the sign test and Wilcoxon signed rank test indicate that VB does not improve the IQ at the 0.05 significance level. However the Wilcoxon test gives an even less significant P-value (0.1) because the negatives have greater magnitudes than the positives, which counterbalances the overwhelming number of positives somehow (shown in the graph below). This redeeming effect is lost in the sign test which considers only the signs.

In [67]:
(
    alt.Chart(
        DATA['14.2'].select(
            (col('treated')-col('control')).alias('diff')))
    .mark_tick()
    .encode(
        alt.X('diff:Q').axis(grid=False)))

### Ex 14.3

For the previous exercise calculate 90% sign and Wilcoxon signed rank CI's on the median of the difference in IQ scores of treated vs. control children. Compare the results with those obtained using the corresponding tests.

### Ex 14.4

Refer to the corneal thickness data of glaucoma patients from Exercise 8.7.

#### (a) 

Do the sign test to determine if the corneal thickness differs between an eye affected with glaucoma and an unaffected eye. Use $\alpha = .05$.

#### (b) 

Repeat part (a) using the Wilcoxon signed rank test.

### Ex 14.5

For the previous exercise calculate 95% (or the nearest achievable higher confidence level to be on the conservative side) sign and Wilcoxon signed rank CI's on the median difference in the corneal thickness between an affected and an unaffected eye. Compare the results with those obtained using the corresponding tests.

### Ex 14.6

Refer to the data from Exercise 8.15 on home heating energy consumption before and after installing insulation.

#### (a) 
Do the sign test at $\alpha = .05$ to determine if insulation reduced the energy consumption.

#### (b) 

Repeat part (a) using the Wilcoxon signed rank test.

### Ex 14.7

For the previous exercise calculate 95 % (or the nearest achievable higher confidence level to be on the conservative side) sign and Wilcoxon signed rank CI's on the median energy saving due to insulation. Compare the results with those obtained using the corresponding tests.

### Ex 14.8

Find the null distribution of the Wilcoxon signed rank statistic for n=4 by enumerating all possible assignments of ranks to signed differences as in Table 14.2. Verify that the mean of the distribution equals $n(n + 1)/4 = 5$ and variance equals $n(n + 1)(2n + 1)/24 = 7.5$.

### Ex 14.9

Many nonparametric test statistics have discrete distributions which impose a lower bound on the P-value (attained when the sample outcome is most favorable for rejecting $H_0$, e.g., when all signs are plus in the upper one-sided sign test or the Wilcoxon signed rank test) and therefore make it impossible to reject $H_0$ at an $\alpha$ less than this lower bound.

#### (a)

Show that the lowest attainable P-value using the one-sided sign test or the Wilcoxon signed rank test is $(1/2)^n$.

#### (b) 

Is it possible to reject $H_0$ at $\alpha = .01$ if n = 6 using these tests?

#### (c) 

What is the smallest sample size required for these tests if rejection of $H_0$ at $\alpha = .01$ must be possible?

### Ex 14.10

Check the accuracy of the normal approximation (14.6) to the Wilcoxon signed rank critical constant $w_{n,\alpha}$ by comparing it with the exact constant from Table A.10 for n = 10 and $\alpha = .053$.

## 14.2 Inferences for Two Independent Samples

The nonparametric alternative to the two independent samples t-test is the **Wilcoxon-Mann-Whitney** test. Let $n_1$ and $n_2$ be the sizes of the two samples. The Wilcoxon test ranks all $N = n_1 + n_2$ observations together and computes rank sums, $w_1$ and $w_2$, for the two samples, where $w_1 + w_2 = N (N + 1)/2$. The Mann-Whitney test uses the statistics $u_1 = \#(x_i > y_j)$ and $u_2 = \#(x_i < y_j) = n_1 n_2 - u_1$. The two tests are equivalent because their test statistics are related: $u_1 = w_1 - n_1 (n_1 + 1)/2$ and $u_2 = w_2 - n_2 (n_2 + 1)/2$. The null distributions of these statistics are derived using the fact that all $\frac{N!}{n_1! n_2!}$ orderings of the data are equally likely under the null hypothesis that the two populations are identical.

### Ex 14.11

The table below gives the survival times of 16 mice that were randomly assigned to a control group or to a treatment group. Did the treatment prolong survival? Answer using the Wilcoxon-Mann-Whitney test at $\alpha = .10$.

Survival Times of Mice (Days)||||||||||
---|---|---|---|---|---|---|---|---|---
**Control Group** | 52 | 104 | 146 | 10 | 50 | 31 | 40 | 27 | 46
**Treatment Group** | 94 | 197 | 16 | 38 | 99 | 141 | 23 ||

In [97]:
# ✍️
DATA['14.11'] = pl.read_json('Ex14-11.json')
DATA['14.11'].head()

Control,Treatmnt
list[f64],list[f64]
"[52.0, 104.0, … 46.0]","[94.0, 197.0, … 23.0]"


In [3]:
stats.mannwhitneyu(
    DATA['14.11']['Treatmnt'].explode(),
    DATA['14.11']['Control'].explode(),
    alternative='greater')

MannwhitneyuResult(statistic=36.0, pvalue=0.3402972027972028)

P-value = 0.34 > 0.1, so inconclusive to say treatment prolongs survival.

### Ex 14.12

Consider the dopamine level data from Exercise 4.22. Apply the Wilcoxon-Mann-Whitney test at $\alpha = .05$ to find out if there is a significant difference between the dopamine levels of the psychotic and nonpsychotic patients.

### Ex 14.13

Consider the data from Exercise 8.11 on the measurement of atomic weight of carbon by two different methods.

#### (a) 

Do the Wilcoxon-Mann-Whitney test at $\alpha$ = .10 to determine if there is a significant difference between the two methods.

#### (b)

Calculate a 90% CI on the median difference between the two methods. Does the result agree with the result of the hypothesis test from part (a)?

### Ex 14.14

Find the null distribution of the Mann-Whitney statistic ($U_1$) for $n_1 = n_2 = 3$ by enumerating all possible assignments of ranks to the $x$'s and $y$'s as in Table 14.5. Verify that the mean of the distribution equals $n_1 n_2 / 2 = 4.5$ and variance equals $n_1 n_2 (N + 1) / 12 = 5.25$.

### Ex 14.15

This exercise extends Exercise 14.9 to the Wilcoxon-Mann-Whitney test.

#### (a) 

Show that the lowest attainable P-value using the one-sided Wilcoxon-Mann-Whitney test is $1/\binom{n_1+n_2}{n_1}$, where $n_1$ and $n_2$ are the sample sizes of the two groups.

#### (b)

Is it possible to reject $H_0$ at $\alpha = .01$ if $n_1 = n_2 = 4$?

#### (c)

What is the smallest $n_1 = n_2 = n$ required if rejection of $H_0$ at $\alpha = .01$ must be possible?

### Ex 14.16

Check the accuracy of the normal approximation (14.11) to the Wilcoxon-Mann-Whitney critical constant $u_{n_1, n_2, \alpha}$ by comparing it with the exact constant from Table A.11 for $n_1 = 8$, $n_2 = 10$ and $\alpha = .051$.

### Ex 14.17

Use the formulas (14.10) for the mean and variance of the Mann-Whitney U-statistic to show that the means and variances of the Wilcoxon rank sum statistics are given by

$$
\mathrm{E}(W_1) = \frac{n_1 (N + 1)}{2}, \quad \mathrm{Var}(W_1) = \frac{n_1 n_2 (N + 1)}{12}
$$

and

$$
\mathrm{E}(W_2) = \frac{n_2 (N + 1)}{2}, \quad \mathrm{Var}(W_2) = \frac{n_1 n_2 (N + 1)}{12}\,.
$$

What is the expected rank of any observation under $H_0$? Use this expected rank to justify the formulas for $\mathrm{E}(W_1)$ and $\mathrm{E}(W_2)$.

## 14.3 Inferences for Several Independent Samples

The **Kruskal-Wallis test** is a generalization of the Wilcoxon-Mann-Whitney test for $a > 2$ independent samples; it is a nonparametric alternative to the ANOVA F-test for a one-way layout. Let $n_i$ denote the sample size from the $i$th treatment group. All $N = \sum_{i=1}^a n_i$ observations are ranked together and the rank sums $r_i$ and rank averages $\bar{r_i} = r_i / n_i$ are computed. The test statistic is a weighted sum of squares of the $\bar{r_i}$ around their average, which equals $(N + 1)/2$. The null distribution of the statistic is derived by using the same argument as above, namely that under the null hypothesis of equality
of the $a$ population distributions all $\frac{N!}{n_1! n_2! \cdots n_a!}$ orderings of the data are equally likely. Asymptotically (as $n_i \rightarrow \infty$ for all $i$), this null distribution is chi-square with $a - 1$ d.f.

### Ex 14.18

How many different assignments of ranks are possible in the Kruskal-Wallis test for a one-way layout with three treatment groups and four observations per group? Assume that there are no ties.

### Ex 14.19

Sixteen students were randomized to four different educational programs, four students per program. The following are the number of days absent from the program over a period of one academic year.

Program | A | B | C | D
---|---|---|---|---
|| 17 | 14 | 6 | 11
|| 9 | 10 | 8 | 2
|| 35 | 21 | 5 | 4
|| 20 | 18 | 13 | 7

#### (a)

Perform the Kruskal-Wallis test at $\alpha = .05$ to determine if there are significant differences among the four programs.

#### (b) 

Is there a significant difference between the two programs with the most and the least absences? Answer using a multiple comparison test at $\alpha = .05$.

### Ex 14.20

Refer to the water salinity data from Exercise 12.4. Apply the Kruskal-Wallis test to check if there are significant differences between the salinity levels of water from different sites. If so, apply a nonparametric multiple comparison test to determine which methods differ from each other. Use $\alpha = .01$ for both tests.

### Ex 14.21

Refer to the sickle cell data from Exercise 12.5. Apply the Kruskal-Wallis test to check if there are significant differences between the hemoglobin levels of patients with different types of sickle cell disease. If so, apply a nonparametric multiple comparison test to determine which disease types differ from each other. Use $\alpha = .05$ for both tests.

## 14.4 Inferences For Several Matched Samples

The **Friedman test** is a generalization of the sign test for $a > 2$ matched samples; it is a nonparametric alternative to the ANOVA F-test for a randomized block design. Here the rankings are done separately within each of the $b$ sets of matched samples
or blocks. The test statistic is similar to the Kruskal-Wallis statistic based on rank averages. The null distribution of the statistic is derived by using the fact that under the null hypothesis of the equality of the $a$ population distributions, all $(a!)^b$ rankings are equally likely. Asymptotically (as $b \rightarrow \infty$), this null distribution is chi-square with
$a - 1$ d.f.

### Ex 14.22

How many different assignments of ranks are possible in the Friedman test for a randomized block design with three treatment groups and four blocks? Assume that there are no ties.

### Ex 14.23

Three assembly fixtures were compared using four different workers, with each worker using the three fixtures in a random order. The time (in minutes) taken to complete the assembly is given in the following table.
Worker

Fixture \ Worker | 1 | 2 | 3 | 4
---|---|---|---|---
1 | 1.3 | 1.6 | 2.0 | 1.4
2 | 2.1 | 1.9 | 2.2 | 1.5
3 | 1.8 | 1.7 | 1.2 | 1.1

#### (a)

Do the Friedman test to determine if there are significant differences between the fixtures in terms of the assembly times required. Use $\alpha = .10$.

#### (b)

If significant differences are found in part (a), determine which fixtures differ from each other using a nonparametric multiple comparison test with $\alpha = .10$.

### Ex 14.24

Analyze the penicillin yield data from Exercise 12.21 by using the Friedman test to determine if there are significant differences between the methods of manufacture. If so, apply a nonparametric multiple comparison test to determine which methods differ from each other. Use $\alpha = .10$ for both tests.

### Ex 14.25

Refer to the blood plasma data from Exercise 12.26. Apply the Friedman test to check if there are significant differences between the plasma levels of each person at different times. Use $\alpha = .05$.

## 14.5 Rank Correlation Methods

Two nonparametric alternatives to the Pearson correlation coefficient for n independent pairs of observations, $\{(x_i, y_i)\ |\ i = 1, 2, \ldots, n\}$, are the **Spearman rank correlation coefficient** and the **Kendall rank correlation coefficient**. The Spearman rank correlation coefficient is just the Pearson correlation coefficient applied to the ranks of the $x_i$'s and the $y_i$'s. The Kendall rank correlation coefficient is defined as the difference between the proportions of concordant and discordant pairs, which can also be computed from the ranks. The null distributions (under the hypothesis of independence between the $x_i$'s and the $y_i$'s) of both these rank correlation coefficients are derived by regarding all $n!$ pairings of the ranks as equally likely. Asymptotically (as $n \rightarrow \infty$), the null distributions of both the rank correlation coefficients are normal with zero means but different variances.

## 14.6 *Resampling Methods

Resampling methods draw repeated samples from the observed sample itself to generate the sampling distribution of a statistic. The permutation method draws samples without replacement, while the bootstrap method draws samples with replacement. The jackknife method resamples by deleting one observation at a time. These methods are useful for assessing accuracies (e.g., the bias and standard error) of complex statistics.

The inherent error of resampling could be far less serious than making wrong and often unverifiable assumptions about the population distribution.

- permutation test: draw from samples *without replacement*. The Wilcoxon-Mann-Whitney test is a special case of this.
- bootstrap method: draw from samples *with replacement*.
- jackknife method: delete one observation at a time.

### Ex 14.34

Consider a subset of the carbon atomic weight data shown below from Exercise 8.11.

||||||
---|---|---|---|---
**Method 1**| 12.0129 | 12.0072 | 12.0064 | 12.0054
**Method 2**| 12.0318 | 12.0246 | 12.0069 |

Do a two-sided permutation test to compare the two methods of atomic weight determination by listing all $\binom{7}{3} = 35$ assignments of labels x and y to the data values. Use $\alpha = .10$.

In [40]:
# ✍️
method_1 = np.array([12.0129, 12.0072, 12.0064, 12.0054])
method_2 = np.array([12.0318, 12.0246, 12.0069])

In [3]:
def get_statistic(m1: np.ndarray, m2: np.ndarray) -> float:
    return np.mean(m1) - np.mean(m2)

In [42]:
res = stats.permutation_test(
    (method_1, method_2), get_statistic,
    permutation_type='independent', alternative='two-sided', n_resamples=np.inf)
res.pvalue

0.17142857142857143

P-value > 0.1, therefor no significant difference between the 2 methods.

### Ex 14.35

Do a one-sided permutation test on the mouse data from Exercise 14.11 by using a computer package to draw 25 random samples _without replacement_. Enumerate your samples. Estimate the P-value and compare it with that obtained in Exercise 14.11.

✍️ First, let's enumerate the samples. Remember that the mouse data from Exercise 14.11 is:

In [58]:
DATA['14.11']

Control,Treatmnt
list[f64],list[f64]
"[52.0, 104.0, … 46.0]","[94.0, 197.0, … 23.0]"


In [116]:
# simulate a permutation test
n_samples = 25
control, treatment = DATA['14.11'].row(0)

df = (
    DATA['14.11']
    .select(
        pl.repeat(col('Control').list.concat('Treatmnt'), n_samples)
        .list.sample(fraction=1, shuffle=True)
        .alias('data'))
    .select(
        col('data').list.head(len(control)).alias('control'),
        col('data').list.tail(len(treatment)).alias('treatment'))
    .with_columns(
        statistic=col('treatment').list.mean() - col('control').list.mean()))

with pl.Config(tbl_rows=-1):
    print(df)

shape: (25, 3)
┌────────────────────────┬────────────────────────┬────────────┐
│ control                ┆ treatment              ┆ statistic  │
│ ---                    ┆ ---                    ┆ ---        │
│ list[f64]              ┆ list[f64]              ┆ f64        │
╞════════════════════════╪════════════════════════╪════════════╡
│ [146.0, 52.0, … 16.0]  ┆ [40.0, 23.0, … 38.0]   ┆ -5.428571  │
│ [94.0, 197.0, … 50.0]  ┆ [10.0, 16.0, … 104.0]  ┆ -33.873016 │
│ [94.0, 46.0, … 52.0]   ┆ [99.0, 10.0, … 40.0]   ┆ 12.857143  │
│ [16.0, 146.0, … 94.0]  ┆ [40.0, 10.0, … 141.0]  ┆ 20.47619   │
│ [94.0, 23.0, … 104.0]  ┆ [50.0, 141.0, … 197.0] ┆ 42.063492  │
│ [31.0, 16.0, … 99.0]   ┆ [40.0, 27.0, … 52.0]   ┆ -15.333333 │
│ [38.0, 31.0, … 52.0]   ┆ [27.0, 146.0, … 94.0]  ┆ 22.507937  │
│ [52.0, 197.0, … 99.0]  ┆ [146.0, 27.0, … 94.0]  ┆ -17.365079 │
│ [94.0, 38.0, … 10.0]   ┆ [27.0, 16.0, … 50.0]   ┆ -5.68254   │
│ [146.0, 141.0, … 38.0] ┆ [50.0, 31.0, … 40.0]   ┆ -43.777778 │
│ [94.0, 2

So the P-value is estimated to be:

In [117]:
1 - 0.01 * stats.percentileofscore(df['statistic'], np.mean(treatment) - np.mean(control))

0.12

Because 25 does not exhaust all the combinations, this P-value may be different each time the code is run. However it should not be too far off from the P-value calculated by enumerating all the combinations, using `stats.permutation_test`:

In [119]:
res = stats.permutation_test(
    (treatment, control), 
    lambda m1, m2: np.mean(m1) - np.mean(m2),
    permutation_type='independent', alternative='greater', n_resamples=np.inf)
res.pvalue

0.14055944055944056

Whereas in Ex 14.11 the P-value was 0.34.

### Ex 14.36

Use a computer package to draw 25 bootstrap samples from the data in Exercise 7.16 on thermostat settings. Enumerate your samples. Estimate the standard error of the sample mean and compare it with the exact value.

In [3]:
# ✍️
DATA['14.36'] = pl.read_json('../data/Ex7-16.json')
DATA['14.36']

"[202.2, 203.4, … 199.0]"


Now to enumerate 25 bootstrap samples:

In [4]:
df = (
    DATA['14.36']
    .select(
        pl.repeat(col(''), n=25)
        .list.sample(fraction=1, with_replacement=True)
        .alias('temperature')))

with pl.Config(tbl_rows=-1, fmt_table_cell_list_len=-1, fmt_str_lengths=100):
    print(df.with_row_index(offset=1))

shape: (25, 2)
┌───────┬────────────────────────────────────────────────────────────────────────┐
│ index ┆ temperature                                                            │
│ ---   ┆ ---                                                                    │
│ u32   ┆ list[f64]                                                              │
╞═══════╪════════════════════════════════════════════════════════════════════════╡
│ 1     ┆ [200.5, 201.3, 203.7, 200.8, 203.4, 203.7, 201.3, 203.7, 202.5, 202.2] │
│ 2     ┆ [202.2, 198.0, 203.7, 203.7, 202.5, 202.2, 203.4, 199.0, 200.8, 203.4] │
│ 3     ┆ [203.7, 199.0, 206.3, 202.2, 199.0, 199.0, 198.0, 203.4, 201.3, 199.0] │
│ 4     ┆ [206.3, 199.0, 203.7, 202.5, 201.3, 200.8, 206.3, 198.0, 203.7, 202.5] │
│ 5     ┆ [198.0, 206.3, 202.2, 206.3, 206.3, 202.2, 206.3, 201.3, 201.3, 202.2] │
│ 6     ┆ [201.3, 199.0, 200.5, 202.5, 206.3, 203.7, 201.3, 198.0, 203.7, 202.2] │
│ 7     ┆ [198.0, 202.2, 202.5, 198.0, 199.0, 198.0, 203.7, 203.4, 203.7

The standard error of these sample means is therefore:

In [5]:
df.select(col('temperature').list.mean().std().alias('SE (25 samples)'))

SE (25 samples)
f64
0.629497


Compare it with the "population" (the given data) standard error. They are indeed comparable.

In [6]:
DATA['14.36'].select(col('').list.std() / col('').list.len().sqrt().alias('exact SE'))

0.762168


### Ex 14.37

Use the bootstrap samples from Exercise 14.36 to estimate the standard errors of the sample median, sample maximum, and sample minimum.

In [31]:
df.select(
    col('temperature').list.mean().std().alias('SE of mean'),
    col('temperature').list.median().std().alias('of median'),
    col('temperature').list.max().std().alias('of max'),
    col('temperature').list.min().std().alias('of min'))

SE of mean,of median,of max,of min
f64,f64,f64,f64
0.629497,0.777909,1.19147,0.845084


Compare to the result below obtained by using `stats.bootstrap`. Note that each SE is calculated based on a separate bootstrap resample (of size 9999 by default).

In [34]:
DATA['14.36'].select(
    col('').map_elements(
        lambda x: {
            'SE of mean': stats.bootstrap((x,), np.mean).standard_error,
            'of median': stats.bootstrap((x,), np.median).standard_error,
            'of max': stats.bootstrap((x,), np.max).standard_error,
            'of min': stats.bootstrap((x,), np.min).standard_error,})
).unnest('')

SE of mean,of median,of max,of min
f64,f64,f64,f64
0.722429,0.863929,1.340081,0.836833


### Ex 14.38

Do a bootstrap test on the mouse data from Exercise 14.11 by using a computer package to draw 25 random samples _with replacement_ as done in Example 14.7. Enumerate your samples. Estimate the P-value and compare it with that obtained in Exercises 14.11 and 14.35.

✍️ Remember the data from Exercise 14.11 was:

In [98]:
DATA['14.11']

Control,Treatmnt
list[f64],list[f64]
"[52.0, 104.0, … 46.0]","[94.0, 197.0, … 23.0]"


In [121]:
# to enumerate the 25 samples
n_resamples = 25
df = (
    DATA['14.11']
    .select( # concatenate the 2 lists and take 25 bootstrap resamples
        pl.all().list.len().name.prefix('len_'),
        pl.repeat(
            col('Control').list.concat('Treatmnt'), n=n_resamples)
        .list.sample(fraction=1, with_replacement=True)
        .alias('combined'))
    .select( # label the first 9 'control' and the rest 'treatment'
        col('combined').list.head('len_Control').alias('control'),
        col('combined').list.tail('len_Treatmnt').alias('treatment'))
    .with_columns(statistic=col('treatment').list.mean() - col('control').list.mean()))

with pl.Config(tbl_rows=-1, fmt_table_cell_list_len=-1, fmt_str_lengths=100):
    print(df.with_row_index(offset=1))

shape: (25, 4)
┌───────┬──────────────────────────────────────┬──────────────────────────────────────┬────────────┐
│ index ┆ control                              ┆ treatment                            ┆ statistic  │
│ ---   ┆ ---                                  ┆ ---                                  ┆ ---        │
│ u32   ┆ list[f64]                            ┆ list[f64]                            ┆ f64        │
╞═══════╪══════════════════════════════════════╪══════════════════════════════════════╪════════════╡
│ 1     ┆ [31.0, 46.0, 104.0, 23.0, 23.0,      ┆ [46.0, 52.0, 10.0, 197.0, 94.0,      ┆ -7.619048  │
│       ┆ 141.0, 141.0, 38.0, 104.0]           ┆ 23.0, 31.0]                          ┆            │
│ 2     ┆ [40.0, 16.0, 16.0, 141.0, 99.0,      ┆ [40.0, 104.0, 52.0, 146.0, 23.0,     ┆ 13.206349  │
│       ┆ 99.0, 31.0, 99.0, 10.0]              ┆ 104.0, 52.0]                         ┆            │
│ 3     ┆ [146.0, 141.0, 38.0, 50.0, 38.0,     ┆ [31.0, 46.0, 10.0, 104.0, 1

With the bootstrap distribution of the differences obtained, the P-value would be

In [122]:
diff = DATA['14.11'].select(
        col('Treatmnt').list.mean() - col('Control').list.mean()).item()
1 - 0.01 * stats.percentileofscore(df['statistic'], diff)

0.07999999999999996

which is significant at the $\alpha = .10$ level, supporting the claim that the treatment has an effect in prolonging survival times, whereas, as we saw, that both the Mann-Whitney test in Ex 14.11 (p = 0.34) and the permutation test in Ex 14.35 (p = 0.14) fail to detect an effect. 

`stats.bootstrap` by default uses 9999 resamples, and in this case the P-value converges to around 0.13, as shown below, in agreement with the permutation result. This show that 25 is simply too small for the P-value to be stable.

In [124]:
data = DATA['14.11'].select(col('Control').list.concat('Treatmnt')).item()
dist = stats.bootstrap((data, ), lambda x: np.mean(x[:7]) - np.mean(x[7:])).bootstrap_distribution
1 - 0.01 * stats.percentileofscore(dist, diff)

0.12491249124912485

### Ex 14.39

Show the result (14.30) which relates the bootstrap estimate of the standard error of the
sample mean to the exact value. You will need to use the following result from **finite
population sampling**: Consider a population of N numbers: ${x_1, x_2, \ldots , x_N}$ with mean
and variance equal to

$$
\mu = \frac{\sum_{i=1}^N x_i}{N} \quad \text{and} \quad \sigma^2 = \frac{\sum_{i=1}^N (x_i - \mu)^2}{N}.
$$

If $\bar{X}$ is the sample mean of an SRS of size n _with replacement_ from this population, then

$$
\mathrm{Var}(\bar{X}) = \frac{\sigma^2}{n} = \frac{\sum_{i=1}^N (x_i - \mu)^2}{n\,N}.
$$

### Ex 14.40

Show the result (14.32) which states that the jackknife estimate of the standard error of the sample mean is exact. Check the result numerically by enumerating the 10 jackknife samples for the data from Exercise 14.1 and calculating the jackknife estimate of the standard error of the sample mean.

## Advanced Exercises

### Ex 14.41

For a random sample $X_1, X_2, \ldots, X_n$ from an $N(\mu, \sigma^2)$ distribution consider testing $H_0: \mu = 0$ vs. $H_1: \mu > 0$ using the sign test. The hypotheses tested by the sign test are $H_0: p = 1/2$ vs. $H_1: p > 1/2$, where $p = P(X_i > 0)$.

#### (a) 

Show that the power of the $\alpha$-level sign test that rejects $H_0$ when $S_+ = \#(X_i > 0) \ge b_{n,\alpha}$ is given by

$$
\sum_{i=b_{n, \alpha}}^n \binom{n}{i} p^i (1-p)^{n-i}
$$

where $p = \Phi(\mu / \sigma)$.

#### (b)

The power of the corresponding normal theory z-test for testing $H_0: \mu = 0$ vs. $H_1: \mu > 0$ (assuming $\sigma$ is known) is given by

$$
\Phi\left(\frac{\mu \sqrt{n}}{\sigma} - z_\alpha\right)
$$

using the results from Chapter 7. Calculate the powers of the sign test and z-test for $n=10$, $\alpha = .0547$ (in which case $b_{n, \alpha} = 8$ and $z_\alpha = 1.601$) for $\mu / \sigma$ = 0.5, 1.0, and 1.5. (*Hint*: First calculate $p$ for given $\mu / \sigma $. Then calculate the power of the sign test using the binomial distribution formula.)

#### (c) 

Why is the z-test power higher than the sign test power for all values of $\mu / \sigma$? Do you think that the z-test will be more powerful than the sign test even if the distribution of the $X_i$'s is not normal? Why or why not?

### Ex 14.42

The sign test is a test on the median or the 50th percentile of a distribution; it is equivalent to a binomial test of $H_0: p = 1/2$, where $p$ is the probability that an observation from the distribution exceeds the median value postulated under $H_0$.

#### (a)

Generalize the sign test for testing the $q$th quantile or the 100$q$th percentile (denoted by $\tilde{\mu_q}$) of a distribution. State the test procedure. (*Hint*: Define $p = P(X > \tilde{\mu_q}) = 1 - q$.)

#### (b) 

Let $\tilde{\mu}_.75$ denote the 75th percentile of the distribution of household incomes of a population. Test $H_0: \tilde{\mu}_.75$ = 60K vs. $H_1: \tilde{\mu}_.75$ > 60K at $\alpha = .10$ based on the following incomes (expressed in K$'s) of 15 sample households:

||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---
75 | 45 | 50 | 62 | 40 | 35 | 80 | 68 | 40 | 38 | 43 | 70 | 55 | 65 | 70

### Ex 14.43

Derive the formulas (14.4) for the mean and variance of the Wilcoxon signed rank statistic under $H_0$ by carrying out the steps below.

#### (a)

The $i$th rank can correspond to a positive sign or a negative sign. Let $Z_i$ be an indicator variable with $Z_i = 1$ if the $i$th rank corresponds to a positive sign and $Z_i = 0$ if the $i$th rank corresponds to a negative sign. Show that $W_+ = \sum_{i=1}^n i\,Z_i$.

✍️ Because $W_+$ is the sum of the positive ranks,
$$
W_+ = \sum_{i=1}^n i\,Z_i\,.
$$

#### (b)

Note that the $Z_i$ are independent and identically distributed Bernoulli r.v.'s with success probability $p = P(X_ > \tilde{\mu_0})$. Hence show that

$$
\mathrm{E}(W_+) = \sum_{i=1}^n i\,p = \frac{p\,n(1+n)}{2}
$$

and

$$
\mathrm{Var}(W_+) = \sum_{i=1}^n i^2 p\,(1-p) = \frac{p\,(1-p)\,n(n+1)(2n+1)}{6}\,.
$$

Now (14.4) follows by substituting $p=1/2$ in the above formulas.

✍️
$$
\begin{align*}
\mathrm{E}(W_+) &= \mathrm{E}\left[\sum_{i=1}^n i\,Z_i\right] \\
&= \sum_{i=1}^n i\,\mathrm{E}(Z_i) \\
&= p \sum_{i=1}^n i \\
&= p\,n(1+n)/2\,.
\end{align*}
$$

Similarly,

$$
\begin{align*}
\mathrm{Var}(W_+) &= \mathrm{Var}\left[\sum_{i=1}^n i\,Z_i\right] \\
&= \sum_{i=1}^n i^2\,\mathrm{Var}(Z_i) \\
&= p\,(1-p) \sum_{i=1}^n i^2 \\
&= p\,(1-p)\,n(n+1)(2n+1)/6\,.
\end{align*}
$$