# STAT 345: Nonparametric Statistics

## Lesson 05.2: The Kruskal-Wallis Test

**Reading: Conover Section 5.2**

*Prof. John T. Whelan*

Tuesday 25 February 2025

These lecture slides are in a computational notebook.  You have access to them through http://vmware.rit.edu/

Flat HTML and slideshow versions are also in MyCourses.

The notebook can run Python commands (other notebooks can use R or Julia; "Ju-Pyt-R").  Think: computational data analysis, not "coding".

Standard commands to activate inline interface and import libraries:

In [1]:
%matplotlib inline

In [2]:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (8.0,5.0)
plt.rcParams['font.size'] = 14

Recall Wilcoxon rank-sum test for two independent samples $\{x_{1j}|j=1,\ldots,n_1\}$ & $\{x_{2j}|j=1,\ldots,n_2\}$.  

Test whether they could be from same distribution by ranking all $N=n_1+n_2$ data points.  If sum of the $\{R_{1j}\}$ (which we've called $W$) is too high or too low, we reject $H_0$ (in a two-sided test)

Since $\sum_{j=1}^{n_2} R_{2j}=\frac{N(N+1)}{2}-\sum_{j=1}^{n_1}R_{1j}$, one rank-sum carries all the information.  Two-sided test asks if $\sum_{j=1}^{n_1}R_{1j}$ is too far from its expected value of $n_1\frac{N+1}{2}$ (which would mean $\sum_{j=1}^{n_2}R_{2j}$ is too far from $n_2\frac{N+1}{2}$)

Now generalize: suppose instead of two, we have $k$ samples, and let the $i$th sample $\{x_{ij}\}$ have size $n_i$.

In [3]:
x_i_j = [np.array([ 14.97,   5.80,  25.03,   5.50 ]),
       np.array([  5.83,  13.96,  21.96]),
       np.array([ 17.89,  23.03,  61.09,   18.62,  55.51])]
n_i = np.array([len(xi_j) for xi_j in x_i_j]); n_i

array([4, 3, 5])

In [4]:
k = len(n_i); k

3

Note that, since the sample sizes $\{n_i\}$ can all be different, we can't just store this in Python as an array with two indices.  Instead `x_i_j` is a list of $k$ arrays, which in general have different sizes.

The total number of data points in all of the samples is
$N=\sum_{i=1}^k n_i$.

In [5]:
N = np.sum(n_i); N

12

We're trying to assess the null hypothesis that the samples were all drawn from the same distribution--or more specifically that they were drawn in a way that $P(\color{royalblue}{X_{ij}}\mathbin{>}\color{royalblue}{X_{i'j'}}) = P(\color{royalblue}{X_{ij}}\mathbin{<}\color{royalblue}{X_{i'j'}})$--against the alternative that some of the samples (not specified which) are drawn from distributions that have different location parameters, i.e., that $P(\color{royalblue}{X_{ij}}\mathbin{>}\color{royalblue}{X_{i'j'}}) \ne P(\color{royalblue}{X_{ij}}\mathbin{<}\color{royalblue}{X_{i'j'}})$ for at least one pair $(i,i')$.  We wish to this in a rank-based way which does not rely on a particular assumed sampling distribution for the samples.

We combine ("concatenate") all $N$ values into a single list and rank them, and let the
rank of $x_{ij}$ be $R_{ij}$.

In [6]:
x_r = np.concatenate(x_i_j); x_r

array([14.97,  5.8 , 25.03,  5.5 ,  5.83, 13.96, 21.96, 17.89, 23.03,
       61.09, 18.62, 55.51])

In [7]:
R_r = stats.rankdata(x_r); R_r 

array([ 5.,  2., 10.,  1.,  3.,  4.,  8.,  6.,  9., 12.,  7., 11.])

In [8]:
# A little trick to make a list of labels to see which of the k samples each value is from:
i_r = np.concatenate([(i,)*n_i[i] for i in range(k)]); i_r

array([0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 2])

In [9]:
# This lets us organize Rij into a list of arrays
R_i_j = [R_r[i_r==i] for i in range(k)]; R_i_j

[array([ 5.,  2., 10.,  1.]),
 array([3., 4., 8.]),
 array([ 6.,  9., 12.,  7., 11.])]

If we write the sum of the ranks in the
$i$th sample as $R_{i}=\sum_{j=1}^{n_i} R_{ij}=n_{i}\overline{R}_i$, we
have $k$ statistics $\{R_{i}\}$.

In [10]:
R_i = np.array([np.sum(Ri_j) for Ri_j in R_i_j]); R_i

array([18., 15., 45.])

However they can be described by only
$k-1$ quantities, since they obey the constraint
$\sum_{i=1}^k R_{i}=\frac{N(N+1)}{2}$.

In [11]:
np.sum(R_i), (N*(N+1)//2)

(78.0, 78)

For $k=2$, we
already saw that the sum of the ranks in the second group, $R_2$, which
we called $W_y$, was determined by the sum of the ranks in the first
group, $R_1$, which we called $W_x$, so there was only $k-1=1$
independent statistic.

If we consider the $k$ statistics $\{{\color{royalblue}{R_i}}\}$, they
have expectation values, assuming the null hypothesis that the samples are equally likely to contain any combination of ranks, of
$$E({\color{royalblue}{R_i}}) = n_i\frac{N+1}{2}=n_i \overline{R}$$
If the alternative hypothesis is true, and some of the samples will have a tendency to contain higher or lower ranks than the average, then some of the differences $\{\color{royalblue}{R_i}-n_i\overline{R}\}$ will be "too far" from zero:

In [12]:
Rbar = 0.5*(N+1); (Rbar,np.mean(R_r))

(6.5, 6.5)

In [13]:
R_i - n_i * Rbar

array([-8. , -4.5, 12.5])

In this sort of situation, where a multidimensional measurement could be off in any "direction" from its expected value, the standard approach is to convert this to a normalized "distance", whose value under the null hypothesis is described by a chi-squared distribution.  Not only the null distribution but the construction of the distance itself relies upon the normal approximation.

## Reminder of $\chi^2$ Construction

Recall: if ${\color{royalblue}{Z_i}}$ are $n$ independent standard normal random variables, then
${\color{royalblue}{W}} = \sum_{i=1}^n ({\color{royalblue}{Z_i}})^2$
is a chi-squared random variable with $n$ degrees of freedom,
${\color{royalblue}{W}}\sim\chi^2(n)$.

So if
$\{{\color{royalblue}{X_i}}\}$ are independent normal random variables
with $E({\color{royalblue}{X_i}})=\mu_i$ and
$\operatorname{Var}({\color{royalblue}{X_i}})=\sigma_i^2$, then
$${\color{royalblue}{W}} = \sum_{i=1}^n \left(\frac{{\color{royalblue}{X_i}}-\mu_i}{\sigma_i}\right)^2\sim\chi^2(n)$$

Analogous situation where the rvs are not independent: suppose $\{{\color{royalblue}{X_i}}\}$ is a random sample from $N(\mu,\sigma^2)$.
$${\color{royalblue}{Y_i}} = {\color{royalblue}{X_i}} - {\color{royalblue}{{{\overline{X}}}}}
  = {\color{royalblue}{X_i}} - \frac{1}{n}\sum_{k=1}^n {\color{royalblue}{X_k}}$$
are $n$ correlated rvs related by one constraint $\sum_{i=1}^n{\color{royalblue}{Y_i}}=0$.

One
of the results of Student’s theorem, which underpins the confidence
intervals for mean and variance when both are unknown, is that
$$\sum_{i=1}^n \left(\frac{{\color{royalblue}{Y_i}}}{\sigma}\right)^2
  = \sum_{i=1}^n \left(\frac{{\color{royalblue}{X_i}} - {\color{royalblue}{{{\overline{X}}}}}}{\sigma}\right)^2
  \sim \chi^2(n-1)$$

For a derivation of this result, see e.g., section 6.3 of <https://ccrg.rit.edu/~whelan/courses/2015_3fa_STAT_405/notes03.pdf>.</span>

The one constraint among the $n$ rvs
$\{{\color{royalblue}{Y_i}}\}$ causes the # of degrees of freedom
to be $n-1$ rather than $n$.

## Construction of Kruskall-Wallis Statistic

- We have $k-1$ independent quantities among the $k$ rank-sums $\{\color{royalblue}{R_i}=\sum_{j=1}^{n_i}\color{royalblue}{R_{ij}}\}$, to be combined into a single $\chi^2$-like statistic.

- The means are $E({\color{royalblue}{R_i}}) = n_i\frac{N+1}{2}=n_i \overline{R}$

- If there are no ties, they have variances and covariances
$$\operatorname{Var}({\color{royalblue}{R_i}}) = n_i(N-n_i)\frac{N+1}{12}
\qquad\operatorname{Cov}({\color{royalblue}{R_i}},{\color{royalblue}{R_{\ell}}}) = -n_in_{\ell}\frac{N+1}{12}
  \qquad \hbox{if } i\ne \ell$$

- The variances and covariances can be
summarized into a variance-covariance matrix with elements$$
  \operatorname{Cov}({\color{royalblue}{R_i}},{\color{royalblue}{R_{\ell}}}) = (N\delta_{i\ell}n_i-n_in_{\ell})\frac{N+1}{12}
  \qquad  i=1,\ldots k; \quad\ell=1,\ldots k 
  $$
where $$\delta_{i\ell}
  =
  \begin{cases}
    1 & \hbox{if } i=\ell \\
    0 & \hbox{if } i\ne \ell
  \end{cases}$$ is the Kronecker delta (the elements of the identity
matrix).

We could turn any one of them into a standard normal by
shifting and scaling, but instead we combine them into a single $\chi^2$-like statistic
$${\color{royalblue}{T}} = \frac{12}{N(N+1)}
  \sum_{i=1}^k \frac{1}{n_i}\left({\color{royalblue}{R_i}}-n_i\frac{N+1}{2}\right)^2
  \sim \chi^2(k-1)$$

The key step in the derivation is to write the variance-covariance
    matrix as $\frac{N(N+1)}{12\sqrt{n_i n_{\ell}}}
      \left(\delta_{ij}-\frac{\sqrt{n_i n_{\ell}}}{N}\right)$ and
    recognize the expression in parentheses as a projection operator
    with one zero eigenvalue and $k-1$ unit eigenvalues.

In [14]:
T = 12/(N*(N+1)) * np.sum((R_i-n_i*Rbar)**2/n_i); T

4.153846153846154

We get the $p$-value by using the $\chi^2$ distribution.

In [15]:
stats.chi2(df=k-1).sf(T)

0.12531520484413722

If there are ties, variance is reduced. Various equivalent forms;
the simplest is probably
$${\color{royalblue}{T}} = (N-1)
  \frac{\sum_{i=1}^k\frac{1}{n_i}\left({\color{royalblue}{R_i}}-n_i\overline{R}\right)^2}
  {\sum_{i=1}^k \sum_{j=1}^{n_i}
    \left({\color{royalblue}{R_{ij}}}-\overline{R}\right)^2}
  \sim \chi^2(k-1)$$

In [16]:
(N-1) * np.sum((R_i-n_i*Rbar)**2/n_i) / np.sum((R_r-Rbar)**2)

4.153846153846154

Of course, this also works if there are no ties; in that case
$\sum_{i=1}^k \sum_{j=1}^{n_i}
    \left({\color{royalblue}{R_{ij}}}-\overline{R}\right)^2 = \sum_{r=1}^N
    \left(R_r-\overline{R}\right)^2$ is just $\frac{(N+1)N(N-1)}{12}$

In [17]:
np.sum((R_r-Rbar)**2), (N+1)*N*(N-1)/12

(143.0, 143.0)

<span id="fn3">[<sup>3</sup>](#fm2)By the way, if you're familiar with the Analysis of Variance (ANOVA) this expression may look somewhat familar; see for example equation (1.6) of <https://ccrg.rit.edu/~whelan/courses/2019_1sp_MATH_252/notes10.pdf>.  In fact, the Kruskall-Wallis statistic is sometimes referred to as "ANOVA with ranks".  One key difference, though, is that the expression $\sum_{i=1}^k \sum_{j=1}^{n_i}
    \left({\color{royalblue}{R_{ij}}}-\frac{N+1}{2}\right)^2$ which is known as the Total Sum of Squares (SST), is equal to $\frac{(N+1)N(N-1)}{12}$ (if there are no ties, but in any event determined by the available ranks rather than their arrangement in the samples), while in ANOVA it is a linear combination of two $\chi^2$ random variables (SST=SSTr+SSE).

## Multiple Comparisons

- The Kruskall-Wallis test tells us whether some of the samples have differing location parameters, but it doesn't tell us which pairs of samples differ significantly, nor in which direction.  (It's inherently like a two-sided hypothesis test in this regard.)  

- Conover describes a procedure for identifying which differences are significant.  (This plays a role something like Tukey's multiple comparison test in the ANOVA case, except that it's more fundamentally connected to the original statistic.)

- The test has become known as the Conover-Iman test, and while it's probably best preserved in your textbook itself, the original reference is Conover and Iman, "
Multiple-comparisons procedures. Informal report," Los Alamos Informal Report LA-7677-MS, https://doi.org/10.2172/6057803

Anyway, the test statistic, appropriate if the original Kruskall-Wallis statistic ${\color{royalblue}{T}}$ is significant at the $\alpha$ level, is

$$
\frac{
  {\color{royalblue}{R_i}}/n_i-{\color{royalblue}{R_j}}/n_j
}
{
  \sqrt{
    {\color{royalblue}{S^2}}\frac{N-1-{\color{royalblue}{T}}}{N-k}
    \left(\frac{1}{n_i}+\frac{1}{n_j}\right)
  }
}
$$

which is to be compared to the $\alpha/2$ tail of the Student-$t$ distribution with $N-k$ degrees of freedom.  Here
$$
{\color{royalblue}{S^2}} = \frac{1}{N-1}\sum_{i=1}^k \sum_{j=1}^{n_i}
    \left({\color{royalblue}{R_{ij}}}-\overline{R}\right)^2
$$
is the normalizing quantity in the denominator of the Kruskall-Wallis statistic, which is equal to $\frac{N(N+1)}{12}$ if there are no ties.

Returning to our example, recall we got $p=0.125$ from the Kruskal-Wallis test, so there's not really any evidence of the samples differing.

In [18]:
stats.chi2(df=k-1).sf(T)

0.12531520484413722

But suppose the data had been a little different, and indicated a more significant mismatch (keeping the same sizes so we don't have to recompute $n_i$, $k$, etc.)

In [19]:
newx_i_j = [
    np.array([ 14.97,   5.80,  15.03,   5.50 ]),
    np.array([  5.83,  13.96,  21.96 ]),
    np.array([ 27.89,  23.03,  61.09,   18.62,  55.51 ])
]
newx_r = np.concatenate(newx_i_j)
newR_r = stats.rankdata(newx_r)
newR_i_j = [newR_r[i_r==i] for i in range(k)]; newR_i_j

[array([5., 2., 6., 1.]),
 array([3., 4., 8.]),
 array([10.,  9., 12.,  7., 11.])]

In [20]:
newR_i = np.array([np.sum(Ri_j) for Ri_j in newR_i_j]); newR_i

array([14., 15., 49.])

In [21]:
newT = (N-1) * np.sum((newR_i-n_i*Rbar)**2/n_i) / np.sum((newR_r-Rbar)**2); newT

7.476923076923077

In [22]:
stats.chi2(df=k-1).sf(newT)

0.023790676031139445

Now it's significant at below the 5% level, so we can ask which differences are significant.

In [23]:
newSsq = np.sum((newR_r-Rbar)**2)/(N-1); newSsq, N*(N+1)/12

(13.0, 13.0)

In [24]:
newRbar_i = newR_i/n_i; newRbar_i, Rbar

(array([3.5, 5. , 9.8]), 6.5)

In [25]:
newT_ii = (newRbar_i[:,None]-newRbar_i[None,:])/np.sqrt(newSsq*(N-1-T)/(N-k)*(1/n_i[:,None]+1/n_i[None,:])); newT_ii

array([[ 0.        , -0.62453835, -2.98648642],
       [ 0.62453835,  0.        , -2.0901051 ],
       [ 2.98648642,  2.0901051 ,  0.        ]])

In [26]:
2*stats.t(df=N-k).sf(np.abs(newT_ii)), stats.chi2(df=k-1).sf(newT)

(array([[1.        , 0.54777771, 0.01528782],
        [0.54777771, 1.        , 0.06617206],
        [0.01528782, 0.06617206, 1.        ]]),
 0.023790676031139445)

You will examine this test more carefully on the homework.