# 4.7 Comparison of distributions

When given two samples, we can ask whether or they are drawn from the same distribution, or if these two samples imply a difference in the measured quality. Similarly, we can also ask whether a sample is consistent with being drawn from some known distribution.

We should clarify, what do we mean by the same distribution? We can describe distributions by their shape, location, and scale. Once we can assume the shape, or equivalently once we know from which distribution the sample is drawn from (i.e a Gaussian distribution), the problem simplifies; we now only need to consider two parameters: location and scale. 

Depending on the data type, and what assumptions we can infer about the underlying distributions, we can perform specific statistical tests. These statistical tests take data and use it to compute an appropriate statistic, and then compare the data-based value with it;s expected distribution. The expected distribution is evaluated by assuming that the null hypothesis is true, as discussed in the preceding section. When this expected distribution implies that the data-based value is unlikely to have arisen from it by chance (i.e a small p value), the null hypothesis is rejected with some threshold probability $\alpha$, typically 0.05 or 0.01 ( p < $\alpha$) For example, if the null hypothesis is that our datum came from the $\mathcal{N}(0,1)$ distribution, the x=3 corresponds to p=0.003. Note: p>a does not mean the hypothesis is proven correct.

## 4.7.1 Regression towards the mean

We'll begin this section by clarifying a simple statistical selection effect that is often ignored and leads to invalid conclusions.

If two instances of a data set ${x_i}$ are drawn from some distribution, the mean difference between the matched values (i.e the $i$th values from both data sets) will be zero. However, if we use one data set and select a subsample for comparison, the mean difference can become biased. For example, if we select the lowest quartile from the 1st data set, then the mean difference between the 2nd and 1st data set will be larger than zero.

This effect is known as regression toward the mean: if a random variable is extreme on its first measurement, it will tend to be closer to the population mean on a second measurement.

- Example: In an astronomical context, a common related tale states that the weather conditions observed at a telescope site today are, typically, not as good as those that would have bee inferred from the prior measurements made during the site selection process.

Thus, when selecting a subsample for further study, or a control sample for comparison analysis, one has to worry about various statistical selection effects

## 4.7.2 Nonparametric methods for comparing distributions

When the distributions are not known, tests are called nonparametric, or distribution-free tests. The most popular parametric test is the Kolmogorov-Smirnov (K-S) test, which compares the cumulative distribution function, F(x), for two samples, $\{x1_i\}$, $i = 1,...,N_1$, and $\{x2_i\}$, $i = 1,...,N_2$.

The K-S test is based on the following statistic which measures the maximum distance of the two cumulative distributions, $F_1(x1)$ and $F_2(x2)$,

$$D = \text{max}|F_1(x1) = F_2(x2)| $$

$(0 \leq D \leq 1)$; we note that the other statistics could be used to measure the difference between $F_1$ and $F_2$, e.g., the integrated square error). The key question is how often would the value of D computed from the data arise by chance if the two samples were drawn from the same distribution (the null hypothesis in this case). Kolmogorov showed in 1933 that the probability of obtaining by chance a value of D larger than the measured value is given by the function 

$$Q_{KS}(\lambda) = 2\sum^{\infty}_{k=1} (-1)^{k-1}e^{-2k^2\lambda^2} \qquad(1)$$

where the argument $\lambda$ can be accurately described by the following approximation:

$$\lambda = \bigg(0.12 + \sqrt{n_e} + \frac{0.11}{\sqrt{n_e}}\bigg)D $$

where the number of effective data points is computed from

$$n_e = \frac{N_1N_2}{N_1 +N_2} $$

Note that for large $n_e$, $\lambda \approx \sqrt{n_e}D$. If the probability that a given value of D is due to chance is very small (e.g., 0.01 or 005), we can reject the null hypothesis that the two samples were drawn from the same underlying distribution.

For $n_e$ greater than about 10 or so, we can bypass eq (1) an use the following approximation to evaluate D corresponding to a given probability $\alpha$ of obtaining a value at least that large:

$$D_{KS} = \frac{C(\alpha)}{\sqrt{n_e}} $$,

where $C(\alpha = 0.05)= 1.36$ and $C(\alpha = 0.01)= 1.63$. Note that the ability to reject the null hypothesis (if it really is false) increases with $\sqrt{n_e}$. For example, if $n_e = 100$, then $D> D_{KS} = 0.163$ would arise by chance in only 1% of all trials. If the actual data-based value is indeed 0.163, we can reject the null hypothesis that the data were drawn from the same (unknown) distribution, with our decision being correct 99 out of 100 cases.

We can also use the K-S test to ask "Is the measured $f(x)$ consistent with a known reference distribution function $h(x)$?" This is known as the "one sample" K-S test, as opposed to the "two sample" K-S test discussed above. In this case, $N_1 = N$ and $N_2 = \infty$, and thus $n_e = N$. Again, a small value of $Q_{KS}$ (or $D > D_{KS}$) indicated that it is unlikely, at the given confidence level set by $\alpha$, that the data summarized by $f(x)$ were drawn from $h(x)$.

The K-S test is sensitive to the location, scale, and shape of the underlying distribution. Additionally, because the test relies on cumulative distributions, it is invariant to reparametrization of x  (we would get the same answer if we used $\ln{x}$ instead of x). The main strength but also the main weakness of the K-S test is its ignorance about the underlying distribution. For example, the test is insensitive to details in the differential distribution function (e.g., narrow regions where it drops to zero), and more sensitive near the center distribution than at the tails (the K-S test is not the best choice for distinguishing samples drawn from Gaussian and exponential distributions).

A simple test related to the K-S test was developed by Kuiper to treat distributions defined on a circle. It is based on the statistic

$$D* = \text{max}\{F_1(x1)-F_2(x2)\} + \text{max}\{F_2(x1)-F_1(x2)\} $$

This statistic considers both positive and negative difference between two distributions (D from the K-S test is equal to the greater of the two terms). For distributions defined on a circle (i.e., 0° < x < 360°), the value of D* is invariant to where exactly the origin (x=0°) is placed. Thus, the Kuiper test is a good test for comparing the longitude distributions of two astronomical samples. By analogy with the K-S test,

$$Q_{\text{Kuiper}}(\lambda) = 2\sum^{\infty}_{k=1}(4k^2\lambda^2-1)e^{-2k^2\lambda^2} $$

with

$$\lambda = \bigg(0.155 + \sqrt{n_e} +\frac{0.24}{\sqrt{n_e}}\bigg) D*$$.

The K-S test is not the only option for nonparametric comparison of distributions. The Cramér-von Mises criterion, the Watson test, and the Anderson-Darling test are similar to the K-S test but consider somewhat different statistics. For example, the Anderson-darling test is more sensitive to difference in the tails of the two distributions than the K-S test. A difficulty with these other statistics is that a simple summary of their behavior, such as given by (ENTER) for the K-S test, is not readily available. 

The Gini coefficient is a somewhat similar quantity that is also based on the cumulative distribution function. It measures the deviation of a given cumulative distribution ($F(x)$ defined as $x_\text{min} \leq x \leq x_\text{max}$) from that expected uniform distribution:

$$ G = 1-2\int^{x_\text{max}}_{x_\text{min}}F(x)dx $$

When $F(x)$ corresponds to a uniform differential distribution $G=0$ and $G \leq 1$ always. The Gini coefficient is not a statistical test, but we mention it here for reference because it is commonly used for classification in economics and related fields.


$$\lambda = \frac{\sqrt{n_e}D}{1+(0.25-0.75/\sqrt{n_e})\sqrt{1-\rho^2}}$$
$$ A^2 = -N - \frac{1}{N}\sum^N_{i=1}[(2i-1)\ln(F_i)+(2N-2i+1)\ln(1-F_i)] $$

$$z_i = \frac{x_i-\mu}{\sigma}$$
$$W = \frac{\big(\sum^N_{i=1}a_iR_i\big)^2}{\sum^N_{i=1}(x_i-\overline{x})^2} $$
$$Z_1 = 1.3\frac{|\mu-q_{50}|}{s}\sqrt{N}$$
$$Z_2= 1.1\bigg|\frac{s}{\sigma_G}-1\bigg|\sqrt{N}$$
$$h(x) = \mathcal{N}(\mu,\sigma)\sum^\infty_{k=0}a_kH_k(z) $$
$$t=\frac{\overline{x1}-\overline{x2}}{s_D}$$
$$s_D = \sqrt{s^2_{12}\bigg(\frac{1}{N_1}+\frac{1}{N_2}\bigg)}$$
$$s_{12} = \sqrt{\frac{(N_1-1)s^2_1+(N_2-1)s^2_2}{N_1+N_2-2}} $$
$$s_D = \sqrt{\frac{(N_1-1)s^2_1+(N_2-1)s^2_2-2\text{Cov}_{12}}{N}} $$
$$\text{Cov}_{12} = \frac{1}{N-1}\sum^N_{i=1}(x1_i-\overline{x1})(x2_i-\overline{x2}) $$
$$F = \frac{s^2_1}{s^2_2} $$

### The U test and the Wilcoxon test

The K-S test, as well as other nonparametric methods for comparing distributions, are often sensitive to more than one distribution property, such as the location or scale parameters. We often care about differences in only a particular statistic, such as the mean value, and do not care about the others. For these cases, there exist several nonparametric tests , analogous to the better known classical parametric tests: the t test and the paired t test. These are based on the ranks of data points and not their values.

The $U$ test or the Mann-Whitney-Wilcoxon test ( or the Wilcoxon rank-sum test) is a nonparametric test for testing whether two data sets are drawn from distributions with different location parameters. (if the distributions are known to be gaussian, the standard classical test is called the t test). The sensitivity of the $U$ test is dominated by a difference in medians of the two tested distributions.

The $U$ statistic is determined usiing ranks for the full sample obtained by concatenating the two data sets and sorting them. To compute the $U$ statistic, take each value from sample 1 and count the number of observations in sample 2 that have a smaller rank (in the case of identical values, take half a count). The sum of these counts is $U$, and the minimum of the values with the samples reversed is used to assess the significance. For cases with more than about 20 points per sample, the $U$ statistic for sample 1 can be more easily computed as 

$$U_1 = R_1 - \frac{N_1(N_1-1)}{2} $$

where $R_1$ is the sum of ranks for sample 1, and analogously for sample 2. The adopted $U$ statistic is the smaller of the two (note that $U_1+U_2 = N_1N_2$, which can be used to check computations). The behavior for $U$ for large samples can be well approximated with a Gaussian distribution, $\mathcal{N}(\mu_U,\sigma_U)$, of a variable

$$z = \frac{U-\mu_U}{\sigma_U} $$

with 

$$\mu_U = \frac{N_1N_2}{2} $$

and 

$$\sigma_U = \sqrt{\frac{N_1N_2(N_1+N_2+1)}{12}} $$

A special case of comparing the means of two data sets is when the data sets have the same size  ($N_1 = N_2 = N$) and data points are paired. For example, the two data sets could correspond to the same sample measured twice, before and after something that could have affected the values, and we are testing for evidence of a change in means values. The nonparametric Wilcoxon signed-rank test is can be used to compare means of two arbitrary distribution. The test is based on differences $y_i=x1_i-x2_i$ and the values with $y_1=0$ are excluded, yielding the new sample size $m < N$. The sample is ordered by $|y_i|$, resulting in the rank $R_i$ for each pair, and each pair is assigned $\Phi = 1$ if $x1_i > x2_i$ and 0 otherwise. The Wilcoxon signed-ranked statistic is then

$$W_+ = \sum^m_i \Phi_iR_i $$

that is, all the ranks with $y_i > 0$ are summed. Analogously, $W_{-}$ is the sum of all the ranks with $y_i < 0$, and the statistic $T$ is the smaller of the two. For small values of $m$, the significance of $T$ can be found in tables. For $m$ larger than about 20, the behavior of $T$ can be well approximated with a Gaussian distribution. $\mathcal{N} (\mu_T, \sigma_T)$, of the variable

$$z = \frac{T-\mu_T}{\sigma_T} $$

with 

$$\mu_T = \frac{N(2N+1)}{2} $$

and

$$\sigma_T = N \sqrt{\frac{(2N+1)}{12}} $$

## 4.7.3 Comparison of two-dimensional distributions

For multidimensional distribution, the cumulative probability distribution is not well defined in more than one dimension. Thus, there does not exist a direct analog to the K-S test for distributions that are multidimensional. However