# STAT 345: Nonparametric Statistics

## Lesson 11.1: Cochran's $Q$ Test

**Reading: Conover Section 4.6 & 3.5**

*Guest Lecturer: Prof. Nonhle Channon Mdziniso*

Thursday 17 April 2025

These lecture slides are in a computational notebook.  You have access to them through http://vmware.rit.edu/

Flat HTML and slideshow versions are also in MyCourses.

The notebook can run Python commands (other notebooks can use R or Julia; "Ju-Pyt-R").  Think: computational data analysis, not "coding".

Standard commands to activate inline interface and import libraries:

In [1]:
%matplotlib inline

In [2]:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (8.0,5.0)
plt.rcParams['font.size'] = 14

## Cochran’s Q Test

We now turn to a scenario which is somewhere between a contingency table
and a complete block design. We consider a contingency table in which
all of the counts are either $0$ or $1$, and are assumed to correspond
to the yes/no response of $r$ different subjects to $c$ different
treatments. We call these responses $\{X_{ij}\}$ and the data table
looks like

<table>
    <tr><td></td><th colspan="4" style="text-align: center">Treatment</th></tr>
<tr><td></td><td> $$1$$</td><td>   $$2$$</td><td> $$\cdots$$</td><td>$$c$$</td></tr>
<tr><th> $$i=1$$</th><td> $$X_{11}$$</td><td>$$X_{12}$$</td><td>$$\cdots$$</td><td>$$X_{1c}$$</td><td>$$r_1$$</td></tr>
<tr><th> $$i=2$$</th><td>  $$X_{21}$$</td><td>$$X_{22}$$</td><td>$$\cdots$$</td><td>$$X_{2c}$$</td><td>$$r_2$$</td></tr>
<tr><th> $$\vdots$$</th><td>$$\vdots$$</td><td>$$\vdots$$</td><td>$$\ddots$$</td><td>$$\vdots$$</td><td>$$\vdots$$</td></tr>
<tr><th> $$i=r$$</th><td>$$X_{r1}$$</td><td>$$X_{r2}$$</td><td>$$\cdots$$</td><td>$$X_{rc}$$</td><td>$$r_r$$</td></tr>
<tr><td></td><td>$$c_1$$</td><td>$$c_2$$</td><td>  $$\cdots$$</td><td> $$c_c$$</td><td> $$N$$</td></tr>
    </table>

If we imagine repeating the experiment, each observation is a Bernoulli
random variable. i.e., a binomial with one trial,
${\color{royalblue}{X_{ij}}}\sim\operatorname{Bin}(1,p^{(ij)})$. We
write the probability as $p^{(ij)}$ rather than $p_{ij}$ to stress that
there is no constraint placed on any sum of the probabilities, just a
requirement that $0\le p^{(ij)}\le 1$ for all $i$ and $j$. We can also
see that the marginal totals are all random statistics in this picture,
since they represent the total numbers of successes that happen to
occur:

<table>
    <tr><td></td><th colspan="4" style="text-align: center">Treatment</th><td></td></tr>
<tr><td></td><td> $$1$$</td><td>   $$2$$</td><td> $$\cdots$$</td><td>$$c$$</td><td></td></tr>
<tr><th> $$i=1$$</th><td> $$\color{royalblue}{X_{11}}$$</td><td>$$\color{royalblue}{X_{12}}$$</td><td>$$\cdots$$</td><td>$$\color{royalblue}{X_{1c}}$$</td><td>$$\color{royalblue}{R_1}$$</td></tr>
<tr><th> $$i=2$$</th><td>  $$\color{royalblue}{X_{21}}$$</td><td>$$\color{royalblue}{X_{22}}$$</td><td>$$\cdots$$</td><td>$$\color{royalblue}{X_{2c}}$$</td><td>$$\color{royalblue}{R_2}$$</td></tr>
<tr><th> $$\vdots$$</th><td>$$\vdots$$</td><td>$$\vdots$$</td><td>$$\ddots$$</td><td>$$\vdots$$</td><td>$$\vdots$$</td></tr>
<tr><th> $$i=r$$</th><td>$$\color{royalblue}{X_{r1}}$$</td><td>$$\color{royalblue}{X_{r2}}$$</td><td>$$\cdots$$</td><td>$$\color{royalblue}{X_{rc}}$$</td><td>$$\color{royalblue}{R_r}$$</td></tr>
<tr><td></td><td>$$\color{royalblue}{C_1}$$</td><td>$$\color{royalblue}{C_2}$$</td><td>  $$\cdots$$</td><td> $$\color{royalblue}{C_c}$$</td><td> $$\color{royalblue}{N}$$</td></tr>
    </table>

- ${\color{royalblue}{X_{ij}}}\sim\operatorname{Bin}(1,p^{(ij)})$. $H_0$ says each subject responds the same way to
all the treatments, i.e., $p^{(ij)}=p^{(i\bullet)}$ for each $i$, but we
don’t make any statements about the $\{p^{(i\bullet)}\}$.

- Under $H_0$, row totals might be quite different, but column totals (#
of successes for each treatment) should be similar. So we're interested in the statistical properties of
${\color{royalblue}{C_j}}=\sum_{i=1}^r{\color{royalblue}{X_{ij}}}$.
$$E({\color{royalblue}{C_j}}) = \sum_{i=1}^rE({\color{royalblue}{X_{ij}}}) = \sum_{i=1}^r p^{(i\bullet)}
\qquad\hbox{and}\qquad\operatorname{Var}({\color{royalblue}{C_j}}) = \sum_{i=1}^r\operatorname{Var}({\color{royalblue}{X_{ij}}})
  = \sum_{i=1}^r p^{(i\bullet)}(1-p^{(i\bullet)})$$

- For statistic, replace unknown $p^{(i\bullet)}$ w/estimator ${\color{royalblue}{R_i}}/c$. Note that this means the
estimator of $E({\color{royalblue}{C_j}})$ is
$$\sum_{i=1}^r\frac{{\color{royalblue}{R_i}}}{c} = \frac{{\color{royalblue}{N}}}{c}
  = \frac{1}{c}\sum_{j=1}^c {\color{royalblue}{C_j}}$$ which would have been a best guess (the expected column total is
the average of the column totals).

- Each column’s contribution to the statistic should then be
$$\frac{[{\color{royalblue}{C_j}} - E({\color{royalblue}{C_j}})]^2}{\operatorname{Var}({\color{royalblue}{C_j}})}
  \sim \frac{({\color{royalblue}{C_j}}-{\color{royalblue}{N}}/c)^2}{\sum_{i=1}^r({\color{royalblue}{R_i}}/c)(1-{\color{royalblue}{R_i}}/c)}$$

- Because mean & the variance are estimated, the relevant
correlations turn out to give a statistic
$${\color{royalblue}{Q}} = c(c-1)\frac{\sum_{j=1}^c({\color{royalblue}{C_j}}-{\color{royalblue}{N}}/c)^2}
  {\sum_{i=1}^r({\color{royalblue}{R_i}})(c-{\color{royalblue}{R_i}})}$$
whose null distribution is approximately $\chi^2(c-1)$.

As a concrete example, consider these data from Messenger et al, *Phys Rev D* **92**, 023006 (2015), which represent the performance of $c=5$ different search pipelines on the same set of $r=50$ different simulated gravitational-wave signals.  A 1 means the signal was detected, a 0 means it was not.  We wish to know if there's a difference among the performance of the different pipelines.

In [4]:
X_ij = np.loadtxt('lesson_11_1_found.dat',usecols=(2,3,4,5,6),dtype=int); r,c = X_ij.shape; r, c 

(50, 5)

In [5]:
X_ij

array([[0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1],
       [0, 1, 0, 0, 1],
       [0, 1, 0, 0, 1],
       [0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1],
       [0, 1, 0, 0, 1],
       [1, 1, 0, 0, 1],
       [0, 0, 0, 0, 1],
       [1, 0, 0, 0, 1],
       [0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1],
       [1, 1, 0, 0, 1],
       [1, 1, 0, 0, 1],
       [0, 1, 0, 0, 1],
       [1, 1, 0, 0, 1],
       [0, 0, 0, 0, 1],
       [1, 1, 0, 0, 1],
       [1, 1, 1, 0, 1],
       [1, 1, 0, 0, 1],
       [1, 1, 0, 0, 1],
       [1, 1, 1, 0, 1],
       [1, 1, 1, 0, 1],
       [1, 1, 1, 0, 1],
       [1, 1, 0, 0, 1],
       [0, 1, 0, 0, 1],
       [0, 1, 0, 0, 1],
       [0, 1, 0, 0, 1],
       [1, 1, 0, 0, 1],
       [0, 0, 0, 0, 1],
       [1, 1, 0, 0, 1],
       [1, 1, 1, 0, 1],
       [1, 1, 0, 0, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 0, 1],
       [1, 1, 1,

The row total $r_i$ is the number of pipelines that detected the $i$th signal.  The column total $c_j$ is the number of signals detected by the $j$th pipeline.

In [6]:
r_i = X_ij.sum(axis=-1); c_j = X_ij.sum(axis=0); N=np.sum(X_ij); r_i, c_j, N, np.sum(r_i), np.sum(c_j)

(array([1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 2, 3, 1, 2, 1, 1, 3, 3, 2, 3,
        1, 3, 4, 3, 3, 4, 4, 4, 3, 2, 2, 2, 3, 1, 3, 4, 3, 5, 4, 5, 4, 5,
        4, 5, 4, 5, 5, 5]),
 array([28, 34, 16,  7, 50]),
 135,
 135,
 135)

The "expected" number of detections is $N/c$, and actual minus expected is $c_j-N/c$

In [7]:
N/c, c_j-N/c

(27.0, array([  1.,   7., -11., -20.,  23.]))

The contribution to the null variance from each row is $r_i(c-r_i)$ so signals where about half the pipelines made a detection help us distinguish the hypotheses:

In [8]:
r_i, r_i*(c-r_i)

(array([1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 2, 3, 1, 2, 1, 1, 3, 3, 2, 3,
        1, 3, 4, 3, 3, 4, 4, 4, 3, 2, 2, 2, 3, 1, 3, 4, 3, 5, 4, 5, 4, 5,
        4, 5, 4, 5, 5, 5]),
 array([4, 4, 4, 4, 6, 6, 4, 4, 4, 4, 4, 4, 6, 6, 4, 6, 4, 4, 6, 6, 6, 6,
        4, 6, 4, 6, 6, 4, 4, 4, 6, 6, 6, 6, 6, 4, 6, 4, 6, 0, 4, 0, 4, 0,
        4, 0, 4, 0, 0, 0]))

- We construct the statistic
$${\color{royalblue}{Q}} = c(c-1)\frac{\sum_{j=1}^c({\color{royalblue}{C_j}}-{\color{royalblue}{N}}/c)^2}
  {\sum_{i=1}^r({\color{royalblue}{R_i}})(c-{\color{royalblue}{R_i}})}$$

In [9]:
Q = c*(c-1)*np.sum((c_j-N/c)**2)/np.sum(r_i*(c-r_i)); Q

104.76190476190476

- Obviously, this is a huge value for a chi-squared with 4 degrees of freedom, so we easily reject the null hypothesis that the searches perform equally well.

In [10]:
c-1, stats.chi2(df=c-1).sf(Q)

(4, 9.519798192058123e-22)

- That was sort of obvious, since the total numbers of detected signals out of 50 were so different

In [11]:
c_j

array([28, 34, 16,  7, 50])

Since that's a pretty extreme result, let's see what would have happened if we'd just restricted attention to two of the searches, which happen to be the most comparabie:

In [12]:
myX_ij = np.loadtxt('lesson_11_1_found.dat',usecols=(2,3),dtype=int); myr,myc = myX_ij.shape; myr, myc

(50, 2)

In [13]:
myX_ij

array([[0, 0],
       [0, 0],
       [0, 0],
       [0, 0],
       [0, 1],
       [0, 1],
       [0, 0],
       [0, 0],
       [0, 0],
       [0, 0],
       [0, 0],
       [0, 0],
       [0, 1],
       [1, 1],
       [0, 0],
       [1, 0],
       [0, 0],
       [0, 0],
       [1, 1],
       [1, 1],
       [0, 1],
       [1, 1],
       [0, 0],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [0, 1],
       [0, 1],
       [0, 1],
       [1, 1],
       [0, 0],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1]])

Now we see one pipeline detects 28 signals, the other 34, and some signals are detected by one, both, or neither pipeline:

In [14]:
myr_i = myX_ij.sum(axis=-1); myc_j = myX_ij.sum(axis=0); myN=np.sum(myX_ij); myr_i, myc_j, myN, np.sum(myr_i), np.sum(myc_j)

(array([0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 2, 2, 1, 2,
        0, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2]),
 array([28, 34]),
 62,
 62,
 62)

In [15]:
myN/myc, myc_j-myN/myc

(31.0, array([-3.,  3.]))

In [16]:
myr_i, myr_i*(myc-myr_i)

(array([0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 2, 2, 1, 2,
        0, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2]),
 array([0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0]))

- Again we construct the Cochran $Q$ ststistic:

In [17]:
myQ = myc*(myc-1)*np.sum((myc_j-myN/myc)**2)/np.sum(myr_i*(myc-myr_i)); myQ

4.5

- This is still kind of high for a $\chi^2(1)$, so we can reject the null hypothesis with $p=0.03$.

In [18]:
myc-1, stats.chi2(df=myc-1).sf(myQ)

(1, 0.033894853524689295)

### Importance of Blocking Information

Note that, if we didn’t know that all of the observations in the same
block were related, we would pool them together and create a $2\times
c$ contingency table where the two rows were just successes and
failures. The observations would be $O_{1j}\equiv$ the number of
successes in column $j$ (which is $C_j$ in the current notation) and
$O_{2j}\equiv$ the number of failures in column $j$ (which is $r-C_j$ in
the current notation). The contingency table would look like

<table>
    <tr><td></td><th colspan="4" style="text-align: center">Treatment</th></tr>
<tr><td></td><td> $$1$$</td><td>   $$2$$</td><td> $$\cdots$$</td><td>$$c$$</td><td></td></tr>
<tr><th> Success</th><td> $$\color{royalblue}{C_{1}}$$</td><td>$$\color{royalblue}{C_{2}}$$</td><td>$$\cdots$$</td><td>$$\color{royalblue}{C_{c}}$$</td><td>$$\color{royalblue}{N}$$</td></tr>
<tr><th> Failure</th><td>  $$r-\color{royalblue}{C_{1}}$$</td><td>$$r-\color{royalblue}{C_{2}}$$</td><td>$$\cdots$$</td><td>$$r-\color{royalblue}{C_{c}}$$</td><td>$$rc-\color{royalblue}{N}$$</td></tr>
<tr><td></td><td>$$r$$</td><td>$$r$$</td><td>  $$\cdots$$</td><td> $$r$$</td><td> $$rc$$</td></tr>
    </table>

As you’ll see on the homework, this test is less sensitive if the
blocking carries important information.

<table>
    <tr><td></td><th colspan="4" style="text-align: center">Treatment</th></tr>
<tr><td></td><td> $$1$$</td><td>   $$2$$</td><td> $$\cdots$$</td><td>$$c$$</td><td></td></tr>
<tr><th> Success</th><td> $$\color{royalblue}{C_{1}}$$</td><td>$$\color{royalblue}{C_{2}}$$</td><td>$$\cdots$$</td><td>$$\color{royalblue}{C_{c}}$$</td><td>$$\color{royalblue}{N}$$</td></tr>
<tr><th> Failure</th><td>  $$r-\color{royalblue}{C_{1}}$$</td><td>$$r-\color{royalblue}{C_{2}}$$</td><td>$$\cdots$$</td><td>$$r-\color{royalblue}{C_{c}}$$</td><td>$$rc-\color{royalblue}{N}$$</td></tr>
<tr><td></td><td>$$r$$</td><td>$$r$$</td><td>  $$\cdots$$</td><td> $$r$$</td><td> $$rc$$</td></tr>
    </table>

The numbers of successes are $c_j$, but the expected number for each pipeline is $\frac{(N)(r)}{rc}=N/c$:

In [19]:
myO1_j = myc_j; myE1j = myN/myc; myO2_j = myr - myc_j; myE2j = myr-myN/myc; myO1_j, myE1j, myO2_j, myE2j

(array([28, 34]), 31.0, array([22, 16]), 19.0)

In [20]:
myT = np.sum((myO1_j-myE1j)**2/myE1j) + np.sum((myO1_j-myE1j)**2/myE1j); myT

1.1612903225806452

In [21]:
stats.chi2(df=myc-1).sf(myT)

0.2811980995641761

This is not significant.  Detecting 28 vs 22 signals out of 50 is not such a big deal, if you don't know how many signals one pipeline detected that the other didn't.

Incidentally, if you compare all five pipelines this way, the results are still very significant, since the number of detections is so different.

In [22]:
O1_j = c_j; E1j = N/c; O2_j = r - c_j; E2j = r-N/c; O1_j, E1j, O2_j, E2j

(array([28, 34, 16,  7, 50]), 27.0, array([22, 16, 34, 43,  0]), 23.0)

In [23]:
T = np.sum((O1_j-E1j)**2/E1j) + np.sum((O1_j-E1j)**2/E1j); T

81.48148148148148

In [24]:
stats.chi2(df=c-1).sf(T)

8.454370175361483e-17

### McNemar’s Test (see Conover Section 3.5)

In the case where $c=2$, Cochran’s test is equivalent to a test
performed using a $2\times 2$ contingency table, known as McNemar’s
test. When there are only two columns, the information in each block
consists of whether the pair $(X_{i1},X_{i2})$ is $(0,0)$, $(0,1)$,
$(1,0)$, or $(1,1)$, and the important information is how many blocks of
each kind we have. We’re then dealing with a $2\times 2$ contingency
table

|            |$$X_{i2}=0$$|$$X_{i2}=1$$|
| ---------- | ---------- | ---------- |
|$$X_{i1}=0$$|    $$a$$   |    $$b$$   |
|$$X_{i1}=1$$|    $$c$$   |    $$d$$   |

In [25]:
thisa = np.sum((1-myX_ij[:,0]) * (1-myX_ij[:,1])); 
thisb = np.sum((1-myX_ij[:,0]) * myX_ij[:,1]); 
thisc = np.sum(myX_ij[:,0] * (1-myX_ij[:,1])); 
thisd = np.sum(myX_ij[:,0] * myX_ij[:,1]); 
thisa, thisb, thisc, thisd

(15, 7, 1, 27)

|            |$$X_{i2}=0$$|$$X_{i2}=1$$|
| ---------- | ---------- | ---------- |
|$$X_{i1}=0$$|    $$a$$   |    $$b$$   |
|$$X_{i1}=1$$|    $$c$$   |    $$d$$   |

The interpretation of this table is different from the usual two-way
contingency table, though. If the treatments behave differently, $b$ and
$c$ will differ from each other. The McNemar test statistic is
$$\frac{(b-c)^2}{b+c}$$ which is approximately $\chi^2(1)$ distributed
if the treatments are equivalent.

|            |#2 Missed|#2 Found|
| ---------- | ---------- | ---------- |
|**#1 Missed**|    $$15$$   |    $$7$$   |
|**#1 Found**|    $$1$$   |    $$27$$   |

In [26]:
thisT = (thisb-thisc)**2/(thisb+thisc); thisT, myQ

(4.5, 4.5)

Note that this is exactly the same as the Cochran $Q$ statistic.

The McNemar Test is actually just the sign test, since the off-diagonal elements are basically $n_+$ and $n_-$.

|            |#2 Missed|#2 Found|
| ---------- | ---------- | ---------- |
|**#1 Missed**|    $$15$$   |    $$7$$   |
|**#1 Found**|    $$1$$   |    $$27$$   |

Note that this makes it really clear why the Cochran $Q$ (which is the same as the McNemar test since $c=2$) test gave a significant result, when the test which just counted the number of found signals did not.  The McNemar test ignores the 15 signals which were missed by both pipelines and the 27 which were found by both, and notes that the second pipeline found 7 signals that the first one missed, and the first one only found one signal that the second one missed.