# STAT 345: Nonparametric Statistics

## Lesson 09.2: Two-Way Contingency Tables

**Reading: Conover Section 4.2**

*Prof. John T. Whelan*

Tuesday 8 April 2025

These lecture slides are in a computational notebook.  You have access to them through http://vmware.rit.edu/

Flat HTML and slideshow versions are also in MyCourses.

The notebook can run Python commands (other notebooks can use R or Julia; "Ju-Pyt-R").  Think: computational data analysis, not "coding".

Standard commands to activate inline interface and import libraries:

In [1]:
%matplotlib inline

In [2]:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (8.0,5.0)
plt.rcParams['font.size'] = 14

## Contingency Tables
- Generalization of categorical data: multiple sets of categories, each observation classified in one category from each set.
- E.g., draw individuals from population \& note hair \& eye color.

- Each observation is a multi-dimensional categorical vector. For concreteness, focus on the case w/two sets of categories:
$\{{\mathcal{R}}_1,\ldots,{\mathcal{R}}_r\}\equiv\{{\mathcal{R}}_i|i=1,\ldots,r\}$
and
$\{{\mathcal{C}}_1,\ldots,{\mathcal{C}}_c\}\equiv\{{\mathcal{C}}_j|j=1,\ldots,c\}$.

- $N$ obs $\equiv$ paired categorical data sample
$\{(x_I,y_I)|I=1,\ldots,N\}$, w./$x_I\in\{{\mathcal{R}}_i\}$ and
$y_I\in\{{\mathcal{C}}_j\}$.

- Count up the number of observations in each pair of categories
$$O_{ij} = \sum_{I=1}^N I[x_I={\mathcal{R}}_i,y_I={\mathcal{C}}_j]$$

- Arrange observations $\{O_{ij}\}$ into a **contingency table**:

| | $$j=1$$ | $$j=2$$ | $$\cdots$$ | $$j=c$$ | Total |
|-| ------- | ------- | ---------- | ------- | --- |
|$$i=1$$ | $$O_{11}$$ |  $$O_{12}$$ |  $$\cdots$$ |  $$O_{1c}$$ | $$r_1$$ |
 |      $$i=2$$ | $$O_{21}$$ |  $$O_{22}$$ |  $$\cdots$$ |  $$O_{2c}$$ | $$r_2$$ |
 |   $$\vdots$$ | $$\vdots$$ |  $$\vdots$$ |  $$\ddots$$ |  $$\vdots$$ | $$\vdots$$ |
 |      $$i=r$$ | $$O_{r1}$$ |  $$O_{r2}$$ |  $$\cdots$$ |  $$O_{rc}$$ | $$r_r$$ |
 | **Total** |             $$c_1$$ |     $$c_2$$ |    $$\cdots$$ |   $$c_c$$ |   $$N$$ |

- Define the row and column totals (total number of observations in each set of categories)
$$\sum_{i=1}^r O_{ij} = c_j \qquad \sum_{j=1}^c O_{ij} = r_i$$
- The total number of observations is given by
$$\sum_{i=1}^r r_i = N = \sum_{j=1}^c c_j$$

- Usual contingency table test checks for association between the categories in the two sets (rather than specified probabilities).

- $H_0$ says row & column categories have no influence on each other (e.g., a blue-eyed person is no more likely to have brown vs blond hair than a brown-eyed person)

- Details are different depending on how you formulate the problem (which row/column totals are fixed), but all predict expected counts of $E_{ij}=r_ic_j/N$ and construct statistic
$$\sum_{i=1}^r\sum_{j=1}^c \frac{(O_{ij}-E_{ij})^2}{E_{ij}}
  = \sum_{i=1}^r\sum_{j=1}^c \frac{O_{ij}^2}{E_{ij}} - N$$
which is approximately $\chi^2([r-1][c-1])$ distributed if $H_0$ is true.

### Example

For instance, suppose we survey students majoring in four
disciplines about their food choices:

|               |  Vegan |  Vegetarian |  Non-Veg  | Total|
|  -------------|------- |------------ |--------- |-------|
|  Math & Stat  |    9   |      22     |    50   |    81|
|  Physics      |    6   |      16     |    29    |   51|
|  Chemistry    |   11   |      28     |    63   |    102|
|  Biology      |   26   |      50     |    97   |   173|
|  Total        |   52   |     116      |   239  |    407|

Are there significant tendencies for
students in one major to have one diet or another?

There are two ways to
pose the question:

Any difference in the tendencies of students in one major or
another to have a vegan, vegetarian, or non-vegetarian diet?
(homogeneity)

Any correlation between the major chosen by a student and their
dietary choices? (independence)

Either way, we construct the same $\chi^2$ statistic, starting from the observations $\{O_{ij}\}$ & row/column sums:

In [3]:
O_ij = np.array([[9, 22, 50], [6, 16, 29], [11, 28, 63], [26, 50, 97]])
r_i = np.sum(O_ij,axis=1); c_j = np.sum(O_ij,axis=0); N = np.sum(O_ij); r_i, c_j, N, np.sum(r_i), np.sum(c_j)

(array([ 81,  51, 102, 173]), array([ 52, 116, 239]), 407, 407, 407)

The estimates according to the model are
$\frac{(81)(52)}{407}\approx 10.35$, $\frac{(81)(116)}{407}\approx
23.09$, etc.:

In [4]:
E_ij = r_i[:,None] * c_j[None,:] / N; E_ij

array([[ 10.34889435,  23.08599509,  47.56511057],
       [  6.51597052,  14.53562654,  29.94840295],
       [ 13.03194103,  29.07125307,  59.8968059 ],
       [ 22.1031941 ,  49.30712531, 101.58968059]])

Note that $\{E_{ij}\}$ has the same row & column sums as $\{O_{ij}\}$:

In [5]:
np.sum(E_ij,axis=1), np.sum(E_ij,axis=0), np.sum(E_ij)

(array([ 81.,  51., 102., 173.]), array([ 52., 116., 239.]), 407.0)

The $\chi^2$ statistic is $\sum_{i=1}^r\sum_{j=1}^c \frac{(O_{ij}-E_{ij})^2}{E_{ij}}
  = \sum_{i=1}^r\sum_{j=1}^c \frac{O_{ij}^2}{E_{ij}} - N\approx 1.99$:

In [6]:
w = np.sum((O_ij-E_ij)**2/E_ij); w, np.sum(O_ij**2/E_ij) - N

(1.9911349995983592, 1.99113499959833)

This is on the low side for $(4-1)(3-1)=6$ degrees of freedom, so the $p$-value of $0.92$ tells us there's no evidence of correlation or dependence in the categories:

In [7]:
r=len(r_i); c=len(c_j); r, c, (r-1)*(c-1), stats.chi2(df=(r-1)*(c-1)).sf(w)

(4, 3, 6, 0.9205121061730804)

There are fancier ways to do this with e.g., Pandas, but old-fashioned spreadsheets like gnumeric or Excel work well too...

## Test for Inhomogeneity

To see where the $\chi^2$ test comes from (and what the null distribution is for small samples), consider what assumptions are being made.

First, suppose the total number $r_i$ in each row is fixed.  I.e., we poll a given number of students from each major, and consider whether the distributions of diets are different.

For each $i$, we have a
    multinomial random vector
    $\{{\color{royalblue}{O_{ij}}}|i=1,\ldots,r;j=1,\ldots,c\}$, which
    in general has probabilities
    $\{p^{(i)}_1,p^{(i)}_2,\ldots,p^{(i)}_c\}\equiv\{p^{(i)}_j\}$. $H_0$ says $p^{(i)}_j=p_{\bullet j}$ for all $i$.

For each $i$, we have a
    multinomial random vector
    $\{{\color{royalblue}{O_{ij}}}|i=1,\ldots,r;j=1,\ldots,c\}$, which
    in general has probabilities
    $\{p^{(i)}_1,p^{(i)}_2,\ldots,p^{(i)}_c\}\equiv\{p^{(i)}_j\}$. $H_0$ says $p^{(i)}_j=p_{\bullet j}$ for all $i$.

- Multinomial has expectation value $E({\color{royalblue}{O_{ij}}})=r_i p^{(i)}_j$

- Assuming homogeneity, estimate $p_{\bullet j}$ as $\hat{p}_{\bullet j}=c_j/N$

- Estimated expected number is $\hat{E}_{ij}=r_i \hat{p}_{\bullet j}=r_i c_j/N$; use that to make the $\chi^2$.

- We have $r$ multinomial random variables with $c$ categories each, which means we've observed $r(c-1)$ non-trivial numbers. We've estimated $c$ probabilities, but only $c-1$ of them were non-trivial because they had to add to $1$. Thus the number of degrees of freedom for the chi-squared should be $r(c-1) - (c-1) = (r-1)(c-1)$.

- Note that although the formalism treats the rows and columns rather differently, the final data analysis prescription treats them symmetrically.

## Test for Dependence

Now suppose only the total number of observations $N$ is fixed, e.g., we've just picked 407 students at random and noted their major and diet.

- Row totals
    $\{{\color{royalblue}{R_i}}\}$ and column totals
    $\{{\color{royalblue}{C_j}}\}$ are random variables. In general,
    $\{{\color{royalblue}{O_{ij}}}|i=1,\ldots,r;j=1,\ldots,c\}$ are a
    multinomial random vector with probabilities $\{p_{ij}\}$, $\sum_{i=1}^r\sum_{j=1}^c p_{ij}=1$, so $E\left(\color{royalblue}{O_{ij}}\right) = N p_{ij}$

- $H_0$ (independence) says $p_{ij}=p_{i\bullet}p_{\bullet j}$ for some
    $\{p_{i\bullet}|i=1,\ldots,r\}$ with $\sum_{i=1}^r p_{i\bullet}=1$
    and some $\{p_{\bullet j}|i=1,\ldots,c\}$ with
    $\sum_{j=1}^c p_{\bullet j}=1$.

- Estimate $\hat{p}_{i\bullet}=r_i/N$ & $\hat{p}_{\bullet j}=c_i/N$ so $\hat{E}_{ij} = N\frac{r_i}{N}\frac{c_j}{N}$ as before.

- Since multinomial w/$rc$ categories, so $rc-1$ independent observations.  Estimated $r$ probs $\{\hat{p}_{i\bullet}\}$, $r-1$ independent & $c$ probs $\{\hat{p}_{\bullet j}\}$, $c-1$ independent, so $\sum_{i=1}^r\sum_{j=1}^c \frac{(O_{ij}-E_{ij})^2}{E_{ij}}$ is a $\chi^2$ w/$rc-1-(r-1)-(c-1)=rc-r-c+1=(r-1)(c-1)$ dof, same as before.

### Other Tests

Finally, could assume the row numbers $\{r_i\}$ & column numbers $\{c_j\}$ are fixed. The distribution of $\{{\color{royalblue}{O_{ij}}}\}$ is then just combinatorics: arrange the $N$ observations into rows and columns, respecting the marginal totals.  We'll consider this in the next lesson, but for large numbers, it's the same chi-squared as the other cases.

Note, although these assumptions all give the same chi-squared test, the exact distributions for the statistics will be different when the numbers are smaller, as you'll examine on the homework.

In each case we’ll have a null expectation value $E_{ij}$ for the number
of observations in row $i$ and column $j$, and we’ll define a
chi-squared statistic
$$\sum_{i=1}^r\sum_{j=1}^c \frac{(O_{ij}-E_{ij})^2}{E_{ij}}
  = \sum_{i=1}^r\sum_{j=1}^c \frac{O_{ij}^2}{E_{ij}} - N$$ It will turn
out that for each of the assumptions about the null distribution,
$E_{ij}=r_ic_j/N$, and, under the null hypothesis and the normal
approximation, the statistic will be chi-squared distributed with
$(r-1)(c-1)$ degrees of freedom. For small samples, however, the details
of the null distribution will depend on the assumptions made about the
experimental setup.

## Projects

- In the next few weeks you'll carry out and present a project.

- The project should consist of
either investigation and presentation of a nonparametric statistical
method not covered this semester, or an in-depth numerical evaluation of
an analysis or comparison of analyses which we *have* covered.

- By 8am **Monday** April 14 you'll submit a proposal.  I'll select four of these, and assign each to groups of 2 or 3 of you.  You'll submit written reports with the last two homeworks, and give presentations during the final two classes.