# The Neyman "potential outcomes" model for causal inference

There are $N$ subjects and $T$ possible treatments.
Each subject is represented by a ticket. 
Ticket $j$ lists $T$ numbers, $(x_{j0}, \ldots, x_{jT-1})$.
The value $x_{jt}$ is the response subject $j$ will have if assigned to treatment $t$.
(Treatment $0$ is commonly control or placebo.)

This mathematical set up embodies the _non-interference_ assumption, which means that
subject $j$'s response depends only on which treatment subject $j$ receives, and not
on the treatment other subjects receive.
(That is not a good assumption in situations like vaccine trials, where whether one subject
becomes infected may depend on which other subjects are vaccinated, if subjects
may come in contact with each other.)

This model is also called the _potential outcomes_ model, because it starts with the
_potential_ outcomes each subject will have to each treatment. 
Assigning a subject to a 
treatment just reveals the potential outcome that corresponds to that treatment, for that subject. 
This model was introduced by Jerzy Neyman, the founder of the U.C. Berkeley Department of Statistics, in a 1923 paper in Polish [translated into English in 1990](https://projecteuclid.org/journals/statistical-science/volume-5/issue-4/On-the-Application-of-Probability-Theory-to-Agricultural-Experiments-Essay/10.1214/ss/1177012031.full).
It was popularized by Donald Rubin in the 1970s and 1980s.

There are generalizations of this model, including one in which the "potential outcomes" are random, rather than deterministic, but their distributions are fixed before assignment to treatment: if subject $j$ is assigned treatment $t$, a draw from the distribution $\mathbb{P}_{jt}$ is observed. 
Draws for different subjects are independent.
We shall see an example of this when we discuss [nuisance parameters](./nuisance.ipynb).

### Null hypotheses for the Neyman model

The _strong_ null hypothesis is that subject by subject, the effect of
all $T$ treatments is the same.
That is,
\begin{equation*}
x_{j0} = x_{j1} = \cdots = x_{jT-1}, \;\; j=1, \ldots, N.
\end{equation*}
Different subjects may have different responses ($x_{jt}$ might not equal $x_{kt}$ if $j \ne k$), but each subject's response is the same regardless of the treatment assigned 
to that subject.
This is the null hypothesis Fisher considered in _The Design of Experiments_ and which
he generally considered the "correct" null in practice.

Suppose $T=2$: we are comparing two treatments. Suppose we assign $n$ subjects at random
to treatment 0 and the other $m = N-n$ to treatment 1.
Let $\{z_j\}_{j=1}^n$ be the responses of the subjects assigned treatment 0
and $\{y_j\}_{j=1}^m$ be the responses of the subjects assigned treatment 1.
(That is, $z_1 = x_{k0}$ if $k$ is the first subject assigned treatment $0$,
and $y_1 = x_{k1}$ if $k$ is the first subject assigned treatment $1$.)
Then testing the strong null hypothesis is identical to the _two-sample problem_:
under the strong null, each subject's response would have been the same, regardless
of treatment, so allocating subject to treatments and observing their responses
is just randomly partitioning a fixed set of $n$ numbers into a group of size $n$ and a group of size $m$.

The _weak_ null hypothesis is that on average across subjects, all treatments have the same effect. 
That is,
\begin{equation*}
\frac{1}{N} \sum_{j=1}^N x_{j0} = \frac{1}{N} \sum_{j=1}^N x_{j1} = \ldots = \frac{1}{N} \sum_{j=1}^N x_{jT-1}.
\end{equation*}
Much of Neyman's work on experiments involves this null hypothesis.
The statistical theory is more complex for the weak null hypothesis than for the strong null.

The strong null is indeed a stronger hypothesis than the weak null, because if the strong null is true, the weak null must also be true: if $T$ lists are equal, element by element, then their means are equal. 
But the converse is not true: the weak null can be true even if the strong null is false.
For example, for $T=2$ and $N=2$, we might have potential responses $(0, 1)$ for subject 1 and $(1,0)$ for subject 2. The effect of treatment is to increase subject 1's response from 0 to 1 and to decrease subject 2's response from 1 to 0.
The treatment affects both subjects, but the average effect of treatment is the same: the average response across subjects is 1/2, with or without treatment.

If we can test the weak null, we can also make inferences about the _average treatment effect_. If we can only test the strong null, in general we have to make assumptions about how treatment affects responses in order to make inferences about the average treatment effect.

### Alternative hypotheses in the Neyman model

#### Constant shift

For example, if we assume that the effect of treatment is to shift every subject's response by the same amount, then we can use a test of the strong null to make inferences about that constant effect.
In symbols, this alternative states that there is some number $\Delta$
such that $x_{j1} = x_{j0}+\Delta$ for all subjects $j$.

Again, once the original data are observed, this hypothesis completely specifies the probability distribution of the data: we know what subject $j$'s response would have been had the subject been assigned the other treatment. If the subject was assigned treatment 0, the response would have been larger by $\Delta$ if the subject had been assigned treatment 1 instead. if the subject was assigned treatment 1, the response would have been smaller by $\Delta$ if the subject had been assigned treatment 0 instead.

### Other tractable alternative hypotheses

A more general alternative is that $x_{j1} = f(x_{j0})$ for some strictly monotonic (and thus invertible) function $f$. 
A simple example is that treatment multiplies the response by a constant.

In some contexts, it can be reasonable to assume that treatment can only help, that is that $x_{j1} \ge x_{j0}$, without specifying a functional relationship between them. 

### Testing the strong null hypothesis

Under the strong null that the treatment makes no difference whatsoever--as if 
the response had been predetermined before assignment to treatment or control--the null distribution of any test statistic is completely determined once the data have
been observed: we know what the data would have been for any other random assignment, namely, the same. 
And we know the chance that each of those possible datasets would have resulted from
the experiment, since we know how subjects were assigned at random to treatments.

For alternatives that allow us to find $x_{j0}$ from $x_{j1}$ and vice versa,
the alternative also completely determines the 
probability distribution of any test statistic, once the data have been observed.

## Two treatments, binary responses.
 
Imagine testing whether a vaccine prevents a disease.
We assign a random sample of $n$ of the $N$ subjects to receive treatment 1;
the other $N-n$ receive a placebo, treatment 0.
Let $W_j$ denote the treatment assigned to subject $j$, so $\sum_{j=1}^N W_j = n$.
After some time has passed, we observe 
\begin{equation*}
X_j := (1-W_j) x_{j0} + W_j x_{j1}, \;\; j=1, \ldots, N.
\end{equation*}
These are random variables, but the only source of randomness is $\{W_j\}$, the 
treatment assignment variables.

The total number of infections among the vaccinated is
\begin{equation*}
   X_1^* := \sum_{j=1}^N W_j x_{j1}
\end{equation*}
and the total among the unvaccinated is
\begin{equation*}
   X_0^* := \sum_{j=1}^N (1-W_j) x_{j0}.
\end{equation*}


Under the strong null that the vaccine makes no difference whatsoever--as if whether a subject would become ill was predetermined before assignment to treatment or control--the distribution of the number of infections among the vaccinated would have a hypergeometric distribution with parameters $N$, $G=X_0^*+X_1^*$, and $n=n$.
Testing the strong null using this hypergeometric distribution yields _Fisher's Exact Test_.

This model can be generalized by considering the total number of infections $X_0^*+X_1^*$ to be random, then conditioning on the observed value to get a conditional test.

### Inference about the size of the treatment effect

But how can we find a confidence interval for the treatment effect?

First, we need to define what we mean by the treatment effect!

#### Effect Size for the Weak Null

Recall that the "weak" null hypothesis that average response is the same, regardless
of the assigned treatment (but individuals might have different responses to treatment).
Define $\bar{X}_0 := X_0^*/(N-n)$ and $\bar{X}_1 := X_1^*/n$, the observed mean
responses for the control group and the treatment group.
These are unbiased estimates of 
the corresponding population parameters,
\begin{equation*}
\bar{x}_0 := \frac{1}{N} \sum_{j=1}^n x_{j0}
\end{equation*}
and
\begin{equation*}
\bar{x}_1 := \frac{1}{N} \sum_{j=1}^n x_{j1}.
\end{equation*}

The _average treatment effect_ for the study population is
\begin{equation*}
\tau := \bar{x}_1 - \bar{x}_0 = \frac{1}{N} \sum_{j=1}^N (x_{1j}-x_{0j}) = \frac{1}{N} \sum_{j=1}^N \tau_j,
\end{equation*}
where $\tau_j := x_{1j}-x_{0j}$.
An unbiased estimate of $\tau$ is $\hat{\tau} := \bar{X}_1 - \bar{X}_0$.

The study population can be summarized by 4 numbers, $N_{00}$, $N_{01}$, $N_{10}$, and $N_{11}$, where $N_{ik}$ is the number of subjects $j$ for whom 
$x_{j0} = k$ and $x_{j1} = k$, for $i, k \in \{0, 1\}$.
Of course, $N = N_{00} + N_{01} + N_{10} + N_{11}$.
The average treatment effect can be written $\tau = (N_{01}-N_{10})/N$.

#### First approach: Bonferroni simultaneous confidence intervals

A collection of confidence set procedures $\{\mathcal{I}_i \}_{i=1}^m$ for a corresponding set of parameters $\{ \theta_i \}_{i=1}^m$ has _simultaneous coverage probability_ $1-\alpha$ if
\begin{equation*}
\mathbb{P}_{\theta_1, \ldots, \theta_m} \cap_{i=1}^m ( \theta_i \in \mathcal{I}_i ) \ge 1-\alpha
\end{equation*}
whatever the true values $\{\theta_i\}$ are.

Bonferroni's inequality says that for any collection of events $\{A_i\}$,
\begin{equation*}
\mathbb{P} \left ( \cup_i A_i \right ) \le \sum_i \mathbb{P}(A_i).
\end{equation*}.
It follows that if $\mathcal{I}_i$ is a confidence interval procedure for $\theta_i$ with coverage probability $1-\alpha_i$, $i=1, \ldots, m$, then the collection of
confidence set procedures $\{\mathcal{I}_i \}_{i=1}^m$ has simultaneous coverage probability
not smaller than $1-\sum_i \alpha_i$.

In the current situation, if we had simultaneous confidence sets for both $N_{01}$ and $N_{10}$, we could find a confidence set for $\tau$ because $\tau = (N_{01}-N_{10})/N$.  

#### Second approach: testing all 2x2 tables of potential outcomes

After the randomization, for each subject $j$, we observe either $x_{j0}$ or
$x_{j1}$.
At that point, we know $N$ of the $2N$ numbers $\{x_{jk}\}_{j=1}^N{}_{k=0}^1$.
The other $N$ numbers--the responses that were not observed--can be any combination of 0s and 1s: there are $2^N$ possibilities in all.

But at the end of the day, all that matters are the four numbers $N_{00}$, $N_{01}$, $N_{10}$, and $N_{11}$.
Together, these sum to $N$. 
The total number of ways there are of partitioning $N$ items into 4 groups can be found by Feller's "bars and stars" argument (see the [notes on nuisance parameters](./nuisance.ipynb));
the answer is $\binom{N+3}{3}$. 
This is $O(N^3)$ tables
([Rigdon and Hudgens (2015)](https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.6384)).
But many of those tables are incompatible with the observed data.
[Li and Ding (2016)](https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.6764) show that taking into account the observed data constrains
the number of tables to $O(N^2)$, greatly speeding the computation.

The (unknown) complete table of results is as follows:

$w_j$      | $x_{j0}$   | $x_{j1}$ |
:-----:    | :------:   | :------: |
 1         |    0       |    0     |
 1         |    0       |    0     |
 1         |    1       |    0     |
 1         |    1       |    1     |
 1         |    0       |    1     |
 $\cdots$   | $\cdots$   | $\cdots$  |
 0         |    0       |    0     |
 0         |    0       |    1     |
 0         |    1       |    1     |
 $\cdots$   | $\cdots$   | $\cdots$  |




[MORE TO COME]