# The Neyman "potential outcomes" model for causal inference

There are $N$ subjects and $T$ possible treatments.
Each subject is represented by a ticket. 
Ticket $j$ lists $T$ numbers, $(x_{j0}, \ldots, x_{jT-1})$.
The value $x_{jt}$ is the response subject $j$ will have if assigned to treatment $t$.
(Treatment $0$ is commonly control or placebo.)

This mathematical set up embodies the _non-interference_ assumption, which means that
subject $j$'s response depends only on which treatment subject $j$ receives, and not
on the treatment other subjects receive.
(That is not a good assumption in situations like vaccine trials, where whether one subject
becomes infected may depend on which other subjects are vaccinated, if subjects
may come in contact with each other.)

This model is also called the _potential outcomes_ model, because it starts with the
_potential_ outcomes each subject will have to each treatment. 
Assigning a subject to a 
treatment just reveals the potential outcome that corresponds to that treatment, for that subject. 
This model was introduced by Jerzy Neyman, the founder of the U.C. Berkeley Department of Statistics, in a 1923 paper in Polish [translated into English in 1990](https://projecteuclid.org/journals/statistical-science/volume-5/issue-4/On-the-Application-of-Probability-Theory-to-Agricultural-Experiments-Essay/10.1214/ss/1177012031.full).
It was popularized by Donald Rubin in the 1970s and 1980s.

There are generalizations of this model, including one in which the "potential outcomes" are random, rather than deterministic, but their distributions are fixed before assignment to treatment: if subject $j$ is assigned treatment $t$, a draw from the distribution $\mathbb{P}_{jt}$ is observed. 
Draws for different subjects are independent.
We shall see an example of this when we discuss [nuisance parameters](./nuisance.ipynb).

### Null hypotheses for the Neyman model

The _strong_ null hypothesis is that subject by subject, the effect of
all $T$ treatments is the same.
That is,
\begin{equation*}
x_{j0} = x_{j1} = \cdots = x_{jT-1}, \;\; j=1, \ldots, N.
\end{equation*}
Different subjects may have different responses ($x_{jt}$ might not equal $x_{kt}$ if $j \ne k$), but each subject's response is the same regardless of the treatment assigned 
to that subject.
This is the null hypothesis Fisher considered in _The Design of Experiments_ and which
he generally considered the "correct" null in practice.

Suppose $T=2$: we are comparing two treatments. Suppose we assign $n$ subjects at random
to treatment 0 and the other $m = N-n$ to treatment 1.
Let $\{z_j\}_{j=1}^n$ be the responses of the subjects assigned treatment 0
and $\{y_j\}_{j=1}^m$ be the responses of the subjects assigned treatment 1.
(That is, $z_1 = x_{k0}$ if $k$ is the first subject assigned treatment $0$,
and $y_1 = x_{k1}$ if $k$ is the first subject assigned treatment $1$.)
Then testing the strong null hypothesis is identical to the _two-sample problem_:
under the strong null, each subject's response would have been the same, regardless
of treatment, so allocating subject to treatments and observing their responses
is just randomly partitioning a fixed set of $n$ numbers into a group of size $n$ and a group of size $m$.

The _weak_ null hypothesis is that on average across subjects, all treatments have the same effect. 
That is,
\begin{equation*}
\frac{1}{N} \sum_{j=1}^N x_{j0} = \frac{1}{N} \sum_{j=1}^N x_{j1} = \ldots = \frac{1}{N} \sum_{j=1}^N x_{jT-1}.
\end{equation*}
Much of Neyman's work on experiments involves this null hypothesis.
The statistical theory is more complex for the weak null hypothesis than for the strong null.

The strong null is indeed a stronger hypothesis than the weak null, because if the strong null is true, the weak null must also be true: if $T$ lists are equal, element by element, then their means are equal. 
But the converse is not true: the weak null can be true even if the strong null is false.
For example, for $T=2$ and $N=2$, we might have potential responses $(0, 1)$ for subject 1 and $(1,0)$ for subject 2. The effect of treatment is to increase subject 1's response from 0 to 1 and to decrease subject 2's response from 1 to 0.
The treatment affects both subjects, but the average effect of treatment is the same: the average response across subjects is 1/2, with or without treatment.

If we can test the weak null, we can also make inferences about the _average treatment effect_
\begin{equation*}
\tau := \frac{1}{N} \sum_{j=1}^N (x_{j1} - x_{j0})
\end{equation*}
(when treatment 1 is being compared to control).

If we can only test the strong null, in general we have to make assumptions about how treatment affects responses in order to make inferences about the average treatment effect.

### Alternative hypotheses in the Neyman model

#### Constant shift

For example, if we assume that the effect of treatment is to shift every subject's response by the same amount, then we can use a test of the strong null to make inferences about that constant effect.
In symbols, this alternative states that there is some number $\tau$
such that $x_{j1} = x_{j0}+\tau$ for all subjects $j$.

Again, once the original data are observed, this hypothesis completely specifies the probability distribution of the data: we know what subject $j$'s response would have been had the subject been assigned the other treatment. If the subject was assigned treatment 0, the response would have been larger by $\Delta$ if the subject had been assigned treatment 1 instead. if the subject was assigned treatment 1, the response would have been smaller by $\Delta$ if the subject had been assigned treatment 0 instead.

#### Other tractable alternative hypotheses

A more general alternative is that $x_{j1} = f(x_{j0})$ for some strictly monotonic (and thus invertible) function $f$. 
A simple example is that treatment multiplies the response by a constant.

In some contexts, it can be reasonable to assume that treatment can only help, that is that $x_{j1} \ge x_{j0}$, without specifying a functional relationship between them. 

### Testing the strong null hypothesis

Under the strong null that the treatment makes no difference whatsoever--as if 
the response had been predetermined before assignment to treatment or control--the null distribution of any test statistic is completely determined once the data have
been observed: we know what the data would have been for any other random assignment, namely, the same. 
And we know the chance that each of those possible datasets would have resulted from
the experiment, since we know how subjects were assigned at random to treatments.

For alternatives that allow us to find $x_{j0}$ from $x_{j1}$ and vice versa,
the alternative also completely determines the 
probability distribution of any test statistic, once the data have been observed.

## Two treatments, binary responses.
 
Imagine testing whether a vaccine prevents a disease.
We assign a random sample of $n$ of the $N$ subjects to receive treatment 1;
the other $N-n$ receive a placebo, treatment 0.
Let $W_j$ denote the treatment assigned to subject $j$, so $\sum_{j=1}^N W_j = n$.
After some time has passed, we observe 
\begin{equation*}
X_j := (1-W_j) x_{j0} + W_j x_{j1}, \;\; j=1, \ldots, N.
\end{equation*}
These are random variables, but (in this model)
the only source of randomness is $\{W_j\}$, the 
treatment assignment variables.

The total number of infections among the vaccinated is
\begin{equation*}
   X_1^* := \sum_{j=1}^N W_j x_{j1}
\end{equation*}
and the total among the unvaccinated is
\begin{equation*}
   X_0^* := \sum_{j=1}^N (1-W_j) x_{j0}.
\end{equation*}


Under the strong null that the vaccine makes no difference whatsoever--as if whether a subject would become ill was predetermined before assignment to treatment or control--the distribution of the number of infections among the vaccinated would have a hypergeometric distribution with parameters $N$, $G=X_0^*+X_1^*$, and $n=n$.
Testing the strong null using this hypergeometric distribution yields _Fisher's Exact Test_.

This model can be generalized by considering the total number of infections $X_0^*+X_1^*$ to be random, then conditioning on the observed number of infections to get a conditional test.

### Inference about the size of the treatment effect

Recall that the "weak" null hypothesis that average response is the same, regardless
of the assigned treatment (but individuals might have different responses to treatment).
Define 
\begin{equation*}
\bar{X}_0 := X_0^*/(N-n)
\end{equation*}
and
\begin{equation*}
\bar{X}_1 := X_1^*/n,
\end{equation*}
the observed mean
responses for the control group and the treatment group, respectively.
These are unbiased estimates of 
the corresponding population parameters,
\begin{equation*}
\bar{x}_0 := \frac{1}{N} \sum_{j=1}^n x_{j0}
\end{equation*}
and
\begin{equation*}
\bar{x}_1 := \frac{1}{N} \sum_{j=1}^n x_{j1}.
\end{equation*}

The _average treatment effect_ for the study population is
\begin{equation*}
\tau = \bar{x}_1 - \bar{x}_0 = \frac{1}{N} \sum_{j=1}^N (x_{1j}-x_{0j}) = \frac{1}{N} \sum_{j=1}^N \tau_j,
\end{equation*}
where $\tau_j := x_{1j}-x_{0j}$.
An unbiased estimate of $\tau$ is $\hat{\tau} := \bar{X}_1 - \bar{X}_0$.

Since each $\tau_j$ is either $-1$, $0$, or $1$, the only possible values of
$\tau$ are multiples of $1/N$.
The largest and smallest possible values are $\tau=-1$ (which
occurs if $x_{1j}=0$ and $x_{0j}=1$ for all $j$) and
$\tau=11$ (which occurs if $x_{1j}=1$ and $x_{0j}=0$ for all $j$), a range of 2.

The study population can be summarized by four numbers, $N_{00}$, $N_{01}$, $N_{10}$, and $N_{11}$, where $N_{ik}$ is the number of subjects $j$ for whom 
$x_{j0} = i$ and $x_{j1} = k$, for $i, k \in \{0, 1\}$.
That is, $N_{00}$ is the number of subjects whose response is "0" whether they are assigned to treatment or to control, while $N_{01}$ is the number of subjects whose response would be
"0" if assigned to control and "1" if assigned to treatment, etc.
Of course, $N = N_{00} + N_{01} + N_{10} + N_{11}$.

Define $N_{\cdot 1} := N_{01} + N_{11}$ and $N_{1\cdot} := N_{10} + N_{11}$.
Then $N_{\cdot 1}$ is the number of subjects whose response would be "1" if assigned to treatment and $N_{1\cdot}$ is the number of subjects whose response would be "1" if 
assigned to control.

Now $\sum_{j=1}^N x_{0j} = N_{1\cdot}$ and $\sum_{j=1}^N x_{1j} = N_{\cdot 1}$, so
the average treatment effect can be written
\begin{equation*}
\tau = \frac{1}{N} \sum_{j=1}^N (x_{1j}-x_{0j}) = \frac{1}{N} (N_{\cdot 1} - N_{1\cdot}).
\end{equation*}

Once the data have been observed, we know that $N_{1\cdot}$ is at least $X_0^*$
(we saw that many ones in the control group)
and at most $X_0^* + n$ (if every unobserved control response were one).
Similarly, we know that $N_{\cdot 1}$ is at least $X_1^*$
(we saw that many ones in the treatment group)
and at most $X_1^* + (N-n)$ (if every unobserved 
treatment response were one).
Their difference is thus between
\begin{equation*}
X_1^* - X_0^* - n 
\end{equation*}
and
\begin{equation*}
X_1^* - X_0^* + N - n)
\end{equation*}
so
\begin{equation*}
\tau \in \{ (X_1^* - X_0^* - n)/N, (X_1^* - X_0^* - n + 1)/N, \ldots, (X_1^* - X_0^* + N-n)\},
\end{equation*}
which has range $1$.

#### First approach to confidence bounds for $\tau$: Bonferroni simultaneous confidence intervals

A collection of confidence set procedures $\{\mathcal{I}_i \}_{i=1}^m$ for a corresponding set of parameters $\{ \theta_i \}_{i=1}^m$ has _simultaneous coverage probability_ $1-\alpha$ if
\begin{equation*}
\mathbb{P}_{\theta_1, \ldots, \theta_m} \bigcap_{i=1}^m ( \theta_i \in \mathcal{I}_i ) \ge 1-\alpha
\end{equation*}
whatever the true values $\{\theta_i\}$ are.

Bonferroni's inequality says that for any collection of events $\{A_i\}$,
\begin{equation*}
\mathbb{P} \left ( \bigcup_i A_i \right ) \le \sum_i \mathbb{P}(A_i).
\end{equation*}
It follows that if $\mathcal{I}_i$ is a confidence interval procedure for $\theta_i$ with coverage probability $1-\alpha_i$, $i=1, \ldots, m$, then the collection of $m$
confidence set procedures $\{\mathcal{I}_i \}_{i=1}^m$ has simultaneous coverage probability
not smaller than $1-\sum_{i=1}^m \alpha_i$.

In the current situation, if we had simultaneous confidence sets for both $N_{\cdot 1}$ and $N_{1\cdot}$, we could find a confidence set for $\tau$, because 
$\tau = (N_{\cdot 1} - N_{1\cdot})/N$.

Now, $X_1^*$ is the number of "1"s in a random sample of size $n$ from a population of size $N$ of which $N_{\cdot 1}$ are labeled "1" and the rest are labeled "0."
Similarly, $X_0^*$ is the number of "1"s in a random sample of size $n$ from a population of size $N$ of which $N_{1\cdot}$ are labeled "1" and the rest are labeled "0."
That is, the probability distribution of $X_1^*$ is 
hypergeometric with parameters $N$, $G=N_{\cdot 1}$, and $n$,
and the probability distribution of $X_0^*$ is 
hypergeometric with parameters $N$, $G=N_{1 \cdot}$, and $n$, but $X_1^*$ and $X_0^*$ are
dependent.

To construct an _upper_ 1-sided $1-\alpha$ confidence bound for $\tau$, we can find an _upper_ 1-sided $1-\alpha/2$ confidence bound for $N_{\cdot 1}$, subtract a _lower_ 1-sided $1-\alpha/2$ confidence bound for $N_{1 \cdot}$, and divide the result by $N$.

To construct a _lower_ 1-sided $1-\alpha$ confidence bound for $\tau$, we can find a _lower_ 1-sided $1-\alpha/2$ confidence bound for $N_{\cdot 1}$, subtract an _upper_ 1-sided $1-\alpha/2$ confidence bound for $N_{1 \cdot}$, and divide the result by $N$.

To construct a 2-sided confidence interval for $\tau$, we can find a 
2-sided $1-\alpha/2$ confidence bound for $N_{\cdot 1}$ and a 2-sided $1-\alpha/2$ confidence bound for $N_{1 \cdot}$. 
The lower endpoint of the $1-\alpha$ confidence interval for $\tau$ 
is the lower endpoint of the 2-sided interval for $N_{\cdot 1}$  minus the upper
endpoint of the 2-sided interval for $N_{1 \cdot}$, divided by $N$.
The upper endpoint of the $1-\alpha$ confidence interval for $\tau$ 
is the upper endpoint of the 2-sided interval for $N_{\cdot 1}$  minus the lower
endpoint of the 2-sided interval for $N_{1 \cdot}$, divided by $N$.

This approach has a number of advantages: it is conceptually simple, conservative, and
only requires the ability to compute confidence intervals for $G$ for
hypergeometric distributions.
But the intervals will in general be quite conservative, i.e., unnecessarily wide.
We now examine sharper approaches.

#### Second approach: testing all tables of potential outcomes

After the randomization, for each subject $j$, we observe either $x_{j0}$ or
$x_{j1}$.
At that point, we know $N$ of the $2N$ numbers $\{x_{jk}\}_{j=1}^N{}_{k=0}^1$.
The other $N$ numbers--the responses that were not observed--can be any combination of 0s and 1s: there are $2^N$ possibilities in all.
But not all of those yield distinguishable tables of results.

At the end of the day, the average treatment effect $\tau$ is determined by
the four numbers, $N_{00}$, $N_{01}$, $N_{10}$, and $N_{11}$, which to sum to $N$. 
How many possible values are there for those four numbers?
The total number of ways there are of partitioning $N$ items into 4 groups can be found by Feller's "bars and stars" argument (see the [notes on nuisance parameters](./nuisance.ipynb));
the answer is $\binom{N+3}{3} = (N+3)(N+2)(N+1)/6$. 
This is $O(N^3)$ tables
(see [Rigdon and Hudgens (2015)](https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.6384)).

But many of those tables are incompatible with the observed data.
For instance, we know that $ N_{1\cdot} \ge X_0^*$ and $N_{\cdot 1} \ge X_1^*$.
[Li and Ding (2016)](https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.6764) show that taking into account the observed data constrains
the number of tables to $O(N^2)$, greatly speeding the computation.



**A permutation approach**

Together, $N_{00}$, $N_{01}$, $N_{10}$, and $N_{11}$ determine the sampling distribution of any statistic, through the random allocation of subjects to treatments.
To test the null hypothesis $H_0: \{N_{00}=N_{00}^0, N_{01}=N_{01}^0, N_{10}=N_{10}^0, N_{11}=N_{11}^0 \}$, we can define some function $T$ of $X_0^*$ and $X_1^*$ and reject $H_0$ if the observed value of $T$ is in the tail of the probability distribution corresponding to
$H_0$.

What should we use for $T$? Since we are interested in $\tau$,
\begin{equation*}
\hat{\tau} := \bar{X}_1 - \bar{X}_0 = X_1^*/n - X_0^*/(N-n)
\end{equation*}
an unbiased estimator of $\tau$, is a sensible choice (although not necessarily optimal in any sense).

Recall that $N_{\cdot 1} := N_{01} + N_{11}$, $N_{1\cdot} := N_{10} + N_{11}$,
and 
\begin{equation*}
\tau = (N_{\cdot 1} - N_{1\cdot})/N = (N_{01}-N_{10})/N.
\end{equation*}
if $H_0$ is true, then $\tau = \tau_0 = (N_{01}^0 - N_{10}^0)/N$.
If we do not reject $H_0$, then we cannot reject the hypothesis $\tau = \tau_0$.
We can reject the hypothesis $\tau = \tau_0$ if we can reject _every_ table for which
$\tau = \tau_0$.

We might calibrate the test by figuring out the probability distribution of $T$ under $H_0$:
For each of the $\binom{N}{n}$ equally likely possible treatment assignments, we can find
the corresponding value of $T$.
When $N$ is large and $n$ is not close to $0$ or $N$, that approach is impractical.
Instead, we might approximate the probability distribution by simulation: actually drawing
pseudo-random samples.

[MORE TO COME]

**Enumerating all feasible 2x2 tables**

Constraints on $N_{jk}$.

$N_{10} + N_{11} \ge $X_0^*$
$N_{01} + N_{11} \ge $X_1^*$

Define 
\begin{equation*}
n_{wk} \equiv = \sum_{j=1}^N 1_{W_j=w, x_{jw}=k}
\end{equation*}
for $w \in \{0, 1\}$ and $k \in \{0, 1\}$.
That is, $n_{00} = (N-n)-X_0^*$, $n_{01} = X_0^*$, $n_{10} = n-X_1^*$, and
$n_{11} = X_1^*$.
Clearly, $\sum_{w,k} n_{wk} = N$: these numbers count the elements in 
a partition of the subjects.

[Rigdon and Hudgens (2014)](https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.6384) show that it suffices to examine 
$n_{RH} := (n_{11} + 1)(n_{10} + 1)(n_{01} + 1)(n_{00} + 1)$ 2 by 2 tables to find confidence bounds
for $\tau$ using $\hat{\tau}$ as the test statistic.

The argument is as follows: Consider the $n_{11}$ subjects who were assigned to treatment $w=1$ and whose response was $x_{j1}=1$.
Fix the unobserved values of the remaining $N-n_{11}$ subjects.
As the unobserved responses of those $n_{11}$ vary, the value of $\tau$ 
and the probability distribution of $T$ depend only on how many of them
have $x_{j0}=0$ and how many have $x_{j0}=1$.
The number $m$ for which $x_{j0}=0$ could be 0, 1, $\ldots$, or $n_{11}$;
the other $n_{11}-m$ have $x_{j0}=1$. 
There are thus only $n_{11}+1$ tables that need to be examined,
given the unobserved values in the other three groups.
By an analogous argument, as the unobserved values of the $n_{01}$
subjects vary, holding constant the unobserved valoes for the rest, there are at
most $n_{01}+1$ distinct values of $\tau$ and distinct probability distributions for
$T$. The same goes for $n_{10}$ and $n_{11}$. By the fundamental rule of counting,
the total number of tables that give rise to distinguishable probability distributions
for $T$ is thus at most $n_{RH}$.

[Li and Ding (2016)](https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.6764) show that
a table is consistent with the observed values $n_{wk}$ iff

\begin{equation*}
\max \{0,n_{11}-N_{10}, N_{11}-n_{01}, N_{\cdot 1} - n_{10} - n_{01} \}
  \le
\min \{N_{11}, n_{11}, N_{\cdot 1}-n_{01}, N - N_{10} - n_{01} - n_{10} \}.
\end{equation*}