In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from cmdstanpy import CmdStanModel

Below we repeat the analysis that was done in the original simulation.

In [3]:
# Load the data
chicks = pd.read_table("chickens.dat", delim_whitespace=True)
J = len(chicks)
x = chicks["freq"]
y0 = chicks["sham_est"] - 1
se0 = chicks["sham_se"]
n0 = chicks["sham_n"]
y1 = chicks["exposed_est"] - 1
se1 = chicks["exposed_se"]
n1 = chicks["exposed_n"]
diff = y1 - y0
diff_se = np.sqrt(se1**2 + se0**2)

  chicks = pd.read_table("chickens.dat", delim_whitespace=True)


In [4]:
# Prepare data for Stan model
y = list(y0) + list(y1)
se = list(se0) + list(se1)
expt_id = list(range(1, J + 1)) + list(range(1, J + 1))
z = J * [0] + J * [1]  # treatment: 0 = sham, 1 = exposed
chick_data = {
    "N": 2 * J,
    "J": J,
    "y": y,
    "se": se,
    "x": x,
    "expt_id": expt_id,
    "z": z,
    "diff": list(diff),
    "diff_se": list(diff_se),
}

In [5]:
model = CmdStanModel(stan_file="chickens-no-corr-hier.stan")

In [6]:
fit = model.sample(data=chick_data, adapt_delta=0.9)

14:33:39 - cmdstanpy - INFO - CmdStan start processing


chain 1 |          | 00:00 Status

chain 2 |          | 00:00 Status

chain 3 |          | 00:00 Status

chain 4 |          | 00:00 Status

                                                                                                                                                                                                                                                                                                                                

14:33:40 - cmdstanpy - INFO - CmdStan done processing.





In [7]:
theta = fit.stan_variable("theta")
theta_mean = np.mean(theta, axis=0)
theta_se = np.std(theta, axis=0)

The function below simulates a dataset for `num_experiments` experiments, all assumed to have the same proportion `prop_treatment` allocated to the treatment group. Here one complication is that the original simulation assumes all experiments have the same number of subjects in the control and treatment groups.

The original assumptions are:

$$
\begin{align*}
    y_{j1} &\sim N(\theta_j + b_j, s_{j1}) \\
    y_{j0} &\sim N(b_j, s_{j0}) \\
    \theta_j &\sim N(\mu_\theta, \sigma_\theta) \\
    b_j &\sim N(\mu_b, \sigma_b)
\end{align*}
$$

These assumptions mostly still work here since $s_{j1}$ and $s_{j0}$ need not be the same for all experiments $j$. However, in the original simulation, it was assumed that $s_{j1} = s_{j0} = 0.04$ for all $j$. This is reasonable if the treatment and control groups have the same size and if we expect the treatment/sham effects to have similar spread, which was empirically shown for the chicken dataset. 

However, if the groups have unequal numbers of subjects, then we expect the group with fewer subjects to have higher variance. We let $Y_{j0}^{(i)}$ and $Y_{j1}^{(k)}$ be the outcomes for the $i$-th and $k$-th individuals in the control and treatment groups respectively for experiment $j$. Let the $j$-th experiment have $N_{j0}$ and $N_{j1}$ subjects in control and treatment groups respectively. If the number of subjects in each group is at least 30, then by the CLT we have

$$ y_{j1} = \frac{1}{N_{j1}} \sum_{k=1}^{N_{j1}} Y_{j1}^{(k)} \rightarrow N(\mu_{j1}, \sigma_{j1}) $$

where $\sigma_{j1} = \sqrt{\frac{\text{var}(Y_{j1}^{(k)})}{N_{j1}}}$ and similarly for $y_{j0}$. We identify $\mu_{j1}$ with $\theta_j + b_j$ and $\sigma_{j1}$ with $s_{j1}$. Since $s_{j1} = 0.04$ in the original simulation, we use $\text{var}(Y_{j1}^{(k)}) = 32 \times (0.04)^2$ for all experiments $j$ and individuals $k$; likewise for $\text{var}(Y_{j0}^{(i)})$.

*Thought: What if the sample sizes are too small for the CLT? Maybe bootstrap or make distributional assumptions on $Y_{j1}^{(k)}$?*

Here is a more general treatment. Assuming random allocation between groups, it seems reasonable to model 

$$
\begin{align}
    Y_{j1}^{(k)} \overset{\mathrm{iid}}{\sim} F_{j1}, \quad \forall k = 1, \ldots, N_{j1} \\
    Y_{j0}^{(k)} \overset{\mathrm{iid}}{\sim} F_{j0}, \quad \forall k = 1, \ldots, N_{j0}
\end{align}
$$

for some distributions $F_{j1}$ and $F_{j0}$. In practice, we expect these distributions to vary over $j$ due to differences in the experimental setup or sampling from different populations (intentional or not). However, we make the simplifying assumption that $F_{j1} = F_1$ and $F_{j0} = F_0$ for all experiments $j$.

Further, we assume
$$
\begin{align}
    F_1 &= N(\mu_1, \sigma_1) \\
    F_0 &= N(\mu_0, \sigma_0)
\end{align}
$$

which is reasonable if $N_{j1} \geq 30$ and $N_{j0} \geq 30$ for all $j$, as discussed above. This yields

$$y_{j1} = \frac{1}{N_{j1}} \sum_{k=1}^{N_{j1}} \sim N(\mu_1, \sigma_1 / \sqrt{ N_{ j1 } } })$$

In [None]:
def simulate_experiments(
    num_experiments,
    num_subjects_per_experiment,
    prop_treatment,
    true_b,
    true_theta,
    sigma_b,
    sigma_theta,
    sigma_treatment,
    sigma_control,
):
    num_treated = np.floor(prop_treatment * num_subjects_per_experiment).astype(int)
    num_control = num_subjects_per_experiment - num_treated

    sigma_y0 = sigma_control / np.sqrt(num_control)
    sigma_y1 = sigma_treatment / np.sqrt(num_treated)

    b = np.random.normal(true_b, sigma_b, num_experiments)
    theta = np.random.normal(true_theta, sigma_theta, num_experiments)
    y_control = np.random.normal(b, sigma_y0)
    y_treated = np.random.normal(theta + b, sigma_y1)

    return {
        "num_": 2 * num_experiments,
        "J": num_experiments,
        "P": prop_treatment,
        "y": np.concatenate([y_control, y_treated]),
        "se": np.concatenate([sigma_y0, sigma_y1]),
        "x": np.ones(num_experiments),
        "expt_id": np.tile(np.arange(1, num_experiments + 1), 2),
        "z": np.concatenate([np.zeros(num_experiments), np.ones(num_experiments)]),
    }

np.float64(0.03726713045384946)