Inspired by: https://mc-stan.org/users/documentation/case-studies/pool-binary-trials.html

In [1]:
library(tidyverse)
library(yaml)

── [1mAttaching core tidyverse packages[22m ──────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.3     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [2]:
data_config_id <- "pooling_sim"

In [3]:
data_config <- yaml.load_file(paste0("../data/configs/", data_config_id, "/data.yaml"))

In [4]:
data_base_dir <- paste0("../", data_config$output_dir)
data_path <- paste0(data_base_dir, "/data.csv")
if (!dir.exists(data_base_dir)) {
    dir.create(data_base_dir, recursive = TRUE)
}

In [5]:
N_per_group <- data_config$N_per_group
N_groups <- data_config$N_groups
N_obs <- N_groups * N_per_group

beta_0 <- data_config$beta_0
beta_1 <- data_config$beta_1
sigma_group <- data_config$sigma_group
sigma_individual <- data_config$sigma_individual
sigma_measurement <- data_config$sigma_measurement

min_x <- data_config$min_x
max_x <- data_config$max_x

data_seed <- data_config$data_seed
n_measurements <- data_config$n_measurements

Our true data generating process

$y_{ijr} = \beta_{0} + \beta_{1} X_{ij} + u_{0j} + p_{0i} + e_{ijr}$

Where:
- $r$ denotes the measurement number (we assume repeated measurements)
- $y_{ij} \sim \mathcal{N}(\beta_{0} + \beta_{1} X_{ij} + u_{0j} + p_{0i}, \sigma_{\text{measurement}}^2)$ = response for individual $i$ in group $j$ (for a given $r$)
- $u_{0j} \sim \mathcal{N}(0, \sigma_{\text{group}}^2)$ = the group-level effect for group $j$
- $p_{0i} \sim \mathcal{N}(0, \sigma_{\text{individual}}^2)$ = the individual-level effect for individual $i$
- $X_{ij}$ = observed measurement for individual $i$ in group $j$
- $e_{ijr} \sim \mathcal{N}(0, \sigma_{\text{measurement}}^2)$ = measurement error for a given observation

## Simulate data

In [6]:
set.seed(data_seed)

In [7]:
data_df <- tibble(
        group_id = 1:N_groups,
        group_effect = rnorm(N_groups, 0, sigma_group)
    ) %>%
    inner_join(
        tibble(
            group_id = rep(1:N_groups, each=N_per_group),
            x = runif(N_obs, min = min_x, max = max_x),
            indiv_effect = rnorm(N_obs, mean = 0, sd = sigma_individual),
            indiv_id = 1:N_obs
        ),
        by="group_id"
    ) %>%
    slice(rep(1:n(), each = n_measurements)) %>%
    group_by(indiv_id) %>%
    mutate(measurement_id = 1:n()) %>%
    ungroup() %>%
    mutate(
        measurement_error_term = rnorm(N_obs * n_measurements, mean = 0, sd = sigma_measurement),
        y = beta_0 + beta_1 * x + group_effect + indiv_effect + measurement_error_term
    )

In [8]:
data_df %>%
    write_csv(data_path)

In [9]:
data_df %>% pull(y) %>% quantile()

TODO: specify a subset of data to hold out