# DGP generate_classic_rct_26

### Math Explanation of the `generate_classic_rct_26` DGP

The `generate_classic_rct_26` function generates a synthetic dataset for a **Classic Randomized Controlled Trial (RCT)**. In this scenario, treatment is assigned completely at random, and covariates $X$ affect the outcome (prognostic) but do not influence the treatment assignment (no confounding).

By default, it simulates a **conversion** experiment (binary outcome) with 10,000 samples and a 50/50 split.


#### 1. Covariate Generation (Confounders)
Three binary covariates $X = [x_1, x_2, x_3]$ are generated independently:
*   **`platform_ios`** ($x_1$): $x_1 \sim \text{Bernoulli}(0.5)$
*   **`country_usa`** ($x_2$): $x_2 \sim \text{Bernoulli}(0.6)$
*   **`source_paid`** ($x_3$): $x_3 \sim \text{Bernoulli}(0.3)$


#### 2. Treatment Assignment ($D$)
Since it is an RCT, the treatment $D$ is independent of $X$. It is assigned with a probability $P(D=1) = 0.5$:
$$D \sim \text{Bernoulli}(0.5)$$
The log-odds of treatment (intercept $\alpha_d$) is $0$.


#### 3. Outcome Generation (`conversion`)
The outcome is a binary variable representing conversion. The probability of conversion for an individual is modeled using a logistic link function:
$$P(\text{conversion}=1 \mid D, X) = \sigma(L)$$
where $\sigma(z) = \frac{1}{1 + e^{-z}}$ is the sigmoid function.

The latent linear predictor $L$ is defined as:
$$L = \alpha_y + \sum_{j=1}^3 \beta_{y,j} x_j + g_y(X) + D \cdot \theta$$

*   **Baseline Intercept ($\alpha_y$):**
    Derived from the target control conversion rate $p_A = 0.10$.
    $$\alpha_y = \text{logit}(0.10) = \ln\left(\frac{0.10}{0.90}\right) \approx -2.197$$
*   **Treatment Effect ($\theta$):**
    Derived from the target treatment conversion rate $p_B = 0.11$. It represents the shift in log-odds.
    $$\theta = \text{logit}(0.11) - \text{logit}(0.10) = \ln\left(\frac{0.11}{0.89}\right) - \ln\left(\frac{0.10}{0.90}\right) \approx 0.106$$
    On the probability scale, this corresponds to an Average Treatment Effect (ATE) of $\approx 1\%$.
*   **Prognostic Coefficients ($\beta_y$):**
    By default, $\beta_y = [0.6, 0.4, 0.8]$. These values determine how much each covariate shifts the log-odds of conversion.
*   **Non-linear Signal ($g_y(X)$):**
    By default, $g_y(X) = 0$ unless `add_pre=True`. If enabled, a random non-linear function is added to increase complexity.


#### 4. Pre-period Covariate (`y_pre`)
If `add_pre=True`, a continuous pre-treatment covariate is generated to act as a proxy for the baseline outcome (useful for CUPED/ANCOVA):
$$y_{\text{pre}} = (s(X) - \bar{s}) + \epsilon_{\text{pre}}$$
where $s(X)$ is the prognostic signal from covariates ($\sum \beta_{y,j} x_j + g_y(X)$) and $\epsilon_{\text{pre}}$ is Gaussian noise calibrated to achieve a target correlation (default $\rho = 0.7$) with the outcome's latent signal.


#### Summary of Default Parameters
*   **$N$**: 10,000
*   **Control Conversion ($p_A$)**: 10%
*   **Treatment Conversion ($p_B$)**: 11%
*   **Treatment Split**: 50%
*   **Confounders**: 3 binary (purely prognostic)


In [6]:
import numpy as np
from causalis.data.dgps import generate_classic_rct_26
from causalis.data import CausalData

In [7]:
# let's generate a df
data = generate_classic_rct_26(return_causal_data=False)
data.head()

Unnamed: 0,conversion,d,platform_ios,country_usa,source_paid
0,0.0,0.0,1.0,0.0,1.0
1,0.0,1.0,0.0,0.0,1.0
2,0.0,0.0,1.0,1.0,0.0
3,0.0,1.0,1.0,1.0,0.0
4,0.0,1.0,0.0,1.0,0.0


In [8]:
# wrap it in CausalData
causaldata = CausalData(df = data,
                        treatment='d',
                        outcome='conversion',
                        confounders=['platform_ios', 'country_usa', 'source_paid'])
causaldata

CausalData(df=(10000, 5), treatment='d', outcome='conversion', confounders=['platform_ios', 'country_usa', 'source_paid'])

In [9]:
# Statistics for outcome comparison
from causalis.statistics.functions import outcome_stats

outcome_stats(causaldata)

Unnamed: 0,treatment,count,mean,std,min,p10,p25,median,p75,p90,max
0,0.0,4955,0.198991,0.399281,0.0,0.0,0.0,0.0,0.0,1.0,1.0
1,1.0,5045,0.232904,0.422723,0.0,0.0,0.0,0.0,0.0,1.0,1.0


In [10]:
# Let's check the balance of confounders
from causalis.statistics.functions import confounders_balance

confounders_balance(causaldata)

Unnamed: 0,confounders,mean_d_0,mean_d_1,abs_diff,smd,ks_pvalue
2,source_paid,0.299092,0.313776,0.014684,0.031853,0.64592
0,platform_ios,0.494046,0.502874,0.008828,0.017654,0.98861
1,country_usa,0.586276,0.591873,0.005597,0.011374,1.0
