# Scenario: Classic RCT

We call 'Classic Randomized Controlled Trial' (RCT) a scenario where a treatment is randomly assigned to participants, and we do not have pre-experiment data of participants like pre-treatment outcome.

Treatment - new onboarding for new users.

We will test hypothesis:

$H_o$ - There is no difference in conversion rate between treatment and control groups.

$H_a$ - There is a difference in conversion rate between treatment and control groups.

## Data

We will use gold dgp from causalis library. More you can read at

In [1]:
import numpy as np
from causalis.data.dgps import generate_classic_rct_26
from causalis.data import CausalData

data = generate_classic_rct_26(return_causal_data=False)
data.head()

Unnamed: 0,conversion,d,platform_ios,country_usa,source_paid
0,0.0,0.0,1.0,0.0,1.0
1,0.0,1.0,0.0,0.0,1.0
2,0.0,0.0,1.0,1.0,0.0
3,0.0,1.0,1.0,1.0,0.0
4,0.0,1.0,0.0,1.0,0.0


In [2]:
causaldata = CausalData(df = data,
                        treatment='d',
                        outcome='conversion',
                        confounders=['platform_ios', 'country_usa', 'source_paid'])

In [3]:
from causalis.statistics.functions import outcome_stats
outcome_stats(causaldata)

Unnamed: 0,treatment,count,mean,std,min,p10,p25,median,p75,p90,max
0,0.0,4955,0.198991,0.399281,0.0,0.0,0.0,0.0,0.0,1.0,1.0
1,1.0,5045,0.232904,0.422723,0.0,0.0,0.0,0.0,0.0,1.0,1.0


# Monitoring

Some system is randomly splitting users. Half must have new onboarding, other half has not. We should monitor the split with SRM test.

### Math of Sample Ratio Mismatch (SRM)

Sample Ratio Mismatch (SRM) is a diagnostic test used in Randomized Controlled Trials (RCTs) to detect if the actual distribution of participants between variants (e.g., Control and Treatment) significantly deviates from the intended design. It uses a **Pearson’s Chi-square Goodness-of-Fit test**.

#### 1. Setup
- **Variants**: $k$ experimental groups (e.g., $k=2$ for Control and Treatment).
- **Observed Counts ($O_i$)**: The actual number of users assigned to variant $i$.
- **Total Sample Size ($N$)**: The sum of all observed counts, $N = \sum_{i=1}^k O_i$.
- **Target Allocation ($p_i$)**: The intended probability of assignment for variant $i$ (e.g., $0.5$ for a 50/50 split).

#### 2. Expected Counts ($E_i$)
The number of users we *expected* to see in each variant if the randomization worked perfectly:
$$E_i = N \times p_i$$

#### 3. Chi-square Statistic ($\chi^2$)
We calculate the cumulative squared deviation between observed and expected counts, normalized by the expected counts:
$$\chi^2 = \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i}$$

#### 4. Hypothesis Testing
- **Null Hypothesis ($H_0$)**: The observed counts follow the target distribution (no mismatch).
- **Degrees of Freedom ($df$)**: $df = k - 1$.
- **P-value**: The probability of observing a $\chi^2$ statistic as extreme as the one calculated, assuming $H_0$ is true. It is derived from the Chi-square distribution:
  $$p\text{-value} = P(\chi^2_{df} > \chi^2_{calculated})$$
- **Decision**: If $p\text{-value} < \alpha$ (where $\alpha$ is a conservative threshold like $0.001$), we reject $H_0$ and flag an **SRM**.

### Example (50/50 split)
If you intended to split 1,000 users evenly but observed **450** in Control and **550** in Treatment:
1. $E_{control} = 1000 \times 0.5 = 500$
2. $E_{treatment} = 1000 \times 0.5 = 500$
3. $\chi^2 = \frac{(450-500)^2}{500} + \frac{(550-500)^2}{500} = \frac{2500}{500} + \frac{2500}{500} = 5 + 5 = 10$
4. With $df=1$, a $\chi^2$ of 10 gives $p\text{-value} \approx 0.0015$, indicating a likely mismatch.

In [4]:
from causalis.scenarios.rct import check_srm

check_srm(assignments=causaldata, target_allocation={0: 0.5, 1: 0.5}, alpha=0.001)

SRMResult(status=no SRM, p_value=0.36812, chi2=0.8100, df=1)

# Check the confounders balance

Are groups equal in terms of confounders? We need to choose with domain and business sense confounders and check balance of them.
The standard benchmark:

- $SMD > 0.1$
- $`ks_pvalue` < 0.05$

In [5]:
from causalis.statistics.functions import confounders_balance

confounders_balance(causaldata)

Unnamed: 0,confounders,mean_d_0,mean_d_1,abs_diff,smd,ks_pvalue
0,source_paid,0.299092,0.313776,0.014684,0.031853,0.64592
1,platform_ios,0.494046,0.502874,0.008828,0.017654,0.98861
2,country_usa,0.586276,0.591873,0.005597,0.011374,1.0


# Estimation with Diff-in-Means

In [6]:
from causalis.statistics.models.diff_in_means import DiffInMeans

model = DiffInMeans().fit(causaldata)

### What is `conversion_z_test`

The `conversion_z_test` performs a statistical comparison of conversion rates between two groups (Treatment and Control). It provides a p-value for the hypothesis test, and robust confidence intervals for both absolute and relative differences.

#### 1. Observed Metrics
For each group (Control $0$, Treatment $1$):
- $n_0, n_1$: Total number of observations.
- $x_0, x_1$: Number of successes (conversions).
- $p_0 = \frac{x_0}{n_0}, \;\; p_1 = \frac{x_1}{n_1}$: Observed conversion rates.

#### 2. Hypothesis Test (P-value)
The test evaluates $H_0: p_1 = p_0$ (no difference).
- **Pooled Proportion**: $\hat{p} = \frac{x_0 + x_1}{n_0 + n_1}$
- **Pooled Standard Error**: $SE_{pooled} = \sqrt{\hat{p}(1 - \hat{p}) \left(\frac{1}{n_0} + \frac{1}{n_1}\right)}$
- **Z-Statistic**: $Z = \frac{p_1 - p_0}{SE_{pooled}}$
- **P-value**: $2 \times (1 - \Phi(|Z|))$, where $\Phi$ is the standard normal CDF.

#### 3. Absolute Difference (Newcombe CI)
To calculate the confidence interval for the difference $\Delta = p_1 - p_0$, we use the **Newcombe** method, which is more robust than standard Wald intervals for conversion rates.
1.  **Wilson Score Interval** for each group:
    $$CI_{Wilson, i} = (l_i, u_i) = \frac{p_i + \frac{z^2}{2n_i} \pm z \sqrt{\frac{p_i(1 - p_i)}{n_i} + \frac{z^2}{4n_i^2}}}{1 + \frac{z^2}{n_i}}$$
2.  **Combined Interval**:
    $$CI_{\Delta} = (l_1 - u_0, \;\; u_1 - l_0)$$
    *(where $z$ is the critical value for the chosen $\alpha$)*

#### 4. Relative Difference (Lift)
Lift measures the percentage change: $\text{Lift} = (\frac{p_1}{p_0} - 1) \times 100\%$.
The confidence interval is calculated in the **log-Relative Risk (RR)** scale to handle uncertainty in the denominator:
- **Log-RR Standard Error**: $SE_{\ln(RR)} = \sqrt{\frac{1}{x_1} - \frac{1}{n_1} + \frac{1}{x_0} - \frac{1}{n_0}}$
- **Relative CI**: $(\exp[\ln(\frac{p_1}{p_0}) \pm z \times SE_{\ln(RR)}] - 1) \times 100\%$
*(Small constants are added if $x=0$ using the Haldane–Anscombe correction).*

In [7]:
result_conversion = model.estimate('conversion_ztest')
result_conversion.summary()

Unnamed: 0,estimand,coefficient,p_val,lower_ci,upper_ci,relative_diff_%,is_significant
0,ATE,0.033913,3.8e-05,0.011108,0.056658,17.04246,True


In [8]:
result_conversion

CausalEstimate(estimand='ATE', model='DiffInMeans', model_options={'method': 'conversion_ztest', 'alpha': 0.05}, value=0.03391294694870284, ci_upper_absolute=0.05665834451230578, ci_lower_absolute=0.011107630056575168, value_relative=17.04245964815645, ci_upper_relative=26.161257674372564, ci_lower_relative=8.582758392024402, alpha=0.05, p_value=3.794497600839719e-05, is_significant=True, n_treated=5045, n_control=4955, outcome='conversion', treatment='d', confounders=['platform_ios', 'country_usa', 'source_paid'], instrument=[], time=datetime.datetime(2026, 1, 13, 10, 1, 2, 142603), diagnostic_data=DiagnosticData(), sensitivity_analysis={})