# 🩺 4.7 Analysing a Simulated Clinical Trial

This notebook walks you through analysing a **simulated clinical trial** — a common task in nutrition and health research. You'll:
- Simulate trial data
- Summarise baseline characteristics (Table 1)
- Visualise distributions
- Compare groups using both **frequentist** and **Bayesian** statistics

---

### 🧠 What is a Clinical Trial?

A clinical trial compares outcomes between groups (e.g. treatment vs control) to determine if an intervention works. In this case, we’ll simulate a trial comparing a **biomarker outcome** between two groups: Control and Intervention.


## 🧪 Step 1: Simulate the Data

We’ll simulate data for 100 participants split into two groups:
- `group = 0`: Control (no intervention)
- `group = 1`: Intervention (treatment group)

We generate:
- `age`: Normally distributed around 40 years
- `bmi`: Normally distributed around 27 kg/m²
- `outcome`: Biomarker change (Control: ~N(0,2), Intervention: ~N(1,2))

In [None]:
import pandas as pd
import numpy as np

# Set random seed for reproducibility
np.random.seed(42)

# Simulate data
n = 100
df = pd.DataFrame({
    'participant_id': range(1, n+1),
    'age': np.random.normal(40, 10, n),
    'bmi': np.random.normal(27, 4, n),
    'group': np.random.choice([0, 1], size=n)
})

# Outcome depends on group
df['outcome'] = np.where(df['group'] == 0,
                         np.random.normal(0, 2, n),
                         np.random.normal(1, 2, n))

df.head()

## 📋 Step 2: Baseline Table (Table 1)

"Table 1" summarises baseline characteristics (like age and BMI) by group before any analysis. This helps check that the groups are comparable at the start of the trial.

In [None]:
table1 = df.groupby('group')[['age', 'bmi']].agg(['mean', 'std']).round(1)
table1.columns = ['Age (Mean)', 'Age (SD)', 'BMI (Mean)', 'BMI (SD)']
table1.index = ['Control', 'Intervention']
table1['Age'] = table1.apply(lambda row: f"{row['Age (Mean)']} ± {row['Age (SD)']}", axis=1)
table1['BMI'] = table1.apply(lambda row: f"{row['BMI (Mean)']} ± {row['BMI (SD)']}", axis=1)
table1[['Age', 'BMI']]

## 📈 Step 3: Visualise Distributions

We now inspect the distributions of age, BMI, and the outcome to check:
- Are the groups similarly distributed?
- Are there outliers?
- What’s the shape of the data?

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(style='whitegrid')
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for i, var in enumerate(['age', 'bmi', 'outcome']):
    sns.boxplot(data=df, x='group', y=var, ax=axes[i])
    axes[i].set_title(f'{var.capitalize()} by Group')
    axes[i].set_xticklabels(['Control', 'Intervention'])

plt.tight_layout()
plt.show()

## 📏 Step 4: Frequentist Effect Size (Cohen’s d)

We’ll use **Cohen’s d** to estimate the effect size of the intervention:
- \( d = \frac{\text{difference in means}}{\text{pooled SD}} \)

We also perform a t-test to check for statistical significance.

In [None]:
from scipy.stats import ttest_ind

group0 = df[df['group'] == 0]['outcome']
group1 = df[df['group'] == 1]['outcome']

mean_diff = group1.mean() - group0.mean()
pooled_sd = np.sqrt((group0.var() + group1.var()) / 2)
cohens_d = mean_diff / pooled_sd

t_stat, p_val = ttest_ind(group1, group0)

print(f"Cohen's d: {cohens_d:.2f}")
print(f"T-test: t = {t_stat:.2f}, p = {p_val:.3f}")

## 🧠 Step 5: Bayesian Effect Size

Bayesian analysis estimates a **distribution of possible effects** rather than a single value. This gives us:
- A posterior distribution
- A 95% Highest Density Interval (HDI) showing the most credible values

We use `PyMC` to model outcomes in both groups with normal priors.

In [None]:
import pymc as pm
import arviz as az

with pm.Model() as model:
    mu = pm.Normal("mu", mu=0, sigma=10, shape=2)
    sigma = pm.HalfNormal("sigma", sigma=2)
    y_obs = pm.Normal("y_obs", mu=mu[df['group']], sigma=sigma, observed=df['outcome'])
    diff = pm.Deterministic("diff", mu[1] - mu[0])
    trace = pm.sample(1000, tune=1000, return_inferencedata=True, progressbar=True)

az.plot_posterior(trace, var_names=["diff"], ref_val=0)
plt.title("Posterior Difference in Means")
plt.show()

hdi = az.hdi(trace.posterior['diff'], hdi_prob=0.95)
print(f"Posterior mean difference: {trace.posterior['diff'].mean().values.item():.2f}")
print(f"95% HDI: [{hdi.sel(hdi='lower').values.item():.2f}, {hdi.sel(hdi='higher').values.item():.2f}]")

## ✅ Summary

We have:
- Simulated a clinical trial
- Summarised and visualised the data
- Compared outcomes using frequentist and Bayesian methods

**Key Insight**:  
Frequentist methods give you a *yes/no* answer; Bayesian methods show you the *range of plausible effects*.

---

### 🔁 Optional Exercises
1. Modify the outcome mean in the Intervention group to 2 instead of 1. How does Cohen’s d and the posterior shift?
2. Add `age` as a covariate in the Bayesian model.
3. Create a scatterplot of `outcome` vs. `bmi`, coloured by group.