# 🩺 4.7 Analysing a Simulated Clinical Trial

In this notebook, we’ll analyse a **simulated clinical trial**, a common method in nutrition research to compare outcomes between groups.

You will:
- Simulate trial data
- Create a baseline summary table (Table 1)
- Visualise variable distributions
- Calculate effect sizes using frequentist and Bayesian approaches
- Interpret results, including visualising posterior chains

---

### 🧠 What is a Clinical Trial?

A clinical trial is a study where participants are randomly assigned to groups to test the effect of an intervention (e.g. a new diet) on a health outcome.

## 🧪 Step 1: Simulate the Data

We simulate a dataset for 100 participants, with two groups:
- Control (group = 0)
- Intervention (group = 1)

We'll generate age, BMI, and a simulated outcome (e.g., biomarker change).

In [None]:
import pandas as pd
import numpy as np

np.random.seed(42)
n = 100
df = pd.DataFrame({
    'participant_id': range(1, n+1),
    'age': np.random.normal(40, 10, n),
    'bmi': np.random.normal(27, 4, n),
    'group': np.random.choice([0, 1], size=n)
})
df['outcome'] = np.where(
    df['group'] == 0,
    np.random.normal(0, 2, n),
    np.random.normal(1, 2, n)
)

df.head()

## 📋 Step 2: Baseline Table

This shows the average characteristics (age and BMI) in each group, before the intervention's effect is measured.

In [None]:
table1 = df.groupby('group')[['age', 'bmi']].agg(['mean', 'std']).round(1)
table1.columns = ['Age (Mean)', 'Age (SD)', 'BMI (Mean)', 'BMI (SD)']
table1.index = ['Control', 'Intervention']
table1['Age'] = table1.apply(lambda row: f"{row['Age (Mean)']} ± {row['Age (SD)']}", axis=1)
table1['BMI'] = table1.apply(lambda row: f"{row['BMI (Mean)']} ± {row['BMI (SD)']}", axis=1)
table1[['Age', 'BMI']]

## 📈 Step 3: Visualise Distributions

We'll use histograms and KDEs to visualise the distributions of each variable by group.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(style='whitegrid')
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
variables = ['age', 'bmi', 'outcome']
titles = ['Age Distribution', 'BMI Distribution', 'Outcome Distribution']

for i, var in enumerate(variables):
    sns.histplot(data=df, x=var, hue='group', kde=True, element="step", stat="density",
                 common_norm=False, palette='Set1', ax=axes[i])
    axes[i].set_title(titles[i])
    axes[i].legend(title='Group', labels=['Control', 'Intervention'])

plt.tight_layout()
plt.show()

## 📏 Step 4: Frequentist Analysis

We use **Cohen’s d** to measure the size of the effect, and a t-test to check for statistical significance.

In [None]:
from scipy.stats import ttest_ind

group0 = df[df['group'] == 0]['outcome']
group1 = df[df['group'] == 1]['outcome']

mean_diff = group1.mean() - group0.mean()
pooled_sd = np.sqrt((group0.var() + group1.var()) / 2)
cohens_d = mean_diff / pooled_sd

t_stat, p_val = ttest_ind(group1, group0)

print(f"Cohen's d: {cohens_d:.2f}")
print(f"T-test: t = {t_stat:.2f}, p = {p_val:.3f}")

## 🧠 Step 5: Bayesian Analysis

We use PyMC to estimate the difference in outcome between groups and generate a **posterior distribution**.

We will:
- Estimate the posterior mean difference
- Calculate a 95% HDI (Highest Density Interval)
- Visualise the posterior
- Plot the sampling chains to assess convergence

In [None]:
import pymc as pm
import arviz as az

with pm.Model() as model:
    mu = pm.Normal("mu", mu=0, sigma=10, shape=2)
    sigma = pm.HalfNormal("sigma", sigma=2)
    y_obs = pm.Normal("y_obs", mu=mu[df['group']], sigma=sigma, observed=df['outcome'])
    diff = pm.Deterministic("diff", mu[1] - mu[0])
    trace = pm.sample(1000, tune=1000, return_inferencedata=True, progressbar=True)

az.plot_posterior(trace, var_names=["diff"], ref_val=0)
plt.title("Posterior Difference in Means")
plt.show()

# Posterior mean and HDI
posterior_diff = trace.posterior['diff'].values.flatten()
posterior_mean = posterior_diff.mean()
hdi_bounds = az.hdi(posterior_diff, hdi_prob=0.95)

print(f"Posterior mean difference: {posterior_mean:.2f}")
print(f"95% HDI: [{hdi_bounds[0]:.2f}, {hdi_bounds[1]:.2f}]")

### 🔁 Step 5b: Visualise Chains

This helps us assess **MCMC convergence** (i.e. did the sampler explore the space thoroughly?)

In [None]:
az.plot_trace(trace, var_names=["diff"])
plt.suptitle("Trace Plot for Posterior Difference", y=1.02)
plt.show()

## ✅ Summary

You’ve:
- Simulated a clinical trial dataset
- Analysed it with frequentist and Bayesian methods
- Visualised the posterior and trace

**Key Insight**:  
Bayesian methods offer more nuance, while frequentist methods are simpler and quicker. Both can be valuable.

---

### 🔁 Optional Exercises
1. Change the simulated effect size (e.g., Intervention mean = 2). Re-run and compare.
2. Add `age` as a covariate in the Bayesian model.
3. Create a scatterplot of `outcome` vs. `bmi`, coloured by group.
