# 🩺 4.7 Analysing a Simulated Clinical Trial

This notebook walks through the analysis of a simulated clinical trial, a common task in nutrition and health research. We’ll generate a dataset, create a "Table 1" to summarise baseline characteristics, inspect distributions, measure effect sizes using both frequentist and Bayesian methods, and visualise the data.

**Objectives**:
- Generate a simulated clinical trial dataset.
- Create a "Table 1" for baseline characteristics.
- Inspect distributions of key variables.
- Calculate effect sizes using frequentist and Bayesian approaches.
- Visualise the data with informative plots.

**Context**: Clinical trials often compare an intervention (e.g., a new diet) against a control (e.g., standard diet). We’ll simulate a trial with 100 participants, comparing a biomarker outcome between Control and Intervention groups.

<details><summary>Fun Fact</summary>
Clinical trials are like a hippo testing a new swimming spot—careful measurement and comparison ensure the best outcomes! 🦛
</details>

In [None]:
# Setup for Google Colab: Fetch datasets automatically or manually
import os
from google.colab import files

# Define the module and dataset for this notebook
MODULE = '04_data_analysis'
DATASET = 'simulated_trial.csv'
BASE_PATH = '/content/data-analysis-toolkit-FNS'
MODULE_PATH = os.path.join(BASE_PATH, 'notebooks', MODULE)
DATASET_PATH = os.path.join('data', DATASET)

# Step 1: Attempt to clone the repository (automatic method)
# Note: If you encounter a cloning error (e.g., 'fatal: destination path already exists'),
#       reset the runtime (Runtime > Restart runtime) and run this cell again.
try:
    print('Attempting to clone repository...')
    if os.path.exists(BASE_PATH):
        print('Repository already exists, skipping clone.')
    else:
        !git clone https://github.com/ggkuhnle/data-analysis-toolkit-FNS.git
    
    # Debug: Print directory structure
    print('Listing repository contents:')
    !ls {BASE_PATH}
    print(f'Listing notebooks directory contents:')
    !ls {BASE_PATH}/notebooks
    
    # Check if the module directory exists
    if not os.path.exists(MODULE_PATH):
        raise FileNotFoundError(f'Module directory {MODULE_PATH} not found. Check the repository structure.')
    
    # Set working directory to the notebook's folder
    os.chdir(MODULE_PATH)
    
    # Verify dataset is accessible
    if os.path.exists(DATASET_PATH):
        print(f'Dataset found: {DATASET_PATH} 🦛')
    else:
        print(f'Error: Dataset {DATASET} not found after cloning.')
        raise FileNotFoundError
except Exception as e:
    print(f'Cloning failed: {e}')
    print('Falling back to manual upload option...')

    # Step 2: Manual upload option
    print(f'Please upload {DATASET} manually.')
    print(f'1. Click the "Choose Files" button below.')
    print(f'2. Select {DATASET} from your local machine.')
    print(f'3. Ensure the file is placed in notebooks/{MODULE}/data/')
    
    # Create the data directory if it doesn't exist
    os.makedirs('data', exist_ok=True)
    
    # Prompt user to upload the dataset
    uploaded = files.upload()
    
    # Check if the dataset was uploaded
    if DATASET in uploaded:
        with open(DATASET_PATH, 'wb') as f:
            f.write(uploaded[DATASET])
        print(f'Successfully uploaded {DATASET} to {DATASET_PATH} 🦛')
    else:
        raise FileNotFoundError(f'Upload failed. Please ensure you uploaded {DATASET}.')

# Install required packages for this notebook
%pip install pandas numpy pymc arviz matplotlib seaborn scipy
print('Python environment ready.')

## 📊 Generate Simulated Clinical Trial Data

We’ll simulate a clinical trial dataset with 100 participants, split into Control and Intervention groups. The dataset includes:
- `participant_id`: Unique identifier (1 to 100).
- `age`: Age in years (mean=40, sd=10).
- `bmi`: Body Mass Index (mean=27, sd=4).
- `group`: 0 (Control) or 1 (Intervention).
- `outcome`: Change in biomarker level (e.g., ng/mL) after 12 weeks (Control: mean=0, sd=2; Intervention: mean=1, sd=2).

This simulates a small treatment effect, which we’ll analyse.

In [None]:
# Import libraries for data generation
import pandas as pd
import numpy as np

# Set random seed for reproducibility
np.random.seed(42)

# Simulate data for 100 participants
n_participants = 100
data = {
    'participant_id': range(1, n_participants + 1),
    'age': np.random.normal(loc=40, scale=10, size=n_participants),  # Age ~ N(40, 10)
    'bmi': np.random.normal(loc=27, scale=4, size=n_participants),   # BMI ~ N(27, 4)
    'group': np.random.choice([0, 1], size=n_participants, p=[0.5, 0.5]),  # 50% Control, 50% Intervention
}

# Generate outcome based on group
# Control (group=0): outcome ~ N(0, 2)
# Intervention (group=1): outcome ~ N(1, 2)
data['outcome'] = np.where(
    data['group'] == 0,
    np.random.normal(loc=0, scale=2, size=n_participants),  # Control
    np.random.normal(loc=1, scale=2, size=n_participants)   # Intervention
)

# Create DataFrame
df = pd.DataFrame(data)

# Save to CSV (optional, since it's included in the repository)
os.makedirs('data', exist_ok=True)
df.to_csv('data/simulated_trial.csv', index=False)
print('Simulated dataset created and saved as data/simulated_trial.csv 🦛')

# Display first few rows
df.head()

## 📋 Table 1: Baseline Characteristics

In clinical trials, "Table 1" summarises baseline characteristics by group. We’ll calculate means and standard deviations for `age` and `bmi` in each group (Control and Intervention).

In [None]:
# Group by 'group' and calculate mean and std for age and bmi
table1 = df.groupby('group')[['age', 'bmi']].agg(['mean', 'std']).round(1)

# Rename columns for clarity
table1.columns = ['Age (Mean)', 'Age (SD)', 'BMI (Mean)', 'BMI (SD)']
table1.index = ['Control', 'Intervention']

# Format as a typical Table 1
table1_formatted = table1.copy()
table1_formatted['Age'] = table1_formatted.apply(lambda x: f"{x['Age (Mean)']} (±{x['Age (SD)']})", axis=1)
table1_formatted['BMI'] = table1_formatted.apply(lambda x: f"{x['BMI (Mean)']} (±{x['BMI (SD)']})", axis=1)
table1_formatted = table1_formatted[['Age', 'BMI']]

print('Table 1: Baseline Characteristics')
table1_formatted

## 📈 Inspect Distributions

Let’s visualise the distributions of `age`, `bmi`, and `outcome` to understand their shapes and differences between groups. We’ll use histograms and boxplots.

In [None]:
# Import plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Set plot style
sns.set(style='whitegrid')

# Create a figure with subplots
fig, axes = plt.subplots(2, 3, figsize=(15, 8))

# Histograms for age
sns.histplot(data=df[df['group'] == 0], x='age', color='blue', alpha=0.5, label='Control', ax=axes[0, 0])
sns.histplot(data=df[df['group'] == 1], x='age', color='orange', alpha=0.5, label='Intervention', ax=axes[0, 0])
axes[0, 0].set_title('Age Distribution')
axes[0, 0].legend()

# Histograms for bmi
sns.histplot(data=df[df['group'] == 0], x='bmi', color='blue', alpha=0.5, label='Control', ax=axes[0, 1])
sns.histplot(data=df[df['group'] == 1], x='bmi', color='orange', alpha=0.5, label='Intervention', ax=axes[0, 1])
axes[0, 1].set_title('BMI Distribution')
axes[0, 1].legend()

# Histograms for outcome
sns.histplot(data=df[df['group'] == 0], x='outcome', color='blue', alpha=0.5, label='Control', ax=axes[0, 2])
sns.histplot(data=df[df['group'] == 1], x='outcome', color='orange', alpha=0.5, label='Intervention', ax=axes[0, 2])
axes[0, 2].set_title('Outcome Distribution')
axes[0, 2].legend()

# Boxplots for age
sns.boxplot(data=df, x='group', y='age', ax=axes[1, 0])
axes[1, 0].set_title('Age by Group')
axes[1, 0].set_xticks([0, 1])
axes[1, 0].set_xticklabels(['Control', 'Intervention'])

# Boxplots for bmi
sns.boxplot(data=df, x='group', y='bmi', ax=axes[1, 1])
axes[1, 1].set_title('BMI by Group')
axes[1, 1].set_xticks([0, 1])
axes[1, 1].set_xticklabels(['Control', 'Intervention'])

# Boxplots for outcome
sns.boxplot(data=df, x='group', y='outcome', ax=axes[1, 2])
axes[1, 2].set_title('Outcome by Group')
axes[1, 2].set_xticks([0, 1])
axes[1, 2].set_xticklabels(['Control', 'Intervention'])

# Adjust layout
plt.tight_layout()
plt.show()

## 📏 Frequentist Effect Size: Cohen’s d

We’ll calculate the effect size of the intervention on the outcome using Cohen’s d, a common frequentist measure. Cohen’s d is the difference in means divided by the pooled standard deviation.

In [None]:
# Import library for statistical calculations
from scipy.stats import ttest_ind

# Split data by group
control_outcome = df[df['group'] == 0]['outcome']
intervention_outcome = df[df['group'] == 1]['outcome']

# Calculate means and standard deviations
mean_control = control_outcome.mean()
mean_intervention = intervention_outcome.mean()
sd_control = control_outcome.std()
sd_intervention = intervention_outcome.std()

# Calculate pooled standard deviation
n_control = len(control_outcome)
n_intervention = len(intervention_outcome)
pooled_sd = np.sqrt(((n_control - 1) * sd_control**2 + (n_intervention - 1) * sd_intervention**2) / (n_control + n_intervention - 2))

# Calculate Cohen's d
cohens_d = (mean_intervention - mean_control) / pooled_sd

# Perform t-test for reference
t_stat, p_value = ttest_ind(control_outcome, intervention_outcome)

print(f"Cohen's d: {cohens_d:.2f}")
print(f"T-test: t={t_stat:.2f}, p={p_value:.3f}")
print('Interpretation:')
print(' - Cohen’s d < 0.2: Small effect')
print(' - 0.2 ≤ d < 0.5: Medium effect')
print(' - d ≥ 0.5: Large effect')

## 🧠 Bayesian Effect Size: Posterior Difference

Now, let’s estimate the effect size using a Bayesian approach. We’ll model the outcome as normally distributed with different means for each group and calculate the posterior difference between the Intervention and Control means.

In [None]:
# Import Bayesian libraries
import pymc as pm
import arviz as az

# Define the Bayesian model
with pm.Model() as model:
    # Priors for the means of Control (group 0) and Intervention (group 1)
    mu = pm.Normal('mu', mu=0, sigma=10, shape=2)  # Mean for each group
    
    # Prior for standard deviation
    sigma = pm.HalfNormal('sigma', sigma=2)
    
    # Likelihood: outcome ~ Normal(mu[group], sigma)
    y_obs = pm.Normal('y_obs', mu=mu[df['group']], sigma=sigma, observed=df['outcome'])
    
    # Calculate the difference between means
    diff = pm.Deterministic('diff', mu[1] - mu[0])
    
    # Sample from the posterior
    trace = pm.sample(1000, tune=1000, return_inferencedata=True)

# Summarise the posterior difference
diff_mean = trace.posterior['diff'].mean().values
diff_hdi = az.hdi(trace.posterior['diff'], hdi_prob=0.95).values
print(f"Posterior mean difference (Intervention - Control): {diff_mean:.2f}")
print(f"95% HDI: [{diff_hdi[0]:.2f}, {diff_hdi[1]:.2f}]")

# Visualise the posterior difference
az.plot_posterior(trace, var_names=['diff'], ref_val=0)
plt.title('Posterior Distribution of Difference in Means')
plt.show()

## 📉 Visualise the Data

Let’s create additional visualisations to compare the outcome between groups, using a boxplot and a kernel density plot.

In [None]:
# Create a figure with subplots
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Boxplot of outcome by group
sns.boxplot(data=df, x='group', y='outcome', ax=axes[0])
axes[0].set_title('Outcome by Group')
axes[0].set_xticks([0, 1])
axes[0].set_xticklabels(['Control', 'Intervention'])
axes[0].set_ylabel('Outcome (Change in Biomarker)')

# Kernel density plot of outcome by group
sns.kdeplot(data=df[df['group'] == 0], x='outcome', label='Control', color='blue', ax=axes[1])
sns.kdeplot(data=df[df['group'] == 1], x='outcome', label='Intervention', color='orange', ax=axes[1])
axes[1].set_title('Outcome Density by Group')
axes[1].legend()
axes[1].set_xlabel('Outcome (Change in Biomarker)')

# Adjust layout
plt.tight_layout()
plt.show()

## 🧪 Exercises

1. **Adjust the Simulation**: Modify the simulated data to increase the treatment effect (e.g., change the Intervention outcome mean to 2). Re-run the frequentist and Bayesian analyses. How does the effect size change?

2. **Add a Covariate**: Include `age` as a covariate in the Bayesian model (e.g., `mu = pm.Normal(...) + pm.Normal('beta_age', 0, 1) * df['age']`). Re-run the analysis and compare the posterior difference.

3. **Visualisation**: Create a scatter plot of `outcome` vs. `bmi`, coloured by group. What patterns do you observe?

**Guidance**: Use the code above as a starting point. Experiment to deepen your understanding!

**Your Answers**:

**Exercise 1: Adjust the Simulation**  
[Write your code and observations here]

**Exercise 2: Add a Covariate**  
[Write your code and results here]

**Exercise 3: Visualisation**  
[Write your code and observations here]

## Conclusion

You’ve analysed a simulated clinical trial, creating a Table 1, inspecting distributions, measuring effect sizes with frequentist and Bayesian methods, and visualising the data. These techniques are essential for nutrition research and clinical studies.

**Next Steps**: Explore logistic regression and survival analysis in `4.6_logistic_and_survival.ipynb`.

**Resources**:
- [PyMC Documentation](https://www.pymc.io/)
- [ArviZ Documentation](https://arviz-devs.github.io/arviz/)
- Repository: [github.com/ggkuhnle/data-analysis-toolkit-FNS](https://github.com/ggkuhnle/data-analysis-toolkit-FNS)