
🧪 Clinical Trial: Hipponol Intervention Study
==============================================

This notebook explores data from a simulated 2-year clinical trial testing the effects of a novel nutrient, **Hipponol**, on blood pressure and survival.

# Methodology
## Study Design

- **Population**: 1000 participants, aged 40–70, male/female, smokers and non-smokers.
- **Design**: Randomised controlled trial (RCT), 1:1 allocation to Hipponol vs. placebo.
- **Blinding**: Double-blind.
- **Duration**: 2-year follow-up.
- **Outcomes**:
  - **Primary**: Survival at 2 years.
  - **Secondary**: Change in systolic blood pressure (SBP) from baseline to follow-up.

## Analysis Plan

This notebook will follow key elements of the **CONSORT guidelines** for RCT reporting.

### 🔍 Descriptive Analysis (Baseline)

- Generate a **Table 1** comparing baseline characteristics (age, sex, smoking status, baseline SBP) between groups.
- Use appropriate statistical tests:
  - Continuous variables: t-test or Mann–Whitney U
  - Categorical variables: Chi-squared test

### 📉 Outcome Analysis

Analysis of results using **Bayesian** and **Frequentist** methods

### Primary endpoint

- **Survival**:
  - Logistic regression
  - Kaplan–Meier survival curves by group.
  - Log-rank test and Cox proportional hazards model.


- **Blood Pressure**:
  - Compare change in SBP using paired and unpaired t-tests.
  - Adjust for baseline covariates using linear regression.


### 📌 Notes

- Missing data handling: listwise deletion for simplicity.
- Sensitivity analyses optional (not shown in basic notebook).
- Assumptions for all statistical models will be checked.

<details>
<summary>🦛 Fun Fact</summary>
The nutrient Hipponol is purely fictional—but if hippos had trials, we bet they'd run them by CONSORT too!
</details>


In [None]:
# Setup for Google Colab: Fetch datasets automatically or manually
import os
from google.colab import files

# Define the module and dataset for this notebook
MODULE = '10_mini_projects'  
DATASET = 'hipponol_trial_data.csv'
BASE_PATH = '/content/data-analysis-toolkit-FNS'
MODULE_PATH = os.path.join(BASE_PATH, 'notebooks', MODULE)
DATASET_PATH = os.path.join('data', DATASET)

# Step 1: Attempt to clone the repository (automatic method)
# Note: If you encounter a cloning error (e.g., 'fatal: destination path already exists'),
#       reset the runtime (Runtime > Restart runtime) and run this cell again.
try:
    print('Attempting to clone repository...')
    if os.path.exists(BASE_PATH):
        print('Repository already exists, skipping clone.')
    else:
        !git clone https://github.com/ggkuhnle/data-analysis-toolkit-FNS.git
    
    # Debug: Print directory structure
    print('Listing repository contents:')
    !ls {BASE_PATH}
    print(f'Listing notebooks directory contents:')
    !ls {BASE_PATH}/notebooks
    
    # Check if the module directory exists
    if not os.path.exists(MODULE_PATH):
        raise FileNotFoundError(f'Module directory {MODULE_PATH} not found. Check the repository structure.')
    
    # Set working directory to the notebook's folder
    os.chdir(MODULE_PATH)
    
    # Verify dataset is accessible
    if os.path.exists(DATASET_PATH):
        print(f'Dataset found: {DATASET_PATH} 🦛')
    else:
        print(f'Error: Dataset {DATASET} not found after cloning.')
        raise FileNotFoundError
except Exception as e:
    print(f'Cloning failed: {e}')
    print('Falling back to manual upload option...')

    # Step 2: Manual upload option
    print(f'Please upload {DATASET} manually.')
    print(f'1. Click the "Choose Files" button below.')
    print(f'2. Select {DATASET} from your local machine.')
    print(f'3. Ensure the file is placed in notebooks/{MODULE}/data/')
    
    # Create the data directory if it doesn't exist
    os.makedirs('data', exist_ok=True)
    
    # Prompt user to upload the dataset
    uploaded = files.upload()
    
    # Check if the dataset was uploaded
    if DATASET in uploaded:
        with open(DATASET_PATH, 'wb') as f:
            f.write(uploaded[DATASET])
        print(f'Successfully uploaded {DATASET} to {DATASET_PATH} 🦛')
    else:
        raise FileNotFoundError(f'Upload failed. Please ensure you uploaded {DATASET}.')

# Install required packages for this notebook
%pip install pandas numpy
print('Python environment ready.')

# 🛠️ Step 1: Setting Up Python for the Hipponol Trial

This section ensures you're ready to run the analyses in Google Colab or a local environment.

## ✅ Requirements


You need the following Python packages:
- `pandas`
- `numpy`
- `matplotlib`
- `scipy`
- `lifelines` (for survival analysis)
- `cmdstanpy` (for Bayesian analysis, optional)


In [None]:
# Setup

%pip install lifelines
%pip install pymc
%pip install arviz


# Import libraries for data manipulation, visualisation, and statistical analysis
import pandas as pd            # For data manipulation and analysis
import numpy as np             # For numerical operations
import matplotlib.pyplot as plt # For plotting
import seaborn as sns          # For enhanced statistical visualisations
from scipy.stats import ttest_ind, chi2_contingency  # For hypothesis testing

# Import Bayesian modelling and visualisation libraries
import pymc as pm              # For Bayesian statistical modelling
import arviz as az             # For Bayesian inference visualisation

from lifelines import KaplanMeierFitter
from lifelines import CoxPHFitter
import matplotlib.pyplot as plt
import scipy.stats as stats

import patsy





# 📊 Baseline Comparison: Table 1

This section generates a **Table 1** summary of participant characteristics by study group (Hipponol vs Control), including:

- Means and standard deviations for continuous variables  
- Frequencies and percentages for categorical variables  
- Statistical comparison (t-test or chi-squared)  

In [None]:
# Load the dataset (adjust path as needed)
df = pd.read_csv("data/hipponol_trial_data.csv")

# Create a copy to avoid modifying original
df1 = df.copy()

# Display first few rows
df1.head()


## 🔍 Table 1: Summary Function

This cell builds a summary table comparing baseline characteristics across treatment groups using **standard Python libraries**:

- **Numerical variables** (e.g., age, baseline BP) are compared using independent t-tests. We report mean ± standard deviation.
- **Categorical variables** (e.g., gender, smoker) are compared using chi-squared tests, and shown as counts with percentages.

The function `create_table1()`:

- Takes the dataset and grouping column (e.g., "`group`") as input.
- Returns a DataFrame summarising each variable by group, along with p-values for between-group comparisons.

This approach is useful when you want:

- Full control over what’s included in the table.
- Clear, interpretable formatting for reports or publications.

To avoid installing additional packages (e.g., tableone).

🛠️ This method aligns with **CONSORT guidelines** for reporting baseline characteristics.

In [None]:
def create_table1_clean(data, group_col):
    """Create a clean Table 1: baseline characteristics by group."""
    if group_col not in data.columns:
        raise KeyError(f"Grouping column '{group_col}' not found in DataFrame.")
    
    group_names = data[group_col].unique()
    if len(group_names) != 2:
        raise ValueError("This function currently supports exactly two groups.")

    group1, group2 = group_names
    summary = []

    numeric_cols = ['Age', 'Baseline_SBP']
    cat_cols = ['Sex', 'SmokingStatus']

    # Continuous variables
    for col in numeric_cols:
        vals1 = data[data[group_col] == group1][col].dropna()
        vals2 = data[data[group_col] == group2][col].dropna()
        mean1, std1 = vals1.mean(), vals1.std()
        mean2, std2 = vals2.mean(), vals2.std()
        t_stat, p_val = ttest_ind(vals1, vals2)

        summary.append({
            'Variable': col,
            f'{group1} (Mean ± SD)': f'{mean1:.1f} ± {std1:.1f}',
            f'{group2} (Mean ± SD)': f'{mean2:.1f} ± {std2:.1f}',
            'p-value': f'{p_val:.3f}'
        })

    # Categorical variables
    for col in cat_cols:
        cont_table = pd.crosstab(data[col], data[group_col])
        chi2, p_val, _, _ = chi2_contingency(cont_table)

        for i, cat in enumerate(cont_table.index):
            row = {
                'Variable': f"{col} = {cat}",
                f'{group1} (Count %)': f"{cont_table.loc[cat, group1]} ({cont_table.loc[cat, group1] / cont_table[group1].sum() * 100:.1f}%)",
                f'{group2} (Count %)': f"{cont_table.loc[cat, group2]} ({cont_table.loc[cat, group2] / cont_table[group2].sum() * 100:.1f}%)",
                'p-value': f"{p_val:.3f}" if i == 0 else ''
            }
            summary.append(row)

    return pd.DataFrame(summary)



# Create and show Table 1

table1_df = create_table1_clean(df1, group_col='Group')
display(table1_df)


## 📊 Exploring Distributions
Before diving into inferential statistics, it's crucial to understand the **distribution** of our variables. This helps us:

- Check whether groups are comparable
- Identify outliers or skewed data
- Decide which statistical tests are appropriate (e.g., parametric vs. non-parametric)

### 🎯 Focus: Baseline Blood Pressure and Age

For this study, we’ll explore:

- **Age**
- **Baseline Systolic Blood Pressure (SBP)**

These continuous variables should ideally be similarly distributed across the control and intervention groups if randomisation was successful.

--- 

## 🔍 Visualising Distributions

We can use the following plots to explore the distribution:

- **Histogram**: Shows the shape of the distribution (e.g. normal, skewed)
- **Boxplot**: Useful for comparing groups and spotting outliers
- **Violin plot (*optional*)**: Combines boxplot with a kernel density estimate

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set plotting style
sns.set(style="whitegrid")

# Plot: Distribution of Age
plt.figure(figsize=(10, 4))
sns.histplot(data=df1, x='Age', hue='Group', kde=True, bins=20)
plt.title('Age Distribution by Group')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()

# Plot: Baseline Systolic BP
plt.figure(figsize=(10, 4))
sns.boxplot(data=df1, x='Group', y='Baseline_SBP')
plt.title('Baseline Systolic Blood Pressure by Group')
plt.ylabel('Systolic BP (mmHg)')
plt.xlabel('')
plt.show()

# Plot 3: Violin plot of Baseline Systolic BP
plt.figure(figsize=(10, 4))
sns.violinplot(data=df1, x='Group', y='Baseline_SBP', inner='box')
plt.title('Violin Plot: Baseline Systolic Blood Pressure by Group')
plt.ylabel('Systolic BP (mmHg)')
plt.xlabel('')
plt.show()

### 💡 Interpretation Tips

- Look for symmetry or skewness.

- Check for group overlap—similar distributions are a good sign of effective randomisation.

- Use this to support your interpretation of Table 1.

---

# Primary endpoint
## 🧬 Analysing Survival Outcomes

The primary endpoint of the Hipponol trial is **2-year survival**, coded as:

- 1 = **Survived
- 0 = **Died during follow-up**

Survival data is binary and typically analysed using methods suited to categorical outcomes or time-to-event data (e.g. Cox models, Kaplan–Meier). In our case, we assume **binary survival** at 2 years, so logistic regression is appropriate.


### 🧪 Quick Check: Chi-squared Test for Survival
Before diving into regression models, we can start with a simple comparison of survival rates between the Hipponol and Control groups using a Chi-squared test.

✅ Why use it?

- It's a straightforward way to compare **proportions** between two groups.
- Useful for **binary outcomes** like survival (1 = survived, 0 = died).
- Helps confirm whether survival rates differ **significantly** between the two groups.

🧾 Method

We use a **contingency table** to count survivors and non-survivors in each group, then apply the **Chi-squared test of independence**:

In [None]:
from scipy.stats import chi2_contingency

contingency = pd.crosstab(df1['Group'], df1['Survival'])
chi2, p, dof, expected = chi2_contingency(contingency)

print("Contingency Table:")
print(contingency)
print(f"Chi-squared test p-value: {p:.3f}")


📌 Interpretation

- A **small p-value** (typically < 0.05) suggests a **statistically significant** difference in survival between groups.
- This test **does not account for confounders** like age or smoking — use logistic regression for that.

## 🔎 Logistic Regression

Once we’ve done a basic comparison (like a Chi-squared test), the next step is a **logistic regression**. This lets us model the probability of survival while also:

- Controlling for other variables (e.g. age, smoking)
- Estimating odds ratios, which are easier to interpret than raw coefficients

### 💡 What is logistic regression?

Logistic regression is used when your outcome is **binary** — in this case, survival (`1`) vs. death (`0`). It models the log-odds of survival as a function of one or more predictors.

In [None]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Fit a logistic regression model
logit_model = smf.logit('Survival ~ Group', data=df1).fit()

# Print summary
print(logit_model.summary())

🔁 From log-odds to odds ratio

The model estimates **log-odds**, which are not intuitive. We convert them to **odds ratios** using the exponential function:

In [None]:
odds_ratios = logit_model.params.apply(np.exp)
print(odds_ratios)

An odds ratio (OR):

- = 1 means no effect
- > 1 means increased odds (e.g. Hipponol improves survival)
- < 1 means reduced odds (e.g. Hipponol worsens survival)

### 📌 Example Interpretation

If `Group[T.Hipponol]` has an OR of `1.6`:

> "Participants in the Hipponol group had 60% higher odds of surviving the 2-year follow-up compared to the control group."

### 📎 Optional: Add other variables

You can also control for age, sex, and smoking:

In [None]:
logit_full = smf.logit('Survival ~ Group + Age + C(Sex) + C(SmokingStatus)', data=df1).fit()
print(logit_full.summary())

This gives you adjusted odds ratios — a better estimate when multiple factors influence the outcome.

### 📏 Confidence Intervals for Odds Ratios

When you run a logistic regression, the model provides **standard errors** for the coefficients. You can use these to calculate a 95% confidence interval (CI) for each odds ratio:

In [None]:
# Get confidence intervals for the log-odds
conf = logit_model.conf_int()
conf.columns = ['2.5%', '97.5%']

# Add odds ratios
conf_exp = np.exp(conf)
conf_exp['Odds Ratio'] = np.exp(logit_model.params)

print(conf_exp)

### 🧠 How to interpret this:

If the 95% CI for an odds ratio:
- Includes 1, the result is **not statistically significant** at p < 0.05
- Excludes 1, then the effect is **statistically significant**

### 📌 Example:

| Variable         | OR   | 2.5% | 97.5% |
|------------------|------|------|-------|
| Group[T.Hipponol] | 1.62 | 1.02 | 2.58  |

---

> _"The odds of survival in the Hipponol group are 62% higher than in the control group (95% CI: 1.02–2.58)."

This suggests a **statistically significant effect**, as the interval does not include 1.

## Bayesian Logistic Regression – Survival

In this section, we apply **Bayesian logistic regression** to evaluate whether assignment to the **Hipponol** group affects the **probability of survival** compared to the **Control** group.

This analysis complements the Frequentist approach by offering a full **posterior distribution** of the treatment effect, providing a probabilistic interpretation of results rather than just point estimates and p-values.

### 🧮 Why Bayesian?

Bayesian modelling is particularly useful when:

- You want **probabilistic interpretations** of effect size.
- You have **prior knowledge** you wish to incorporate. 
- You want a **posterior distribution** instead of a point estimate.
- You're dealing with **small samples**, where Frequentist power is limited.

In our case, we use **weakly informative** priors (e.g. Normal(0, 10)) to express general uncertainty without strong assumptions.

---

### 🎯 Goal

We aim to:

- Estimate the **posterior probability distribution** of the survival effect due to Hipponol.
- Quantify the **uncertainty** in this effect.
- Calculate the **posterior odds ratio** for survival in the Hipponol group compared to Control.

Bayesian inference allows us to express the result as:

> _“There is a XX% probability that Hipponol increases survival compared to control.”_

This is fundamentally different from the Frequentist interpretation, which only tells us the probability of observing our data under the null hypothesis.

---

### 📦 Define Predictor and Outcome

To build our model, we first define the **predictor** (treatment group) and **outcome** (survival).

- `X`: A binary predictor indicating group membership  
  `0 = Control`, `1 = Hipponol`

- `y`: A binary outcome  
  `0 = Did not survive`, `1 = Survived`

This mirrors the structure required for **Bernoulli trials**, where each observation represents a binary outcome.


In [None]:
# Convert group to numeric: 0 = Control, 1 = Hipponol
X = (df['Group'] == 'Hipponol').astype(int).values

# Survival outcome: already coded as 0 = no, 1 = yes
y = df['Survival'].astype(int).values

Next, we build and fit the Bayesian model using PyMC.

## 🔧 Step-by-Step: Bayesian Logistic Regression in PyMC

We now construct and sample from a **Bayesian logistic regression model** using `PyMC`. This model estimates the effect of group assignment (Hipponol vs Control) on the probability of survival.

---

### 📐 Model Structure

1. **Data**:  
   - `x_shared`: Encodes treatment group (0 = Control, 1 = Hipponol)  
   - `y_shared`: Binary survival outcome (0 = no, 1 = yes)

2. **Priors**:  
   - `Intercept`: Prior belief about survival probability in the Control group.  
     → Set as `Normal(0, 10)` to reflect wide uncertainty.
   - `Beta_Group`: Prior for the effect of the Hipponol group.  
     → Also `Normal(0, 10)` (can be tightened if prior knowledge exists).

3. **Logistic function**:  
   - Computes log-odds: `logit_p = intercept + beta_group * x_shared`  
   - Converts to probability via sigmoid: `p = sigmoid(logit_p)`

4. **Likelihood**:  
   - Models observed survival data as a Bernoulli trial with success probability `p`.

5. **Sampling**:  
   - We use Hamiltonian Monte Carlo (HMC) via the NUTS sampler.  
   - 4 chains × 2000 draws each, with 1000 tuning samples per chain.

---

### 📦 PyMC Model Code


In [None]:
import pymc as pm
import arviz as az

with pm.Model() as model:
    # Data containers (for flexibility in updating)
    x_shared = pm.Data("x_shared", X)
    y_shared = pm.Data("y_shared", y)

    # Priors
    intercept = pm.Normal("Intercept", mu=0, sigma=10)
    beta_group = pm.Normal("Beta_Group", mu=0, sigma=10)

    # Logistic function
    logit_p = intercept + beta_group * x_shared
    p = pm.Deterministic("p", pm.math.sigmoid(logit_p))

    # Likelihood
    y_obs = pm.Bernoulli("y_obs", p=p, observed=y_shared)

    # Sampling from the posterior
    trace = pm.sample(2000, tune=1000, target_accept=0.95, return_inferencedata=True)


🔍 Next Steps
After sampling, we will:

- **Inspect the posterior** of Beta_Group to assess the treatment effect.
- **Convert** this to an **odds ratio** (by exponentiating Beta_Group).
- **Visualise** the posterior distribution using `arviz`.

In [None]:
# Plot posterior for Beta_Group
az.plot_posterior(trace, var_names=['Beta_Group'], ref_val=0)
plt.title("Posterior Distribution: Group Effect (log-odds)")
plt.show()

# Convert to odds ratio
odds_ratios = np.exp(trace.posterior['Beta_Group'].values.flatten())

# Plot posterior odds ratio
az.plot_posterior(odds_ratios, ref_val=1)
plt.title("Posterior Odds Ratio: Hipponol vs Control")
plt.xlabel("Odds Ratio")
plt.show()

### 🧠 Interpretation

- If the **posterior** of Beta_Group is mostly above 0 → **Hipponol increases odds of survival**.
- If the **95% credible interval (HDI)** for odds_ratios excludes 1 → **strong evidence for a difference**.

You can compute az.hdi() for numeric intervals:

In [None]:
az.hdi(odds_ratios, hdi_prob=0.95)

✅ This Bayesian approach allows for **direct probability statements**:

“There is a 95% probability that Hipponol increases the odds of survival by X–Y%.”


## 📈 Kaplan–Meier Survival Curves

To visualise how survival changes over time in each group, we use **Kaplan–Meier survival estimates**. These curves provide a non-parametric estimate of the survival function, which reflects the probability of survival beyond a given time point.

### 🔍 Why Kaplan–Meier?

Kaplan–Meier curves are ideal for visualising time-to-event data because they:

- Handle censored data (participants who didn’t experience the event by the end of follow-up)
- Show the proportion of participants surviving over time
- Allow intuitive comparison between groups (e.g. Hipponol vs Control)

### 📘 Interpretation

Each drop in the curve corresponds to an event (e.g. death). A steeper decline indicates more frequent events, whereas a flat curve suggests stable survival.

If the curves for the two groups diverge:

- **Wider separation** may indicate a meaningful difference in survival.
- **Overlap** suggests similar survival patterns.


In [None]:
# Split data
df_control = df[df['Group'] == 'Control']
df_hipponol = df[df['Group'] == 'Hipponol']

# Initialise fitter
kmf_control = KaplanMeierFitter()
kmf_hipponol = KaplanMeierFitter()

# Fit and plot
plt.figure(figsize=(10, 6))

kmf_control.fit(durations=df_control['Time_to_Event'],
                event_observed=df_control['Survival'],
                label='Control')
kmf_control.plot_survival_function(ci_show=True)

kmf_hipponol.fit(durations=df_hipponol['Time_to_Event'],
                 event_observed=df_hipponol['Survival'],
                 label='Hipponol')
kmf_hipponol.plot_survival_function(ci_show=True)

plt.title("Kaplan–Meier Survival Curves by Group")
plt.xlabel("Time to Event (months)")
plt.ylabel("Survival Probability")
plt.grid(True)
plt.tight_layout()
plt.show()



## 🧪 Compare groups statistically – Log-rank test
This checks whether the survival distributions are significantly different between the groups.

### 🔍 What is the Log-Rank Test?
The **log-rank test** is a non-parametric statistical test used to compare the survival distributions of two or more groups. It's particularly useful in clinical trials or studies with time-to-event data.

✅ What it does:

- Compares the observed number of events in each group to the expected number under the null hypothesis (that all groups have the same survival curve).
- It handles **censored data** properly.
- Outputs a **p-value**: if this is small (typically < 0.05), it suggests the survival curves differ significantly.

In [None]:

from lifelines.statistics import logrank_test

# Filter your data
df_control = df[df["Group"] == "Control"]
df_hipponol = df[df["Group"] == "Hipponol"]

# Run the log-rank test
results = logrank_test(
    df_control["Time_to_Event"], df_hipponol["Time_to_Event"],
    event_observed_A=df_control["Survival"],
    event_observed_B=df_hipponol["Survival"]
)

# Display the results
print("📈 Log-rank test between Control and Hipponol:")
print(f"Test statistic: {results.test_statistic:.3f}")
print(f"p-value: {results.p_value:.4f}")

# Optional: simple interpretation
if results.p_value < 0.05:
    print("⚠️ There is a significant difference in survival between groups.")
else:
    print("✅ No significant difference in survival between groups.")


### 🧪 When to Use:
Use this when:

- You’ve plotted Kaplan-Meier curves and want to **test if the difference is significant**.
- You’re evaluating **intervention effects** in RCTs or cohort studies.
- You want a simple test without assuming specific hazard functions (unlike Cox models).


## Cox Proportional Hazards Model: A Detailed Guide

### ✅ What Is the Cox Proportional Hazards Model?

The **Cox PH model** is a **semi-parametric regression model** used in survival analysis. It relates the time that passes before an event occurs to **one or more covariates**.

### Core idea:
It models the **hazard rate** at time *t* as:

\[
h(t|X) = h_0(t) \cdot \exp(\beta_1X_1 + \beta_2X_2 + \ldots + \beta_pX_p)
\]

- \( h_0(t) \) is the **baseline hazard** (unspecified)
- \( X_i \) are the covariates (e.g. Group, Age, Sex)
- \( \beta_i \) are coefficients estimated from the data

The model makes the **proportional hazards assumption**: the ratio of hazards between groups is constant over time.

---

## 🛠️ 2. How to Fit It in Python (with `lifelines`)

In [None]:

from lifelines import CoxPHFitter

# Select relevant columns only
df_cox = df[["Age", "Sex", "SmokingStatus", "Group", "Time_to_Event", "Survival"]].copy()

# One-hot encode categorical variables, dropping first category to avoid multicollinearity
df_cox = pd.get_dummies(df_cox, drop_first=True)

# Fit the Cox model
cph = CoxPHFitter()
cph.fit(df_cox, duration_col="Time_to_Event", event_col="Survival")

# Print the summary
cph.print_summary()



### 📏 How to Interpret the Output

The summary table includes:

- `coef`: log of the hazard ratio (HR)
- `exp(coef)`: the **hazard ratio**
- `p`: p-value testing if the coefficient is significantly different from 0
- `-log2(p)`: significance in a log-scale
- `95% CI`: confidence intervals for the HR

### Example Interpretation:

| Variable              | exp(coef) | p-value | Interpretation |
|----------------------|-----------|---------|----------------|
| Group_Hipponol       | 0.65      | 0.001   | Hipponol **reduces** the hazard by 35% vs. Control |
| Age                  | 1.02      | 0.050   | Each year of age **increases** hazard by 2% |
| Sex_Male             | 1.10      | 0.200   | No significant difference vs. Female |

---

### ⚠️ Test the Proportional Hazards (PH) Assumption

This is **essential** because if PH is violated, your hazard ratios are **not valid over time**.


In [None]:
# Check proportional hazards assumption
cph.check_assumptions(df_cox, p_value_threshold=0.05, show_plots=True)


- Print a test result for each covariate
- Show **Schoenfeld residuals** plots:
  - If residuals show **systematic trends**, PH is **likely violated**
  - Flat residuals = good

### If PH is violated:
- Add **interaction terms with time**
- Use **stratified Cox models**
- Split time into intervals (time-dependent covariates)






### 🧪 5. (Optional) Predict or Visualise

You can predict survival curves for a given profile:

In [None]:
import matplotlib.pyplot as plt

# Create a base profile (mean age, female, non-smoker)
base = pd.DataFrame({
    'Age': [df['Age'].mean()],
    'Sex_Male': [0],
    'SmokingStatus_Smoker': [0],
    'Group_Hipponol': [0]  # Control group
})

# Copy and modify for Hipponol
hipponol = base.copy()
hipponol['Group_Hipponol'] = 1  # Switch to Hipponol group

# Predict survival curves
sf_control = cph.predict_survival_function(base)
sf_hipponol = cph.predict_survival_function(hipponol)

# Plot
plt.figure(figsize=(10, 6))
plt.plot(sf_control, label="Control")
plt.plot(sf_hipponol, label="Hipponol")
plt.title("Predicted Survival Curve for Control vs. Hipponol")
plt.xlabel("Time")
plt.ylabel("Survival Probability")
plt.legend()
plt.grid(True)
plt.show()



# 🧪 Secondary Endpoint: Blood Pressure Change

This section explores how to compare systolic blood pressure (SBP) outcomes between the treatment group (`Hipponol`) and the control group. SBP is treated as a          ↪continuous variable, and we are particularly interested in:

- Whether there are baseline differences
- Whether the groups differ at follow-up
- Whether **Hipponol** leads to a greater reduction in SBP than the control

---

## 1. 🎨 Visual Exploration of Distributions

Before diving into formal tests, it's useful to examine the data visually. You can compare the SBP distributions for the two groups at:

### Baseline




In [None]:
# Follow-up SBP distribution
plt.figure(figsize=(10, 5))
sns.kdeplot(data=df, x='Baseline_SBP', hue='Group', fill=True, common_norm=False, alpha=0.4)
plt.title("Follow-up SBP Distribution by Group")
plt.show()



### Follow-up

In [None]:
# Follow-up SBP distribution
plt.figure(figsize=(10, 5))
sns.kdeplot(data=df, x='Followup_SBP', hue='Group', fill=True, common_norm=False, alpha=0.4)
plt.title("Follow-up SBP Distribution by Group")
plt.show()

### Change in Blood Pressure

In [None]:
df['SBP_change'] = df['Followup_SBP'] - df['Baseline_SBP']

plt.figure(figsize=(10, 5))
sns.kdeplot(data=df, x='SBP_change', hue='Group', fill=True, common_norm=False, alpha=0.4)
plt.title("Change in SBP (Follow-up – Baseline) by Group")
plt.axvline(0, color='black', linestyle='--')
plt.show()


## 🔍 t-Test for Comparing Change in SBP Between Two Groups

### What is it?

A **two-sample t-test** (also called an **independent samples t-test**) is used to determine whether there is a **statistically significant difference** in the **means  ↪of a continuous variable** between two **independent groups**.

In this case, we're interested in:

> **Is the average change in SBP significantly different between participants in the Control group and those in the Hipponol group?**

---

### 🧪 When to use it

Use a two-sample t-test when:
- The outcome variable (here: `SBP change = Followup_SBP − Baseline_SBP`) is continuous.
- The two groups are **independent** (no participant is in both groups).
- The data are approximately **normally distributed** in each group.
- Variances are assumed to be **equal** (although Welch's correction can be used otherwise).

---

### 🧠 Hypotheses

Let:
- \( \mu_1 \): mean change in SBP in the **Control** group
- \( \mu_2 \): mean change in SBP in the **Hipponol** group

Then:

- **Null hypothesis \( H_0 \)**: \( \mu_1 = \mu_2 \) (no difference)
- **Alternative hypothesis \( H_A \)**: \( \mu_1 \ne \mu_2 \) (there is a difference)

You can also use a **one-sided test** if you expect Hipponol to reduce SBP.

---



In [None]:

hip = df[df['Group'] == 'Hipponol']['SBP_change']
con = df[df['Group'] == 'Control']['SBP_change']
con = df[df['Group'] == 'Control']['SBP_change']

# Mean difference
diff = hip.mean() - con.mean()

# Standard error of the difference
se_diff = np.sqrt(hip.var(ddof=1)/len(hip) + con.var(ddof=1)/len(con))

# 95% Confidence Interval
dof = len(hip) + len(con) - 2
ci = stats.t.interval(0.95, dof, loc=diff, scale=se_diff)

# t-test
t_stat, p_val = ttest_ind(hip, con)

print(f"Mean difference: {diff:.2f} mmHg")
print(f"95% CI: {ci}")
print(f"t-statistic: {t_stat:.3f}, p-value: {p_val:.4f}")



## 🧾 Interpreting results

- A **small p-value** (typically < 0.05) means you reject the null hypothesis and conclude that **SBP change differs significantly** between the groups.
- A **large p-value** suggests that the observed difference might be due to chance.


## 🧪 ANOVA and ANCOVA

### Analysis of Variance (ANOVA)

**ANOVA** is used to determine whether there are statistically significant differences in the means of a continuous variable across two or more groups.

In our case, we can use **one-way ANOVA** to compare the change in systolic blood pressure (SBP) between the `Control` and `Hipponol` groups.

### 🔍 Assumptions:
- Independence of observations
- Normally distributed SBP changes within each group
- Homogeneity of variances across groups

In [None]:
# Assume df is your DataFrame and has already loaded the dataset
df['SBP_Change'] = df['Followup_SBP'] - df['Baseline_SBP']

# Split by group
control = df[df['Group'] == 'Control']['SBP_Change']
hipponol = df[df['Group'] == 'Hipponol']['SBP_Change']

# Run one-way ANOVA
f_stat, p_value = stats.f_oneway(control, hipponol)

print("One-way ANOVA")
print(f"F-statistic: {f_stat:.4f}")
print(f"P-value: {p_value:.4f}")

### 🧪 Comparison: t-test vs. ANOVA

| Feature                        | **t-test**                                  | **ANOVA (one-way)**                         |
|-------------------------------|---------------------------------------------|---------------------------------------------|
| **Use case**                  | Comparing **two** group means               | Designed for comparing **three or more**    |
| **Statistic reported**        | *t*-statistic                               | *F*-statistic                               |
| **P-value**                   | Same as ANOVA when only 2 groups            | Same as t-test when only 2 groups           |
| **Interpretation**            | Mean difference                             | Variance between vs. within groups          |
| **Extension**                 | Needs other models for >2 groups            | Naturally extends to >2 group comparisons   |

---

## 📐 Mathematical Note

When there are exactly **two groups**, the relationship between the statistics is:

\\[
F = t^2
\\]

So the numerical **p-value** and the result will be identical — the difference is in the formulation.

---

## ✅ When to Use Each?

- ✅ **Use t-test**: when you are only comparing two groups and want the simplest approach.
- ✅ **Use ANOVA**: when you are planning to include more than two groups in future or need consistency in your analysis framework.

---

## 🔄 Practical Implication

For the SBP change analysis with just Control and Hipponol:
- Both methods are valid and will return the same **p-value**.
- ANOVA provides more flexibility later (e.g., adjusting for other factors via ANCOVA or general linear models).


### Analysis of Covariance (ANCOVA)

**ANCOVA** extends ANOVA by including one or more **continuous covariates** that may influence the dependent variable. This is particularly useful when there are baseline differences between groups (e.g. in baseline SBP).

In our context, ANCOVA allows us to compare the **Follow-up SBP**, adjusting for **Baseline SBP** as a covariate.

#### 💡 Why use ANCOVA?

- It reduces residual variance, improving statistical power
- It adjusts for potential imbalances in baseline values

#### 🔍 Assumptions:

- Same as ANOVA, plus:
  - Linearity between covariate and outcome
  - Homogeneity of regression slopes (no interaction between group and covariate)

In [None]:
# Convert Group to categorical if needed
df['Group'] = df['Group'].astype('category')

# Fit ANCOVA model
model = smf.ols('Followup_SBP ~ Group + Baseline_SBP + Age + Sex + SmokingStatus', data=df).fit()

# Show summary
print(model.summary())
print(df.head())

📊 Notes

- ANOVA checks whether the mean changes in SBP differ between groups.
- ANCOVA goes further by adjusting for baseline SBP, increasing statistical power and controlling for initial differences.


# Regression Analysis: Blood Pressure Change

To assess the relationship between treatment group and change in systolic blood pressure (SBP), a linear regression model can be used. This allows us to estimate the **magnitude** of the treatment effect while optionally adjusting for **confounders** like age, sex, and smoking status.

#### 📦 Step 1: Compute SBP Change
#### 🔧 Step 2: Encode the Treatment Group

To use `Group` in regression, convert it into a numeric variable. We'll use **0 = Control**, **1 = Hipponol**.


In [None]:
df['SBP_Change'] = df['Followup_SBP'] - df['Baseline_SBP']
df['Treatment'] = (df['Group'] == 'Hipponol').astype(int)

#### 📊 Step 3: Fit the regression model

This will estimate the **mean difference in SBP change** between the two groups. The intercept represents the mean change in the Control group, and the coefficient for `Treatment` represents the **additional change** for the Hipponol group.

In [None]:
# Fit the model
model = smf.ols('SBP_Change ~ Treatment', data=df).fit()

# Show results
print(model.summary())



## 🧮 Step 4: Regression with Covariates (Optional)

To adjust for potential confounders.

Note: `Sex` and `SmokingStatus` should be treated as categorical variables. You may need to encode them using `C()` in the formula:

In [None]:

model_adj = smf.ols('SBP_Change ~ Treatment + Age + C(Sex) + C(SmokingStatus)', data=df).fit()
print(model_adj.summary())

#### 📈 Step 5: Visualise the Regression


In [None]:
# Encode group as binary treatment (if not already done)
df['Treatment'] = (df['Group'] == 'Hipponol').astype(int)

plt.figure(figsize=(10, 6))

# Add jittered individual points
sns.stripplot(x='Treatment', y='SBP_Change', data=df,
              jitter=0.2, size=5, alpha=0.6, color='grey')

# Add group means and error bars (95% CI)
sns.pointplot(x='Treatment', y='SBP_Change', data=df,
              linestyle='none', capsize=0.1, err_kws={'linewidth': 1.5}, color='blue')

# Customise axes
plt.xticks([0, 1], ['Control', 'Hipponol'])
plt.title('Change in Systolic Blood Pressure by Treatment Group')
plt.xlabel('Treatment Group')
plt.ylabel('Change in SBP (mmHg)')
plt.grid(True, linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()

### 🧠 Interpretation

- A **negative coefficient** for `Treatment` implies a greater **reduction** in SBP in the Hipponol group.
- The **p-value** tells you whether this difference is statistically significant.
- The **adjusted R²** provides an indication of how much variance in SBP change is explained by the model.


# 🧠 Bayesian Regression Analysis (SBP Change)

This section uses a Bayesian approach to model the effect of treatment (`Hipponol` vs `Control`) on the change in systolic blood pressure (SBP), while adjusting for age, sex, and smoking status.

---

#### 🧹 Step 1: Data Preparation

First, we calculate the change in SBP and encode the categorical variables.

---

#### 🧮 Step 2: Model Specification

We model SBP change as a linear function of:

- Age (standardised)
- Sex (Female as reference)
- Smoking Status (Non-smoker as reference)
- Group (Control as reference)

In [None]:

def scale_column(df, column, method='zscore'):
    """
    Scales a column in the dataframe using the specified method.
    
    Parameters:
    - df: pandas DataFrame
    - column: name of the column to scale
    - method: 'zscore', 'minmax', 'robust', or 'maxabs'
    
    Returns:
    - A pandas Series with the scaled values
    """
    x = df[column]

    if method == 'zscore':
        return (x - x.mean()) / x.std()
    elif method == 'minmax':
        return (x - x.min()) / (x.max() - x.min())
    elif method == 'robust':
        median = x.median()
        iqr = x.quantile(0.75) - x.quantile(0.25)
        return (x - median) / iqr
    elif method == 'maxabs':
        return x / x.abs().max()
    else:
        raise ValueError("Unknown scaling method. Choose 'zscore', 'minmax', 'robust', or 'maxabs'.")



# Assume df is already loaded
df['SBP_Change'] = df['Followup_SBP'] - df['Baseline_SBP']

# Ensure proper one-hot encoding
df_encoded = pd.get_dummies(df[['SBP_Change', 'Age', 'Sex', 'SmokingStatus', 'Group']], drop_first=True)

# Standardise age
df_encoded['Age_Standardised'] = scale_column(df_encoded, 'Age')

# Drop the unstandardised 'Age' column if not needed
df_encoded = df_encoded.drop(columns='Age')

# Ensure all columns are float (important for PyMC)
X = df_encoded.drop(columns=['SBP_Change']).astype(float)
y = df_encoded['SBP_Change'].astype(float)

predictor_names = X.columns.tolist()

with pm.Model() as model:
    # Data containers
    predictors = pm.Data("predictors", X)
    outcome = pm.Data("outcome", y)

    # Priors
    intercept = pm.Normal("Intercept", mu=0, sigma=10)
    beta = pm.Normal("Beta", mu=0, sigma=1, shape=len(predictor_names))
    sigma = pm.HalfNormal("Sigma", sigma=5)

    # Linear model
    mu = intercept + pm.math.dot(predictors, beta)

    # Likelihood
    y_obs = pm.Normal("y_obs", mu=mu, sigma=sigma, observed=outcome)

    # Sampling
    trace = pm.sample(2000, tune=1000, target_accept=0.95, return_inferencedata=True)

trace.posterior = trace.posterior.rename({'Beta_dim_0': 'variable'})
trace.posterior['variable'] = ('variable', predictor_names)

#### 📊 Step 3: Posterior Summary

After sampling from the posterior distribution using PyMC, we now have a large collection of samples (called the trace) representing the joint posterior distribution of the model parameters.

The posterior summary allows us to understand:

- **Central tendency** (e.g., mean or median of each parameter estimate)
- **Uncertainty** (e.g., standard deviation or credible intervals)
- **Effect direction and strength** (is the coefficient consistently above/below zero?)

Here’s how to examine this in Python using arviz:


In [None]:
az.summary(trace, var_names=["Intercept", "Beta", "Sigma"], hdi_prob=0.95)

- **Mean**: Posterior mean (expected value)
- **SD**: Posterior standard deviation
- **HDI (Highest Density Interval)**: The most credible interval containing 95% of the posterior probability mass. If this interval excludes 0, the effect is considered statistically credible (akin to 'significant').

🧭 How to interpret:
- If the **HDI excludes zero**, it suggests a consistent effect (e.g., a real impact of the treatment).
- If the **HDI includes zero**, there's high uncertainty about whether that variable truly affects the outcome.

---
📈 Plotting the posterior:

Visualising the posterior is an intuitive way to understand the model:


In [None]:
az.plot_forest(trace, var_names=["Beta"], combined=True)

## ✅ Interpretation

The key output here is the coefficient for `Group_Hipponol`. If its 95% Highest Posterior Density Interval (HDI) does not overlap with zero, there's strong evidence that `Hipponol` affects SBP change compared to control.


# 🧪 Comparing Treatment Effects on SBP Change

We evaluated the effect of the Hipponol intervention on the change in **systolic blood pressure (SBP)** using three different analytical approaches: a standard independent samples **T-test**, a classical **Ordinary Least Squares (OLS) regression**, and a **Bayesian regression model**. Each method provides an estimate of the mean difference in SBP change between the treatment (Hipponol) and control groups, along with a measure of uncertainty.

## 🔹 T-test
The T-test compares the mean SBP change directly between the two groups, assuming equal variance. The estimated mean difference was statistically significant, indicating that the SBP change in the Hipponol group was different from the control group. This method provides a straightforward test of the null hypothesis with a confidence interval for the mean difference.

- **Advantage**: Simple and easy to interpret.
- **Limitation**: Does not adjust for potential confounders such as age, sex, or smoking status.

## 🔹 ANCOVA
The Analysis of Covariance (ANCOVA) extends the OLS model by including baseline covariates—such as age, sex, and smoking status—to control for confounding and improve precision. This method adjusts the treatment effect for baseline imbalances and provides a more accurate estimate.

- **Advantage**: Adjusts for potential confounders, improving estimate precision.
- **Limitation**: Requires linear relationships between covariates and outcome, and assumes homogeneity of regression slopes.

## 🔹 OLS Regression
The OLS regression treated group membership as a binary predictor for SBP change. The estimate from this model was numerically identical to the T-test result because no additional covariates were included. However, regression frameworks allow for the extension to multiple covariates and interaction terms.

- **Advantage**: Can be extended to include confounders or interactions.
- **Limitation**: Assumes linear relationships and homoscedasticity.

## 🔹 Bayesian Regression
The Bayesian model estimated the treatment effect using probabilistic priors and provided a full posterior distribution for the treatment coefficient. The 95% Highest Density Interval (HDI) offers a Bayesian analogue to the confidence interval, interpreted as the range within which the true effect lies with 95% probability.

- **Advantage**: Flexibility, intuitive probabilistic interpretation, and explicit incorporation of prior beliefs.
- **Limitation**: Requires computational resources and careful choice of priors.

In [None]:
#
# Assume df is already loaded and processed
df['SBP_Change'] = df['Followup_SBP'] - df['Baseline_SBP']
df['Treatment'] = (df['Group'] == 'Hipponol').astype(int)

# --- T-test ---
control = df[df['Treatment'] == 0]['SBP_Change']
hipponol = df[df['Treatment'] == 1]['SBP_Change']

t_stat, t_pval = stats.ttest_ind(hipponol, control)
mean_diff = hipponol.mean() - control.mean()
ci_low, ci_high = stats.t.interval(
    0.95, df=len(control) + len(hipponol) - 2,
    loc=mean_diff,
    scale=stats.sem(np.concatenate([hipponol.values, -control.values]))
)

# --- OLS Regression ---
model_adj = smf.ols('SBP_Change ~ Treatment + Age + C(Sex) + C(SmokingStatus)', data=df).fit()
ols_estimate = model_adj.params['Treatment']
ols_ci_low, ols_ci_high = model_adj.conf_int().loc['Treatment']


# ANCOVA: Adjusting for age, sex, smoking
df_ancova = df.copy()
df_ancova = pd.get_dummies(df_ancova, columns=['Sex', 'SmokingStatus'], drop_first=True)
formula = "SBP_Change ~ Treatment + Age + Sex_Male + SmokingStatus_Smoker"
y_ancova, X_ancova = patsy.dmatrices(formula, data=df_ancova, return_type='dataframe')
ancova_model = sm.OLS(y_ancova, X_ancova).fit()
ancova_estimate = ancova_model.params['Treatment']
ancova_ci_low, ancova_ci_high = ancova_model.conf_int().loc['Treatment']


# --- Bayesian Regression ---
with pm.Model() as model:
    beta_0 = pm.Normal("Intercept", mu=0, sigma=10)
    beta_1 = pm.Normal("Treatment", mu=0, sigma=5)
    sigma = pm.HalfNormal("Sigma", sigma=5)

    mu = beta_0 + beta_1 * df['Treatment'].values
    y_obs = pm.Normal("y_obs", mu=mu, sigma=sigma, observed=df['SBP_Change'])

    trace = pm.sample(2000, tune=1000, target_accept=0.95, return_inferencedata=True)

bayes_summary = az.summary(trace, var_names=["Treatment"], hdi_prob=0.95)
bayes_estimate = bayes_summary.loc["Treatment", "mean"]
bayes_hdi_low = bayes_summary.loc["Treatment", "hdi_2.5%"]
bayes_hdi_high = bayes_summary.loc["Treatment", "hdi_97.5%"]

# --- Comparison Table with Formatting ---
effect_df = pd.DataFrame({
    "Method": ["T-test", "OLS regression", "ANCOVA", "Bayesian regression"],
    "Estimate (mmHg)": [mean_diff, ols_estimate, ancova_estimate, bayes_estimate],
    "95% CI Lower": [ci_low, ols_ci_low, ancova_ci_low, bayes_hdi_low],
    "95% CI Upper": [ci_high, ols_ci_high, ancova_ci_high, bayes_hdi_high]
})

# Round values for presentation
effect_df = effect_df.round(2)

print(effect_df)


### 🔄 Relationship Between ANCOVA and OLS Regression

In this instance, **Analysis of Covariance (ANCOVA)** and **Ordinary Least Squares (OLS) regression** provide **identical results**—but why is that the case?

#### 🧠 Conceptual Overlap

ANCOVA is essentially a special case of linear regression. Specifically, it combines:

- **ANOVA**: for testing differences between categorical groups (e.g. treatment vs. control), and  
- **Regression**: for adjusting these comparisons based on continuous or additional categorical covariates (e.g. age, sex, smoking status).

Thus, ANCOVA is not a fundamentally different statistical method but rather a **framework for interpreting a linear model** that includes both categorical and continuous predictors. When implemented in software such as Python’s `statsmodels`, ANCOVA is conducted via the **same OLS regression engine** as any standard linear regression model.

---

#### 📊 Technical Equivalence

The following model, framed as an OLS regression:

```python
model = smf.ols('SBP_Change ~ Treatment + Age + C(Sex) + C(SmokingStatus)', data=df).fit()
```

…is analytically identical to an ANCOVA testing the **effect of Treatment on SBP change**, **adjusting for Age, Sex, and Smoking Status**. The categorical variables are automatically dummy-coded, and the model estimates group differences while controlling for the covariates.

Both approaches:
- Estimate coefficients using least squares
- Provide p-values, confidence intervals, and model diagnostics
- Are suitable for continuous outcomes (like SBP change)

---

#### 🧾 Summary

| Term        | What it Emphasises                | Underlying Method |
|-------------|-----------------------------------|-------------------|
| **ANCOVA**  | Group comparison **with adjustment** | Linear regression |
| **OLS Regression** | Flexible modelling of continuous outcomes | Linear regression |

**✅ Conclusion:**  
Whether you call it ANCOVA or OLS regression, you're using the **same statistical engine**. The terminology difference mostly reflects **intent and interpretation**, not implementation.
