# FB2NEP Workbook 8 – Missing Data and Sensitivity Analysis

Version 0.0.1

In all real epidemiological datasets, some information is missing. In this workbook we focus more systematically on:

- What “missing data” mean in practice.
- Patterns of missingness in the FB2NEP cohort.
- Complete-case versus imputation-based analyses.
- Simple sensitivity analyses to assess robustness.
- A brief introduction to Bayesian thinking about missing data.

We work with a simple regression example using the synthetic FB2NEP cohort.

Run the first two code cells to set up the repository and load the data.

In [None]:
# ============================================================
# FB2NEP bootstrap cell (works both locally and in Colab)
#
# What this cell does:
# - Ensures that we are inside the fb2nep-epi repository.
# - In Colab: clones the repository from GitHub if necessary.
# - Loads and runs scripts/bootstrap.py.
# - Makes the main dataset available as the variable `df`.
#
# Important:
# - You may see messages printed below (for example from pip
#   or from the bootstrap script). This is expected.
# - You may also see WARNINGS (often in yellow). In most cases
#   these are harmless and can be ignored for this module.
# - The main thing to watch for is a red error traceback
#   (for example FileNotFoundError, ModuleNotFoundError).
#   If that happens, please re-run this cell first. If the
#   error persists, ask for help.
# ============================================================

import os
import sys
import pathlib
import subprocess
import importlib.util

# ------------------------------------------------------------
# Configuration: repository location and URL
# ------------------------------------------------------------
# REPO_URL: address of the GitHub repository.
# REPO_DIR: folder name that will be created when cloning.
REPO_URL = "https://github.com/ggkuhnle/fb2nep-epi.git"
REPO_DIR = "fb2nep-epi"

# ------------------------------------------------------------
# 1. Ensure we are inside the fb2nep-epi repository
# ------------------------------------------------------------
# In local Jupyter, you may already be inside the repository,
# for example in fb2nep-epi/notebooks.
#
# In Colab, the default working directory is /content, so
# we need to clone the repository into /content/fb2nep-epi
# and then change into that folder.
cwd = pathlib.Path.cwd()

# Case A: we are already in the repository (scripts/bootstrap.py exists here)
if (cwd / "scripts" / "bootstrap.py").is_file():
    repo_root = cwd

# Case B: we are outside the repository (for example in Colab)
else:
    repo_root = cwd / REPO_DIR

    # Clone the repository if it is not present yet
    if not repo_root.is_dir():
        print(f"Cloning repository from {REPO_URL} into {repo_root} ...")
        subprocess.run(["git", "clone", REPO_URL, str(repo_root)], check=True)
    else:
        print(f"Using existing repository at {repo_root}")

    # Change the working directory to the repository root
    os.chdir(repo_root)
    repo_root = pathlib.Path.cwd()

print(f"Repository root set to: {repo_root}")

# ------------------------------------------------------------
# 2. Load scripts/bootstrap.py as a module and call init()
# ------------------------------------------------------------
# The shared bootstrap script contains all logic to:
# - Ensure that required Python packages are installed.
# - Ensure that the synthetic dataset exists (and generate it
#   if needed).
# - Load the dataset into a pandas DataFrame.
#
# We load the script as a normal Python module (fb2nep_bootstrap)
# and then call its init() function.
bootstrap_path = repo_root / "scripts" / "bootstrap.py"

if not bootstrap_path.is_file():
    raise FileNotFoundError(
        f"Could not find {bootstrap_path}. "
        "Please check that the fb2nep-epi repository structure is intact."
    )

# Create a module specification from the file
spec = importlib.util.spec_from_file_location("fb2nep_bootstrap", bootstrap_path)
bootstrap = importlib.util.module_from_spec(spec)
sys.modules["fb2nep_bootstrap"] = bootstrap

# Execute the bootstrap script in the context of this module
spec.loader.exec_module(bootstrap)

# The init() function is defined in scripts/bootstrap.py.
# It returns:
# - df   : the main synthetic cohort as a pandas DataFrame.
# - CTX  : a small context object with paths, flags and settings.
df, CTX = bootstrap.init()

# Optionally expose a few additional useful variables from the
# bootstrap module (if they exist). These are not essential for
# most analyses, but can be helpful for advanced use.
for name in ["CSV_REL", "REPO_NAME", "REPO_URL", "IN_COLAB"]:
    if hasattr(bootstrap, name):
        globals()[name] = getattr(bootstrap, name)

print("Bootstrap completed successfully.")
print("The main dataset is available as the variable `df`.")
print("The context object is available as `CTX`.")


In [None]:
# Imports used throughout this workbook
#
# - numpy, pandas: general data handling
# - matplotlib: simple visualisations
# - statsmodels: regression and multiple imputation (MICE)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.api as sm
from statsmodels.imputation.mice import MICEData, MICE

from scripts.helpers_tables import ensure_cmdstan

cmdstan_root = ensure_cmdstan()

from cmdstanpy import CmdStanModel
print("Using CmdStan from:", cmdstan_root)

%matplotlib inline

In [None]:
# Quick inspection of the main dataset
# This is just to remind ourselves of the structure of the FB2NEP cohort.

df.head()

## Background: what do we mean by “missing data”?

Before inspecting missingness in the FB2NEP dataset, we first clarify what *missing* means in an epidemiological context.

### (a) What can be missing?

Information can be missing at several levels:

- **Whole participants**  
  These individuals never appear in the analysis dataset (not recruited, withdrew immediately, or provided no usable data).

- **Entire visits or time points**  
  A participant attends baseline but not follow-up; in longitudinal data this appears as absent rows or missing whole sets of variables.

- **Individual variables (items)**  
  A single measurement is absent:  
  - SBP not taken or not recorded.  
  - Height or weight missing.  
  - A questionnaire item skipped.

In this workbook we mainly deal with **item-level missingness** in outcome, exposure, and covariates within a regression model.

### (b) Different types of missing entries

Not all missing values have the same meaning:

- **Item non-response**  
  A measurement *should* exist, but is not observed.

- **Unit non-response**  
  A whole clinic visit or questionnaire is missing.

- **Structurally missing (“not applicable”)**  
  These entries are *supposed* to be missing. Examples:  
  - `menopausal_status` in men.  
  - `CVD_date` for participants without a CVD event.  

Structural missingness is not imputed because it is determined by the logic of the variable.

### (c) What does “missing” mean in the dataset?

In the FB2NEP dataset, missing values appear as `NaN` in pandas.

This means:

- The participant **exists** in the cohort.  
- But the dataset contains **no recorded value** for that variable.

It does **not** mean:

- zero,  
- “no disease”,  
- “never”,  
- or any other substantive category.

Example:

- `SBP = 0` mmHg would be physiologically impossible and signals an error.  
- `SBP = NaN` means the measurement was not collected or not available.

Missingness therefore concerns the **measurement**, not the person. A missing BMI does not mean the participant lacks body mass; it means the dataset lacks the recorded value.

In the rest of this workbook, we focus on:

- How much item-level missingness there is in our chosen variables.  
- How different ways of handling these missing values (complete-case analysis, single imputation, multiple imputation) can influence our regression results.

## 1. Missingness overview

We now turn to the concrete pattern of missing data in the FB2NEP cohort.

In this workbook we focus on a simple blood pressure model:

- Outcome: `SBP` (systolic blood pressure).
- Main exposure: `BMI` (body mass index).
- Covariates: `age`, `sex`, `smoking_status`, `SES_class`.

We begin by calculating the proportion of missing values in each of these variables.

In [None]:
# Select variables of interest for this workbook.
#
# We keep the code robust by checking that each variable actually exists in df.

vars_of_interest = [v for v in ["SBP", "BMI", "age", "sex", "smoking_status", "SES_class"] if v in df.columns]
df_an = df[vars_of_interest].copy()

# Ensure that categorical variables are coded as categories.
if "sex" in df_an.columns:
    df_an["sex"] = df_an["sex"].astype("category")
if "smoking_status" in df_an.columns:
    df_an["smoking_status"] = df_an["smoking_status"].astype("category")
if "SES_class" in df_an.columns:
    df_an["SES_class"] = df_an["SES_class"].astype("category")

# Proportion of missing values for each variable.
missing_props = df_an.isna().mean()
missing_props

The table above shows the **fraction** of missing values for each variable.

- A value of `0.05` means that 5% of observations for that variable are missing.
- The amount of missing data can differ markedly between variables.

To visualise this pattern we can draw a simple bar chart.

In [None]:
# Bar plot of missingness for the selected variables.

plt.figure(figsize=(6, 4))
missing_props.sort_values(ascending=False).plot(kind="bar")
plt.ylabel("Proportion missing")
plt.ylim(0, 1)
plt.title("Proportion of missing values by variable")
plt.tight_layout()
plt.show()

### 1.1 Conceptual mechanisms of missingness

The statistical properties of any missing-data method depend on **how** data are missing. Three standard mechanisms are:

- **MCAR – Missing Completely At Random**  
  The probability that a value is missing does not depend on any observed or unobserved variable. For example, a blood sample is lost due to a laboratory freezer failure that affects samples randomly.

- **MAR – Missing At Random**  
  The probability that a value is missing may depend on **observed** variables, but not on the value of the missing variable itself, after conditioning on the observed data. For example, BMI may be more likely to be missing among older participants, but conditional on age and sex, missingness is unrelated to the true BMI value.

- **MNAR – Missing Not At Random**  
  The probability that a value is missing still depends on the *unobserved* value, even after conditioning on observed covariates. For example, participants with very high BMI may be particularly reluctant to be weighed, even after adjusting for age, sex and smoking.

In practice we rarely know the true mechanism. A key message is: therefore:

> We can rarely “fix” missing data, but we can **make our assumptions explicit** and **explore sensitivity** of results to these assumptions.

## 2. Complete-case versus single imputation

We now fit a simple linear regression model with systolic blood pressure (SBP) as the outcome and BMI as the main exposure, adjusted for age and available covariates.

We compare two strategies:

1. **Complete-case analysis**: use only participants with no missing values in any of the model variables. This is easy but can lead to loss of power and biased estimates, unless data are MCAR (or satisfy a slightly weaker condition).
2. **Single imputation**: fill in missing values with a single “best guess” (mean or mode) and analyse the resulting dataset as if it were complete. This preserves sample size but underestimates uncertainty.

We start with the complete-case analysis.

In [None]:
# 2.1 Complete-case analysis
# --------------------------
# We drop any row that has at least one missing value in the variables
# we intend to use in the model.

df_cc = df_an.dropna()
print(f"Number of complete cases: {len(df_cc)} out of {len(df_an)} participants")

# Construct the regression formula step by step.
formula = "SBP ~ BMI + age"
if "sex" in df_cc.columns:
    formula += " + C(sex)"
if "smoking_status" in df_cc.columns:
    formula += " + C(smoking_status)"
if "SES_class" in df_cc.columns:
    formula += " + C(SES_class)"

print("Model formula:", formula)

# Fit the ordinary least squares (OLS) model using statsmodels.
model_cc = smf.ols(formula, data=df_cc).fit()

# Create a compact summary table with estimates and 95% confidence intervals.
cc_summary = model_cc.summary2().tables[1][["Coef.", "Std.Err.", "[0.025", "0.975]"]]
cc_summary

In [None]:
# 2.2 Single imputation (mean/mode)
# ---------------------------------
# For illustration, we perform a very simple single imputation:
# - For numeric variables, replace missing values with the mean.
# - For categorical variables, replace missing values with the most frequent category.
#
# This method is *not* recommended for serious analyses, but it is useful to
# demonstrate how different handling of missing data can influence estimates.

df_si = df_an.copy()

for col in df_si.columns:
    if df_si[col].dtype.kind in "biufc":  # numeric types
        df_si[col] = df_si[col].fillna(df_si[col].mean())
    else:  # categorical or object types
        df_si[col] = df_si[col].fillna(df_si[col].mode().iloc[0])

# Fit the same model to the single-imputed dataset.
model_si = smf.ols(formula, data=df_si).fit()

si_summary = model_si.summary2().tables[1][["Coef.", "Std.Err.", "[0.025", "0.975]"]]

# Compare coefficients from complete-case and single-imputed analyses.
comparison_2 = pd.DataFrame({
    "cc_coef": cc_summary["Coef."],
    "si_coef": si_summary["Coef."],
    "cc_SE": cc_summary["Std.Err."],
    "si_SE": si_summary["Std.Err."]
})
comparison_2

In the table above, focus on the **BMI** coefficient and its standard error in the two approaches.

- Do the point estimates differ?  
- Are the confidence intervals similar or noticeably different?  
- How many observations were used in the complete-case analysis compared with the imputed analysis?

Single imputation keeps the original sample size but fails to reflect the extra uncertainty caused by missing data, so standard errors are often too small.

## 3. Multiple imputation with MICE (simplified)

Multiple imputation aims to improve on single imputation by:

1. Drawing **several** plausible values for each missing observation, generating multiple imputed datasets.
2. Fitting the analysis model in each dataset.
3. Combining estimates and standard errors using **Rubin's rules**.

Conceptually, this is closer to the idea of uncertainty used elsewhere in statistics: we acknowledge that the missing values could have been different, and we propagate this uncertainty into the final estimates.

Here we use a simple implementation of **MICE (Multivariate Imputation by Chained Equations)** from `statsmodels`. The details of the imputation models are beyond the scope of FB2NEP; we treat this as a black box and focus on the *idea* and the comparison of results.

In [None]:
# 3.1 Prepare data for MICE
# -------------------------
# MICE in statsmodels expects a numeric design matrix. We therefore use
# one-hot encoding (dummy variables) for categories.

df_mice = pd.get_dummies(df_an, drop_first=True)

# Create a MICEData object that stores the data and handles the chained equations.
mice_data = MICEData(df_mice)

# Outcome and predictors: here we use all available predictors for imputation
# and analysis (this is not always ideal, but sufficient for illustration).
endog = "SBP"
predictors = [c for c in df_mice.columns if c != endog]
formula_mice = endog + " ~ " + " + ".join(predictors)
print("MICE model formula:")
print(formula_mice)

# 3.2 Fit the MICE model with m=5 imputations.
# --------------------------------------------
# The MICE object performs imputations internally and then fits the regression
# model repeatedly, combining estimates automatically.

# Note: first argument = formula, second = *class* with .from_formula (sm.OLS)
mice = MICE(formula_mice, sm.OLS, mice_data)
result_mice = mice.fit(5)

# The summary is long, but it shows pooled estimates and standard errors.
result_mice.summary()


In [None]:
# 3.3 Extract and compare key coefficients across methods
# -------------------------------------------------------
# We focus on the BMI effect. For the MICE model, BMI is numeric and should
# appear among the exogenous (predictor) names.

# 1. Get BMI coefficient and SE from the complete-case and single-impute models
bmi_cc_coef = model_cc.params.get("BMI", np.nan)
bmi_cc_se   = model_cc.bse.get("BMI", np.nan)

bmi_si_coef = model_si.params.get("BMI", np.nan)
bmi_si_se   = model_si.bse.get("BMI", np.nan)

print("Raw BMI coefficients from the two frequentist models:")
print(f"  Complete-case BMI coef: {bmi_cc_coef:.6f}")
print(f"  Single-impute BMI coef: {bmi_si_coef:.6f}")

# 2. Wrap the pooled MICE parameters and standard errors in a DataFrame
pooled = pd.DataFrame(
    {
        "coef": result_mice.params,
        "se": result_mice.bse,
    },
    index=result_mice.model.exog_names,
)

if "BMI" in pooled.index:
    bmi_mi_coef = pooled.loc["BMI", "coef"]
    bmi_mi_se   = pooled.loc["BMI", "se"]
else:
    bmi_mi_coef = np.nan
    bmi_mi_se   = np.nan

print(f"  Multiple-impute BMI coef: {bmi_mi_coef:.6f}")

# 3. Assemble comparison table
summary_methods = pd.DataFrame({
    "method":   ["complete_case", "single_impute", "multiple_impute"],
    "BMI_coef": [bmi_cc_coef,     bmi_si_coef,     bmi_mi_coef],
    "BMI_SE":   [bmi_cc_se,       bmi_si_se,       bmi_mi_se],
    "n_used":   [len(df_cc),      len(df_si),      len(df_an)],
})

summary_methods


This table summarises the estimated BMI effect and its standard error under three strategies.

- Multiple imputation typically has **smaller standard errors** than complete-case analysis (because it uses more data) but **larger standard errors** than naive single imputation (because it acknowledges imputation uncertainty).
- In well-behaved situations, point estimates are often similar across methods, but they **can** differ, especially if missingness is related to key variables.

Even a moderately sceptical hippo would insist that the assumptions behind each method are made clear in any report or dissertation.

## 4. Simple sensitivity analyses

No missing-data method is perfect, and the mechanism of missingness is usually not known with certainty. **Sensitivity analyses** explore how robust our conclusions are to alternative assumptions or modelling choices.

Here we illustrate two simple strategies:

1. Restricting the analysis to a more “ordinary” BMI range.
2. Applying a small **delta adjustment** to imputed values to mimic an MNAR scenario.

### 4.1 Restricting the BMI range

Extreme values can sometimes drive results and may also be more prone to measurement error or missingness. As a basic sensitivity analysis, we refit the complete-case model **excluding** participants with BMI ≥ 40 kg/m² and compare results.

In [None]:
# 4.1 Restrict analysis to BMI < 40 kg/m^2

if "BMI" in df_cc.columns:
    df_cc_restricted = df_cc[df_cc["BMI"] < 40]
    print(f"Complete cases in original model:   {len(df_cc)}")
    print(f"Complete cases with BMI < 40:      {len(df_cc_restricted)}")

    model_cc_rest = smf.ols(formula, data=df_cc_restricted).fit()

    cc_rest_summary = model_cc_rest.summary2().tables[1][["Coef.", "Std.Err.", "[0.025", "0.975]"]]

    comp_rest = pd.DataFrame({
        "original_coef": cc_summary["Coef."],
        "restricted_coef": cc_rest_summary["Coef."],
        "original_SE": cc_summary["Std.Err."],
        "restricted_SE": cc_rest_summary["Std.Err."],
    })
    comp_rest

### 4.3 Sensitivity to adjustment set (model specification)

Sensitivity analyses are not limited to missing data. In epidemiology, it is good practice
to ask how sensitive our main association is to *modelling choices*, in particular:

- Which covariates we adjust for (potential confounders).
- How we code exposures and covariates (continuous vs categories).
- Functional form (for example, linear vs quadratic).

Here we consider the association between BMI and SBP and compare two regression models:

1. A **minimally adjusted model**:  
   $$
   \text{SBP} = \beta_0 + \beta_1 \cdot \text{BMI} + \beta_2 \cdot \text{age} + \beta_3 \cdot \text{sex} + \varepsilon.
   $$

2. A **more fully adjusted model** that also includes potential confounders:  
   $$
   \text{SBP} = \beta_0 + \beta_1 \cdot \text{BMI} + \beta_2 \cdot \text{age}
   + \beta_3 \cdot \text{sex} + \beta_4 \cdot \text{smoking status}
   + \beta_5 \cdot \text{SES} + \beta_6 \cdot \text{physical activity} + \varepsilon.
   $$

The aim is to see whether the estimated BMI effect (\\(\\beta_1\\)) is stable or changes
substantially when we change the adjustment set. Large changes might indicate strong
confounding or model misspecification.


In [None]:
# 4.3 Sensitivity to adjustment set (model specification)
# ------------------------------------------------------
# We use the complete-case dataset df_cc and compare:
# - A minimally adjusted model: SBP ~ BMI + age + sex
# - A more fully adjusted model: add smoking_status, SES_class, physical_activity (if present)

# Ensure we are working with complete cases for all relevant variables.
covars_min = ["age", "sex"]
covars_full = ["age", "sex", "smoking_status", "SES_class", "physical_activity"]

# Keep only variables that actually exist in the data.
covars_min = [v for v in covars_min if v in df_cc.columns]
covars_full = [v for v in covars_full if v in df_cc.columns]

vars_min = ["SBP", "BMI"] + covars_min
vars_full = ["SBP", "BMI"] + covars_full

df_cc_min = df_cc[vars_min].dropna()
df_cc_full = df_cc[vars_full].dropna()

print(f"Minimally adjusted model:   {len(df_cc_min)} complete cases")
print(f"Fully adjusted model:       {len(df_cc_full)} complete cases")

# Construct formulas
formula_min = "SBP ~ BMI"
for cov in covars_min:
    if str(df_cc_min[cov].dtype) == "category":
        formula_min += f" + C({cov})"
    else:
        formula_min += f" + {cov}"

formula_full = "SBP ~ BMI"
for cov in covars_full:
    if str(df_cc_full[cov].dtype) == "category":
        formula_full += f" + C({cov})"
    else:
        formula_full += f" + {cov}"

print("Minimal model formula:   ", formula_min)
print("Full model formula:      ", formula_full)

# Fit both models
model_min = smf.ols(formula_min, data=df_cc_min).fit()
model_full = smf.ols(formula_full, data=df_cc_full).fit()

# Extract BMI coefficients and standard errors
bmi_min_coef = model_min.params.get("BMI", np.nan)
bmi_min_se   = model_min.bse.get("BMI", np.nan)

bmi_full_coef = model_full.params.get("BMI", np.nan)
bmi_full_se   = model_full.bse.get("BMI", np.nan)

sens_adjust = pd.DataFrame({
    "model": ["minimal", "full"],
    "BMI_coef": [bmi_min_coef, bmi_full_coef],
    "BMI_SE": [bmi_min_se, bmi_full_se],
    "n_used": [len(df_cc_min), len(df_cc_full)],
})

sens_adjust


## 5. Bayesian approaches

Bayesian methods treat missing values as **unknown parameters** and estimate them jointly with all other parameters in the model.

In a Bayesian missing-data model:

- We specify a **likelihood** for the observed data given parameters and missing values.
- We specify **prior distributions** for both the model parameters and the missing values.
- We use Markov Chain Monte Carlo (MCMC) or related algorithms to sample from the joint **posterior distribution**.

The resulting posterior samples naturally incorporate uncertainty about missing values into uncertainty about regression coefficients. This is conceptually similar to multiple imputation, but the Bayesian approach integrates all uncertainty in a single framework and can make it easier to specify complex MNAR models.

Implementing Bayesian missing-data models is beyond the scope of FB2NEP, but it is useful to be aware of them, particularly for more advanced research projects.

In [None]:
# 5. Bayesian regression with Stan (CmdStanPy)
# --------------------------------------------
# Model: SBP ~ BMI + age + sex
#
# We use weakly informative priors and sample from the posterior using Stan,
# accessed via CmdStanPy. This is an illustrative example that mirrors the
# frequentist linear regression used earlier.


# If you have put these helpers into a module, import them here, e.g.:
# from helpers_bayes import ensure_cmdstan, stan_summary_table
# For a standalone notebook, you can paste the helper definitions above this cell.

cmdstan_root = ensure_cmdstan()

from cmdstanpy import CmdStanModel
print("Using CmdStan from:", cmdstan_root)

# Prepare a clean analysis dataset (complete cases for the relevant variables)

df_bayes = df_an.dropna(subset=["SBP", "BMI", "age", "sex"]).copy()
df_bayes["sex_M"] = (df_bayes["sex"] == "M").astype(float)  # float now

data_stan = {
    "N": len(df_bayes),
    "SBP": df_bayes["SBP"].to_numpy(),
    "BMI": df_bayes["BMI"].to_numpy(),
    "age": df_bayes["age"].to_numpy(),
    "sex_M": df_bayes["sex_M"].to_numpy(),
}


print(f"N for Bayesian model: {data_stan['N']}")


In [None]:
stan_code = """
data {
  int<lower=1> N;
  vector[N] SBP;
  vector[N] BMI;
  vector[N] age;
  vector[N] sex_M;        // now a vector, not a scalar
}

parameters {
  real intercept;
  real beta_BMI;
  real beta_age;
  real beta_sex;
  real<lower=0> sigma;
}

model {
  // Weakly informative priors
  intercept ~ normal(120, 20);
  beta_BMI  ~ normal(0.5, 0.5);
  beta_age  ~ normal(0.6, 0.3);
  beta_sex  ~ normal(0, 2);
  sigma     ~ normal(15, 5);

  // Linear predictor
  SBP ~ normal(
    intercept
    + beta_BMI * BMI
    + beta_age * age
    + beta_sex * sex_M,
    sigma
  );
}
"""
with open("sbp_regression.stan", "w") as f:
    f.write(stan_code)

sbp_model = CmdStanModel(stan_file="sbp_regression.stan")


In [None]:
fit = sbp_model.sample(
    data=data_stan,
    seed=11088,
    chains=4,
    parallel_chains=4,
    iter_warmup=1000,
    iter_sampling=2000,
    show_console=True,   # keep on while debugging
)


In [None]:
from scripts.helpers_tables import stan_summary_table

param_order = ["intercept", "beta_BMI", "beta_age", "beta_sex", "sigma"]
bayes_tbl = stan_summary_table(fit, param_order=param_order)
bayes_tbl


In [None]:
# 6. Comparison of Bayesian (Stan) and Frequentist Estimates
# ----------------------------------------------------------



# Extract BMI, age, sex coefficients from frequentist models
bmi_cc  = model_cc.params.get("BMI", np.nan)
age_cc  = model_cc.params.get("age", np.nan)
sex_cc  = model_cc.params.get("C(sex)[T.M]", np.nan)

# From MI results (already computed earlier)
bmi_mi  = bmi_mi_coef
age_mi  = np.nan   # replace if you computed MI for age
sex_mi  = np.nan   # replace if you computed MI for sex

# From Stan posterior means
bmi_stan = bayes_tbl.loc["beta_BMI", "mean"]
age_stan = bayes_tbl.loc["beta_age", "mean"]
sex_stan = bayes_tbl.loc["beta_sex", "mean"]

comparison = pd.DataFrame({
    "method": ["OLS (complete-case)", "Multiple Imputation", "Bayesian (Stan)"],
    "BMI_coef": [bmi_cc, bmi_mi, bmi_stan],
})

comparison

## 6. Practical mini-assignment

In this final section you will carry out a small, assignment-style analysis using the FB2NEP cohort.

### Task

1. Choose an **outcome** (for example, `SBP` or a biomarker).
2. Choose a main **exposure** (for example, `BMI`, `salt_g_d`, or `physical_activity`).
3. Choose at least **two covariates** (for example, `age`, `sex`, `SES_class`).
4. Inspect and describe the pattern of missingness in your chosen variables.
5. Fit three models:
   - Complete-case analysis.
   - Single imputation (mean/mode or another reasonable rule).
   - Multiple imputation using MICE.
6. Perform **one sensitivity analysis**, for example:
   - Restrict the analysis to a more typical range of the exposure.
   - Apply a small delta-based MNAR adjustment similar to Section 4.2.
7. Write a short interpretation (approximately 150–200 words) answering:
   - How do the point estimates differ between methods?  
   - How do the standard errors differ?  
   - How sensitive are your conclusions to the chosen assumptions?
