# FB2NEP Workbook 8 – Missing Data and Sensitivity Analysis

We now focus more systematically on:

- Patterns of missingness.
- Complete-case versus imputation-based analyses.
- Simple sensitivity analyses.

Run the first two code cells to set up the repository and load the data.

In [None]:
import os
import sys
import runpy
import pathlib
import subprocess

REPO_URL = "https://github.com/ggkuhnle/fb2nep-epi.git"
REPO_NAME = "fb2nep-epi"

# 1. If we are in Colab and scripts/bootstrap.py is not present,
#    clone the repository and change into it.
if "google.colab" in sys.modules and not pathlib.Path("scripts/bootstrap.py").exists():
    root = pathlib.Path("/content")
    repo_dir = root / REPO_NAME

    if not repo_dir.exists():
        print(f"Cloning {REPO_URL} …")
        subprocess.run(["git", "clone", REPO_URL], check=True)

    os.chdir(repo_dir)
    print("Changed working directory to:", os.getcwd())

# 2. Now try to locate and run scripts/bootstrap.py
for p in ["scripts/bootstrap.py", "../scripts/bootstrap.py", "../../scripts/bootstrap.py"]:
    if pathlib.Path(p).exists():
        print(f"Bootstrapping via: {p}")
        runpy.run_path(p)
        break
else:
    print("⚠️ scripts/bootstrap.py not found – "
          "please check that the FB2NEP repository is available.")


In [None]:
import pandas as pd

# Load the main synthetic cohort used in all FB2NEP workbooks
df = pd.read_csv("data/synthetic/fb2nep.csv")

# Quick check: first rows
df.head()

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.api as sm
from statsmodels.imputation.mice import MICEData, MICE
%matplotlib inline

## 1. Missingness overview

In [None]:
vars_of_interest = [v for v in ["SBP", "BMI", "age", "sex", "smoking_status", "SES_class"] if v in df.columns]
df_an = df[vars_of_interest].copy()
if "sex" in df_an.columns:
    df_an["sex"] = df_an["sex"].astype("category")
if "smoking_status" in df_an.columns:
    df_an["smoking_status"] = df_an["smoking_status"].astype("category")
if "SES_class" in df_an.columns:
    df_an["SES_class"] = df_an["SES_class"].astype("category")
df_an.isna().mean()

## 2. Complete-case versus single imputation

In [None]:
df_cc = df_an.dropna()
print(f"Number of complete cases: {len(df_cc)}")
formula = "SBP ~ BMI + age"
if "sex" in df_cc.columns:
    formula += " + C(sex)"
if "smoking_status" in df_cc.columns:
    formula += " + C(smoking_status)"
if "SES_class" in df_cc.columns:
    formula += " + C(SES_class)"
model_cc = smf.ols(formula, data=df_cc).fit()
model_cc.params

In [None]:
df_si = df_an.copy()
for col in df_si.columns:
    if df_si[col].dtype.kind in "biufc":
        df_si[col] = df_si[col].fillna(df_si[col].mean())
    else:
        df_si[col] = df_si[col].fillna(df_si[col].mode().iloc[0])
model_si = smf.ols(formula, data=df_si).fit()
pd.DataFrame({"complete_case": model_cc.params, "single_impute": model_si.params})

## 3. Multiple imputation with MICE (simplified)

Here we use a simple multiple imputation approach from `statsmodels`. The details
are beyond the scope of FB2NEP; the main aim is to show that different strategies
           for handling missing data can lead to slightly different estimates.

In [None]:
df_mice = pd.get_dummies(df_an, drop_first=True)
mice_data = MICEData(df_mice)
endog = "SBP"
predictors = [c for c in df_mice.columns if c != endog]
formula_mice = endog + " ~ " + " + ".join(predictors)
mice = MICE(smf.ols, formula_mice, mice_data)
result_mice = mice.fit(5)
result_mice.summary()

## 4. Simple sensitivity analysis

As a basic sensitivity analysis, we can exclude participants with very high BMI and
refit the model.

In [None]:
if "BMI" in df_cc.columns:
    df_cc_restricted = df_cc[df_cc["BMI"] < 40]
    model_cc_rest = smf.ols(formula, data=df_cc_restricted).fit()
    pd.DataFrame({"original_cc": model_cc.params, "restricted_cc": model_cc_rest.params})