# 02b · Study population & missing data plan

> **Purpose**: define the *analysis cohort* transparently (inclusion/exclusion), describe exclusions with a simple flow, compare included vs excluded participants, and agree a pragmatic missing-data strategy to carry forward.

> **Learning objectives**
- Specify clear inclusion/exclusion rules and reproduce them in code.
- Produce a CONSORT-style flow (counts at each step) and save it.
- Compare baseline traits of included vs excluded to check selection bias risks.
- Choose a *default* missing-data approach (complete-case vs simple impute) for subsequent notebooks.

---

In [None]:
# Make sure the repo root (which has scripts/bootstrap.py) is on sys.path.
import sys, os, pathlib, subprocess

REPO_NAME = "fb2nep-epi"
REPO_URL  = "https://github.com/ggkuhnle/fb2nep-epi.git"
IN_COLAB  = "google.colab" in sys.modules

def ensure_repo_on_path():
    here = pathlib.Path.cwd()
    # Walk up a few levels to find scripts/bootstrap.py
    for p in [here, *here.parents]:
        if (p / "scripts" / "bootstrap.py").exists():
            os.chdir(p)                 # normalise CWD to repo root
            sys.path.append(str(p))     # ensure imports like "from scripts..." work
            return p
    # Not found locally: if on Colab, clone then chdir
    if IN_COLAB:
        # clone only if missing
        if not (pathlib.Path("/content") / REPO_NAME).exists():
            subprocess.run(["git", "clone", REPO_URL], check=False)
        os.chdir(f"/content/{REPO_NAME}")
        sys.path.append(os.getcwd())
        return pathlib.Path.cwd()
    # Otherwise, we can’t proceed
    raise FileNotFoundError("Could not find repo root containing scripts/bootstrap.py")

repo_root = ensure_repo_on_path()
print("Repo root:", repo_root)

In [None]:
# Bootstrap: ensure repo root on path, then import init
import sys, pathlib
sys.path.append(str(pathlib.Path.cwd().parent))
from scripts.bootstrap import init
df, ctx = init()
df.head(2)

## 1) Inclusion / Exclusion rules (edit if needed)

**Default rules for this cohort**
- Age ≥ 40 years at baseline.
- Non-missing: key covariates for modelling: `age`, `sex`, `BMI`, `SES_class`, `IMD_quintile`, `smoking_status`.
- For *prospective* analyses later, require `baseline_date` present.
- Exposure/outcome specific requirements will be applied in their respective notebooks.

_You can tighten/relax these below and regenerate the flow._

In [None]:
import pandas as pd
N0 = len(df)

# Start: all participants
pop = df.copy()

# Rule A: age >= 40 (already true by design, but keep explicit)
mask_age = pop['age'] >= 40
N_age = mask_age.sum()
pop = pop[mask_age].copy()

# Rule B: baseline_date present (prospective work later)
mask_base = pop['baseline_date'].notna()
N_base = mask_base.sum()
pop = pop[mask_base].copy()

# Rule C: key covariates non-missing (for a default teaching cohort)
key_covars = ['age','sex','BMI','SES_class','IMD_quintile','smoking_status']
mask_cov = pop[key_covars].notna().all(axis=1)
N_cov = mask_cov.sum()
analysis_cohort = pop[mask_cov].copy()

flow = pd.DataFrame([
    {"step":"Start (all)", "n": N0},
    {"step":"Age ≥ 40", "n": int(N_age)},
    {"step":"Baseline date present", "n": int(N_base)},
    {"step":"Key covariates non-missing", "n": int(N_cov)}
])
flow

### Save flow and derived cohort
We keep a CSV of the flow and a derived cohort for reproducibility (later notebooks may choose further, question-specific filters).

In [None]:
from pathlib import Path
Path('derived').mkdir(exist_ok=True)
flow.to_csv('derived/consort_flow_default.csv', index=False)
analysis_cohort.to_csv('derived/analysis_cohort_default.csv', index=False)
print('Saved: derived/consort_flow_default.csv; derived/analysis_cohort_default.csv')
analysis_cohort.shape

## 2) Included vs excluded (selection check)
Compare some baseline variables between **included** and **excluded** after Rule C (key covariates complete). Large differences suggest possible **selection bias** if missingness relates to exposure/outcome.

In [None]:
import numpy as np
df_tmp = df.copy()
need = df_tmp[['age','sex','BMI','SES_class','IMD_quintile','smoking_status','SBP']].copy()
inc_mask = need.notna().all(axis=1)
need['included'] = np.where(inc_mask, 'Included','Excluded')

def num_summary(d, v):
    return d.groupby('included')[v].agg(['mean','std','median','count']).round(2)
def cat_summary(d, v):
    ct = pd.crosstab(d['included'], d[v], normalize='index').round(3)
    return ct

tab_age  = num_summary(need, 'age')
tab_bmi  = num_summary(need, 'BMI')
tab_sex  = cat_summary(need, 'sex')
tab_ses  = cat_summary(need, 'SES_class')
tab_imd  = cat_summary(need, 'IMD_quintile')
tab_smok = cat_summary(need, 'smoking_status')
tab_age, tab_bmi, tab_sex, tab_ses.head(), tab_imd.head(), tab_smok.head()

_Prompt_: Are excluded participants older / more deprived / different smokers? If yes, discuss direction of potential bias for your question (e.g., could complete-case inflate/attenuate associations?).

## 3) Missing-data map and drivers
High-level visual of missingness and quick probes of likely **MAR** drivers (age, IMD, SES).

In [None]:
import matplotlib.pyplot as plt
miss_rate = df.isna().mean().sort_values(ascending=False)

plt.figure(figsize=(7.5,3.6))
plt.bar(miss_rate.index[:20], miss_rate.values[:20])
plt.xticks(rotation=75, ha='right'); plt.ylabel('Proportion missing')
plt.title('Top 20 variables by missingness')
plt.tight_layout(); plt.show()

# Example probe: missing vit C by SES
import pandas as pd
if 'plasma_vitC_umol_L' in df:
    mflag = df['plasma_vitC_umol_L'].isna().astype(int)
    pd.crosstab(df['SES_class'], mflag, normalize='index').round(3)

## 4) Default missing-data strategy to carry forward
For **teaching** (no extra deps), choose **one** of:

A) **Complete-case per-analysis**: drop rows with any missingness in variables used *for that analysis* (preferred base).  
B) **Crude imputation for select covariates**: median for continuous, mode for categoricals (only if you need stable n for in-class comparisons).

_Record your choice below_; later notebooks will respect it if you load the saved cohort or reapply the function here.

In [None]:
CHOICE = 'A'  # 'A' or 'B'

def impute_simple(d):
    out = d.copy()
    for c in out.select_dtypes(include=['float64','int64']).columns:
        out[c] = out[c].fillna(out[c].median())
    for c in out.select_dtypes(include=['object','category']).columns:
        md = out[c].mode(dropna=True)
        if len(md): out[c] = out[c].fillna(md.iloc[0])
    return out

cohort_for_next = analysis_cohort.copy() if CHOICE=='A' else impute_simple(analysis_cohort)
cohort_for_next.to_csv('derived/analysis_cohort_default_next.csv', index=False)
print('Saved: derived/analysis_cohort_default_next.csv (strategy', CHOICE, ')')
cohort_for_next.shape

## 5) # TODO — your turn
1. Tighten one rule (e.g., require non-missing `SBP`) and regenerate the flow. What’s the new *n* and who did you drop?
2. Create an **exposure-specific** cohort for *salt → CVD* by requiring non-missing `salt_g_d` and `urinary_sodium_mmol_L`; save as `derived/cohort_salt.csv`.
3. Write 2–3 sentences on the **selection risk** posed by your exclusions. Which direction of bias is plausible for your primary analysis?
4. If you chose **B** (impute), list at least two limitations vs multiple imputation.

_We will load `derived/analysis_cohort_default_next.csv` in Notebook 03 unless you override the path there._