# FB2NEP Workbook 8 – Regression and Modelling (Part 1)

This workbook introduces:

- Linear, logistic, and Cox regression.
- Model assumptions and basic diagnostics.
- Interpretation of β, odds ratios (OR), and hazard ratios (HR).
- Confounding, colliders, and mediators.
- Graphical understanding with DAGs.

In [None]:
from __future__ import annotations

import pathlib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import statsmodels.api as sm
import statsmodels.formula.api as smf

%matplotlib inline

DATA_PATH = pathlib.Path("data") / "fb2nep_synthetic.csv"
df = pd.read_csv(DATA_PATH)
df.head()

## 1. Linear regression

Example: association between systolic blood pressure (SBP) and BMI, adjusted for age and sex.

In [None]:
if "sex" in df.columns:
    df["sex"] = df["sex"].astype("category")

required = {"sbp", "bmi", "age"}
if required.issubset(df.columns):
    model_lin = smf.ols("sbp ~ bmi + age + C(sex)", data=df).fit()
    model_lin.summary()

In [None]:
if {"sbp", "bmi", "age"}.issubset(df.columns):
    print(model_lin.params)
    print("\n95 % confidence intervals:")
    print(model_lin.conf_int())

## 2. Model diagnostics – linear regression

In [None]:
if {"sbp", "bmi", "age"}.issubset(df.columns):
    fitted = model_lin.fittedvalues
    residuals = model_lin.resid

    plt.figure(figsize=(6, 4))
    plt.scatter(fitted, residuals, alpha=0.5)
    plt.axhline(0, color="black", linestyle="--")
    plt.xlabel("Fitted values")
    plt.ylabel("Residuals")
    plt.title("Residuals vs fitted values")
    plt.tight_layout()
    plt.show()

    plt.figure(figsize=(6, 4))
    plt.hist(residuals, bins=30)
    plt.xlabel("Residual")
    plt.ylabel("Number of observations")
    plt.title("Distribution of residuals")
    plt.tight_layout()
    plt.show()

## 3. Logistic regression

In [None]:
if "high_upf" not in df.columns:
    if "energy_kcal" in df.columns:
        median_energy = df["energy_kcal"].median()
        df["high_upf"] = (df["energy_kcal"] > median_energy).astype(int)
    else:
        np.random.seed(11088)
        df["high_upf"] = np.random.randint(0, 2, size=len(df))

if {"high_upf", "bmi", "age"}.issubset(df.columns):
    model_logit = smf.logit("high_upf ~ bmi + age + C(sex)", data=df).fit()
    model_logit.summary()

In [None]:
if {"high_upf", "bmi", "age"}.issubset(df.columns):
    params = model_logit.params
    conf = model_logit.conf_int()
    or_ = np.exp(params)
    or_ci = np.exp(conf)
    display(pd.DataFrame({"OR": or_, "CI_lower": or_ci[0], "CI_upper": or_ci[1]}))

## 4. Cox regression (brief)

Assume the dataset contains `time_followup` and `event_cvd`.
We fit a simple Cox model with BMI and age.

In [None]:
from lifelines import CoxPHFitter

surv_cols = {"time_followup", "event_cvd", "bmi", "age"}
if surv_cols.issubset(df.columns):
    surv_df = df[list(surv_cols)].dropna()
    cph = CoxPHFitter()
    cph.fit(surv_df, duration_col="time_followup", event_col="event_cvd")
    cph.print_summary()

## 5. Confounding, colliders, and mediators – DAG example

In [None]:
dagitty_code = """
dag {
  SES -> high_upf
  SES -> CVD
  high_upf -> CVD
}
"""
print(dagitty_code)