# 05 · Regression analysis (cross-sectional)

> **Purpose**: build and assess cross-sectional models: linear (SBP) and logistic (prevalent hypertension). Focus on transformations, non-linear terms, diagnostics, and clear interpretation.

> **Learning objectives**
- Fit linear regression for a continuous outcome (SBP) and interpret coefficients.
- Fit logistic regression for a derived binary outcome (hypertension). Report ORs (95% CI).
- Apply transformations (log, z) and non-linear terms (polynomial, splines) when justified.
- Run basic diagnostics: residuals, QQ, influence, and multicollinearity (VIF).

---

In [None]:
# Make sure the repo root (which has scripts/bootstrap.py) is on sys.path.
import sys, os, pathlib, subprocess

REPO_NAME = "fb2nep-epi"
REPO_URL  = "https://github.com/ggkuhnle/fb2nep-epi.git"
IN_COLAB  = "google.colab" in sys.modules

def ensure_repo_on_path():
    here = pathlib.Path.cwd()
    # Walk up a few levels to find scripts/bootstrap.py
    for p in [here, *here.parents]:
        if (p / "scripts" / "bootstrap.py").exists():
            os.chdir(p)                 # normalise CWD to repo root
            sys.path.append(str(p))     # ensure imports like "from scripts..." work
            return p
    # Not found locally: if on Colab, clone then chdir
    if IN_COLAB:
        # clone only if missing
        if not (pathlib.Path("/content") / REPO_NAME).exists():
            subprocess.run(["git", "clone", REPO_URL], check=False)
        os.chdir(f"/content/{REPO_NAME}")
        sys.path.append(os.getcwd())
        return pathlib.Path.cwd()
    # Otherwise, we can’t proceed
    raise FileNotFoundError("Could not find repo root containing scripts/bootstrap.py")

repo_root = ensure_repo_on_path()
print("Repo root:", repo_root)

In [None]:
# Bootstrap: ensure repo root on path, then import init
import sys, pathlib
sys.path.append(str(pathlib.Path.cwd().parent))
from scripts.bootstrap import init
df, ctx = init()
df.head(2)

## 1) Outcome & predictors (cross-sectional setup)
We’ll treat **SBP** (systolic BP) as a continuous outcome for linear regression. For a binary example, define **prevalent hypertension** as SBP ≥ 140 mmHg (teaching-only threshold).

**Candidate predictors**: `age`, `sex`, `BMI`, `IMD_quintile`, `SES_class`, `smoking_status`, `salt_g_d`, `fruit_veg_g_d`, `red_meat_g_d`, `physical_activity`.

In [None]:
import numpy as np, pandas as pd

work = df[['SBP','age','sex','BMI','IMD_quintile','SES_class','smoking_status','salt_g_d','fruit_veg_g_d','red_meat_g_d','physical_activity']].copy()
work = work.dropna()
work['HT_prev'] = (work['SBP'] >= 140).astype(int)
work.head(3), work.shape

## 2) Linear regression — SBP as outcome
Start simple (salt only), then build up with plausible confounders and compare models by fit and interpretability.

In [None]:
import statsmodels.api as sm
from patsy import dmatrices

# Unadjusted: SBP ~ salt_g_d
y_u, X_u = dmatrices('SBP ~ salt_g_d', data=work, return_type='dataframe')
fit_u = sm.OLS(y_u, X_u).fit()
fit_u.summary().tables[1]

In [None]:
# Adjusted: SBP ~ salt + age + sex + BMI + IMD_quintile + SES_class + smoking + physical_activity
# C() wraps categoricals; continuous left as-is (for now)
formula_a = 'SBP ~ salt_g_d + age + BMI + C(sex) + C(IMD_quintile) + C(SES_class) + C(smoking_status) + C(physical_activity)'
y_a, X_a = dmatrices(formula_a, data=work, return_type='dataframe')
fit_a = sm.OLS(y_a, X_a).fit()
fit_a.summary().tables[1]

### Diagnostics — residual plots & QQ
We’re aiming for *adequate* teaching diagnostics: residuals vs fitted, and normal Q–Q. Look for patterning (non-linearity) or heavy tails (outliers/influence).

In [None]:
import matplotlib.pyplot as plt
resid = fit_a.resid.values.ravel()
fitted = fit_a.fittedvalues.values.ravel()

plt.figure(figsize=(5.2,4)); plt.scatter(fitted, resid, s=8, alpha=0.6)
plt.axhline(0, ls='--'); plt.xlabel('Fitted'); plt.ylabel('Residual'); plt.title('Residuals vs Fitted'); plt.tight_layout(); plt.show()

sm.qqplot(resid, line='45'); plt.title('Normal Q–Q (residuals)'); plt.tight_layout(); plt.show()

### Multicollinearity — quick VIF check
Rule of thumb: VIF > ~5–10 suggests strong collinearity (teaching heuristic, not law).

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
import numpy as np

# Build a numeric design matrix (expand categoricals as dummies without intercept)
XA = pd.get_dummies(work[['salt_g_d','age','BMI','sex','IMD_quintile','SES_class','smoking_status','physical_activity']], drop_first=True)
XA = sm.add_constant(XA, has_constant='add')
vif = pd.Series([variance_inflation_factor(XA.values, i) for i in range(XA.shape[1])], index=XA.columns)
vif.round(2).sort_values(ascending=False).head(12)

## 3) Transformations & non-linear terms
- Skewed predictors (e.g., salt) may behave better on `log1p` scale.
- Consider **polynomial** terms (e.g., BMI²) or **splines** for flexible curvature.

_Teaching note_: do this when diagnostics or prior knowledge suggest non-linearity — not by default.

In [None]:
work2 = work.copy()
work2['salt_log1p'] = np.log1p(work2['salt_g_d'])
work2['BMI2'] = work2['BMI']**2

formula_nl = 'SBP ~ salt_log1p + age + BMI + BMI2 + C(sex) + C(IMD_quintile) + C(SES_class) + C(smoking_status) + C(physical_activity)'
y_nl, X_nl = dmatrices(formula_nl, data=work2, return_type='dataframe')
fit_nl = sm.OLS(y_nl, X_nl).fit()
fit_nl.summary().tables[1]

_Optional_: cubic B-splines for age (requires `patsy.bs`).

In [None]:
from patsy import bs
formula_s = 'SBP ~ salt_log1p + bs(age, df=4) + BMI + C(sex) + C(IMD_quintile) + C(SES_class) + C(smoking_status) + C(physical_activity)'
y_s, X_s = dmatrices(formula_s, data=work2, return_type='dataframe')
fit_s = sm.OLS(y_s, X_s).fit()
print('Adj R^2 (linear age):', round(fit_nl.rsquared_adj,4))
print('Adj R^2 (spline age):', round(fit_s.rsquared_adj,4))

## 4) Logistic regression — prevalent hypertension (teaching example)
Define `HT_prev = 1` if SBP ≥ 140 mmHg. Fit unadjusted and adjusted models. Report odds ratios (OR) with 95% CIs. _Reminder_: cross-sectional **prevalent** hypertension mixes incidence and duration; this is a didactic example.

In [None]:
import statsmodels.api as sm
from patsy import dmatrices

log_u_y, log_u_X = dmatrices('HT_prev ~ salt_log1p', data=work2, return_type='dataframe')
log_u = sm.Logit(log_u_y, log_u_X).fit(disp=False)

def cat_term(df_, v):
    return f"C({v})" if (df_[v].dtype=='object' or str(df_[v].dtype).startswith('category')) else v

adj_terms = [
    'salt_log1p','age','BMI',
    cat_term(work2,'sex'), cat_term(work2,'IMD_quintile'), cat_term(work2,'SES_class'),
    cat_term(work2,'smoking_status'), cat_term(work2,'physical_activity')
]
formula_log_a = 'HT_prev ~ ' + ' + '.join(adj_terms)
log_a_y, log_a_X = dmatrices(formula_log_a, data=work2, return_type='dataframe')
log_a = sm.Logit(log_a_y, log_a_X).fit(disp=False)

import numpy as np
def tidy_or(fit):
    OR = np.exp(fit.params).rename('OR')
    CI = np.exp(fit.conf_int()).rename(columns={0:'2.5%',1:'97.5%'})
    return pd.concat([OR,CI], axis=1).round(3)

tidy_or(log_u), tidy_or(log_a)

### Brief interpretation guide
- **Linear model**: coefficient on `salt_log1p` ≈ change in SBP (mmHg) per 1-unit change in log(1+salt g/day), holding others constant.
- **Logistic model**: OR for `salt_log1p` — multiplicative change in odds of prevalent hypertension per 1-unit increase in log(1+salt), adjusted for covariates.
- Non-linear terms (e.g., `BMI2`, `bs(age)`) let the effect vary by level; interpret via **predicted margins** not just coefficients.

## 5) # TODO — hands-on tasks
1. **Model comparison**: Compare `fit_a`, `fit_nl`, and `fit_s` using adjusted R² and residual plots. Which balances fit and simplicity?
2. **Alternative exposure**: Swap `salt_log1p` for `red_meat_g_d` (log1p as needed). Refit linear SBP model and logistic HT model; interpret changes.
3. **Predicted margins (bonus)**: Compute predicted SBP at the 10th, 50th, 90th percentiles of `salt_g_d` holding other vars at typical values. Summarise in one sentence.
4. **Collinearity reflection**: Which predictors show higher VIF? Suggest a remedy (e.g., remove, combine, or centre variables).

In [None]:
# (3) Predicted margins demo
q = work2['salt_g_d'].quantile([0.1,0.5,0.9]).rename({0.1:'p10',0.5:'p50',0.9:'p90'})
new = work2.median(numeric_only=True).to_frame().T
new = new.assign(salt_g_d=[q['p10'], q['p50'], q['p90']])
new['salt_log1p'] = np.log1p(new['salt_g_d'])
# Keep categoricals at mode
for v in ['sex','IMD_quintile','SES_class','smoking_status','physical_activity']:
    new[v] = work2[v].mode().iloc[0]
_, Xp = dmatrices(formula_nl, data=new, return_type='dataframe')
pred = fit_nl.predict(Xp)
pd.DataFrame({'salt_g_d': new['salt_g_d'].values, 'pred_SBP': pred.values})

> ## Key takeaways
>
> - Let **design and diagnostics** motivate transformation or non-linearity — don’t add complexity by default.
> - Report effects clearly: units for linear; **OR (95% CI)** for logistic.
> - Check residuals and **VIF**; reflect on practical remedies for violations.
> - Cross-sectional models describe associations; causal interpretation demands a DAG and longitudinal design (next notebook).