# FB2NEP Workbook 6 – Regression and Modelling: Foundations

This workbook introduces the foundations of regression modelling in nutritional epidemiology.

We will focus on:

- Theoretical background to regression.
- Linear, logistic, and Cox proportional hazards regression.
- Quantile regression.
- Model assumptions and diagnostics.
- Non-linear models (polynomials and splines).
- Interpretation of coefficients (β, OR, RR, HR).
- Generating predictions from fitted models.

All analyses use the synthetic *FB2NEP cohort*.

Run the first code cell to configure the repository and load the dataset.


In [None]:
"""FB2NEP bootstrap cell

This cell:

- Locates and runs the common `bootstrap.py` script.
- Makes the main analysis DataFrame `df` available.
- Optionally exposes a context dictionary `CTX` with additional information.

You *must* run this cell before running any of the later analysis cells.
"""

import pathlib
import runpy

bootstrap_paths = [
    "scripts/bootstrap.py",
    "../scripts/bootstrap.py",
    "../../scripts/bootstrap.py",
]

CTX = None

for path in bootstrap_paths:
    if pathlib.Path(path).exists():
        print(f"Bootstrapping via: {path}")
        CTX = runpy.run_path(path)
        break
else:
    raise FileNotFoundError("Could not find 'scripts/bootstrap.py' in any expected location.")

# Expect that bootstrap.py defines at least a DataFrame called `df`.
if "df" not in CTX:
    raise KeyError("The bootstrap context does not contain a DataFrame named 'df'.")

df = CTX["df"]

print("DataFrame 'df' loaded.")
print("Number of rows:", len(df))
print("Number of columns:", df.shape[1])


In [None]:
"""Inspect the first few rows and the variable types.

This provides a quick overview of the FB2NEP cohort and its variables.
"""

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from IPython.display import display

display(df.head())
display(df.dtypes.head(30))


## 1. What regression is

Regression modelling is a central tool in epidemiology. In its most basic form, regression estimates the **expected value** of an outcome variable given one or more predictors:

\[
E(Y \mid X_1, X_2, \ldots, X_p).
\]

The regression model describes a **systematic component** (the part explained by the predictors) and a **random component** (the unexplained variability, or error term).

In this workbook we will:

- Start with simple regression models.
- Extend to different outcome types.
- Introduce models for non-linear relationships and for different parts of the outcome distribution.


### 1.1 Prediction versus inference

Regression can be used for different purposes:

- **Prediction**: obtain accurate predictions \( \hat{Y} \) for new individuals.
- **Inference**: estimate and interpret the parameters (for example, β, OR, HR) and their uncertainty.

In nutritional epidemiology we are often interested primarily in **inference**:

- How much higher is blood pressure, on average, in individuals with high sodium intake?
- What is the hazard ratio for cardiovascular disease per 5 kg/m² higher BMI?

Prediction is also important, for example when developing risk scores, but the focus in this workbook is on understanding **parameters** and **assumptions**.


In [None]:
"""Simple visual example: BMI and age.

Here we:

- Create a scatter plot of BMI against age.
- Add rudimentary formatting to make the figure readable.

We do *not* fit a model yet; this is purely descriptive.
"""

fig, ax = plt.subplots(figsize=(6, 4))

ax.scatter(df["age"], df["bmi"], alpha=0.3, edgecolor="none")

ax.set_xlabel("Age (years)")
ax.set_ylabel("Body mass index (kg/m²)")
ax.set_title("Scatter plot of BMI against age (FB2NEP cohort)")

plt.tight_layout()
plt.show()


## 2. Types of regression models

In this section we introduce three commonly used regression models in epidemiology:

- **Linear regression** for continuous outcomes.
- **Logistic regression** for binary outcomes.
- **Cox proportional hazards regression** for time-to-event outcomes.

The underlying idea is similar in all three cases: we model how the **expected outcome** (mean, probability, hazard) changes with predictors.


### 2.1 Linear regression

Linear regression models a continuous outcome as a linear function of predictors:

\[
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \varepsilon.
\]

- \( Y \) is a continuous variable (for example, BMI or systolic blood pressure).
- \( X_1, X_2, \ldots, X_p \) are predictors (for example, age, sex, smoking status).
- \( \varepsilon \) is a random error term.

The key quantity is the **conditional mean**:

\[
E(Y \mid X_1, \ldots, X_p) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p.
\]

The coefficient \( \beta_j \) describes the expected difference in Y associated with a one-unit difference in \( X_j \), holding the other predictors constant.


In [None]:
"""Linear regression example: BMI on age (and sex).

We will:

- Fit a simple linear regression model.
- Inspect the summary output.
- Overlay the fitted line on a scatter plot.

Assumptions and diagnostics will be discussed later; for now the aim is to see the basic mechanics.
"""

import statsmodels.api as sm
import statsmodels.formula.api as smf

# For this example we assume the following variables exist:
# - 'bmi': continuous outcome
# - 'age': continuous predictor
# - 'sex': binary or categorical (for example 'Male', 'Female')

# Fit an ordinary least squares (OLS) model using a formula interface.
model_lin = smf.ols("bmi ~ age + C(sex)", data=df)
result_lin = model_lin.fit()

# Display a standard model summary.
result_lin.summary()


In [None]:
"""Plot the fitted regression line for BMI ~ age.

For visual simplicity we will:

- Restrict to one sex (for example, 'Female').
- Fit a simple model with age as the only predictor in this subgroup.
- Overlay the fitted line on the scatter plot.

This is purely for illustration.
"""

# Subset to one sex (adjust the label if your dataset uses different coding).
df_female = df[df["sex"] == "Female"].copy()

model_lin_f = smf.ols("bmi ~ age", data=df_female)
result_lin_f = model_lin_f.fit()

# Create a grid of ages spanning the observed range.
age_grid = np.linspace(df_female["age"].min(), df_female["age"].max(), 100)
pred_df = pd.DataFrame({"age": age_grid})
pred_df["bmi_hat"] = result_lin_f.predict(pred_df)

fig, ax = plt.subplots(figsize=(6, 4))

ax.scatter(df_female["age"], df_female["bmi"], alpha=0.3, edgecolor="none", label="Observed BMI")
ax.plot(pred_df["age"], pred_df["bmi_hat"], linewidth=2, label="Fitted line")

ax.set_xlabel("Age (years)")
ax.set_ylabel("Body mass index (kg/m²)")
ax.set_title("Linear regression: BMI ~ age (example subset)")
ax.legend()

plt.tight_layout()
plt.show()


### 2.2 Logistic regression

Logistic regression is used when the outcome is **binary**, for example the presence or absence of hypertension.

Let \( Y \in \{0, 1\} \) with \( Y = 1 \) indicating that the event (for example, hypertension) is present. The logistic model specifies the **log odds** of the event as a linear function of predictors:

\[
\log\left(\frac{p}{1 - p}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p,
\]

where \( p = P(Y = 1 \mid X_1, \ldots, X_p) \).

If we exponentiate a coefficient \( \beta_j \), we obtain an **odds ratio**:

\[
\exp(\beta_j)
\]

which describes the multiplicative change in the odds of the outcome associated with a one-unit increase in \( X_j \), holding other predictors constant.


In [None]:
"""Logistic regression example: hypertension on BMI and age.

We assume there is a binary outcome variable 'hypertension' coded 0/1.

The model:

    logit(P(hypertension = 1)) = β0 + β1 * bmi + β2 * age + β3 * sex

We fit the model and inspect the estimated odds ratios.
"""

# Fit logistic regression using the formula interface.
model_log = smf.logit("hypertension ~ bmi + age + C(sex)", data=df)
result_log = model_log.fit()

# Display the model summary.
display(result_log.summary())

# Extract odds ratios and 95 % confidence intervals.
params = result_log.params
conf = result_log.conf_int()
or_table = pd.DataFrame({
    "OR": np.exp(params),
    "CI_lower": np.exp(conf[0]),
    "CI_upper": np.exp(conf[1]),
})

or_table


In [None]:
"""Plot predicted probability of hypertension across BMI.

We:

- Fix age and sex at reference values.
- Vary BMI across a grid.
- Compute predicted probabilities from the fitted logistic model.
"""

# Choose reference values (adjust if needed).
age_ref = 60
sex_ref = "Female"  # adjust if your coding is different

bmi_grid = np.linspace(df["bmi"].quantile(0.05),
                       df["bmi"].quantile(0.95),
                       100)

pred_df = pd.DataFrame({
    "bmi": bmi_grid,
    "age": age_ref,
    "sex": sex_ref,
})

pred_df["p_hyp"] = result_log.predict(pred_df)

fig, ax = plt.subplots(figsize=(6, 4))

ax.plot(pred_df["bmi"], pred_df["p_hyp"], linewidth=2)

ax.set_xlabel("Body mass index (kg/m²)")
ax.set_ylabel("Predicted probability of hypertension")
ax.set_title("Logistic regression: predicted probability vs BMI")

plt.tight_layout()
plt.show()


### 2.3 Cox proportional hazards regression

In many epidemiological studies we are interested in **time-to-event** outcomes, such as time to incident cardiovascular disease. Cox proportional hazards regression models the **hazard** (the instantaneous event rate) as:

\[
h(t \mid X_1, \ldots, X_p) = h_0(t) \exp(\beta_1 X_1 + \cdots + \beta_p X_p),
\]

where:

- \( h_0(t) \) is the **baseline hazard** (unspecified).
- The exponentiated coefficients \( \exp(\beta_j) \) are **hazard ratios**.

A hazard ratio \( \exp(\beta_j) > 1 \) indicates a higher instantaneous risk of the event associated with higher \( X_j \), assuming the **proportional hazards assumption** holds.


In [None]:
"""Cox regression example: time to CVD event.

We assume the dataset contains:

- 'time_cvd': follow-up time (for example, in years).
- 'event_cvd': event indicator (1 if event occurred, 0 if censored).
- 'age', 'sex', 'bmi' as predictors.

We use the `lifelines` package for Cox regression.
"""

try:
    from lifelines import CoxPHFitter
except ImportError as e:
    raise ImportError(
        "The 'lifelines' package is required for this section. " 
        "Install it with `pip install lifelines` and re-run the cell."
    ) from e

# Select relevant columns and drop missing values.
cols = ["time_cvd", "event_cvd", "age", "bmi", "sex"]
df_cox = df[cols].dropna().copy()

# Lifelines expects categorical variables to be encoded appropriately.
# Here we create a simple indicator for sex == 'Female' as an example.
df_cox["sex_female"] = (df_cox["sex"] == "Female").astype(int)

cph = CoxPHFitter()
cph.fit(df_cox[["time_cvd", "event_cvd", "age", "bmi", "sex_female"]],
        duration_col="time_cvd",
        event_col="event_cvd")

cph.print_summary()


In [None]:
"""Plot example survival curves for two profiles.

We:

- Define two hypothetical profiles (for example, lower vs higher BMI).
- Use the fitted Cox model to estimate survival curves.
"""

# Define two example profiles.
profile_low = {
    "age": 60,
    "bmi": 24,
    "sex_female": 1,
}

profile_high = {
    "age": 60,
    "bmi": 32,
    "sex_female": 1,
}

profiles = pd.DataFrame([profile_low, profile_high])
profiles.index = ["BMI 24", "BMI 32"]

surv = cph.predict_survival_function(profiles)

fig, ax = plt.subplots(figsize=(6, 4))

for label in surv.columns:
    ax.plot(surv.index, surv[label], label=label)

ax.set_xlabel("Follow-up time")
ax.set_ylabel("Estimated survival probability")
ax.set_title("Cox model: example survival curves")
ax.legend()

plt.tight_layout()
plt.show()


## 3. Quantile regression

So far we have focused on models for the **mean** of the outcome (linear regression), the **probability** of a binary outcome (logistic regression), or the **hazard** of an event (Cox regression).

For many nutritional and biomedical variables the distribution is **skewed**. In such cases it can be helpful to model not only the mean, but also specific **quantiles** (for example, the median or upper decile).

**Quantile regression** estimates conditional quantiles of \( Y \) given predictors. For the median (quantile 0.5) we write:

\[
Q_{0.5}(Y \mid X) = \beta_0(0.5) + \beta_1(0.5) X.
\]

The interpretation of \( \beta_1(0.5) \) is:

> The expected difference in the *median* of Y associated with a one-unit difference in X, holding other predictors constant.


In [None]:
"""Quantile regression example: BMI on age.

We:

- Fit an ordinary least squares (OLS) model (mean regression).
- Fit a median (q = 0.5) quantile regression model.
- Fit an upper-quantile (q = 0.9) model.
- Compare the fitted lines.

We use `statsmodels` for quantile regression.
"""

from statsmodels.regression.quantile_regression import QuantReg

# Subset to complete cases for age and bmi.
df_qr = df[["age", "bmi"]].dropna().copy()

# Ordinary least squares for comparison.
ols_model = smf.ols("bmi ~ age", data=df_qr).fit()

# Quantile regression at median (0.5) and 0.9.
qr_model_05 = QuantReg(df_qr["bmi"], sm.add_constant(df_qr["age"]))
qr_result_05 = qr_model_05.fit(q=0.5)

qr_model_09 = QuantReg(df_qr["bmi"], sm.add_constant(df_qr["age"]))
qr_result_09 = qr_model_09.fit(q=0.9)

# Create prediction grid.
age_grid = np.linspace(df_qr["age"].min(), df_qr["age"].max(), 100)
X_grid = sm.add_constant(age_grid)

bmi_hat_ols = ols_model.predict(pd.DataFrame({"age": age_grid}))
bmi_hat_05 = qr_result_05.predict(X_grid)
bmi_hat_09 = qr_result_09.predict(X_grid)

fig, ax = plt.subplots(figsize=(6, 4))

ax.scatter(df_qr["age"], df_qr["bmi"], alpha=0.2, edgecolor="none", label="Observed BMI")
ax.plot(age_grid, bmi_hat_ols, linewidth=2, label="OLS (mean)")
ax.plot(age_grid, bmi_hat_05, linewidth=2, linestyle="--", label="Quantile 0.5 (median)")
ax.plot(age_grid, bmi_hat_09, linewidth=2, linestyle=":", label="Quantile 0.9")

ax.set_xlabel("Age (years)")
ax.set_ylabel("Body mass index (kg/m²)")
ax.set_title("BMI ~ age: mean and quantile regression")

ax.legend()
plt.tight_layout()
plt.show()


### 3.1 Strengths and limitations of quantile regression

**Strengths:**

- Provides a more complete description of the conditional distribution of Y.
- Robust to outliers (especially when modelling the median).
- Naturally accommodates heteroscedasticity (non-constant variance).

**Limitations:**

- Interpretation can be less intuitive than mean regression.
- Confidence intervals and hypothesis tests are more complex.
- More demanding computationally (although not an issue for this workbook).

In nutritional epidemiology quantile regression can be particularly useful when:

- The upper tail of a distribution is of special interest (for example, high sodium intake).
- The outcome distribution is strongly skewed (for example, some biomarkers).


## 4. Assumptions of regression models

All models are simplifications of reality. To interpret results sensibly we need to be aware of their assumptions.

Here we briefly review key assumptions for:

- Linear regression.
- Logistic regression.
- Cox proportional hazards regression.

Diagnostics and practical illustrations follow in the next section.


### 4.1 Linearity

In a standard linear regression model we assume that the relationship between each continuous predictor and the outcome is **linear** (after any transformations we choose).

If the true relationship is markedly non-linear, then:

- The model may fit poorly.
- Estimates of effect may be biased.
- Residual plots may show systematic patterns.

Later in this workbook we will introduce non-linear models (polynomials and splines) that relax this assumption.


### 4.2 Independence

We usually assume that the residuals (errors) are **independent** between individuals.

Violations of independence can occur when:

- The same individual contributes multiple observations (for example, repeated measures).
- Observations are clustered (for example, participants from the same household or clinic).

More advanced methods, such as mixed models or cluster-robust standard errors, are used in those situations. Here we make the simplifying assumption of independence.


### 4.3 Homoscedasticity

**Homoscedasticity** means that the variance of the residuals is constant across levels of the predictors.

If residual variance increases or decreases with fitted values (heteroscedasticity), then:

- Estimates of standard errors may be biased.
- Confidence intervals and P-values may be unreliable.

Residual-versus-fitted plots can be used to detect such patterns.


### 4.4 Normality of residuals

For linear regression, we often assume that the residuals are approximately **normally distributed**.

- This assumption is not necessary for obtaining unbiased estimates of the mean.
- It matters mainly for **inference** (confidence intervals and tests) in small samples.

Normality can be explored with **Q–Q plots**, which compare the distribution of residuals with a theoretical normal distribution.


### 4.5 Multicollinearity

**Multicollinearity** arises when predictors are strongly correlated with one another.

Consequences:

- Coefficients may be unstable.
- Standard errors become large.
- It can be difficult to disentangle separate effects.

The **variance inflation factor (VIF)** is a commonly used diagnostic: large VIF values indicate problematic collinearity.


### 4.6 Separation in logistic regression

In logistic regression, **separation** occurs when a predictor (or combination of predictors) perfectly predicts the outcome (for example, all smokers have disease, all non-smokers are healthy).

Consequences:

- Maximum likelihood estimates may not exist or may be extremely large.
- Standard logistic regression fails.

In practice one may:

- Collapse categories.
- Use penalised logistic regression.
- Rethink the model structure.


### 4.7 Proportional hazards in Cox regression

The Cox model assumes that hazard ratios are **constant over time** (proportional hazards). In other words:

\[
\frac{h(t \mid X = 1)}{h(t \mid X = 0)} = \text{constant in } t.
\]

Violations of this assumption can be detected using:

- Plots of log(-log(survival)) curves.
- Schoenfeld residuals and associated tests.

If proportional hazards does not hold, options include:

- Stratified Cox models.
- Time-varying coefficients.
- Alternative modelling approaches.


## 5. Model diagnostics

We now illustrate a few standard diagnostic tools for regression models.

The aim is not to be exhaustive, but to provide a first hands-on experience with:

- Residual plots.
- Q–Q plots.
- Influence diagnostics.
- Goodness-of-fit metrics.


In [None]:
"""Diagnostics for linear regression: residual plots and Q–Q plot.

We use the previously fitted linear model 'result_lin'.
"""

# Compute residuals and fitted values.
resid = result_lin.resid
fitted = result_lin.fittedvalues

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

# Residuals vs fitted values.
axes[0].scatter(fitted, resid, alpha=0.3, edgecolor="none")
axes[0].axhline(0, color="black", linewidth=1)
axes[0].set_xlabel("Fitted values")
axes[0].set_ylabel("Residuals")
axes[0].set_title("Residuals vs fitted")

# Q–Q plot for residuals.
sm.ProbPlot(resid).qqplot(line="45", ax=axes[1])
axes[1].set_title("Q–Q plot of residuals")

plt.tight_layout()
plt.show()


In [None]:
"""Variance Inflation Factors (VIF) for linear regression predictors.

We:

- Construct the design matrix for the linear model.
- Compute VIF for each predictor.

High VIF values (for example, > 5 or > 10) may indicate problematic collinearity.
"""

from statsmodels.stats.outliers_influence import variance_inflation_factor

# Build design matrix for the linear model.
X = result_lin.model.exog
vif_values = [variance_inflation_factor(X, i) for i in range(X.shape[1])]
vif_table = pd.DataFrame({
    "variable": result_lin.model.exog_names,
    "VIF": vif_values,
})

vif_table


In [None]:
"""Receiver operating characteristic (ROC) curve for logistic regression.

We:

- Compute predicted probabilities of hypertension.
- Calculate true positive and false positive rates.
- Plot the ROC curve and compute the area under the curve (AUC).

This provides a measure of overall discrimination.
"""

from sklearn.metrics import roc_curve, roc_auc_score

# Ensure no missing values for the logistic model variables.
df_log = df[["hypertension", "bmi", "age", "sex"]].dropna().copy()
result_log = smf.logit("hypertension ~ bmi + age + C(sex)", data=df_log).fit(disp=False)

y_true = df_log["hypertension"]
y_score = result_log.predict(df_log)

fpr, tpr, thresholds = roc_curve(y_true, y_score)
auc = roc_auc_score(y_true, y_score)

fig, ax = plt.subplots(figsize=(6, 4))

ax.plot(fpr, tpr, linewidth=2, label=f"ROC curve (AUC = {auc:.3f})")
ax.plot([0, 1], [0, 1], linestyle="--", color="grey", label="No discrimination")

ax.set_xlabel("False positive rate")
ax.set_ylabel("True positive rate")
ax.set_title("ROC curve for logistic regression")
ax.legend()

plt.tight_layout()
plt.show()


## 6. Non-linear models

The term "linear regression" refers to linearity in the **parameters** (β), not necessarily in the predictors themselves.

Many epidemiological relationships are **non-linear**. For example:

- Body mass index and mortality risk.
- Age and blood pressure.
- Sodium intake and blood pressure.

To capture such patterns we can:

- Add **polynomial terms** (for example, age²).
- Use **splines**, which fit smooth curves made of polynomial segments.

In this section we briefly introduce both approaches.


### 6.1 Polynomial regression

A simple extension of linear regression is to add powers of a predictor, for example:

\[
Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \varepsilon.
\]

This is still a linear model in the parameters \( \beta_0, \beta_1, \beta_2 \), but represents a **curved** relationship between X and Y.

Caution is required:

- High-order polynomials can behave very erratically at the boundaries of the data.
- Interpretation of individual coefficients is difficult; the focus should be on the **overall shape** of the fitted curve.


In [None]:
"""Polynomial regression example: BMI on age and age².

We:

- Create a squared age term.
- Fit a model with age and age².
- Plot the fitted curve and compare with the simple linear model.
"""

df_poly = df[["age", "bmi"]].dropna().copy()
df_poly["age2"] = df_poly["age"] ** 2

model_poly = smf.ols("bmi ~ age + age2", data=df_poly).fit()

age_grid = np.linspace(df_poly["age"].min(), df_poly["age"].max(), 100)
pred_poly = model_poly.predict(pd.DataFrame({
    "age": age_grid,
    "age2": age_grid ** 2,
}))

pred_lin = ols_model.predict(pd.DataFrame({"age": age_grid}))

fig, ax = plt.subplots(figsize=(6, 4))

ax.scatter(df_poly["age"], df_poly["bmi"], alpha=0.2, edgecolor="none", label="Observed BMI")
ax.plot(age_grid, pred_lin, linewidth=2, label="Linear model")
ax.plot(age_grid, pred_poly, linewidth=2, linestyle="--", label="Polynomial (age + age²)")

ax.set_xlabel("Age (years)")
ax.set_ylabel("Body mass index (kg/m²)")
ax.set_title("Polynomial regression: BMI ~ age + age²")

ax.legend()
plt.tight_layout()
plt.show()


### 6.2 Splines

**Splines** provide a more flexible and stable approach to modelling non-linear relationships.

Idea:

- The range of X is divided into intervals by "knots".
- Within each interval we fit low-degree polynomials.
- The pieces are joined smoothly at the knots.

A widely used choice in epidemiology is the **restricted cubic spline**, which behaves linearly beyond the outer knots and smoothly between knots.

Advantages:

- Flexible yet stable.
- Interpretation focuses on the **shape** of the curve.
- Works well in large cohorts.


In [None]:
"""Restricted cubic spline example: BMI on age.

We:

- Construct spline basis functions for age.
- Fit a linear model using the spline terms.
- Plot the fitted spline curve.

We use `patsy` to generate the spline basis.
"""

from patsy import dmatrix

# Use a subset with complete data.
df_spline = df[["age", "bmi"]].dropna().copy()

# Construct a spline basis for age with 4 degrees of freedom.
# The function 'cr' creates a cubic regression spline.
spline_basis = dmatrix("cr(age, df=4)", data=df_spline, return_type="dataframe")
spline_cols = spline_basis.columns

# Fit OLS with spline terms.
df_spline_model = pd.concat([df_spline["bmi"], spline_basis], axis=1)
formula_spline = "bmi ~ " + " + ".join(spline_cols)
model_spline = smf.ols(formula_spline, data=df_spline_model).fit()

# Prediction grid.
age_grid = np.linspace(df_spline["age"].min(), df_spline["age"].max(), 100)
spline_grid = dmatrix("cr(age, df=4)",
                      data={"age": age_grid},
                      return_type="dataframe")

pred_spline = model_spline.predict(spline_grid)

fig, ax = plt.subplots(figsize=(6, 4))

ax.scatter(df_spline["age"], df_spline["bmi"], alpha=0.2, edgecolor="none", label="Observed BMI")
ax.plot(age_grid, pred_spline, linewidth=2, label="Spline fit (df = 4)")

ax.set_xlabel("Age (years)")
ax.set_ylabel("Body mass index (kg/m²)")
ax.set_title("Restricted cubic spline: BMI ~ age")

ax.legend()
plt.tight_layout()
plt.show()


### 6.3 Comparing models

To decide whether a non-linear term is useful we can compare models using:

- Visual inspection of fitted curves.
- Goodness-of-fit measures such as the Akaike information criterion (AIC).
- Likelihood ratio tests (for nested models).

For example, we can compare:

- A simple linear model (BMI ~ age).
- A polynomial model (BMI ~ age + age²).
- A spline model (BMI ~ spline(age)).

Lower AIC values indicate better trade-off between fit and complexity.


In [None]:
"""Compare linear, polynomial, and spline models using AIC.

This is a simple numeric comparison; interpretation still relies on graphs and subject-matter knowledge.
"""

aic_results = pd.DataFrame({
    "model": ["Linear", "Polynomial (age + age²)", "Spline (df = 4)"],
    "AIC": [ols_model.aic, model_poly.aic, model_spline.aic],
})

aic_results


## 7. Interpreting effect estimates

Different regression models produce different types of effect estimates. It is important to be clear about their meaning.

- **β (beta) coefficients** in linear regression: expected difference in the mean outcome per unit change in the predictor.
- **Odds ratios (OR)** in logistic regression: multiplicative change in the odds of the outcome.
- **Risk ratios (RR)**: multiplicative change in risk (probability); not directly estimated in standard logistic models.
- **Hazard ratios (HR)** in Cox regression: multiplicative change in the instantaneous hazard.

In non-linear models (polynomials, splines, quantile regression) the interpretation usually focuses on the **shape of the curve** rather than individual coefficients.


In [None]:
"""Extract and summarise effect estimates from the fitted models.

We:

- Summarise β estimates from the linear model.
- Present odds ratios from the logistic model.
- Present hazard ratios from the Cox model.

This illustrates how different models report different effect measures.
"""

# Linear regression coefficients (bmi ~ age + C(sex)).
beta_lin = result_lin.params.to_frame(name="estimate")
beta_lin["model"] = "Linear (BMI)"

# Logistic regression odds ratios (hypertension ~ bmi + age + C(sex)).
params_log = result_log.params
conf_log = result_log.conf_int()
or_table = pd.DataFrame({
    "estimate": np.exp(params_log),
    "CI_lower": np.exp(conf_log[0]),
    "CI_upper": np.exp(conf_log[1]),
})
or_table["model"] = "Logistic (hypertension)"

# Cox model hazard ratios.
cox_summary = cph.summary[["coef", "exp(coef)", "exp(coef) lower 95%", "exp(coef) upper 95%"]].copy()
cox_summary.rename(columns={
    "coef": "coef",
    "exp(coef)": "HR",
    "exp(coef) lower 95%": "CI_lower",
    "exp(coef) upper 95%": "CI_upper",
}, inplace=True)
cox_summary["model"] = "Cox (time to CVD)"

display(beta_lin)
display(or_table)
display(cox_summary)


## 8. Estimation and inference (brief overview)

Most regression models in this workbook are estimated using **maximum likelihood** (or, in the case of ordinary least squares, a closely related approach).

The key ideas are:

- Parameters are estimated by finding values that make the observed data "most likely" under the assumed model.
- Standard errors quantify the typical variation of estimates across hypothetical repeated samples.
- **Wald tests** and **likelihood ratio tests** are used to assess whether coefficients differ from zero.
- **Confidence intervals** indicate a range of parameter values that are compatible with the observed data and the model assumptions.

A full treatment of the underlying theory is beyond the scope of FB2NEP, but it is important to know that:

- Estimates are subject to sampling variability.
- P-values and confidence intervals rely on model assumptions.


In [None]:
"""Manual computation of a confidence interval for a linear regression coefficient.

We illustrate the basic idea using the coefficient for 'age' in the linear model.

The 95 % confidence interval is:

    estimate ± 1.96 * standard_error

under a normal approximation.
"""

# Extract estimate and standard error for 'age'.
age_est = result_lin.params["age"]
age_se = result_lin.bse["age"]

ci_lower = age_est - 1.96 * age_se
ci_upper = age_est + 1.96 * age_se

print("Coefficient for age (linear model):", f"{age_est:.3f}")
print("Standard error:", f"{age_se:.3f}")
print("Approximate 95 % CI:", f"[{ci_lower:.3f}, {ci_upper:.3f}]")


## 9. Predictions from fitted models

One of the most practical uses of regression models is to obtain **predicted values** for specified combinations of predictors.

Examples:

- Predicted mean BMI at age 65 years in women.
- Predicted probability of hypertension at age 65 years for different BMI values.
- Predicted survival curves for different risk profiles.

In all cases it is important to remember:

- Predictions depend on the **assumed model** and its **fitted parameters**.
- Uncertainty in predictions can be quantified (for example, by confidence intervals or prediction intervals).


In [None]:
"""Prediction from a linear model: BMI at age 65.

We:

- Create a small DataFrame with the desired predictor values.
- Use the `predict` method of the fitted model.

For simplicity we focus on a single sex.
"""

# Example: predict BMI at age 65 for women.
new_data = pd.DataFrame({
    "age": [65],
    "sex": ["Female"],
})

pred_bmi = result_lin.predict(new_data)

print("Predicted mean BMI at age 65 (Female):", float(pred_bmi.iloc[0]))


In [None]:
"""Prediction from a logistic model: probability of hypertension.

We:

- Create a grid of BMI values at a fixed age and sex.
- Compute predicted probabilities using the logistic model.
"""

bmi_grid = np.linspace(df["bmi"].quantile(0.05),
                       df["bmi"].quantile(0.95),
                       100)

new_data = pd.DataFrame({
    "bmi": bmi_grid,
    "age": age_ref,
    "sex": sex_ref,
})

new_data["p_hyp"] = result_log.predict(new_data)

fig, ax = plt.subplots(figsize=(6, 4))

ax.plot(new_data["bmi"], new_data["p_hyp"], linewidth=2)

ax.set_xlabel("Body mass index (kg/m²)")
ax.set_ylabel("Predicted probability of hypertension")
ax.set_title("Predicted probability vs BMI (age = 60, sex = Female)")
plt.tight_layout()
plt.show()


In [None]:
"""Prediction from a spline model.

We:

- Use the spline model fitted earlier (BMI ~ spline(age)).
- Compute predicted BMI over an age grid.
- Plot the curve, which captures non-linearity.
"""

age_grid = np.linspace(df_spline["age"].min(), df_spline["age"].max(), 100)
spline_grid = dmatrix("cr(age, df=4)",
                      data={"age": age_grid},
                      return_type="dataframe")

pred_spline = model_spline.predict(spline_grid)

fig, ax = plt.subplots(figsize=(6, 4))

ax.plot(age_grid, pred_spline, linewidth=2)
ax.set_xlabel("Age (years)")
ax.set_ylabel("Predicted BMI (kg/m²)")
ax.set_title("Predicted BMI vs age (spline model)")
plt.tight_layout()
plt.show()


## 10. Summary and further reading

In this workbook you have:

- Reviewed the basic idea of regression as modelling conditional expectations.
- Fitted and interpreted linear, logistic, and Cox proportional hazards models.
- Seen how quantile regression extends the idea to conditional quantiles.
- Examined key model assumptions and basic diagnostics.
- Introduced non-linear models using polynomial terms and splines.
- Obtained predictions from fitted models.

These tools are building blocks for more advanced topics in nutritional epidemiology:

- Confounding and adjustment.
- Causal diagrams (DAGs).
- Mediation analysis.
- Missing data and more complex model structures.

These topics are developed further in **Workbook 7**.

**Suggested further reading:**

- Kleinbaum, D. G., and Klein, M. *Logistic Regression: A Self-Learning Text.*
- Harrell, F. E. *Regression Modelling Strategies.*
- Rothman, K. J., Greenland, S., and Lash, T. L. *Modern Epidemiology.*
- Koenker, R. *Quantile Regression.*
