# FB2NEP Assignment Notebook – Personal Dataset and Analysis

**Module:** FB2NEP Nutritional Epidemiology and Public Health  
**Academic Year:** 2025/26

---

## Overview

This notebook supports **Part B** of the FB2NEP coursework assignment. It generates a **personal synthetic dataset** based on your student ID and guides you through:

1. Creating a baseline characteristics table ("Table 1") comparing **males and females**
2. Exploring distributions and considering transformations
3. Fitting regression models relating dietary intake to health outcomes
4. Linking your model specification to causal reasoning (DAGs)

You do **not** need to understand Python code to complete this notebook. In most cases, you only need to:

1. Edit one line to add your **student ID**
2. Run the code cells in order (top to bottom)
3. Copy selected tables and figures into your Word document
4. Answer the questions in your own words

> **Important:** Warnings (yellow text) are usually harmless. If you see a red error message, re-run the previous cell. If the problem persists, ask for help.

## Assignment Structure

The full coursework consists of:

**Part A (in Word only):**
- Short knowledge questions on epidemiological concepts
- Drawing and explaining a directed acyclic graph (DAG)
- Interpretation of results from a published epidemiological study

**Part B (this notebook + Word):**
- **B1:** Table 1 – baseline characteristics by sex, with commentary
- **B2:** Distribution exploration and transformation decisions
- **B3:** Regression model and interpretation
- **B4:** DAG-informed adjustment strategy
- **Bonus:** Interaction or non-linear effects (optional)

This notebook is designed to support **Part B**. You will copy results and figures from here into your Word document and write your interpretations there.

---

## Step 1 – Set Up Python Libraries

Run the cell below once. It loads the Python libraries that this notebook uses.

In [None]:
# ============================================================
# Import required Python libraries
#
# You do not need to change anything in this cell.
# Simply run it once. Warnings are usually fine.
# ============================================================

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
from IPython.display import display
from scipy import stats

# Make plots appear inside the notebook
%matplotlib inline

# Nicer table display
pd.set_option("display.max_columns", 50)
pd.set_option("display.precision", 3)

print("✓ Libraries imported successfully.")

---

## Step 2 – Enter Your Student ID (Personal Random Seed)

To ensure that **each student receives a different dataset**, we use your student ID (or candidate number) to create a **personal random seed**.

### Instructions:
1. Edit the line `student_id = "12345678"` below
2. Replace `"12345678"` with your own student ID (keep the quotation marks)
3. Run the cell

> **Please use the same ID every time** you run this notebook. This ensures you always get the same dataset.

In [None]:
# ============================================================
# EDIT THIS LINE: Replace "12345678" with your student ID
# ============================================================

student_id = "12345678"  # <-- CHANGE THIS TO YOUR STUDENT ID

# ============================================================
# Do not edit below this line
# ============================================================

# Convert student ID to a numeric seed
seed_int = sum(ord(c) for c in student_id) * 17 + len(student_id) * 1337
print(f"Your student ID: {student_id}")
print(f"Your personal random seed: {seed_int}")
print("\n✓ Seed set. Run the next cell to generate your dataset.")

---

## Step 3 – Generate Your Personal Dataset

Run this cell to generate a synthetic cohort of approximately **2,000 participants** with:

- Demographics: age, sex, socioeconomic status (SES), Index of Multiple Deprivation (IMD)
- Lifestyle: smoking status, physical activity
- Diet: sugar-sweetened beverage (SSB) intake, fruit and vegetable intake, salt intake
- Anthropometry: BMI, waist circumference
- Biomarkers: systolic blood pressure (SBP)
- Outcomes: CVD risk score, obesity status

The dataset is designed to have realistic epidemiological properties and known associations.

In [None]:
# ============================================================
# Generate the synthetic dataset for this assignment
# ============================================================

rng = np.random.default_rng(seed_int)
n = 2000

# Study ID
study_id = np.arange(1, n + 1)

# Sex: approximately 52% female, 48% male
sex = rng.choice(["Female", "Male"], size=n, p=[0.52, 0.48])

# Age: 40–80 years, mean around 55, slight sex difference
age = np.where(
    sex == "Female",
    rng.normal(loc=54, scale=11, size=n),
    rng.normal(loc=56, scale=12, size=n)
)
age = np.clip(age, 40, 80)

# IMD quintile (1=most deprived, 5=least deprived)
imd_quintile = rng.choice([1, 2, 3, 4, 5], size=n, p=[0.15, 0.20, 0.30, 0.20, 0.15])

# SES: derived from IMD with some noise
ses_probs = {
    1: [0.55, 0.35, 0.10],
    2: [0.40, 0.45, 0.15],
    3: [0.25, 0.50, 0.25],
    4: [0.15, 0.45, 0.40],
    5: [0.10, 0.35, 0.55]
}
ses = np.array([rng.choice(["Low", "Middle", "High"], p=ses_probs[q]) for q in imd_quintile])

# Smoking status
smoking = rng.choice(["Never", "Former", "Current"], size=n, p=[0.50, 0.32, 0.18])

# Physical activity: socially patterned
pa_base = np.where(ses == "Low", 0.25, np.where(ses == "Middle", 0.35, 0.45))
pa_probs = np.column_stack([1 - pa_base - 0.3, 0.3 * np.ones(n), pa_base])
pa = np.array([rng.choice(["Low", "Moderate", "High"], p=pa_probs[i]) for i in range(n)])

# SSB intake (servings/day): skewed, socially patterned, sex difference
ssb_base = rng.gamma(shape=1.8, scale=0.6, size=n)
ssb_ses_effect = np.where(ses == "Low", 0.4, np.where(ses == "Middle", 0.1, -0.2))
ssb_sex_effect = np.where(sex == "Male", 0.3, 0)
ssb_age_effect = -0.015 * (age - 55)
ssb = np.clip(ssb_base + ssb_ses_effect + ssb_sex_effect + ssb_age_effect, 0, 6)

# Fruit and vegetable intake (portions/day)
fv_base = rng.normal(loc=4.0, scale=1.5, size=n)
fv_ses_effect = np.where(ses == "Low", -0.8, np.where(ses == "Middle", 0, 0.6))
fv_sex_effect = np.where(sex == "Female", 0.5, 0)
fruit_veg = np.clip(fv_base + fv_ses_effect + fv_sex_effect, 0, 12)

# Salt intake (g/day)
salt_base = rng.normal(loc=8.0, scale=2.0, size=n)
salt_sex_effect = np.where(sex == "Male", 1.2, 0)
salt = np.clip(salt_base + salt_sex_effect, 3, 15)

# BMI: influenced by diet, PA, age, sex
bmi_base = rng.normal(loc=26.5, scale=4.5, size=n)
bmi_ssb_effect = 0.8 * ssb
bmi_fv_effect = -0.3 * fruit_veg
bmi_pa_effect = np.where(pa == "Low", 1.5, np.where(pa == "Moderate", 0, -1.2))
bmi_age_effect = 0.05 * (age - 55)
bmi_sex_effect = np.where(sex == "Male", 0.5, 0)
bmi = np.clip(bmi_base + bmi_ssb_effect + bmi_fv_effect + bmi_pa_effect + bmi_age_effect + bmi_sex_effect, 18, 50)

# Waist circumference (cm): correlated with BMI, sex difference
waist_base = 40 + 1.8 * bmi + rng.normal(0, 5, size=n)
waist_sex_effect = np.where(sex == "Male", 8, 0)
waist = np.clip(waist_base + waist_sex_effect, 60, 150)

# Obesity status (BMI >= 30)
obese = (bmi >= 30).astype(int)

# Systolic blood pressure: influenced by salt, BMI, age, smoking
sbp_base = rng.normal(loc=120, scale=12, size=n)
sbp_salt_effect = 1.5 * (salt - 8)
sbp_bmi_effect = 0.8 * (bmi - 26)
sbp_age_effect = 0.5 * (age - 55)
sbp_smoking_effect = np.where(smoking == "Current", 5, 0)
sbp_sex_effect = np.where(sex == "Male", 4, 0)
sbp = np.clip(sbp_base + sbp_salt_effect + sbp_bmi_effect + sbp_age_effect + sbp_smoking_effect + sbp_sex_effect, 90, 200)

# CVD risk score (continuous, 0-100 scale)
cvd_base = rng.normal(loc=15, scale=8, size=n)
cvd_age_effect = 0.5 * (age - 55)
cvd_sbp_effect = 0.25 * (sbp - 120)
cvd_bmi_effect = 0.4 * (bmi - 26)
cvd_smoking_effect = np.where(smoking == "Current", 8, np.where(smoking == "Former", 3, 0))
cvd_sex_effect = np.where(sex == "Male", 5, 0)
cvd_ses_effect = np.where(ses == "Low", 4, np.where(ses == "Middle", 1, -2))
cvd_ssb_effect = 1.2 * ssb  # Direct effect of SSB on CVD risk
cvd_risk = np.clip(cvd_base + cvd_age_effect + cvd_sbp_effect + cvd_bmi_effect + 
                   cvd_smoking_effect + cvd_sex_effect + cvd_ses_effect + cvd_ssb_effect, 0, 80)

# Create DataFrame
df = pd.DataFrame({
    "study_id": study_id,
    "sex": sex,
    "age": np.round(age, 1),
    "imd_quintile": imd_quintile,
    "ses": ses,
    "smoking": smoking,
    "pa": pa,
    "ssb": np.round(ssb, 2),
    "fruit_veg": np.round(fruit_veg, 2),
    "salt_g_d": np.round(salt, 2),
    "bmi": np.round(bmi, 1),
    "waist_cm": np.round(waist, 1),
    "obese": obese,
    "sbp": np.round(sbp, 1),
    "cvd_risk": np.round(cvd_risk, 2)
})

# Add some missing values (realistic)
missing_idx_bmi = rng.choice(n, size=int(n * 0.02), replace=False)
missing_idx_sbp = rng.choice(n, size=int(n * 0.03), replace=False)
df.loc[missing_idx_bmi, "bmi"] = np.nan
df.loc[missing_idx_sbp, "sbp"] = np.nan

print(f"✓ Dataset generated successfully!")
print(f"  - {len(df)} participants")
print(f"  - {len(df.columns)} variables")
print(f"\nFirst 5 rows:")
display(df.head())
print(f"\nVariable types:")
print(df.dtypes)

---

# Part B1 – Table 1: Baseline Characteristics by Sex

A **"Table 1"** is a standard table in epidemiological papers that summarises the baseline characteristics of the study population. It typically:

- Shows continuous variables as mean ± SD (or median and IQR if skewed)
- Shows categorical variables as counts and percentages
- Compares groups (in this case, males vs females)

### Question B1 (to answer in Word)

Using the table produced below:

1. **Describe** the main characteristics of male vs female participants
2. **Identify** substantial differences that might be epidemiologically relevant
3. **Consider** how these differences might affect analyses of diet-disease associations

Write approximately 200 words in your Word document.

> Copy the table into Word and refer to specific numbers in your commentary.

In [None]:
# ============================================================
# Create Table 1 – Baseline Characteristics by Sex
# ============================================================

def make_table1(data, group_var, continuous_vars, categorical_vars):
    """Create a Table 1 with means (SD) for continuous and counts (%) for categorical."""
    
    groups = sorted(data[group_var].dropna().unique())
    table = {"Variable": []}
    
    # Add column for each group plus total
    for g in groups:
        table[g] = []
    table["Total"] = []
    
    # Sample sizes
    table["Variable"].append("N")
    for g in groups:
        n_g = len(data[data[group_var] == g])
        table[g].append(str(n_g))
    table["Total"].append(str(len(data)))
    
    # Continuous variables
    for v in continuous_vars:
        if v not in data.columns:
            continue
        table["Variable"].append(f"{v}, mean ± SD")
        for g in groups:
            df_g = data[data[group_var] == g]
            m = df_g[v].mean()
            s = df_g[v].std()
            table[g].append(f"{m:.1f} ± {s:.1f}")
        m_total = data[v].mean()
        s_total = data[v].std()
        table["Total"].append(f"{m_total:.1f} ± {s_total:.1f}")
    
    # Categorical variables
    for v in categorical_vars:
        if v not in data.columns:
            continue
        categories = sorted(data[v].dropna().unique())
        for cat in categories:
            table["Variable"].append(f"{v}: {cat}, n (%)")
            for g in groups:
                df_g = data[data[group_var] == g]
                count = (df_g[v] == cat).sum()
                pct = 100 * count / len(df_g) if len(df_g) > 0 else 0
                table[g].append(f"{count} ({pct:.1f}%)")
            count_total = (data[v] == cat).sum()
            pct_total = 100 * count_total / len(data)
            table["Total"].append(f"{count_total} ({pct_total:.1f}%)")
    
    return pd.DataFrame(table)

# Define variables for Table 1
continuous_vars = ["age", "bmi", "waist_cm", "ssb", "fruit_veg", "salt_g_d", "sbp", "cvd_risk"]
categorical_vars = ["ses", "smoking", "pa", "obese"]

table1 = make_table1(df, group_var="sex", continuous_vars=continuous_vars, categorical_vars=categorical_vars)

print("="*70)
print("TABLE 1 – Baseline Characteristics by Sex")
print("="*70)
display(table1)
print("\n➜ Copy this table into your Word document and write your commentary there.")

In [None]:
# ============================================================
# Statistical comparisons (for reference)
# ============================================================

print("Statistical comparisons between males and females:")
print("="*60)

males = df[df["sex"] == "Male"]
females = df[df["sex"] == "Female"]

# T-tests for continuous variables
for v in ["age", "bmi", "ssb", "fruit_veg", "salt_g_d", "sbp", "cvd_risk"]:
    t_stat, p_val = stats.ttest_ind(males[v].dropna(), females[v].dropna())
    sig = "*" if p_val < 0.05 else ""
    print(f"{v:15s}: t = {t_stat:6.2f}, p = {p_val:.4f} {sig}")

print("\n* indicates p < 0.05")
print("\nNote: These tests are for descriptive purposes only.")

---

# Part B2 – Distributions and Transformations

Before fitting regression models, we should examine the distributions of key variables. This helps us decide whether:

- Variables need **transformation** (e.g., log-transformation for skewed distributions)
- There are **outliers** that might influence results
- Assumptions for statistical models are likely to be met

### Question B2 (to answer in Word)

Based on the histograms and boxplots below:

1. **Describe** the distribution of each variable (symmetric, skewed, outliers?)
2. **State** whether you would consider any transformation and explain why
3. **Discuss** how the distributions might affect interpretation of regression coefficients

Write approximately 150 words in your Word document.

> You may copy one or two plots into your Word document as illustration.

In [None]:
# ============================================================
# Distribution plots for key variables
# ============================================================

vars_to_plot = ["ssb", "bmi", "cvd_risk", "sbp"]

fig, axes = plt.subplots(len(vars_to_plot), 2, figsize=(12, 3*len(vars_to_plot)))

for i, var in enumerate(vars_to_plot):
    if var not in df.columns:
        continue
    
    # Histogram
    ax = axes[i, 0]
    df[var].hist(bins=30, ax=ax, color="steelblue", edgecolor="white")
    ax.set_xlabel(var)
    ax.set_ylabel("Frequency")
    ax.set_title(f"Histogram of {var}")
    
    # Boxplot by sex
    ax = axes[i, 1]
    df.boxplot(column=var, by="sex", ax=ax)
    ax.set_xlabel("Sex")
    ax.set_ylabel(var)
    ax.set_title(f"Boxplot of {var} by sex")
    plt.suptitle("")  # Remove automatic title

plt.tight_layout()
plt.show()

print("\n➜ Review the plots and answer Question B2 in your Word document.")

In [None]:
# ============================================================
# Summary statistics for distribution assessment
# ============================================================

print("Summary statistics for key variables:")
print("="*70)

summary_vars = ["ssb", "bmi", "cvd_risk", "sbp", "fruit_veg", "salt_g_d"]
summary_df = df[summary_vars].describe().T
summary_df["skewness"] = df[summary_vars].skew()
summary_df["kurtosis"] = df[summary_vars].kurtosis()

display(summary_df[["count", "mean", "std", "min", "25%", "50%", "75%", "max", "skewness"]])

print("\nInterpretation guide:")
print("  - Skewness near 0: approximately symmetric")
print("  - Skewness > 0.5: right-skewed (consider log-transformation)")
print("  - Skewness < -0.5: left-skewed")

---

# Part B3 – Regression Model and Interpretation

In this section, you will fit a regression model relating **CVD risk** to **SSB intake**, adjusting for potential confounders.

We use a **linear regression model** with `cvd_risk` as the continuous outcome:

$$\text{CVD risk} = \beta_0 + \beta_1 \times \text{SSB} + \beta_2 \times \text{age} + \beta_3 \times \text{sex} + \ldots + \varepsilon$$

### Question B3 (to answer in Word)

Using the regression output:

1. **Interpret** the coefficient for SSB in plain language
2. **Explain** what happens to the SSB coefficient when you add more covariates
3. **Discuss** the model fit (R-squared) and what it tells you
4. **Identify** which covariates have the strongest associations with CVD risk

Write approximately 200 words in your Word document.

In [None]:
# ============================================================
# Model 1: Crude (unadjusted) association
# ============================================================

print("MODEL 1: Crude association (SSB only)")
print("="*60)

model1 = smf.ols(formula="cvd_risk ~ ssb", data=df).fit()
print(f"CVD risk = {model1.params['Intercept']:.2f} + {model1.params['ssb']:.2f} × SSB")
print(f"\nR-squared: {model1.rsquared:.3f}")
print(f"\nInterpretation: For each additional serving of SSB per day,")
print(f"CVD risk score increases by {model1.params['ssb']:.2f} points (95% CI: {model1.conf_int().loc['ssb', 0]:.2f} to {model1.conf_int().loc['ssb', 1]:.2f})")

In [None]:
# ============================================================
# Model 2: Adjusted for demographics
# ============================================================

print("\nMODEL 2: Adjusted for demographics (age, sex)")
print("="*60)

model2 = smf.ols(formula="cvd_risk ~ ssb + age + C(sex)", data=df).fit()
print(f"SSB coefficient: {model2.params['ssb']:.2f} (95% CI: {model2.conf_int().loc['ssb', 0]:.2f} to {model2.conf_int().loc['ssb', 1]:.2f})")
print(f"R-squared: {model2.rsquared:.3f}")

In [None]:
# ============================================================
# Model 3: Fully adjusted model
# ============================================================

print("\nMODEL 3: Fully adjusted model")
print("="*60)

formula = "cvd_risk ~ ssb + age + C(sex) + C(ses) + C(smoking) + C(pa) + bmi"
model3 = smf.ols(formula=formula, data=df).fit()

print("\nFull regression summary:")
display(model3.summary())

In [None]:
# ============================================================
# Summary table comparing models
# ============================================================

print("\nCOMPARISON OF MODELS")
print("="*70)

comparison = pd.DataFrame({
    "Model": ["Crude", "Demographics", "Fully adjusted"],
    "SSB coefficient": [model1.params['ssb'], model2.params['ssb'], model3.params['ssb']],
    "95% CI lower": [model1.conf_int().loc['ssb', 0], model2.conf_int().loc['ssb', 0], model3.conf_int().loc['ssb', 0]],
    "95% CI upper": [model1.conf_int().loc['ssb', 1], model2.conf_int().loc['ssb', 1], model3.conf_int().loc['ssb', 1]],
    "p-value": [model1.pvalues['ssb'], model2.pvalues['ssb'], model3.pvalues['ssb']],
    "R-squared": [model1.rsquared, model2.rsquared, model3.rsquared]
})

display(comparison.round(3))

print("\n➜ Copy this table into Word and discuss how the SSB coefficient changes.")

---

# Part B4 – DAG-Informed Adjustment Strategy

This section links your **causal diagram (DAG)** from Part A with your regression model.

A simple DAG for the SSB-CVD relationship might look like:

```
           SES
          /   \
         ↓     ↓
       SSB → BMI → CVD risk
         \        ↗
          ↘→→→→→→
```

Where:
- **SES** is a confounder (affects both SSB intake and CVD risk)
- **BMI** may be a mediator (SSB → BMI → CVD) or a confounder
- There may be a direct effect of SSB on CVD (independent of BMI)

### Question B4 (to answer in Word)

1. Does your regression model include all variables necessary to control for confounding?
2. Identify one variable that you think **should NOT** be adjusted for (because it's a collider or mediator) and explain why
3. If BMI is on the causal pathway, what does adjusting for it tell us?

Write approximately 150 words in your Word document.

In [None]:
# ============================================================
# Compare models with and without BMI (mediator analysis)
# ============================================================

print("MEDIATOR ANALYSIS: Effect of adjusting for BMI")
print("="*60)

# Model without BMI (total effect)
formula_no_bmi = "cvd_risk ~ ssb + age + C(sex) + C(ses) + C(smoking) + C(pa)"
model_no_bmi = smf.ols(formula=formula_no_bmi, data=df).fit()

# Model with BMI (direct effect)
formula_with_bmi = "cvd_risk ~ ssb + age + C(sex) + C(ses) + C(smoking) + C(pa) + bmi"
model_with_bmi = smf.ols(formula=formula_with_bmi, data=df).fit()

print(f"SSB coefficient WITHOUT BMI (total effect): {model_no_bmi.params['ssb']:.3f}")
print(f"SSB coefficient WITH BMI (direct effect):    {model_with_bmi.params['ssb']:.3f}")
print(f"\nDifference: {model_no_bmi.params['ssb'] - model_with_bmi.params['ssb']:.3f}")
print(f"\nThis difference represents the portion of the SSB-CVD association")
print(f"that may be mediated through BMI.")

---

# Optional Bonus – Additional Analyses (up to +5 marks)

If you would like to attempt the optional bonus marks, you can explore:

1. **Interaction effects**: Does the SSB-CVD association differ by sex?
2. **Non-linear effects**: Is the association linear, or does it plateau/accelerate?
3. **Stratified analysis**: Separate models for males and females

### Bonus question (to answer in Word)

- Briefly describe what additional analysis you performed
- Present the key result
- Explain in plain language what this means

Maximum length: 150 words.

In [None]:
# ============================================================
# Example: Interaction between SSB and sex
# ============================================================

print("BONUS EXAMPLE: Interaction between SSB and sex")
print("="*60)

formula_interaction = "cvd_risk ~ ssb * C(sex) + age + C(ses) + C(smoking)"
model_interaction = smf.ols(formula=formula_interaction, data=df).fit()

print("\nInteraction model summary:")
display(model_interaction.summary())

print("\nLook at the 'ssb:C(sex)[T.Male]' coefficient.")
print("If significant, this suggests the SSB-CVD association differs by sex.")

In [None]:
# ============================================================
# Stratified analysis by sex
# ============================================================

print("\nSTRATIFIED ANALYSIS BY SEX")
print("="*60)

formula_strat = "cvd_risk ~ ssb + age + C(ses) + C(smoking) + C(pa) + bmi"

model_female = smf.ols(formula=formula_strat, data=df[df["sex"] == "Female"]).fit()
model_male = smf.ols(formula=formula_strat, data=df[df["sex"] == "Male"]).fit()

print(f"\nFemales (n={len(df[df['sex']=='Female'])}):")
print(f"  SSB coefficient: {model_female.params['ssb']:.3f} (95% CI: {model_female.conf_int().loc['ssb', 0]:.3f} to {model_female.conf_int().loc['ssb', 1]:.3f})")

print(f"\nMales (n={len(df[df['sex']=='Male'])}):")
print(f"  SSB coefficient: {model_male.params['ssb']:.3f} (95% CI: {model_male.conf_int().loc['ssb', 0]:.3f} to {model_male.conf_int().loc['ssb', 1]:.3f})")

---

# End of Notebook

You have completed all the code required for **Part B** of the assignment.

## Checklist – What to include in your Word document

- [ ] **B1:** Table 1 and commentary on male/female differences (~200 words)
- [ ] **B2:** Discussion of distributions and transformations (~150 words)
- [ ] **B3:** Regression results and interpretation (~200 words)
- [ ] **B4:** DAG-informed adjustment discussion (~150 words)
- [ ] **Bonus:** Additional analysis (optional, ~150 words)

## Important reminders

- The emphasis of marking is on your **epidemiological reasoning and interpretation**, not on Python code
- Include your **student ID** on the first page of your Word document
- Do **not** include your name in the document
- Submit via Blackboard by the deadline

---

*Good luck with your assignment!*