# FB2NEP Assignment Notebook – Personal Dataset and Analysis

This notebook supports **Part B** of the FB2NEP coursework assignment:

- It generates a **personal synthetic dataset** for you, based on your student ID.
- It guides you through:
  - Creating a baseline characteristics table ("Table 1").
  - Exploring distributions and considering transformations.
  - Fitting a regression model relating sugar-sweetened beverage (SSB) intake, obesity and CVD risk.
  - Obtaining output that you can copy into your Word document.

You do **not** need to understand Python code to complete this notebook. In most cases, you only need to:

1. Edit one line to add your **student ID**.
2. Run the code cells in order (top to bottom).
3. Copy selected tables and figures into your Word document.
4. Answer the questions in your own words in the Word document.

> **Important:** You may see warnings (often in yellow). These are usually harmless for this assignment. If you see a long red error message, re-run the previous cell. If the problem persists, ask for help.

## Reminder – Assignment structure

The full coursework consists of:

- **Part A (in Word only):**
  - Short knowledge questions.
  - Drawing and explaining a DAG.
  - Interpreting published results.

- **Part B (this notebook + Word):**
  - **B1:** Table 1 and commentary.
  - **B2:** Distributions and transformations.
  - **B3:** Regression model and interpretation.
  - **B4:** DAG-informed adjustment strategy.
  - Optional bonus.

This notebook is designed to support **Part B**. You will need to copy results and figures from here into your Word document and write your interpretations there.

## Step 1 – Set up Python libraries

Run the cell below once. It loads the Python libraries that this notebook uses.

In [None]:
# ============================================================
# Import required Python libraries
#
# You do not need to change anything in this cell.
# Simply run it once. If you see warnings, that is usually fine.
# ============================================================

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display

# Make plots appear inside the notebook
%matplotlib inline

# For nicer table display (if supported)
pd.set_option("display.max_columns", 50)
pd.set_option("display.precision", 3)

print("Libraries imported successfully.")

## Step 2 – Enter your student ID (personal random seed)

To ensure that **each student receives a different dataset**, we use your student ID (or candidate number) to create a **personal random seed**.

1. Edit the line `student_id = "12345678"` below and replace `"12345678"` with your own student ID (keep the quotation marks).
2. Run the cell.
3. The notebook will convert this into a number and use it to generate your personal dataset.

> Please use the **same ID every time** you run this notebook. This ensures that you always get the same dataset.

In [None]:
# ============================================================
# Personal identifier -> random seed
#
# INSTRUCTION:
# Replace "12345678" with your own student ID or candidate number.
# Keep the quotation marks.
# Example: student_id = "19004567"
# ============================================================

student_id = "12345678"  # <-- EDIT THIS LINE ONLY

import hashlib

if not isinstance(student_id, str) or len(student_id.strip()) == 0:
    raise ValueError("Please enter your student ID as a non-empty string.")

# Convert the student ID string into a stable integer seed using SHA-256
hasher = hashlib.sha256(student_id.strip().encode("utf-8"))
seed_int = int.from_bytes(hasher.digest()[:8], "big")  # use first 8 bytes

print(f"Student ID: {student_id}")
print(f"Derived random seed: {seed_int}")

## Step 3 – Generate your personal dataset

The cell below will:

- Use your personal random seed.
- Generate a synthetic dataset with variables such as:
  - `age` (years)
  - `sex` (Male/Female)
  - `ses` (socioeconomic status: low/middle/high)
  - `smoking` (never/former/current)
  - `pa` (physical activity: low/moderate/high)
  - `ssb` (sugar-sweetened beverages, servings per day)
  - `bmi` (kg/m²)
  - `obese` (0 = not obese, 1 = obese)
  - `cvd_risk` (continuous CVD risk score)

The dataset will be created as a pandas DataFrame called `df` and also saved to a CSV file named:

`my_fb2nep_assignment_data.csv`

You can download this CSV file from Colab if you wish (for example via the file browser on the left).

In [None]:
# ============================================================
# Generate the synthetic dataset for this assignment
# ============================================================

rng = np.random.default_rng(seed_int)

# Number of participants in the synthetic cohort
n = 2000

# Study ID (1, 2, 3, ...)
study_id = np.arange(1, n + 1)

# Age: roughly 30–80 years, with a mean around 55
age = rng.normal(loc=55, scale=12, size=n)
age = np.clip(age, 30, 85)

# Sex: approximately 50 % female, 50 % male
sex = rng.choice(["Female", "Male"], size=n, p=[0.52, 0.48])

# Socioeconomic status (ses): low, middle, high
ses = rng.choice(["low", "middle", "high"], size=n, p=[0.30, 0.45, 0.25])

# Smoking status: never, former, current
smoking = rng.choice(["never", "former", "current"], size=n, p=[0.50, 0.30, 0.20])

# Physical activity (pa): low, moderate, high
pa = rng.choice(["low", "moderate", "high"], size=n, p=[0.35, 0.45, 0.20])

# Base SSB intake (servings per day):
# Start with a skewed distribution and then modify by SES, age and sex.
ssb_base = rng.gamma(shape=1.5, scale=0.7, size=n)  # positively skewed

# SES effect: higher intake in low SES, lower in high SES
ses_effect = np.where(ses == "low", 0.6, np.where(ses == "middle", 0.2, -0.3))

# Age effect: slightly lower SSB with increasing age
age_effect = -(age - 55) * 0.01

# Sex effect: assume slightly higher SSB in males
sex_effect = np.where(sex == "Male", 0.2, 0.0)

ssb = ssb_base + ses_effect + age_effect + sex_effect
ssb = np.clip(ssb, 0, None)  # SSB cannot be negative

# BMI: mean around 27, increased by higher SSB and lower physical activity
bmi_base = rng.normal(loc=27, scale=4.5, size=n)

# Effect of SSB on BMI (small positive)
bmi_ssb_effect = ssb * 0.6

# Effect of physical activity on BMI
pa_effect = np.where(pa == "low", 1.5, np.where(pa == "moderate", 0.0, -1.0))

bmi = bmi_base + bmi_ssb_effect + pa_effect
bmi = np.clip(bmi, 16, 55)

# Obesity indicator (binary): BMI >= 30
obese = (bmi >= 30).astype(int)

# CVD risk score: construct a linear predictor and then add noise
# Higher age, male sex, higher BMI, smoking, low SES and higher SSB
# all contribute positively to CVD risk.

lp = (
    -5.0
    + 0.08 * (age - 55)                     # age effect
    + 0.06 * (bmi - 27)                     # BMI effect
    + 0.10 * ssb                            # SSB effect
    + np.where(sex == "Male", 0.8, 0.0)     # male sex
    + np.where(smoking == "former", 0.7, 0.0)
    + np.where(smoking == "current", 1.5, 0.0)
    + np.where(ses == "low", 0.7, 0.0)
    + np.where(ses == "high", -0.4, 0.0)
)

# Add random noise
lp_noisy = lp + rng.normal(loc=0.0, scale=0.8, size=n)

# Convert linear predictor to a risk score between roughly 0 and 100
cvd_risk = 100 / (1 + np.exp(-lp_noisy))

# Assemble the DataFrame
df = pd.DataFrame({
    "id": study_id,
    "age": age,
    "sex": sex,
    "ses": ses,
    "smoking": smoking,
    "pa": pa,
    "ssb": ssb,
    "bmi": bmi,
    "obese": obese,
    "cvd_risk": cvd_risk,
})

print("First 5 rows of your personal dataset:")
display(df.head())

# Save to CSV
csv_filename = "my_fb2nep_assignment_data.csv"
df.to_csv(csv_filename, index=False)
print(f"\nDataset saved to: {csv_filename}")
print("You can download this file from the file browser in Colab if needed.")

# Part B1 – Table 1 and commentary

In this section you will:

1. Create a baseline characteristics table ("Table 1"), comparing **obese** and **non-obese** participants.
2. Copy the table into your Word document.
3. Write a brief commentary in your Word document (approximately 150 words).

### Question B1 (to answer in Word)

**B1.** Using the table produced below:

- Comment on the main characteristics of obese vs non-obese participants (age, sex, SES, smoking, SSB intake, CVD risk).
- Highlight any substantial differences that might be epidemiologically relevant.
- Note any anomalies or possible data quality issues you observe.

You do **not** write the answer here; instead, you copy the table into Word and write your commentary there.

In [None]:
# ============================================================
# Create a simple "Table 1" by obesity status
# ============================================================

def make_table1(data, group_var, continuous_vars, categorical_vars):
    """Create a simple Table 1 with means (SD) and counts (%).

    Parameters
    ----------
    data : pandas.DataFrame
        The dataset containing all variables.
    group_var : str
        Name of the grouping variable (for example "obese").
    continuous_vars : list of str
        Names of continuous variables (for example ["age", "bmi"]).
    categorical_vars : list of str
        Names of categorical variables (for example ["sex", "ses"]).
    """

    groups = data[group_var].dropna().unique()
    groups = sorted(groups)
    table = {}

    for g in groups:
        df_g = data[data[group_var] == g]
        col_dict = {}

        # Continuous variables: mean (SD)
        for v in continuous_vars:
            if v in data.columns:
                m = df_g[v].mean()
                s = df_g[v].std()
                col_dict[v] = f"{m:.1f} ± {s:.1f}"

        # Categorical variables: counts and percentages
        for v in categorical_vars:
            if v in data.columns:
                vc = df_g[v].value_counts(dropna=False)
                total = len(df_g)
                entries = []
                for cat, count in vc.items():
                    percent = 100 * count / total if total > 0 else 0
                    entries.append(f"{cat}: {count} ({percent:.1f}%)")
                col_dict[v] = "; ".join(entries)

        table[g] = col_dict

    return pd.DataFrame(table)

# Define variables for Table 1
continuous_vars = ["age", "bmi", "ssb", "cvd_risk"]
categorical_vars = ["sex", "ses", "smoking", "pa"]

table1 = make_table1(df, group_var="obese", continuous_vars=continuous_vars, categorical_vars=categorical_vars)

print("Table 1 – Baseline characteristics by obesity status (obese = 0/1):")
display(table1)

print("\nPlease copy this table into your Word document and answer Question B1 there.")

# Part B2 – Distributions and transformations

In this section you will:

- Inspect the distributions of three key variables:
  - `ssb` (SSB intake, servings per day)
  - `bmi` (kg/m²)
  - `cvd_risk` (CVD risk score)
- Consider whether any of these variables might require transformation (for example log-transformation or categorisation) before being used in a regression model.

### Question B2 (to answer in Word)

**B2.** Based on the histograms and boxplots:

- Describe the distribution of each variable (for example symmetric, skewed, presence of outliers).
- State whether you would consider any transformation or categorisation for each variable and explain why.

Write your answer in approximately 150 words in your Word document.

> You may copy one or two plots into your Word document as illustration if you wish (this is optional but may help your explanation).

In [None]:
# ============================================================
# Plot distributions: histograms and boxplots
# ============================================================

vars_to_plot = ["ssb", "bmi", "cvd_risk"]

for var in vars_to_plot:
    if var not in df.columns:
        continue

    fig, axes = plt.subplots(1, 2, figsize=(10, 4))
    fig.suptitle(f"Distribution of {var}")

    # Histogram
    axes[0].hist(df[var], bins=30)
    axes[0].set_xlabel(var)
    axes[0].set_ylabel("Frequency")
    axes[0].set_title("Histogram")

    # Boxplot
    axes[1].boxplot(df[var].dropna(), vert=True)
    axes[1].set_ylabel(var)
    axes[1].set_title("Boxplot")

    plt.tight_layout()
    plt.show()

print("Review the plots above, then answer Question B2 in your Word document.")

# Part B3 – Regression model and interpretation

In this section you will fit a regression model that relates CVD risk to SSB intake, obesity and other covariates.

We will use a **linear regression model** with `cvd_risk` as the continuous outcome and the following predictors:

- `ssb` (continuous SSB intake)
- `obese` (0/1)
- `age` (years)
- `sex` (categorical)
- `ses` (categorical)
- `smoking` (categorical)

### Question B3 (to answer in Word)

**B3.** Using the regression output:

- Interpret the estimated effect of **SSB intake** on CVD risk (direction, magnitude, and uncertainty).
- Interpret the effect of **obesity** on CVD risk.
- Comment on the effects of age and sex.
- Discuss briefly whether SES and smoking appear to confound the relationship between SSB intake and CVD risk.

Write your answer in approximately 250–350 words in your Word document.

> You should copy the regression table produced below into your Word document to support your interpretation.

In [None]:
# ============================================================
# Fit a linear regression model for cvd_risk
# ============================================================

import statsmodels.api as sm
import statsmodels.formula.api as smf

# Define the model formula
# C() tells statsmodels to treat the variable as categorical.
formula = "cvd_risk ~ ssb + obese + age + C(sex) + C(ses) + C(smoking)"

model = smf.ols(formula=formula, data=df).fit()

print("Linear regression model fitted. Summary:")
display(model.summary())

In [None]:
# ============================================================
# Create a concise regression table (estimates, CI, p-values)
# ============================================================

params = model.params
conf = model.conf_int()
pvals = model.pvalues

reg_table = pd.DataFrame({
    "beta": params,
    "ci_lower": conf[0],
    "ci_upper": conf[1],
    "p_value": pvals,
})

print("Concise regression table (copy this into Word if you wish):")
display(reg_table)

print("\nNow answer Question B3 in your Word document, using this table and the full summary above.")

# Part B4 – DAG-informed adjustment strategy

This part links your **causal diagram (DAG)** from Part A with your regression model from Part B.

You do **not** need to run any new code here: the question is conceptual.

### Question B4 (to answer in Word)

**B4.** Using the DAG you drew in Part A for the relationship between SSB intake and CVD:

- State whether the regression model you fitted in Part B3 includes all the variables you consider necessary to control for confounding.
- Identify one variable that you think **should not** be adjusted for (for example, because it is a collider or a mediator) and explain why, based on your DAG.

Write your answer in approximately 150 words in your Word document.

> You do not need to modify the code cells here. Use your DAG and regression results to reason about appropriate adjustment.

# Optional bonus – Additional models (up to +5 marks)

If you would like to attempt the optional bonus marks, you can experiment with one of the following ideas:

- Add an **interaction term** between SSB intake and obesity.
- Explore a **non-linear effect** of SSB (for example by categorising SSB or including a squared term).
- Compare this linear model with an alternative model (for example, fitting a logistic regression to a binary outcome that you define, such as high vs low CVD risk).

### Bonus question (to answer in Word)

- Briefly describe what additional model you fitted.
- Present the key result (for example the interaction term or non-linear effect).
- Explain in plain language what this result might mean.

Maximum length: 150 words.

The cell below shows one **example**: adding an interaction between SSB and obesity. You may adapt this or create your own model.

In [None]:
# ============================================================
# Example bonus model: interaction between SSB and obesity
# ============================================================

formula_interaction = "cvd_risk ~ ssb * obese + age + C(sex) + C(ses) + C(smoking)"
model_interaction = smf.ols(formula=formula_interaction, data=df).fit()

print("Interaction model summary (SSB * obese):")
display(model_interaction.summary())

print("\nIf you use this model, focus on the interaction term 'ssb:obese' in your Word document.")

# End of notebook

You have now completed all the code required for **Part B** of the assignment.

Please ensure that your **Word document** includes:

- Table 1 from Part B1 and your commentary.
- A discussion of the distributions and any transformations (Part B2).
- Regression results and your interpretation (Part B3).
- A DAG-informed adjustment discussion (Part B4).
- Bonus analysis and interpretation (if attempted).

Remember that the emphasis of the marking is on your **epidemiological reasoning and interpretation**, not on Python code.