# FB2NEP Workbook 3 – Data Collection and Cleaning

This workbook introduces:

- Data collection pipelines in nutritional epidemiology.
- Identification of implausible or inconsistent values.
- Variable types (continuous, ordinal, categorical, count).
- Handling missing data (MCAR, MAR, MNAR – introduction).
- Simple validation and visual checks.

Run the first two code cells to set up the repository and load the dataset.

In [None]:
import os
import sys
import runpy
import pathlib
import subprocess

REPO_URL = "https://github.com/ggkuhnle/fb2nep-epi.git"
REPO_NAME = "fb2nep-epi"

# 1. If we are in Colab and scripts/bootstrap.py is not present,
#    clone the repository and change into it.
if "google.colab" in sys.modules and not pathlib.Path("scripts/bootstrap.py").exists():
    root = pathlib.Path("/content")
    repo_dir = root / REPO_NAME

    if not repo_dir.exists():
        print(f"Cloning {REPO_URL} …")
        subprocess.run(["git", "clone", REPO_URL], check=True)

    os.chdir(repo_dir)
    print("Changed working directory to:", os.getcwd())

# 2. Now try to locate and run scripts/bootstrap.py
for p in ["scripts/bootstrap.py", "../scripts/bootstrap.py", "../../scripts/bootstrap.py"]:
    if pathlib.Path(p).exists():
        print(f"Bootstrapping via: {p}")
        runpy.run_path(p)
        break
else:
    print("⚠️ scripts/bootstrap.py not found – "
          "please check that the FB2NEP repository is available.")


In [None]:
import pandas as pd

# Load the main synthetic cohort used in all FB2NEP workbooks
df = pd.read_csv("data/synthetic/fb2nep.csv")

# Quick check: first rows
df.head()

## 1. Inspecting the synthetic cohort

The data frame `df` contains the synthetic FB2NEP cohort.
We inspect the first rows and the variable types.

In [None]:
df.head()

In [None]:
df.dtypes

## 2. Data collection pipelines

In nutritional epidemiology, data often come from several sources:

- Surveys (for example, dietary questionnaires).
- Laboratory measurements (for example, biomarkers).
- Registers and administrative data (for example, hospital admissions).

In this synthetic dataset these have already been merged into a single cohort, but
the principle is the same: a unique participant identifier is used to link tables.

## 3. Variable types

We distinguish:

- **Continuous** variables such as `BMI`, `SBP`, or `energy_kcal`.
- **Categorical** variables such as `sex`, `SES_class`, `smoking_status`.
- **Ordinal** variables such as `IMD_quintile` or ordered activity levels.
- **Count** variables (not prominent here, but common in epidemiology).

The data type influences summaries and visualisations.

In [None]:
# Example: list a selection of key variables with their type
key_vars = [
    "age", "sex", "IMD_quintile", "SES_class", "smoking_status",
    "physical_activity", "BMI", "SBP", "energy_kcal", "fruit_veg_g_d",
    "red_meat_g_d", "CVD_incident", "Cancer_incident"
]
[(v, df[v].dtype) for v in key_vars if v in df.columns]

## 4. Identifying implausible or inconsistent values

The synthetic data are constructed to be realistic, but it is good practice to check:

- Ranges of key variables.
- Obvious outliers.
- Internal inconsistencies.

In [None]:
# Example: BMI distribution and simple range check
if "BMI" in df.columns:
    display(df["BMI"].describe())
    implausible_bmi = df[(df["BMI"] < 10) | (df["BMI"] > 70)]
    implausible_bmi.head()

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

if "BMI" in df.columns:
    plt.figure(figsize=(6, 4))
    df["BMI"].hist(bins=30)
    plt.xlabel("BMI (kg/m²)")
    plt.ylabel("Number of participants")
    plt.title("Distribution of BMI – initial check")
    plt.tight_layout()
    plt.show()

## 5. Missing data: MCAR, MAR, MNAR (overview)

The synthetic dataset includes:

- MCAR-type missingness on some biomarker and diet variables.
- MAR-type missingness depending on age and deprivation.
- Small MNAR components for alcohol and BMI.

We begin by computing the proportion missing in each variable.

In [None]:
missing_fraction = df.isna().mean().sort_values(ascending=False)
missing_fraction.head(20)

In [None]:
plt.figure(figsize=(10, 4))
missing_fraction.head(25).plot(kind="bar")
plt.ylabel("Proportion missing")
plt.title("Proportion of missing values (top 25 variables)")
plt.tight_layout()
plt.show()

In [None]:
# Example: does missing BMI depend on age?
if {"BMI", "age"}.issubset(df.columns):
    df["BMI_missing"] = df["BMI"].isna()
    display(df.groupby("BMI_missing")["age"].describe())

## 6. Simple validation checks

- Check distributions of key variables by sex or SES.
- Check consistency (for example, menopausal status only for women).

These checks will inform later modelling decisions.