# FB2NEP Workbook 3 – Data Collection and Cleaning

This workbook introduces:

- Data collection pipelines in nutritional epidemiology.
- Design, administration and analysis of questionnaires.
- Use of registers and public datasets (for example NHANES, NDNS).
- Identification of implausible or inconsistent values.
- Variable types (continuous, ordinal, categorical, count).
- Handling missing data (MCAR, MAR, MNAR – introduction).
- Simple validation and visual checks.

Run the first two code cells to set up the repository and load the dataset.

In [None]:
# FB2NEP bootstrap cell – use in *all* workbooks
#
# This cell:
# - Locates scripts/bootstrap.py starting from this notebook.
# - Executes it with runpy and explicitly calls init().
# - After this, you have in the *notebook* namespace:
#       df          – main FB2NEP synthetic dataset
#       CTX         – context object with repo_root etc.
#       REPO_ROOT   – path to the repository root
#       CSV_REL     – relative path to the CSV
#       IN_COLAB    – True/False

import pathlib
import runpy

bootstrap_candidates = [
    "scripts/bootstrap.py",
    "../scripts/bootstrap.py",
    "../../scripts/bootstrap.py",
]

bootstrap_ns = None

for rel in bootstrap_candidates:
    p = pathlib.Path(rel)
    if p.exists():
        print(f"Loading bootstrap from: {p}")
        # Execute bootstrap.py and capture its global namespace as a dict
        bootstrap_ns = runpy.run_path(str(p))
        break
else:
    raise FileNotFoundError(
        "Could not find scripts/bootstrap.py. "
        "Please check that you are running this notebook inside fb2nep-epi."
    )

if "init" not in bootstrap_ns:
    raise RuntimeError("bootstrap.py does not define an init() function.")

# Call init() from bootstrap.py – this does all the heavy lifting
df, CTX = bootstrap_ns["init"]()

# Expose a few convenience names in the notebook
REPO_ROOT = CTX.repo_root
CSV_REL = CTX.csv_rel
IN_COLAB = CTX.in_colab

print("Repository root:", REPO_ROOT)
print("Main dataset:", CSV_REL)
print("df shape:", df.shape)
print("IN_COLAB:", IN_COLAB)


In [None]:
import pandas as pd

# Load the main synthetic cohort used in all FB2NEP workbooks
df = pd.read_csv("data/synthetic/fb2nep.csv")

# Quick check: first rows
df.head()

## 1. Inspecting the synthetic cohort

The data frame `df` contains the synthetic FB2NEP cohort.

- Each row represents one participant.
- Each column represents a variable (for example age, BMI, blood pressure, diet).

We inspect the first rows and the variable types.

In [None]:
# Show the first 5 rows
df.head()

In [None]:
# Show the data types of all columns
df.dtypes

## 2. Data collection pipelines

In nutritional epidemiology, data usually come from several sources that are linked together:

- **Questionnaires and interviews** (for example, food frequency questionnaires, 24-hour recalls, lifestyle questionnaires).
- **Laboratory measurements** (for example, blood biomarkers, urinary biomarkers).
- **Clinical examinations** (for example, blood pressure, anthropometry).
- **Registers and administrative data** (for example, hospital episode statistics, cancer registries, mortality data).
- **Public datasets and surveys** (for example, NHANES in the United States, NDNS in the United Kingdom).

Typical steps in a data collection pipeline are:

1. **Sampling and recruitment** of participants.
2. **Baseline data collection**:
   - Questionnaires completed on paper, online, or via interview.
   - Clinical and anthropometric measurements.
   - Biospecimen collection for later laboratory analysis.
3. **Follow-up data collection**:
   - Repeat questionnaires or clinic visits.
   - Linkage to health registers to obtain information on disease outcomes.
4. **Data entry, coding and merging**:
   - Scanning or manual entry of questionnaires.
   - Coding of free-text responses.
   - Merging laboratory, questionnaire and register data using a unique participant identifier.

The synthetic FB2NEP dataset represents a **merged cohort** where these steps have already taken place. The underlying logic, however, is the same as in real-world studies such as NHANES or NDNS.

### 2.1 Questionnaires: design, administration, analysis

Questionnaires are a central tool in nutritional epidemiology. Important design choices include:

- **Mode of administration**:
  - Self-administered (paper or online).
  - Interviewer-administered (face-to-face or telephone).
- **Question format**:
  - Open questions (free text).
  - Closed questions with predefined response categories.
  - Likert-type scales (for example, “strongly agree” to “strongly disagree”).
- **Recall period**:
  - Short (24-hour recall).
  - Medium (last week).
  - Long (usual intake over the last year, as in many FFQs).

For analysis, questionnaire responses must be:

- **Coded** into numeric or categorical variables (for example, 1–5 for a Likert scale).
- **Mapped** to foods and nutrients using a food composition database.
- **Checked** for internal consistency and plausibility (for example, not more than 24 hours of eating in a day).

In large surveys such as NHANES or NDNS, these steps are documented in detail. For FB2NEP we use a synthetic dataset that mimics the structure and typical issues of such studies.

## 3. Variable types

We distinguish several common variable types:

- **Continuous** variables: can take many numeric values on a scale.
  - Examples: `BMI`, `SBP`, `energy_kcal`, `fruit_veg_g_d`.
- **Categorical (nominal)** variables: categories without inherent order.
  - Examples: `sex`, `SES_class`, `smoking_status`.
- **Ordinal** variables: categories *with* a meaningful order, but without fixed distances.
  - Examples: `IMD_quintile` (1 = most deprived, 5 = least deprived), `physical_activity` (low / moderate / high).
- **Count** variables: non-negative integers (0, 1, 2, …), often event counts.
  - Example: number of GP visits in a year (not present explicitly in this dataset).

The distinction between **ordinal** and **categorical** variables is important:

- Ordinal variables carry information about **ranking** (for example, high > medium > low).
- Categorical variables do not (for example, “never smoker” is not greater or smaller than “former smoker”).

This affects how we summarise and model the variables.

In [None]:
# Example: list a selection of key variables with their type
key_vars = [
    "age", "sex", "IMD_quintile", "SES_class", "smoking_status",
    "physical_activity", "BMI", "SBP", "energy_kcal", "fruit_veg_g_d",
    "red_meat_g_d", "CVD_incident", "Cancer_incident"
]

[(v, df[v].dtype) for v in key_vars if v in df.columns]

In [None]:
# Example: treat IMD_quintile and physical_activity as ordered categorical variables
import pandas as pd

if "IMD_quintile" in df.columns:
    df["IMD_quintile_ord"] = pd.Categorical(
        df["IMD_quintile"],
        categories=[1, 2, 3, 4, 5],
        ordered=True
    )

if "physical_activity" in df.columns:
    df["physical_activity_ord"] = pd.Categorical(
        df["physical_activity"],
        categories=["low", "moderate", "high"],
        ordered=True
    )

df[["IMD_quintile", "IMD_quintile_ord", "physical_activity", "physical_activity_ord"]].head()

## 4. Identifying implausible or inconsistent values

Even carefully collected data may contain values that are *implausible* or *wrong*.
Examples include:

- Anthropometric values outside physiological limits (for example, BMI < 10 kg/m²).
- Blood pressure values that are extremely low or high.
- Energy intakes that are incompatible with life over the long term.
- Men who are recorded as being “post-menopausal”.

In practice we define **simple rules** to flag such values for review, and in some cases to exclude them from analysis.

In [None]:
# Example 1: BMI range check
#
# These cut-offs are illustrative. In a real study they should be
# chosen with clinical input and clear documentation.

if "BMI" in df.columns:
    print("Summary of BMI:")
    display(df["BMI"].describe())

    implausible_bmi = df[(df["BMI"] < 10) | (df["BMI"] > 70)]
    print(f"Number of participants with BMI < 10 or > 70: {len(implausible_bmi)}")
    implausible_bmi.head()

In [None]:
# Example 2: SBP (systolic blood pressure) range check

if "SBP" in df.columns:
    print("Summary of SBP:")
    display(df["SBP"].describe())

    implausible_sbp = df[(df["SBP"] < 70) | (df["SBP"] > 260)]
    print(f"Number of participants with SBP < 70 or > 260 mmHg: {len(implausible_sbp)}")
    implausible_sbp.head()

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

if "BMI" in df.columns:
    plt.figure(figsize=(6, 4))
    df["BMI"].hist(bins=30)
    plt.xlabel("BMI (kg/m²)")
    plt.ylabel("Number of participants")
    plt.title("Distribution of BMI – initial check")
    plt.tight_layout()
    plt.show()

### 4.1 Energy intake and the Goldberg cut-off

A common question in nutritional epidemiology is whether reported **energy intake** is
compatible with the amount of energy a person is likely to expend.

The **Goldberg cut-off** is one approach:

- Estimate basal metabolic rate (BMR) from sex, age and body size.
- Estimate total energy expenditure (TEE) as BMR multiplied by a physical activity level.
- Compute the ratio: reported energy intake / estimated energy requirement.
- If this ratio is far below 1 (for example, well below 0.8), the report may indicate **under-reporting**.

In practice, more detailed formulas and cut-offs are used. Recent work (for example,
by Speakman and colleagues) has highlighted limitations of the original Goldberg approach,
including the risk of incorrectly excluding valid data and the sensitivity to assumptions
about physical activity.

For FB2NEP we focus on the **principle**: energy intakes can be evaluated in relation to
expected requirements, but exclusion rules must be:

- Transparent and reproducible.
- Justified in the protocol.
- Used with caution, as they can introduce bias if misapplied.

In [None]:
# Simple exploration of energy intake

if "energy_kcal" in df.columns:
    print("Summary of reported energy intake (kcal/d):")
    display(df["energy_kcal"].describe())

    plt.figure(figsize=(6, 4))
    df["energy_kcal"].hist(bins=30)
    plt.xlabel("Energy intake (kcal/day)")
    plt.ylabel("Number of participants")
    plt.title("Distribution of reported energy intake")
    plt.tight_layout()
    plt.show()

### 4.2 Rare categories and “Prefer not to say”

Questionnaires often include categories such as **“Prefer not to say”** or very rare
responses. These raise practical questions:

- Should they be combined with another category?
- Should they be treated as missing data?
- Do they indicate a problem with the question (for example, perceived sensitivity)?

Before making decisions, we usually examine how frequent such values are.

In [None]:
# Simple frequency tables for key categorical variables

for col in ["sex", "SES_class", "smoking_status", "physical_activity"]:
    if col in df.columns:
        print(f"\nValue counts for {col}:")
        print(df[col].value_counts(dropna=False))

## 5. Missing data: MCAR, MAR, MNAR (overview)

Almost all real datasets contain missing values. Three concepts are important:

- **MCAR (Missing Completely At Random)**:
  - The probability of missingness is unrelated to observed or unobserved data.
  - Example: a random sample of blood tubes is lost in the post.
- **MAR (Missing At Random)**:
  - The probability of missingness depends only on observed variables.
  - Example: older participants are less likely to provide a urine sample, but age is recorded.
- **MNAR (Missing Not At Random)**:
  - The probability of missingness depends on unobserved values.
  - Example: participants with very high alcohol intake are less likely to report their intake.

In the synthetic dataset we have:

- MCAR-type missingness on some biomarker and diet variables.
- MAR-type missingness depending on age and deprivation.
- Small MNAR components for alcohol and BMI.

We start by computing the proportion missing in each variable.

In [None]:
# Proportion of missing values in each variable
missing_fraction = df.isna().mean().sort_values(ascending=False)
missing_fraction.head(20)

In [None]:
plt.figure(figsize=(10, 4))
missing_fraction.head(25).plot(kind="bar")
plt.ylabel("Proportion missing")
plt.title("Proportion of missing values (top 25 variables)")
plt.tight_layout()
plt.show()

In [None]:
# Example: does missing BMI depend on age?

if {"BMI", "age"}.issubset(df.columns):
    df["BMI_missing"] = df["BMI"].isna()
    print("Age distribution by BMI missingness:")
    display(df.groupby("BMI_missing")["age"].describe())

In [None]:
# Example: does missing fruit_veg_g_d depend on IMD_quintile? (MAR pattern)

if {"fruit_veg_g_d", "IMD_quintile"}.issubset(df.columns):
    df["FV_missing"] = df["fruit_veg_g_d"].isna()
    tab = pd.crosstab(df["IMD_quintile"], df["FV_missing"], normalize="index")
    print("Proportion missing fruit_veg_g_d by IMD_quintile:")
    display(tab)

## 6. Simple validation checks

Finally we perform a few simple validation checks:

- Are there any men with a non-"NA" menopausal status?
- Do event dates look plausible relative to baseline?
- Do key variables have reasonable distributions by sex or SES?

These checks will inform later modelling decisions.

In [None]:
# Check that menopausal_status is only set for women

if {"sex", "menopausal_status"}.issubset(df.columns):
    inconsistent = df[(df["sex"] == "M") & (df["menopausal_status"] != "NA")]
    print(f"Number of men with non-NA menopausal_status: {len(inconsistent)}")
    inconsistent.head()

In [None]:
# Check SBP distribution by sex

if {"SBP", "sex"}.issubset(df.columns):
    print("SBP by sex:")
    display(df.groupby("sex")["SBP"].describe())