# FB2NEP Workbook 2 – Populations, Samples, and Representativeness

**Date:** 09 November 2025

This workbook introduces core ideas in sampling for nutritional epidemiology:

- The distinction between the **general population**, a **study's target population**, and the realised **sample**.
- The notions of **sampling frames**, **representativeness**, and **sampling error** versus **bias**.
- How a **probability sample** such as NHANES can be used as a reference for population structure.
- How to compare simple descriptive statistics between a **reference population** (NHANES) and a **study cohort** (the FB2NEP synthetic dataset).
- How **sample size** affects the precision of estimates of central tendency (here: body mass index, BMI).

We use two datasets:

- **NHANES (National Health and Nutrition Examination Survey)**: a large, nationally representative survey of the non-institutionalised United States population. Here we download a small subset of NHANES 2017–2018 directly from the CDC website via a helper script.
- **FB2NEP synthetic cohort**: the synthetic dataset that we use throughout the module for regression and causal reasoning. Here it plays the role of a *study sample* that we compare to NHANES.

> Hippo cameo (single, pedagogical): later, imagine a very diligent hippo who always volunteers for nutrition surveys. This will help us think about who actually ends up in a sample.


In [None]:
# Bootstrap the FB2NEP repository (works in Colab and locally).
#
# This cell:
# - Clones the fb2nep-epi repository in Google Colab if needed.
# - Locates and runs scripts/bootstrap.py.
# - Ensures that data/ and notebooks/ are available.

import os
import sys
import runpy
import pathlib
import subprocess

REPO_URL = "https://github.com/ggkuhnle/fb2nep-epi.git"
REPO_NAME = "fb2nep-epi"

# 1. If we are in Colab and scripts/bootstrap.py is not present,
#    clone the repository and change into it.
if "google.colab" in sys.modules and not pathlib.Path("scripts/bootstrap.py").exists():
    root = pathlib.Path("/content")
    repo_dir = root / REPO_NAME

    if not repo_dir.exists():
        print(f"Cloning {REPO_URL} …")
        subprocess.run(["git", "clone", REPO_URL], check=True)

    os.chdir(repo_dir)
    print("Changed working directory to:", os.getcwd())

# 2. Now try to locate and run scripts/bootstrap.py
for p in ["scripts/bootstrap.py", "../scripts/bootstrap.py", "../../scripts/bootstrap.py"]:
    if pathlib.Path(p).exists():
        print(f"Bootstrapping via: {p}")
        runpy.run_path(p)
        break
else:
    print("⚠️ scripts/bootstrap.py not found – "
          "please check that the FB2NEP repository is available.")


In [None]:
# Imports and data loading.
#
# This cell:
# - Imports core libraries (NumPy, pandas, Matplotlib).
# - Imports the NHANES helper from scripts/fetch_nhanes_demo.py.
# - Loads the NHANES subset and the FB2NEP synthetic cohort.
# - Sets a fixed random seed so that sampling results are reproducible.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from scripts.fetch_nhanes_demo import load_nhanes_demo

# Fixed random seed for all simulations in this workbook.
RANDOM_SEED = 11088
rng = np.random.default_rng(RANDOM_SEED)

# Load processed NHANES subset (downloaded and tidied by helper script).
nhanes = load_nhanes_demo(cache=True)

# Load the FB2NEP synthetic cohort used throughout the module.
df_fb2nep = pd.read_csv("data/synthetic/fb2nep.csv")

print("NHANES demo shape:", nhanes.shape)
print("FB2NEP synthetic cohort shape:", df_fb2nep.shape)

nhanes.head()

## 1. Concepts and definitions

### 1.1 General population, target population, and sample

- **General population**: The entire group about which we ultimately wish to draw conclusions. Examples: “all adults living in the United States”, or “all adults aged 40 years and older in the United Kingdom”.

- **Target population (study population)**: The subset of the general population that a particular study *intends* to represent. Examples: “all adults aged 40–79 years registered with a general practitioner in England”, or “all post-menopausal women without prior cardiovascular disease at baseline”. This definition is conceptual and exists **before** sampling.

- **Sampling frame**: The operational list or mechanism used to select individuals. Examples: general practice registers, electoral rolls, health insurance lists, employee registers. Individuals who are not in the sampling frame cannot be selected, even if they belong to the target population.

- **Sample**: The actual set of individuals who are recruited and provide data. The sample may differ from the target population because of non-response, exclusion criteria, and practical constraints.

Key questions for any study:

1. **Who did we intend to study?** (target population)
2. **Who did we actually study?** (realised sample)
3. **How different are these groups** with respect to variables that matter for our research question?


### 1.2 Representativeness, sampling error, and bias

A sample is **representative** of a population when the distribution of key characteristics in the sample matches that of the population of interest. Common characteristics:

- Sex.
- Age distribution.
- Socioeconomic position (for example, education, deprivation indices).
- Ethnicity (if available).

Representativeness is mainly about **external validity**: how far we can generalise study findings beyond those who took part.

It is useful to distinguish two sources of difference between sample and population:

- **Sampling error**: Random variation in estimates because we observe only a finite sample rather than the whole population. Sampling error becomes smaller as the sample size increases.
- **Systematic bias**: Systematic differences between sample and population (for example, non-participation of people with poor health or low income) that do **not** disappear when the sample size increases.

In this workbook we will:

- Use **NHANES** as a reference survey for the general population.
- Treat the **FB2NEP synthetic cohort** as a “study sample” and compare it to NHANES.
- Use repeated sampling from NHANES to illustrate **sampling error**.


### 1.3 NHANES and the FB2NEP synthetic cohort

- **NHANES** uses complex probability sampling strategies to obtain an approximately nationally representative sample of the civilian, non-institutionalised United States population. For this workbook we focus on adults and use a limited set of variables: age, sex, race/ethnicity, education, and BMI.

- The **FB2NEP synthetic cohort** is a simulated cohort used for teaching. It mimics a longitudinal study of adults aged 40 years and older with follow-up for cardiovascular disease and cancer. It is not designed to be representative of any real country, but we can still compare its structure to NHANES.


In [None]:
# Quick look at the FB2NEP synthetic cohort.

    fb_cols = [
    "id", "age", "sex", "IMD_quintile", "SES_class", "BMI", "SBP",
    "fruit_veg_g_d", "red_meat_g_d", "CVD_incident", "Cancer_incident"
]

# Use list comprehension to keep only columns that are actually present.
fb_cols = [c for c in fb_cols if c in df_fb2nep.columns]

df_fb2nep[fb_cols].head()

## 2. Basic distributions in NHANES

We first inspect the distribution of sex, age, and race/ethnicity in the NHANES subset. We create simple age groups for adults and compute proportion tables.


In [None]:
# Create age groups for NHANES adults.
#
# We restrict to adults aged 20 years and older (already done in the helper script),
# and group into: 20–39, 40–59, 60+ years.

age_bins_nhanes = [20, 40, 60, np.inf]
age_labels_nhanes = ["20–39", "40–59", "60+"]

nhanes = nhanes.copy()
nhanes["age_group"] = pd.cut(
    nhanes["age_years"],
    bins=age_bins_nhanes,
    labels=age_labels_nhanes,
    right=False,
)

nhanes[["age_years", "age_group", "sex", "race_eth", "education", "bmi"]].head()

In [None]:
# Helper function: counts and proportions for a categorical variable.

def proportion_table(data: pd.DataFrame, column: str) -> pd.DataFrame:
    """Return counts and proportions for one categorical column.

    The result has one row per category with:
    - count: number of observations in this category,
    - proportion: fraction of all observations in this category.
    """
    counts = data[column].value_counts(dropna=False)
    props = data[column].value_counts(normalize=True, dropna=False)
    out = pd.DataFrame({
        "count": counts,
        "proportion": props,
    })
    return out

print("NHANES sex distribution:")
display(proportion_table(nhanes, "sex"))

print("\nNHANES age group distribution:")
display(proportion_table(nhanes, "age_group"))

print("\nNHANES race/ethnicity distribution:")
display(proportion_table(nhanes, "race_eth"))

## 3. Representation relative to the United States Census

NHANES is designed to be approximately representative of the United States population. To illustrate this, we compare NHANES proportions to approximate adult distributions from the United States Census (values here are simplified for teaching).

We compute a **representation ratio** for each category:

\begin{equation}
\text{representation ratio} = \frac{\text{NHANES proportion}}{\text{Census proportion}}.
\end{equation}

- A value close to 1 means that NHANES has a similar proportion to the Census.
- Values greater than 1 indicate **over-representation** in NHANES.
- Values less than 1 indicate **under-representation**.


In [None]:
# Approximate United States Census distributions for adults.
# These values are illustrative and are not official statistics.

census_sex = pd.DataFrame({
    "sex": ["Female", "Male"],
    "census_prop": [0.509, 0.491],
})

census_age = pd.DataFrame({
    "age_group": ["20–39", "40–59", "60+"],
    "census_prop": [0.35, 0.33, 0.32],
})

census_race = pd.DataFrame({
    "race_eth": ["White", "Black", "Hispanic", "Asian", "Other"],
    "census_prop": [0.58, 0.12, 0.19, 0.06, 0.05],
})

census_sex

In [None]:
# Helper function: representation table for one variable.

def representation_table(sample_tab: pd.DataFrame, census_tab: pd.DataFrame, key: str) -> pd.DataFrame:
    """Merge sample and Census proportions and compute representation ratios.

    Parameters
    ----------
    sample_tab : DataFrame
        Table with columns [key, 'proportion'] from the sample (here: NHANES).
    census_tab : DataFrame
        Table with columns [key, 'census_prop'] from Census or other reference.
    key : str
        Column name that identifies the categories (for example, 'sex').
    """
    merged = sample_tab.merge(census_tab, on=key, how="outer", validate="one_to_one")
    merged = merged.rename(columns={"proportion": "sample_prop"})
    merged["representation_ratio"] = merged["sample_prop"] / merged["census_prop"]
    return merged

# Compute NHANES proportion tables.
nhanes_sex = proportion_table(nhanes, "sex").reset_index().rename(columns={"index": "sex"})
nhanes_age = proportion_table(nhanes, "age_group").reset_index().rename(columns={"index": "age_group"})
nhanes_race = proportion_table(nhanes, "race_eth").reset_index().rename(columns={"index": "race_eth"})

# Compute representation ratios.
repr_sex = representation_table(nhanes_sex, census_sex, "sex")
repr_age = representation_table(nhanes_age, census_age, "age_group")
repr_race = representation_table(nhanes_race, census_race, "race_eth")

repr_sex

In [None]:
# Simple bar plots of representation ratios.

def plot_representation(df: pd.DataFrame, category_col: str, title: str) -> None:
    """Plot representation ratios for one variable.

    A horizontal line at 1.0 indicates perfect agreement between
    NHANES and the Census margins.
    """
    df = df.copy().sort_values("representation_ratio")

    plt.figure(figsize=(6, 4))
    plt.bar(df[category_col].astype(str), df["representation_ratio"])
    plt.axhline(1.0, linestyle="--")
    plt.ylabel("Representation ratio (NHANES / Census)")
    plt.title(title)
    plt.xticks(rotation=45, ha="right")
    plt.tight_layout()
    plt.show()

plot_representation(repr_sex, "sex", "NHANES vs Census: sex")
plot_representation(repr_age, "age_group", "NHANES vs Census: age group")
plot_representation(repr_race, "race_eth", "NHANES vs Census: race/ethnicity")

### Interpretation

- A representation ratio close to 1.0 indicates that NHANES has a similar proportion to the Census for that category.
- Ratios above 1.0 indicate that the category is **over-represented** in NHANES; ratios below 1.0 indicate **under-representation**.
- NHANES is designed to be reasonably close to the United States population, but it is not perfect. Some groups will be slightly over- or under-represented even after weighting.


## 4. Sampling variability and sample size (NHANES BMI example)

We now treat the NHANES subset as our **population** and examine how estimates of mean BMI vary when we repeatedly draw samples of different sizes.

For a numeric variable such as **BMI** we compute:

- The population mean and standard deviation using all NHANES adults.
- The distribution of **sample means** from many repeated samples of size *n*.

We expect that:

- The mean of the sample means will be close to the true population mean.
- The variability of the sample means will decrease as the sample size increases.


In [None]:
# Population statistics for BMI in NHANES.

if "bmi" not in nhanes.columns:
    raise KeyError("The NHANES dataset does not contain a 'bmi' column.")

pop_mean_bmi = nhanes["bmi"].mean()
pop_sd_bmi = nhanes["bmi"].std()

print("NHANES BMI (adults) – population statistics:")
print(f"  Mean: {pop_mean_bmi:5.2f}")
print(f"  SD:   {pop_sd_bmi:5.2f}")

In [None]:
# Functions to simulate sampling distributions of the mean BMI.

def draw_sample_mean_bmi(data: pd.DataFrame, n: int, rng: np.random.Generator) -> float:
    """Draw a simple random sample of size n and return its mean BMI.

    The sample is drawn without replacement from the rows of 'data'.
    """
    indices = rng.choice(data.index.to_numpy(), size=n, replace=False)
    return data.loc[indices, "bmi"].mean()


def simulate_sampling_distribution(
    data: pd.DataFrame, n: int, n_sim: int, rng: np.random.Generator
) -> np.ndarray:
    """Simulate a sampling distribution of the mean BMI.

    Repeats the sampling process 'n_sim' times for a given sample size 'n'
    and returns an array of sample means.
    """
    means = np.empty(n_sim)
    for i in range(n_sim):
        means[i] = draw_sample_mean_bmi(data, n, rng)
    return means


# Choose sample sizes for comparison and number of simulations.
sample_sizes = [100, 500, 2000]
n_sim = 300

sampling_results = {}
for n in sample_sizes:
    sampling_results[n] = simulate_sampling_distribution(nhanes, n, n_sim, rng)

# Inspect the first few simulated means for n = 100.
sampling_results[100][:5]

In [None]:
# Plot histograms of sample means for each sample size.

for n in sample_sizes:
    means = sampling_results[n]

    plt.figure(figsize=(6, 4))
    plt.hist(means, bins=20)
    plt.axvline(pop_mean_bmi, linestyle="--")
    plt.xlabel("Sample mean BMI")
    plt.ylabel("Frequency across simulations")
    plt.title(f"Sampling distribution of mean BMI (n = {n})")
    plt.tight_layout()
    plt.show()

    print(f"n = {n}")
    print(f"  Mean of sample means: {means.mean():6.3f}")
    print(f"  SD of sample means:   {means.std():6.3f}")
    print(f"  Population mean BMI:  {pop_mean_bmi:6.3f}\n")

### Interpretation

- The **centre** of each sampling distribution (mean of the sample means) is close to the true NHANES mean BMI.
- The **spread** of the sampling distribution (standard deviation of the sample means) decreases as the sample size increases.
- The reduction in spread is approximately proportional to **1/√n**, which is the basis for many ideas in statistical inference (for example, standard errors and confidence intervals).


## 5. Comparing the FB2NEP synthetic cohort to NHANES

We now compare the **structure** of the FB2NEP synthetic cohort to the NHANES reference survey. The aim is not to make the synthetic cohort “representative of the United States”, but to illustrate how one might compare a study sample to a reference population.

Because the FB2NEP cohort includes only adults aged 40 years and older, we first create a matching subset of NHANES adults aged 40 years and older.


In [None]:
# Restrict NHANES to adults aged 40+ years and define age groups.

nhanes_40plus = nhanes[nhanes["age_years"] >= 40].copy()

age_bins_40 = [40, 55, 70, np.inf]
age_labels_40 = ["40–54", "55–69", "70+"]

nhanes_40plus["age_group_40"] = pd.cut(
    nhanes_40plus["age_years"],
    bins=age_bins_40,
    labels=age_labels_40,
    right=False,
)

# Define equivalent age groups in the FB2NEP cohort.
df_fb2nep = df_fb2nep.copy()
df_fb2nep["age_group_40"] = pd.cut(
    df_fb2nep["age"],
    bins=age_bins_40,
    labels=age_labels_40,
    right=False,
)

nhanes_40plus[["age_years", "age_group_40", "sex"]].head()

In [None]:
# Helper function: compare distributions between NHANES and FB2NEP.

def compare_two_sources(
    ref: pd.DataFrame, study: pd.DataFrame, column: str, ref_label: str, study_label: str
) -> pd.DataFrame:
    """Create a table comparing a reference dataset and a study dataset.

    The output contains counts and proportions for each category in 'column'
    for both datasets.
    """
    ref_tab = proportion_table(ref, column).rename(columns={
        "count": f"{ref_label}_count",
        "proportion": f"{ref_label}_prop",
    })
    study_tab = proportion_table(study, column).rename(columns={
        "count": f"{study_label}_count",
        "proportion": f"{study_label}_prop",
    })
    merged = ref_tab.merge(study_tab, left_index=True, right_index=True, how="outer")
    return merged

print("Sex distribution: NHANES 40+ vs FB2NEP synthetic cohort")
display(compare_two_sources(nhanes_40plus, df_fb2nep, "sex", "NHANES40", "FB2NEP"))

print("\nAge group (40+): NHANES vs FB2NEP synthetic cohort")
display(compare_two_sources(nhanes_40plus, df_fb2nep, "age_group_40", "NHANES40", "FB2NEP"))

> **Hippo cameo:** imagine that exactly one extremely diligent hippo lives in the catchment
> area and always volunteers for every nutrition study. The hippo will appear in many study
> samples, but this single observation tells us little about the many hippos who do not
> volunteer. This is an example of how selection can be systematic rather than random.


### Interpretation

- The FB2NEP synthetic cohort is not intended to match NHANES exactly, but it is still useful to compare basic characteristics such as sex and age distribution.
- If a real study sample differs strongly from a reference survey (such as NHANES or NDNS), then the study may have limited **external validity** for the general population.
- The comparison does not by itself prove bias in associations, but it is an important step in describing **who we are studying**.


## 6. Exercises (for students)

1. **Change the sample size in the BMI simulation**  
   In the sampling demonstration, add another sample size (for example, `n = 50` or `n = 5000`) to the `sample_sizes` list and rerun the simulation. How does the spread of the sample mean BMI change? Relate your findings to the idea that standard errors shrink with 1/√n.

2. **Additional variables for representativeness**  
   Extend the comparison between NHANES and the FB2NEP synthetic cohort to include other variables, for example `SES_class` (in FB2NEP) and `education` (in NHANES). What difficulties arise when variables do not use exactly the same categories?

3. **Alternative age groupings**  
   Define different age groupings (for example, two broad groups 40–64 and 65+) and repeat the comparison. How does the choice of grouping affect your impression of representativeness?

4. **Selection bias thought experiment**  
   Suppose that in a real cohort, people with very poor health are less likely to participate. Describe in a short paragraph how this could bias estimates of the association between physical activity and cardiovascular disease.

5. **Country-specific reference surveys**  
   In the United Kingdom, the **National Diet and Nutrition Survey (NDNS)** is often used as a reference for diet and some health indicators. Look up (outside this notebook) what NDNS is and which population it covers. How might you compare a United Kingdom cohort to NDNS in a similar way to what we have done here with NHANES?


## 7. Summary

- **Population, target population, and sample** are related but distinct concepts. It is essential to be explicit about each before analysing data.
- **Representativeness** concerns the similarity of the sample to the population of interest in terms of key characteristics and is closely linked to **external validity**.
- **Sampling error** arises because we observe only a finite sample. It decreases as the sample size increases but never disappears entirely.
- Using a reference survey such as **NHANES** allows us to compare the structure of a study sample (here: the FB2NEP synthetic cohort) to a broader population.
- Restricted or specialised cohorts can provide excellent information about associations **within** certain groups but are not automatically representative of all adults.
- Comparing your own data to reference surveys is a routine and important step in nutritional epidemiology, both for describing study populations and for assessing the generalisability of findings.
