# FB2NEP Workbook 3 ‚Äì Populations, Samples, and Representativeness

Version 0.1.0

This workbook introduces core ideas in sampling for nutritional epidemiology:

- The distinction between the **general population**, a **study's target population**, and the realised **sample**.
- The notions of **sampling frames**, **representativeness**, and **sampling error** versus **bias**.
- How a **probability sample** such as NHANES can be used as a reference for population structure.
- How to compare simple descriptive statistics between a **reference population** (NHANES) and a **study cohort** (the FB2NEP synthetic dataset).
- How **sample size** affects the precision of estimates of central tendency (here: body mass index, BMI).

We use two datasets:

- **NHANES (National Health and Nutrition Examination Survey)**: a large, nationally representative survey of the non-institutionalised United States population. Here we download a small subset of NHANES 2017‚Äì2018 directly from the [CDC website](https://www.cdc.gov) via a helper script.
- **FB2NEP synthetic cohort**: the synthetic dataset that we use throughout the module for regression and causal reasoning. Here it plays the role of a *study sample* that we compare to NHANES.

> Hippo cameo: imagine a very diligent hippo who always volunteers for nutrition surveys. This will help us think about who actually ends up in a sample.


In [None]:
# ============================================================
# FB2NEP bootstrap cell (works both locally and in Colab)
#
# What this cell does:
# - Ensures that we are inside the fb2nep-epi repository.
# - In Colab: clones the repository from GitHub if necessary.
# - Loads and runs scripts/bootstrap.py.
# - Makes the main dataset available as the variable `df`.
#
# Important:
# - You may see messages printed below (for example from pip
#   or from the bootstrap script). This is expected.
# - You may also see WARNINGS (often in yellow). In most cases
#   these are harmless and can be ignored for this module.
# - The main thing to watch for is a red error traceback
#   (for example FileNotFoundError, ModuleNotFoundError).
#   If that happens, please re-run this cell first. If the
#   error persists, ask for help.
# ============================================================

import os
import sys
import pathlib
import subprocess
import importlib.util

# ------------------------------------------------------------
# Configuration: repository location and URL
# ------------------------------------------------------------
# REPO_URL: address of the GitHub repository.
# REPO_DIR: folder name that will be created when cloning.
REPO_URL = "https://github.com/ggkuhnle/fb2nep-epi.git"
REPO_DIR = "fb2nep-epi"

# ------------------------------------------------------------
# 1. Ensure we are inside the fb2nep-epi repository
# ------------------------------------------------------------
# In local Jupyter, you may already be inside the repository,
# for example in fb2nep-epi/notebooks.
#
# In Colab, the default working directory is /content, so
# we need to clone the repository into /content/fb2nep-epi
# and then change into that folder.
cwd = pathlib.Path.cwd()

# Case A: we are already in the repository (scripts/bootstrap.py exists here)
if (cwd / "scripts" / "bootstrap.py").is_file():
    repo_root = cwd

# Case B: we are outside the repository (for example in Colab)
else:
    repo_root = cwd / REPO_DIR

    # Clone the repository if it is not present yet
    if not repo_root.is_dir():
        print(f"Cloning repository from {REPO_URL} into {repo_root} ...")
        subprocess.run(["git", "clone", REPO_URL, str(repo_root)], check=True)
    else:
        print(f"Using existing repository at {repo_root}")

    # Change the working directory to the repository root
    os.chdir(repo_root)
    repo_root = pathlib.Path.cwd()

print(f"Repository root set to: {repo_root}")

# ------------------------------------------------------------
# 2. Load scripts/bootstrap.py as a module and call init()
# ------------------------------------------------------------
# The shared bootstrap script contains all logic to:
# - Ensure that required Python packages are installed.
# - Ensure that the synthetic dataset exists (and generate it
#   if needed).
# - Load the dataset into a pandas DataFrame.
#
# We load the script as a normal Python module (fb2nep_bootstrap)
# and then call its init() function.
bootstrap_path = repo_root / "scripts" / "bootstrap.py"

if not bootstrap_path.is_file():
    raise FileNotFoundError(
        f"Could not find {bootstrap_path}. "
        "Please check that the fb2nep-epi repository structure is intact."
    )

# Create a module specification from the file
spec = importlib.util.spec_from_file_location("fb2nep_bootstrap", bootstrap_path)
bootstrap = importlib.util.module_from_spec(spec)
sys.modules["fb2nep_bootstrap"] = bootstrap

# Execute the bootstrap script in the context of this module
spec.loader.exec_module(bootstrap)

# The init() function is defined in scripts/bootstrap.py.
# It returns:
# - df   : the main synthetic cohort as a pandas DataFrame.
# - CTX  : a small context object with paths, flags and settings.
df, CTX = bootstrap.init()

# Optionally expose a few additional useful variables from the
# bootstrap module (if they exist). These are not essential for
# most analyses, but can be helpful for advanced use.
for name in ["CSV_REL", "REPO_NAME", "REPO_URL", "IN_COLAB"]:
    if hasattr(bootstrap, name):
        globals()[name] = getattr(bootstrap, name)

print("Bootstrap completed successfully.")
print("The main dataset is available as the variable `df`.")
print("The context object is available as `CTX`.")


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from scripts.fetch_nhanes_demo import load_nhanes_demo

# ---------------------------------------------------------------------
# Fixed random seed for all simulations in this workbook.
# ---------------------------------------------------------------------
RANDOM_SEED = 11088
rng = np.random.default_rng(RANDOM_SEED)

# ---------------------------------------------------------------------
# Load processed NHANES subset (downloaded and tidied by helper script).
# ---------------------------------------------------------------------
nhanes = load_nhanes_demo(cache=True)

# ---------------------------------------------------------------------
# Load / reuse the FB2NEP synthetic cohort
# ---------------------------------------------------------------------

# (If you prefer to be explicit about the path, you could do:)
fb2nep_path = REPO_ROOT / CSV_REL
print("Loading FB2NEP synthetic cohort from:", fb2nep_path)
df_fb2nep = pd.read_csv(fb2nep_path)

print("NHANES demo shape:", nhanes.shape)
print("FB2NEP synthetic cohort shape:", df_fb2nep.shape)

print("NHANES")
nhanes.head()

In [None]:
print("FB2NEP")
df_fb2nep.head()

## 1. Concepts and definitions

### 1.1 General population, target population, and sample

- **General population**: The entire group about which we ultimately wish to draw conclusions. Examples: ‚Äúall adults living in the United States‚Äù, or ‚Äúall adults aged 40 years and older in the United Kingdom‚Äù.

- **Target population (study population)**: The subset of the general population that a particular study *intends* to represent. Examples: <font color='red'>all adults aged 40‚Äì79 years registered with a general practitioner in England</font>, or <font color='red'>all post-menopausal women without prior cardiovascular disease at baseline</font>. This definition is conceptual and exists **before** sampling.

- **Sampling frame**: The operational list or mechanism used to select individuals. Examples: general practice registers, electoral rolls, health insurance lists, employee registers. Individuals who are not in the sampling frame cannot be selected, even if they belong to the target population.

- **Sample**: The actual set of individuals who are recruited and provide data. The sample may differ from the target population because of non-response, exclusion criteria, and practical constraints.

Key questions for any study: 

1. **Who did we intend to study?** (target population)
2. **Who did we actually study?** (realised sample)
3. **How different are these groups** with respect to variables that matter for our research question?


### 1.2 Representativeness, sampling error, and bias

A sample is **representative** of a population when the distribution of key characteristics in the sample matches that of the population of interest. Common characteristics:

- Sex.
- Age distribution.
- Socioeconomic position (for example, education, deprivation indices).
- Ethnicity (if available).

Representativeness is mainly about **external validity**: how far we can generalise study findings beyond those who took part.

It is useful to distinguish two sources of difference between sample and population:

- **Sampling error**: Random variation in estimates because we observe only a finite sample rather than the whole population. Sampling error becomes smaller as the sample size increases.
- **Systematic bias**: Systematic differences between sample and population (for example, non-participation of people with poor health or low income) that do **not** disappear when the sample size increases.

In this workbook we will:

- Use **NHANES** as a reference survey for the general population.
- Treat the **FB2NEP synthetic cohort** as a ‚Äústudy sample‚Äù and compare it to NHANES.
- Use repeated sampling from NHANES to illustrate **sampling error**.


### 1.3 NHANES and the FB2NEP synthetic cohort

- **NHANES** üá∫üá∏ uses complex probability sampling strategies to obtain an approximately nationally representative sample of the civilian, non-institutionalised United States population. For this workbook we focus on adults and use a limited set of variables: age, sex, race/ethnicity, education, and BMI.

- The **FB2NEP synthetic cohort** is a simulated cohort used for teaching. It mimics a longitudinal study of adults aged 40 years and older with follow-up for cardiovascular disease and cancer. It is not designed to be representative of any real country, but we can still compare its structure to NHANES.


In [None]:
# Quick look at the FB2NEP synthetic cohort.

    
fb_cols = [
    "id", "age", "sex", "IMD_quintile", "SES_class", "BMI", "SBP",
    "fruit_veg_g_d", "red_meat_g_d", "CVD_incident", "Cancer_incident"
]

# Use list comprehension to keep only columns that are actually present.
fb_cols = [c for c in fb_cols if c in df_fb2nep.columns]

df_fb2nep[fb_cols].head()

## 2. Basic distributions in NHANES

We first inspect the distribution of sex, age, and race/ethnicity in the NHANES subset. We create simple age groups for adults and compute proportion tables.


In [None]:
# Create age groups for NHANES adults.
#
# We restrict to adults aged 20 years and older (already done in the helper script),
# and group into: 20‚Äì39, 40‚Äì59, 60+ years.

age_bins_nhanes = [20, 40, 60, np.inf]
age_labels_nhanes = ["20‚Äì39", "40‚Äì59", "60+"]

nhanes = nhanes.copy()
nhanes["age_group"] = pd.cut(
    nhanes["age_years"],
    bins=age_bins_nhanes,
    labels=age_labels_nhanes,
    right=False,
)

nhanes[["age_years", "age_group", "sex", "race_eth", "education", "bmi"]].head()

In [None]:
from scripts.helpers_tables import proportion_table

print("NHANES sex distribution:")
display(proportion_table(nhanes, "sex", dropna=False))

print("\nNHANES age group distribution:")
display(proportion_table(nhanes, "age_group", dropna=False))

print("\nNHANES race/ethnicity distribution:")
display(proportion_table(nhanes, "race_eth", dropna=False))

## 3. Representation relative to the United States Census

NHANES is designed to be approximately representative of the United States population. To illustrate this, we compare NHANES proportions to approximate adult distributions from the United States Census (values here are simplified for teaching).

We compute a **representation ratio** for each category:

\begin{equation}
\text{representation ratio} = \frac{\text{NHANES proportion}}{\text{Census proportion}}.
\end{equation}

- A value close to 1 means that NHANES has a similar proportion to the Census.
- Values greater than 1 indicate **over-representation** in NHANES.
- Values less than 1 indicate **under-representation**.


In [None]:
# Approximate United States Census distributions for adults.
# These values are illustrative and are not official statistics.

census_sex = pd.DataFrame({
    "sex": ["Female", "Male"],
    "census_prop": [0.509, 0.491],
})

census_age = pd.DataFrame({
    "age_group": ["20‚Äì39", "40‚Äì59", "60+"],
    "census_prop": [0.35, 0.33, 0.32],
})

census_race = pd.DataFrame({
    "race_eth": ["White", "Black", "Hispanic", "Asian", "Other"],
    "census_prop": [0.58, 0.12, 0.19, 0.06, 0.05],
})

census_sex

In [None]:
from scripts.helpers_tables import representation_table


# Compute NHANES proportion tables.
nhanes_sex = proportion_table(nhanes, "sex").reset_index().rename(columns={"index": "sex"})
nhanes_age = proportion_table(nhanes, "age_group").reset_index().rename(columns={"index": "age_group"})
nhanes_race = proportion_table(nhanes, "race_eth").reset_index().rename(columns={"index": "race_eth"})

# Compute representation ratios.
repr_sex = representation_table(nhanes_sex, census_sex, "sex")
repr_age = representation_table(nhanes_age, census_age, "age_group")
repr_race = representation_table(nhanes_race, census_race, "race_eth")

repr_sex

In [None]:
from scripts.helpers_tables import plot_representation

plot_representation(repr_sex, "sex", "NHANES vs Census: sex")
plot_representation(repr_age, "age_group", "NHANES vs Census: age group")
plot_representation(repr_race, "race_eth", "NHANES vs Census: race/ethnicity")

### Interpretation

- A representation ratio close to 1.0 indicates that NHANES has a similar proportion to the Census for that category.
- Ratios above 1.0 indicate that the category is **over-represented** in NHANES; ratios below 1.0 indicate **under-representation**.
- NHANES is designed to be reasonably close to the United States population, but it is not perfect. Some groups will be slightly over- or under-represented even after weighting.


## 4. Sampling variability and sample size (NHANES BMI example)

We now treat the NHANES subset as our **population** and examine how estimates of mean BMI vary when we repeatedly draw samples of different sizes.

For a numeric variable such as **BMI** we compute:

- The population mean and standard deviation using all NHANES adults.
- The distribution of **sample means** from many repeated samples of size *n*.

We expect that:

- The mean of the sample means will be close to the true population mean.
- The variability of the sample means will decrease as the sample size increases.


In [None]:
# Population statistics for BMI in NHANES.

if "bmi" not in nhanes.columns:
    raise KeyError("The NHANES dataset does not contain a 'bmi' column.")

pop_mean_bmi = nhanes["bmi"].mean()
pop_sd_bmi = nhanes["bmi"].std()

print("NHANES BMI (adults) ‚Äì population statistics:")
print(f"  Mean: {pop_mean_bmi:5.2f}")
print(f"  SD:   {pop_sd_bmi:5.2f}")

### 4.1 Setting up the simulation

We now *simulate* what would happen if we repeatedly carried out the same study many times on different random samples from the same population.

The goal is to understand the **sampling distribution of the mean BMI** for different sample sizes.

The key ideas are:

- We treat the NHANES dataset (`nhanes`) as if it were the **true population**.
- For each chosen sample size *n* (for example, 10, 50, 100, 500, 2 000), we:
  1. Draw a simple random sample of size *n* from NHANES (without replacement).
  2. Calculate the mean BMI in that sample.
  3. Repeat this process `n_sim` times (for example, 300 times).
- The resulting collection of mean BMI values shows how much the estimate of the mean BMI **varies from study to study**, purely due to random sampling.


In [None]:
from scripts.helpers_tables import draw_sample_mean_bmi, simulate_sampling_distribution

# Compute the "true" population mean BMI from the full NHANES dataset.
# In this workbook we treat NHANES as the population.
true_mean_bmi = float(nhanes["bmi"].mean().round(2))

print(f"True (population) mean BMI from NHANES: {true_mean_bmi:.2f}")

# Choose sample sizes for comparison and number of simulations.
sample_sizes = [10, 50, 100, 500, 2000]
n_sim = 300   # number of repeated samples for each n

# Dictionary to store the sampling distributions for each n
sampling_results = {}
for n in sample_sizes:
    sampling_results[n] = simulate_sampling_distribution(nhanes, n, n_sim, rng)

# Inspect the first few simulated means for n = 10
print("\nFirst five simulated sample means (n = 10):")
print(np.round(sampling_results[10][:5], 2))

# Compare the average of simulated means to the true mean
simulated_mean_10 = float(np.mean(sampling_results[10]).round(2))
print(f"\nAverage of simulated means for n = 10: {simulated_mean_10:.2f}")
print(f"Difference from true mean: {simulated_mean_10 - true_mean_bmi:+.2f}")


# Inspect the first few simulated means for n = 500
print("\nFirst five simulated sample means (n = 500):")
print(np.round(sampling_results[500][:5], 2))

# Compare the average of simulated means to the true mean
simulated_mean_500 = float(np.mean(sampling_results[500]).round(2))
print(f"\nAverage of simulated means for n = 500: {simulated_mean_500:.2f}")
print(f"Difference from true mean: {simulated_mean_500 - true_mean_bmi:+.2f}")


In [None]:
# Plot histograms of sample means for each sample size.

for n in sample_sizes:
    means = sampling_results[n]

    plt.figure(figsize=(6, 4))
    plt.hist(means, bins=20)
    plt.axvline(pop_mean_bmi, linestyle="--")
    plt.xlabel("Sample mean BMI")
    plt.ylabel("Frequency across simulations")
    plt.title(f"Sampling distribution of mean BMI (n = {n})")
    plt.tight_layout()
    plt.show()

    print(f"n = {n}")
    print(f"  Mean of sample means: {means.mean():6.3f}")
    print(f"  SD of sample means:   {means.std():6.3f}")
    print(f"  Population mean BMI:  {pop_mean_bmi:6.3f}\n")

### Interpretation

- The **centre** of each sampling distribution (mean of the sample means) is close to the true NHANES mean BMI.
- The **spread** of the sampling distribution (standard deviation of the sample means) decreases as the sample size increases.
- The reduction in spread is approximately proportional to **1/‚àön**, which is the basis for many ideas in statistical inference (for example, standard errors and confidence intervals).


## 5. Comparing the FB2NEP synthetic cohort to NHANES

We now compare the **structure** of the FB2NEP synthetic cohort to the NHANES reference survey. The aim is not to make the synthetic cohort ‚Äúrepresentative of the United States‚Äù, but to illustrate how one might compare a study sample to a reference population.

Because the FB2NEP cohort includes only adults aged 40 years and older, we first create a matching subset of NHANES adults aged 40 years and older.


In [None]:
# Restrict NHANES to adults aged 40+ years and define age groups.

nhanes_40plus = nhanes[nhanes["age_years"] >= 40].copy()

age_bins_40 = [40, 55, 70, np.inf]
age_labels_40 = ["40‚Äì54", "55‚Äì69", "70+"]

nhanes_40plus["age_group_40"] = pd.cut(
    nhanes_40plus["age_years"],
    bins=age_bins_40,
    labels=age_labels_40,
    right=False,
)

# Define equivalent age groups in the FB2NEP cohort.
df_fb2nep = df_fb2nep.copy()
df_fb2nep["age_group_40"] = pd.cut(
    df_fb2nep["age"],
    bins=age_bins_40,
    labels=age_labels_40,
    right=False,
)

nhanes_40plus[["age_years", "age_group_40", "sex"]].head()

In [None]:
from scripts.helpers_tables import compare_two_sources



print("Sex distribution: NHANES 40+ vs FB2NEP synthetic cohort")
display(compare_two_sources(nhanes_40plus, df_fb2nep, "sex", "NHANES40", "FB2NEP"))

print("\nAge group (40+): NHANES vs FB2NEP synthetic cohort")
display(compare_two_sources(nhanes_40plus, df_fb2nep, "age_group_40", "NHANES40", "FB2NEP"))

### 7.1 Coding differences: `sex` in NHANES vs FB2NEP

> What is going on here? NHANES and FB2NEP code `sex` differently!  
> NHANES uses `"Female"` / `"Male"`, whereas FB2NEP uses `"F"` / `"M"`.

This is a simple but important example of a **coding problem**:

- The two datasets refer to the **same underlying concept** (biological sex at baseline),
- but they use **different labels** for the categories.

If we tabulate or merge without harmonising the codes, we obtain misleading tables:

- Four rows (`F`, `M`, `Female`, `Male`) instead of two,
- missing values in some of the columns because categories do not align.

This kind of problem occurs frequently when combining:

- different surveys,
- registry data and cohort data,
- different waves of the same study.

We now create a *harmonised* version of `sex` in both datasets, with the labels `"Female"` and `"Male"`, and then repeat the comparison.


In [None]:
# -------------------------------------------
# Harmonise the coding of 'sex' in both sets
# -------------------------------------------

def harmonise_sex(series: pd.Series) -> pd.Series:
    """
    Map different encodings of sex to common labels "Female" / "Male".

    Parameters
    ----------
    series : pandas.Series
        Original sex variable (for example 'F'/'M' or 'Female'/'Male').

    Returns
    -------
    pandas.Series
        New series with values "Female" or "Male" (or NaN if unknown).

    Notes
    -----
    - Any unexpected categories are printed so that they can be
      checked manually (for example 'Other', 'Prefer not to say').
    """
    # Define how original codes are to be translated.
    code_map = {
        "F": "Female",
        "M": "Male",
        "Female": "Female",
        "Male": "Male",
    }

    # Apply the mapping; entries not in code_map become NaN.
    mapped = series.map(code_map)

    # Identify any values that were not in the mapping.
    unknown = series[~series.isna() & mapped.isna()].unique()
    if len(unknown) > 0:
        print("Warning: unexpected categories in 'sex':", unknown)

    return mapped


# Apply the harmonisation to both datasets.
nhanes_40plus["sex_harmonised"] = harmonise_sex(nhanes_40plus["sex"])
df_fb2nep["sex_harmonised"] = harmonise_sex(df_fb2nep["sex"])

# Quick check: show the distribution in each dataset.
print("NHANES 40 Plus (harmonised):")
print(nhanes_40plus["sex_harmonised"].value_counts(dropna=False))
print("\nFB2NEP (harmonised):")
print(df_fb2nep["sex_harmonised"].value_counts(dropna=False))


Let's repeat the code above:

In [None]:

print("Sex distribution: NHANES 40+ vs FB2NEP synthetic cohort")
display(compare_two_sources(nhanes_40plus, df_fb2nep, "sex_harmonised", "NHANES40", "FB2NEP"))

print("\nAge group (40+): NHANES vs FB2NEP synthetic cohort")
display(compare_two_sources(nhanes_40plus, df_fb2nep, "age_group_40", "NHANES40", "FB2NEP"))

> **Hippo cameo:** imagine that exactly one extremely diligent hippo lives in the catchment
> area and always volunteers for every nutrition study. The hippo will appear in many study
> samples, but this single observation tells us little about the many hippos who do not
> volunteer. This is an example of how selection can be systematic rather than random.


### Interpretation

- The FB2NEP synthetic cohort is not intended to match NHANES exactly, but it is still useful to compare basic characteristics such as sex and age distribution.
- If a real study sample differs strongly from a reference survey (such as NHANES or NDNS), then the study may have limited **external validity** for the general population.
- The comparison does not by itself prove bias in associations, but it is an important step in describing **who we are studying**.


### 7.2 Additional variables for representativeness: SES vs education

We may wish to compare other characteristics between NHANES and the FB2NEP synthetic cohort, for example:

- `education` in NHANES (for example, ‚Äú‚â§High school‚Äù, ‚ÄúSome college‚Äù, ‚ÄúBachelor+‚Äù), and  
- `SES_class` in FB2NEP (for example, ‚ÄúABC1‚Äù, ‚ÄúC2DE‚Äù).

Both variables describe aspects of **socioeconomic position**, but they are **not the same**:

- `education` is an individual-level measure of highest qualification;
- `SES_class` in FB2NEP is a social grade based on occupation (ABC1 vs C2DE).

Direct comparison of the original categories is therefore not meaningful. Instead, we can:

1. Inspect the category labels and distributions in each dataset.
2. Construct a **very crude harmonised variable** that groups people into ‚Äúlower‚Äù and ‚Äúhigher‚Äù socioeconomic position.
3. Compare these simplified variables, while being explicit about the approximation.

This illustrates that harmonisation often involves judgement and information loss. In real analyses, this should be documented and justified.


In [None]:
from scripts.helpers_tables import compare_two_sources

# -------------------------------------------------
# Step 1: inspect the original SES / education codes
# -------------------------------------------------

print("NHANES 'education' categories:")
print(nhanes["education"].value_counts(dropna=False))
print("\nFB2NEP 'SES_class' categories:")
print(df_fb2nep["SES_class"].value_counts(dropna=False))

# -------------------------------------------------
# Step 2: create a very crude harmonised SEP variable
# -------------------------------------------------
# For teaching purposes we define:
#
#   - In NHANES (education):
#       "‚â§High school"  -> "lower"
#       "Some college"  -> "middle"
#       "Bachelor+"     -> "higher"
#
#   - In FB2NEP (SES_class):
#       "C2DE"          -> "lower"
#       "ABC1"          -> "higher"
#
# We then collapse to a simple "lower" vs "higher_or_middle" binary
# variable in both datasets, so that we can compare something roughly
# analogous across the two sources.
#
# This is deliberately crude and should be interpreted with caution.

def make_sep_binary_from_education(education: pd.Series) -> pd.Series:
    """
    Convert NHANES education categories into a crude binary SEP measure.

    Returns "lower" or "higher_or_middle". Explicit "Unknown" values
    are treated as missing (NaN). Any truly unexpected categories are
    reported.
    """
    # First treat the NHANES label "Unknown" as missing.
    education_clean = education.replace("Unknown", pd.NA)

    # Map the remaining education categories to a simplified SEP measure.
    edu_map = {
        "‚â§High school": "lower",
        "Some college": "higher_or_middle",
        "College+": "higher_or_middle",   # corrected label
    }

    sep = education_clean.map(edu_map)

    # Identify any values that are not missing and were not mapped.
    unknown = education_clean[~education_clean.isna() & sep.isna()].unique()
    if len(unknown) > 0:
        print("Warning: unexpected education categories:", unknown)

    return sep


def make_sep_binary_from_ses_class(ses: pd.Series) -> pd.Series:
    """Convert FB2NEP SES_class into a crude binary SEP measure.

    Returns "lower" or "higher_or_middle".
    """
    # Create dictionary mapping SES to category
    

    ses_map = {
        "C2DE": "lower",
        "ABC1": "higher_or_middle",
    }

    sep = ses.map(ses_map)

    unknown = ses[~ses.isna() & sep.isna()].unique()
    if len(unknown) > 0:
        print("Warning: unexpected SES_class categories:", unknown)

    return sep


# Apply the mappings to create harmonised binary SEP variables.
nhanes["SEP_binary"] = make_sep_binary_from_education(nhanes["education"])
df_fb2nep["SEP_binary"] = make_sep_binary_from_ses_class(df_fb2nep["SES_class"])

print("\nNHANES SEP_binary:")
print(nhanes["SEP_binary"].value_counts(dropna=False))
print("\nFB2NEP SEP_binary:")
print(df_fb2nep["SEP_binary"].value_counts(dropna=False))

# -------------------------------------------------
# Step 3: compare the crude SEP distributions
# -------------------------------------------------

sep_comparison = compare_two_sources(
    ref=nhanes,
    study=df_fb2nep,
    column="SEP_binary",
    ref_label="NHANES40",
    study_label="FB2NEP",
)

sep_comparison


## 8. Exercises

1. **Change the sample size in the BMI simulation**  
   In the sampling demonstration, add another sample size (for example, `n = 50` or `n = 5000`) to the `sample_sizes` list and rerun the simulation. How does the spread of the sample mean BMI change? Relate your findings to the idea that standard errors shrink with 1/‚àön.

2. **Alternative age groupings**  
   Define different age groupings (for example, two broad groups 40‚Äì64 and 65+) and repeat the comparison. How does the choice of grouping affect your impression of representativeness?

3. **Selection bias thought experiment**  
   Suppose that in a real cohort, people with very poor health are less likely to participate. Describe in a short paragraph how this could bias estimates of the association between physical activity and cardiovascular disease.

4. **Country-specific reference surveys**  
   In the United Kingdom, the **National Diet and Nutrition Survey (NDNS)** is often used as a reference for diet and some health indicators. Look up (outside this notebook) what NDNS is and which population it covers. How might you compare a United Kingdom cohort to NDNS in a similar way to what we have done here with NHANES?


## 9. Summary

- **Population, target population, and sample** are related but distinct concepts. It is essential to be explicit about each before analysing data.
- **Representativeness** concerns the similarity of the sample to the population of interest in terms of key characteristics and is closely linked to **external validity**.
- **Sampling error** arises because we observe only a finite sample. It decreases as the sample size increases but never disappears entirely.
- Using a reference survey such as **NHANES** allows us to compare the structure of a study sample (here: the FB2NEP synthetic cohort) to a broader population.
- Restricted or specialised cohorts can provide excellent information about associations **within** certain groups but are not automatically representative of all adults.
- Comparing your own data to reference surveys is a routine and important step in nutritional epidemiology, both for describing study populations and for assessing the generalisability of findings.
