# FB2NEP — Populations, Samples, and Representativeness

**Date:** 09 November 2025

This workbook introduces core ideas in sampling for nutritional epidemiology:

- The distinction between the **general population**, a **study's target population**, and the actual **sample**.
- The notions of **sampling frames**, **representativeness**, and **sampling error** versus **bias**.
- How a **probability sample** such as NHANES can be used as a reference.
- How to compare simple descriptive statistics between a **population** and a **sample**.
- How **sample size** affects the precision of estimates of central tendency.
- Why some famous cohorts (for example, NHS, HPFS) are intentionally **not** representative of all adults.

The main example uses **NHANES** demographic data as an approximately nationally representative survey for the United States. For teaching, we treat NHANES as the population, and draw repeated samples from it.

> Hippo cameo (single, pedagogical): later, a very conscientious hippo will volunteer for a nutrition survey to help us think about who ends up in a sample.


In [None]:
# Imports. Dependencies are kept minimal so that this notebook can run in Google Colab.
import pathlib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Reproducibility: use a fixed random seed across the module.
RANDOM_SEED = 11088
rng = np.random.default_rng(RANDOM_SEED)

# Display options for cleaner tables.
pd.set_option("display.max_columns", 50)
pd.set_option("display.width", 120)


## 1. Concepts and definitions

### 1.1 General population, target population, and sample

- **General population**: The entire group about which we ultimately wish to draw conclusions. For example, “all adults living in the United States”, or “all adults aged 20 years and older in the United Kingdom”.

- **Target population (study population)**: The subset of the general population that a particular study *intends* to represent. For example, “all registered nurses aged 30–55 years in the United States at baseline”, or “all men aged 40–75 years who are practising health professionals”. This definition is conceptual and exists *before* sampling.

- **Sampling frame**: The operational list or mechanism used to select individuals (for example, a register of residents, a list of employees, a general practitioner list). People who are not in the sampling frame cannot be selected, even if they belong to the target population.

- **Sample**: The actual set of individuals who are recruited and provide data. The sample may differ from the target population because of non-response, exclusion criteria, and practical constraints.

In nutritional epidemiology, the target population is often broader than the realised sample. The key questions are:

1. *Who* did we intend to study?
2. *Who* did we actually study?
3. How different are these groups with respect to the characteristics that matter for our research question?


### 1.2 Representativeness and bias

A study sample is **representative** of a population when the distribution of key characteristics in the sample matches that of the population of interest. Examples of such characteristics:

- Sex (for example, proportion female).
- Age distribution.
- Race and ethnicity.
- Education and socioeconomic position.

Representativeness is most important for **external validity** (the ability to generalise findings beyond the study sample). Lack of representativeness can lead to **selection bias**, where the association observed in the study differs from the association in the population because of who was included.

It is useful to separate:

- **Sampling error**: Random variation in estimates because we observe only a finite sample rather than the whole population. Sampling error becomes smaller as the sample size increases.
- **Systematic bias**: Systematic differences between sample and population (for example, non-participation of people with low income) that do not disappear when the sample size increases.

In this workbook, we focus on sampling error and simple aspects of representativeness. More complex selection mechanisms and bias will be addressed later in the module.


## 2. NHANES as a reference survey

**NHANES (National Health and Nutrition Examination Survey)** is a large, ongoing survey in the United States. It uses complex sampling methods to obtain a sample that is approximately representative of the civilian, non-institutionalised US population.

For teaching purposes, we will treat NHANES as our **reference population**. We will:

1. Load a processed NHANES demographic dataset.
2. Recode age, sex, and race/ethnicity into simple categories.
3. Compare the NHANES distributions to approximate US Census values.

In a real analysis, one would also need to use the complex survey design variables (strata, clusters, and survey weights). Here we deliberately ignore these complications and treat the NHANES dataset as if it were a simple random sample. This is acceptable for the pedagogical purpose of illustrating populations and samples.


In [None]:
# Helper function: load a processed NHANES demographic CSV.
#
# The recommended workflow for this module is:
# - Store a cleaned NHANES demographic file in the repository, for example:
#     data/nhanes_demo_2017_2018.csv
#   with one row per participant and at least the following columns:
#     - age_years   (numeric, age in years)
#     - sex         ("Female" / "Male")
#     - race_eth    (simplified race/ethnicity categories)
#     - bmi         (numeric, kg/m^2)  [optional but useful]
#
# Optionally, a remote URL can be set so that the data are downloaded when
    # the notebook is run in Colab. The local copy then serves as a back-up.

NHANES_REMOTE_URL = None  # e.g. "https://raw.githubusercontent.com/.../nhanes_demo_2017_2018.csv"
NHANES_LOCAL_CANDIDATES = [
    pathlib.Path("data/nhanes_demo_2017_2018.csv"),
    pathlib.Path("../data/nhanes_demo_2017_2018.csv"),
    pathlib.Path("../../data/nhanes_demo_2017_2018.csv"),
]

def load_nhanes_demo():
    """Load a processed NHANES demographic dataset.

    The function first tries to load from NHANES_REMOTE_URL (if set).
    If this fails or the URL is None, it searches for a local CSV
    in NHANES_LOCAL_CANDIDATES.
    """
    # Attempt remote download if a URL is specified.
    if NHANES_REMOTE_URL is not None:
        try:
            print(f"Attempting to download NHANES from {NHANES_REMOTE_URL} ...")
            df_remote = pd.read_csv(NHANES_REMOTE_URL)
            print("Loaded NHANES data from remote URL.")
            return df_remote
        except Exception as exc:  # noqa: BLE001
            print(f"Remote download failed: {exc}")
            print("Falling back to local files.")

    # Fall back to local candidates.
    for path in NHANES_LOCAL_CANDIDATES:
        if path.exists():
            print(f"Loading NHANES data from local file: {path}")
            return pd.read_csv(path)

    raise FileNotFoundError(
        "Could not find a NHANES demo CSV. Please add a processed "
        "file (e.g. nhanes_demo_2017_2018.csv) to the data/ folder."
    )

nhanes = load_nhanes_demo()
nhanes.head()


### 2.1 Recoding variables for teaching

For this workbook we create simple categorical variables:

- **Age group**: 20–39, 40–59, 60+ years.
- **Sex**: Female, Male.
- **Race/ethnicity**: simplified categories, for example White, Black, Hispanic, Asian, Other.

The processed NHANES file is assumed to contain already cleaned columns `age_years`, `sex`, and `race_eth`. If your file uses different column names, please adjust the code below.


In [None]:
# Derive age groups from age in years.
# We restrict to adults aged 20+ years for this workbook.

nhanes = nhanes.copy()
nhanes = nhanes[nhanes["age_years"] >= 20].reset_index(drop=True)

age_bins = [20, 40, 60, np.inf]
age_labels = ["20–39", "40–59", "60+"]
nhanes["age_group"] = pd.cut(
    nhanes["age_years"],
    bins=age_bins,
    labels=age_labels,
    right=False,
)

# Check the first few rows to confirm that the new variable looks sensible.
nhanes[["age_years", "age_group", "sex", "race_eth"]].head()


## 3. Comparing NHANES to US Census distributions

To assess how representative NHANES is, we compare the distributions of age group, sex, and race/ethnicity to approximate values from the **United States Census** (circa 2020). For simplicity, we use a small table of approximate proportions for adults; these are intended for teaching and are not official statistics.

We will: 

1. Compute the NHANES proportions for each category.
2. Provide a small table with approximate Census proportions.
3. Compute a **representation ratio** = NHANES proportion / Census proportion.
4. Plot the representation ratios.


In [None]:
# Helper: tidy proportion table for a categorical column.

def prop_table(df: pd.DataFrame, col: str) -> pd.DataFrame:
    """Return a tidy proportion table for a categorical column.

    The output has two columns: the category and the proportion.
    """
    out = (
        df[col]
        .value_counts(normalize=True)
        .rename("proportion")
        .reset_index()
        .rename(columns={"index": col})
    )
    return out

prop_table(nhanes, "sex")


In [None]:
# Approximate US Census distributions for adults (illustrative values).
# The exact numbers are not critical for the teaching purpose.

census_sex = pd.DataFrame({
    "sex": ["Female", "Male"],
    "census_prop": [0.509, 0.491],
})

census_age = pd.DataFrame({
    "age_group": ["20–39", "40–59", "60+"],
    # Roughly equal shares across the three groups for adults.
    "census_prop": [0.35, 0.33, 0.32],
})

census_race = pd.DataFrame({
    "race_eth": ["White", "Black", "Hispanic", "Asian", "Other"],
    # Approximate shares based on 2020 Census summaries.
    "census_prop": [0.58, 0.12, 0.19, 0.06, 0.05],
})

census_sex


In [None]:
# Compute representation ratios for NHANES versus Census.

def representation_table(sample_tab: pd.DataFrame, census_tab: pd.DataFrame, key: str) -> pd.DataFrame:
    """Merge sample and Census proportions and compute representation ratios.

    representation_ratio = sample_proportion / census_proportion
    """
    merged = sample_tab.merge(census_tab, on=key, how="outer", validate="one_to_one")
    merged = merged.rename(columns={"proportion": "sample_prop"})
    merged["representation_ratio"] = merged["sample_prop"] / merged["census_prop"]
    return merged

nhanes_sex = prop_table(nhanes, "sex")
nhanes_age = prop_table(nhanes, "age_group")
nhanes_race = prop_table(nhanes, "race_eth")

repr_sex = representation_table(nhanes_sex, census_sex, "sex")
repr_age = representation_table(nhanes_age, census_age, "age_group")
repr_race = representation_table(nhanes_race, census_race, "race_eth")

repr_sex


In [None]:
# Simple bar plots of representation ratios.

def plot_representation(df: pd.DataFrame, category_col: str, title: str) -> None:
    """Plot representation ratios for one variable.

    A value of 1.0 indicates that the NHANES proportion matches the
    Census proportion exactly.
    """
    df = df.copy()
    df = df.sort_values("representation_ratio")

    plt.figure(figsize=(6, 4))
    plt.bar(df[category_col].astype(str), df["representation_ratio"])
    plt.axhline(1.0, linestyle="--")
    plt.ylabel("Representation ratio (NHANES / Census)")
    plt.title(title)
    plt.xticks(rotation=45, ha="right")
    plt.tight_layout()
    plt.show()

plot_representation(repr_sex, "sex", "NHANES vs Census: sex")
plot_representation(repr_age, "age_group", "NHANES vs Census: age group")
plot_representation(repr_race, "race_eth", "NHANES vs Census: race/ethnicity")


### Interpretation

- A representation ratio close to 1.0 indicates that NHANES has a similar proportion to the Census for that category.
- Ratios above 1.0 indicate **over-representation**; ratios below 1.0 indicate **under-representation**.
- NHANES is designed to be reasonably close to the US population, but it is not perfect. Some groups will be slightly over- or under-represented even after weighting.


## 4. Central tendency and dispersion: population vs sample

We now treat the NHANES dataset as if it were the **population** of US adults. This is obviously an approximation, but it allows us to illustrate the difference between population quantities and sample estimates.

For a numeric variable such as **BMI** we will compute:

- **Mean** (average).
- **Median** (50th percentile).
- **Standard deviation** (spread around the mean).
- Selected **quantiles** (for example, 5th, 25th, 75th, 95th percentiles).

Then we will draw a *single* random sample from NHANES and compare the sample statistics with the population values.


In [None]:
# Check that a BMI column is available.

if "bmi" not in nhanes.columns:
    raise KeyError(
        "The NHANES dataset does not contain a 'bmi' column. "
        "Please add BMI (kg/m^2) to the processed file."
    )

# Compute "population" statistics using the full NHANES dataset.

pop_mean_bmi = nhanes["bmi"].mean()
pop_median_bmi = nhanes["bmi"].median()
pop_sd_bmi = nhanes["bmi"].std()
pop_quantiles_bmi = nhanes["bmi"].quantile([0.05, 0.25, 0.75, 0.95])

print("Population (NHANES) BMI statistics")
print(f"  Mean:   {pop_mean_bmi:5.2f}")
print(f"  Median: {pop_median_bmi:5.2f}")
print(f"  SD:     {pop_sd_bmi:5.2f}")
print("  Quantiles:")
print(pop_quantiles_bmi)


In [None]:
# Draw a single simple random sample from NHANES, without replacement.

SAMPLE_SIZE = 400

sample_indices = rng.choice(
    nhanes.index.to_numpy(), size=SAMPLE_SIZE, replace=False
)
sample = nhanes.loc[sample_indices].copy()

samp_mean_bmi = sample["bmi"].mean()
samp_median_bmi = sample["bmi"].median()
samp_sd_bmi = sample["bmi"].std()

print("Sample BMI statistics (n = 400)")
print(f"  Mean:   {samp_mean_bmi:5.2f}")
print(f"  Median: {samp_median_bmi:5.2f}")
print(f"  SD:     {samp_sd_bmi:5.2f}")

print("\nDifferences (sample minus population):")
print(f"  Mean difference:   {samp_mean_bmi - pop_mean_bmi:6.3f}")
print(f"  Median difference: {samp_median_bmi - pop_median_bmi:6.3f}")


The sample statistics will usually be **close**, but not identical, to the population values. The difference reflects **sampling error**.

If we draw a new sample of 400 adults, we obtain slightly different estimates. As we increase the sample size, the variability of the estimates decreases.


## 5. Sampling variability and sample size

We now demonstrate how **sample size** affects the precision of the sample mean. We repeatedly draw samples of different sizes from the NHANES dataset and record the mean BMI each time.

Typical sample sizes:

- Small study: n = 100.
- Medium study: n = 500.
- Larger study: n = 2 000.

For each n we will draw 500 samples and visualise the distribution of sample means.


In [None]:
# Simulate sampling distributions of the mean BMI for different sample sizes.

def sample_mean_bmi(df: pd.DataFrame, n: int, rng: np.random.Generator) -> float:
    """Draw a simple random sample of size n and return its mean BMI."""
    idx = rng.choice(df.index.to_numpy(), size=n, replace=False)
    return df.loc[idx, "bmi"].mean()

def simulate_means(df: pd.DataFrame, n: int, n_sim: int, rng: np.random.Generator) -> np.ndarray:
    """Return an array of sample means from repeated sampling."""
    means = np.empty(n_sim)
    for i in range(n_sim):
        means[i] = sample_mean_bmi(df, n, rng)
    return means

sample_sizes = [100, 500, 2000]
n_sim = 500

sampling_results = {}
for n in sample_sizes:
    sampling_results[n] = simulate_means(nhanes, n, n_sim, rng)

sampling_results[100][:5]


In [None]:
# Plot histograms of sample means for each sample size.

for n in sample_sizes:
    means = sampling_results[n]
    plt.figure(figsize=(6, 4))
    plt.hist(means, bins=20)
    plt.axvline(pop_mean_bmi, linestyle="--")
    plt.title(f"Sampling distribution of mean BMI (n = {n})")
    plt.xlabel("Sample mean BMI")
    plt.ylabel("Frequency (across simulations)")
    plt.tight_layout()
    plt.show()

    print(f"n = {n}")
    print(f"  Mean of sample means:      {means.mean():6.3f}")
    print(f"  SD of sample means:        {means.std():6.3f}")
    print(f"  Population mean (NHANES):  {pop_mean_bmi:6.3f}\n")


### Interpretation

- The **centre** of the sampling distribution (mean of sample means) is close to the true population mean. This is expected when we use simple random sampling.
- The **spread** of the sampling distribution (standard deviation of sample means) decreases as the sample size increases. For large n, different samples give very similar mean estimates.
- The reduction in spread is approximately proportional to **1/√n**, which is a core idea behind many statistical concepts (for example, standard errors and confidence intervals).


## 6. Specific cohorts: NHS and HPFS as restricted populations

Not all studies aim to represent the general population. Two famous examples are:

- **Nurses' Health Study (NHS)**: mostly female registered nurses.
- **Health Professionals Follow-up Study (HPFS)**: mostly male health professionals.

These designs have advantages:

- Participants are relatively homogeneous in education and occupation.
- Outcome ascertainment and long-term follow-up can be easier.
- Response rates can be higher than in a general population sample.

However, they are **not representative** of all adults. The target population for NHS is “female nurses”, not “all adults in the United States”.

We can illustrate this by constructing toy “NHS-like” and “HPFS-like” subsets from NHANES.


In [None]:
# Toy example: construct simple NHS-like and HPFS-like subsamples.
#
# Assumptions for the processed NHANES file:
# - An education variable 'education' exists with categories such as
#   "≤High school", "Some college", "Bachelor+".
# If this is not available, this block can be skipped or adapted.

if "education" in nhanes.columns:
    # NHS-like: female, age 30–55, higher education.
    nhs_like = nhanes[
        (nhanes["sex"] == "Female")
        & (nhanes["age_years"].between(30, 55))
        & (nhanes["education"] == "Bachelor+")
    ].copy()

    # HPFS-like: male, age 40–75, higher education.
    hpfs_like = nhanes[
        (nhanes["sex"] == "Male")
        & (nhanes["age_years"].between(40, 75))
        & (nhanes["education"] == "Bachelor+")
    ].copy()

    print(f"Toy NHS-like sample size:  {len(nhs_like)}")
    print(f"Toy HPFS-like sample size: {len(hpfs_like)}")
else:
    nhs_like = None
    hpfs_like = None
    print(
        "No 'education' column found. Toy NHS/HPFS examples are skipped. "
        "You may add such a column to the NHANES file to enable this section."
    )


In [None]:
# Compare simple distributions for the toy cohorts to the full NHANES dataset.

def compare_to_nhanes(df: pd.DataFrame, label: str, col: str) -> pd.DataFrame:
    """Create a table comparing a cohort to NHANES for one variable."""
    nhanes_tab = prop_table(nhanes, col).rename(columns={"proportion": "nhanes_prop"})
    cohort_tab = prop_table(df, col).rename(columns={"proportion": "cohort_prop"})
    merged = nhanes_tab.merge(cohort_tab, on=col, how="outer")
    merged["cohort"] = label
    return merged

if nhs_like is not None and len(nhs_like) > 0:
    comp_nhs_age = compare_to_nhanes(nhs_like, "NHS-like", "age_group")
    comp_nhs_sex = compare_to_nhanes(nhs_like, "NHS-like", "sex")
    print("NHS-like vs NHANES (age_group):")
    print(comp_nhs_age)

if hpfs_like is not None and len(hpfs_like) > 0:
    comp_hpfs_age = compare_to_nhanes(hpfs_like, "HPFS-like", "age_group")
    comp_hpfs_sex = compare_to_nhanes(hpfs_like, "HPFS-like", "sex")
    print("\nHPFS-like vs NHANES (age_group):")
    print(comp_hpfs_age)


### Interpretation

- The toy **NHS-like** and **HPFS-like** subsets deliberately focus on specific groups (e.g. higher education, particular sex and age ranges).
- They are therefore **not representative** of the general adult population, even if the underlying NHANES sample is.
- Such cohorts can still provide very valuable evidence, particularly for questions where the internal validity and detailed exposure assessment are more important than direct generalisation to all adults.

> Hippo cameo: imagine a very diligent hippo who is a nurse and joins the NHS. The hippo helps answer questions about diet and health **in nurses**, but the findings may not tell us directly what happens in hippos that are farmers, engineers, or unemployed.


## 7. Comparing your own study data to NHANES

In the FB2NEP module we will work with a synthetic epidemiological dataset. A natural question is:

> **How similar is our study sample to a reference population such as NHANES?**

The steps are:

1. Make sure your dataset contains variables that can be mapped to those in NHANES (for example, `sex`, `age_group`, and `race_eth`).
2. Compute proportions in your own dataset.
3. Compare them to NHANES using the same helpers as above.

Below is a template. In the FB2NEP repository, `df_study` will typically come from the module bootstrap script.


In [None]:
# Template: replace this with your study dataset.
# For the FB2NEP synthetic dataset, you will usually have a cell such as:
#
#     from scripts.bootstrap import init
#     df_study, ctx = init()
#
# Here we simply copy a small random subset of NHANES as a placeholder.

df_study = nhanes.sample(n=600, random_state=RANDOM_SEED).copy()

study_sex = prop_table(df_study, "sex").rename(columns={"proportion": "study_prop"})
study_age = prop_table(df_study, "age_group").rename(columns={"proportion": "study_prop"})
study_race = prop_table(df_study, "race_eth").rename(columns={"proportion": "study_prop"})

nhanes_sex = prop_table(nhanes, "sex").rename(columns={"proportion": "nhanes_prop"})
nhanes_age = prop_table(nhanes, "age_group").rename(columns={"proportion": "nhanes_prop"})
nhanes_race = prop_table(nhanes, "race_eth").rename(columns={"proportion": "nhanes_prop"})

comp_sex = nhanes_sex.merge(study_sex, on="sex", how="outer")
comp_age = nhanes_age.merge(study_age, on="age_group", how="outer")
comp_race = nhanes_race.merge(study_race, on="race_eth", how="outer")

    print("Study vs NHANES: sex")
print(comp_sex)
print("\nStudy vs NHANES: age_group")
print(comp_age)
print("\nStudy vs NHANES: race_eth")
print(comp_race)


## 8. Exercises (for students)

1. **Population vs sample**  
   Change the `SAMPLE_SIZE` in Section 4 (for example, to 100, 1 000, 5 000) and observe how the sample mean BMI differs from the NHANES population mean. Summarise what you observe.

2. **Sampling variability**  
   In Section 5, extend the list of sample sizes to include `n = 50` and `n = 10 000` (if NHANES is large enough). How does the spread of sample means change? Relate your findings to the idea that standard errors shrink with 1/√n.

3. **Representativeness of NHANES**  
   Modify the `census_race` table to reflect a different set of categories (for example, merging “Other” and “Asian” or splitting “Hispanic” into subgroups). How sensitive are your representation ratios to such recoding decisions?

4. **Specific cohorts**  
   Adapt the toy NHS-like and HPFS-like definitions. For example, restrict to non-smokers or older age groups. Discuss how these design choices might improve internal validity but reduce representativeness of the general population.

5. **Your own dataset**  
   Replace `df_study` with the synthetic FB2NEP dataset. Compare sex, age group, and race/ethnicity distributions to NHANES. Write a short paragraph on how similar or different your study population is, and what that implies for generalising results to the US adult population.


## 9. Summary

- **Population, target population, and sample** are distinct concepts. Clarity about each is essential before analysing data.
- **Representativeness** concerns the similarity of the sample to the population of interest in terms of key characteristics. It is mainly about external validity.
- **Sampling error** decreases as the sample size increases. Even unbiased samples will vary from the population just by chance.
- **NHANES** provides a useful example of an approximately nationally representative survey for adults in the United States, against which we can compare other samples.
- Famous cohorts such as **NHS** and **HPFS** are intentionally restricted to specific professional groups. They can be excellent for internal validity but are not representative of all adults.
- Comparing your own study data to a reference survey helps you understand **who you are studying** and how far you can generalise your findings.
