# FB2NEP — Sampling, Representativeness, and Bias

**Date:** 09 November 2025

This notebook introduces core ideas in sampling for nutritional epidemiology: the difference between the **general population**, a **study's target (sample) population**, and an actual **study sample**; common **recruitment biases**; and approaches to **addressing bias** (e.g., post-stratification weights). We will simulate data to compare well-known United States cohorts (by design archetype) with a United States reference survey (NHANES-like), and demonstrate how bias can distort estimates and how weighting can partially correct them.

> **Note:** The cohort labels below are illustrative archetypes (e.g., “NHS-like”, “HPFS-like”, “WHI-like”), generated using simulated data for teaching purposes.

## Learning outcomes
By the end of this notebook, students should be able to:
- Distinguish clearly between **general population**, **sample population**, and **study sample**.
- Explain what **representativeness** means and why it matters for inference.
- Identify typical **recruitment biases** (e.g., “worried well”, sex imbalance, education skew).
- Apply simple **post-stratification (weighting)** to address imbalance and evaluate its effect.
- Compare large observational studies to the general public and discuss **under-representation**.
- Reflect on what this means **globally**, including issues of transferability beyond the US.
- Discuss why a global multi-country study may fail when using **US-centric dietary assessment** methods (e.g., limited food lists, portion sizes, and misclassification across diverse cuisines).

> Hippo cameo: only one, later, and strictly pedagogical.


In [None]:

# Imports. We keep dependencies minimal for Colab friendliness.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# For reproducibility across sessions (as requested for the module).
RANDOM_SEED = 11088
rng = np.random.default_rng(RANDOM_SEED)

# Display options for cleaner tables.
pd.set_option("display.max_columns", 50)
pd.set_option("display.width", 120)


## 1. Concepts and definitions

- **General population**: The entire group of interest (for example, all US adults aged 20+).
- **Sample population (target population)**: The subset of the general population that a study *intends* to represent (for example, all registered nurses aged 30–55 in the United States at baseline).
- **Study sample**: The actual individuals recruited and measured. This may deviate from the intended target population due to recruitment processes, non-response, and eligibility criteria.

### Representativeness
A study is **representative** when the distribution of key characteristics in the sample matches that of the population of interest (for example, sex, age, ethnicity, education, socioeconomic position). Lack of representativeness can bias estimates and reduce external validity (generalisation).

### Recruitment bias (examples)
- **“Worried well”**: People who are more health-conscious are more likely to join studies.
- **Occupational cohorts**: Often healthier, with higher education and different behaviours than the general public.
- **Sex imbalance**: Some studies include only women (e.g., women’s health initiatives) or only men (e.g., certain professional cohorts).
- **Under-representation**: Minority groups or lower-income strata may be under-represented.

### Addressing bias
- **Design stage**: Sampling frames, stratified sampling, oversampling under-represented groups.
- **Analysis stage**: **Post-stratification weighting** to align sample distributions to a reference population (e.g., NHANES). Weighting helps for variables captured in the weighting scheme; it cannot fix unmeasured differences.


## 2. Simulated NHANES-like reference population

We simulate a large reference dataset with distributions loosely inspired by modern US adult demographics. These are **synthetic** and for teaching only. Variables include:
- Sex (Female/Male)
- Age group (20–39 / 40–59 / 60+)
- Race/ethnicity (simplified categories)
- Education (≤High school / Some college / Bachelor+)
- Income band (Low / Middle / High)
- Body mass index (BMI) (continuous, generated with group-specific means)
- Current smoking (Yes/No)
- “Fruit & Vegetables” (F&V) portions per day (simulated dietary proxy)

We will treat this reference as the “population” for assessing representativeness.


In [None]:

# --------------------------
# Helper: categorical sampler
# --------------------------
def sample_categorical(rng, categories, probs, size):
    """Sample from named categories with specified probabilities."""
    return rng.choice(categories, size=size, p=np.array(probs))

N_POP = 150_000  # big enough to act as a pseudo-population

# Define marginal distributions for the synthetic US adult population.
sex_cats = ["Female", "Male"]
sex_p = [0.51, 0.49]

age_cats = ["20–39", "40–59", "60+"]
age_p = [0.34, 0.34, 0.32]

race_cats = ["White", "Black", "Hispanic", "Asian", "Other"]
race_p = [0.58, 0.12, 0.19, 0.06, 0.05]

edu_cats = ["≤High school", "Some college", "Bachelor+"]
edu_p = [0.40, 0.35, 0.25]

inc_cats = ["Low", "Middle", "High"]
inc_p = [0.35, 0.45, 0.20]

# Generate independent marginals. In reality, these are correlated;
# we keep it simple for pedagogy, then induce outcome differences by group.
pop = pd.DataFrame({
    "sex": sample_categorical(rng, sex_cats, sex_p, N_POP),
    "age_group": sample_categorical(rng, age_cats, age_p, N_POP),
    "race_eth": sample_categorical(rng, race_cats, race_p, N_POP),
    "education": sample_categorical(rng, edu_cats, edu_p, N_POP),
    "income": sample_categorical(rng, inc_cats, inc_p, N_POP),
})

# Create a simple binary current-smoking variable with group differences.
base_smoke = 0.16  # baseline prevalence
smoke = np.full(N_POP, base_smoke)

# Adjust smoking by education (lower education -> higher smoking)
smoke += np.where(pop["education"].eq("≤High school"), 0.06, 0.0)
smoke += np.where(pop["education"].eq("Bachelor+"), -0.05, 0.0)

# Adjust smoking by age (younger slightly higher than 60+ here, for illustration)
smoke += np.where(pop["age_group"].eq("20–39"), 0.02, 0.0)
smoke += np.where(pop["age_group"].eq("60+"), -0.02, 0.0)

# Clip to [0,1] and sample
smoke = np.clip(smoke, 0, 1)
pop["smoker"] = rng.binomial(1, smoke).astype(bool)

# Simulate BMI with group-level means (education & age effects) and random noise.
# This is purely pedagogical; do not interpret as real surveillance values.
bmi_mean = 28.0
bmi = np.full(N_POP, bmi_mean, dtype=float)
bmi += np.where(pop["education"].eq("Bachelor+"), -1.0, 0.0)
bmi += np.where(pop["education"].eq("≤High school"), +0.7, 0.0)
bmi += np.where(pop["age_group"].eq("60+"), +0.5, 0.0)
bmi += rng.normal(0, 3.5, N_POP)  # individual noise
pop["bmi"] = bmi

# Simulate Fruit & Veg portions/day (F&V) as a simple diet quality proxy.
fv = np.full(N_POP, 3.2, dtype=float)  # baseline
fv += np.where(pop["education"].eq("Bachelor+"), +0.8, 0.0)
fv += np.where(pop["education"].eq("≤High school"), -0.4, 0.0)
fv += np.where(pop["smoker"], -0.5, 0.0)
fv += rng.normal(0, 0.7, N_POP)
pop["fv_day"] = np.clip(fv, 0, None)

pop.head()


**Check basic distributions in the simulated population.**

In [None]:

def prop_table(df, col):
    """Return a tidy proportion table for a categorical column."""
    out = (df[col].value_counts(normalize=True)
             .rename("proportion")
             .reset_index()
             .rename(columns={"index": col}))
    return out

summary_demog = {
    "sex": prop_table(pop, "sex"),
    "age_group": prop_table(pop, "age_group"),
    "race_eth": prop_table(pop, "race_eth"),
    "education": prop_table(pop, "education"),
    "income": prop_table(pop, "income"),
}
summary_demog


## 3. Study archetypes and samples

We now define *archetypal* United States cohorts (for teaching only) and draw samples from the same underlying population with **selection probabilities** that mimic key design features:

- **NHS-like (Nurses' Health Study)**: Predominantly women with higher education; under-representation of some minority groups.
- **HPFS-like (Health Professionals Follow-up Study)**: Predominantly men with higher education.
- **WHI-like (Women’s Health Initiative)**: Women aged 50–79 at enrolment (we approximate using our 40–59 and 60+ groups, but apply stronger selection weight for 60+).

For contrast, we also create a **Convenience online sample**: over-represents health-conscious non-smokers with higher F&V intake and higher education (a “worried well” pattern).


In [None]:

# --------------------------
# Helper: weighted sampler
# --------------------------
def draw_weighted_sample(df, p_select, n_draw, rng):
    """Draw n_draw rows from df with probability proportional to p_select."""
    p = p_select / p_select.sum()
    idx = rng.choice(df.index.values, size=n_draw, replace=False, p=p)
    return df.loc[idx].copy()

# Selection weights for each archetype
def selection_weights_nhs_like(df):
    w = np.ones(len(df))
    w *= np.where(df["sex"].eq("Female"), 4.0, 0.2)          # mostly women
    w *= np.where(df["education"].eq("Bachelor+"), 2.0, 1.0) # higher education
    w *= np.where(df["race_eth"].eq("Asian"), 1.0, 1.0)      # neutral here
    return w

def selection_weights_hpfs_like(df):
    w = np.ones(len(df))
    w *= np.where(df["sex"].eq("Male"), 4.0, 0.2)            # mostly men
    w *= np.where(df["education"].eq("Bachelor+"), 2.5, 1.0) # higher education
    return w

def selection_weights_whi_like(df):
    w = np.ones(len(df))
    w *= np.where(df["sex"].eq("Female"), 5.0, 0.01)         # women only (virtually)
    w *= np.where(df["age_group"].eq("60+"), 3.5, 0.8)       # emphasise older women
    return w

def selection_weights_convenience(df):
    w = np.ones(len(df))
    w *= np.where(df["education"].eq("Bachelor+"), 2.5, 1.0) # over-represent higher education
    w *= np.where(df["smoker"].eq(False), 1.8, 0.5)          # fewer smokers
    # Small tilt towards higher F&V consumers
    w *= 1.0 + (df["fv_day"] - df["fv_day"].mean())*0.05
    return np.clip(w, 0.01, None)

# Draw samples (n ~ 5,000 for stable comparisons).
N_SAMPLE = 5_000
nhs = draw_weighted_sample(pop, selection_weights_nhs_like(pop), N_SAMPLE, rng)
hpfs = draw_weighted_sample(pop, selection_weights_hpfs_like(pop), N_SAMPLE, rng)
whi = draw_weighted_sample(pop, selection_weights_whi_like(pop), N_SAMPLE, rng)
conv = draw_weighted_sample(pop, selection_weights_convenience(pop), N_SAMPLE, rng)

for name, df in [("NHS_like", nhs), ("HPFS_like", hpfs), ("WHI_like", whi), ("Convenience_online", conv)]:
    df["cohort"] = name

cohorts = pd.concat([nhs, hpfs, whi, conv], ignore_index=True)
cohorts.head()


**Compare key distributions** across population and cohorts.

In [None]:

def compare_distribution(pop_df, sample_df, col):
    pop_prop = prop_table(pop_df, col).set_index(col)
    samp_prop = prop_table(sample_df, col).set_index(col)
    # Align categories
    all_idx = pop_prop.index.union(samp_prop.index)
    pop_prop = pop_prop.reindex(all_idx).fillna(0.0)
    samp_prop = samp_prop.reindex(all_idx).fillna(0.0)
    comp = pd.concat([pop_prop, samp_prop], axis=1)
    comp.columns = ["population_prop", "sample_prop"]
    comp["representation_ratio"] = comp["sample_prop"] / comp["population_prop"].replace(0, np.nan)
    return comp.reset_index()

def cohort_tables(col):
    out = []
    for name, df in [("NHS_like", nhs), ("HPFS_like", hpfs), ("WHI_like", whi), ("Convenience_online", conv)]:
        comp = compare_distribution(pop, df, col)
        comp.insert(0, "cohort", name)
        out.append(comp)
    return pd.concat(out, ignore_index=True)

tables_sex = cohort_tables("sex")
tables_age = cohort_tables("age_group")
tables_race = cohort_tables("race_eth")
tables_edu = cohort_tables("education")

tables_sex, tables_age.head(10)


### Visualise representation ratios

A **representation ratio** = (sample proportion) / (population proportion).  
- ≈ 1.0 means the group is proportionally represented.  
- < 1.0 means **under-represented**.  
- > 1.0 means **over-represented**.


In [None]:

def plot_representation(comp_df, col, cohort_name):
    # Filter to one cohort and plot the representation ratios
    df = comp_df[comp_df["cohort"].eq(cohort_name)].copy()
    df = df.sort_values("representation_ratio")
    plt.figure(figsize=(8, 4.5))
    plt.bar(df[col].astype(str), df["representation_ratio"])
    plt.axhline(1.0, linestyle="--")
    plt.title(f"Representation ratios — {cohort_name} ({col})")
    plt.ylabel("Representation ratio (sample / population)")
    plt.xticks(rotation=45, ha="right")
    plt.tight_layout()
    plt.show()

for c in ["NHS_like", "HPFS_like", "WHI_like", "Convenience_online"]:
    plot_representation(tables_sex, "sex", c)
    plot_representation(tables_age, "age_group", c)
    plot_representation(tables_race, "race_eth", c)
    plot_representation(tables_edu, "education", c)


## 4. Post-stratification weighting: a worked example

We demonstrate how **weights** can partially correct bias for an estimator. Suppose the estimand is **mean F&V portions/day** in US adults. We will:
1. Compute the true mean in the simulated population.
2. Compute the (biased) mean in a skewed cohort (e.g., the “Convenience online” sample).
3. Construct **raking-style** weights to match the population margins for **sex, age group, and education**.
4. Recompute the **weighted mean** and compare with the truth.

> Limitation: Weighting corrects only for variables included in the weighting scheme and measured without error. It does not eliminate bias from unmeasured differences or selection on outcomes.


In [None]:

# Helper: compute marginal distributions (proportions) for weighting
def marginal_props(df, col):
    vc = df[col].value_counts(normalize=True)
    return vc.to_dict()

pop_margins = {
    "sex": marginal_props(pop, "sex"),
    "age_group": marginal_props(pop, "age_group"),
    "education": marginal_props(pop, "education"),
}

# Simple iterative proportional fitting (raking) for three margins.
def rake_weights(df, cols, target_margins, max_iter=30, tol=1e-7):
    w = np.ones(len(df), dtype=float)
    for _ in range(max_iter):
        w_old = w.copy()
        for col in cols:
            # current weighted shares by category
            shares = (
                df.groupby(col)
                  .apply(lambda g: w[g.index].sum())
            )
            shares = shares / shares.sum()
            # compute adjustment factors for each category
            adj = {}
            for k, target in target_margins[col].items():
                current = shares.get(k, 0.0)
                if current == 0:
                    adj[k] = 1.0  # avoid division by zero; no members in sample
                else:
                    adj[k] = target / current
            # apply
            factors = df[col].map(adj).to_numpy()
            w = w * factors
        # convergence check
        if np.max(np.abs(w - w_old)) < tol:
            break
    return w

# Choose the Convenience sample (most skewed) for the demo
samp = conv.copy()
true_pop_mean = pop["fv_day"].mean()
unweighted_mean = samp["fv_day"].mean()

weights = rake_weights(samp, ["sex", "age_group", "education"], pop_margins)
samp["weight"] = weights
weighted_mean = np.average(samp["fv_day"], weights=samp["weight"])

true_pop_mean, unweighted_mean, weighted_mean


**Interpretation.**  
- The **true** population mean F&V is given by the (simulated) NHANES-like reference.  
- The **unweighted** convenience sample mean is expected to be **higher** (over-representation of health-conscious participants).  
- The **weighted** estimate should move **towards** the population truth, often substantially, though not perfectly if there are unmeasured imbalances.


## 5. Under-representation: who is missing?

We identify groups with representation ratio < 0.8 in each cohort. Persistent under-representation can weaken external validity and mask important subgroup effects.


In [None]:

def underrepresented_groups(comp_df, threshold=0.8):
    return comp_df[comp_df["representation_ratio"] < threshold].copy()

under_sex = {c: underrepresented_groups(tables_sex[tables_sex["cohort"].eq(c)]) for c in tables_sex["cohort"].unique()}
under_age = {c: underrepresented_groups(tables_age[tables_age["cohort"].eq(c)]) for c in tables_age["cohort"].unique()}
under_race = {c: underrepresented_groups(tables_race[tables_race["cohort"].eq(c)]) for c in tables_race["cohort"].unique()}
under_edu = {c: underrepresented_groups(tables_edu[tables_edu["cohort"].eq(c)]) for c in tables_edu["cohort"].unique()}

# Display a compact summary for race/ethnicity
summary_under_race = []
for c, dfc in under_race.items():
    for _, r in dfc.iterrows():
        summary_under_race.append((c, r["race_eth"], float(r["representation_ratio"])))
summary_under_race = pd.DataFrame(summary_under_race, columns=["cohort","race_eth","repr_ratio"]).sort_values(["cohort","repr_ratio"])
summary_under_race.head(20)


### Consequences for effect estimates (simulation)

We simulate a simple association: **Smokers consume fewer F&V** and have **higher BMI**. If smokers are under-represented, unweighted estimates may be biased upward for F&V and downward for BMI. Weighting helps but cannot resolve *all* structural differences.


In [None]:

def cohort_estimates(df, label):
    return pd.Series({
        "mean_BMI": df["bmi"].mean(),
        "mean_FV": df["fv_day"].mean(),
        "smoking_prev": df["smoker"].mean(),
        "n": len(df),
        "label": label,
    })

ests = pd.DataFrame([
    cohort_estimates(pop, "Population (NHANES-like)"),
    cohort_estimates(nhs, "NHS-like"),
    cohort_estimates(hpfs, "HPFS-like"),
    cohort_estimates(whi, "WHI-like"),
    cohort_estimates(conv, "Convenience"),
])

# Weighted estimate for the convenience cohort
w_mean_bmi = np.average(conv["bmi"], weights=weights)
w_mean_fv  = np.average(conv["fv_day"], weights=weights)
est_weighted_conv = pd.Series({
    "mean_BMI": w_mean_bmi,
    "mean_FV": w_mean_fv,
    "smoking_prev": np.average(conv["smoker"].astype(float), weights=weights),
    "n": len(conv),
    "label": "Convenience (weighted)",
})
ests = pd.concat([ests, est_weighted_conv.to_frame().T], ignore_index=True)
ests


## 6. Beyond the US: global transferability

Representativeness within the US does not imply **global** representativeness. Dietary patterns, food composition, preparation methods, and portion sizes vary widely. A **US-centric dietary assessment** (for example, food lists and portion images tuned to US cuisine) can cause **systematic misclassification** when applied elsewhere. This may bias exposure distributions and attenuate or distort associations.

### Mini simulation: “Global” vs “US” distributions
We simulate a contrasting “Global” population with:
- Different age structure,
- Different education and income distributions,
- Different baseline diet proxy (F&V),
to show that a US-derived cohort can be a poor proxy for other regions.


In [None]:

# Construct a different 'Global' reference with altered margins and diet levels.
g_age_p = [0.45, 0.35, 0.20]    # younger age structure
g_edu_p = [0.55, 0.30, 0.15]    # lower overall formal education
g_inc_p = [0.55, 0.35, 0.10]    # more low-income

G_POP = 120_000
gpop = pd.DataFrame({
    "sex": sample_categorical(rng, sex_cats, sex_p, G_POP),
    "age_group": sample_categorical(rng, age_cats, g_age_p, G_POP),
    "race_eth": sample_categorical(rng, race_cats, race_p, G_POP),  # keep simple
    "education": sample_categorical(rng, edu_cats, g_edu_p, G_POP),
    "income": sample_categorical(rng, inc_cats, g_inc_p, G_POP),
})

# Diet proxy and BMI with different baselines
g_smoke = np.full(G_POP, 0.18)
g_smoke += np.where(gpop["education"].eq("≤High school"), 0.05, 0.0)
g_smoke = np.clip(g_smoke, 0, 1)
gpop["smoker"] = rng.binomial(1, g_smoke).astype(bool)

gbmi = np.full(G_POP, 26.5, dtype=float)  # different baseline
gbmi += np.where(gpop["education"].eq("Bachelor+"), -0.8, 0.0)
gbmi += rng.normal(0, 3.8, G_POP)
gpop["bmi"] = gbmi

gfv = np.full(G_POP, 3.8, dtype=float)    # different baseline F&V
gfv += np.where(gpop["education"].eq("Bachelor+"), +0.6, 0.0)
gfv += np.where(gpop["smoker"], -0.4, 0.0)
gfv += rng.normal(0, 0.8, G_POP)
gpop["fv_day"] = np.clip(gfv, 0, None)

# Compare a US cohort (e.g., NHS-like) to this global 'population'
comp_global = compare_distribution(gpop, nhs, "education")
comp_global


**Observation.**  
Even if a cohort is well-characterised and internally valid, its external validity depends on the target population. When dietary assessment tools are not adapted (for example, **US-centric FFQs** applied in countries with different staple foods and cooking methods), **measurement error** and **differential misclassification** can invalidate comparisons and pooled analyses.

> Hippo cameo (pedagogical): imagine a diligent hippo completing an FFQ with “peanut butter and jelly sandwich” options but no entries for fermented cassava or millet ugali. The form can be filled in, but the **construct validity** for that dietary context is poor.


## 7. Exercises (for students)

1. **Compute representation ratios** for **income** and **smoking** across the cohorts. Which cohort deviates most from the population? Provide a brief written interpretation.
2. **Design a weighting scheme** that uses **sex, age group, education, and smoking**. Compare weighted and unweighted means for **BMI** and **F&V** in the **NHS-like** cohort.
3. **Sensitivity**: Change the selection weights in the `selection_weights_convenience` function to produce a *more extreme* “worried well” sample. How do the biases change?
4. **Global generalisation**: Create a second “Global” reference with a different education profile and test whether US cohorts approximate it after reweighting. Where does reweighting fail? Why?


## 8. Summary

- **Population vs sample**: Analytical conclusions depend on who was studied. Be explicit about the target population and who ended up in the sample.
- **Representativeness matters**: Distributions of sex, age, ethnicity, education, and socioeconomic position influence both exposures and outcomes.
- **Recruitment bias is common**: “Worried well”, occupational cohorts, and single-sex studies can be highly non-representative of the general public.
- **Weighting helps, within limits**: Post-stratification can reduce bias for variables that are included and measured well. It does not fix unmeasured selection or poor measurement.
- **Global relevance is not automatic**: Methods and tools must be adapted to local diets and contexts. Applying **US-centric** dietary instruments elsewhere risks misclassification and misleading inference.


## Appendix: helper functions (optional)

For convenience, we reprint key helpers used above.


In [None]:

import inspect

print(inspect.getsource(prop_table))
print(inspect.getsource(compare_distribution))
print(inspect.getsource(rake_weights))
