# 01 · Introduction to Nutritional Epidemiology

> **Learning objectives**
- Define epidemiology in the context of nutrition.
- Recognise challenges: confounding, measurement error, missingness.
- Load and inspect the FB2NEP synthetic cohort (N≈25k).
---

In [1]:
# Imports & settings
import numpy as np, pandas as pd, matplotlib.pyplot as plt
np.random.seed(11088)
plt.rcParams['figure.dpi']=130
PATH='data/synthetic/fb2nep.csv'

In [2]:
# Ensure dataset exists (works in Colab and locally)
import os, subprocess, shlex
if not os.path.exists(PATH):
    print("Dataset missing — generating via scripts/generate_dataset.py ...")
    ret = subprocess.run(shlex.split("python scripts/generate_dataset.py"))
    if ret.returncode != 0:
        raise SystemExit("Generation failed. Check scripts/generate_dataset.py output.")
df = pd.read_csv(PATH)
df.head(3)

Dataset missing — generating via scripts/generate_dataset.py ...


/Users/gunter/.pyenv/versions/3.10.14/bin/python: can't open file '/Users/gunter/Documents/fb2nep-epi/notebooks/scripts/generate_dataset.py': [Errno 2] No such file or directory


SystemExit: Generation failed. Check scripts/generate_dataset.py output.

## First look

In [1]:
%run ../notebooks/_bootstrap.py
# now df is loaded; CSV_REL/REPO_ROOT/IN_COLAB are available
df.head()

Generating dataset…
> python scripts/generate_dataset.py
Wrote data/synthetic/fb2nep.csv with shape (25000, 27)
   id baseline_date  follow_up_years  age  ... CVD_incident    CVD_date  Cancer_incident Cancer_date
0   1    2011-05-27             6.44   59  ...            1  2016-11-14                1  2015-10-01
1   2    2010-08-14             7.50   60  ...            0                            0            
2   3    2012-04-28             7.57   54  ...            0                            0            
3   4    2015-01-20             5.71   67  ...            0                            0            
4   5    2013-04-10             6.30   70  ...            0                            0            

[5 rows x 27 columns]
Generated: data/synthetic/fb2nep.csv ✅
(25000, 27) — dataset ready


Unnamed: 0,id,baseline_date,follow_up_years,age,sex,menopausal_status,IMD_quintile,SES_class,smoking_status,physical_activity,...,ssb_ml_d,fibre_g_d,alcohol_units_wk,salt_g_d,plasma_vitC_umol_L,urinary_sodium_mmol_L,CVD_incident,CVD_date,Cancer_incident,Cancer_date
0,1,2011-05-27,6.44,59,M,,5,ABC1,current,low,...,159.0,15.2,0.0,4.1,40.3,94.6,1,2016-11-14,1,2015-10-01
1,2,2010-08-14,7.5,60,M,,4,ABC1,never,low,...,206.0,11.2,0.0,5.0,36.1,104.4,0,,0,
2,3,2012-04-28,7.57,54,F,post,4,ABC1,former,moderate,...,233.0,26.4,0.0,7.3,46.3,130.9,0,,0,
3,4,2015-01-20,5.71,67,F,post,3,ABC1,never,low,...,399.0,10.5,8.0,4.9,31.4,78.0,0,,0,
4,5,2013-04-10,6.3,70,M,,1,C2DE,current,moderate,...,600.0,25.9,0.0,9.1,33.1,154.1,0,,0,


In [None]:
df.describe(include='all', datetime_is_numeric=True)

### Measurement error (discussion)
- Self-reported diet vs biomarkers (e.g. fruit/veg vs plasma vitamin C).
- Day-to-day variation and systematic bias.

### # TODO · Quintiles of fruit/veg vs plasma vitamin C

In [None]:
# Compute mean vitamin C by quintiles of fruit_veg_g_d and assert monotonicity.
q = pd.qcut(df['fruit_veg_g_d'], 5, duplicates='drop')
res = df.groupby(q)['plasma_vitC_umol_L'].agg(['mean','std','count']).round(2)
# Expect monotone increase in mean vit C
assert res['mean'].is_monotonic_increasing, "Mean vit C should increase across fruit/veg quintiles"
res

### Checkpoint
- Note any odd ranges or surprising values you want to revisit later.