# 01 · Introduction to the dataset (light touch)

> **Purpose**: load the synthetic cohort (N≈25k), orient yourself, and do minimal integrity checks.
No modelling yet — just get a feel for the data.

> **Learning objectives**
- Load the dataset reproducibly (Colab/local) and locate key files.
- Inspect variables, units, and basic distributions.
- Run a few lightweight integrity checks (flags ↔ dates; plausible ranges; monotone signals).

---

In [None]:
# Bootstrap: ensure repo on path, then load df via scripts/bootstrap.py
import sys, pathlib
# Add repo root (parent of notebooks/) to sys.path — works locally and in Colab
sys.path.append(str(pathlib.Path.cwd().parent))
from scripts.bootstrap import init
df, ctx = init()
df.head(3)

## What’s in here?
- **metadata/data_dictionary.csv** — variable types, units, coding.
- **metadata/provenance.md** — how the data were generated (assumptions, seed=11088).
- **scripts/generate_dataset.py** — transparent generator (re-run if needed).
- **scripts/validate_dataset.py** — basic checks (ranges, monotone relations, incidence).

In [None]:
df.shape, sorted(df.columns)[:12], sorted(df.columns)[12:24]

## Quick summaries (numeric & categorical)

In [None]:
num_desc = df.select_dtypes(include=["float64","int64"]).describe().T.round(2)
cat_desc = df.select_dtypes(include=["object","category"]).describe().T
num_desc.head(12), cat_desc.head(12)

## Distributions: age, BMI, SBP (orientation, not inference)

In [None]:
import matplotlib.pyplot as plt
for col in ["age","BMI","SBP"]:
    if col in df:
        x = df[col].dropna()
        plt.figure(); plt.hist(x, bins=40, alpha=0.9)
        plt.xlabel(col); plt.ylabel("count"); plt.title(col)
        plt.show()

## Lightweight integrity checks
These are deliberately simple — just enough to catch glaring problems.

In [None]:
# Cohort age range
assert df['age'].min() >= 40, "Cohort should be age ≥ 40"

# Required columns exist
required = {
    'id','baseline_date','age','sex','BMI','SBP','IMD_quintile','SES_class','smoking_status',
    'fruit_veg_g_d','red_meat_g_d','salt_g_d','energy_kcal',
    'plasma_vitC_umol_L','urinary_sodium_mmol_L',
    'CVD_incident','CVD_date','Cancer_incident','Cancer_date'
}
missing = required - set(df.columns)
assert not missing, f"Missing columns: {missing}"

# Binary incident flags are 0/1 and match dates
for flag, date in [("CVD_incident","CVD_date"),("Cancer_incident","Cancer_date")]:
    vals = set(df[flag].dropna().unique().tolist())
    assert vals.issubset({0,1}), f"{flag} should be 0/1"
    f = df[flag].astype(int); d = df[date].fillna("")
    assert ((f==1) <= (d!="")).all(), f"{date}: incident=1 rows must have a date"
print("Integrity checks passed ✅")

## Construct-validity spot checks (signal amidst noise)
- Fruit & veg → plasma vitamin C should be **monotone increasing** on average.
- Salt intake → urinary sodium should be **monotone increasing** on average.

In [None]:
import pandas as pd, numpy as np
q_fv = pd.qcut(df['fruit_veg_g_d'], 5, duplicates='drop')
m_vitc = df.groupby(q_fv)['plasma_vitC_umol_L'].mean().round(2)
q_salt = pd.qcut(df['salt_g_d'], 5, duplicates='drop')
m_urna = df.groupby(q_salt)['urinary_sodium_mmol_L'].mean().round(2)
display(m_vitc, m_urna)
assert m_vitc.is_monotonic_increasing, "Vitamin C means should increase across fruit/veg quintiles"
assert m_urna.is_monotonic_increasing, "Urinary sodium means should increase across salt quintiles"

## Very crude incidences (orientation only)
We’ll revisit definitions later; for now just confirm ballpark magnitudes.

In [None]:
p_cvd = df['CVD_incident'].mean().round(4)
p_cancer = df['Cancer_incident'].mean().round(4)
print({"CVD": p_cvd, "Cancer": p_cancer})
assert 0.06 <= p_cvd <= 0.20 and 0.06 <= p_cancer <= 0.20, "Incidences should be plausible in this synthetic cohort"

> ## Key takeaways
>
> - You can reliably **load** the dataset in Colab or locally using the bootstrap.
> - Run a handful of **simple checks** up front (ranges, flags↔dates, monotone relations).
> - Keep today’s view **strictly descriptive** — modelling begins later.
>
> **Next:** produce a defensible **description of the population** (Table 1) and explore **missing data**.