# FB2NEP Workbook 4 – Data Exploration and “Table 1”

Version 0.0.5

This workbook introduces:

- Descriptive statistics and visual inspection.
- How to construct a baseline characteristics table (“Table 1”).
- Simple group comparisons: *t*-tests, χ²-tests, and one-way ANOVA.
- Using visualisation for first insights.
- Distinguishing **statistical significance** from **practical relevance**.

Run the bootstrap cell first.

In [None]:
# ============================================================
# FB2NEP bootstrap cell (works both locally and in Colab)
#
# What this cell does:
# - Ensures that we are inside the fb2nep-epi repository.
# - In Colab: clones the repository from GitHub if necessary.
# - Loads and runs scripts/bootstrap.py.
# - Makes the main dataset available as the variable `df`.
#
# Important:
# - You may see messages printed below (for example from pip
#   or from the bootstrap script). This is expected.
# - You may also see WARNINGS (often in yellow). In most cases
#   these are harmless and can be ignored for this module.
# - The main thing to watch for is a red error traceback
#   (for example FileNotFoundError, ModuleNotFoundError).
#   If that happens, please re-run this cell first. If the
#   error persists, ask for help.
# ============================================================

import os
import sys
import pathlib
import subprocess
import importlib.util

# ------------------------------------------------------------
# Configuration: repository location and URL
# ------------------------------------------------------------
# REPO_URL: address of the GitHub repository.
# REPO_DIR: folder name that will be created when cloning.
REPO_URL = "https://github.com/ggkuhnle/fb2nep-epi.git"
REPO_DIR = "fb2nep-epi"

# ------------------------------------------------------------
# 1. Ensure we are inside the fb2nep-epi repository
# ------------------------------------------------------------
# In local Jupyter, you may already be inside the repository,
# for example in fb2nep-epi/notebooks.
#
# In Colab, the default working directory is /content, so
# we need to clone the repository into /content/fb2nep-epi
# and then change into that folder.
cwd = pathlib.Path.cwd()

# Case A: we are already in the repository (scripts/bootstrap.py exists here)
if (cwd / "scripts" / "bootstrap.py").is_file():
    repo_root = cwd

# Case B: we are outside the repository (for example in Colab)
else:
    repo_root = cwd / REPO_DIR

    # Clone the repository if it is not present yet
    if not repo_root.is_dir():
        print(f"Cloning repository from {REPO_URL} into {repo_root} ...")
        subprocess.run(["git", "clone", REPO_URL, str(repo_root)], check=True)
    else:
        print(f"Using existing repository at {repo_root}")

    # Change the working directory to the repository root
    os.chdir(repo_root)
    repo_root = pathlib.Path.cwd()

print(f"Repository root set to: {repo_root}")

# ------------------------------------------------------------
# 2. Load scripts/bootstrap.py as a module and call init()
# ------------------------------------------------------------
# The shared bootstrap script contains all logic to:
# - Ensure that required Python packages are installed.
# - Ensure that the synthetic dataset exists (and generate it
#   if needed).
# - Load the dataset into a pandas DataFrame.
#
# We load the script as a normal Python module (fb2nep_bootstrap)
# and then call its init() function.
bootstrap_path = repo_root / "scripts" / "bootstrap.py"

if not bootstrap_path.is_file():
    raise FileNotFoundError(
        f"Could not find {bootstrap_path}. "
        "Please check that the fb2nep-epi repository structure is intact."
    )

# Create a module specification from the file
spec = importlib.util.spec_from_file_location("fb2nep_bootstrap", bootstrap_path)
bootstrap = importlib.util.module_from_spec(spec)
sys.modules["fb2nep_bootstrap"] = bootstrap

# Execute the bootstrap script in the context of this module
spec.loader.exec_module(bootstrap)

# The init() function is defined in scripts/bootstrap.py.
# It returns:
# - df   : the main synthetic cohort as a pandas DataFrame.
# - CTX  : a small context object with paths, flags and settings.
df, CTX = bootstrap.init()

# Optionally expose a few additional useful variables from the
# bootstrap module (if they exist). These are not essential for
# most analyses, but can be helpful for advanced use.
for name in ["CSV_REL", "REPO_NAME", "REPO_URL", "IN_COLAB"]:
    if hasattr(bootstrap, name):
        globals()[name] = getattr(bootstrap, name)

print("Bootstrap completed successfully.")
print("The main dataset is available as the variable `df`.")
print("The context object is available as `CTX`.")


## 1. Why we start with descriptive exploration

Before fitting any model, we need a sense of:

- what the data look like,
- whether variables behave as expected,
- where outliers or odd patterns might be,
- how different groups differ at baseline.

In epidemiology and clinical trials, the default tool for this is **“Table 1”**:

- a structured summary of baseline characteristics,
- split by relevant groups (e.g. sex, exposure, case/control),
- combining categorical and continuous variables.

A good Table 1 is *not* a formal hypothesis test — it is **context**.

In this workbook we learn how to:

- explore variables descriptively,
- construct a simple baseline Table 1 from the FB2NEP dataset,
- and run a few basic comparisons between groups.

## 1.1 First look at the dataset (`df.head()` and `dtypes`)

Before doing anything more complex, we want to answer a few basic questions:

- What are the **variables** called?
- What is their **type** (number, text, date)?
- Do the **values look plausible** (e.g. BMI, blood pressure, age)?
- Are there obvious **missing values** or strange codes?

In pandas, two very simple tools already help a lot:

- `df.head()`  → first few rows of the dataset.
- `df.dtypes`  → how pandas currently represents each variable.

When you run the next cell, look specifically for:

- Does `baseline_date` look like a date, or like plain text?
- Are things like `sex`, `smoking_status`, `SES_class` stored as *object* (string) variables?
- Are continuous variables (e.g. `age`, `BMI`, `SBP`, `energy_kcal`) stored as numeric (`int64` or `float64`)?

> **Mini-task:**  
> After running the next cell, write down one thing that looks as you would expect, and one thing that surprises you.


In [None]:
# 1.1 Quick look at the dataset

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display

display(df.head())
df.dtypes.head(20)

### 1.2 Quick reflection

- Are there any variables you thought would be continuous but appear as text, or vice versa?
- Do you see any variables where the **name** is unclear and might need a better label in a paper?
- Do you notice any **obvious missing values** in the first few rows?

You do **not** fix things yet here – the goal is just to get a feeling for the data.


## 2. Descriptive statistics: first pass

We start with quick summaries for selected variables:

- **Continuous variables** via `.describe()`  
  → number of observations, mean, standard deviation, quartiles, min and max.
- **Categorical variables** via `.value_counts()`  
  → counts (and, if we want, proportions) for each category.

From this we can already check:

- Are ranges **plausible**? (e.g. nobody with BMI 3 or age 300.)
- Is there evidence of **skewness** (very long right tail for energy intake)?
- How big are the **groups** (e.g. current vs former vs never smokers)?
- Are categories **balanced**, or is one group very small?

Run the next cell and look at:

- `age`, `BMI`, `SBP`, `energy_kcal` – do the numbers look reasonable for a UK cohort?
- `sex`, `SES_class`, `smoking_status`, `physical_activity` – do the group sizes fit your expectations?


In [None]:
# 2.1 Summary statistics for key variables

continuous_vars = ["age", "BMI", "SBP", "DBP", "energy_kcal"]
categorical_vars = ["sex", "SES_class", "smoking_status", "physical_activity"]

for v in continuous_vars:
    if v in df.columns:
        print(f"\n{v} (continuous)")
        display(df[v].describe())

for v in categorical_vars:
    if v in df.columns:
        print(f"\n{v} (categorical)")
        print(df[v].value_counts(dropna=False))

### 2.1 Interpreting these summaries

When you look at the `.describe()` output for continuous variables, ask:

- Are the **minimum and maximum** values realistic in human populations?
- Is the **standard deviation** small or large compared with the mean?
- Does the **IQR** (25%–75%) suggest most people are in a narrow band, or very spread out?

For categorical variables:

- Are there any **rare categories** (very small counts)?
- Are the group sizes **very imbalanced** (e.g. 95% never smokers)?
- Do you see any unexpected categories (e.g. misspellings)?

> **Mini-task:**  
> Pick one continuous and one categorical variable.  
> For each, write one short sentence describing what you see (e.g. “Most participants are in their 50s and 60s, with ages ranging from 40 to 90.”).


## 3. Visual inspection

Tables are useful, but plots help us see patterns that numbers can hide:

- **Skewness** or **long tails** (e.g. a few people with very high intake).
- **Multimodal** distributions (two “peaks” → possibly two subgroups).
- **Spikes** that might indicate coding or rounding (e.g. lots of values at 150 mmHg).
- **Group differences** that may or may not be important.

Here we use BMI as an example:

1. A **histogram** to see the overall shape of the distribution.
2. A **boxplot by sex** to see whether men and women differ in BMI.

When you look at the plots, consider:

- Does BMI look roughly **normal**, slightly skewed, or very skewed?
- Are there obvious **outliers** (points far away from the rest)?
- Do men and women have **similar medians** and spreads, or is one group heavier?


In [None]:
# 3.1 Histogram of BMI

%matplotlib inline

if "BMI" in df.columns:
    plt.figure(figsize=(6, 4))
    df["BMI"].hist(bins=30)
    plt.xlabel("BMI (kg/m²)")
    plt.ylabel("Number of participants")
    plt.title("BMI distribution")
    plt.tight_layout()
    plt.show()

In [None]:
# 3.2 Boxplot of BMI by sex

if {"BMI", "sex"}.issubset(df.columns):
    plt.figure(figsize=(6, 4))
    df.boxplot(column="BMI", by="sex")
    plt.title("BMI by sex")
    plt.suptitle("")  # remove automatic super-title
    plt.xlabel("Sex")
    plt.ylabel("BMI (kg/m²)")
    plt.tight_layout()
    plt.show()

### 3.3 Quick questions

After looking at the histogram and boxplot:

- Would you be comfortable using methods that assume an approximately **normal** distribution for BMI?
- If you had to highlight **one potential issue** with BMI in this dataset (e.g. outliers, skew), what would it be?
- Do you think the difference between men and women in BMI, if any, is likely to be **meaningful** in practice?

You do not need exact numbers yet – this is about developing an *eye* for the data.


## 4. Building a baseline “Table 1”

In most papers, **Table 1** summarises baseline characteristics by a key grouping variable:

- in trials: intervention vs control,
- in cohorts: exposed vs unexposed, or cases vs non-cases,
- very commonly: **by sex**.

We will:

- choose `sex` as the grouping variable,
- summarise continuous variables as *mean ± SD*,
- summarise categorical variables as counts and percentages.

The aim is clarity, not yet hypothesis testing.

In [None]:
# 4.1 Helper function to create a simple Table 1

from scripts.helpers_tables import make_table1

continuous_vars = ["age", "BMI", "SBP", "DBP"]
categorical_vars = ["SES_class", "smoking_status", "physical_activity"]

if "sex" in df.columns:
    table1 = make_table1(df, group="sex",
                         continuous=continuous_vars,
                         categorical=categorical_vars)
    print("Baseline characteristics by sex (simple Table 1):")
    display(table1)
else:
    print("Variable 'sex' not found in dataset.")

## 4.2 Interpreting a simple Table 1

Look at the Table 1 output:

- **Continuous rows** (age, BMI, SBP) show *mean ± SD* by sex.
- **Categorical rows** (SES, smoking, physical activity) show counts and percentages by sex.

Questions to ask:

- Are age and BMI **similar** between men and women, or is there a noticeable difference?
- Do men and women differ in **SES distribution** (e.g. more ABC1 in one group)?
- Are there clear differences in **smoking patterns** (more current smokers in one sex)?
- Is **physical activity** similarly distributed, or does one group report more “high” activity?

> **Mini-task:**  
> Write two short sentences summarising Table 1, for example:
> - “Men and women in this cohort are of similar age and BMI.”  
> - “Smoking and SES distributions are also similar between sexes, with only small differences.”


## 5. Comparing groups: *t*-tests, χ²-tests, ANOVA

Once we have a baseline table, we might ask whether observed differences are **larger than we would expect by chance**.

Here we introduce three basic tools:

- **Two-sample *t*-test** — difference in means between two groups (e.g. BMI in men vs women).
- **χ²-test of independence** — association between two categorical variables (e.g. smoking vs SES).
- **One-way ANOVA** — difference in means across three or more groups (e.g. SBP by physical activity category).

These are *introductory* tools; later workbooks will use regression models that generalise these ideas.

In [None]:
# 5.1 Two-sample t-test: BMI in men vs women

from scipy.stats import ttest_ind

if {"BMI", "sex"}.issubset(df.columns):
    bmi_m = df[df["sex"] == "M"]["BMI"].dropna()
    bmi_f = df[df["sex"] == "F"]["BMI"].dropna()

    if len(bmi_m) > 1 and len(bmi_f) > 1:
        stat, p = ttest_ind(bmi_m, bmi_f, equal_var=False)
        print("T-test: BMI (M vs F)")
        print("n (M) =", len(bmi_m), "  n (F) =", len(bmi_f))
        print("Statistic:", stat)
        print("p-value :", p)
    else:
        print("Not enough data in one of the groups for t-test.")
else:
    print("BMI or sex not found in dataset.")

### 5.1.1 Interpreting the BMI t-test (M vs F)

The t-test compares **mean BMI** between men and women under the assumption that:

- observations are **independent**,
- BMI is **approximately normal** within each group,
- the two groups have similar enough variance (we used `equal_var=False`, which relaxes this).

When you look at the output:

- The **statistic** tells you how far the observed difference is from zero, in SD units.
- The **p-value** tells you how compatible the data are with the null hypothesis of *no difference*.

However, even if the p-value is very small (statistically significant), always also check:

- **Effect size**: what is the actual difference in mean BMI (in kg/m²)?
- **Practical relevance**: would a difference of, say, 0.3 kg/m² matter clinically or for public health?

> **Mini-task:**  
> Use the Table 1 means to compute (by hand) the difference in BMI between men and women.  
> Then decide: *even if this were “significant”, would it be important?*


In [None]:
# 5.2 χ²-test: smoking_status vs SES_class

from scipy.stats import chi2_contingency

if {"smoking_status", "SES_class"}.issubset(df.columns):
    tab = pd.crosstab(df["smoking_status"], df["SES_class"])
    print("Contingency table: smoking_status × SES_class")
    display(tab)

    if tab.values.min() > 0:
        chi2, p, dof, exp = chi2_contingency(tab)
        print("χ² statistic:", chi2)
        print("Degrees of freedom:", dof)
        print("p-value:", p)
    else:
        print("Some cells have zero counts; χ²-test results need caution.")
else:
    print("smoking_status or SES_class not found in dataset.")

### 5.2.1 Interpreting the χ²-test: smoking vs SES

The χ²-test of independence asks:

> Is there evidence that **smoking status** and **SES class** are associated,  
> or could the observed differences be due to chance alone?

Steps:

1. The contingency table (`pd.crosstab`) shows **counts** in each combination of categories.
2. The χ²-test compares these to **expected counts** if smoking and SES were independent.
3. A small p-value suggests that the pattern of smoking differs by SES group.

Points to check:

- Are there any **very small cells**? These can make the χ²-test less reliable.
- If there is an association, is it **strong** (e.g. current smoking concentrated in one SES group), or subtle?

> **Mini-task:**  
> Look at the contingency table and describe in words how smoking patterns differ between SES classes (if at all).


In [None]:
# 5.3 One-way ANOVA: SBP by physical_activity

from scipy.stats import f_oneway

if {"SBP", "physical_activity"}.issubset(df.columns):
    levels = ["low", "moderate", "high"]
    groups = []

    for lev in levels:
        if lev in df["physical_activity"].unique():
            g = df[df["physical_activity"] == lev]["SBP"].dropna()
            if len(g) > 1:
                groups.append(g)
                print(f"Level '{lev}': n =", len(g))

    if len(groups) >= 2:
        stat, p = f_oneway(*groups)
        print("\nANOVA: SBP by physical_activity")
        print("F-statistic:", stat)
        print("p-value   :", p)
    else:
        print("Not enough groups with data for ANOVA.")
else:
    print("SBP or physical_activity not found in dataset.")

### 5.3.1 ANOVA: what does a “significant” F-test mean here?

The one-way ANOVA tests:

> Do mean SBP values differ across the physical activity groups (low, moderate, high)?

If the p-value is small:

- It tells you that **at least one** group mean is different.
- It does **not** tell you *which* groups differ, or by how much.

Always combine:

- The ANOVA result (F-statistic, p-value) with
- The **group means and SDs** (you can compute them with `df.groupby("physical_activity")["SBP"].mean()`).

> **Mini-task:**  
> Compute mean SBP by physical activity group and consider whether the size of the differences is likely to be relevant in practice.


## 5.4 Effect sizes and confidence intervals

So far we have looked at *p*-values for group comparisons.  
However, a *p*-value alone does **not** tell us:

- how big the difference is,
- how uncertain the estimate is,
- whether the effect might plausibly be “small but important” or “large but irrelevant”.

A **confidence interval (CI)** gives us exactly this.

### What is a confidence interval?

A 95% CI for a mean (or a difference in means) is a range that:

> would contain the true population value in 95% of repeated samples  
> (if we could repeat the study infinitely many times under identical conditions).

It is **not**:
- a 95% probability that the true value lies inside the interval,  
- nor a statement about any individual.

It is a **statement about the method**.

### Why CIs matter more than p-values

A very small *p*-value could correspond to:

- a **tiny** effect with high precision (large N),
- a **moderate** effect with poor precision (small N),
- or something in between.

The CI tells you which one is happening.

Examples:
- A BMI difference of **0.2 (95% CI: 0.1 to 0.3)** → tiny, precise.
- A BMI difference of **2.5 (95% CI: –1.0 to 6.0)** → big uncertainty; p-value alone is misleading.

### CI rule of thumb

- If a CI for a difference **crosses 0** → compatible with “no difference”.
- If a CI is **very narrow** → estimate is precise.
- If a CI is **wide** → estimate is uncertain, even if the p-value is small.

We add a simple Python function below to compute a 95% CI for the difference in means for BMI (men vs women).


In [None]:
# 5.4 Confidence interval for difference in means (BMI: M vs F)

import numpy as np
from scipy.stats import t

# Extract data
bmi_m = df[df["sex"] == "M"]["BMI"].dropna().values
bmi_f = df[df["sex"] == "F"]["BMI"].dropna().values

# Means and SDs
mean_m = bmi_m.mean()
mean_f = bmi_f.mean()
diff = mean_m - mean_f

# Standard error of difference (Welch)
se = np.sqrt(bmi_m.var(ddof=1)/len(bmi_m) + bmi_f.var(ddof=1)/len(bmi_f))

# Degrees of freedom (Welch–Satterthwaite)
num = (bmi_m.var(ddof=1)/len(bmi_m) + bmi_f.var(ddof=1)/len(bmi_f))**2
den = ( (bmi_m.var(ddof=1)/len(bmi_m))**2 / (len(bmi_m)-1) +
        (bmi_f.var(ddof=1)/len(bmi_f))**2 / (len(bmi_f)-1) )
df_welch = num / den

# 95% CI
t_crit = t.ppf(0.975, df_welch)
ci_low = diff - t_crit * se
ci_high = diff + t_crit * se

print(f"Difference in mean BMI (M − F): {diff:.3f}")
print(f"95% CI: [{ci_low:.3f}, {ci_high:.3f}]")



### How to use this in practice

When reporting group differences (e.g. in Table 1 or in papers):

- Always include the **effect size** (difference in means).
- Always include the **95% CI**.
- Treat the p-value as supportive, not decisive.
- Use the CI to judge **precision** and **relevance**.

CIs are often more informative than p-values because they show:
- the *range* of plausible values, and  
- whether the difference is likely to matter in practice.


## 6. Statistical vs practical significance

When we run a hypothesis test (for example a *t*-test or χ²-test), we usually get:

- a **test statistic** (t, χ², F, …), and  
- a **p-value**.

### 6.1 What does the p-value mean?

The p-value is:

> The probability of observing a result **at least as extreme as the one in your data**,  
> **if** the null hypothesis were true.

Important implications:

- It is **not** the probability that the null hypothesis is true.
- It is **not** the probability that your result is “just due to chance”.
- It depends on:
  - the **effect size** (how different the groups really are),
  - the **sample size** (more people → smaller p-values for the same effect),
  - and the **variability** in the data.

With large samples (such as the FB2NEP synthetic cohort), even very small differences
can produce **tiny p-values**.

Examples:

- A BMI difference of 0.3 kg/m² between two groups may be strongly “statistically significant”.
- It is unlikely to be **clinically** or **public-health relevant** on its own.

### 6.2 Why p = 0.05 is arbitrary (and not magic)

In many papers, p < 0.05 is used as a cut-off to call a result **“statistically significant”**.

- This threshold is largely **historical and conventional**, not a scientific law.
- Using 0.05 means we **accept** that, *if there were no true effect*, we would falsely
  declare a difference in about **5%** of studies (Type I error).
- Some fields use stricter thresholds (e.g. 0.01 or 0.001); others avoid hard cut-offs
  and report p-values as a **continuum of evidence**.

Better interpretation:

- p = 0.049 and p = 0.051 are **essentially the same** in terms of evidence.
- Instead of “significant / not significant”, think:
  - **Smaller p-value** → data are less compatible with “no effect”.
  - **Larger p-value** → data are more compatible with “no effect”.
- Always combine p-values with:
  - **effect sizes**,
  - **confidence intervals**, and
  - **subject-matter knowledge**.

### 6.3 False positives and the role of sample size

If we repeatedly compare two groups that are **truly identical** (no real difference):

- and we use p < 0.05 as our rule,
- we will still find “significant” differences in about **5%** of comparisons, **purely by chance**.

In large datasets:

- even tiny, practically irrelevant differences can give very small p-values.
- this is why “statistically significant” does **not** automatically mean “important”.

> **Example using FB2NEP:**  
> Suppose you find that BMI is 0.4 kg/m² higher in men than in women, with p < 0.001.  
> This is *statistically* significant in 25,000 people, but is unlikely to change clinical practice on its own.

### 6.4 Practical significance: what actually matters?

Always ask:

1. **Effect size** – How large is the difference or association in real units?
   - Is a 0.4 kg/m² difference in BMI meaningful?
   - Is a 0.2 mmol/L difference in cholesterol important?
2. **Uncertainty** – What do confidence intervals look like?
   - Is the interval narrow (precise estimate) or wide (uncertain)?
3. **Context** – Would this difference change practice, policy, or interpretation?
   - Would a clinician treat patients differently?
   - Would a policymaker change guidelines?

Plots and descriptive summaries are often as important as p-values when deciding
whether a finding is **meaningful**.

In FB2NEP and in real studies:

- use p-values as **one piece of evidence**,
- avoid treating 0.05 as a magic border between “true” and “false”,
- and always relate results back to **public-health and clinical relevance**.



In [None]:
# 6.x Simulation: how often do we see "significant" differences by chance?

import numpy as np
from scipy.stats import ttest_ind
import matplotlib.pyplot as plt

# For reproducibility
np.random.seed(11088)

# Parameters – students can play with these
n_sim = 1000          # number of simulated experiments
n_per_group = 100     # sample size in each group
alpha = 0.05          # significance threshold

p_values = []

for i in range(n_sim):
    # Generate two groups from the *same* distribution (no true difference)
    x = np.random.normal(loc=0, scale=1, size=n_per_group)
    y = np.random.normal(loc=0, scale=1, size=n_per_group)

    # Two-sample t-test (Welch)
    stat, p = ttest_ind(x, y, equal_var=False)
    p_values.append(p)

p_values = np.array(p_values)

# How many "significant" results did we get?
n_sig = (p_values < alpha).sum()
prop_sig = n_sig / n_sim * 100

print(f"Number of simulations       : {n_sim}")
print(f"Sample size per group       : {n_per_group}")
print(f"Alpha (significance level)  : {alpha}")
print(f"Significant results (p < α) : {n_sig} ({prop_sig:.1f}%)")

# Optional: look at the distribution of p-values
plt.figure(figsize=(6, 4))
plt.hist(p_values, bins=20)
plt.xlabel("p-value")
plt.ylabel("Count")
plt.title("Distribution of p-values when there is no true difference")
plt.tight_layout()
plt.show()


**Questions to consider:**

- How close is the percentage of “significant” results to the chosen α (0.05)?
- What happens if you:
  - change `alpha` to 0.01?
  - increase or decrease `n_per_group`?
- How does this simulation illustrate the idea of **Type I error** and the arbitrariness of the 0.05 threshold?


## 7. Summary

In this workbook you:

- Performed **descriptive exploration** of key continuous and categorical variables.
- Used **histograms** and **boxplots** to check distributions and group differences.
- Built a simple baseline **Table 1** summarising characteristics by sex.
- Ran basic group comparisons using *t*-tests, χ²-tests and one-way ANOVA.
- Reflected on the difference between **statistical** and **practical** significance.

These steps are part of routine data exploration in nutritional epidemiology
and provide the foundation for more advanced modelling in later workbooks.