# Data Handling and Basic Analysis (FB2NRP)
*Version 0.0.3*

This workbook introduces the foundations of **data handling and basic analysis**, using a small **synthetic RCT dataset** that mimics some nutrition trials:

- Blood pressure change after different amounts of coffee  
- Blood glucose after different cereals  
- Appetite VAS after different test foods  

The data are **simulated** and include **age** and **sex**.  
The dataset is made available as a pandas DataFrame called `df` by the **bootstrap cell above**.

The workbook is designed to work both as:

- a **lecture resource** (with explanations and examples), and  
- a **practical notebook** (with commented Python code cells).

By the end of the workbook you should be able to:

- Distinguish between **categorical**, **ordinal**, and **continuous** variables  
- Explore a dataset with `df.info()` and `df.describe()`  
- Compute and report **mean**, **SD**, **median**, **IQR** for continuous data  
- Understand common **distributions** (normal, log-normal, *t*, Poisson) at a light-touch level  
- Explore **distributions** and use Q–Q plots to assess normality  
- Create **contingency tables** for categorical data  
- Describe data appropriately for publication (e.g. a simple **Table 1**)  
- Understand the basics of **NHST** (H0 vs H1), **p-values**, and **95% CIs**  
- See why p = 0.05 is not a magical threshold (we will use **α = 0.0314**)  
- Compare two and more groups using **parametric** and **non-parametric** tests  
- Remember that **statistics are tools, not an oracle**


## 0. From raw data to analysis: the basic flow

In almost every quantitative study, the **flow of analysis** is:

1. **View the data**  
   - Load the dataset into your software (here: pandas DataFrame `df`).  
   - Look at a few rows (`df.head()`).  
   - Check variable names, data types, and obvious issues (`df.info()`).

2. **Clean the data**  
   - Handle missing values (decide when to impute, when to drop).  
   - Detect obviously impossible values (e.g. age = −5, VAS > 100).  
   - Fix coding problems (e.g. "Male" vs "M" vs "m").

3. **Standardise the data**  
   - Ensure variables use **consistent units** (e.g. all blood pressure in mmHg, all glucose in mmol/L).  
   - Recode categories in a consistent way (e.g. `F`/`M`, or 0/1 with clear labels).

4. **Analyse the data**  
   - Start with **descriptive statistics** (means, medians, counts, percentages).  
   - Present a clear **Table 1** of baseline characteristics.  
   - Move to **statistical inference** (p-values, confidence intervals, models) only once you understand the data.

In this workbook we follow the same structure: first **understand and describe**, then **compare and infer**.


In [None]:
# ============================================================
# FB2NRP bootstrap cell (works both locally and in Colab)
#
# What this cell does:
# - Locally: expects you to open the notebook from *inside*
#   the fb2nrp-datahandling repository (e.g. repo/notebooks).
#   It walks up the directory tree to find scripts/bootstrap.py.
# - In Colab: if the repo is not found, it clones it from GitHub
#   into /content/fb2nrp-datahandling.
# - Loads and runs scripts/bootstrap.py.
# - Generates a synthetic dataset and makes it available as `df`.
# ============================================================

import os
import sys
import pathlib
import subprocess
import importlib.util

REPO_URL = "https://github.com/ggkuhnle/fb2nrp-datahandling.git"
REPO_DIR = "fb2nrp-datahandling"

def in_colab() -> bool:
    """Return True if running inside Google Colab."""
    try:
        import google.colab  # type: ignore  # noqa: F401
        return True
    except ImportError:
        return False

# Make sure the process cwd is valid
try:
    cwd = pathlib.Path.cwd()
except FileNotFoundError:
    raise RuntimeError(
        "Current working directory no longer exists.\n"
        "Please restart the kernel from inside the fb2nrp-datahandling repository "
        "(e.g. open the notebook from repo/notebooks and try again)."
    )

# Try to find the repo root by walking up the directory tree
repo_root = None
for parent in [cwd] + list(cwd.parents):
    if (parent / "scripts" / "bootstrap.py").is_file():
        repo_root = parent
        break

if repo_root is not None:
    # We are somewhere inside an existing clone (local or Colab)
    os.chdir(repo_root)
    repo_root = pathlib.Path.cwd()
    print(f"Repository root detected at: {repo_root}")
else:
    # Repo not found by walking up
    if in_colab():
        # In Colab: clone into /content/fb2nrp-datahandling
        base_dir = pathlib.Path("/content")
        os.chdir(base_dir)
        repo_root = base_dir / REPO_DIR
        if not repo_root.is_dir():
            print(f"Cloning repository from {REPO_URL} into {repo_root} ...")
            subprocess.run(["git", "clone", REPO_URL, str(repo_root)], check=True)
        else:
            print(f"Using existing repository at {repo_root}")
        os.chdir(repo_root)
        repo_root = pathlib.Path.cwd()
        print(f"Repository root set to: {repo_root}")
    else:
        # Local but not inside the repo: fail with a clear message
        raise RuntimeError(
            "Could not find fb2nrp-datahandling repository root.\n"
            "Please make sure you open this notebook from inside the "
            "`fb2nrp-datahandling` repository (e.g. repo/notebooks) and "
            "then re-run this cell."
        )

# ------------------------------------------------------------
# 2. Load scripts/bootstrap.py as a module and call init()
# ------------------------------------------------------------

bootstrap_path = repo_root / "scripts" / "bootstrap.py"

if not bootstrap_path.is_file():
    raise FileNotFoundError(
        f"Could not find {bootstrap_path}. "
        "Please check that the fb2nrp-datahandling repository structure is intact."
    )

spec = importlib.util.spec_from_file_location("fb2nrp_bootstrap", bootstrap_path)
bootstrap = importlib.util.module_from_spec(spec)
sys.modules["fb2nrp_bootstrap"] = bootstrap
spec.loader.exec_module(bootstrap)

# CTX will contain paths and settings defined in bootstrap.py
CTX = bootstrap.init()

for name in ["REPO_NAME", "REPO_URL"]:
    if hasattr(bootstrap, name):
        globals()[name] = getattr(bootstrap, name)

print("Bootstrap completed successfully.")
print("The context object is available as `CTX`.")


In [None]:
# ============================================================
# Setup: scientific Python libraries and plotting style
#
# Assumes the bootstrap cell above has already created:
#   - CTX : context object with paths and settings
# ============================================================

# Data handling and numerical computing
import numpy as np
import pandas as pd

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns

# Statistical tests and distributions
import scipy.stats as st

# Display options (optional but helpful)
pd.set_option("display.max_rows", 20)
pd.set_option("display.max_columns", 20)

# Plot style
sns.set_theme(style="whitegrid")
plt.rcParams["figure.figsize"] = (8, 5)

print("Libraries loaded.")


In [None]:
# ============================================================
# Generate the synthetic dataset for this workbook
# ============================================================

# The helper function simulate_practical_data() returns a
# small DataFrame that mimics some nutrition RCT practicals.
from scripts.helpers import simulate_practical_data, VARIABLES

# Use a fixed seed for reproducibility
df = simulate_practical_data(seed=11088)

print(f"Dataset loaded with {len(df)} rows and {df.shape[1]} columns.")


## 1. Study variables and data types

Before running any analysis, we need to understand **what kind of variables** we have.  
Different types require different summaries and different statistical tests.

### 1.1 Overview of variable types

| Variable type | What it means | Examples | How to summarise | Appropriate analyses |
|--------------|---------------|----------|------------------|----------------------|
| **Categorical (nominal)** | Distinct labels with **no natural order**. Values are names only. | Sex (F/M), coffee arm (low/medium/high), cereal arm, favourite animal (hippo optional). | Counts and percentages. | Chi-squared tests, Fisher’s exact test, logistic/multinomial regression. |
| **Ordinal** | Categories **with a natural order**, but **unequal spacing** between levels. | Likert scale 1–5, hunger rating (low/medium/high), symptom severity. | Counts/percentages; sometimes median (IQR) of coded scores with justification. | Mann–Whitney test, Kruskal–Wallis, ordinal logistic regression. |
| **Continuous (or approx. continuous)** | Numerical values where **differences and averages are meaningful**. Often many possible values. | Age (years), BP change (mmHg), glucose, VAS (0–100, often treated as continuous). | Mean ± SD (if symmetric), or median (IQR) if skewed. | t-tests, ANOVA, correlation, linear regression; non-parametric alternatives if needed. |

A few reminders:

- Coding categorical data as numbers **does not** turn them into continuous variables.  
- Ordinal scales can sometimes be treated as continuous **only** if many levels and behaved distributions make it reasonable.  
- VAS scores (0–100) occupy a grey zone: technically ordinal, often acceptable to treat as continuous in nutrition.


### 1.2 Variables in our synthetic dataset

Our synthetic dataset `df` contains (one row per participant):

| Variable | Type | Description |
|----------|------|-------------|
| `sex` | Categorical (nominal) | Participant sex (F/M) |
| `age` | Continuous | Age in years |
| `coffee_arm` | Categorical (nominal) | Intervention: low / medium / high coffee |
| `cereal_arm` | Categorical (nominal) | Cereal: bran / cornflakes / muesli |
| `food_arm` | Categorical (nominal) | Test food: apple / biscuit / yoghurt |
| `bp_change` | Continuous | Change in blood pressure (mmHg) |
| `glucose` | Continuous | Postprandial blood glucose (arbitrary units) |
| `appetite_vas` | Continuous / ordinal | VAS 0–100; treated here as approx. continuous |

For completeness, we can also display the helper metadata `VARIABLES` that describes each column.


In [None]:
VARIABLES


## 2. First look at the dataset

We start with a **quick overview** of `df`:

- `df.head()` shows the first few rows (useful to spot obvious coding issues).  
- `df.info()` summarises variables, data types, and missing values.

This is the **"view"** step of the analysis flow.


In [None]:
# First few rows of the dataset
df.head()


In [None]:
# Overall structure of the DataFrame (types, missingness)
df.info()


### 2.1 Missing values and impossible values

We should also check for **missing values** and obviously **impossible values** (e.g. negative age, VAS > 100, glucose = 0 in a living participant).

Our simulator does not generate missing or impossible values, but in real data these checks are essential and sometimes the longest part of the analysis.


In [None]:
# Count of missing values per variable
df.isna().sum()


## 3. Describing continuous variables

For continuous variables (age, BP change, glucose, VAS) we want to describe:

1. **Where the values tend to lie** (central tendency).  
2. **How much they vary** (dispersion).  
3. **What the distribution looks like** (shape).

In this section we first look at **distributions**, then define **central tendency and dispersion**, and finally compute appropriate **summary statistics**.


### 3.1 Distributions and how to look at them

Many statistical methods assume that variables follow certain **distributions**.  
For this workbook, four are particularly useful:

- **Normal distribution** (bell-shaped, symmetric).  
- **Log-normal distribution** (skewed; log of the variable is normal).  
- **t-distribution** (like normal, but with heavier tails; used in t-tests).  
- **Poisson distribution** (for **counts**, especially of rare events).

We do not need the formulas; we just need to recognise their shapes and know when they are plausible models.

We usually look at distributions in two ways:

- **Histograms/density plots**: show the shape of the data.  
- **Q–Q plots (Quantile–Quantile plots)**: compare the quantiles of the data to those of a reference distribution (often normal).


#### 3.1.1 Normal and log-normal distributions

- A variable is **approximately normal** when its histogram is symmetric and bell-shaped.  
  - Example: adult height, measurement error, often blood pressure in reasonably homogeneous groups.
- A variable is **log-normal** when its **logarithm** is approximately normal.  
  - Example: many biomarkers and concentrations, where values are strictly positive and skewed to the right.

Below we simulate data from a normal distribution to illustrate the shape.


In [None]:
# Simulated example: normal distribution
rng = np.random.default_rng(11088)
normal_sample = rng.normal(loc=0, scale=1, size=2000)

sns.histplot(normal_sample, kde=True)
plt.title("Simulated normal distribution (mean = 0, SD = 1)")
plt.xlabel("Value")
plt.ylabel("Count")
plt.show()


Below we simulate data from a log-normal distribution. Notice the **right-skewed** shape: many observations near the lower end, with a long tail of higher values.


In [None]:
# Simulated example: log-normal distribution
lognormal_sample = rng.lognormal(mean=0, sigma=0.6, size=2000)

sns.histplot(lognormal_sample, kde=True)
plt.title("Simulated log-normal distribution")
plt.xlabel("Value")
plt.ylabel("Count")
plt.show()


#### 3.1.2 t-distribution

The **t-distribution** appears when we:

- estimate means from **small samples**, and  
- use the **sample SD** instead of the true population SD.

It looks similar to the normal distribution but has **heavier tails**, especially with **small degrees of freedom (df)**.  
This matters for **t-tests** and confidence intervals based on small samples.

Below we plot t-distributions with different degrees of freedom and compare them to the standard normal.


In [None]:
# t-distributions vs standard normal
x = np.linspace(-4, 4, 400)
pdf_normal = st.norm.pdf(x, loc=0, scale=1)
pdf_t3 = st.t.pdf(x, df=3)
pdf_t10 = st.t.pdf(x, df=10)
pdf_t30 = st.t.pdf(x, df=30)

plt.plot(x, pdf_normal, label="Normal")
plt.plot(x, pdf_t3, linestyle="--", label="t, df=3")
plt.plot(x, pdf_t10, linestyle=":", label="t, df=10")
plt.plot(x, pdf_t30, linestyle="-.", label="t, df=30")
plt.title("Normal vs t-distributions")
plt.xlabel("Value")
plt.ylabel("Density")
plt.legend()
plt.show()


#### 3.1.3 Poisson distribution

The **Poisson distribution** is a model for **counts** of events in a fixed time or space, especially when events are:

- **independent**, and  
- individually **rare**.

Examples:

- Number of adverse events per participant in a trial.  
- Number of emergency admissions per day in a small hospital.  
- Number of typing errors per page in a report (for some of us).

It has a single parameter **λ (lambda)**, which is both the **mean** and the **variance** of the distribution.

Below we show the probabilities for a Poisson distribution with λ = 2.5.


In [None]:
# Poisson distribution example (lambda = 2.5)
lam = 2.5
k_values = np.arange(0, 11)  # 0 to 10 events
pmf = st.poisson.pmf(k_values, mu=lam)

plt.stem(k_values, pmf, use_line_collection=True)
plt.title("Poisson(λ = 2.5) distribution")
plt.xlabel("Number of events (k)")
plt.ylabel("Probability P(X = k)")
plt.show()


#### 3.1.4 Looking at our data: BP change

Now we return to our dataset and look at the distribution of **blood pressure change** (`bp_change`).

We use:

- a **histogram** with a smooth density estimate, and  
- a **Q–Q plot** against the normal distribution.


In [None]:
# Histogram and density for blood pressure change
sns.histplot(df["bp_change"], kde=True)
plt.title("Distribution of BP change")
plt.xlabel("BP change (mmHg)")
plt.ylabel("Count")
plt.show()


A **Q–Q plot** compares the quantiles of our data to those of a perfect normal distribution.

- If the points lie roughly on a straight line, the data are not wildly inconsistent with normality.  
- Systematic curves (S-shape, heavy tails) suggest deviations such as skewness or outliers.


In [None]:
# Q–Q plot to assess normality of BP change
st.probplot(df["bp_change"], dist="norm", plot=plt)
plt.title("Q–Q plot of BP change")
plt.show()


### 3.2 Central tendency and dispersion

Once we have a sense of the **shape** of a distribution, we can talk about:

- **Central tendency** – where the values tend to lie.  
- **Dispersion (spread)** – how much they vary around the centre.

Common choices:

- **Mean** (average) and **standard deviation (SD)**  
  - Most useful when the distribution is not too skewed.  
- **Median** and **interquartile range (IQR)**  
  - More robust to skewed distributions and outliers.

Choice of summary should be guided by the **distributional shape**, not by habit.


### 3.3 Mean, SD, median, and IQR

Definitions:

- **Mean**: add up all observations and divide by the number of observations.  
- **Standard deviation (SD)**: describes how far, on average, observations are from the mean.  
- **Median**: the middle value when the data are ordered (50% below, 50% above).  
- **Interquartile range (IQR)**: difference between the 75th percentile (Q3) and 25th percentile (Q1).

Rules of thumb:

- If the distribution is **roughly symmetric** → report *mean ± SD*.  
- If the distribution is **clearly skewed** → report *median (IQR)*.

In practice, many papers report both, at least in supplementary material.


### 3.4 Sample vs population and the standard error of the mean (SEM)

In practice we almost never observe the **entire population**. We observe a **sample** and use it to say something about the population.

- **Population mean (μ)**: the true average in the entire population (usually unknown).  
- **Sample mean (x̄)**: the average in our sample.

If we repeatedly took new samples of the same size and calculated the mean each time, those sample means would vary.

- The **standard deviation (SD)** describes variability **between individuals**.  
- The **standard error of the mean (SEM)** describes variability **between sample means**.

For a sample of size *n*, and sample SD = *s*, a common estimate is:

$$\text{SEM} \approx \frac{s}{\sqrt{n}}.$$

SEM is mainly used when constructing **confidence intervals** and performing **hypothesis tests**, not for describing raw data in a Table 1.


In [None]:
# Basic summary statistics for numeric variables
# (mean, SD, min, max, quartiles)
df.describe()


### 3.5 Descriptive statistics for key continuous outcomes

Let us compute both mean/SD and median/IQR for the three main continuous outcomes:

- `bp_change`  
- `glucose`  
- `appetite_vas`

This table is close to what you might include in a **results section** or a **Table 1**.


In [None]:
cont_vars = ["bp_change", "glucose", "appetite_vas"]
rows = []

for var in cont_vars:
    series = df[var].dropna()
    mean = series.mean()
    sd = series.std()
    median = series.median()
    q1 = series.quantile(0.25)
    q3 = series.quantile(0.75)
    iqr = q3 - q1
    rows.append({
        "variable": var,
        "mean": mean,
        "sd": sd,
        "median": median,
        "q1": q1,
        "q3": q3,
        "iqr": iqr
    })

summary_cont = pd.DataFrame(rows)
summary_cont


## 4. Categorical variables: nominal vs ordinal

For **categorical variables** we usually report **counts and percentages**.

Two subclasses are important:

- **Nominal** (unordered)  
  - Examples: sex (F/M), cereal arm, country, favourite snack.  
  - The labels have no inherent ranking.

- **Ordinal** (ordered)  
  - Examples: Likert scales, symptom severity (mild/moderate/severe).  
  - There is a natural order, but the distance between categories is not known.

### 4.1 Why not just convert categories into numbers and treat them as continuous?

It is tempting to code categories as numbers (e.g. *apple* = 1, *biscuit* = 2, *yoghurt* = 3) and then compute a mean.

This is **usually a bad idea** because:

- The numerical codes are **arbitrary labels**, not real quantities.  
- The differences between codes (2 − 1 vs 3 − 2) have **no scientific meaning**.  
- Treating them as continuous in a t-test or regression can give misleading results.

Instead, we summarise categorical variables using **counts**, **percentages**, and **contingency tables**.


In [None]:
# Distribution of participants by coffee arm (counts)
df["coffee_arm"].value_counts().to_frame(name="count")


In [None]:
# Contingency table: sex by coffee arm
pd.crosstab(df["sex"], df["coffee_arm"], margins=True)


## 5. Data description as "Table 1"

Most clinical and nutrition papers include a **Table 1** that describes the **baseline characteristics** of the study sample.

Typical elements:

- A column for **"All participants"**.  
- Additional columns for **treatment arms** (or exposure groups).  
- Rows for key variables: age, sex, BMI, main outcomes, etc.  
- Continuous variables shown as *mean ± SD* or *median (IQR)*.  
- Categorical variables shown as *n (%)*.

Below we build a **very simple Table 1** describing age and sex by coffee arm.  
This is for illustration only – in real work you would usually format the table more nicely (e.g. for LaTeX or Word).


In [None]:
# Simple Table 1: age and sex by coffee arm

group_var = "coffee_arm"
continuous_vars = ["age"]
categorical_vars = ["sex"]

arms = df[group_var].unique()
arms.sort()

table1_rows = []

# Continuous variables: report mean ± SD
for var in continuous_vars:
    row = {"variable": var, "type": "continuous"}
    for arm in arms:
        sub = df[df[group_var] == arm][var].dropna()
        m = sub.mean()
        s = sub.std()
        row[arm] = f"{m:.1f} ± {s:.1f}"
    table1_rows.append(row)

# Categorical variables: report n (%)
for var in categorical_vars:
    levels = df[var].dropna().unique()
    levels.sort()
    for level in levels:
        row = {
            "variable": f"{var} = {level}",
            "type": "categorical"
        }
        for arm in arms:
            sub = df[df[group_var] == arm]
            n = (sub[var] == level).sum()
            total = len(sub)
            perc = 100 * n / total if total > 0 else np.nan
            row[arm] = f"{n} ({perc:.1f}%)"
        table1_rows.append(row)

table1 = pd.DataFrame(table1_rows)
table1


### 5.1 Example text for methods/results

Using a table like the one above, you might write in a paper:

- *"Participants had a mean age of 22.1 ± 3.1 years."*  
- *"Overall, 40% of participants were male (n = 72/180)."*  
- *"Baseline blood pressure did not differ meaningfully between coffee arms."*

The exact wording depends on the study, but the principle is always:

- Describe **who** was studied.  
- Use **appropriate summaries** for each variable type.  
- Make it possible for the reader to judge how well the sample represents the population of interest.


## 6. Statistical inference and NHST

So far we have described the **sample**. Statistical inference is about what we can reasonably say about the **underlying population**.

In classical **null hypothesis significance testing (NHST)** we:

1. Formulate a **null hypothesis (H0)**, usually "no difference" or "no effect".  
2. Formulate an **alternative hypothesis (H1)**, e.g. "there is a difference".  
3. Choose a test statistic (e.g. a t-statistic) and compute it from the data.  
4. Compute a **p-value**: the probability (under H0) of observing a result *at least as extreme* as the one we saw.  
5. Compare the p-value to a threshold **α** (alpha) to decide whether the result is *compatible* with H0.

Example in this workbook:

- H0: mean BP change is the same in **low** and **high** coffee arms.  
- H1: mean BP change is different in the two arms.

We never prove H0 or H1; we simply assess how **compatible** the data are with H0.


### 6.1 p-values and why 0.05 is not magical

- A **p-value** is *not* the probability that H0 is true.  
- It is the probability of the observed data (or more extreme) *if H0 were true*.

Common misunderstandings:

- p = 0.04 does **not** mean there is a 96% chance that the effect is real.  
- p = 0.06 does **not** mean "no effect".

The widely used threshold **α = 0.05** is **just a convention**:

- 0.049 and 0.051 are essentially the same in terms of evidence.  
- Treating them as "significant" vs "non-significant" can be misleading.  
- In reality, we should look at **effect size**, **uncertainty**, and **context**.

In this workbook we deliberately use an unusual threshold **α = 0.0314** to emphasise that the choice of α is arbitrary and should be justified, not blindly copied.


### 6.2 Confidence intervals (CIs)

A **95% confidence interval (CI)** for a parameter (e.g. difference in means) is constructed such that, in repeated samples, **95% of such intervals would contain the true parameter**.

In practice:

- If a 95% CI for a difference **excludes 0**, the corresponding two-sided test at α = 0.05 is "statistically significant".  
- The CI gives information about **precision** (width of the interval) and **effect size** (where the interval lies).  
- CIs are usually more informative than a bare p-value.


In [None]:
# Example: difference in mean BP change between low and high coffee arms

bp_low = df[df["coffee_arm"] == "low"]["bp_change"].dropna()
bp_high = df[df["coffee_arm"] == "high"]["bp_change"].dropna()

mean_low = bp_low.mean()
mean_high = bp_high.mean()
diff = mean_high - mean_low

# Standard error for difference in means (Welch t-test style)
se_diff = np.sqrt(bp_low.var(ddof=1)/len(bp_low) + bp_high.var(ddof=1)/len(bp_high))

# 95% CI using normal approximation (for teaching; in practice use statsmodels)
z = 1.96
ci_lower = diff - z * se_diff
ci_upper = diff + z * se_diff

print(f"Mean BP change (low coffee):  {mean_low:6.2f} mmHg")
print(f"Mean BP change (high coffee): {mean_high:6.2f} mmHg")
print(f"Difference (high - low):      {diff:6.2f} mmHg")
print(f"Approx. 95% CI: [{ci_lower:6.2f}, {ci_upper:6.2f}] mmHg")


### 6.3 Simulating p-values under the null

To see what p-values look like when **there is no true effect**, we can simulate many small RCTs where both groups come from the same distribution.

If H0 is true and we repeat the experiment many times:

- p-values are roughly **uniformly distributed** between 0 and 1.  
- The proportion of p-values below α is **approximately α** (e.g. about 5% below 0.05).


In [None]:
# Simulate 10 000 null experiments (no true difference between groups)
rng = np.random.default_rng(11088)
p_values = []

n_per_group = 30
n_sim = 10000

for _ in range(n_sim):
    x = rng.normal(0, 1, n_per_group)
    y = rng.normal(0, 1, n_per_group)
    _, p = st.ttest_ind(x, y, equal_var=False)
    p_values.append(p)

alpha_1 = 0.05
alpha_2 = 0.0314

sns.histplot(p_values, bins=30)
plt.axvline(alpha_1, linestyle="--", label="0.05")
plt.axvline(alpha_2, linestyle=":", label="0.0314")
plt.title("Distribution of p-values when there is NO true effect")
plt.xlabel("p-value")
plt.ylabel("Count")
plt.legend()
plt.show()


In [None]:
# Proportion of p-values below each alpha threshold
p_values_array = np.array(p_values)
prop_005 = np.mean(p_values_array < alpha_1)
prop_0314 = np.mean(p_values_array < alpha_2)

print(f"Proportion of p-values < 0.05:   {prop_005:5.3f}")
print(f"Proportion of p-values < 0.0314: {prop_0314:5.3f}")


**Reflection**

- Roughly what proportion of p-values fall below 0.05 when H0 is true?  
- What happens when we tighten the threshold to 0.0314?  
- What does this tell you about treating p = 0.049 and p = 0.051 as fundamentally different?


## 7. Basic applications: parametric vs non-parametric tests

A **parametric test** makes assumptions about the distribution of the data (e.g. normality, similar variances).  
A **non-parametric test** usually works on **ranks** and makes fewer distributional assumptions.

We now compare **BP change** between two coffee arms (e.g. low vs high):

- **Parametric test**: independent-samples t-test (Welch).  
- **Non-parametric test**: Mann–Whitney U test (Wilcoxon rank-sum).

We will use **α = 0.0314** as our decision threshold.


In [None]:
# Select two arms for comparison: low vs high coffee
bp_low = df[df["coffee_arm"] == "low"]["bp_change"].dropna()
bp_high = df[df["coffee_arm"] == "high"]["bp_change"].dropna()

# Independent-samples t-test (Welch, unequal variances)
t_stat, p_t = st.ttest_ind(bp_low, bp_high, equal_var=False)

# Mann–Whitney U test (non-parametric)
u_stat, p_u = st.mannwhitneyu(bp_low, bp_high, alternative="two-sided")

alpha = 0.0314

print(f"t-test:       t = {t_stat:6.3f}, p = {p_t:6.4f}")
print(f"Mann–Whitney: U = {u_stat:6.1f}, p = {p_u:6.4f}")
print(f"Using alpha = {alpha}")


**Questions**

- Do the conclusions from the t-test and Mann–Whitney test agree at α = 0.0314?  
- Would your conclusion change if you (arbitrarily) switched to α = 0.05?  
- Looking back at the distributions, does a parametric or non-parametric test seem more appropriate?


## 8. Comparing more than two groups

Now consider **blood glucose** across the three cereal arms:

- bran  
- cornflakes  
- muesli

We can use:

- **One-way ANOVA** (parametric): compares mean values across groups.  
- **Kruskal–Wallis test** (non-parametric): compares distributions using ranks.


In [None]:
groups_glucose = [group["glucose"].values for _, group in df.groupby("cereal_arm")]

# One-way ANOVA
f_stat, p_anova = st.f_oneway(*groups_glucose)

# Kruskal–Wallis test
h_stat, p_kw = st.kruskal(*groups_glucose)

print("One-way ANOVA:")
print(f"  F = {f_stat:6.3f}, p = {p_anova:6.4f}")
print("Kruskal–Wallis:")
print(f"  H = {h_stat:6.3f}, p = {p_kw:6.4f}")


In [None]:
# Visualise glucose values by cereal arm
sns.boxplot(data=df, x="cereal_arm", y="glucose")
plt.title("Blood glucose by cereal arm")
plt.xlabel("Cereal arm")
plt.ylabel("Glucose (arbitrary units)")
plt.show()


## 9. Comparing categories: chi-squared tests and a warning

Sometimes we want to know whether **two categorical variables are associated**, for example:

- Is the proportion of participants with **high appetite** different across **test foods**?

For this we can use a **chi-squared test of independence** on a contingency table.

⚠️ **Important reminder:**  
- **Categorical** and **ordinal** data should **not** be analysed as if they were continuous without careful justification.  
- For example, treating a 5-point Likert scale as if it were a continuous variable and running a t-test can be misleading.


In [None]:
# Create a simple categorical outcome: high vs not-high appetite
high_cutoff = 70
df["appetite_high"] = (df["appetite_vas"] >= high_cutoff).astype(int)

# Contingency table: appetite_high by food_arm
table = pd.crosstab(df["appetite_high"], df["food_arm"])
table


In [None]:
# Chi-squared test of independence
chi2, p_chi, dof, expected = st.chi2_contingency(table)

print(f"Chi-squared test: chi2 = {chi2:6.3f}, df = {dof}, p = {p_chi:6.4f}")


**Questions**

- Does the chi-squared test suggest a difference in the proportion of high appetite across test foods (using α = 0.0314)?  
- How would you **report** this result in words?  
- Why is a chi-squared test more appropriate here than a t-test on the raw VAS scores split into two categories?


## 10. Statistics: tools, not an oracle

Finally, a reminder:

- Statistical methods help us **summarise uncertainty** and **quantify evidence**.  
- They do **not** replace judgement about study design, data quality, or plausibility.  
- A "significant" p-value does not guarantee truth, and a "non-significant" result does not prove there is no effect.

When using these tools in real research, always consider:

- Are the data appropriate for the method?  
- Are the assumptions at least approximately met?  
- Do the results make sense in the context of other evidence?  
- If a friendly statistician or methodologist (or a small hippo) looked over your analysis, would they recognise the decisions you took and why?
