# Data Handling and Basic Analysis (Part 2 Nutrition)
*Version 0.0.1*

This workbook introduces the foundations of **data handling**, using a small **synthetic RCT dataset** that mimics the Part 2 practicals:

- Blood pressure change after different amounts of coffee  
- Blood glucose after different cereals  
- Appetite VAS after different test foods  

We *simulate* data (rather than use student data), including **age** and **sex**.

By the end of the workbook, you should be able to:

- Generate and inspect data  
- Identify missing or impossible values  
- Explore distributions  
- Decide whether parametric or non-parametric tests are appropriate  
- Understand randomness of *p*-values (including why 0.05 is arbitrary)  
- Compare intervention vs control  
- Perform one-way comparisons with multiple intervention arms  
- Present results graphically

Run the first code cell to configure the environment and load helper functions.


In [None]:
import os
import sys
import runpy
import pathlib

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as st

# Optional: adjust figure style (can be removed if you prefer defaults)
plt.style.use("ggplot")

# Load helper function that simulates the practical data.
# The file helpers/sim_helpers.py must be present in the repository.
helpers_path = pathlib.Path("../helpers/sim_helpers.py")
runpy.run_path(helpers_path)


## 1. Generate the synthetic dataset

We now generate a synthetic dataset that mimics the Part 2 practicals.

- Coffee intervention with three arms: low, medium, high.  
- Cereal intervention with three arms: bran, cornflakes, muesli.  
- Test foods for appetite VAS: apple, biscuit, yoghurt.  

We also include **age** and **sex** for each participant.


In [None]:
# Generate the dataset
df = simulate_practical_data(seed=11088)

# Display the first few rows
df.head()


## 2. Inspecting the data

Before any analysis, we need to understand the structure of the dataset and
check for obvious problems:

- Are all variables present?  
- Are data types sensible (numeric vs categorical)?  
- Are there missing values?  
- Are there any clearly impossible values?


In [None]:
# Overall structure of the DataFrame
df.info()


In [None]:
# Summary statistics (including categorical variables)
df.describe(include="all")


In [None]:
# Count of missing values for each variable
df.isna().sum()


In [None]:
# Numerical summary (range, mean, quartiles) for numeric variables
df.describe()


## 3. Exploring distributions

Choice of statistical test depends strongly on the underlying distribution.

Here we:

- Plot histograms and density curves.
- Inspect normality with a Q–Q plot.
- Compare distributions across intervention arms using boxplots.


In [None]:
# Histogram and density for blood pressure change
sns.histplot(df["bp_change"], kde=True)
plt.title("Distribution of BP change")
plt.xlabel("BP change (mmHg)")
plt.ylabel("Count")
plt.show()


In [None]:
# Q–Q plot to assess normality of BP change
st.probplot(df["bp_change"], dist="norm", plot=plt)
plt.title("Q–Q plot of BP change")
plt.show()


In [None]:
# Boxplot of BP change by coffee arm
sns.boxplot(data=df, x="coffee_arm", y="bp_change")
plt.title("BP change by coffee intervention arm")
plt.xlabel("Coffee arm")
plt.ylabel("BP change (mmHg)")
plt.show()


## 4. Comparing two arms: parametric and non-parametric tests

We now compare **BP change** between two coffee arms (e.g. low vs high).

- Parametric test: independent-samples *t*-test (assumes approximate normality).  
- Non-parametric test: Mann–Whitney U test (uses ranks, does not assume normality).

We also start using a slightly unusual significance threshold:  
**α = 0.0314** (to emphasise that 0.05 is an arbitrary convention).


In [None]:
# Select two arms for comparison: low vs high coffee
bp_low = df[df["coffee_arm"] == "low"]["bp_change"]
bp_high = df[df["coffee_arm"] == "high"]["bp_change"]

# Independent-samples t-test
t_stat, p_t = st.ttest_ind(bp_low, bp_high, equal_var=False)

# Mann–Whitney U test (non-parametric)
u_stat, p_u = st.mannwhitneyu(bp_low, bp_high, alternative="two-sided")

alpha = 0.0314

print(f"t-test:        t = {t_stat:.3f}, p = {p_t:.4f}")
print(f"Mann–Whitney:  U = {u_stat:.1f}, p = {p_u:.4f}")
print(f"Using alpha = {alpha}")


**Question for you**

- Do the conclusions from the *t*-test and Mann–Whitney test agree?  
- How would the conclusion change if we used α = 0.05 instead of 0.0314?


## 5. Comparing multiple arms

Now consider **blood glucose** across different cereal arms:

- bran  
- cornflakes  
- muesli  

We can use:

- ANOVA (parametric, assumes approximate normality and equal variances).  
- Kruskal–Wallis test (non-parametric, based on ranks).


In [None]:
# Prepare lists of glucose values by cereal arm
groups_glucose = [group["glucose"].values for _, group in df.groupby("cereal_arm")]

# One-way ANOVA
f_stat, p_anova = st.f_oneway(*groups_glucose)

# Kruskal–Wallis test
h_stat, p_kw = st.kruskal(*groups_glucose)

print("One-way ANOVA:")
print(f"  F = {f_stat:.3f}, p = {p_anova:.4f}")
print("Kruskal–Wallis test:")
print(f"  H = {h_stat:.3f}, p = {p_kw:.4f}")


In [None]:
# Visualise glucose values by cereal arm
sns.boxplot(data=df, x="cereal_arm", y="glucose")
plt.title("Blood glucose by cereal arm")
plt.xlabel("Cereal arm")
plt.ylabel("Glucose (arbitrary units)")
plt.show()


## 6. Why p = 0.05 is not a magical threshold

In many papers, p < 0.05 is treated as “significant” and p ≥ 0.05 as “not significant”.

Here we simulate **10 000 RCTs with no true difference** between two groups.
For each simulated experiment, we:

- Draw two random samples from the same normal distribution.  
- Run an independent-samples *t*-test.  
- Store the resulting *p*-value.

If there is really **no effect**, *p*-values should be uniformly distributed between 0 and 1, and about **5%** of them should still be below 0.05 just by chance.


In [None]:
# Simulate 10 000 null experiments
rng = np.random.default_rng(11088)
p_values = []

n_per_group = 30
n_sim = 10000

for _ in range(n_sim):
    x = rng.normal(0, 1, n_per_group)
    y = rng.normal(0, 1, n_per_group)
    _, p = st.ttest_ind(x, y, equal_var=False)
    p_values.append(p)

# Plot the distribution of p-values
sns.histplot(p_values, bins=30)
plt.axvline(0.05, color="red", linestyle="--", label="0.05")
plt.title("Distribution of p-values when there is NO true effect")
plt.xlabel("p-value")
plt.ylabel("Count")
plt.legend()
plt.show()


**Reflection**

- Roughly what proportion of *p*-values fall below 0.05?  
- What does this tell you about using p < 0.05 as a hard decision rule?  
- How might scientific conclusions be distorted if we only publish or believe
  “significant” findings?


## 7. Presenting results graphically

Well-designed figures often communicate results more clearly than tables alone.

For approximately normal outcomes, we can show **means with confidence intervals**.
For skewed outcomes (such as VAS scores), boxplots or violin plots can be more informative.


In [None]:
# Example: mean BP change with 95% confidence intervals by coffee arm
sns.pointplot(data=df, x="coffee_arm", y="bp_change", ci=95, dodge=True)
plt.title("Mean BP change with 95% CI by coffee arm")
plt.xlabel("Coffee arm")
plt.ylabel("BP change (mmHg)")
plt.show()


In [None]:
# Example: appetite VAS by test food (distribution often skewed)
sns.boxplot(data=df, x="food_arm", y="appetite_vas")
plt.title("Appetite VAS by test food")
plt.xlabel("Test food")
plt.ylabel("Appetite VAS (0–100)")
plt.show()


## 8. Exercise

Use the tools from this workbook to complete the following tasks:

1. For each outcome (**bp_change**, **glucose**, **appetite_vas**), explore its distribution
   (histograms, Q–Q plots, boxplots).
2. Decide whether a **parametric** or **non-parametric** test is more appropriate
   for comparing intervention groups.
3. Perform the comparison and report:
   - The test used and why.
   - The test statistic and *p*-value.
   - Your conclusion using **α = 0.0314**.
4. Create **one clear figure** that summarises the main result for one of the
   interventions (coffee, cereal, or test food).
