# Data Handling and Basic Analysis (Part 2 Nutrition)
*Version 0.1.0*

This workbook introduces the foundations of **data handling and basic analysis**, using a small **synthetic RCT dataset** that mimics the Part 2 practicals:

- Blood pressure change after different amounts of coffee  
- Blood glucose after different cereals  
- Appetite VAS after different test foods  

The data are **simulated** (not real student data) and include **age** and **sex**.  
The dataset is made available as a pandas DataFrame called `df` by the **bootstrap cell above**.

By the end of the workbook you should be able to:

- Distinguish between **categorical**, **ordinal**, and **continuous** variables  
- Explore a dataset with `df.info()` and `df.describe()`  
- Compute and report **mean**, **SD**, **median**, **IQR** for continuous data  
- Explore **distributions** and use Q–Q plots to assess normality  
- Create **contingency tables** for categorical data  
- Describe data appropriately for publication  
- Understand the basics of **NHST** (H0 vs H1), **p-values**, and **95% CIs**  
- See why p = 0.05 is not a magical threshold (we will use **α = 0.0314**)  
- Compare two and more groups using **parametric** and **non-parametric** tests  
- Remember that **statistics are tools, not an oracle**


In [None]:
# ============================================================
# Setup: scientific Python libraries and plotting style
#
# Assumes the bootstrap cell above has already created:
#   - df  : the synthetic dataset (pandas DataFrame)
#   - CTX : context object with paths and settings
# ============================================================

# Data handling and numerical computing
import numpy as np
import pandas as pd

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns

# Statistical tests
import scipy.stats as st

# Display options (optional but helpful)
pd.set_option("display.max_rows", 20)
pd.set_option("display.max_columns", 20)

# Plot style
sns.set_theme(style="whitegrid")
plt.rcParams["figure.figsize"] = (8, 5)

print("Libraries loaded. DataFrame `df` is ready for analysis.")


## 1. Study variables and data types

Before running any analysis, we should understand **what kind of variables** we have.

Common types:

- **Categorical (nominal)**: labels with no natural order (e.g. sex, treatment arm).  
- **Ordinal**: categories with a natural order, but unknown distance between levels (e.g. Likert scales).  
- **Continuous (or approximately continuous)**: numeric values where differences and averages make sense (e.g. age, blood pressure, VAS scores).

Our synthetic dataset `df` contains (one row per participant):


In [None]:
variables = pd.DataFrame(
    [
        {"variable": "sex",          "type": "categorical", "description": "Participant sex (F/M)"},
        {"variable": "age",          "type": "continuous",  "description": "Age (years)"},
        {"variable": "coffee_arm",   "type": "categorical", "description": "Coffee intervention: low / medium / high"},
        {"variable": "cereal_arm",   "type": "categorical", "description": "Cereal: bran / cornflakes / muesli"},
        {"variable": "food_arm",     "type": "categorical", "description": "Test food: apple / biscuit / yoghurt"},
        {"variable": "bp_change",    "type": "continuous",  "description": "Change in blood pressure (mmHg)"},
        {"variable": "glucose",      "type": "continuous",  "description": "Postprandial blood glucose (arbitrary units)"},
        {"variable": "appetite_vas", "type": "continuous",  "description": "Appetite VAS (0–100)"}
    ]
)
variables


## 2. First look at the dataset

We start with a **quick overview**:

- `df.head()` shows the first few rows.  
- `df.info()` summarises the variables and data types.  
- `df.describe()` provides basic summary statistics for numeric variables.


In [None]:
# First few rows
df.head()


In [None]:
# Overall structure of the DataFrame
df.info()


In [None]:
# Summary statistics for numeric variables
df.describe()


### 2.1 Missing values and impossible values

We should also check for **missing values** and obviously impossible values (e.g. negative age, VAS > 100).  

Our simulator does not generate missing or impossible values, but in real data these checks are essential.


In [None]:
# Count of missing values per variable
df.isna().sum()


## 3. Descriptive statistics for continuous variables

For continuous variables, we often report:

- **Mean and standard deviation (SD)** if the distribution is roughly symmetric.  
- **Median and interquartile range (IQR)** if the distribution is skewed.

Here we compute both for the three main continuous outcomes:

- `bp_change`  
- `glucose`  
- `appetite_vas`


In [None]:
cont_vars = ["bp_change", "glucose", "appetite_vas"]
rows = []

for var in cont_vars:
    series = df[var].dropna()
    mean = series.mean()
    sd = series.std()
    median = series.median()
    q1 = series.quantile(0.25)
    q3 = series.quantile(0.75)
    iqr = q3 - q1
    rows.append({
        "variable": var,
        "mean": mean,
        "sd": sd,
        "median": median,
        "q1": q1,
        "q3": q3,
        "iqr": iqr
    })

summary_cont = pd.DataFrame(rows)
summary_cont


## 4. Exploring distributions

Choice of statistical test depends strongly on the **shape of the distribution**.

For each continuous outcome we can:

- Plot **histograms** and **density curves** (to see skew, multimodality, etc.).  
- Use a **Q–Q plot** to compare the data to a perfect normal distribution.  
- Compare distributions across arms using **boxplots**.


In [None]:
# Histogram and density for blood pressure change
sns.histplot(df["bp_change"], kde=True)
plt.title("Distribution of BP change")
plt.xlabel("BP change (mmHg)")
plt.ylabel("Count")
plt.show()


### 4.1 What is a Q–Q plot?

A **Quantile–Quantile (Q–Q) plot** is a way to check whether a variable follows a **normal distribution**.

The idea is:

- Every dataset has its own **quantiles** (for example the 10th, 50th, 90th percentile).  
- A normal distribution also has well-defined quantiles.  
- A Q–Q plot compares the quantiles of your data with the quantiles of a perfect normal distribution.

If the data are approximately normal, the points in the Q–Q plot:

- fall roughly on a straight line, especially in the middle of the distribution.

Deviations can indicate:

- **S-shape** → skewed distribution  
- **Heavy tails** → more extreme values than expected  
- **Outliers** → isolated points far from the main pattern

Q–Q plots are often more informative than histograms, especially in smaller samples.


In [None]:
# Q–Q plot to assess normality of BP change
st.probplot(df["bp_change"], dist="norm", plot=plt)
plt.title("Q–Q plot of BP change")
plt.show()


In [None]:
# Boxplot of BP change by coffee arm
sns.boxplot(data=df, x="coffee_arm", y="bp_change")
plt.title("BP change by coffee intervention arm")
plt.xlabel("Coffee arm")
plt.ylabel("BP change (mmHg)")
plt.show()


## 5. Descriptive statistics for categorical variables

For categorical variables, we usually report **counts and percentages**.

Examples:

- How many participants are in each **coffee_arm**?  
- What is the distribution of **sex**?  
- How many participants are in each **combination** (e.g. coffee arm × sex)?


In [None]:
# Distribution of participants by coffee arm
df["coffee_arm"].value_counts().to_frame(name="count")


In [None]:
# Contingency table: sex by coffee arm
pd.crosstab(df["sex"], df["coffee_arm"], margins=True)


### 5.1 Describing data properly for publication

In a methods or results section, we might write:

- **Continuous, roughly normal** (e.g. BP change):  
  *"BP change was 1.8 ± 5.2 mmHg (mean ± SD)."*  
- **Continuous, skewed** (e.g. VAS):  
  *"Appetite VAS was 62 (48–75) units (median, IQR)."*  
- **Categorical**:  
  *"40% of participants were male (n = 72/180)."*

The choice between mean ± SD and median (IQR) should be guided by **distributional shape**, not habit.


## 6. Statistical inference and NHST

So far we have described the **sample**. Statistical inference is about what we can reasonably say about the **underlying population**.

In classical **null hypothesis significance testing (NHST)** we:

1. Formulate a **null hypothesis (H0)**, usually "no difference" or "no effect".  
2. Formulate an **alternative hypothesis (H1)**, e.g. "there is a difference".  
3. Choose a test statistic (e.g. a *t*-statistic) and compute it from the data.  
4. Compute a **p-value**: the probability (under H0) of observing a result *at least as extreme* as the one we saw.  
5. Decide whether the result is "compatible" with H0, often using a threshold α.


### 6.1 p-values and 95% confidence intervals

- A **p-value** is *not* the probability that H0 is true. It is the probability of the observed data (or more extreme) *if H0 were true*.
- A **95% confidence interval (CI)** for a parameter (e.g. difference in means) is an interval constructed such that, in repeated samples, 95% of such intervals would contain the true parameter.

In practice:

- If a 95% CI for a difference **excludes 0**, the corresponding two-sided test at α = 0.05 is "statistically significant".  
- The CI also gives a sense of **precision** and **effect size**, not just a yes/no decision.


In [None]:
# Example: difference in mean BP change between low and high coffee arms
bp_low = df[df["coffee_arm"] == "low"]["bp_change"].dropna()
bp_high = df[df["coffee_arm"] == "high"]["bp_change"].dropna()

mean_low = bp_low.mean()
mean_high = bp_high.mean()
diff = mean_high - mean_low

# Standard error for difference in means (Welch t-test style)
se_diff = np.sqrt(bp_low.var(ddof=1)/len(bp_low) + bp_high.var(ddof=1)/len(bp_high))

# 95% CI using normal approximation (for teaching; in practice use statsmodels)
z = 1.96
ci_lower = diff - z * se_diff
ci_upper = diff + z * se_diff

print(f"Mean BP change (low coffee):  {mean_low:6.2f} mmHg")
print(f"Mean BP change (high coffee): {mean_high:6.2f} mmHg")
print(f"Difference (high - low):      {diff:6.2f} mmHg")
print(f"Approx. 95% CI: [{ci_lower:6.2f}, {ci_upper:6.2f}] mmHg")


## 7. Why p = 0.05 is not a magical threshold

In many papers, p < 0.05 is treated as "significant" and p ≥ 0.05 as "not significant", as if there were a sharp cliff.

In reality:

- 0.05 is a **convention**, not a law of nature.  
- A p-value of 0.049 is not fundamentally different from 0.051.  
- Decisions should also consider **effect size**, **uncertainty**, and **context**.

In this workbook we deliberately use an unusual threshold **α = 0.0314** to emphasise that the choice of α is arbitrary and should be justified, not blindly copied.

To see this more clearly, we simulate many RCTs with **no true difference** and look at the distribution of p-values.


In [None]:
# Simulate 10 000 null experiments (no true difference between groups)
rng = np.random.default_rng(11088)
p_values = []

n_per_group = 30
n_sim = 10000

for _ in range(n_sim):
    x = rng.normal(0, 1, n_per_group)
    y = rng.normal(0, 1, n_per_group)
    _, p = st.ttest_ind(x, y, equal_var=False)
    p_values.append(p)

alpha_1 = 0.05
alpha_2 = 0.0314

sns.histplot(p_values, bins=30)
plt.axvline(alpha_1, linestyle="--", label="0.05")
plt.axvline(alpha_2, linestyle=":", label="0.0314")
plt.title("Distribution of p-values when there is NO true effect")
plt.xlabel("p-value")
plt.ylabel("Count")
plt.legend()
plt.show()

prop_005 = np.mean(np.array(p_values) < alpha_1)
prop_0314 = np.mean(np.array(p_values) < alpha_2)

print(f"Proportion of p-values < 0.05:   {prop_005:5.3f}")
print(f"Proportion of p-values < 0.0314: {prop_0314:5.3f}")


**Reflection**

- Roughly what proportion of p-values fall below 0.05 when H0 is true?  
- What happens when we tighten the threshold to 0.0314?  
- What does this tell you about treating p = 0.049 and p = 0.051 as fundamentally different?


## 8. Comparing two groups: parametric and non-parametric tests

We now compare **BP change** between two coffee arms (e.g. low vs high).

- **Parametric test**: independent-samples *t*-test (assumes approximate normality and reasonable variance behaviour).  
- **Non-parametric test**: Mann–Whitney U test (uses ranks; does not assume normality).

We will use **α = 0.0314** as our decision threshold.


In [None]:
# Select two arms for comparison: low vs high coffee
bp_low = df[df["coffee_arm"] == "low"]["bp_change"].dropna()
bp_high = df[df["coffee_arm"] == "high"]["bp_change"].dropna()

# Independent-samples t-test (Welch)
t_stat, p_t = st.ttest_ind(bp_low, bp_high, equal_var=False)

# Mann–Whitney U test (non-parametric)
u_stat, p_u = st.mannwhitneyu(bp_low, bp_high, alternative="two-sided")

alpha = 0.0314

print(f"t-test:       t = {t_stat:6.3f}, p = {p_t:6.4f}")
print(f"Mann–Whitney: U = {u_stat:6.1f}, p = {p_u:6.4f}")
print(f"Using alpha = {alpha}")


**Questions**

- Do the conclusions from the *t*-test and Mann–Whitney test agree at α = 0.0314?  
- Would your conclusion change if you (arbitrarily) switched to α = 0.05?  
- Looking back at the distributions, does a parametric or non-parametric test seem more appropriate?


## 9. Comparing more than two groups

Now consider **blood glucose** across the three cereal arms:

- bran  
- cornflakes  
- muesli

We can use:

- **One-way ANOVA** (parametric): compares mean values across groups.  
- **Kruskal–Wallis test** (non-parametric): compares distributions using ranks.


In [None]:
groups_glucose = [group["glucose"].values for _, group in df.groupby("cereal_arm")]

# One-way ANOVA
f_stat, p_anova = st.f_oneway(*groups_glucose)

# Kruskal–Wallis test
h_stat, p_kw = st.kruskal(*groups_glucose)

print("One-way ANOVA:")
print(f"  F = {f_stat:6.3f}, p = {p_anova:6.4f}")
print("Kruskal–Wallis:")
print(f"  H = {h_stat:6.3f}, p = {p_kw:6.4f}")


In [None]:
# Visualise glucose values by cereal arm
sns.boxplot(data=df, x="cereal_arm", y="glucose")
plt.title("Blood glucose by cereal arm")
plt.xlabel("Cereal arm")
plt.ylabel("Glucose (arbitrary units)")
plt.show()


## 10. Comparing categories: chi-squared tests and a warning

Sometimes we want to know whether **two categorical variables are associated**, for example:

- Is the proportion of participants with **high appetite** different across **test foods**?

For this we can use a **chi-squared test of independence** on a contingency table.

⚠️ **Important reminder:**  
- **Categorical** and **ordinal** data should **not** be analysed as if they were continuous without careful justification.  
- For example, treating a 5-point Likert scale as if it were a continuous variable and running a *t*-test can be misleading.


In [None]:
# Create a simple categorical outcome: high vs not-high appetite
high_cutoff = 70
df["appetite_high"] = (df["appetite_vas"] >= high_cutoff).astype(int)

# Contingency table: appetite_high by food_arm
table = pd.crosstab(df["appetite_high"], df["food_arm"])
print("Contingency table (rows: appetite_high, columns: food_arm):")
display(table)

# Chi-squared test of independence
chi2, p_chi, dof, expected = st.chi2_contingency(table)

print(f"Chi-squared test: chi2 = {chi2:6.3f}, df = {dof}, p = {p_chi:6.4f}")


**Questions**

- Does the chi-squared test suggest a difference in the proportion of high appetite across test foods (using α = 0.0314)?  
- How would you **report** this result in words?  
- Why is a chi-squared test more appropriate here than a *t*-test on the raw VAS scores split into two categories?


## 11. Statistics: tools, not an oracle

Finally, a reminder:

- Statistical methods help us **summarise uncertainty** and **quantify evidence**.  
- They do **not** replace judgement about study design, data quality, or plausibility.  
- A "significant" p-value does not guarantee truth, and a "non-significant" result does not prove there is no effect.

When using these tools in real research, always consider:

- Are the data appropriate for the method?  
- Are the assumptions at least approximately met?  
- Do the results make sense in the context of other evidence?
