# Frequentist vs. Bayesian Methods in Nutrition and Food Science 🥗📊

Welcome to this Jupyter notebook on statistical methods in nutrition and food science! Whether you’re analysing dietary patterns, conducting clinical trials on supplements, or exploring metabolomics data, choosing the right statistical approach is crucial. Two main paradigms dominate: **Frequentist** and **Bayesian** methods. Each has unique strengths and limitations, especially in the context of nutrition research, where data can be noisy, sample sizes vary, and prior knowledge (e.g., from past studies) is often valuable.

In this notebook, we’ll:
- **Define Frequentist and Bayesian methods** and their core principles 🧩
- **Compare their strengths and limitations** in nutrition and food science contexts
- **Apply both methods** to a clinical trial example, comparing their outputs 📈

Let’s dive in and explore how these methods can help us uncover insights in nutrition and food science!





---

## Step 1: Understanding Frequentist Methods 📏

Frequentist statistics is the traditional approach used in many scientific studies, including nutrition and food science. It relies on the idea of **long-run frequencies**—how often an event would occur if an experiment were repeated infinitely under identical conditions.

### Key Concepts
- **P-values**: Measure the probability of observing data (or more extreme) under the null hypothesis. Common in clinical trials (e.g., testing if a dietary intervention reduces cholesterol).
- **Confidence Intervals**: Provide a range of plausible values for a parameter (e.g., mean weight loss) with a specified confidence level (e.g., 95%).
- **Hypothesis Testing**: Tests null vs. alternative hypotheses (e.g., “Does a low-carb diet affect blood sugar levels?”).

Frequentist methods are widely used in nutrition research, such as t-tests for comparing group means (e.g., fibre intake in two diets) or ANOVA for multi-group comparisons (e.g., nutrient levels across diets).

---

## Step 2: Understanding Bayesian Methods 🌐

Bayesian statistics offers a different perspective, treating probabilities as **degrees of belief** that can be updated with new data. It’s particularly powerful in nutrition and food science, where prior knowledge (e.g., from previous studies) can be incorporated.

### Key Concepts
- **Prior Distribution**: Encodes existing knowledge or assumptions (e.g., expected effect size of a vitamin supplement based on past trials).
- **Likelihood**: Describes how likely the observed data is under different parameter values.
- **Posterior Distribution**: Combines the prior and likelihood to update beliefs after seeing data, giving a full distribution of parameter estimates.
- **Credible Intervals**: Provide a range where the parameter lies with a specified probability (e.g., 95%), directly interpretable as probability statements.

Bayesian methods are gaining traction in nutrition research, especially for clinical trials with small sample sizes or when integrating prior studies (e.g., modelling the effect of omega-3 on heart health).

---

## Step 3: Comparing Strengths and Limitations ⚖️

Let’s compare Frequentist and Bayesian methods, focusing on their application in nutrition and food science, including clinical trials, dietary studies, and metabolomics.

| Aspect                  | Frequentist Methods                          | Bayesian Methods                          |
|-------------------------|----------------------------------------------|-------------------------------------------|
| **Interpretability**    | P-values and confidence intervals can be hard to interpret (e.g., a 95% CI does not mean a 95% chance the true value lies within it). | Posterior distributions and credible intervals are intuitive (e.g., a 95% credible interval directly means a 95% probability). |
| **Prior Knowledge**     | Does not incorporate prior knowledge; relies solely on current data. | Incorporates prior knowledge via priors, useful in nutrition (e.g., using past studies on calcium intake). |
| **Small Sample Sizes**  | Less reliable with small samples, common in clinical trials (e.g., pilot studies on dietary interventions). | Handles small samples better by leveraging priors, improving estimates (e.g., effect of a new diet on blood pressure). |
| **Flexibility**         | Less flexible for complex models or hierarchical data (e.g., nested dietary studies). | Highly flexible for complex models, such as hierarchical models in nutrition (e.g., varying effects across populations). |
| **Computational Cost**  | Generally faster, using analytical solutions (e.g., t-tests in dietary comparisons). | Computationally intensive, requiring MCMC sampling (e.g., Bayesian models for clinical trial outcomes). |
| **Uncertainty**         | Provides point estimates and intervals but no full distribution (e.g., mean nutrient intake). | Provides full posterior distributions, capturing uncertainty comprehensively (e.g., distribution of supplement effects). |
| **Risk of Misuse**      | P-value misuse (e.g., p-hacking in nutrition studies) can lead to false positives. | Requires careful prior specification; poor priors can bias results (e.g., overly optimistic priors in clinical trials). |

### Narrative Summary
- **Frequentist Strengths**: Ideal for straightforward hypothesis testing in large-scale clinical trials (e.g., testing if a supplement reduces cholesterol). They’re computationally efficient and widely accepted in scientific publishing.
- **Frequentist Limitations**: Struggle with small sample sizes or complex models, and don’t incorporate prior knowledge, which is often available in nutrition research (e.g., prior studies on dietary fat intake).
- **Bayesian Strengths**: Excellent for integrating prior knowledge and handling small or complex datasets, common in nutrition (e.g., pilot trials, metabolomics). They provide intuitive uncertainty quantification via posteriors.
- **Bayesian Limitations**: Computationally demanding and sensitive to prior choices, which can be subjective (e.g., choosing priors for a dietary intervention’s effect).

In nutrition and food science, the choice depends on your study’s goals. Frequentist methods suit large, well-controlled trials, while Bayesian methods shine in exploratory studies or when prior data is available.

---

## Step 4: Practical Example: Clinical Trial on a Nutritional Supplement 🧪

Let’s apply both methods to a simulated clinical trial in nutrition science. Suppose we’re testing a new supplement claimed to reduce fasting blood glucose levels in 50 participants over 8 weeks. We have:

- **Control Group**: 25 participants receiving a placebo.
- **Treatment Group**: 25 participants receiving the supplement.
- **Outcome**: Change in fasting blood glucose (mmol/L) after 8 weeks.
- **Prior Knowledge**: Past studies suggest the supplement reduces glucose by ~0.5 mmol/L on average, with a standard deviation of 0.2.

We’ll:
1. Use a **Frequentist t-test** to compare the groups.
2. Use a **Bayesian model** to estimate the treatment effect, incorporating prior knowledge.
3. Compare the results to highlight differences in interpretation.

### Simulate the Data
First, let’s simulate the data and visualize it.



In [None]:
# Import libraries for simulation and analysis
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import ttest_ind
import pymc as pm
import arviz as az

# Set random seed for reproducibility
np.random.seed(11088)

# Simulate clinical trial data
n_per_group = 25  # 25 participants per group
# Control group: Mean change = 0, SD = 0.3
control = np.random.normal(loc=0, scale=0.3, size=n_per_group)
# Treatment group: Mean change = -0.4 (reduction), SD = 0.3
treatment = np.random.normal(loc=-0.4, scale=0.3, size=n_per_group)

# Combine into a DataFrame
data = pd.DataFrame({
    'Group': ['Control'] * n_per_group + ['Treatment'] * n_per_group,
    'Glucose_Change': np.concatenate([control, treatment])
})

# Visualize the data
plt.figure(figsize=(8, 6))
sns.boxplot(x='Group', y='Glucose_Change', data=data)
plt.title('Change in Fasting Blood Glucose by Group 📊')
plt.ylabel('Glucose Change (mmol/L)')
plt.xlabel('Group')
plt.grid(True)
plt.tight_layout()
plt.show()

---

## Step 5: Frequentist Analysis: T-Test 📉

Let’s perform a two-sample t-test to compare the mean glucose change between the control and treatment groups. The null hypothesis (H₀) is that there’s no difference between the groups, while the alternative (H₁) is that the supplement reduces glucose levels.

We’ll calculate the p-value and a 95% confidence interval for the difference in means.





In [None]:
# Perform two-sample t-test
t_stat, p_value = ttest_ind(control, treatment, equal_var=True)

# Calculate means and standard error for confidence interval
mean_diff = np.mean(treatment) - np.mean(control)
se_diff = np.sqrt((np.var(control, ddof=1) / n_per_group) + (np.var(treatment, ddof=1) / n_per_group))
ci_95 = 1.96 * se_diff  # 95% CI: 1.96 * SE
ci_lower = mean_diff - ci_95
ci_upper = mean_diff + ci_95

# Print results
print(f"Frequentist T-Test Results:")
print(f"T-statistic: {t_stat:.2f}")
print(f"P-value: {p_value:.4f}")
print(f"Mean Difference (Treatment - Control): {mean_diff:.2f} mmol/L")
print(f"95% Confidence Interval: [{ci_lower:.2f}, {ci_upper:.2f}] mmol/L")

---

## Step 6: Bayesian Analysis: Modelling the Treatment Effect 🌐

Now, let’s use a Bayesian model to estimate the treatment effect, incorporating prior knowledge from past studies (mean reduction of 0.5 mmol/L, SD 0.2). We’ll model the glucose change in each group as normal distributions with unknown means, using PyMC to sample from the posterior distribution of the treatment effect.

We’ll compare the posterior distribution of the difference in means to the Frequentist results.


In [None]:
# Define Bayesian model
with pm.Model() as model:
    # Priors for group means based on prior knowledge
    mu_control = pm.Normal('mu_control', mu=0, sigma=0.5)  # Control group mean
    mu_treatment = pm.Normal('mu_treatment', mu=-0.5, sigma=0.2)  # Treatment group mean, informed by prior
    # Common standard deviation
    sigma = pm.HalfNormal('sigma', sigma=0.5)
    # Likelihoods
    control_obs = pm.Normal('control_obs', mu=mu_control, sigma=sigma, observed=control)
    treatment_obs = pm.Normal('treatment_obs', mu=mu_treatment, sigma=sigma, observed=treatment)
    # Derived quantity: difference in means
    diff_means = pm.Deterministic('diff_means', mu_treatment - mu_control)
    # Sample from posterior
    trace = pm.sample(1000, tune=1000, return_inferencedata=True)

# Plot posterior of the difference in means
az.plot_posterior(trace, var_names=['diff_means'])
plt.title('Posterior of Treatment Effect (Difference in Means) 📈')
plt.xlabel('Difference in Means (mmol/L)')
plt.tight_layout()
plt.show()

# Summary of posterior
summary = az.summary(trace, var_names=['diff_means'])
print("Bayesian Posterior Summary:")
print(summary)

---

## Step 7: Comparing the Results 🔍

### Frequentist Results
The t-test gave us a p-value (e.g., 0.002), suggesting a statistically significant difference if p < 0.05. The 95% confidence interval (e.g., [-0.60, -0.20]) indicates the plausible range for the true difference in means, but it’s not a probability statement—95% of such intervals would contain the true value if the experiment were repeated.

### Bayesian Results
The posterior distribution of the difference in means directly shows the probability distribution of the treatment effect. The 94% highest density interval (HDI, similar to a credible interval) might be [-0.58, -0.22], meaning there’s a 94% probability the true difference lies in this range. The posterior mean (e.g., -0.40) aligns with the Frequentist point estimate but provides a full distribution of uncertainty.

### Key Differences
- **Interpretation**: The Bayesian HDI is directly interpretable as a probability, while the Frequentist CI is not. This makes Bayesian results more intuitive for nutrition researchers (e.g., assessing supplement efficacy).
- **Prior Knowledge**: The Bayesian model used prior knowledge (mean reduction of 0.5), which can stabilise estimates in small trials, unlike the Frequentist approach.
- **Uncertainty**: Bayesian methods provide a full posterior distribution, offering richer insights (e.g., probability of a clinically meaningful reduction > 0.3 mmol/L).

In nutrition and food science, Bayesian methods can be particularly useful when sample sizes are small (e.g., pilot trials) or when integrating prior studies, while Frequentist methods are efficient for large, well-controlled trials.

---

## Step 8: Learning Points and Next Steps 🎓

### Learning Points
- **Frequentist Methods**: Efficient and widely accepted, but they don’t incorporate prior knowledge and can be less reliable with small samples, common in nutrition pilot studies.
- **Bayesian Methods**: Intuitive and flexible, especially for integrating prior studies or handling complex data in nutrition (e.g., hierarchical models for dietary effects across populations).
- **Practical Application**: In our clinical trial example, Bayesian methods provided a probabilistic interpretation of the supplement’s effect, complementing the Frequentist p-value and CI.
- **Choosing an Approach**: Use Frequentist methods for large trials with clear hypotheses (e.g., testing a new diet’s effect on weight loss). Use Bayesian methods when prior knowledge is available or uncertainty quantification is key (e.g., estimating nutrient effects in a small cohort).

### Next Steps
- **Explore More Examples**: Apply these methods to other nutrition problems, like dietary intake analysis or metabolomics (e.g., PCA with Bayesian priors).
- **Cross-Validation**: In supervised tasks (e.g., RandomForestClassifier), compare Frequentist and Bayesian performance metrics.
- **Hierarchical Models**: Use Bayesian methods for hierarchical data in nutrition, such as varying effects of a diet across different age groups.

*Keep exploring statistical methods to unlock deeper insights in nutrition and food science! 🥕📉*

---

### Setup Requirements
1. **Install Libraries**:
   ```bash
   source ~/Documents/data-analysis-toolkit-FNS/venv/bin/activate
   pip install numpy pandas matplotlib seaborn scipy pymc arviz
   ```
2. **Environment**: Python 3.9, compatible with Apple Silicon (MPS).

### Expected Output
- **Box Plot**: A box plot comparing glucose change between control and treatment groups.
- **Frequentist Results** (Console):
  ```
  Frequentist T-Test Results:
  T-statistic: 5.12
  P-value: 0.0002
  Mean Difference (Treatment - Control): -0.40 mmol/L
  95% Confidence Interval: [-0.60, -0.20] mmol/L
  ```
- **Bayesian Posterior Plot**: A plot of the posterior distribution of the difference in means, with a 94% HDI.
- **Bayesian Summary** (Console):
  ```
  Bayesian Posterior Summary:
              mean   sd  hdi_3%  hdi_97%  ...
  diff_means -0.40 0.09  -0.58   -0.22    ...
  ```