# 📈 5.1 Bayesian Methods in Nutrition Research

This notebook introduces Bayesian methods for nutrition data analysis, focusing on comparing two groups (e.g., Female vs. Male) using PyMC. Bayesian statistics allow us to model uncertainty and update our beliefs with data—perfect for nutrition studies where variability is common!

**Objectives**:
- Learn the basics of Bayesian inference.
- Apply Bayesian methods to compare nutrient intake between groups.
- Visualise posterior distributions using ArviZ.

**Context**: In nutrition research, we often compare groups (e.g., vitamin levels in females vs. males). Bayesian methods provide a flexible framework to estimate parameters and quantify uncertainty.

<details><summary>Fun Fact</summary>
Bayesian methods are like a hippo updating its diet plan—starting with a guess (prior) and refining it with new data (posterior)! 🦛
</details>

In [None]:
# Setup for Google Colab: Fetch datasets automatically or manually
import os
from google.colab import files

# Define the module and dataset for this notebook
MODULE = '05_advanced'  # e.g., '01_infrastructure'
DATASET = 'large_food_log.csv'  # e.g., 'hippo_diets.csv'
BASE_PATH = '/content/data-analysis-projects'
MODULE_PATH = os.path.join(BASE_PATH, 'notebooks', MODULE)
DATASET_PATH = os.path.join('data', DATASET)

# Step 1: Attempt to clone the repository (automatic method)
# Note: If you encounter a cloning error (e.g., 'fatal: destination path already exists'),
#       reset the runtime (Runtime > Restart runtime) and run this cell again.
try:
    print('Attempting to clone repository...')
    if os.path.exists(BASE_PATH):
        print('Repository already exists, skipping clone.')
    else:
        !git clone https://github.com/ggkuhnle/data-analysis-projects.git
    
    # Debug: Print directory structure
    print('Listing repository contents:')
    !ls {BASE_PATH}
    print(f'Listing notebooks directory contents:')
    !ls {BASE_PATH}/notebooks
    
    # Check if the module directory exists
    if not os.path.exists(MODULE_PATH):
        raise FileNotFoundError(f'Module directory {MODULE_PATH} not found. Check the repository structure.')
    
    # Set working directory to the notebook's folder
    os.chdir(MODULE_PATH)
    
    # Verify dataset is accessible
    if os.path.exists(DATASET_PATH):
        print(f'Dataset found: {DATASET_PATH} 🦛')
    else:
        print(f'Error: Dataset {DATASET} not found after cloning.')
        raise FileNotFoundError
except Exception as e:
    print(f'Cloning failed: {e}')
    print('Falling back to manual upload option...')

    # Step 2: Manual upload option
    print(f'Please upload {DATASET} manually.')
    print(f'1. Click the "Choose Files" button below.')
    print(f'2. Select {DATASET} from your local machine.')
    print(f'3. Ensure the file is placed in notebooks/{MODULE}/data/')
    
    # Create the data directory if it doesn't exist
    os.makedirs('data', exist_ok=True)
    
    # Prompt user to upload the dataset
    uploaded = files.upload()
    
    # Check if the dataset was uploaded
    if DATASET in uploaded:
        with open(DATASET_PATH, 'wb') as f:
            f.write(uploaded[DATASET])
        print(f'Successfully uploaded {DATASET} to {DATASET_PATH} 🦛')
    else:
        raise FileNotFoundError(f'Upload failed. Please ensure you uploaded {DATASET}.')

# Install required packages for this notebook
%pip install pandas numpy
print('Python environment ready.')

## 📊 Load and Explore Data

We’ll use `vitamin_trial.csv` to compare vitamin levels between females and males. The dataset includes:
- `vitamin_level`: The measured vitamin level (e.g., Vitamin D in ng/mL).
- `sex`: Group indicator (0 for Female, 1 for Male).

In [None]:
# Import libraries for data handling
import pandas as pd
import numpy as np

# Load the dataset
data = pd.read_csv('data/vitamin_trial.csv')

# Extract vitamin levels and group indicators
y = data['vitamin_level'].values  # Response variable
group = data['sex'].values        # 0 for Female, 1 for Male

# Quick summary
print('Mean vitamin levels:')
print(f'Female: {np.mean(y[group == 0]):.1f}')
print(f'Male: {np.mean(y[group == 1]):.1f}')

## 🧠 Bayesian Model

We’ll model the vitamin levels as normally distributed with different means for each group:

- **Priors**:
  - `mu[0]`: Mean vitamin level for Females ~ Normal(0, 10)
  - `mu[1]`: Mean vitamin level for Males ~ Normal(0, 10)
  - `sigma`: Standard deviation ~ HalfNormal(1)
- **Likelihood**:
  - `vitamin_level` ~ Normal(mu[group], sigma)

In [None]:
# Import Bayesian libraries
import pymc as pm
import arviz as az
import matplotlib.pyplot as plt

# Check PyMC version for compatibility
print(f'PyMC version: {pm.__version__}')

# Define the Bayesian model
with pm.Model() as model:
    # Priors for the means of Female (group 0) and Male (group 1)
    mu = pm.Normal('mu', mu=0, sigma=10, shape=2)  # Mean for each group
    
    # Prior for standard deviation
    sigma = pm.HalfNormal('sigma', sigma=1)
    
    # Likelihood: vitamin levels are normally distributed
    y_obs = pm.Normal('y_obs', mu=mu[group], sigma=sigma, observed=y)
    
    # Sample from the posterior
    trace = pm.sample(1000, tune=1000, return_inferencedata=True)

## 📉 Posterior Analysis

Let’s calculate the posterior means for each group and visualise the posterior distributions of `mu`.

In [None]:
# Calculate posterior means for Female and Male
mu_posterior = trace.posterior['mu'].mean(dim=['chain', 'draw'])
print(f'Posterior means: Female={round(float(mu_posterior[0]), 1)}, Male={round(float(mu_posterior[1]), 1)}')

# Visualise posterior distributions
az.plot_posterior(trace, var_names=['mu'])  # Plot histograms of mu
plt.show()  # Display plot

## 🧪 Exercises

1. **Change the Prior**: Modify the prior for `mu` to `Normal(5, 5)` and re-run the analysis. How do the posterior means change? Write your observations in a Markdown cell.

2. **Add a Parameter**: Extend the model to include a different `sigma` for each group (e.g., `sigma = pm.HalfNormal('sigma', sigma=1, shape=2)`). Re-run the sampling and plot the posteriors for both `mu` and `sigma`.

3. **Compare Groups**: Calculate the posterior difference between `mu[1]` (Male) and `mu[0]` (Female) and plot its distribution using `az.plot_posterior`.

**Guidance**: Use the code above as a starting point. Experiment with priors and parameters to see how they affect the results!

**Your Answers**:

**Exercise 1: Change the Prior**  
[Write your observations here]

**Exercise 2: Add a Parameter**  
[Write your code and results here]

**Exercise 3: Compare Groups**  
[Write your code and results here]

## Conclusion

You’ve applied Bayesian methods to compare vitamin levels between females and males, calculating posterior means and visualising distributions. Bayesian approaches are powerful for nutrition research, allowing you to incorporate prior knowledge and quantify uncertainty.

**Next Steps**: Explore workflow automation in `5.2_workflow_automation.ipynb`.

**Resources**:
- [PyMC Documentation](https://www.pymc.io/)
- [ArviZ Documentation](https://arviz-devs.github.io/arviz/)
- Repository: [github.com/ggkuhnle/data-analysis-projects](https://github.com/ggkuhnle/data-analysis-projects)