# 📊 5.1 Bayesian Inference

This notebook introduces Bayesian inference for analysing nutrient intakes, a robust method for nutrition research, such as NDNS studies. Bayesian methods integrate prior knowledge with observed data to estimate parameters with uncertainty.

**Objectives**:
- Understand Bayesian principles: priors, likelihoods, and posteriors.
- Apply PyMC to model nutrient intake data.
- Visualise and interpret posterior distributions.

**Context**: Bayesian approaches are valuable for small datasets or when prior information (e.g., Recommended Dietary Allowances) is available. We will model iron intakes using `hippo_nutrients.csv`.

<details>
<summary>Advanced Note</summary>
Bayesian methods offer advantages over frequentist approaches (covered in 4.4) for handling uncertainty.
</details>

<details>
<summary>Fun Fact</summary>
Tracking nutrients is like a hippo logging its daily diet—precision matters! 🦛
</details>

In [None]:
# Setup for Google Colab: Fetch datasets automatically or manually
import os
from google.colab import files

# Define the module and dataset for this notebook
MODULE = '05_advanced'
DATASET = 'large_food_log.csv'
DATASET_PATH = os.path.join('data', DATASET)

# Step 1: Attempt to clone the repository (automatic method)
try:
    print('Attempting to clone repository...')
    !git clone https://github.com/ggkuhnle/data-analysis-toolkit-FNS.git
    os.chdir(f'/content/data-analysis-toolkit-FNS/notebooks/{MODULE}')
    if os.path.exists(DATASET_PATH):
        print(f'Dataset found: {DATASET_PATH} 🦛')
    else:
        print(f'Error: Dataset {DATASET} not found after cloning.')
        raise FileNotFoundError
except Exception as e:
    print(f'Cloning failed: {e}')
    print('Falling back to manual upload option...')

    # Step 2: Manual upload option
    print(f'Please upload {DATASET} manually.')
    print(f'1. Click the "Choose Files" button below.')
    print(f'2. Select {DATASET} from your local machine.')
    print(f'3. Ensure the file is placed in notebooks/{MODULE}/data/')
    
    # Create the data directory if it doesn't exist
    os.makedirs('data', exist_ok=True)
    
    # Prompt user to upload the dataset
    uploaded = files.upload()
    
    # Check if the dataset was uploaded
    if DATASET in uploaded:
        with open(DATASET_PATH, 'wb') as f:
            f.write(uploaded[DATASET])
        print(f'Successfully uploaded {DATASET} to {DATASET_PATH} 🦛')
    else:
        raise FileNotFoundError(f'Upload failed. Please ensure you uploaded {DATASET}.')

# Install required packages for this notebook
%pip install pandas numpy
print('Python environment ready.')

In [1]:
# Install required packages
%pip install pymc pandas numpy matplotlib arviz  # For Colab users
import pymc as pm
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import arviz as az
print('Bayesian analysis environment ready.')

## Data Preparation

The `hippo_nutrients.csv` dataset contains simulated nutrient intakes, inspired by NDNS. We will extract iron data for analysis.

In [2]:
df = pd.read_csv('../data_handling/data/hippo_nutrients.csv')
iron_data = df[df['Nutrient'] == 'Iron']['Value'].dropna()
print(df.head(2))

   ID Nutrient  Year  Value  Age Sex
0  H1     Iron  2024    8.2   25   F
1  H1     Iron  2025    8.5   26   F


## Bayesian Model

We model iron intake as a normal distribution with a prior mean of 8 mg (based on RDA). The model estimates the mean (`mu`) and standard deviation (`sigma`).

**Exercise 1**: Run the model to estimate iron intake parameters.

In [3]:
with pm.Model() as model:
    mu = pm.Normal('mu', mu=8, sigma=2)  # Prior: mean ~ N(8, 2)
    sigma = pm.HalfNormal('sigma', sigma=1)  # Prior: std dev
    obs = pm.Normal('obs', mu=mu, sigma=sigma, observed=iron_data)
    trace = pm.sample(1000, return_inferencedata=True)

# Visualise posterior distributions
az.plot_posterior(trace, var_names=['mu', 'sigma'], hdi_prob=0.95)
plt.show()

## Parameter Summary

The table below summarizes the estimated parameters, including the mean and 95% highest density interval (HDI).

In [4]:
summary = az.summary(trace, var_names=['mu', 'sigma'], hdi_prob=0.95)
print(summary[['mean', 'hdi_2.5%', 'hdi_97.5%']].rename(columns={'mean': 'Mean', 'hdi_2.5%': '95% HDI Lower', 'hdi_97.5%': '95% HDI Upper'}))

        Mean  95% HDI Lower  95% HDI Upper
mu      8.3           7.8           8.8
sigma   1.1           0.9           1.3


## Exercise 2: Interpret Results

Examine the posterior plots and table. Document the estimated mean iron intake and its 95% HDI in a Markdown cell.

**Guidance**: Use the plot’s peak and HDI bounds to describe uncertainty.

**Answer**:

The estimated mean iron intake is approximately...

## Conclusion

This notebook demonstrated Bayesian inference using PyMC to estimate nutrient intakes with uncertainty. You have learned:
- Specifying priors and likelihoods.
- Modelling and visualising posteriors.
- Interpreting parameter estimates for nutrition research.

**Next Steps**: Explore workflow automation in 5.2.

**Resources**:
- [PyMC Documentation](https://www.pymc.io/)
- [Bayesian Methods for Nutrition](https://statswithr.com/book/bayesian-basics.html)
- Repository: [github.com/ggkuhnle/data-analysis-toolkit-FNS](https://github.com/ggkuhnle/data-analysis-toolkit-FNS)