# 🦛 3.1 Principles of Tidy Data

In this notebook, we’ll explore **Hadley Wickham’s four principles of tidy data**, see why they matter for nutrition research, and apply them to our `hippo_nutrients.csv` dataset.

## 📋 The Four Principles of Tidy Data

1. **Each variable is a column**  
2. **Each observation is a row**  
3. **Each type of observational unit forms a table**  
4. **Each cell contains a single value**

Tidy data makes analysis straightforward—no hippo-chasing through messy tables!

## 🐘 From Messy to Tidy: A Visual Example

**Messy data** (wide format, multiple values in a cell):

| ID | Iron_2024;2025 | Calcium_2024;2025 |
|----|----------------|--------------------|
| H1 | 8.2; 8.5       | 1200; 1250         |

**Tidy data** (long format, one value per cell):

| ID | Year | Nutrient | Value |
|----|------|----------|-------|
| H1 | 2024 | Iron     | 8.2   |
| H1 | 2025 | Iron     | 8.5   |
| H1 | 2024 | Calcium  | 1200  |
| H1 | 2025 | Calcium  | 1250  |

In [None]:
# Setup environment (Colab/local)
%run ../../bootstrap.py  # installs requirements + editable package

import pandas as pd
import numpy as np

print("Environment ready. pandas:", pd.__version__, "numpy:", np.__version__)

## 🔍 Load & Preview `hippo_nutrients.csv`

Let’s load the dataset and inspect its initial structure.

In [None]:
df = pd.read_csv('hippo_nutrients.csv')
df.head()

## 🔄 Reshaping with `melt()`

We’ll transform from wide to long format, making each nutrient-year combination its own row.

In [None]:
# Assume columns: ID, Age, Sex, Iron_2024, Iron_2025, Calcium_2024, Calcium_2025
df_melted = df.melt(id_vars=['ID', 'Age', 'Sex'],
                    var_name='Nutrient_Year',
                    value_name='Value')
# Separate 'Nutrient' and 'Year'
df_tidy = df_melted.assign(
    Nutrient = df_melted['Nutrient_Year'].str.split('_').str[0],
    Year = df_melted['Nutrient_Year'].str.split('_').str[1].astype(int)
).drop(columns='Nutrient_Year')
df_tidy.head()

## 🔄 Back to Wide with `pivot_table()`

To show inverse operation, pivot back to wide format for Iron values.

In [None]:
df_pivot = df_tidy[df_tidy['Nutrient']=='Iron'] \
    .pivot_table(index=['ID','Age','Sex'],
                 columns='Year',
                 values='Value') \
    .reset_index()
df_pivot.head()

## 🐾 Exercises

1. **Filter Iron Data**: Use the tidy DataFrame `df_tidy` to filter only Iron entries.  
2. **Compute Average Calcium**: Group by `ID` and compute the average Calcium intake across years.

*Hint*: Use `df_tidy[df_tidy['Nutrient']=='Iron']` and `groupby`.

## ❓ Quick Quiz

**True or False**: In tidy data, you can have multiple years combined in one column header.  

<details><summary>Answer</summary>  
False. Each variable (e.g., Year) must be its own column.  
</details>

## 🎬 Conclusion

- We covered the **four tidy data principles**.  
- Converted messy tables to tidy format using **melt** and **pivot**.  

Next: Dive into **3.2 Importing & Connecting** to pull in external data sources!