# 🦛 3.1 Data Types and Structures — *Wide vs Long (Tidy)*

This notebook explores core tabular data layouts used in nutrition research, with a focus on **tidy data** and how to move between **wide** and **long** forms using pandas.

**Objectives**:
- Understand vectors, tables, and the difference between **wide** and **long** (tidy) data.
- Use `pandas.pivot` (wide) and `pandas.melt` (long) appropriately.
- Apply tidy principles to `hippo_nutrients.csv`.

**Context**: Tidy data is critical for efficient analysis of nutrition datasets. 🦛

<details><summary>Fun Fact</summary>
Tidy data is like a hippo’s lunch tray—neat and ready to munch!
</details>

In [None]:
# Setup for Google Colab: Fetch datasets automatically or manually
import os
from google.colab import files

# Define the module and dataset for this notebook
MODULE = '03_data_handling'  # e.g., '01_infrastructure'
DATASET = 'hippo_nutrients.csv'  # e.g., 'hippo_diets.csv'
BASE_PATH = '/content/data-analysis-projects'
MODULE_PATH = os.path.join(BASE_PATH, 'notebooks', MODULE)
DATASET_PATH = os.path.join('data', DATASET)

try:
    print('Attempting to clone repository...')
    if os.path.exists(BASE_PATH):
        print('Repository already exists, skipping clone.')
    else:
        !git clone https://github.com/ggkuhnle/data-analysis-projects.git

    # Debug: Print directory structure
    print('Listing repository contents:')
    !ls {BASE_PATH}
    print('Listing notebooks directory contents:')
    !ls {BASE_PATH}/notebooks

    if not os.path.exists(MODULE_PATH):
        raise FileNotFoundError(f'Module directory {MODULE_PATH} not found. Check the repository structure.')

    os.chdir(MODULE_PATH)

    if os.path.exists(DATASET_PATH):
        print(f'Dataset found: {DATASET_PATH} ✅')
    else:
        print(f'Error: Dataset {DATASET} not found after cloning.')
        raise FileNotFoundError
except Exception as e:
    print(f'Cloning failed: {e}')
    print('Falling back to manual upload option...')
    os.makedirs('data', exist_ok=True)
    uploaded = files.upload()
    if DATASET in uploaded:
        with open(DATASET_PATH, 'wb') as f:
            f.write(uploaded[DATASET])
        print(f'Successfully uploaded {DATASET} to {DATASET_PATH} ✅')
    else:
        raise FileNotFoundError(f'Upload failed. Please ensure you uploaded {DATASET}.')

%pip install pandas numpy -q
print('Python environment ready.')

In [None]:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', 20)
print('Data handling environment ready.')

## Wide vs Long (Tidy): the essential distinction

**Long/tidy data** (what most modelling/plotting functions prefer):
- One **row per observation**.
- One **column per variable**.
- Example here: each row gives (`ID`, `Nutrient`, `Year`, `Value`, `Age`, `Sex`).

**Wide data** (often what you get from spreadsheets):
- One row per **entity/time-point**, with **many columns** for different measures.
- Example: one row per (`ID`, `Year`) and separate columns `Iron`, `Calcium`, `Zinc`, ...

👉 Use **`pivot`** to go from **long → wide**.
👉 Use **`melt`** to go from **wide → long**.

The trick is to correctly identify:
- **ID variables** (stay as columns and define the row): e.g., `ID`, `Year`, `Age`, `Sex`.
- **Measured variables** (become columns in wide; become rows in long): e.g., nutrient names like `Iron`, `Calcium`.


## Load and inspect
Load `hippo_nutrients.csv` and inspect its structure.

In [None]:
df = pd.read_csv('data/hippo_nutrients.csv')
df.head(6)

### Is the dataset already tidy (long)?
We check the columns and a few rows. If we have `Nutrient` **as a column of names** and `Value` **as a single measurement column**, the data is already **long/tidy**.

In [None]:
print('Columns:', list(df.columns))
print('\nDistinct Nutrients:', df['Nutrient'].unique()[:5], '...')
print('\nSample:')
display(df.sample(min(5, len(df))))

## Long → Wide with `pivot`
Our dataset is already long/tidy. To see the **wide** form (each nutrient as its own column), we can **pivot**:

- **Index** (row identifiers): `['ID', 'Year', 'Age', 'Sex']`
- **Columns** (new wide columns): `'Nutrient'`
- **Values** (cells): `'Value'`

In [None]:
df_wide = (
    df.pivot(index=['ID','Year','Age','Sex'], columns='Nutrient', values='Value')
      .reset_index()
)
df_wide.head(6)

## Wide → Long with `melt`
If we receive data in **wide** form (e.g., columns `Iron`, `Calcium`, ...), we convert back to **long/tidy** with `melt`.

Key idea: **`id_vars`** are the identifier columns to keep as-is; **`value_vars`** are the columns to unpivot into rows. Here, the value columns are the nutrient names (e.g., `'Iron'`, `'Calcium'`, ...).

In [None]:
# Work from the wide table we just created
id_cols = ['ID','Year','Age','Sex']
value_cols = [c for c in df_wide.columns if c not in id_cols]

df_long_again = df_wide.melt(
    id_vars=id_cols,
    value_vars=value_cols,
    var_name='Nutrient',
    value_name='Value'
)
df_long_again.sort_values(['ID','Year','Nutrient']).head(8)

### Why the earlier `melt` was wrong
If you call:

```python
df.melt(id_vars=['ID','Age','Sex'], var_name='Nutrient', value_name='value')
```

on the **already-long** `df`, pandas will try to unpivot *everything that isn’t* in `id_vars` — i.e. it will stack the columns `['Nutrient','Year','Value']`. The new `'Nutrient'` column then contains the **former column names** (`'Nutrient'`, `'Year'`, `'Value'`), which is not what we want.

✅ **Rule of thumb**:
- If your data already has a single measurement column (e.g. `'Value'`) and a variable name column (e.g. `'Nutrient'`), it’s **already long** — you don’t need `melt`.
- Use `melt` only when you have separate **value columns** (like `Iron`, `Calcium`, …) that you want to gather into rows.

## Exercise 1: Filter Tidy Data

Filter the **long/tidy** data to show only iron intakes and describe the result in a Markdown cell.

**Guidance**: Use either the original `df` (already tidy) or `df_long_again`:

```python
df_iron = df[df['Nutrient'] == 'Iron']
df_iron.sort_values(['ID','Year']).head()
```

**Answer**:

The filtered iron data shows...

## Conclusion

You’ve learned how to recognise and transform between **wide** (many value columns) and **long/tidy** (one measurement column + variable name column). Use `pivot` to widen and `melt` to tidy.

**Resources**:
- [Tidy Data Paper](https://vita.had.co.nz/papers/tidy-data.pdf)
- [pandas `melt` docs](https://pandas.pydata.org/docs/reference/api/pandas.melt.html)
- [pandas `pivot` docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html)
- Repository: [github.com/ggkuhnle/data-analysis-projects](https://github.com/ggkuhnle/data-analysis-projects)