# 📊 3.1 What Are Data? Types and Structures

Welcome to the world of **data**! Think of data as the ingredients in a recipe: they come in different forms, and how you organize them matters for cooking up insights. In this notebook, we'll unpack the basics of data structures and formats, setting you up to handle real-world datasets like a pro. 🧑‍🍳

**What we'll cover:**
- **Vectors**: The simplest building blocks of data (like a list of numbers).
- **Tables**: Organized grids of data (think spreadsheets).
- **Tidy data**: A way to structure data so it’s easy to analyze.
- **Wide vs. Long formats**: Two ways to arrange your data table.
- **Why this matters**: How these concepts connect to tools like Excel and pandas.

**Why care?** In nutrition, data might be nutrient intakes, survey responses, or lab results. Understanding their structure helps you clean, analyze, and visualize them effectively.

Let’s dive in! 🚀

<details>
<summary>🤔 Advanced: Beyond Tables</summary>
Data isn’t always tabular! In advanced work, you might encounter time-series, geospatial, or graph data. For now, we’ll focus on tables since they’re common in nutrition research, but keep an eye out for these other types later!
</details>

In [1]:
# Let’s load our trusty tools
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt  # For a quick visual
%matplotlib inline

## 🧬 Vectors: The Simplest Data

A **vector** is just a sequence of values of the same type—like a column in a spreadsheet. For example, the iron intake of three participants could be a vector: `[8.2, 9.1, 7.5]`.

In Python, vectors are often represented as **lists** or **NumPy arrays**. Let’s create one:

In [2]:
iron_intake = np.array([8.2, 9.1, 7.5])
print(f'Iron intake vector: {iron_intake}')
print(f'Type: {type(iron_intake)}')

Iron intake vector: [8.2 9.1 7.5]
Type: <class 'numpy.ndarray'>


## 📋 Tables: Organizing Vectors

A **table** combines multiple vectors into a grid, with rows and columns. In nutrition, a table might store participant data, where each row is a person and each column is a measurement (like iron intake or age).

In **pandas**, tables are called **DataFrames**. They’re like Excel spreadsheets but supercharged for coding. Let’s create a simple table:

In [3]:
df = pd.DataFrame({
    'Participant': ['P1', 'P2', 'P3'],
    'Iron_2020': [8.2, 9.1, 7.5],
    'Iron_2021': [8.5, 9.3, 7.8]
})
df

Unnamed: 0,Participant,Iron_2020,Iron_2021
0,P1,8.2,8.5
1,P2,9.1,9.3
2,P3,7.5,7.8


## 🧼 Tidy Data: The Golden Rule

**Tidy data** is a way to organize tables so they’re easy to analyze. The rules are simple:
1. Each **variable** is a column.
2. Each **observation** is a row.
3. Each **value** has its own cell.

Our table above isn’t tidy! Why? Because `Iron_2020` and `Iron_2021` are values of the same variable (iron intake) split across two columns. Instead, we want one column for `Year` and one for `Iron`.

This brings us to **wide vs. long formats**.

<details>
<summary>🔍 Learn More: Tidy Data</summary>
Read Hadley Wickham’s [Tidy Data paper](https://www.jstatsoft.org/article/view/v059i10) for a deeper dive. It’s a classic in data science!
</details>

## 📏 Wide vs. Long Formats

- **Wide format**: Each variable’s values spread across multiple columns (like `Iron_2020`, `Iron_2021`).
- **Long format**: Each variable gets one column, with another column indicating the category (like `Year`).

Wide is great for reading or reporting, but long is better for analysis because it’s tidy. Let’s convert our table to long format.

<details>
<summary>🛠️ Advanced: When to Use Wide</summary>
Wide formats are useful for pivot tables or heatmaps. You’ll see them in dashboard apps or when presenting summary stats.
</details>

## 🧪 Exercise: Convert Wide to Long Format

Let’s make our data tidy by converting it from wide to long format using pandas’ `melt` function. Follow the code below, then try it yourself!

**Task**: Convert the DataFrame to long format, so we have columns: `Participant`, `Year`, and `Iron`.

In [4]:
df_melted = df.melt(id_vars='Participant', 
                    value_vars=['Iron_2020', 'Iron_2021'],
                    var_name='Year', 
                    value_name='Iron')
# Clean up the Year column
df_melted['Year'] = df_melted['Year'].str.extract(r'(\d+)')
df_melted

Unnamed: 0,Participant,Year,Iron
0,P1,2020,8.2
1,P2,2020,9.1
2,P3,2020,7.5
3,P1,2021,8.5
4,P2,2021,9.3
5,P3,2021,7.8


**What happened?**
- `id_vars`: Columns to keep as is (`Participant`).
- `value_vars`: Columns to melt into one (`Iron_2020`, `Iron_2021`).
- `var_name`: Name of the new column for categories (`Year`).
- `value_name`: Name of the new column for values (`Iron`).
- We used `str.extract` to grab just the year number.

Now our data is **tidy**! Each row is an observation (one participant’s iron intake in a specific year), and each variable (`Participant`, `Year`, `Iron`) has its own column.

## 🎯 Your Turn: Try It Yourself!

Here’s a new dataset with calcium intakes. Convert it from wide to long format.

**Task**: Create a tidy DataFrame with columns `Participant`, `Year`, and `Calcium`.

In [5]:
df_calcium = pd.DataFrame({
    'Participant': ['A1', 'A2', 'A3'],
    'Calcium_2019': [1200, 1100, 1300],
    'Calcium_2020': [1150, 1050, 1250]
})

# Your code here:
# df_calcium_melted = ...

## 📈 Bonus: Visualize the Difference

Let’s plot our tidy iron data to see why long format is great for analysis. We’ll make a line plot of iron intake over time for each participant.

In [6]:
plt.figure(figsize=(8, 5))
for participant in df_melted['Participant'].unique():
    subset = df_melted[df_melted['Participant'] == participant]
    plt.plot(subset['Year'], subset['Iron'], marker='o', label=participant)
plt.xlabel('Year')
plt.ylabel('Iron Intake (mg)')
plt.title('Iron Intake Over Time')
plt.legend()
plt.show()

<Figure size 640x480 with 1 Axes>

## 🎉 Wrap-Up

You’ve learned:
- **Vectors** are single columns of data.
- **Tables** organize vectors into rows and columns.
- **Tidy data** makes analysis easier with one variable per column and one observation per row.
- **Wide vs. Long**: Long format is tidy and great for coding, while wide is better for reading.

**What’s next?** In the next notebook, we’ll load real datasets and start exploring them!

**Resources**:
- [Pandas Melt Documentation](https://pandas.pydata.org/docs/reference/api/pandas.melt.html)
- [Tidy Data in Python](https://www.dataschool.io/tidy-data-python-pandas/)
- Try this notebook on [Google Colab](https://colab.research.google.com/) or fork it from our [GitHub repo](https://github.com/your-repo-link)!

<details>
<summary>🏆 Challenge: Go Deeper</summary>
Try converting the tidy data back to wide format using `pivot` or `pivot_table`. Hint: Check the [pandas pivot docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html).
</details>