# 📊 3.1 What Are Data? Types and Structures

This notebook introduces foundational concepts in data structure.

We will cover:
- What is a **vector**, **table**, or **tidy dataset**?
- The differences between **wide** and **long** format
- How these formats appear in tools like **Excel**, **R**, and **pandas**

Understanding data structure is essential for cleaning, transforming, and analysing data efficiently.

## 🔢 What Are Data?

**Data** are representations of information — typically organised in a tabular format.

- **Vector**: A single list of values, e.g. `[1, 2, 3, 4]`
- **Table (DataFrame)**: Rows and columns, like an Excel sheet
- **Tidy data**: Each row is an observation; each column is a variable

### 📐 Wide vs Long Format

- **Wide format**: More columns, fewer rows (e.g. one column per timepoint)
- **Long format**: Fewer columns, more rows (e.g. values stacked in a single column)

**Example (Wide):**

| Subject | Weight_Day1 | Weight_Day2 |
|---------|-------------|-------------|
| A       | 70          | 71          |
| B       | 80          | 79          |

**Same data (Long):**

| Subject | Day   | Weight |
|---------|-------|--------|
| A       | Day1  | 70     |
| A       | Day2  | 71     |
| B       | Day1  | 80     |
| B       | Day2  | 79     |

In [None]:
# Example using pandas to reshape from wide to long format
import pandas as pd

df_wide = pd.DataFrame({
    'Subject': ['A', 'B'],
    'Weight_Day1': [70, 80],
    'Weight_Day2': [71, 79]
})

df_long = pd.melt(df_wide, id_vars='Subject', 
                  var_name='Day', value_name='Weight')
df_long['Day'] = df_long['Day'].str.replace('Weight_', '')
df_long

## 🧪 Exercise: Convert Wide to Long

Try this yourself with a new example:
- Create a wide-format dataset showing fruit consumption across days.
- Use `pd.melt()` to convert it to long format.
- Rename the columns to match tidy principles.
- Add a comment on why long format is helpful for plotting or analysis.

## 🧠 Advanced: Why Tidy Format Matters
<details><summary>Click to expand</summary>

**Tidy format** simplifies operations like:

- Grouping and summarising (e.g. with `groupby`)
- Filtering and faceting plots
- Fitting regression models (where variables must be clearly identified)

In many R and Python libraries, long/tidy data is the expected input format.

</details>