# 🔄 3.4 Data Transformation

This notebook explores data transformation techniques to prepare nutrition datasets for analysis.

**Objectives**:
- Filter and group data for insights.
- Pivot data for alternative views.
- Transform `hippo_nutrients.csv` for analysis.

**Context**: Transformation enables meaningful insights from nutrition data, like comparing nutrient intakes across groups.

<details><summary>Fun Fact</summary>
Transforming data is like a hippo rearranging its snacks—same stuff, better view! 🦛
</details>

In [None]:
# Setup for Google Colab: Fetch datasets automatically or manually
import os
from google.colab import files

# Define the module and dataset for this notebook
MODULE = '03_data_handling'  # e.g., '01_infrastructure'
DATASET = 'hippo_nutrients.csv'  # e.g., 'hippo_diets.csv'
BASE_PATH = '/content/data-analysis-projects'
MODULE_PATH = os.path.join(BASE_PATH, 'notebooks', MODULE)
DATASET_PATH = os.path.join('data', DATASET)

# Step 1: Attempt to clone the repository (automatic method)
# Note: If you encounter a cloning error (e.g., 'fatal: destination path already exists'),
#       reset the runtime (Runtime > Restart runtime) and run this cell again.
try:
    print('Attempting to clone repository...')
    if os.path.exists(BASE_PATH):
        print('Repository already exists, skipping clone.')
    else:
        !git clone https://github.com/ggkuhnle/data-analysis-projects.git
    
    # Debug: Print directory structure
    print('Listing repository contents:')
    !ls {BASE_PATH}
    print(f'Listing notebooks directory contents:')
    !ls {BASE_PATH}/notebooks
    
    # Check if the module directory exists
    if not os.path.exists(MODULE_PATH):
        raise FileNotFoundError(f'Module directory {MODULE_PATH} not found. Check the repository structure.')
    
    # Set working directory to the notebook's folder
    os.chdir(MODULE_PATH)
    
    # Verify dataset is accessible
    if os.path.exists(DATASET_PATH):
        print(f'Dataset found: {DATASET_PATH} 🦛')
    else:
        print(f'Error: Dataset {DATASET} not found after cloning.')
        raise FileNotFoundError
except Exception as e:
    print(f'Cloning failed: {e}')
    print('Falling back to manual upload option...')

    # Step 2: Manual upload option
    print(f'Please upload {DATASET} manually.')
    print(f'1. Click the "Choose Files" button below.')
    print(f'2. Select {DATASET} from your local machine.')
    print(f'3. Ensure the file is placed in notebooks/{MODULE}/data/')
    
    # Create the data directory if it doesn't exist
    os.makedirs('data', exist_ok=True)
    
    # Prompt user to upload the dataset
    uploaded = files.upload()
    
    # Check if the dataset was uploaded
    if DATASET in uploaded:
        with open(DATASET_PATH, 'wb') as f:
            f.write(uploaded[DATASET])
        print(f'Successfully uploaded {DATASET} to {DATASET_PATH} 🦛')
    else:
        raise FileNotFoundError(f'Upload failed. Please ensure you uploaded {DATASET}.')

# Install required packages for this notebook
%pip install pandas numpy
print('Python environment ready.')

In [1]:
# Install required packages
%pip install pandas  # Ensures compatibility in Colab
import pandas as pd  # For data manipulation
print('Data transformation environment ready.')

Data transformation environment ready.


 ## Data Preparation

Load `hippo_nutrients.csv` and inspect its structure.

In [2]:
# Load the dataset
df = pd.read_csv('data/hippo_nutrients.csv')  # Path relative to notebook
print(df.head(2))  # Display first two rows

   ID Nutrient  Year  Value  Age Sex
0  H1     Iron  2024    8.2   25   F
1  H1     Iron  2025    8.5   26   F


## Filtering Data

Filter for female hippos and Iron intakes.

In [3]:
# Filter for female hippos and Iron
df_female_iron = df[(df['Sex'] == 'F') & (df['Nutrient'] == 'Iron')]
print(df_female_iron.head(2))  # Display filtered data

   ID Nutrient  Year  Value  Age Sex
0  H1     Iron  2024    8.2   25   F
1  H1     Iron  2025    8.5   26   F


## Grouping Data

Group by `Nutrient` and calculate mean `Value`.

In [4]:
# Group by nutrient and compute mean
mean_values = df.groupby('Nutrient')['Value'].mean()
print(mean_values)  # Display mean values

Nutrient
Calcium     1150.0
Iron           8.0
Vitamin_D     10.5
Name: Value, dtype: float64


 ## Pivoting Data

Pivot the data to show `Value` by `Nutrient` and `Year`.

In [5]:
# Pivot data
df_pivot = df.pivot_table(values='Value', index='Nutrient', columns='Year', aggfunc='mean')
print(df_pivot)  # Display pivoted data

Year      2024  2025
Nutrient            
Calcium   1150  1140
Iron         8     8
Vitamin_D   10    11


 ## Exercise 1: Transform Data

Filter for Vitamin_D data in 2024, group by `Sex`, and compute median `Value`. Document your code.

**Guidance**: Use `df[(df['Nutrient'] == 'Vitamin_D') & (df['Year'] == 2024)]` and `groupby('Sex')['Value'].median()`.

**Answer**:

My transformation code is...

## Conclusion

You’ve learned to transform nutrition data through filtering, grouping, and pivoting.

**Next Steps**: Explore data aggregation in 3.5.

**Resources**:
- [Pandas GroupBy](https://pandas.pydata.org/docs/user_guide/groupby.html)
- [Pandas Pivot](https://pandas.pydata.org/docs/user_guide/reshaping.html)
- Repository: [github.com/ggkuhnle/data-analysis-projects](https://github.com/ggkuhnle/data-analysis-projects)