# 🧹 3.3 Data Cleaning

This notebook covers data cleaning techniques to prepare nutrition datasets for analysis.

**Objectives**:
- Handle missing values and duplicates.
- Validate data consistency.
- Clean `hippo_nutrients.csv` for analysis.

**Context**: Clean data is essential for reliable nutrition research, like NDNS studies.

<details><summary>Fun Fact</summary>
Cleaning data is like a hippo tidying its pond—clear water, clear insights! 🦛
</details>

In [None]:
# Setup for Google Colab: Fetch datasets automatically or manually
import os
from google.colab import files

# Define the module and dataset for this notebook
MODULE = '03_data_handling'
DATASET = 'hippo_nutrients.csv'
DATASET_PATH = os.path.join('data', DATASET)

# Step 1: Attempt to clone the repository (automatic method)
try:
    print('Attempting to clone repository...')
    !git clone https://github.com/ggkuhnle/data-analysis-toolkit-FNS.git
    os.chdir(f'/content/data-analysis-toolkit-FNS/notebooks/{MODULE}')
    if os.path.exists(DATASET_PATH):
        print(f'Dataset found: {DATASET_PATH} 🦛')
    else:
        print(f'Error: Dataset {DATASET} not found after cloning.')
        raise FileNotFoundError
except Exception as e:
    print(f'Cloning failed: {e}')
    print('Falling back to manual upload option...')

    # Step 2: Manual upload option
    print(f'Please upload {DATASET} manually.')
    print(f'1. Click the "Choose Files" button below.')
    print(f'2. Select {DATASET} from your local machine.')
    print(f'3. Ensure the file is placed in notebooks/{MODULE}/data/')
    
    # Create the data directory if it doesn't exist
    os.makedirs('data', exist_ok=True)
    
    # Prompt user to upload the dataset
    uploaded = files.upload()
    
    # Check if the dataset was uploaded
    if DATASET in uploaded:
        with open(DATASET_PATH, 'wb') as f:
            f.write(uploaded[DATASET])
        print(f'Successfully uploaded {DATASET} to {DATASET_PATH} 🦛')
    else:
        raise FileNotFoundError(f'Upload failed. Please ensure you uploaded {DATASET}.')

# Install required packages for this notebook
%pip install pandas numpy
print('Python environment ready.')

In [1]:
# Install required packages
%pip install pandas numpy  # Ensures compatibility in Colab
import pandas as pd  # For data manipulation
import numpy as np  # For numerical operations
print('Data cleaning environment ready.')

Data cleaning environment ready.


## Data Preparation

Load `hippo_nutrients.csv` and check for issues like missing values.

In [2]:
# Load the dataset
df = pd.read_csv('data/hippo_nutrients.csv')  # Path relative to notebook

# Check for missing values
print('Missing values:')
print(df.isnull().sum())  # Display count of missing values per column

Missing values:
ID          0
Nutrient    0
Year        0
Value       8
Age         0
Sex         0
dtype: int64


## Handling Missing Values

Fill missing `Value` entries with the mean for each nutrient.

In [3]:
# Group by nutrient and fill missing values with mean
df['Value'] = df.groupby('Nutrient')['Value'].transform(lambda x: x.fillna(x.mean()))

# Verify no missing values
print(f'Missing values after filling: {df["Value"].isnull().sum()}')  # Check Value column

Missing values after filling: 0


## Removing Duplicates

Check for and remove duplicate rows.

In [4]:
# Check for duplicates
duplicates = df.duplicated().sum()  # Count duplicate rows
print(f'Duplicates: {duplicates}')

# Remove duplicates if any
df = df.drop_duplicates()  # Drop duplicate rows

Duplicates: 0


## Exercise 1: Clean and Validate

Filter for Calcium data, fill missing `Value` entries with the median, and check for duplicates. Document your code.

**Guidance**: Use `df[df['Nutrient'] == 'Calcium']` and `fillna(df['Value'].median())`.

**Answer**:

My cleaning code is...

## Conclusion

You’ve learned to clean nutrition data by handling missing values and duplicates.

**Next Steps**: Explore data transformation in 3.4.

**Resources**:
- [Pandas Data Cleaning](https://pandas.pydata.org/docs/user_guide/missing_data.html)
- [Data Cleaning Guide](https://www.datacamp.com/community/tutorials/data-preparation-with-pandas)
- Repository: [github.com/ggkuhnle/data-analysis-toolkit-FNS](https://github.com/ggkuhnle/data-analysis-toolkit-FNS)