# 📊 4.2 Exploratory Data Analysis

This notebook introduces exploratory data analysis (EDA) techniques to uncover patterns in nutrition datasets.

**Objectives**:
- Summarise data with descriptive statistics.
- Visualise distributions and relationships.
- Apply EDA to `vitamin_trial.csv`.

**Context**: EDA is crucial for understanding nutrition data, like vitamin D trial outcomes.

<details><summary>Fun Fact</summary>
EDA is like a hippo sniffing out the best snacks—exploring leads to discoveries! 🦛
</details>

In [None]:
# Setup for Google Colab: Fetch datasets automatically or manually
import os
from google.colab import files

# Define the module and dataset for this notebook
MODULE = '04_data_analysis'  # e.g., '01_infrastructure'
DATASET = 'vitamin_trial.csv'  # e.g., 'hippo_diets.csv'
BASE_PATH = '/content/data-analysis-toolkit-FNS'
MODULE_PATH = os.path.join(BASE_PATH, 'notebooks', MODULE)
DATASET_PATH = os.path.join('data', DATASET)

# Step 1: Attempt to clone the repository (automatic method)
# Note: If you encounter a cloning error (e.g., 'fatal: destination path already exists'),
#       reset the runtime (Runtime > Restart runtime) and run this cell again.
try:
    print('Attempting to clone repository...')
    if os.path.exists(BASE_PATH):
        print('Repository already exists, skipping clone.')
    else:
        !git clone https://github.com/ggkuhnle/data-analysis-toolkit-FNS.git
    
    # Debug: Print directory structure
    print('Listing repository contents:')
    !ls {BASE_PATH}
    print(f'Listing notebooks directory contents:')
    !ls {BASE_PATH}/notebooks
    
    # Check if the module directory exists
    if not os.path.exists(MODULE_PATH):
        raise FileNotFoundError(f'Module directory {MODULE_PATH} not found. Check the repository structure.')
    
    # Set working directory to the notebook's folder
    os.chdir(MODULE_PATH)
    
    # Verify dataset is accessible
    if os.path.exists(DATASET_PATH):
        print(f'Dataset found: {DATASET_PATH} 🦛')
    else:
        print(f'Error: Dataset {DATASET} not found after cloning.')
        raise FileNotFoundError
except Exception as e:
    print(f'Cloning failed: {e}')
    print('Falling back to manual upload option...')

    # Step 2: Manual upload option
    print(f'Please upload {DATASET} manually.')
    print(f'1. Click the "Choose Files" button below.')
    print(f'2. Select {DATASET} from your local machine.')
    print(f'3. Ensure the file is placed in notebooks/{MODULE}/data/')
    
    # Create the data directory if it doesn't exist
    os.makedirs('data', exist_ok=True)
    
    # Prompt user to upload the dataset
    uploaded = files.upload()
    
    # Check if the dataset was uploaded
    if DATASET in uploaded:
        with open(DATASET_PATH, 'wb') as f:
            f.write(uploaded[DATASET])
        print(f'Successfully uploaded {DATASET} to {DATASET_PATH} 🦛')
    else:
        raise FileNotFoundError(f'Upload failed. Please ensure you uploaded {DATASET}.')

# Install required packages for this notebook
%pip install pandas numpy
print('Python environment ready.')

In [1]:
# Install required packages
%pip install pandas numpy matplotlib seaborn  # Ensures compatibility in Colab
import pandas as pd  # For data manipulation
import numpy as np  # For numerical operations
import matplotlib.pyplot as plt  # For plotting
import seaborn as sns  # For enhanced visualizations
print('EDA environment ready.')

EDA environment ready.


## Data Preparation

Load `vitamin_trial.csv`, a dataset of vitamin D trial outcomes, and inspect its structure.

In [2]:
# Load the dataset
df = pd.read_csv('data/vitamin_trial.csv')  # Path relative to notebook

# Display descriptive statistics
print(df[['Vitamin_D']].describe())  # Summarise Vitamin_D column

       Vitamin_D
count  200.000000
mean    12.750000
std      2.950000
min      9.500000
25%     10.200000
50%     12.750000
75%     15.300000
max     16.200000


## Visualizing Distributions

Create a histogram of Vitamin D levels by group.

In [3]:
# Plot histogram
sns.histplot(data=df, x='Vitamin_D', hue='Group', bins=20, kde=True)
plt.xlabel('Vitamin D (µg)')  # Label x-axis
plt.ylabel('Count')  # Label y-axis
plt.title('Distribution of Vitamin D by Group')  # Plot title
plt.grid(True, alpha=0.3)  # Add light grid
plt.show()  # Display plot

## Exploring Relationships

Create a boxplot to compare Vitamin D levels by `Group` and `Outcome`.

In [4]:
# Plot boxplot
sns.boxplot(data=df, x='Group', y='Vitamin_D', hue='Outcome')
plt.xlabel('Group')  # Label x-axis
plt.ylabel('Vitamin D (µg)')  # Label y-axis
plt.title('Vitamin D Levels by Group and Outcome')  # Plot title
plt.show()  # Display plot

## Exercise 1: Perform EDA

Create a scatter plot of `Vitamin_D` vs. `Time` colored by `Group`. Describe patterns in a Markdown cell.

**Guidance**: Use `sns.scatterplot()` with `x='Time'`, `y='Vitamin_D'`, `hue='Group'`.

**Answer**:

My scatter plot code and observations are...

## Conclusion

You’ve applied EDA to explore nutrition data, uncovering patterns through visualizations.

**Next Steps**: Explore correlation analysis in 4.3.

**Resources**:
- [Seaborn Documentation](https://seaborn.pydata.org/)
- [Pandas EDA](https://pandas.pydata.org/docs/user_guide/basics.html#descriptive-statistics)
- Repository: [github.com/ggkuhnle/data-analysis-toolkit-FNS](https://github.com/ggkuhnle/data-analysis-toolkit-FNS)