# 02 Data Preprocessing

## Overview:
This notebook handles data preprocessing for the **Animal Subspecies Classification** project. It includes:
1. **Cleaning**: Removing corrupted, and non-image files.
2. **Standardizing**: Resizing all images to a consistent size.
3. **Splitting**: Dividing the dataset into training, validation, and testing subsets.

## Goals:
- Ensure the dataset is clean, standardized, and ready for model training.
- Organize the data into folders for easy loading during model development.

## Outputs:
- Cleaned and standardized dataset, organized into training, validation, and testing subsets.

In [6]:
# General imports
import os
import shutil
from hashlib import md5

# Image processing
from PIL import Image

# Data splitting
from sklearn.model_selection import train_test_split

# Data Cleaning

## Purpose:
This block cleans the raw dataset by:
1. Removing corrupted or non-image files.

## How It Works:
- Each image is opened using the `PIL.Image` library to check if it is a valid image.
- Logs are printed to indicate which files were removed and why.

## Output:
A clean dataset in the `data/raw` folder, ready for further processing.

In [7]:
# Data Cleaning
def clean_data(source_dir):
    """
    Removes corrupted or non-image files.
    """
    for class_name in os.listdir(source_dir):
        class_path = os.path.join(source_dir, class_name)
        if not os.path.isdir(class_path):
            continue

        for img_name in os.listdir(class_path):
            img_path = os.path.join(class_path, img_name)
            try:
                # Verify image can be opened
                with Image.open(img_path) as img:
                    img.verify()  # Check if it's a valid image
            except Exception as e:
                print(f"Removing corrupted or non-image file: {img_path} ({e})")
                os.remove(img_path)

# Run data cleaning
print("Cleaning raw data...")
clean_data("data/raw")
print("Data cleaning complete!")

Cleaning raw data...
Data cleaning complete!


# Image Standardization

## Purpose:
This block resizes all images to a consistent size (`224x224`) for compatibility with neural network models.

## How It Works:
- Each image in the `data/raw` folder is opened, resized to the target dimensions, and saved in the `data/processed` folder.
- This step ensures that all input images are standardized, reducing the preprocessing overhead during model training.

## Output:
Standardized images saved in the `data/processed` folder.

In [8]:
# Function to standardize images (resize to 224x224)
def standardize_images(source_dir, target_dir, image_size=(224, 224)):
    for class_name in os.listdir(source_dir):
        class_path = os.path.join(source_dir, class_name)
        if not os.path.isdir(class_path):
            continue
        target_class_dir = os.path.join(target_dir, class_name)
        os.makedirs(target_class_dir, exist_ok=True)

        for img_name in os.listdir(class_path):
            img_path = os.path.join(class_path, img_name)
            try:
                with Image.open(img_path) as img:
                    img = img.resize(image_size)
                    img.save(os.path.join(target_class_dir, img_name))
            except Exception as e:
                print(f"Error processing {img_path}: {e}")

# Standardize images
print("Standardizing images...")
os.makedirs("data/processed", exist_ok=True)
standardize_images("data/raw", "data/processed")
print("Image standardization complete!")

Standardizing images...
Image standardization complete!


# Data Splitting

## Purpose:
This block splits the standardized dataset into three subsets:
1. **Training Set (70%)**: Used to train the neural network models.
2. **Validation Set (20%)**: Used to tune hyperparameters and monitor overfitting.
3. **Testing Set (10%)**: Used to evaluate the model's final performance.

## How It Works:
- Images are grouped by class.
- Each class is split into training, validation, and testing subsets using `train_test_split` from `sklearn`.
- The split is stratified to maintain class balance in each subset.
- Images are moved into their respective folders (`data/train`, `data/val`, `data/test`).

## Output:
1. Training data in `data/train`.
2. Validation data in `data/val`.
3. Testing data in `data/test`.

In [9]:
# Function to split data into train, val, test
def split_data(source_dir, train_dir, val_dir, test_dir, train_ratio=0.7, val_ratio=0.2):
    for class_name in os.listdir(source_dir):
        class_path = os.path.join(source_dir, class_name)
        if not os.path.isdir(class_path):
            continue

        images = os.listdir(class_path)
        train, temp = train_test_split(images, train_size=train_ratio, random_state=42)
        val, test = train_test_split(temp, test_size=(1 - train_ratio - val_ratio) / (val_ratio + (1 - train_ratio)))

        # Move images to train, val, test folders
        for img_name in train:
            os.makedirs(os.path.join(train_dir, class_name), exist_ok=True)
            shutil.copy(os.path.join(class_path, img_name), os.path.join(train_dir, class_name, img_name))

        for img_name in val:
            os.makedirs(os.path.join(val_dir, class_name), exist_ok=True)
            shutil.copy(os.path.join(class_path, img_name), os.path.join(val_dir, class_name, img_name))

        for img_name in test:
            os.makedirs(os.path.join(test_dir, class_name), exist_ok=True)
            shutil.copy(os.path.join(class_path, img_name), os.path.join(test_dir, class_name, img_name))

# Split data into train, val, test
print("Splitting data into train, val, test...")
os.makedirs("data/train", exist_ok=True)
os.makedirs("data/val", exist_ok=True)
os.makedirs("data/test", exist_ok=True)
split_data("data/processed", "data/train", "data/val", "data/test")
print("Data splitting complete!")

Splitting data into train, val, test...
Data splitting complete!
