# Datasets Exploration and Visualization

This notebook focuses on exploring and visualizing the datasets used in our computer vision tasks. The actual dataset preparation is handled by dedicated Python modules in the `src/data` directory.

## Datasets Overview:
1. CUB-200-2011: 200 bird species, 11,788 images
2. FGVC Aircraft: 100 aircraft variants, 10,000 images
3. Stanford Dogs: 120 dog breeds, 20,580 images
4. Stanford Cars: 196 car models, 16,185 images
5. Oxford 102 Flowers: 102 flower species, 8,189 images
6. NABirds: 555 bird species, 48,562 images

Each dataset is organized in a standardized structure:
```
dataset/
├── train/
│   ├── class_1/
│   │   ├── image_00001.jpg
│   │   ├── image_00007.jpg
│   ├── class_2/
│   ...
├── test/
│   ├── class_1/
│   ├── class_2/
│   ...
├── validation/
│   ├── class_1/
│   ├── class_2/
│   ...
```

## Setup and Imports

We import utility functions from our data modules and visualization libraries to analyze and display dataset statistics.

In [None]:
import os
import matplotlib.pyplot as plt
import numpy as np

# Import our dataset utilities
from src.data.utils_visualization import count_elements_in_subdirs, plot_directory_sizes, count_images, plot_image_counts

# Import dataset processors
from src.data.cub200_dataset import process_dataset as process_cub200
from src.data.fgvc_aircraft_dataset import process_dataset as process_aircraft
from src.data.stanford_dogs_dataset import process_dataset as process_dogs
from src.data.stanford_cars_dataset import process_dataset as process_cars
from src.data.oxford_flowers_dataset import process_dataset as process_flowers
from src.data.nabirds_dataset import process_dataset as process_nabirds

## Dataset Analysis

We'll analyze each dataset individually, looking at:
1. Distribution of images across splits (train/test/validation)
2. Distribution of images per class
3. Class balance and potential data quality issues

### CUB-200-2011 Dataset Analysis

The Caltech-UCSD Birds 200-2011 dataset ([CUB-200-2011](https://www.vision.caltech.edu/datasets/cub_200_2011/)) contains:
- 200 bird species
- 11,788 total images
- Original split: 5,994 training and 5,794 test images
- We create a validation set from 20% of training data

In [None]:
# Define paths for CUB-200-2011
CUB_BASE_DIR = "./../data/processed/CUB-200-2011"
cub_train_dir = os.path.join(CUB_BASE_DIR, "train")
cub_test_dir = os.path.join(CUB_BASE_DIR, "test")
cub_val_dir = os.path.join(CUB_BASE_DIR, "validation")

# Plot split sizes
plot_directory_sizes(
    cub_train_dir, 
    cub_test_dir, 
    cub_val_dir, 
    ('Train', 'Test', 'Validation', 'Total')
)

# Plot class distributions
cub_train_counts = count_images(cub_train_dir)
cub_test_counts = count_images(cub_test_dir)
cub_val_counts = count_images(cub_val_dir)

plot_image_counts("CUB-200-2011", cub_train_counts, cub_test_counts, cub_val_counts)

### FGVC Aircraft Dataset Analysis

The [FGVC-Aircraft](https://www.robots.ox.ac.uk/~vgg/data/fgvc-aircraft/) dataset contains:
- 10,000 total images
- 100 aircraft variants for fine-grained classification
- Pre-defined splits: 3,334 train, 3,333 test, 3,333 validation
- Multiple granularity levels available (variant/family/manufacturer)

The dataset provides a balanced distribution across splits and comes with pre-defined train/test/validation splits, requiring no additional split creation.

In [None]:
# Define paths for FGVC Aircraft
AIRCRAFT_BASE_DIR = "./../data/processed/FGVC-Aircraft"
aircraft_train_dir = os.path.join(AIRCRAFT_BASE_DIR, "train")
aircraft_test_dir = os.path.join(AIRCRAFT_BASE_DIR, "test")
aircraft_val_dir = os.path.join(AIRCRAFT_BASE_DIR, "validation")

# Plot split sizes
plot_directory_sizes(
    aircraft_train_dir,
    aircraft_test_dir,
    aircraft_val_dir,
    ('Train', 'Test', 'Validation', 'Total')
)

# Plot class distributions
aircraft_train_counts = count_images(aircraft_train_dir)
aircraft_test_counts = count_images(aircraft_test_dir)
aircraft_val_counts = count_images(aircraft_val_dir)

plot_image_counts(
    "FGVC Aircraft",
    aircraft_train_counts,
    aircraft_test_counts,
    aircraft_val_counts
)

### Stanford Dogs Dataset Analysis

The [Stanford Dogs](http://vision.stanford.edu/aditya86/ImageNetDogs/) dataset contains:
- 120 dog breeds for fine-grained classification
- 20,580 total images
- Pre-defined train/test splits provided in .mat files
- We create a validation set from 20% of training data

This dataset is a subset of ImageNet and focuses on fine-grained visual categorization of dog breeds.

In [None]:
# Define paths for Stanford Dogs
DOGS_BASE_DIR = "./../data/processed/Stanford Dogs"
dogs_train_dir = os.path.join(DOGS_BASE_DIR, "train")
dogs_test_dir = os.path.join(DOGS_BASE_DIR, "test")
dogs_val_dir = os.path.join(DOGS_BASE_DIR, "validation")

# Plot split sizes
plot_directory_sizes(
    dogs_train_dir,
    dogs_test_dir,
    dogs_val_dir,
    ('Train', 'Test', 'Validation', 'Total')
)

# Plot class distributions
dogs_train_counts = count_images(dogs_train_dir)
dogs_test_counts = count_images(dogs_test_dir)
dogs_val_counts = count_images(dogs_val_dir)

plot_image_counts(
    "Stanford Dogs",
    dogs_train_counts,
    dogs_test_counts,
    dogs_val_counts
)

### Stanford Cars Dataset Analysis

The [Stanford Cars](http://ai.stanford.edu/~jkrause/cars/car_dataset.html) dataset contains:
- 196 car models (make, model, year)
- 16,185 total images
- Original split: 8,144 training and 8,041 test images
- We create a validation set from 20% of training data

The dataset focuses on fine-grained classification of car models, including make, model, and year of manufacture.

In [None]:
# Define paths for Stanford Cars
CARS_BASE_DIR = "./../data/processed/Stanford Cars"
cars_train_dir = os.path.join(CARS_BASE_DIR, "train")
cars_test_dir = os.path.join(CARS_BASE_DIR, "test")
cars_val_dir = os.path.join(CARS_BASE_DIR, "validation")

# Plot split sizes
plot_directory_sizes(
    cars_train_dir,
    cars_test_dir,
    cars_val_dir,
    ('Train', 'Test', 'Validation', 'Total')
)

# Plot class distributions
cars_train_counts = count_images(cars_train_dir)
cars_test_counts = count_images(cars_test_dir)
cars_val_counts = count_images(cars_val_dir)

plot_image_counts(
    "Stanford Cars",
    cars_train_counts,
    cars_test_counts,
    cars_val_counts
)

### Oxford 102 Flowers Dataset Analysis

The [Oxford 102 Flowers](https://www.robots.ox.ac.uk/~vgg/data/flowers/102/) dataset contains:
- 102 flower categories
- 8,189 total images
- 40-258 images per class
- Pre-defined train/test/validation splits in .mat files

Note: The original dataset has swapped train/test splits compared to standard conventions, so we reorganize them to maintain consistency across all datasets.

In [None]:
# Define paths for Oxford 102 Flowers
FLOWERS_BASE_DIR = "./../data/processed/Oxford 102 Flower"
flowers_train_dir = os.path.join(FLOWERS_BASE_DIR, "train")
flowers_test_dir = os.path.join(FLOWERS_BASE_DIR, "test")
flowers_val_dir = os.path.join(FLOWERS_BASE_DIR, "validation")

# Plot split sizes before swap
print("Distribution before train/test swap:")
plot_directory_sizes(
    flowers_test_dir,
    flowers_train_dir,
    flowers_val_dir,
    ('Train', 'Test', 'Validation', 'Total')
)

# Note: The actual swap operation is handled in the oxford_flowers_dataset.py module
# Here we just display the distributions before and after

# Plot split sizes after swap
print("\nDistribution after train/test swap:")
plot_directory_sizes(
    flowers_train_dir,
    flowers_test_dir,
    flowers_val_dir,
    ('Train', 'Test', 'Validation', 'Total')
)

# Plot class distributions
flowers_train_counts = count_images(flowers_train_dir)
flowers_test_counts = count_images(flowers_test_dir)
flowers_val_counts = count_images(flowers_val_dir)

plot_image_counts(
    "Oxford 102 Flowers",
    flowers_train_counts,
    flowers_test_counts,
    flowers_val_counts
)

### NABirds Dataset Analysis

The [NABirds](https://dl.allaboutbirds.org/nabirds) dataset contains:
- 555 bird species from North America
- 48,562 total images
- Pre-defined train/test splits
- We create a validation set from 20% of training data

This dataset required special handling due to:
1. Nested directory structure with "/" in class names
2. Need for empty folder cleanup
3. Variable number of images per class
4. Complex class hierarchy

The preprocessing steps are handled in the `nabirds_dataset.py` module, which includes:
1. Directory structure flattening
2. Empty folder removal
3. Class consolidation
4. Validation split creation

In [None]:
# Define paths for NABirds
NABIRDS_BASE_DIR = "./../data/processed/NABirds"
nabirds_train_dir = os.path.join(NABIRDS_BASE_DIR, "train")
nabirds_test_dir = os.path.join(NABIRDS_BASE_DIR, "test")
nabirds_val_dir = os.path.join(NABIRDS_BASE_DIR, "validation")

# Plot split sizes
plot_directory_sizes(
    nabirds_train_dir,
    nabirds_test_dir,
    nabirds_val_dir,
    ('Train', 'Test', 'Validation', 'Total')
)

# Plot class distributions
nabirds_train_counts = count_images(nabirds_train_dir)
nabirds_test_counts = count_images(nabirds_test_dir)
nabirds_val_counts = count_images(nabirds_val_dir)

plot_image_counts(
    "NABirds",
    nabirds_train_counts,
    nabirds_test_counts,
    nabirds_val_counts
)