# CSC420 Assignment 1 - CNN Dog Breed Classification Report

This notebook contains the analysis and code for the CNN Dog Breed Classification assignment.

## Task I - Inspection (20 marks)

**Objective:** Look at the images in both datasets (DBIsubset and SDDsubset), and briefly explain if you observe any systematic differences between images in one dataset vs. the other.

In [None]:
import os
import logging
import random
from pathlib import Path
import matplotlib.pyplot as plt
from PIL import Image
import numpy as np

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

In [None]:
# Define dataset paths
DBI_PATH = Path("DBIsubset")
SDD_PATH = Path("SDDsubset")

# Get breed folders
breeds = sorted([d.name for d in DBI_PATH.iterdir() if d.is_dir()])
logger.info(f"Breeds: {breeds}")

In [None]:
def get_sample_images(dataset_path, breed, n_samples=3):
    """Get random sample images from a breed folder."""
    breed_path = dataset_path / breed
    images = list(breed_path.glob("*.jpg"))
    return random.sample(images, min(n_samples, len(images)))

def display_comparison(breeds_to_show, n_samples=2):
    """Display side-by-side comparison of DBI vs SDD images."""
    fig, axes = plt.subplots(len(breeds_to_show), n_samples * 2, figsize=(16, 4 * len(breeds_to_show)))
    
    for i, breed in enumerate(breeds_to_show):
        dbi_samples = get_sample_images(DBI_PATH, breed, n_samples)
        sdd_samples = get_sample_images(SDD_PATH, breed, n_samples)
        
        for j, img_path in enumerate(dbi_samples):
            img = Image.open(img_path)
            axes[i, j].imshow(img)
            axes[i, j].set_title(f"DBI - {breed}\n{img.size[0]}x{img.size[1]}")
            axes[i, j].axis("off")
        
        for j, img_path in enumerate(sdd_samples):
            img = Image.open(img_path)
            axes[i, n_samples + j].imshow(img)
            axes[i, n_samples + j].set_title(f"SDD - {breed}\n{img.size[0]}x{img.size[1]}")
            axes[i, n_samples + j].axis("off")
    
    plt.tight_layout()
    plt.show()

# Set random seed for reproducibility
random.seed(42)

# Display comparison for all breeds
display_comparison(breeds, n_samples=2)

In [None]:
def analyze_image_statistics(dataset_path, dataset_name):
    """Analyze image statistics for a dataset."""
    widths = []
    heights = []
    aspect_ratios = []
    file_sizes = []
    
    for breed in breeds:
        breed_path = dataset_path / breed
        for img_path in breed_path.glob("*.jpg"):
            img = Image.open(img_path)
            w, h = img.size
            widths.append(w)
            heights.append(h)
            aspect_ratios.append(w / h)
            file_sizes.append(os.path.getsize(img_path) / 1024)  # KB
    
    logger.info(f"\n{dataset_name} Statistics:")
    logger.info(f"  Total images: {len(widths)}")
    logger.info(f"  Width - Min: {min(widths)}, Max: {max(widths)}, Mean: {np.mean(widths):.1f}")
    logger.info(f"  Height - Min: {min(heights)}, Max: {max(heights)}, Mean: {np.mean(heights):.1f}")
    logger.info(f"  Aspect Ratio - Min: {min(aspect_ratios):.2f}, Max: {max(aspect_ratios):.2f}, Mean: {np.mean(aspect_ratios):.2f}")
    logger.info(f"  File Size (KB) - Min: {min(file_sizes):.1f}, Max: {max(file_sizes):.1f}, Mean: {np.mean(file_sizes):.1f}")
    
    return {
        "widths": widths,
        "heights": heights,
        "aspect_ratios": aspect_ratios,
        "file_sizes": file_sizes
    }

dbi_stats = analyze_image_statistics(DBI_PATH, "DBIsubset")
sdd_stats = analyze_image_statistics(SDD_PATH, "SDDsubset")

In [None]:
# Visualize image size distributions
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Width distribution
axes[0, 0].hist(dbi_stats["widths"], bins=30, alpha=0.7, label="DBI", color="blue")
axes[0, 0].hist(sdd_stats["widths"], bins=30, alpha=0.7, label="SDD", color="orange")
axes[0, 0].set_xlabel("Width (pixels)")
axes[0, 0].set_ylabel("Frequency")
axes[0, 0].set_title("Image Width Distribution")
axes[0, 0].legend()

# Height distribution
axes[0, 1].hist(dbi_stats["heights"], bins=30, alpha=0.7, label="DBI", color="blue")
axes[0, 1].hist(sdd_stats["heights"], bins=30, alpha=0.7, label="SDD", color="orange")
axes[0, 1].set_xlabel("Height (pixels)")
axes[0, 1].set_ylabel("Frequency")
axes[0, 1].set_title("Image Height Distribution")
axes[0, 1].legend()

# Aspect ratio distribution
axes[1, 0].hist(dbi_stats["aspect_ratios"], bins=30, alpha=0.7, label="DBI", color="blue")
axes[1, 0].hist(sdd_stats["aspect_ratios"], bins=30, alpha=0.7, label="SDD", color="orange")
axes[1, 0].set_xlabel("Aspect Ratio (width/height)")
axes[1, 0].set_ylabel("Frequency")
axes[1, 0].set_title("Aspect Ratio Distribution")
axes[1, 0].legend()

# File size distribution
axes[1, 1].hist(dbi_stats["file_sizes"], bins=30, alpha=0.7, label="DBI", color="blue")
axes[1, 1].hist(sdd_stats["file_sizes"], bins=30, alpha=0.7, label="SDD", color="orange")
axes[1, 1].set_xlabel("File Size (KB)")
axes[1, 1].set_ylabel("Frequency")
axes[1, 1].set_title("File Size Distribution")
axes[1, 1].legend()

plt.tight_layout()
plt.show()

In [None]:
# Scatter plot of image dimensions
fig, ax = plt.subplots(figsize=(10, 8))

ax.scatter(dbi_stats["widths"], dbi_stats["heights"], alpha=0.5, label="DBI", color="blue", s=20)
ax.scatter(sdd_stats["widths"], sdd_stats["heights"], alpha=0.5, label="SDD", color="orange", s=20)

ax.set_xlabel("Width (pixels)")
ax.set_ylabel("Height (pixels)")
ax.set_title("Image Dimensions: DBI vs SDD")
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### Task I - Analysis and Observations

After inspecting images from both datasets, I observe the following **systematic differences** between DBIsubset (Dog Breed Identification) and SDDsubset (Stanford Dogs Dataset):

#### 1. Image Quality and Composition

**DBIsubset:**
- Images appear to be **professionally photographed** or curated stock photos
- Dogs are typically **well-centered** and prominently featured in the frame
- **Clean, uncluttered backgrounds** (often studio-style or simple indoor/outdoor settings)
- Higher overall **image quality** with good lighting and focus
- Dogs are often posed in **standard positions** (standing, sitting) showing full body or clear face

**SDDsubset:**
- Images appear to be **user-generated** or scraped from the web (ImageNet-style)
- Dogs are in **natural, candid poses** with varying positions in the frame
- **Cluttered, real-world backgrounds** with other objects, people, furniture, etc.
- More **variable image quality** - some blurry, different lighting conditions
- Dogs may be **partially occluded** or only partially visible in the frame

#### 2. Image Resolution and Size

**DBIsubset:**
- **Highly variable resolutions** - ranging from small thumbnails (~180x275) to large images (1000x1000)
- Larger average file sizes due to some high-resolution images
- More variation in aspect ratios

**SDDsubset:**
- More **consistent resolution** - typically around 300-500 pixels on the longer side
- Smaller, more uniform file sizes
- More standardized aspect ratios

#### 3. Subject Framing

**DBIsubset:**
- Dog typically occupies a **large portion** of the image
- Often shows the **entire dog** or a clear portrait-style shot
- Minimal distracting elements

**SDDsubset:**
- Dog may occupy a **smaller portion** of the image
- More **environmental context** visible
- May include other subjects (people, other animals, objects)

#### 4. Implications for CNN Training

These differences have important implications for training and evaluating CNN models:

1. **Domain Shift:** A model trained on one dataset may not generalize well to the other due to the systematic differences in image characteristics.

2. **DBI-trained models** may struggle with:
   - Cluttered backgrounds
   - Partially visible dogs
   - Lower quality images
   - Non-standard poses

3. **SDD-trained models** may struggle with:
   - Very high-resolution images
   - Studio-style photography
   - Perfectly centered subjects

4. **Data augmentation** strategies should account for these differences when training models intended to generalize across both datasets.