<a href="https://colab.research.google.com/github/christophergaughan/PyTorch/blob/main/PyTorch_Custom_Datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Here we must remember that there are 3 PyTorch Domains for using custom datasets. **Remember** Different domain Libraries contain DataLoading funtions for different data sources. i.e. you'll want to look into each of these PyTorch domain libraries for existing dqta loading functions and customizable data loading functions:

* Visual- `torchvision.datasets`
* Text - `torchtext.datasets`
* Audio - `torchaudio.datasets`
* Recommendation system - `torchrec.datasets`

We've used some datasets with PyTorch so far.

BUT, how do you get your own data into PyTorch?

One way to do this is via *custom datasets*





## Importing PyTorch and setting up device agnostic code

In [None]:
import torch
from torch import nn

torch.__version__

In [None]:
# set up device agnostic code
device = "cuda" if torch.cuda.is_available() else "cpu"
device

In [None]:
!nvidia-smi

## Get data- we'll be getting food images (Food-101 Data Set)
* we'll start off with just three categories of foodand use just 10% of the data
* dataset is obviously just a subset of the full dataset.
* 3 classes, 100 images/class
* When starting out ML projects, it's important to try things on a small scale and only *then* increase the dataset i.e. scale it up
* at this point speed of experients is is faster b/c datset is smaller

In [None]:
import requests
import zipfile
from pathlib import Path

# Setup path to data folder
data_path = Path("data/")
image_path = data_path / "pizza_steak_sushi"

# If the image folder doesn't exist, download it and prepare it...
if image_path.is_dir():
    print(f"{image_path} directory exists.")
else:
    print(f"Did not find {image_path} directory, creating one...")
    image_path.mkdir(parents=True, exist_ok=True)

    # Download pizza, steak, sushi data
    with open(data_path / "pizza_steak_sushi.zip", "wb") as f:
        request = requests.get("https://github.com/mrdbourke/pytorch-deep-learning/raw/main/data/pizza_steak_sushi.zip")
        print("Downloading pizza, steak, sushi data...")
        f.write(request.content)

    # Unzip pizza, steak, sushi data
    with zipfile.ZipFile(data_path / "pizza_steak_sushi.zip", "r") as zip_ref:
        print("Unzipping pizza, steak, sushi data...")
        zip_ref.extractall(image_path)

## Becoming one with the data (data prep and data exploration)

In [None]:
import os
def walk_through_dir(dir_path):
    """
    Walks through dir_path returning its contents.
    Args:
    dir_path (str or pathlib.Path): target directory
    Return: target data directory
    os.walk: directory tree maker
    """
    for dirpath, dirnames, filenames in os.walk(dir_path):
        print(f"There are {len(dirnames)} directories and {len(filenames)} images in '{dirpath}'")

In [None]:
walk_through_dir(image_path)

In [None]:
# Setup our training and testing paths
train_dir = image_path / "train"
test_dir = image_path / "test"

train_dir, test_dir

### Visualizing an image

let's write some code to:
1. get all the image paths
2. pick a random inage path using Pythons random choice()
3. Get the image class name using `pathlib.Path.parent.stem`
4. Since we're working with images, let's open the image with Python's PIL
5. We'll show the image and print metadata

In [None]:
import random
from PIL import Image

# Set seed
random.seed(42)

# 1. Get all image paths (* means "any combination")
image_path_list = list(image_path.glob("*/*/*.jpg"))

# 2. Get random image path
random_image_path = random.choice(image_path_list)

# 3. Get image class from path name (the image class is the name of the directory where the image is stored)
image_class = random_image_path.parent.stem

# 4. Open image
img = Image.open(random_image_path)

# 5. Print metadata
print(f"Random image path: {random_image_path}")
print(f"Image class: {image_class}")
print(f"Image height: {img.height}")
print(f"Image width: {img.width}")
img

## Visualize with matplotlib

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Turn the image into an array
img_as_array = np.asarray(img)

# Plot the image with matplotlib
plt.figure(figsize=(10, 7))
plt.imshow(img_as_array)
plt.title(f"Image class: {image_class} | Image shape: {img_as_array.shape} -> (height, width, color_channels)")
plt.axis(False);

## Transform all the images into torch.tensors

before we can use our image data with PyTorch:
1. Turn your target data into tensors
2. Turn it into a `torch.utils.data.Dataset` and subsequently a `torch.utils.data.DataLoader`, we'll call those `dataset` and `DataLoader`

NOTE:  we will be using `imagefolder` in PyTorch that has a `transform` function

In [None]:
import torch
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

### Transforming data with `torchvision.transforms`
turn .jpeg's --> toch.tensors

In [None]:
# Write a transform for image
data_transform = transforms.Compose([
    # Resize the images to 64x64
    transforms.Resize(size=(64, 64)),
    # Flip the images randomly on horizontal- data augmentation
    transforms.RandomHorizontalFlip(p=0.5),
    # Turn the image into a torch.Tensor- normalizes from 0 --> 1
    transforms.ToTensor()
])

In [None]:
data_transform(img).shape

### Visualizing transformed image
transforms help you get your images ready to be used with model/perform *data augmentation*-
https://pytorch.org/vision/stable/transforms.html

In [None]:
image_path_list[:5]

In [None]:
def plot_transformed_images(image_paths: list, transform, n=3, seed=42):
    """Plots a series of random images from image_path_list.

    Will open n image paths from image_paths, transform them
    with transform and plot them side by side.

    Args:
        image_paths (list): List of target image paths.
        transform (PyTorch Transforms): Transforms to apply to images.
        n (int, optional): Number of images to plot. Defaults to 3.
        seed (int, optional): Random seed for the random generator. Defaults to 42.
    """
    random.seed(seed)
    random_image_paths = random.sample(image_paths, k=n)
    for image_path in random_image_paths:
        with Image.open(image_path) as f:
            fig, ax = plt.subplots(nrows=1, ncols=2) # 1 row 2 cols
            ax[0].imshow(f)
            ax[0].set_title(f"Original\nSize: {f.size}")
            ax[0].axis(False)

            # Transform and plot target image
            # Note: permute() will change shape of image to suit matplotlib
            # (PyTorch default is [C, H, W] but Matplotlib is [H, W, C])
            transformed_image = transform(f).permute(1, 2, 0) # note we need to change the shape for matplotlib (C, H, W) -> (H, W, C)
            ax[1].imshow(transformed_image)
            ax[1].set_title(f"Transformed\nShape: {transformed_image.shape}")
            ax[1].axis("off") # can use False

            fig.suptitle(f"Class: {image_path.parent.stem}", fontsize=16)

plot_transformed_images(image_path_list,
                        transform=data_transform,
                        n=3)

### Note the original is better in quality than the transformed image
there is less information encoded in the image. Thus, we may expect some loss
in performance. However, this quality can be tunes as a **hyperparameter**.
Note, that we are going with the `CNN explainer`, so this is where we are
getting these image parameters from.

**Note as well** that are images are flipped on the horizontal axis.

**Note also** that our images are in `tensor format`. Thus these tensors are ready for implementing a model.

## Option 1: Loading image dta using `ImageFolder`

remeber that each one of the torchvision libraries have built-in functions to help you load data. here, using ImageFolder we note:

```
classtorchvision.datasets.ImageFolder(root: ~typing.Union[str, ~pathlib.Path], transform: ~typing.Optional[~typing.Callable] = None, target_transform: ~typing.Optional[~typing.Callable] = None, loader: ~typing.Callable[[str], ~typing.Any] = <function default_loader>, is_valid_file: ~typing.Optional[~typing.Callable[[str], bool]] = None, allow_empty: bool = False)
```

This library will help us load in images in the format we have specified. It is a pre-built `datsets` function. WE'RE LOADING IN OUR IMAGES AS TENSORS.

* We can load image classification data using `torchvision.datasets.ImageFolder`
* one of the advantges of using a PyTorch 'pre-built'

https://pytorch.org/vision/0.20/generated/torchvision.datasets.ImageFolder.html#torchvision.datasets.ImageFolder

In [None]:
# Use ImageFolder to create dataset(s)
from torchvision import datasets
train_data = datasets.ImageFolder(root=train_dir, # target folder of images
                                  transform=data_transform, # transforms to perform on data (images) we write above
                                  target_transform=None) # transforms to perform on labels/target (if necessary)

test_data = datasets.ImageFolder(root=test_dir,
                                 transform=data_transform)

print(f"Train data:\n{train_data}\nTest data:\n{test_data}")

In [None]:
train_dir, test_dir

In [None]:
# Get class names as list
class_names = train_data.classes # Changed 'trian_data' to 'train_data'
class_names

In [None]:
# Get class names as a dictionary- class names maped to dict
class_dict = train_data.class_to_idx
class_dict

In [None]:
# Cheeck the lengths of our datset
len(train_data), len(test_data)

In [None]:
train_data.samples[:10]

So we have used `datsets.ImageFolder` to turn all our images into tensors and we did that with the help of `data.transform`, we've resized our images using `transform.Resize`, we've randomly flipped the data along the horizontal using `traansforms.RandomHorizontalFlip` (which we don't necessarily need but used to indicate what happens to an image when you passit through an image pipeline. Then most importantly we used `transform.ToTensor()` to allow our images to be *used with a PyTorch model*.

In [None]:
# Index on the the train_data Dataset to get a single image and label
img, label = train_data[0][0], train_data[0][1]
print(f"Image tensor:\n {img}")
print(f"Image shape: {img.shape}")
print(f"Image datatype: {img.dtype}")
print(f"Image label: {label}")
print(f"Label datatype: PyTorch: {type(label)}")


### This is like ana ccounting step where we see what forms our dataset is in to help with any errors we mifght account downstrem in our model

In [None]:
class_names[label]

**let's look at our data with matplotlib realizing where the color channel requirements are for matplotlib where it likew the color channels last (as they are different for PyTorch)**
* we'll continue with our 'book-keeping' making sure datatypes match.

In [None]:
# Rearrange the order dimensions
img_permute = img.permute(1, 2, 0)

# Print out different shapes
print(f"Original shape: {img.shape} -> [color_channels, height, width]")
print(f"Image permuted {img_permute.shape} -> [height, width, color_channels]: ")
# Rearrange the order dimensions
img_permute = img.permute(1, 2, 0)

# Print out different shapes
print(f"Original shape: {img.shape} -> [color_channels, height, width]")
print(f"Image permuted -> [height, width, color_channels]: {img_permute.shape}")

# Plot the image
plt.figure(figsize=(10, 7)) # Added parentheses here
plt.imshow(img_permute)
plt.axis("off")
plt.title(class_names[label], fontsize=14)

# PyTorch `DataLoader`- we've been working with these but a formal definition can only help: we turn our images datasets into `DataLoaders`

## What is a `DataLoader`?
The `DataLoader` in PyTorch is a utility that efficiently loads and iterates over datasets, particularly for deep learning tasks. It provides key functionalities such as:

- **Batching**: Splits the dataset into mini-batches for more efficient training. Think of the situation where you had a large dataset and you had to load it all in at once. You run into *memory* problems. Even utilizing GPU's this would be the case. Thus we have the option of turning our images into batch sizes that is more easily digested by our system without overwhelming our memory constraints.
- **Shuffling**: Randomly shuffles data at the start of each epoch to improve model generalization.
- **Parallel Loading**: Uses multiple *num workers* to load data in parallel, improving performance.
- **Automatic Batching**: Allows automatic collation of data samples into batches.
- **Custom Sampling**: Supports custom samplers to define how data should be accessed.

## Syntax/ hyperparameters
```python
from torch.utils.data import DataLoader

dataloader = DataLoader(
    dataset,          # The dataset to load
    batch_size=32,    # Number of samples per batch- this is variable depending on context
    shuffle=True,     # Whether to shuffle data
    num_workers=4     # e.g. Number of parallel workers (just know it is a possibility) for loading data. Think of a much larger dataset in the cloud
)
```
# Why Do We Not Shuffle the Test Dataset in PyTorch?

## Explanation
When training a deep learning model, we typically set `shuffle=True` for the **training dataset** to improve generalization by preventing the model from learning the order of data. However, for the **test dataset**, we usually set `shuffle=False`. The reasons for this are:

1. **Consistency in Evaluation**:  
   - We want to evaluate the model on the same fixed order of test samples to ensure reproducibility.
   
2. **Sequential Dependencies**:  
   - If the test data has a natural order (e.g., *time-series data*), shuffling it would disrupt the sequence, leading to incorrect evaluations.
   
3. **Batch-wise Predictions**:  
   - In some applications, predictions need to be aligned with the original dataset order for post-processing.

## Example: Loading Training and Test Datasets

```python
import torch
from torch.utils.data import Dataset, DataLoader

# Custom dataset example- an OOP implementation is super-useful here
class CustomDataset(Dataset):
    def __init__(self):
        self.data = torch.arange(100)  # Example dataset with 100 samples

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

# Instantiate datasets
train_dataset = CustomDataset()
test_dataset = CustomDataset()

# Create DataLoaders
train_loader = DataLoader(train_dataset, batch_size=10, shuffle=True)  # SShuffle for training
test_loader = DataLoader(test_dataset, batch_size=10, shuffle=False)   # No shuffle for testing

# Iterating over train_loader (Shuffled)
print("Training batches:")
for batch in train_loader:
    print(batch)

# Iterating over test_loader (Not shuffled)
print("\nTest batches:")
for batch in test_loader:
    print(batch)



In [None]:
import os
os.cpu_count()

In [None]:
# Turn train and test datasets into DataLoader's
from torch.utils.data import DataLoader

BATCH_SIZE=1
train_dataloader = DataLoader(dataset=train_data,
                              batch_size=BATCH_SIZE, # how many samples per batch?
                              num_workers=1, # how many subprocesses to work with- usually the more the better
                              shuffle=True) # we don't want our model to memeorize data and thus diminish the generalizability of our model, shuffling the data reduces the chance that this happens

test_dataloader = DataLoader(dataset=test_data,
                             batch_size=BATCH_SIZE,
                             num_workers=1,
                             shuffle=False) # if we want to look at our datset in the future, our test-data is always in the same order

train_dataloader, test_dataloader

*so we have two instances of our dataloader*

In [None]:
print(len(train_dataloader), len(test_dataloader)) # note output as we change BATCH_SIZE
print(len(train_data), len(test_data))


let's interate through our train dataloader. This code snippet is designed to get a single batch of data (in this case, a single image and its corresponding label) from the `train_dataloader` and then print the shape of the image and the label.

In [None]:
img, label = next(iter(train_dataloader))

# BATCH_SIZE  will now bw 1 you acn obviously change this parameter depending on the size of the model
print(f"Image shape: {img.shape} -> [batch_size, color_channels, height, width]")
print(f"Label shape: {label.shape} -> batch_size")

#### What if we didn't have these conveient dataloader tools at our disposal? How would we load our data without them? Concurrently, what if we didn't have the ImageFolder class? How could we load our data-image set so that it's compatible with the DataLoader? Let's pretend we didn't have the TorchVision.datasets ImageFolder helper function. We could *still* replicate that functionality!
* we want to get the class names as a list from our loaded data
* we want to get our class names as a dictionary as well

## Loading Image Data with a Custom `Dataset`

Pro's:
* Can create a `DataSet` out of almost anything
* Not limited to PyTorch pre-built `Dataset` functions

Cons
* Even though you could create `Dataset` out of almost anything, it doesn't mean it will work
* Using a custom `Dataset` often results in us writing more code, which could be prone to errors and/or performance issues

In [None]:
import os
import pathlib
import torch

from PIL import Image
from torch.utils.data import Dataset
from torchvision import transforms
from typing import Tuple, Dict, List

In [None]:
# Instance of torchvision.datasets.ImageFolder()
train_data.classes, train_data.class_to_idx

We will map our imge names to their class by passing a file path to a function.

## Creating a helper function to get class names

We want to:
1. Get claass names using `os.scandir()` to traverse a target directory (ideally the directory is in standard image classification format).
2. Raise an error if the class names aren't found (if this happens, there might be something wrong with the directory structure).
3. Turn the class names into dict and a list- return them

In [None]:
# Set up ath for target directory
target_directory = train_dir
print(f"Target dir: {target_directory}")

# Get the class names from the target directory
class_names_found = sorted([entry.name for entry in list(os.scandir(target_directory))])
class_names_found



In [None]:
list(os.scandir(target_directory))

In [None]:
def find_classes(directory: str) -> Tuple[List[str], Dict[str, int]]:
    """Finds the class folder names in a target directory.

    Assumes the target directory is in standard image classification format.

    Args:
        directory (str): Target directory to scan.

    Returns:
        Tuple[List[str
    """
    # 1. Get the class names by scanning the target directory
    classes = sorted(entry.name for entry in os.scandir(directory) if entry.is_dir())


    # 2. Raise an error if class names not found
    if not classes:
        raise FileNotFoundError(f"Couldn't find any classes in {directory}.")

    # 3 Create a dictionary of index labels (computers prefer numbers rather than strings as labels)
    class_to_idx = {cls_name: i for i, cls_name in enumerate(classes)}

    return classes, class_to_idx

In [None]:
find_classes(train_dir)

### Create a custom `Dataset` to replicate `ImageFolder`

To craete our own custom dataset, we want to:
1. Subclass `torch.utils.data.Dataset`
2. Init our subclass with a target directoryc (the directory we'd like to get data from) as well as a transform if we'd like to trasform the data.
3. Create several attributes:
    * paths - paths of our images
    * transform - the trasform we'd like to use
    * classes - a list of the target classes
    * class_to_idx - dict of the target classes mapped to integer labels
4. Ceate a function to `load images()`, this function will open an image
5. Overwrite `__len()__` method to return the length of the dataset
6. Overwrite `__getitem()__` method to returm a given sample when passed an index

In [None]:
# Write a custom dataset class (inherits from torch.utils.data.Dataset)
from torch.utils.data import Dataset

# 1. Subclass torch.utils.data.Dataset
class ImageFolderCustom(Dataset):

    # 2. Initialize with a targ_dir and transform (optional) parameter
    def __init__(self, targ_dir: str, transform=None) -> None:

        # 3. Create class attributes
        # Get all image paths
        self.paths = list(pathlib.Path(targ_dir).glob("*/*.jpg")) # note: you'd have to update this if you've got .png's or .jpeg's
        # Setup transforms
        self.transform = transform
        # Create classes and class_to_idx attributes
        self.classes, self.class_to_idx = find_classes(targ_dir)

    # 4. Make function to load images
    def load_image(self, index: int) -> Image.Image:
        "Opens an image via a path and returns it."
        image_path = self.paths[index]
        return Image.open(image_path)

    # 5. Overwrite the __len__() method (optional but recommended for subclasses of torch.utils.data.Dataset)
    def __len__(self) -> int:
        "Returns the total number of samples."
        return len(self.paths)

    # 6. Overwrite the __getitem__() method (required for subclasses of torch.utils.data.Dataset)
    def __getitem__(self, index: int) -> Tuple[torch.Tensor, int]:
        "Returns one sample of data, data and label (X, y)."
        img = self.load_image(index)
        class_name  = self.paths[index].parent.name # expects path in data_folder/class_name/image.jpeg
        class_idx = self.class_to_idx[class_name]

        # Transform if necessary
        if self.transform:
            return self.transform(img), class_idx # return data, label (X, y)
        else:
            return img, class_idx # return data, label (X, y)

In [None]:
# Create a transform
from torchvision import transforms
train_transforms = transforms.Compose([
    transforms.Resize((64, 64)),
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.ToTensor()
])

# Don't augment test data, only reshape
test_transforms = transforms.Compose([
    transforms.Resize((64, 64)),
    transforms.ToTensor()
])

In [None]:
train_data_custom = ImageFolderCustom(targ_dir=train_dir,
                                      transform=train_transforms)
test_data_custom = ImageFolderCustom(targ_dir=test_dir,
                                     transform=test_transforms)
train_data_custom, test_data_custom


In [None]:
len(train_data_custom), len(test_data_custom)

In [None]:
train_data_custom.classes

In [None]:
train_data

In [None]:
len(train_data_custom), len(test_data_custom), len(train_data), len(test_data)

In [None]:
train_data_custom.classes, train_data_custom.class_to_idx

In [None]:
# Check for equality amongst our custom Dataset and ImageFolder Dataset
print((len(train_data_custom) == len(train_data)) & (len(test_data_custom) == len(test_data)))
print(train_data_custom.classes == train_data.classes)
print(train_data_custom.class_to_idx == train_data.class_to_idx)

### Create a function to display random images

1. Take a `Dataset` and a number of other parameters such as class names and how many images to visualize
2. To prevent the display getting out of hand, let cap the number of images to 10.
3. Set random seed for reproducibility
4. Get a list of random indices from the target dataset
5. Set up a matplotlib plot
6 Make sure the dimensions of our images line up with Matplotlib(HWC) as opposed to PyTorch(CHW)

In [None]:
# 1. Take in a Dataset as well as a list of class names
def display_random_images(dataset: torch.utils.data.dataset.Dataset,
                          classes: List[str] = None,
                          n: int = 10,
                          display_shape: bool = True,
                          seed: int = None):

    # 2. Adjust display if n too high
    if n > 10:
        n = 10
        display_shape = False
        print(f"For display purposes, n shouldn't be larger than 10, setting to 10 and removing shape display.")

    # 3. Set random seed
    if seed:
        random.seed(seed)

    # 4. Get random sample indexes
    random_samples_idx = random.sample(range(len(dataset)), k=n)

    # 5. Setup plot
    plt.figure(figsize=(16, 8))

    # 6. Loop through samples and display random samples
    for i, targ_sample in enumerate(random_samples_idx):
        targ_image, targ_label = dataset[targ_sample][0], dataset[targ_sample][1]

        # 7. Adjust image tensor shape for plotting: [color_channels, height, width] -> [color_channels, height, width]
        targ_image_adjust = targ_image.permute(1, 2, 0)

        # Plot adjusted samples
        plt.subplot(1, n, i+1)
        plt.imshow(targ_image_adjust)
        plt.axis("off")
        if classes:
            title = f"class: {classes[targ_label]}"
            if display_shape:
                title = title + f"\nshape: {targ_image_adjust.shape}"
        plt.title(title)


In [None]:
# Display random images from ImageFolder created Dataset
display_random_images(train_data,
                      n=5,
                      classes=class_names,
                      seed=42)

And now with the Dataset we created with our own ImageFolderCustom.



In [None]:
# Display random images from ImageFolderCustom Dataset
display_random_images(train_data_custom,
                      n=5,
                      classes=class_names,
                      seed=42) # Try setting the seed for reproducible images

## Turn our custom loaded images into a DataLoader

In [None]:
from torch.utils.data import DataLoader
BATCH_SIZE = 64
NUM_WORKERS = os.cpu_count()
train_dataloader_custom = DataLoader(dataset=train_data_custom,
                                     batch_size=BATCH_SIZE,
                                     num_workers=NUM_WORKERS,
                                     shuffle=True)

test_dataloader_custom = DataLoader(dataset=test_data_custom,
                                    batch_size=BATCH_SIZE,
                                    num_workers=NUM_WORKERS,
                                    shuffle=False)

train_dataloader_custom, test_dataloader_custom

In [None]:
# Get image and label from custom dataloader
img_custom, label_custom = next(iter(train_dataloader_custom))

# Print out shapes
img_custom.shape, label_custom.shape