<a href="https://colab.research.google.com/github/bitswired/bitsof-ai/blob/main/projects/torch-data-introduction/TorchData_Introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Build Better Data Loading Pipelines For PyTorch iwth Torchdata

## Download the data

In [None]:
!pip install kaggle torchdata torchvision



In [None]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle (2).json


{'kaggle.json': b'{"username":"jimzer","key":"684c8ce95d9f44c8d9b5c3df30a56c7d"}'}

In [None]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
!kaggle datasets download puneet6060/intel-image-classification

intel-image-classification.zip: Skipping, found more recently modified local copy (use --force to force download)


In [None]:
!unzip intel-image-classification.zip -d data

In [None]:
!ls data/seg_test/seg_test

buildings  forest  glacier  mountain  sea  street


## General utilities

In [None]:
import glob
import itertools as it
from pathlib import Path

import torch
import torchvision

In [None]:
# Convert split name to folder path
split_to_path = {
    "train": "data/seg_train/seg_train",
    "test": "data/seg_test/seg_test"
}

# Convert class name to int label
name_to_label = {
    "buildings": 0,
    "forest": 1,
    "glacier": 2,
    "mountain": 3,
    "sea": 4,
    "street": 5,
}


# Image transformations to get all images at size 150 x 150
transforms = torch.nn.Sequential(
    torchvision.transforms.Resize((150, 150))
)

def img_path_to_label(path: str):
    """Function to get the class from the file path"""
    name = Path(path).parents[0].stem
    return name_to_label[name]

## Traditional Way: Dataset and DataLoader

In [None]:
from torch.utils.data import Dataset, DataLoader

In [None]:
class IntelDataset(Dataset):
    """Class to represent the Intel Image Classification as a Dataset"""
    def __init__(self, split: str):
        # Get the split path (train or test) from the split name.
        self.path = split_to_path[split]
            
    def _list_files(self):
        """List all images"""
        return list(glob.glob(f"{self.path}/**/*.jpg")) 

    def __len__(self):
        """Get the lenght of the dataset"""
        return len(self._list_files())

    def __getitem__(self, idx: int):
        """Method to access a tuple (input, label) per index"""
        # Get all file paths
        files = self._list_files()
        # Get the file path at the received index
        file_path = files[idx]
        # Load the image
        image = torchvision.io.read_image(file_path)
        # Get the label from the image path
        label = img_path_to_label(file_path)
        # Return the transformed image with its label
        return transforms(image), label

In [None]:
# Create the Dataset for the train split
ds = IntelDataset("train")
# Create the DataLoader with shuffling and batching
dl = DataLoader(ds, batch_size=10, shuffle=True)

In [None]:
# Iterate over the 5 first batches
for X, y in it.islice(dl, 5):
    print(f"X batch length: {len(X)}, y batch length: {len(y)}, labels: {y}")

X batch length: 10, y batch length: 10, labels: tensor([1, 1, 3, 1, 2, 3, 5, 4, 4, 2])
X batch length: 10, y batch length: 10, labels: tensor([4, 4, 4, 2, 4, 4, 5, 5, 2, 1])
X batch length: 10, y batch length: 10, labels: tensor([4, 3, 4, 3, 3, 2, 4, 1, 1, 4])
X batch length: 10, y batch length: 10, labels: tensor([1, 4, 0, 4, 4, 5, 1, 3, 5, 0])
X batch length: 10, y batch length: 10, labels: tensor([3, 0, 5, 5, 1, 1, 2, 2, 1, 0])


## The New Way: TorchData DataPipes

### Quick introduction

TorchData is a library containing modular, composable data loading primitives to build flexible and permormant data loading pipelines.

These are called DataPipes.

There are mainly to type of DataPipes: 
- `IterDataPipe`
- `MapDataPipe`

#### IterDataPipe
These DataPipes represent and updated version of `IterDataset` from `torch.utils.data`. 
They are well-suited for stream datasets, where random reads are expensive.
They behave like an iterator: you can iterate over them, but you can't acess items individually based on an index.

In [None]:
# IterDataPipe of the 10 first int grouped in 2 batches: even and odd 
pipe = (
    # Wrap the range into an IterDataPipe wrapper
    dp.iter.IterableWrapper(range(10))
    # Groupby parity: one batch for even and on batch for odd numbers
    .groupby(lambda x: x % 2)
)
# We can iterate over the items
print("Complete iteration:", list(pipe))
# But we can't access them individually based on an index
# pipe[0] would raise an Exception



Complete iteration: [[0, 2, 4, 6, 8], [1, 3, 5, 7, 9]]


#### MapDataPipe
These DataPipes represent and updated version of `MapDataset` from `torch.utils.data`. 
They are well-suited for key-value datasets, where random reads are cheap.
They behave like an dict: you can iterate over the values, and you can also acess items individually based on their index.

In [None]:
# MapDataPipe of the 10 first integers multiplied by 2 and shuffled
pipe = (
    # Wrap the range into an MapDataPipe wrapper
    dp.map.SequenceWrapper(range(10))
    # Multiply every number by 2
    .map(lambda x: x * 2)
    # Shuffle 
    .shuffle()
)

# We can iterate over the values
print("Complete iteration:", list(pipe))
# We can also access items individually based on their index
print("Index based access:", pipe[0], pipe[9])


Complete iteration: [4, 8, 18, 10, 14, 12, 16, 6, 0, 2]
Index based access: 4 2


### Load Intel Image Classifaction data with TorchData DataPipes

In [None]:
import torchdata.datapipes as dp
from torch.utils.data import default_collate

In [None]:
def build_datapipes(split: str):
    """Function to return the DataPipe based on the split name""" 

    # Get the split path (train or test) from the split name.
    path = split_to_path[split]

    return (
        # Iterate over all file paths
        dp.iter.FileLister(path, recursive=True)
        # Transform path to tuples (path, label)
        .map(lambda x: (x, img_path_to_label(x)))
        # We need a key to tranform or IterDataPipes to a MapDataPipes
        # Enumerate will yield: (index, (path, label))
        .enumerate()
        # Get a MapDataPipes, it's like a dictionary with key based access
        .to_map_datapipe()
        # Read the image and yield (image tensor, label)
        .map(lambda x: (torchvision.io.read_image(x[0]), x[1]))
        # Resize the image using our tranform (transformed image, label)
        .map(lambda x: (transforms(x[0]), x[1]))
        # Shuffle the DataPipes
        .shuffle()
        # Get batches of 10
        .batch(10)
        # Collate the batches. Transforms [(image, label)] to 
        # (images, labels)
        .map(lambda x: default_collate(x))
    )

In [None]:
pipe = build_datapipes("train")
# Iterate over the 5 first batches
for X, y in it.islice(pipe, 5):
    print(f"X batch length: {len(X)}, y batch length: {len(y)}, labels: {y}")

  "Data from prior DataPipe are loaded to get length of"


X batch length: 10, y batch length: 10, labels: tensor([5, 4, 0, 3, 5, 2, 1, 3, 1, 5])
X batch length: 10, y batch length: 10, labels: tensor([3, 4, 3, 0, 5, 0, 0, 0, 3, 5])
X batch length: 10, y batch length: 10, labels: tensor([4, 4, 1, 4, 2, 5, 2, 1, 3, 5])
X batch length: 10, y batch length: 10, labels: tensor([3, 5, 2, 0, 4, 1, 3, 0, 5, 3])
X batch length: 10, y batch length: 10, labels: tensor([3, 5, 4, 3, 5, 1, 2, 4, 5, 4])
