<a href="https://colab.research.google.com/github/YuanChenhang/USAAIO/blob/main/PyTorch_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Import necessary libraries**

In [None]:
import torch
from torch.utils.data import Dataset
import numpy as np

**Define a Dataset class**

Three build-in methods:

* Initialization: ```__init__```
* Length: ```__len__```
* Indexing and slicing: ```__getitem__```

**Example: Build a dataset class to construct an image dataset**

* Raw dataset
    * Image data in numpy with shape ```(batch_size, height, width, num_channels)```
    * Labels

*  Dataset to construct
    * Image data in tensor with shape ```(batch_size, num_channels, height, width)``` and is normalized within 0 and 1
    * Labels

In [None]:
# Build a raw dataset

num_samples = 10
size = (4, 8, 3) # shape of each sample (height, width, num_channels)
num_classes = 5

images = np.random.randint(low = 0, high = 256, size = (num_samples, *size), dtype = np.uint8)
labels = np.random.randint(low = 0, high = num_classes, size = (num_samples,), dtype = np.int64)

In [None]:
class MyDataset(Dataset):
    def __init__(self, images_numpy, labels):
        self.images_tensor = torch.from_numpy(images_numpy).to(torch.float32) / 255
        self.images_tensor = self.images_tensor.permute(0, 3, 1, 2)
        self.labels = torch.from_numpy(labels).to(torch.int64)

    def __len__(self):
        return self.images_tensor.shape[0]

    def __getitem__(self, index):
        image = self.images_tensor[index]
        label = self.labels[index]
        # Method 1 of return: Separate items
        return image, label

        # Method 2 of return: Dictionary
        # return {'image': image, 'label': label}



In [None]:
# Construct dataset

dataset = MyDataset(images, labels)
print(len(dataset))
print(dataset[3])

10
(tensor([[[0.3294, 0.1765, 0.7098, 0.3020, 0.7294, 0.6078, 0.1137, 0.1176],
         [0.2471, 0.8157, 0.2784, 0.3137, 0.2392, 0.4706, 0.9843, 0.7804],
         [0.4353, 0.5490, 0.4275, 0.0235, 0.2588, 0.0353, 0.2078, 0.7882],
         [0.1373, 0.0549, 0.4353, 0.0863, 0.0353, 0.6627, 0.1255, 0.5216]],

        [[0.3490, 0.9333, 1.0000, 0.2353, 0.0157, 0.6510, 0.4941, 0.8353],
         [0.4549, 0.9686, 0.5922, 0.9490, 0.5922, 0.1059, 0.0078, 0.6784],
         [0.1373, 0.5529, 0.4275, 0.1529, 0.4902, 0.3137, 0.5569, 0.7725],
         [0.1137, 0.4745, 0.2824, 0.7529, 0.2275, 0.6863, 0.7686, 0.3569]],

        [[0.9686, 0.0000, 0.9725, 0.9961, 0.0471, 0.4588, 0.1176, 0.4078],
         [0.2000, 0.6118, 0.8588, 0.6706, 0.8431, 0.4039, 0.6863, 0.2745],
         [0.8039, 0.7686, 0.8039, 0.5059, 0.6588, 0.6196, 0.3333, 0.8824],
         [0.6510, 0.5882, 0.9608, 0.2588, 0.6039, 0.6588, 0.2078, 0.6039]]]), tensor(0))


# **Exercise: Build a dataset for next-sentence prediction (NSP)**

* **Raw dataset**
    * Each sample consists of two sentences that follow a logical relation.
    * Each sample's data format: A list with two items. Each item is a 1-dim tensor with each token represented by its ID.
    * The end of each sentence is a special token with ID 2.
    * While generating tokens, avoid using special token IDs 0, 1, 2.

* **Dataset to construct**
    * For each sample, fix the first sentence and create multiple subsamples for the second sentences. The second sentences include the ground-truth one and other sentences randomly drawn form the dataset.
    * Each pair of sentences is labeled as ```True``` if they follow the ground-truth logic and ```False``` otherwise.
    * Format of each sentence:
    ```torch.cat([torch.tensor([1]), sen1, sen2], dim = 0)```

* **Key ideas**
    * Negative sampling
    * Data augmentation

**Build a raw dataset**

In [None]:
num_samples = 10
vocab_size = 30 # token IDs: 0, 1, ..., vocab_size - 1
length_LB = 5
length_UB = 12

raw_dataset = []
for n in range(num_samples):
    # Generate sentence 1
    length_sen1 = torch.randint(low = length_LB, high = length_UB, size = ())
    sen1 = torch.randint(low = 3, high = vocab_size, size = (length_sen1,))
    sen1 = torch.cat([sen1, torch.tensor([2])], dim = 0)

    # Generate sentence 2
    length_sen2 = torch.randint(low = length_LB, high = length_UB, size = ())
    sen2 = torch.randint(low = 3, high = vocab_size, size = (length_sen2,))
    sen2 = torch.cat([sen2, torch.tensor([2])], dim = 0)

    # Put two sentences to a list
    raw_dataset.append([sen1, sen2])

**Define Dataset class**

In [None]:
class MyNSPDataset(Dataset):
    def __init__(self, raw_dataset, num_noisy_samples):
        num_samples_raw = len(raw_dataset) # Number of samples in the raw dataset
        self.num_samples = num_samples_raw * (1 + num_noisy_samples) # Number of samples in the dataset that we are constructing
        self.inputs = [] # List of all paired sentences
        self.labels = [] # List of labels of all paired sentences

        # Extract sentence 1s and 2s, respectively, from the raw dataset
        sen1_raw_list = []
        sen2_raw_list = []
        for n in range(num_samples_raw):
            sen1_raw_list.append(raw_dataset[n][0])
            sen2_raw_list.append(raw_dataset[n][1])

        # Compute probability distribution of sentence 2s that are randomly drawn
        prob_sen2_indices = 1/num_samples_raw * torch.ones(num_samples_raw)

        for n in range(num_samples_raw):
            # Add the nth ground-truth pair of sentences to the dataset
            self.inputs.append(torch.cat([torch.tensor([1]), sen1_raw_list[n], sen2_raw_list[n]]))
            self.labels.append(torch.tensor(True))

            # Randomly generate sentence 2 indices
            sen2_indices = torch.multinomial(input = prob_sen2_indices, num_samples = num_noisy_samples, replacement = True)

            # Add the nth sentence 1 and each randomly selected sentence 2 to the dataset
            for m in range(num_noisy_samples):
                self.inputs.append(torch.cat([torch.tensor([1]), sen1_raw_list[n], sen2_raw_list[sen2_indices[m]]]))
                self.labels.append(sen2_indices[m] == n)

    def __len__(self):
        return self.num_samples

    def __getitem__(self, index):
        input = self.inputs[index]
        label = self.labels[index]
        return {'input': input, 'label': label}


**Construct dataset**

In [None]:
dataset_NSP = MyNSPDataset(raw_dataset, 2)
print(len(dataset_NSP))

30


**Copyright  Beaver-Edge AI Institute. All Rights Reserved. No part of this document may be copied or reproduced without the written permission of Beaver-Edge AI Institute.**