Old Documentation:

- [`import`](https://docs.python.org/3/reference/simple_stmts.html#the-import-statement)
- [`len`](https://docs.python.org/3/library/functions.html#len)
- [`numpy`](https://numpy.org/doc/1.19/user/whatisnumpy.html)
- [`numpy.array`](https://numpy.org/doc/stable/reference/generated/numpy.array.html)
- [numpy indexing](https://numpy.org/doc/stable/reference/arrays.indexing.html)
- [`torch`](https://pytorch.org/docs/stable/index.html)
- [`torch.Tensor`](https://pytorch.org/docs/stable/tensors.html#torch.Tensor)
- [`torch.utils.data`](https://pytorch.org/docs/stable/data.html#torch.utils.data)
- [`torch.utils.data.Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset)

Import the numpy and pytorch (torch) modules.

In [1]:
import numpy as np
import torch

Create the sample dataset to be used in this notebook.

In [2]:
X = np.array([ np.array([[ 2,  3,  4],
                         [ 4,  6,  8],
                         [ 6,  9, 12],
                         [ 8, 12, 16]]),
               np.array([[10, 15, 20],
                         [12, 18, 24]]) ], dtype=object)

**Step 1:** Code to be executed once at the beginning for initialization

In [3]:
def init_(X):
    
    # Create a global variable of X to be accessed in other functions.
    global global_X
    global_X = X
    
    # Create a global variable to store the list of all (sample, row) index pairs.
    global global_index_map
    global_index_map = []
    
    # Use two for loops to create a list of all (sample, row) index pairs.
    for i, x in enumerate(X):
        for j, _ in enumerate(x):
            index_pair = (i, j)
            global_index_map.append(index_pair)
    
    # Create a global variable of the length of the list of pairs to be accessed in other functions
    global global_length
    global_length = len(global_index_map)
    
    return None

**Step 2:** Code to return the number of items as length

In [4]:
def len_():

    # Return the global variable of the length of the list of (sample, row) pairs
    return global_length

**Step 3:** Code to return the x item at sample i, row j

In [5]:
def getitem_(index):
    
    # Get the sample index i and row index j that corresponds to the pair index
    i, j = global_index_map[index]
    
    # Index the global variable X using the sample index i and row index j
    x_item = global_X[i][j, :]
    
    return x_item

**Step 4:** Code to return the collated list of items

In [6]:
def collate_fn_(batch):
    
    # Index the global variable X at input index i    
    batch = torch.as_tensor(batch)
    
    return batch

Example of how to use init, len, getitem, and collate as functions. Here, we create a batch of size 4.

In [7]:
# Initialize the dataset X
init_(X)

# Get the size of the dataset X
dataset_size = len_()

# Create a random sample of 4 indecies from the dataset X
sample_size = 4
sample_indices = np.random.choice(dataset_size, sample_size, replace=False)

# Create an empty list to store batch items from X
batch = []

# Use a for loop to get all batch items
for i in sample_indices:
    batch.append(getitem_(i))
    
# Collate the list of batch items into a usable form
batch = collate_fn_(batch)
print(batch)

tensor([[ 8, 12, 16],
        [10, 15, 20],
        [12, 18, 24],
        [ 6,  9, 12]])


Example of how to create a Dataset class using init, len, getitem and collate 

In [8]:
class ExampleDataset(torch.utils.data.Dataset):
    
    def __init__(self, X):
        
        ### Code to be executed once at the beginning for initialization
        self.X = X
        
        self.index_map = []
        for i, x in enumerate(X):
            for j, _ in enumerate(x):
                index_pair = (i, j)
                self.index_map.append(index_pair)
        
        self.length = len(self.index_map)
        
    def __len__(self):
        
        ### Return the number of items as length
        return self.length
    
    def __getitem__(self, index):
        
        ### Return one item at recording i, timestep j
        i, j = global_index_map[index]
        x_item = global_X[i][j, :]

        return x_item
    
    def collate_fn(batch):
        
        ### Specify how to collate list of items and what to return
        batch = torch.as_tensor(batch)

        return batch

Example of how to use the dataset class and create a batch of size 4.

In [9]:
# Instantiate the dataset class object
dataset = ExampleDataset(X)

# Create a random sample of 4 indecies from the dataset
sample_size = 4
sample_indices = np.random.choice(len(dataset), sample_size, replace=False)

# Create an empty list to store batch items from the dataset
batch = []

# Use a for loop to get all batch items
for i in sample_indices:
    batch.append(dataset[i])
    
# Collate the list of batch items into a usable form
batch = ExampleDataset.collate_fn(batch)
print(batch)

tensor([[ 8, 12, 16],
        [ 2,  3,  4],
        [ 6,  9, 12],
        [ 4,  6,  8]])


Note: The example provided does not take advantage of "multithreading". When a dataset class is combined with a data loader class, you are able to use multithreading and dramatically increase your data loading performance. The data loader class is covered in a future tutorial of this series.