## How to use dataset and custom dataloaders

Quick turiorial on how to use dataloaders with pytorch

orginal code: https://towardsdatascience.com/how-to-use-datasets-and-dataloader-in-pytorch-for-custom-text-data-270eed7f7c00

Creating a pytorch dataset and managing it with dataloader keeps the data manageable.

Defines:
- Dataset stores all your data
- Dataloader is can be used to iterate though the data, manage batches, transform the data

In [1]:
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader

### Create a Custom Dataset Class

In [2]:
class CustomTextDataset(Dataset):
    def __init__(self, text, labels):
        """
        When initialize the class you need to inmport two variables
        Variables:
        - text
        - labels
        """
        self.labels = labels
        self.text = text

    def __len__(self):
        """
        returns the length of the labels when called
        """
        return len(self.labels)
    
    def __getitem__(self, idx):
        """
        function is used by pytorch dataset module to get a 
        sample and construct the dataset

        idx passed in to the function is a number, this number is the data instance which
        dataset will be looping through

        sample containing a dictionary storing the data. This stores in 
        another dictonary consisting of all the data in the dataset
        """
        label = self.labels[idx]
        text = self.text[idx]
        sample = {"Text": text, "Class": label}
        return sample

### Initialise the Custom textDataset class

Example pandas dataset

In [3]:
text = ['Happy', 'Amazing', 'Sad', 'Unhapy', 'Glum']
labels = ['Positive', 'Positive', 'Negative', 'Negative', 'Negative']

text_labels_df = pd.DataFrame({'Text': text, 'Labels': labels})

define data set object

In [4]:
TD = CustomTextDataset(text_labels_df['Text'], text_labels_df['Labels'])

dataset is initialised and ready to be used

### Dataset Example

Show how the data is stored within the dataset

In [5]:
print('\nFirst iteration of data set: ', next(iter(TD)), '\n')


First iteration of data set:  {'Text': 'Happy', 'Class': 'Positive'} 



In [6]:
print('Length of data set: ', len(TD), '\n')

Length of data set:  5 



In [7]:
print('Entire data set: ', list(DataLoader(TD)), '\n')

Entire data set:  [{'Text': ['Happy'], 'Class': ['Positive']}, {'Text': ['Amazing'], 'Class': ['Positive']}, {'Text': ['Sad'], 'Class': ['Negative']}, {'Text': ['Unhapy'], 'Class': ['Negative']}, {'Text': ['Glum'], 'Class': ['Negative']}] 



### How to prepoecess data using collate fn

In ML and DL text needs to be cleaned and turned in to vectors prior to training. 
Dataloader has a handy parameter called collate_fn. this parameter allows you to create separate data processing functions and will apply the processing within that function to the data before it is output

In [8]:
def collate_batch(batch):
    word_tensor = torch.tensor([[1.], [0.], [45.]])
    label_tensor = torch.tensor([[1.]])

    text_list, classes = [], []

    for (_text, _class) in batch:
        text_list.append(word_tensor)
        classes.append(label_tensor)
    
    text = torch.cat(text_list)
    classes = torch.tensor(classes)

    return text, classes


an example, two tensors are created to represent the word and class.

In practice, these could be word vectors passed in through another function. 
The batch is the unpacked and then we add the word and label tensors to lists.

The word tensors are then concatenated and the list of class tensors, in this case 1, are combined into a single tensor. Funciton will now return processed text data ready for training

In [9]:
# DL_DS = DataLoader(TD, batch_size=2, collate_fn=collate_batch)

#### How to iterate through the dataset when training a model

We will iterate though the Dataset without using collate fn because its easier to see how the words and classes are being ouptut by dataloader. 

In [10]:
DL_DS = DataLoader(TD, batch_size=2, shuffle=True)

for (idx, batch) in enumerate(DL_DS):
    # print the text data of the batch
    print(idx, 'text data:', batch['Text'])

    # print the class data of batch
    print(idx, 'Class data: ', batch['Class'], '\n')

0 text data: ['Sad', 'Unhapy']
0 Class data:  ['Negative', 'Negative'] 

1 text data: ['Amazing', 'Happy']
1 Class data:  ['Positive', 'Positive'] 

2 text data: ['Glum']
2 Class data:  ['Negative'] 



In [11]:
DL_DS_CL = DataLoader(TD, batch_size=2, collate_fn=collate_batch, shuffle=True)

for (idx, batch) in enumerate(DL_DS_CL):
    print(idx, batch)

0 (tensor([[ 1.],
        [ 0.],
        [45.],
        [ 1.],
        [ 0.],
        [45.]]), tensor([1., 1.]))
1 (tensor([[ 1.],
        [ 0.],
        [45.],
        [ 1.],
        [ 0.],
        [45.]]), tensor([1., 1.]))
2 (tensor([[ 1.],
        [ 0.],
        [45.]]), tensor([1.]))


data does not appear correct. additional work is needed to review
