# Creating an NLP Data Loader

## Installing libraries

In [2]:
import pandas as pd
from torch.utils.data import Dataset, DataLoader
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.datasets import multi30k, Multi30k
from typing import Iterable, List
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader
from torchdata.datapipes.iter import IterableWrapper, Mapper
import torchtext

import torch
import torch.nn as nn
import torch.optim as optim

import numpy as np
import random
import spacy

## Dataset

### Data loader

A data loader in PyTorch is responsible for loading and batching data from a data set. In NLP applications, the data loader is used to process and transform the text data, rather than just the data set.

Key parameters: data set, batch size, shuffle, etc.

**Data loaders also provide an iterator interface, making it easy to iterate over batches of data during training.**

An iterator is an object that can be looped over. It typically contains two methods: `__iter__()` and `__next__()`. When there are no more elements to iterate over, it raises a `StopIteration` exception.

This feature is useful as in natural langauge processing tasks the data we handle are very large. By using the iterator we can traverse large data sets without loading all elements into memory simultaneously, making the process more memory-efficient.

The data loader converts input data and labels into batches of tensors with the same shape for deep learning models to interpret.

A data loader can also be used for tasks such as tokenising, sequencing, converting samples to the same size, and transforming the data into tensors that a model can understand.

### Custom data set and data loader in PyTorch

In this part we learn how to create a custom data set and use the DataLoader class in PyTorch.

We start by defining a custom data set called CustomDataset. This data set inherits from the `torch.utils.data.Dataset` class and is initialised with a list of sentences. 

We then create a DataLoader by providing this custom data set and batch size to the `torch.utils.data.DataLoader` class. 

Lastly, we iterate through the DataLoader to demonstrate how data is loaded in batches. 

In [2]:
sentences = [
    "If you want to know what a man's like, take a good look at how he treats his inferiors, not his equals.",
    "Fame's a fickle friend, Harry.",
    "It is our choices, Harry, that show what we truly are, far more than our abilities.",
    "Soon we must all face the choice between what is right and what is easy.",
    "Youth can not know how age thinks and feels. But old men are guilty if they forget what it was to be young.",
    "You are awesome!"
]

In [3]:
# Define a custom dataset

class CustomDataset(Dataset):
    def __init__(self, sentences):
        self.sentences = sentences
    
    def __len__(self):
        return len(self.sentences)

    def __getitem__(self, idx):
        return self.sentences[idx]

In [4]:
# Create an instance of the custom dataset

custom_dataset = CustomDataset(sentences)

In [5]:
# Define batch size

batch_size = 2

In [6]:
# Create a DataLoader

dataloader = DataLoader(custom_dataset, batch_size=batch_size, shuffle=True)

In [7]:
# Iterate through the DataLoader

for batch in dataloader:
    print(batch)

["Fame's a fickle friend, Harry.", 'Soon we must all face the choice between what is right and what is easy.']
['Youth can not know how age thinks and feels. But old men are guilty if they forget what it was to be young.', 'You are awesome!']
['It is our choices, Harry, that show what we truly are, far more than our abilities.', "If you want to know what a man's like, take a good look at how he treats his inferiors, not his equals."]


### Creating tensors for custom data set

The custom data set can also tokenise the sentence.

In [12]:
class CustomDataset(Dataset):
    def __init__(self, sentences, tokenizer, vocab):
        self.sentences = sentences
        self.tokenizer = tokenizer
        self.vocab = vocab
    
    def __len__(self):
        return len(self.sentences)
    
    def __getitem__(self, idx):
        tokens = self.tokenizer(self.sentences[idx])
        
        # Convert tokens to tensor indices using vocab
        tensor_indices = [self.vocab[token] for token in tokens]

        return torch.tensor(tensor_indices)

In [13]:
# Tokenizer
tokenizer = get_tokenizer("basic_english")

# Build vocabulary
vocab = build_vocab_from_iterator(map(tokenizer, sentences))

# Create an instance of teh custom data set
custom_dataset = CustomDataset(sentences, tokenizer, vocab)

In [10]:
print("Custom Dataset Length:", len(custom_dataset))
print("Sample Items:")
for i in range(6):
    sample_item = custom_dataset[i]
    print(f"Item {i+1}: {sample_item}")

Custom Dataset Length: 6
Sample Items:
Item 1: tensor([11, 19, 63, 17, 13,  2,  3, 47,  6, 16, 45,  0, 55,  3, 41, 46, 24, 10,
        43, 61,  9, 44,  0, 14,  9, 33,  1])
Item 2: tensor([35,  6, 16,  3, 38, 40,  0,  8,  1])
Item 3: tensor([12,  5, 15, 31,  0,  8,  0, 57, 53,  2, 18, 62,  4,  0, 36, 49, 56, 15,
        21,  1])
Item 4: tensor([54, 18, 50, 23, 34, 58, 30, 27,  2,  5, 52,  7,  2,  5, 32,  1])
Item 5: tensor([66, 29, 14, 13, 10, 22, 60,  7, 37,  1, 28, 51, 48,  4, 42, 11, 59, 39,
         2, 12, 64, 17, 26, 65,  1])
Item 6: tensor([19,  4, 25, 20])


In [14]:
dataloader = DataLoader(custom_dataset, batch_size=batch_size, shuffle=True)

for batch in dataloader:
    print(batch)

RuntimeError: stack expects each tensor to be equal size, but got [4] at entry 0 and [20] at entry 1

The above error arises as the tensors have different lengths. We need to define custom collate function, pad the sequence so that they all have the same length.

In [19]:
# Create a custom collate function

def collate_fn(batch):
    # Pad sequences within the batch to have equal lengths
    """
    `padding_value` specifies the value to use for padding.
    """
    padded_batch = pad_sequence(batch, batch_first=True, padding_value=0)
    return padded_batch

In [20]:
dataloader = DataLoader(custom_dataset, batch_size=batch_size, collate_fn=collate_fn)

for batch in dataloader:
    for row in batch:
        for idx in row:
            words = [vocab.get_itos()[idx] for idx in row]
        print(words)

['if', 'you', 'want', 'to', 'know', 'what', 'a', 'man', "'", 's', 'like', ',', 'take', 'a', 'good', 'look', 'at', 'how', 'he', 'treats', 'his', 'inferiors', ',', 'not', 'his', 'equals', '.']
['fame', "'", 's', 'a', 'fickle', 'friend', ',', 'harry', '.', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',']
['it', 'is', 'our', 'choices', ',', 'harry', ',', 'that', 'show', 'what', 'we', 'truly', 'are', ',', 'far', 'more', 'than', 'our', 'abilities', '.']
['soon', 'we', 'must', 'all', 'face', 'the', 'choice', 'between', 'what', 'is', 'right', 'and', 'what', 'is', 'easy', '.', ',', ',', ',', ',']
['youth', 'can', 'not', 'know', 'how', 'age', 'thinks', 'and', 'feels', '.', 'but', 'old', 'men', 'are', 'guilty', 'if', 'they', 'forget', 'what', 'it', 'was', 'to', 'be', 'young', '.']
['you', 'are', 'awesome', '!', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',']


See the difference in shape when setting `batch_first=False`.

In [17]:
# Create a custom collate function

def collate_fn(batch):
    # Pad sequences within the batch to have equal lengths
    """
    `padding_value` specifies the value to use for padding.
    """
    padded_batch = pad_sequence(batch, batch_first=False, padding_value=0)
    return padded_batch

In [18]:
dataloader = DataLoader(custom_dataset, batch_size=batch_size, collate_fn=collate_fn)

for batch in dataloader:
    for row in batch:
        for idx in row:
            words = [vocab.get_itos()[idx] for idx in row]
        print(words)

['if', 'fame']
['you', "'"]
['want', 's']
['to', 'a']
['know', 'fickle']
['what', 'friend']
['a', ',']
['man', 'harry']
["'", '.']
['s', ',']
['like', ',']
[',', ',']
['take', ',']
['a', ',']
['good', ',']
['look', ',']
['at', ',']
['how', ',']
['he', ',']
['treats', ',']
['his', ',']
['inferiors', ',']
[',', ',']
['not', ',']
['his', ',']
['equals', ',']
['.', ',']
['it', 'soon']
['is', 'we']
['our', 'must']
['choices', 'all']
[',', 'face']
['harry', 'the']
[',', 'choice']
['that', 'between']
['show', 'what']
['what', 'is']
['we', 'right']
['truly', 'and']
['are', 'what']
[',', 'is']
['far', 'easy']
['more', '.']
['than', ',']
['our', ',']
['abilities', ',']
['.', ',']
['youth', 'you']
['can', 'are']
['not', 'awesome']
['know', '!']
['how', ',']
['age', ',']
['thinks', ',']
['and', ',']
['feels', ',']
['.', ',']
['but', ',']
['old', ',']
['men', ',']
['are', ',']
['guilty', ',']
['if', ',']
['they', ',']
['forget', ',']
['what', ',']
['it', ',']
['was', ',']
['to', ',']
['be', ',']
['

Observe that each batch has a fixed size for all the sequences within the batch.

In [22]:
for batch in dataloader:
    print(batch)
    print("Length of sequences in the batch:", batch.shape[1])

tensor([[11, 19, 63, 17, 13,  2,  3, 47,  6, 16, 45,  0, 55,  3, 41, 46, 24, 10,
         43, 61,  9, 44,  0, 14,  9, 33,  1],
        [35,  6, 16,  3, 38, 40,  0,  8,  1,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0]])
Length of sequences in the batch: 27
tensor([[12,  5, 15, 31,  0,  8,  0, 57, 53,  2, 18, 62,  4,  0, 36, 49, 56, 15,
         21,  1],
        [54, 18, 50, 23, 34, 58, 30, 27,  2,  5, 52,  7,  2,  5, 32,  1,  0,  0,
          0,  0]])
Length of sequences in the batch: 20
tensor([[66, 29, 14, 13, 10, 22, 60,  7, 37,  1, 28, 51, 48,  4, 42, 11, 59, 39,
          2, 12, 64, 17, 26, 65,  1],
        [19,  4, 25, 20,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0]])
Length of sequences in the batch: 25


The collate function can also handle tasks such as tokenisation, converting tokenised indices, and transforming the result into a tensor.

In [47]:
# Define a custom data set
class CustomDataset(Dataset):
    def __init__(self, sentences):
        self.sentences = sentences

    def __len__(self):
        return len(self.sentences)

    def __getitem__(self, idx):
        return self.sentences[idx]

custom_dataset=CustomDataset(sentences)

In [48]:
def collate_fn(batch):
    # Tokenize each sample in the batch using the specified tokenizer
    tensor_batch = []
    for sample in batch:
        tokens = tokenizer(sample)
        # Convert tokens to vocabulary indices and create a tensor for each sample
        tensor_batch.append(torch.tensor([vocab[token] for token in tokens]))

    # Pad sequences within the batch to have equal lengths using pad_sequence
    # batch_first=True ensures that the tensors have shape (batch_size, max_sequence_length)
    padded_batch = pad_sequence(tensor_batch, batch_first=True)

    # Return the padded batch
    return padded_batch

In [49]:
dataloader = DataLoader(
    dataset=custom_dataset,
    batch_size=batch_size,
    shuffle=True,
    collate_fn=collate_fn
)

In [50]:
for batch in dataloader:
    print(batch)

tensor([[54, 18, 50, 23, 34, 58, 30, 27,  2,  5, 52,  7,  2,  5, 32,  1],
        [35,  6, 16,  3, 38, 40,  0,  8,  1,  0,  0,  0,  0,  0,  0,  0]])
tensor([[19,  4, 25, 20,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0],
        [66, 29, 14, 13, 10, 22, 60,  7, 37,  1, 28, 51, 48,  4, 42, 11, 59, 39,
          2, 12, 64, 17, 26, 65,  1]])
tensor([[12,  5, 15, 31,  0,  8,  0, 57, 53,  2, 18, 62,  4,  0, 36, 49, 56, 15,
         21,  1,  0,  0,  0,  0,  0,  0,  0],
        [11, 19, 63, 17, 13,  2,  3, 47,  6, 16, 45,  0, 55,  3, 41, 46, 24, 10,
         43, 61,  9, 44,  0, 14,  9, 33,  1]])


## Exercise

Create a data loader with a collate function that processes batches of French text (provided below). Sort the data set on sequences length. Then tokenize, numericalize and pad the sequences. Sorting the sequences will minimize the number of `<PAD>`tokens added to the sequences, which enhances the model's performance. Prepare the data in batches of size 4 and print them.

In [3]:
corpus = [
    "Ceci est une phrase.",
    "C'est un autre exemple de phrase.",
    "Voici une troisième phrase.",
    "Il fait beau aujourd'hui.",
    "J'aime beaucoup la cuisine française.",
    "Quel est ton plat préféré ?",
    "Je t'adore.",
    "Bon appétit !",
    "Je suis en train d'apprendre le français.",
    "Nous devons partir tôt demain matin.",
    "Je suis heureux.",
    "Le film était vraiment captivant !",
    "Je suis là.",
    "Je ne sais pas.",
    "Je suis fatigué après une longue journée de travail.",
    "Est-ce que tu as des projets pour le week-end ?",
    "Je vais chez le médecin cet après-midi.",
    "La musique adoucit les mœurs.",
    "Je dois acheter du pain et du lait.",
    "Il y a beaucoup de monde dans cette ville.",
    "Merci beaucoup !",
    "Au revoir !",
    "Je suis ravi de vous rencontrer enfin !",
    "Les vacances sont toujours trop courtes.",
    "Je suis en retard.",
    "Félicitations pour ton nouveau travail !",
    "Je suis désolé, je ne peux pas venir à la réunion.",
    "À quelle heure est le prochain train ?",
    "Bonjour !",
    "C'est génial !"
]

In [6]:
# Define a custom data set
class CustomDataset(Dataset):
    def __init__(self, sentences):
        self.sentences = sentences
    
    def __len__(self):
        return len(self.sentences)

    def __getitem__(self, idx):
        return self.sentences[idx]

tokenizer = get_tokenizer('spacy', language='fr_core_news_sm')
vocab = build_vocab_from_iterator(map(tokenizer, corpus))
sorted_data = sorted(corpus, key=lambda x:len(tokenizer(x)))

# Define a collate function
def collate_fn(batch):
    tensor_batch = []
    for sample in batch:
        tokens = tokenizer(sample)

        # Convert tokens to vocabulary indices and create a tensor for each sample
        tensor_batch.append(torch.tensor([vocab[token] for token in tokens]))
    
    # Pad sequences within the batch to have equal lengths using pad_sequence
    padded_batch = pad_sequence(tensor_batch, batch_first=True)

    return padded_batch

# Create a data loader
custom_dataset = CustomDataset(corpus)
dataloader = DataLoader(
    dataset=custom_dataset,
    batch_size=4,
    shuffle=False,
    collate_fn=collate_fn
)

for batch in dataloader:
    print(batch)

tensor([[ 28,   4,  10,   9,   0,   0,   0,   0],
        [ 11,   4, 111,  50,  68,   5,   9,   0],
        [ 38,  10, 107,   9,   0,   0,   0,   0],
        [ 12,  69,  51,  49,   0,   0,   0,   0]])
tensor([[ 31,  43,   8,  15,  57,  73,   0],
        [ 37,   4,  19,  92,  95,   7,   0],
        [  1, 105,  41,   0,   0,   0,   0],
        [ 26,  45,   2,   0,   0,   0,   0]])
tensor([[  1,   3,  14,  20,  58,  44,   6,  72,   0],
        [ 36,  62,  90, 110,  60,  83,   0,   0,   0],
        [  1,   3,  76,   0,   0,   0,   0,   0,   0],
        [ 33,  71, 122, 117,  52,   2,   0,   0,   0]])
tensor([[  1,   3,  82,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
        [  1,  16, 103,  17,   0,   0,   0,   0,   0,   0,   0,   0,   0],
        [  1,   3,  70,  46,  10,  81,  78,   5,  21,   0,   0,   0,   0],
        [ 29,  24,  96, 109,  48,  61,  94,  18,   6, 118,  23,  65,   7]])
tensor([[  1, 113,  55,   6,  86,  53,  47,   0,   0,   0],
        [ 32,  85,  42,  80,  87,   

## Data loader for German-English translation task

- Data set configuration and language definition
- Tokenizer setup
- Token generation
- Special symbols
- Vocabulary building
- Default token handling

In [48]:
# You would modify the URLs for the data set since the links to the original data set are broken

multi30k.URL["train"] = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-AI0205EN-SkillsNetwork/training.tar.gz"
multi30k.URL["valid"] = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-AI0205EN-SkillsNetwork/validation.tar.gz"

In [50]:
train_iter = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))

In [51]:
data_set = iter(train_iter)

In [52]:
for n in range(5):
    # Getting the next pair of source and target sentences from the training data set
    src, tgt = next(data_set)

    # Printing the source (German) and target (English) sentences
    print(f"sample {str(n+1)}")
    print(f"Source ({SRC_LANGUAGE}): {src}\nTarget ({TGT_LANGUAGE}): {tgt}")

AttributeError: 'NoneType' object has no attribute 'Lock'
This exception is thrown by __iter__ of _MemoryCellIterDataPipe(remember_elements=1000, source_datapipe=_ChildDataPipe)

In [45]:
multi30k.URL["train"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz"
multi30k.URL["valid"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz"

In [49]:
SRC_LANGUAGE = "de"
TGT_LANGUAGE = "en"

In [37]:
# Initialise the training data iterator for the Multi30k dataset

train_iter = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))

In [47]:
for i, (de, eng) in enumerate(multi_train):
    if i==5:
        break
    print(f"index: {i}, German: {de}, English: {en}")




AttributeError: 'NoneType' object has no attribute 'Lock'
This exception is thrown by __iter__ of _MemoryCellIterDataPipe(remember_elements=1000, source_datapipe=_ChildDataPipe)

In [19]:
# Create an iterator for the training data set

data_set = iter(train_iter)

In [20]:
# Print out the first five pairs of source and target sentences from the training data set

for n in range(5):
    src, tgt = next(data_set)

    print(f"sample {str(n+1)}")
    print(f"Source ({SRC_LANGUAGE}): {src}\nTarget ({TGT_LANGUAGE}): {tgt}")

AttributeError: 'NoneType' object has no attribute 'Lock'
This exception is thrown by __iter__ of _MemoryCellIterDataPipe(remember_elements=1000, source_datapipe=_ChildDataPipe)