## Dataset

This section describes the process to load the dataset used to train and test the model. The dataset I am using on this project is just as stupid as the network. The idea is just to learn more about recurrent neural networks.

In [None]:
import pandas as pd

In [None]:
dataset_path = "../../dataset/data.csv"
data = pd.read_csv(dataset_path, header=0)
data.head()

This dataset is not large enough to justify using the PyTorch `Dataset` utility class. However, I will use it.

In [None]:
import torch
from torch.utils.data.dataset import Dataset
import pandas as pd
import numpy as np


class StupidBotDataset(Dataset):
    def __init__(self, csv_path):
        self.data = pd.read_csv(csv_path, header=0)
        self.questions = self.data["question"]
        self.answers = self.data["answer"]
        self.data_len = len(self.data.index)
        
        # Unique characters in the database.
        self.unique_characters = set("".join(self.questions + self.answers))
        self.unique_characters_length = len(self.unique_characters)
        # Map int to character.
        self.int2char = dict(enumerate(self.unique_characters))
        # Map character to int.
        self.char2int = {char: i for i, char in self.int2char.items()}
        
        # Longer question.
        longer_question_length = len(max(self.questions, key=len))
        # Longer answer.
        longer_answer_length = len(max(self.answers, key=len))
        
        # Pad strings.
        self.questions = self.questions.str.pad(longer_question_length, side="right")
        self.answers = self.answers.str.pad(longer_answer_length, side="right")

    def __getitem__(self, index):
        x = self.questions[index]
        # Map text to int.
        x = self.text2int(x)
        # One-hot encode x.
        x = self.one_hot_encode(x)
        x = torch.tensor(x)
        
        y = self.answers[index]
        # Map text to int.
        y = self.text2int(y)
        # One-hot encode y.
        y = self.one_hot_encode(y)
        y = torch.tensor(y)
        return x, y

    def __len__(self):
        return self.data_len
    
    def text2int(self, text):
        """
            Convert text to an array of integers.
        """
        return [self.char2int[c] for c in text]
    
    def one_hot_encode(self, sequence):
        """
            Convert an array of integers to a matrix one-hot encoded.
        """
        encoded = np.zeros([self.unique_characters_length, len(sequence)], dtype=int)
        for i, character in enumerate(sequence):
            encoded[character][i] = 1
        return encoded
    
    def one_hot_decode(self, sequence):
        """
            sequence: PyTorch tensor.
        """
        return [np.argmax(x) for x in sequence.numpy().T]

The cell below shows an example of how to use the `StupidBotDataset` class.

In [None]:
dataset = StupidBotDataset(dataset_path)
dataset[0]

## Divide dataset into training and testing

The next step is to divide the dataset into training and testing. To do this, I will use the tools provided by  PyTorch.

The dataset will be loaded and shuffled. In large datasets, this can be a problem. However, as this dataset is small, I will use this approach.

In [None]:
from torch.utils.data.sampler import SubsetRandomSampler

Load dataset and define the parameters used to split and load the dataset:

In [None]:
dataset = StupidBotDataset(dataset_path)
dataset_size = len(dataset)
dataset_indices = list(range(dataset_size))

batch_size = 1
test_split = int(np.floor(0.2 * dataset_size))  # 20%
# Shuffle dataset indices.
np.random.shuffle(dataset_indices)

Split dataset:

In [None]:
train_indices, test_indices = (
    dataset_indices[test_split:],
    dataset_indices[:test_split],
)

Load train and test dataset:

In [None]:
train_sampler = SubsetRandomSampler(train_indices)
test_sampler = SubsetRandomSampler(test_indices)

train_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, sampler=train_sampler)
test_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, sampler=test_sampler)