## Data loader

This notebook defines the source code that we will use to load the dataset. The implementation of this notebook is at `src/lib/data_loader.py`.

### Import dependencies

As dependencies we have the following:

- `os`: used to check whether the CSV exists or not;
- `pandas`: used to load the CSV file as a DataFrame;
- `numpy`: used to shuffle the row indexes when splitting the dataset;
- `nltk.tokenize`: used to transform a sentence into a tokens list.

In [8]:
import os
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import word_tokenize


nltk.download("punkt")

[nltk_data] Downloading package punkt to /home/water/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### `__init__`

The function `__init__` takes as parameters the following:
- `csv_path`: path to the CSV file that will be used as dataset;
- `split_percentages`: the distribution of the dataset in training, validation and test data.

In [9]:
csv_path = "../dataset/sample.csv"
split_percentages = [0.7, 0.15, 0.15]

In [10]:
assert sum(split_percentages) == 1, "The sum of `split_percentages` must be 1."
assert os.path.exists(csv_path), "The argument `csv_path` is invalid."

The dataset is loaded using [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), provided by `pandas`. The `header=0` argument informs that the first row of the CSV file is the name of the columns.

In [11]:
dataset = pd.read_csv(csv_path, header=0)

After the dataset is loaded, we will split the dataset into training, validation and testing. This allows us to have a model performance metric when dealing with data that was not seen in the training process.

The validation data will be used after each training epoch, while the test data will be used after the entire model has been trained.

In [12]:
csv_indexes = list(range(len(dataset)))
# [0, 1, 2] becomes [2, 0, 1], for example.
np.random.shuffle(csv_indexes)
validation_count = int(len(csv_indexes) * split_percentages[1])
testing_count = int(len(csv_indexes) * split_percentages[2])

validation_indexes = csv_indexes[:validation_count]
testing_indexes = csv_indexes[
    validation_count: validation_count + testing_count
]
training_indexes = csv_indexes[validation_count + testing_count:]

In [13]:
assert sum(
    [
        len(validation_indexes),
        len(testing_indexes),
        len(training_indexes),
    ]
) == len(csv_indexes), "An error occured while splitting the dataset."

The neural network must know when a sentence begins or ends. If the neural network did not have this knowledge, it would not be possible to know when the neural network ends a sentence. Therefore, we define the token **<SOS\>** to indicate the beginning of a sentence and the token **<EOF\>** to indicate the end of a sentence.

When using batches to train the model, it is possible to have sentences with different sizes. Hence, all sentences in a batch are padded to the same length. The token **<PAD\>** is used to pad.

The `token2index` set maps a token to an index in the vocabulary. The `index2token` set does the reverse: it maps an index to a word. The `number_tokens` variable stores the size of the dictionary, that is, the number of unique tokens that the dataset has.

In [14]:
start_sentence_token = "<SOS>"  # start of sentence
end_sentence_token = "<EOF>"  # end of sentence
pad_sentence_token = "<PAD>"  # sentence pad

token2index = {
    start_sentence_token: 0,
    end_sentence_token: 1,
    pad_sentence_token: 2,
}
index2token = {
    0: start_sentence_token,
    1: end_sentence_token,
    2: pad_sentence_token,
}
token_count = {}
number_tokens = 3

In the next cell, we define three functions that will help us create the token dictionary. The `tokenize_sentence` function takes a string sentence as input and breaks it into a set of tokens. The tokenization process in done using the [`word_tokenize`](https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.punkt.PunktLanguageVars.word_tokenize) function, provided by the NLTK package.

`add_sentence` funtion tokenizes the sentence and adds each token to the dictionary using the `add_token` function. `add_token` is responsible for checking whether the token is already in the dictionary or not and creating the token mapping.

In [15]:
def tokenize_sentence(sentence):
    return word_tokenize(sentence)


def add_sentence(sentence):
    for token in tokenize_sentence(sentence):
        add_token(token)


def add_token(token):
    global number_tokens
    if token not in token2index:
        token2index[token] = number_tokens
        index2token[number_tokens] = token
        token_count[token] = 1
        number_tokens += 1
    else:
        token_count[token] += 1

In [16]:
test_sentence = "Hello, World!"
print(tokenize_sentence(test_sentence))

['Hello', ',', 'World', '!']


Finally, we go through all the sentences in the dataset and create the token dictionary.

In [17]:
for x in range(len(dataset)):
    row = dataset.iloc[x]
    add_sentence(row["question"])
    add_sentence(row["answer"])

### `get_batches`

We created three funtions that will be used to return the batches. The `get_data_indices` funtion takes the `data_type` as parameter and returns the data indexes for the specific operation.

The `get_encoded_sentence` function takes the sentence as parameter, tokenizes the sentence and map the tokens to indexes. The `reverse` parameter, when true, reverses the sentence. The reason for this is based on [this](https://arxiv.org/abs/1409.3215) paper.

The `pad_batch_sequence` function returns the batch with the sequences padded equally. To perform this process, we pad in all sequences based on the length of the longest sequence in the batch. The pad token used is the **<PAD\>**, previously defined. In this same function, we insert the tokens **<SOS\>** and **<EOF\>**, respectively, at the beginning and at the end of each sentence.

In [18]:
def get_data_indices(data_type):
    """
        data_type: 0 for training data; 1 for validation data; 2 for testing data.
    """
    if data_type == 0:
        return training_indexes
    elif data_type == 1:
        return validation_indexes
    elif data_type == 2:
        return testing_indexes
    else:
        raise Exception("Invalid `data_type`.")


def get_encoded_sentence(sentence, reverse=False):
    encoded = []
    for token in tokenize_sentence(sentence):
        encoded.append(token2index[token])
    if reverse:
        encoded = encoded[::-1]
    return encoded


def pad_batch_sequence(x, y):
    assert len(x) == len(y), "`x` and `y` must be the same length."

    longer_x = len(max(x, key=len))
    longer_y = len(max(y, key=len))
    for i in range(len(x)):
        # Append pad token.
        for _ in range(longer_x - len(x[i])):
            x[i].append(token2index[pad_sentence_token])
        for _ in range(longer_y - len(y[i])):
            y[i].append(token2index[pad_sentence_token])
        # Insert SOS token.
        x[i].insert(0, token2index[start_sentence_token])
        y[i].insert(0, token2index[start_sentence_token])
        # Append EOF token.
        x[i].append(token2index[end_sentence_token])
        y[i].append(token2index[end_sentence_token])
    return x, y

This function will create the batches to train, validate or test the model. The function takes three parameters:

- `batch_size`: this is the size of the batch used as input in the model;
- `data_type`: as this same function will be used to get data to train, validate and test the model, we need to know what data we want in the batches. When `data_type` is 0, the function returns the data for training the model. When 1, returns data to validate the model. When 2, return the data for testing the model;
- `drop_last`: the last batch may not be the same size as the `batch_size` parameter, as there is not enough data. When `drop_last` is True, return the last batch, regardless of size.

In [19]:
def get_batches(batch_size, data_type, drop_last=False):
    """
        batch_size: The batch size used as input in the model.
        data_type: 0 for training data; 1 for validation data; 2 for testing data.
        drop_last: When True, the last batch may not be the same size as `batch_size`.
    """
    indexes = get_data_indices(data_type)
    batch_index = -1
    count = 0
    while count < len(indexes) and batch_size <= len(indexes):
        x = []
        y = []
        for i in range(batch_size):
            row = dataset.iloc[indexes[i + count]]
            x.append(get_encoded_sentence(row["question"]))
            y.append(get_encoded_sentence(row["answer"]))
        batch_index += 1
        yield batch_index, pad_batch_sequence(x, y)
        count += batch_size
        if count + batch_size > len(indexes):
            break

    if not drop_last and count < len(indexes):
        x = []
        y = []
        for i in range(len(indexes) - count):
            row = dataset.iloc[indexes[i + count]]
            x.append(get_encoded_sentence(row["question"]))
            y.append(get_encoded_sentence(row["answer"]))
        batch_index += 1
        yield batch_index, pad_batch_sequence(x, y)

## Test

Test `get_batches` function:

In [36]:
for batch_index, (x, y) in get_batches(batch_size=3, data_type=0, drop_last=False):
    print(x, y)

[[0, 14, 4, 5, 15, 16, 17, 18, 19, 2, 2, 2, 2, 2, 1], [0, 64, 65, 126, 17, 127, 26, 17, 128, 57, 2, 2, 2, 2, 1], [0, 91, 38, 17, 45, 92, 93, 26, 17, 94, 95, 46, 96, 57, 1]] [[0, 17, 45, 46, 47, 48, 49, 38, 50, 51, 2, 2, 2, 2, 2, 2, 1], [0, 20, 129, 130, 38, 131, 17, 132, 133, 13, 2, 2, 2, 2, 2, 2, 1], [0, 97, 46, 98, 99, 100, 17, 101, 92, 102, 103, 104, 105, 106, 107, 13, 1]]
[[0, 52, 4, 53, 23, 24, 54, 55, 56, 38, 17, 18, 57, 1], [0, 108, 82, 109, 110, 111, 2, 2, 2, 2, 2, 2, 2, 1], [0, 3, 4, 5, 6, 2, 2, 2, 2, 2, 2, 2, 2, 1]] [[0, 58, 59, 60, 61, 62, 63, 13, 1], [0, 112, 2, 2, 2, 2, 2, 2, 1], [0, 7, 8, 9, 10, 11, 12, 13, 1]]
[[0, 119, 120, 121, 66, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1], [0, 77, 78, 79, 44, 80, 81, 7, 82, 83, 84, 85, 86, 57, 2, 2, 2, 1], [0, 64, 65, 17, 45, 35, 25, 66, 67, 68, 7, 17, 69, 70, 9, 71, 57, 1]] [[0, 53, 122, 82, 123, 24, 124, 125, 1], [0, 87, 88, 60, 89, 26, 74, 90, 1], [0, 72, 73, 74, 75, 48, 76, 13, 1]]
[[0, 64, 35, 17, 113, 26, 17, 18, 57, 1]] [[0, 114, 