## Data loader

This notebook defines the source code that we will use to load the dataset. The implementation of this notebook is [here](../src/lib/data_loader.py).

### Import dependencies

As dependencies we have the following:

- `os`: used to check whether the CSV exists or not;
- `pandas`: used to load the CSV file as a DataFrame;
- `numpy`: used to shuffle the row indexes when splitting the dataset;
- `nltk.tokenize`: used to transform a sentence into a tokens list.

In [29]:
import os
import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize

### `__init__`

The function `__init__` takes as parameters the following:
- `csv_path`: path to the CSV file that will be used as dataset;
- `split_percentages`: the distribution of the dataset in training, validation and test data.

In [30]:
csv_path = "../dataset/sample.csv"
split_percentages = [0.7, 0.15, 0.15]

In [31]:
assert sum(split_percentages) == 1, "The sum of `split_percentages` must be 1."
assert os.path.exists(csv_path), "The argument `csv_path` is invalid."

The dataset is loaded using [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), provided by `pandas`. The `header=0` argument informs that the first row of the CSV file is the name of the columns.

In [32]:
dataset = pd.read_csv(csv_path, header=0)

After the dataset is loaded, we will split the dataset into training, validation and testing. This allows us to have a model performance metric when dealing with data that was not seen in the training process.

The validation data will be used after each training epoch, while the test data will be used after the entire model has been trained.

In [33]:
csv_indexes = list(range(len(dataset)))
# [0, 1, 2] becomes [2, 0, 1], for example.
np.random.shuffle(csv_indexes)
validation_count = int(len(csv_indexes) * split_percentages[1])
testing_count = int(len(csv_indexes) * split_percentages[2])

validation_indexes = csv_indexes[:validation_count]
testing_indexes = csv_indexes[
    validation_count : validation_count + testing_count
]
training_indexes = csv_indexes[validation_count + testing_count :]

In [34]:
assert sum(
    [
        len(validation_indexes),
        len(testing_indexes),
        len(training_indexes),
    ]
) == len(csv_indexes), "An error occured while splitting the dataset."

The neural network must know when a sentence begins or ends. If the neural network did not have this knowledge, it would not be possible to know when the neural network ends a sentence. Therefore, we define the token **<SOS\>** to indicate the beginning of a sentence and the token **<EOF\>** to indicate the end of a sentence.

When using batches to train the model, it is possible to have sentences with different sizes. Hence, all sentences in a batch are padded to the same length. The token **<PAD\>** is used to pad.

The `word2token` set maps a token to an index in the vocabulary. The `index2token` set does the reverse: it maps an index to a word. The `number_tokens` variable stores the size of the dictionary, that is, the number of unique tokens that the dataset has.

In [35]:
start_sentence_token = "<SOS>"  # start of sentence
end_sentence_token = "<EOF>"  # end of sentence
pad_sentence_token = "<PAD>"  # sentence pad

word2token = {
    start_sentence_token: 0,
    end_sentence_token: 1,
    pad_sentence_token: 2,
}
index2token = {
    0: start_sentence_token,
    1: end_sentence_token,
    2: pad_sentence_token,
}
token_count = {}
number_tokens = 3

In the next cell, we define three functions that will help us create the token dictionary. The `tokenize_sentence` function takes a string sentence as input and breaks it into a set of tokens. The tokenization process in done using the [`word_tokenize`](https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.punkt.PunktLanguageVars.word_tokenize) function, provided by the NLTK package.

`add_sentence` funtion tokenizes the sentence and adds each token to the dictionary using the `add_token` function. `add_token` is responsible for checking whether the token is already in the dictionary or not and creating the token mapping.

In [45]:
def tokenize_sentence(sentence):
    return word_tokenize(sentence)

def add_sentence(sentence):
    for token in tokenize_sentence(sentence):
        add_token(token)

def add_token(token):
    global number_tokens
    if token not in word2token:
        word2token[token] = number_tokens
        index2token[number_tokens] = token
        token_count[token] = 1
        number_tokens += 1
    else:
        token_count[token] += 1

In [37]:
test_sentence = "Hello, World!"
print(tokenize_sentence(test_sentence))

['Hello', ',', 'World', '!']


At least, we go through all the sentences in the dataset and create the token dictionary.

In [46]:
for x in range(len(dataset)):
    row = dataset.iloc[x]
    add_sentence(row["question"])
    add_sentence(row["answer"])