# Understanding the SQuAD dataset 

We are going to fine-tune [BERT implemented by HuggingFace](https://huggingface.co/bert-base-uncased) for the text extraction task with a dataset of questions and answers with the [SQuAD (The Stanford Question Answering Dataset)](https://rajpurkar.github.io/SQuAD-explorer/) dataset.
The data is composed by a set of questions and corresponding paragraphs that contains the answers.
The model will be trained to locate the answer in the context by giving the positions where the answer starts and ends.

In this notebook we are going to see how the data is set up for training.

More info:
- [Glossary - HuggingFace docs](https://huggingface.co/transformers/glossary.html#model-inputs)
- [BERT NLP — How To Build a Question Answering Bot](https://towardsdatascience.com/bert-nlp-how-to-build-a-question-answering-bot-98b1d1594d7b)

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer
from rich.pretty import pprint

In [None]:
from datasets.utils import disable_progress_bar
from datasets import disable_caching


disable_progress_bar()
disable_caching()

## The raw data

In [None]:
hf_dataset = load_dataset('squad')

In [None]:
hf_dataset

In [None]:
for i, _squad_example in enumerate(hf_dataset['train']):
    pprint(_squad_example)
    if i > 5:
        break

In [None]:
for i, _squad_example in enumerate(hf_dataset['validation']):
    pprint(_squad_example)
    if i > 5:
        break

In [None]:
len(hf_dataset['train']['title'])

In [None]:
len(hf_dataset['validation']['title'])

In [None]:
len(set(hf_dataset['train']['title']))

In [None]:
len(set(hf_dataset['validation']['title']))

In [None]:
squad_ex = hf_dataset['train'].select([20584])

In [None]:
squad_ex['title']

In [None]:
squad_ex['context']

In [None]:
squad_ex['question']

In [None]:
squad_ex['answers']

# The tokenizer

## Processing the data for training
Now we process the data so we can feed it later to the model.
The idea is to replace the words (and some word parts) by numbers using the tokenizer above and organize the training data as a set of paragraphs and questions.

In [None]:
hf_model = 'google/mobilebert-uncased'

In [None]:
tokenizer = AutoTokenizer.from_pretrained(hf_model)

In [None]:
encoding = tokenizer("Let's tokenize something?")

In [None]:
pprint(encoding)

In [None]:
tokenizer.convert_ids_to_tokens(encoding['input_ids'])

In [None]:
tokenizer.decode(encoding['input_ids'], skip_special_tokens=True)

## Processing the data

In [None]:
MAX_SEQ_LEN = 300

def tokenize_dataset(squad_example, tokenizer=tokenizer):
    """Tokenize the text in the dataset and convert
    the start and ending positions of the answers
    from text to tokens"""
    max_len = MAX_SEQ_LEN
    context = squad_example['context']
    answer_start = squad_example['answers']['answer_start'][0]
    answer = squad_example['answers']['text'][0]
    squad_example_tokenized = tokenizer(
        context, squad_example['question'],
        padding='max_length',
        max_length=max_len,
        truncation=True,
    )
    token_start = len(tokenizer.tokenize(context[:answer_start + 1]))
    token_end = len(tokenizer.tokenize(answer)) + token_start

    squad_example_tokenized['start_token_idx'] = token_start
    squad_example_tokenized['end_token_idx'] = token_end

    return squad_example_tokenized


def filter_samples_by_max_seq_len(squad_example):
    """Fliter out the samples where the answers are
    not within the first `MAX_SEQ_LEN` tokens"""
    max_len = MAX_SEQ_LEN
    answer_start = squad_example['answers']['answer_start'][0]
    answer = squad_example['answers']['text'][0]
    token_start = len(tokenizer.tokenize(squad_example['context'][:answer_start]))
    token_end = len(tokenizer.tokenize(answer)) + token_start
    if token_end < max_len:
        return True

In [None]:
hf_dataset

In [None]:
dataset_filtered = hf_dataset.filter(
    filter_samples_by_max_seq_len,
    num_proc=12,
)
dataset_filtered

In [None]:
dataset_tok = dataset_filtered.map(
    tokenize_dataset,
    remove_columns=hf_dataset['train'].column_names,
    num_proc=12,
)
dataset_tok.set_format('pt')
dataset_tok

## The training set

In [None]:
train_dataset = dataset_tok["train"]
train_dataset

# eval_dataset = processed_dataset["validation"]
# eval_dataset.set_format(type='torch')

In [None]:
train_sample = train_dataset.select([20299])[0]
pprint(train_sample)

## The model input

In [None]:
(
    train_sample['input_ids'].shape,
    train_sample['token_type_ids'].shape,
    train_sample['attention_mask'].shape
)

In [None]:
train_sample['input_ids']

In [None]:
tokenizer.decode(train_sample['input_ids'])

## [Attention masks](https://huggingface.co/transformers/glossary.html#attention-mask)
To create batches for training the text needs to be padded. The attention masks differentiate what is text and what is padding.

In [None]:
train_sample['attention_mask']

In [None]:
context_encoded = train_sample['input_ids'][train_sample['attention_mask'] == 1]
tokenizer.decode(context_encoded)

## [Token type ids](https://huggingface.co/transformers/glossary.html#token-type-ids)
Differentiate two types of tokens, the ones that correspond to the question and the ones that correspond to the answers.

In [None]:
train_sample['token_type_ids']

In [None]:
paragraph_encoded = train_sample['input_ids'][train_sample['token_type_ids'] == 0]
tokenizer.decode(paragraph_encoded,skip_special_tokens=True)

In [None]:
question_encoded = train_sample['input_ids'][train_sample['token_type_ids'] == 1]
tokenizer.decode(question_encoded, skip_special_tokens=True)

In [None]:
train_sample['start_token_idx']

In [None]:
train_sample['end_token_idx']