# Understanding how the SQuAD dataset is set up for our task with BERT
  
We are going to fine-tune BERT for the text-extraction task with a dataset of questions and answers. The question are about a give paragraph (*context*) that contains the answers. The model will be trained to locate the answer in the context by giving the possitions where the answer starts and finishes.

In this notebook see how the dataset is set up for training.

This notebook is based on [BERT (from HuggingFace Transformers) for Text Extraction](https://keras.io/examples/nlp/text_extraction_with_bert/).

 More info:
  * [BERT NLP — How To Build a Question Answering Bot](https://towardsdatascience.com/bert-nlp-how-to-build-a-question-answering-bot-98b1d1594d7b)

In [1]:
import os
import json
import dataset_utils as du
from tensorflow import keras
from tokenizers import BertWordPieceTokenizer
from transformers import BertTokenizer

## 1. The tokenizer

We are going to use the [Huggingface's tokenizers](https://huggingface.co/transformers/main_classes/tokenizer.html).

In [2]:
slow_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased",
                                               cache_dir='./_bert_tockenizer')
save_path = "bert_tockenizer/"
if not os.path.exists(save_path):
    os.makedirs(save_path)

slow_tokenizer.save_pretrained(save_path)

# Load the fast tokenizer from saved file
tokenizer = BertWordPieceTokenizer("bert_tockenizer/vocab.txt", lowercase=True)

HBox(children=(IntProgress(value=0, description='Downloading', max=231508, style=ProgressStyle(description_wid…




## 2. The data

In [3]:
train_data_url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json"
eval_data_url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json"
train_path = keras.utils.get_file("train.json", train_data_url, cache_dir="./")
eval_path = keras.utils.get_file("eval.json", eval_data_url, cache_dir="./")

Downloading data from https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
Downloading data from https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json


In [4]:
with open(train_path) as f:
    raw_train_data = json.load(f)

with open(eval_path) as f:
    raw_eval_data = json.load(f)

In [6]:
raw_train_data.keys()

dict_keys(['data', 'version'])

In [7]:
raw_train_data['data'][91]['title']

'Alps'

In [8]:
raw_train_data['data'][91]['paragraphs'][0]['context']

'The Alps (/ælps/; Italian: Alpi [ˈalpi]; French: Alpes [alp]; German: Alpen [ˈʔalpm̩]; Slovene: Alpe [ˈáːlpɛ]) are the highest and most extensive mountain range system that lies entirely in Europe, stretching approximately 1,200 kilometres (750 mi) across eight Alpine countries: Austria, France, Germany, Italy, Liechtenstein, Monaco, Slovenia, and Switzerland. The Caucasus Mountains are higher, and the Urals longer, but both lie partly in Asia. The mountains were formed over tens of millions of years as the African and Eurasian tectonic plates collided. Extreme shortening caused by the event resulted in marine sedimentary rocks rising by thrusting and folding into high mountain peaks such as Mont Blanc and the Matterhorn. Mont Blanc spans the French–Italian border, and at 4,810 m (15,781 ft) is the highest mountain in the Alps. The Alpine region area contains about a hundred peaks higher than 4,000 m (13,123 ft), known as the "four-thousanders".'

In [9]:
raw_train_data['data'][0]['paragraphs'][0]['qas']

[{'answers': [{'answer_start': 515, 'text': 'Saint Bernadette Soubirous'}],
  'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
  'id': '5733be284776f41900661182'},
 {'answers': [{'answer_start': 188, 'text': 'a copper statue of Christ'}],
  'question': 'What is in front of the Notre Dame Main Building?',
  'id': '5733be284776f4190066117f'},
 {'answers': [{'answer_start': 279, 'text': 'the Main Building'}],
  'question': 'The Basilica of the Sacred heart at Notre Dame is beside to which structure?',
  'id': '5733be284776f41900661180'},
 {'answers': [{'answer_start': 381,
    'text': 'a Marian place of prayer and reflection'}],
  'question': 'What is the Grotto at Notre Dame?',
  'id': '5733be284776f41900661181'},
 {'answers': [{'answer_start': 92,
    'text': 'a golden statue of the Virgin Mary'}],
  'question': 'What sits on top of the Main Building at Notre Dame?',
  'id': '5733be284776f4190066117e'}]

## 3.  The training set

In [10]:
max_len = 384

train_squad_examples = du.create_squad_examples(raw_train_data, max_len, tokenizer)
x_train, y_train = du.create_inputs_targets(train_squad_examples)
print(f"{len(train_squad_examples)} training points created.")

87599 training points created.


In [11]:
len(x_train)

3

In [12]:
x_train[0].shape, x_train[1].shape, x_train[2].shape

((86136, 384), (86136, 384), (86136, 384))

In [13]:
x_train[0][0]

array([  101,  6549,  2135,  1010,  1996,  2082,  2038,  1037,  3234,
        2839,  1012, 10234,  1996,  2364,  2311,  1005,  1055,  2751,
        8514,  2003,  1037,  3585,  6231,  1997,  1996,  6261,  2984,
        1012,  3202,  1999,  2392,  1997,  1996,  2364,  2311,  1998,
        5307,  2009,  1010,  2003,  1037,  6967,  6231,  1997,  4828,
        2007,  2608,  2039, 14995,  6924,  2007,  1996,  5722,  1000,
        2310,  3490,  2618,  4748,  2033, 18168,  5267,  1000,  1012,
        2279,  2000,  1996,  2364,  2311,  2003,  1996, 13546,  1997,
        1996,  6730,  2540,  1012,  3202,  2369,  1996, 13546,  2003,
        1996, 24665, 23052,  1010,  1037, 14042,  2173,  1997,  7083,
        1998,  9185,  1012,  2009,  2003,  1037, 15059,  1997,  1996,
       24665, 23052,  2012, 10223, 26371,  1010,  2605,  2073,  1996,
        6261,  2984, 22353,  2135,  2596,  2000,  3002, 16595,  9648,
        4674,  2061, 12083,  9711,  2271,  1999,  8517,  1012,  2012,
        1996,  2203,

In [14]:
# The training sample is the text plus the question
#
# The padding zeroes are discarded by the tokenizer's decoding
tokenizer.decode(x_train[0][0])

'architecturally, the school has a catholic character. atop the main building\'s gold dome is a golden statue of the virgin mary. immediately in front of the main building and facing it, is a copper statue of christ with arms upraised with the legend " venite ad me omnes ". next to the main building is the basilica of the sacred heart. immediately behind the basilica is the grotto, a marian place of prayer and reflection. it is a replica of the grotto at lourdes, france where the virgin mary reputedly appeared to saint bernadette soubirous in 1858. at the end of the main drive ( and in a direct line that connects through 3 statues and the gold dome ), is a simple, modern stone statue of mary. to whom did the virgin mary allegedly appear in 1858 in lourdes france?'

In [15]:
# `x_train[1][0]` is one for the elements that represent the question:
x_train[1][0]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [16]:
# `x_train[1][0]==0` selects the context and `x_train[1][0]==1`, the question:
tokenizer.decode(x_train[0][0][x_train[1][0]==1])

'to whom did the virgin mary allegedly appear in 1858 in lourdes france?'

In [17]:
# `x_train[1][0]` is one for the elements that represent the text:
x_train[2][0]

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [18]:
# `x_train[1][0]==1` selects the part of the array that has text.
# The rest `x_train[1][0]==0` corresponds to zeros for padding
tokenizer.decode(x_train[0][0][x_train[2][0]==1])

'architecturally, the school has a catholic character. atop the main building\'s gold dome is a golden statue of the virgin mary. immediately in front of the main building and facing it, is a copper statue of christ with arms upraised with the legend " venite ad me omnes ". next to the main building is the basilica of the sacred heart. immediately behind the basilica is the grotto, a marian place of prayer and reflection. it is a replica of the grotto at lourdes, france where the virgin mary reputedly appeared to saint bernadette soubirous in 1858. at the end of the main drive ( and in a direct line that connects through 3 statues and the gold dome ), is a simple, modern stone statue of mary. to whom did the virgin mary allegedly appear in 1858 in lourdes france?'

In [19]:
len(y_train)

2

In [20]:
# `y_train[0]` and `y_train[1]` are the possitions in the text where the answer starts and ends, respectively.
y_train[0].shape, y_train[0].shape

((86136,), (86136,))

A training sample looks like this:

In [21]:
squad_ex = 51198
train_ex = du.get_training_example_from_squad(train_squad_examples, squad_ex)

print('\n * CONTEXT:                   \n', train_squad_examples[squad_ex].context)
print('\n * QUESTION:                  \n', train_squad_examples[squad_ex].question)
print('\n * ANSWER (REFERENCE):        \n', train_squad_examples[squad_ex].answer_text)
print('\n * ANSWER IN CONTEXT:         \n', tokenizer.decode(x_train[0][train_ex][y_train[0][train_ex]:y_train[1][train_ex]]))

print('\n\n === TRAINING SAMPLE ===')
print('\n * CONTEXT & QUESTION:        \n', tokenizer.decode(x_train[0][train_ex]))
print('\n * POSITION IN CONTEXT:       \n', (y_train[0][train_ex], y_train[1][train_ex]))


 * CONTEXT:                   
 Switzerland has a dense network of cities, where large, medium and small cities are complementary. The plateau is very densely populated with about 450 people per km2 and the landscape continually shows signs of human presence. The weight of the largest metropolitan areas, which are Zürich, Geneva–Lausanne, Basel and Bern tend to increase. In international comparison the importance of these urban areas is stronger than their number of inhabitants suggests. In addition the two main centers of Zürich and Geneva are recognized for their particularly great quality of life.

 * QUESTION:                  
 What is the population density of the plateau?

 * ANSWER (REFERENCE):        
 450 people per km2

 * ANSWER IN CONTEXT:         
 450 people per km


 === TRAINING SAMPLE ===

 * CONTEXT & QUESTION:        
 switzerland has a dense network of cities, where large, medium and small cities are complementary. the plateau is very densely populated with about 