## Purpose:
- this notebook illustrates how preprocessing is done for question answering tasks
- source: https://github.com/huggingface/notebooks/blob/master/examples/question_answering.ipynb

### 1. Declare initial variables

In [1]:
# This flag is the difference between SQUAD v1 or 2 (if you're using another dataset, it indicates if impossible
# answers are allowed or not).
squad_v2 = False
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

### 2. Load dataset
- in this case squad v1 is used - does not have "no answers"

In [2]:
from datasets import load_dataset, load_metric

In [3]:
# load squad dataset
datasets = load_dataset("squad_v2" if squad_v2 else "squad")

Reusing dataset squad (C:\Users\tanch\.cache\huggingface\datasets\squad\plain_text\1.0.0\1244d044b266a5e4dbd4174d23cb995eead372fbca31a03edc3f8a132787af41)


In [4]:
# the dataset has already been split to training and validation sets
datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [5]:
# each row is retrieved as a single dictionary
datasets["train"][0]


{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'id': '5733be284776f41900661182',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'title': 'University_of_Notre_Dame'}

In [6]:
# multiple rows are also retrieved as a single dictionary
datasets["train"][1,2,3]


{'answers': [{'answer_start': [188], 'text': ['a copper statue of Christ']},
  {'answer_start': [279], 'text': ['the Main Building']},
  {'answer_start': [381],
   'text': ['a Marian place of prayer and reflection']}],
 'context': ['Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
  'Architecturally, the school has a Catholic character. Atop the Ma

In [7]:
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML
def show_random_elements(dataset, num_examples = 10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

In [8]:
# visaulise the data in a table
show_random_elements(datasets["train"])

Unnamed: 0,answers,context,id,question,title
0,"{'answer_start': [490], 'text': ['Brazil']}","They can also be armed with non-lethal (more accurately known as ""less than lethal"" or ""less-lethal"") weaponry, particularly for riot control. Non-lethal weapons include batons, tear gas, riot control agents, rubber bullets, riot shields, water cannons and electroshock weapons. Police officers often carry handcuffs to restrain suspects. The use of firearms or deadly force is typically a last resort only to be used when necessary to save human life, although some jurisdictions (such as Brazil) allow its use against fleeing felons and escaped convicts. A ""shoot-to-kill"" policy was recently introduced in South Africa, which allows police to use deadly force against any person who poses a significant threat to them or civilians. With the country having one of the highest rates of violent crime, president Jacob Zuma states that South Africa needs to handle crime differently from other countries.",5732bcead6dcfa19001e8a9c,Where can police shoot fleeing convicts?,Police
1,"{'answer_start': [455], 'text': ['fix carbon from the air']}","A large percentage of herbivores have mutualistic gut flora that help them digest plant matter, which is more difficult to digest than animal prey. This gut flora is made up of cellulose-digesting protozoans or bacteria living in the herbivores' intestines. Coral reefs are the result of mutualisms between coral organisms and various types of algae that live inside them. Most land plants and land ecosystems rely on mutualisms between the plants, which fix carbon from the air, and mycorrhyzal fungi, which help in extracting water and minerals from the ground.",56de22074396321400ee25d3,How do plants contribute to terrestrial ecosystems?,Symbiosis
2,"{'answer_start': [163], 'text': ['Fast Patrol Craft']}","During his tour on the guided missile frigate USS Gridley, Kerry requested duty in South Vietnam, listing as his first preference a position as the commander of a Fast Patrol Craft (PCF), also known as a ""Swift boat."" These 50-foot (15 m) boats have aluminum hulls and have little or no armor, but are heavily armed and rely on speed. ""I didn't really want to get involved in the war"", Kerry said in a book of Vietnam reminiscences published in 1986. ""When I signed up for the swift boats, they had very little to do with the war. They were engaged in coastal patrolling and that's what I thought I was going to be doing."" However, his second choice of billet was on a river patrol boat, or ""PBR"", which at the time was serving a more dangerous duty on the rivers of Vietnam.",572aa3c1111d821400f38c6a,What was the formal name of 'swift boats'?,John_Kerry
3,"{'answer_start': [47], 'text': ['Sunni branch']}","Christianity, Judaism, Zoroastrianism, and the Sunni branch of Islam are officially recognized by the government, and have reserved seats in the Iranian Parliament. But the Bahá'í Faith, which is said to be the largest non-Muslim religious minority in Iran, is not officially recognized, and has been persecuted during its existence in Iran since the 19th century. Since the 1979 Revolution, the persecution of Bahais has increased with executions, the denial of civil rights and liberties, and the denial of access to higher education and employment.",57303660947a6a140053d2a8,What other branch of Islam is recognized by the Iranian government?,Iran
4,"{'answer_start': [237], 'text': ['BBC Radio 1']}","Later in 2013, West launched a tirade on Twitter directed at talk show host Jimmy Kimmel after his ABC program Jimmy Kimmel Live! ran a sketch on September 25 involving two children re-enacting West's recent interview with Zane Lowe for BBC Radio 1 in which he calls himself the biggest rock star on the planet. Kimmel reveals the following night that West called him to demand an apology shortly before taping.",56d4672e2ccc5a1400d8314a,"On what radio station did Kanye West deem himself ""the biggest rockstar on the planet""?",Kanye_West
5,"{'answer_start': [711], 'text': ['St. John's United Methodist Church']}","Beyoncé attended St. Mary's Elementary School in Fredericksburg, Texas, where she enrolled in dance classes. Her singing talent was discovered when dance instructor Darlette Johnson began humming a song and she finished it, able to hit the high-pitched notes. Beyoncé's interest in music and performing continued after winning a school talent show at age seven, singing John Lennon's ""Imagine"" to beat 15/16-year-olds. In fall of 1990, Beyoncé enrolled in Parker Elementary School, a music magnet school in Houston, where she would perform with the school's choir. She also attended the High School for the Performing and Visual Arts and later Alief Elsik High School. Beyoncé was also a member of the choir at St. John's United Methodist Church as a soloist for two years.",56d443ef2ccc5a1400d830df,What choir did Beyoncé sing in for two years?,Beyoncé
6,"{'answer_start': [174], 'text': ['Dust and scratches']}","Vinyl records do not break easily, but the soft material is easily scratched. Vinyl readily acquires a static charge, attracting dust that is difficult to remove completely. Dust and scratches cause audio clicks and pops. In extreme cases, they can cause the needle to skip over a series of grooves, or worse yet, cause the needle to skip backwards, creating a ""locked groove"" that repeats over and over. This is the origin of the phrase ""like a broken record"" or ""like a scratched record"", which is often used to describe a person or thing that continually repeats itself. Locked grooves are not uncommon and were even heard occasionally in radio broadcasts.",5727e4fd2ca10214002d98eb,What is the cause of lock grooves on vinyl records?,Gramophone_record
7,"{'answer_start': [304], 'text': ['Tartu']}","Estonia co-operates with Latvia and Lithuania in several trilateral Baltic defence co-operation initiatives, including Baltic Battalion (BALTBAT), Baltic Naval Squadron (BALTRON), Baltic Air Surveillance Network (BALTNET) and joint military educational institutions such as the Baltic Defence College in Tartu. Future co-operation will include sharing of national infrastructures for training purposes and specialisation of training areas (BALTTRAIN) and collective formation of battalion-sized contingents for use in the NATO rapid-response force. In January 2011 the Baltic states were invited to join NORDEFCO, the defence framework of the Nordic countries.",5728c1523acd2414000dfda9,Where is the Baltic Defence College located?,Estonia
8,"{'answer_start': [310], 'text': ['Kathmandu International Theater Festival']}","Kathmandu is home to Nepali cinema and theaters. The city contains several theaters, including the National Dance Theatre in Kanti Path, the Ganga Theatre, the Himalayan Theatre and the Aarohan Theater Group founded in 1982. The M. Art Theater is based in the city. The Gurukul School of Theatre organizes the Kathmandu International Theater Festival, attracting artists from all over the world. A mini theater is also located at the Hanumandhoka Durbar Square, established by the Durbar Conservation and Promotion Committee.",5735c421dc94161900571ffd,What gathering is the work of the Gurukul School of Theatre?,Kathmandu
9,"{'answer_start': [3], 'text': ['1945']}","In 1945, the British entrepreneur J. Arthur Rank, hoping to expand his American presence, bought into a four-way merger with Universal, the independent company International Pictures, and producer Kenneth Young. The new combine, United World Pictures, was a failure and was dissolved within one year. Rank and International remained interested in Universal, however, culminating in the studio's reorganization as Universal-International. William Goetz, a founder of International, was made head of production at the renamed Universal-International Pictures Inc., which also served as an import-export subsidiary, and copyright holder for the production arm's films. Goetz, a son-in-law of Louis B. Mayer decided to bring ""prestige"" to the new company. He stopped the studio's low-budget production of B movies, serials and curtailed Universal's horror and ""Arabian Nights"" cycles. Distribution and copyright control remained under the name of Universal Pictures Company Inc.",56e161c3e3433e1400422e30,In what year was United World Pictures founded?,Universal_Studios


### 3. Instantiate tokenizers:
- this class splits words into sub words and into their corresponding IDs

In [9]:
# instantiate the tokenzier 
# note that different models require different tokenizers
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [10]:
# check that the tokenizer we instantiated  is a fast tokenizer because we need its special features
import transformers
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

- these models have fast tokenizers
    - https://huggingface.co/transformers/index.html#bigtable

In [11]:
# we can see that we can tokenise a question ans answer pair
# different tokenizers will give different tokens
tokenized_output = tokenizer("What is your name?", "My name is Sylvain.")
tokenized_output

{'input_ids': [101, 2054, 2003, 2115, 2171, 1029, 102, 2026, 2171, 2003, 25353, 22144, 2378, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [12]:
# we can also tokenise one sentence
# notice the word is split into multiple subwords
tokenizer("Sylvain")

{'input_ids': [101, 25353, 22144, 2378, 102], 'attention_mask': [1, 1, 1, 1, 1]}

In [13]:
# in decoding, we see that special tokens were automatically added - though we can specify otherwise
tokenizer.decode(tokenized_output['input_ids'])

'[CLS] what is your name? [SEP] my name is sylvain. [SEP]'

### 4a. Splitting long documents
- long documents need to be split into smaller passages so that the inputs can fit into BERT which has a max_length

In [14]:
max_length = 384 # this refers to max number of TOKENS - not characters
doc_stride = 128 # this is number of overlap, so we do not split long documents inside an answer

In [15]:
# the following is an exmaple of a long document that need to be split into smaller documents
for i, example in enumerate(datasets["train"]):
    if len(tokenizer(example["question"], example["context"])["input_ids"]) > 384:
        print("Number of tokens ",len(tokenizer(example["question"], example["context"])["input_ids"]))
        break
example = datasets["train"][i]
example

Number of tokens  396


{'answers': {'answer_start': [30], 'text': ['over 1,600']},
 'context': "The men's basketball team has over 1,600 wins, one of only 12 schools who have reached that mark, and have appeared in 28 NCAA tournaments. Former player Austin Carr holds the record for most points scored in a single game of the tournament with 61. Although the team has never won the NCAA Tournament, they were named by the Helms Athletic Foundation as national champions twice. The team has orchestrated a number of upsets of number one ranked teams, the most notable of which was ending UCLA's record 88-game winning streak in 1974. The team has beaten an additional eight number-one teams, and those nine wins rank second, to UCLA's 10, all-time in wins against the top team. The team plays in newly renovated Purcell Pavilion (within the Edmund P. Joyce Center), which reopened for the beginning of the 2009–2010 season. The team is coached by Mike Brey, who, as of the 2014–15 season, his fifteenth at Notre Dame, has ac

In [16]:
# specifying truncation="only_second", notice the phrase "the most by the Fighting Irish team since 1908-09." was removed
truncated_example = tokenizer(example["question"], example["context"], max_length=max_length, truncation="only_second")["input_ids"]
print("Number of tokens: ", len(truncated_example))
print(tokenizer.decode(truncated_example))

Number of tokens:  384
[CLS] how many wins does the notre dame men's basketball team have? [SEP] the men's basketball team has over 1, 600 wins, one of only 12 schools who have reached that mark, and have appeared in 28 ncaa tournaments. former player austin carr holds the record for most points scored in a single game of the tournament with 61. although the team has never won the ncaa tournament, they were named by the helms athletic foundation as national champions twice. the team has orchestrated a number of upsets of number one ranked teams, the most notable of which was ending ucla's record 88 - game winning streak in 1974. the team has beaten an additional eight number - one teams, and those nine wins rank second, to ucla's 10, all - time in wins against the top team. the team plays in newly renovated purcell pavilion ( within the edmund p. joyce center ), which reopened for the beginning of the 2009 – 2010 season. the team is coached by mike brey, who, as of the 2014 – 15 season

In [17]:
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    stride=doc_stride
)

In [18]:
# this ong document was split into two shorter documents with number of tokens less than "max_length"
# notice the length of overlap is indeed "doc_stride"/"stride"
for ids in tokenized_example['input_ids']:
    print("Length: ",len(ids))
    print(tokenizer.decode(ids))
    print()
overlap = "championship. the 2010 – 11 team concluded its regular season ranked number seven in the country, with a record of 25 – 5, brey's fifth straight 20 - win season, and a second - place finish in the big east. during the 2014 - 15 season, the team went 32 - 6 and won the acc conference tournament, later advancing to the elite 8, where the fighting irish lost on a missed buzzer - beater against then undefeated kentucky. led by nba draft picks jerian grant and pat connaughton, the fighting irish beat the eventual national champion duke blue devils twice during the season. the 32 wins were"
print("Length of overlap: ", len(tokenizer(overlap)['input_ids']))
overlap

Length:  384
[CLS] how many wins does the notre dame men's basketball team have? [SEP] the men's basketball team has over 1, 600 wins, one of only 12 schools who have reached that mark, and have appeared in 28 ncaa tournaments. former player austin carr holds the record for most points scored in a single game of the tournament with 61. although the team has never won the ncaa tournament, they were named by the helms athletic foundation as national champions twice. the team has orchestrated a number of upsets of number one ranked teams, the most notable of which was ending ucla's record 88 - game winning streak in 1974. the team has beaten an additional eight number - one teams, and those nine wins rank second, to ucla's 10, all - time in wins against the top team. the team plays in newly renovated purcell pavilion ( within the edmund p. joyce center ), which reopened for the beginning of the 2009 – 2010 season. the team is coached by mike brey, who, as of the 2014 – 15 season, his fift

"championship. the 2010 – 11 team concluded its regular season ranked number seven in the country, with a record of 25 – 5, brey's fifth straight 20 - win season, and a second - place finish in the big east. during the 2014 - 15 season, the team went 32 - 6 and won the acc conference tournament, later advancing to the elite 8, where the fighting irish lost on a missed buzzer - beater against then undefeated kentucky. led by nba draft picks jerian grant and pat connaughton, the fighting irish beat the eventual national champion duke blue devils twice during the season. the 32 wins were"

## 4b. prepare_train_features
- This function splits long documents while ensuring that the answer is still intact and uncorrupted

In [19]:
def prepare_train_features(examples):
    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    pad_on_right = tokenizer.padding_side == "right"
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

In [22]:
# in using the function, notice the answer still remains in context, which is what we needed
# now we have
# 1. tokenized ids
# 2. start position
# 3. end position
print("Answer:",datasets['train'][0:1]['answers'][0]['text'])
tokenized_example = prepare_train_features(datasets['train'][0:1])

print("start_positions:",tokenized_example["start_positions"])
print("end_positions:",tokenized_example["end_positions"])


tokenizer.decode(tokenized_example['input_ids'][0])


Answer: ['Saint Bernadette Soubirous']
start_positions: [130]
end_positions: [137]


'[CLS] to whom did the virgin mary allegedly appear in 1858 in lourdes france? [SEP] architecturally, the school has a catholic character. atop the main building\'s gold dome is a golden statue of the virgin mary. immediately in front of the main building and facing it, is a copper statue of christ with arms upraised with the legend " venite ad me omnes ". next to the main building is the basilica of the sacred heart. immediately behind the basilica is the grotto, a marian place of prayer and reflection. it is a replica of the grotto at lourdes, france where the virgin mary reputedly appeared to saint bernadette soubirous in 1858. at the end of the main drive ( and in a direct line that connects through 3 statues and the gold dome ), is a simple, modern stone statue of mary. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD

In [23]:
# notice one long document can produce more than 1 sample containing the answer
# so we have more training samples after using the sample
i = 249 
print("Answer:",datasets['train'][i:i+1]['answers'][0]['text'])
print("start_positions:",tokenized_example["start_positions"])
print("end_positions:",tokenized_example["end_positions"])

tokenized_example = prepare_train_features(datasets['train'][i:i+1])

print(tokenizer.decode(tokenized_example['input_ids'][0]))
print(tokenizer.decode(tokenized_example['input_ids'][1]))


Answer: ['over 1,600']
start_positions: [130]
end_positions: [137]
[CLS] how many wins does the notre dame men's basketball team have? [SEP] the men's basketball team has over 1, 600 wins, one of only 12 schools who have reached that mark, and have appeared in 28 ncaa tournaments. former player austin carr holds the record for most points scored in a single game of the tournament with 61. although the team has never won the ncaa tournament, they were named by the helms athletic foundation as national champions twice. the team has orchestrated a number of upsets of number one ranked teams, the most notable of which was ending ucla's record 88 - game winning streak in 1974. the team has beaten an additional eight number - one teams, and those nine wins rank second, to ucla's 10, all - time in wins against the top team. the team plays in newly renovated purcell pavilion ( within the edmund p. joyce center ), which reopened for the beginning of the 2009 – 2010 season. the team is coached b

In [24]:
# more samples have been produced due the the splitting function
# transformers uses smart caching - the following code needs to be run only once as subsequent runs uses cached data
tokenized_datasets = datasets.map(prepare_train_features, batched=True, remove_columns=datasets["train"].column_names)
tokenized_datasets

Loading cached processed dataset at C:\Users\tanch\.cache\huggingface\datasets\squad\plain_text\1.0.0\1244d044b266a5e4dbd4174d23cb995eead372fbca31a03edc3f8a132787af41\cache-8e21f5a34da7220b.arrow
Loading cached processed dataset at C:\Users\tanch\.cache\huggingface\datasets\squad\plain_text\1.0.0\1244d044b266a5e4dbd4174d23cb995eead372fbca31a03edc3f8a132787af41\cache-2d9c358a11c9b795.arrow


DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'end_positions', 'input_ids', 'start_positions'],
        num_rows: 88524
    })
    validation: Dataset({
        features: ['attention_mask', 'end_positions', 'input_ids', 'start_positions'],
        num_rows: 10784
    })
})

- using the function prepare_train_features() we have prepared our QA pairs into correct input format expected from BERT:

In [25]:
# the resulting output after applying prepare_train_features()
# most important features are
# 1. tokenized ids
# 2. start position
# 3. end position
show_random_elements(tokenized_datasets["train"],3)

Unnamed: 0,attention_mask,end_positions,input_ids,start_positions
0,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]",33,"[101, 2073, 2001, 1996, 5871, 9841, 5361, 1029, 102, 1037, 5871, 2598, 2276, 2007, 1037, 1021, 1012, 1020, 1011, 7924, 1006, 2423, 3027, 1007, 5871, 9841, 5361, 1999, 2960, 2012, 1996, 7987, 2401, 2869, 2003, 1996, 2069, 2248, 4434, 4346, 5871, 6971, 2083, 13420, 16846, 3963, 2581, 2000, 18071, 2479, 1998, 1996, 2142, 2983, 1012, 2144, 2035, 2248, 7026, 1998, 4274, 4806, 2024, 18345, 2006, 2023, 2309, 5871, 4957, 2119, 4274, 1998, 7026, 2326, 2024, 3395, 2000, 3103, 2041, 13923, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]",30
1,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]",86,"[101, 2040, 4520, 3265, 3014, 5918, 2012, 7855, 1029, 102, 1996, 2176, 1011, 2095, 1010, 2440, 1011, 2051, 8324, 2565, 8681, 1996, 3484, 1997, 10316, 2015, 2012, 1996, 2118, 1998, 20618, 7899, 1999, 1996, 2840, 1998, 4163, 1010, 4606, 1996, 22797, 1997, 3330, 1010, 8083, 1010, 4807, 1010, 2189, 1010, 1998, 2495, 1012, 2348, 1037, 3192, 1999, 1996, 4314, 2840, 1998, 4163, 2003, 3223, 1999, 2035, 15279, 1010, 2045, 2003, 2053, 3223, 2691, 4563, 8882, 1025, 3265, 3014, 5918, 2024, 2275, 2011, 1996, 4513, 1997, 2169, 2082, 1012, 7855, 1005, 1055, 2440, 1011, 2051, 8324, 1998, 4619, 3454, 5452, 2006, ...]",82
2,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]",96,"[101, 2129, 2001, 1996, 2607, 1997, 8387, 2628, 1029, 102, 2045, 2003, 2788, 2019, 12407, 2005, 1037, 3563, 8720, 1997, 2019, 16514, 4005, 2069, 2043, 2107, 8720, 2064, 4681, 1999, 1996, 3949, 2030, 9740, 1997, 1996, 4295, 1010, 2030, 2000, 5083, 3716, 1997, 1996, 2607, 1997, 2019, 7355, 3188, 2000, 1996, 2458, 1997, 4621, 17261, 2030, 4652, 8082, 5761, 1012, 2005, 2742, 1010, 1999, 1996, 2220, 3865, 1010, 3188, 2000, 1996, 3311, 1997, 17207, 2102, 2005, 1996, 3949, 1997, 8387, 1010, 1996, 2607, 1997, 1996, 4295, 2001, 4876, 2628, 2011, 8822, 1996, 5512, 1997, 5776, 2668, 8168, 1010, 2130, 2295, ...]",90
