In [1]:
!pip install transformers
!pip install datasets

Collecting datasets
  Downloading datasets-2.19.2-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.1/542.1 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
Collecting requests>=2.32.1 (from datasets)
  Downloading requests-2.32.3-py3-none-any.whl (64 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[

In [2]:
import operator
import pandas as pd
import tensorflow as tf
import transformers

from datasets import load_dataset
from tensorflow import keras
from transformers import AutoTokenizer
from transformers import pipeline
from transformers import TFAutoModelForQuestionAnswering

We'll use the **pipeline** (note the singular) abstraction which wraps all the other pipelines. Put simply, it'll be our interface to doing a bunch of NLP tasks.

Using the **pipeline** abstraction is easy. We can instantiate a pipeline with a particular task, and it'll automatically download a suitable tokenizer and model behind the scenes for us and take care of the input and output operations.<br>
https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline<br>



Here, we're retrieving a pipeline for text-classification.

In [3]:
classifier = pipeline("text-classification")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Note the warning message about how no model was supplied. When we instantiate a pipeline for a task without specifying a particular model to perform the task, **Transformers** uses a default model. This is good enough for prototyping but for production, we'll want to specify which model to use for the task since the default can change. We'll see how to do this further below.

We can use the pipeline immediately to classify some text. Tokenization, vectorization, etc is taken care of behind the scenes.

In [4]:
classifier("Alice was excited to go the island but it didn't live up to the hype.")

[{'label': 'NEGATIVE', 'score': 0.9993934631347656}]

In [5]:
classifier("Bob doesn't do well in group situations but he said it wasn't bad.")

[{'label': 'POSITIVE', 'score': 0.9946909546852112}]

For summarization

In [6]:
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [7]:
text = """
Hans Niemann is launching a counterattack in his dispute with chess world
champion Magnus Carlsen, filing a federal lawsuit that accuses Carlsen of
maliciously colluding with others to defame the 19-year-old grandmaster and
ruin his career.

It's the latest move in a scandal that has injected unprecedented levels of
drama into the world of elite chess since early September, when Carlsen
suggested Niemann's upset victory over him at the Sinquefield Cup tournament
in St. Louis was the result of cheating.

Niemann wants a federal court in Missouri's eastern district to award him at
least $100 million in damages. Defendants in the lawsuit include Carlsen, his
company Play Magnus Group, the online platform Chess.com and its leader, Danny
Rensch, along with grandmaster Hikaru Nakamura.
"""

In [8]:
summarizer(text)

[{'summary_text': ' Chess grandmaster Hans Niemann files federal lawsuit against Magnus Carlsen . He accuses Carlsen of colluding with others to defame the 19-year-old grandmaster and ruin his career . Defendants in the lawsuit include Carlsen, his company Play Magnus Group, the online platform Chess.com and its leader .'}]

For question answering

In [9]:
qa = pipeline("question-answering")

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [10]:
context="""
Hugging Face was founded in 2016 by Clément Delangue, Julien Chaumond, and
Thomas Wolf originally as a company that developed a chatbot app targeted at
teenagers.[2] After open-sourcing the model behind the chatbot, the company
pivoted to focus on being a platform for democratizing machine learning. In March
2021, Hugging Face raised $40 million in a Series B funding round.
"""

question = "Who are the Hugging Face founders?"

qa(question=question, context=context)

{'score': 0.9919217228889465,
 'start': 37,
 'end': 87,
 'answer': 'Clément Delangue, Julien Chaumond, and\nThomas Wolf'}

Extractive question-answering models work fine for certain domains, document structures, and questions. But situations that require reasoning, more complex parsing, or contain ambiguity can trip it up.

In [11]:
question = "What does Hugging Face do?"
qa(question=question, context=context)

{'score': 0.08730394393205643,
 'start': 117,
 'end': 162,
 'answer': 'developed a chatbot app targeted at\nteenagers'}

There are ready-made pipelines for a number of tasks:<br>
https://huggingface.co/docs/transformers/main/en/quicktour#pipeline

Using a specific model

In [12]:
ner = pipeline(model="dslim/bert-base-NER")

config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [13]:
text = "Panic ensues in Redmond as love child of Microsoft and OpenAI declares humanity obsolete."
ner(text)

[{'entity': 'B-PER',
  'score': 0.9993875,
  'index': 6,
  'word': 'Red',
  'start': 16,
  'end': 19},
 {'entity': 'I-PER',
  'score': 0.80496955,
  'index': 7,
  'word': '##mond',
  'start': 19,
  'end': 23},
 {'entity': 'B-ORG',
  'score': 0.9980654,
  'index': 12,
  'word': 'Microsoft',
  'start': 41,
  'end': 50},
 {'entity': 'B-ORG',
  'score': 0.9985505,
  'index': 14,
  'word': 'Open',
  'start': 55,
  'end': 59},
 {'entity': 'I-ORG',
  'score': 0.98842865,
  'index': 15,
  'word': '##A',
  'start': 59,
  'end': 60},
 {'entity': 'I-ORG',
  'score': 0.9739822,
  'index': 16,
  'word': '##I',
  'start': 60,
  'end': 61}]

The **Transformers** library provides a bunch of helper classes to help with training models. And beyond the model hub, Hugging Face also hosts datasets, provides *spaces* where you can host your app, and offers a bunch of services such as cloud hardware and inference endpoints to help deploy your model.<br>
Datasets: https://huggingface.co/datasets<br>
Spaces: https://huggingface.co/spaces<br>

With Hugging Face, you can build an ML app prototype within minutes and iterate quickly from there.<br>
https://huggingface.co/docs<br>

Learn more about how to build with Hugging Face through their free course and fantastic book:<br>
https://huggingface.co/course<br>
https://www.oreilly.com/library/view/natural-language-processing/9781098136789/


### Fine tuning a pre trained model

In [14]:
data = load_dataset("squad")

Downloading readme:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

In [15]:
data

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [16]:
pd.DataFrame(data['train'][0, 1, 2, 100, 101, 102],
             columns=["context", "question", "answers"])

Unnamed: 0,context,question,answers
0,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,"{'text': ['Saint Bernadette Soubirous'], 'answ..."
1,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,"{'text': ['a copper statue of Christ'], 'answe..."
2,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,"{'text': ['the Main Building'], 'answer_start'..."
3,One of the main driving forces in the growth o...,In what year did the team lead by Knute Rockne...,"{'text': ['1925'], 'answer_start': [354]}"
4,One of the main driving forces in the growth o...,How many years was Knute Rockne head coach at ...,"{'text': ['13'], 'answer_start': [251]}"
5,One of the main driving forces in the growth o...,How many national titles were won when Knute R...,"{'text': ['three'], 'answer_start': [274]}"


Here's what we need to do:
1. Choose a pre-trained model based on what we want to accomplish and our constraints.
2. Download the appropriate tokenizer for the pre-trained model.
3. Tokenize and vectorize our dataset.
4. Mark where each answer starts and ends in our vectorized dataset.
5. Download the pre-trained model.
6. Fine-tune the pre-trained model with the vectorized dataset.

We'll use the AutoTokenizer class to get the right tokenizer for distilroberta-base

In [17]:
model_name = 'distilroberta-base'
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [18]:
t = "Where can I find a pizzeria?"
print(tokenizer.encode(t))

[0, 13841, 64, 38, 465, 10, 26432, 6971, 116, 2]


But to tokenize, we call the tokenizer object directly (i.e. using __call__).

This returns a sequence of ids and an attention mask in a BatchEncoding object

In [19]:
encoded_t = tokenizer(t)
print(encoded_t)

{'input_ids': [0, 13841, 64, 38, 465, 10, 26432, 6971, 116, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


the tokenizer added a start of sequence token (\<s\>), end of sequence token (\</s\>), and how it uses Ġ to signal a word has preceding whitespace. Keep in mind that what you're seeing here is the output from the *distilroberta-base* tokenizer. Other tokenizers may work differently.

In [20]:
print(tokenizer.convert_ids_to_tokens(encoded_t['input_ids']))

['<s>', 'Where', 'Ġcan', 'ĠI', 'Ġfind', 'Ġa', 'Ġpizz', 'eria', '?', '</s>']


In [21]:
encoded_pair = tokenizer("this is a question", "this is the context")
print(encoded_pair)

{'input_ids': [0, 9226, 16, 10, 864, 2, 2, 9226, 16, 5, 5377, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


The *distilroberta-base* tokenizer uses a double \</s\>\</s\> as a separator.

In [22]:
print(tokenizer.convert_ids_to_tokens(encoded_pair['input_ids']))

['<s>', 'this', 'Ġis', 'Ġa', 'Ġquestion', '</s>', '</s>', 'this', 'Ġis', 'Ġthe', 'Ġcontext', '</s>']


Most of the tokenizers in the Transformers library come in two versions: a Python implementation and a faster Rust implementation. When available, Autotokenizer will download the fast version.

In [23]:
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

In [24]:
context = "Sarah went to The Mirthless Cafe last night to meet her friend."
question = "Where did Sarah go?"

# The answer span and the answer's starting character position in the context.
answer = "The Mirthless Cafe"
answer_start = 14

In [25]:
x = tokenizer(question, context)
x

{'input_ids': [0, 13841, 222, 4143, 213, 116, 2, 2, 33671, 439, 7, 20, 256, 24208, 1672, 16542, 94, 363, 7, 972, 69, 1441, 4, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [26]:
tokenizer.batch_decode(x['input_ids'])

['<s>',
 'Where',
 ' did',
 ' Sarah',
 ' go',
 '?',
 '</s>',
 '</s>',
 'Sarah',
 ' went',
 ' to',
 ' The',
 ' M',
 'irth',
 'less',
 ' Cafe',
 ' last',
 ' night',
 ' to',
 ' meet',
 ' her',
 ' friend',
 '.',
 '</s>']

When we tokenize our dataset, there will probably be question/context pairs which exceed our model's maximum sequence length. In Roberta's case, that's 512

Let's say the maximum sequence length we can handle is 15, so we truncate the context.

In [27]:
example_max_length = 15
x = tokenizer(question, context, max_length=example_max_length,
              truncation="only_second")
x

{'input_ids': [0, 13841, 222, 4143, 213, 116, 2, 2, 33671, 439, 7, 20, 256, 24208, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

The problem here is that the answer span gets chopped off by truncation. In other situations, the answer may not be included at all.

In [28]:
tokenizer.batch_decode(x['input_ids'])

['<s>',
 'Where',
 ' did',
 ' Sarah',
 ' go',
 '?',
 '</s>',
 '</s>',
 'Sarah',
 ' went',
 ' to',
 ' The',
 ' M',
 'irth',
 '</s>']

To ensure we tokenize all context tokens while respecting a maximum length, we can set *return_overflowing_tokens* to **True**. The end effect is to split the input into multiple question/context sequences, with each context sequence being a continuation of the previous one. Since the last one may be shorter than the max length, we set the right padding length as well.<br>

In [29]:
x = tokenizer(question, context, max_length=example_max_length,
              truncation="only_second", return_overflowing_tokens=True,
              padding="max_length")
x

{'input_ids': [[0, 13841, 222, 4143, 213, 116, 2, 2, 33671, 439, 7, 20, 256, 24208, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 1672, 16542, 94, 363, 7, 972, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 69, 1441, 4, 2, 1, 1, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]], 'overflow_to_sample_mapping': [0, 0, 0]}

In [30]:
len(x['input_ids'])

3

In [31]:
tokenizer.batch_decode(x['input_ids'])

['<s>Where did Sarah go?</s></s>Sarah went to The Mirth</s>',
 '<s>Where did Sarah go?</s></s>less Cafe last night to meet</s>',
 '<s>Where did Sarah go?</s></s> her friend.</s><pad><pad><pad>']

If we tokenize two question/context pairs, we'll see the overflow_to_sample_mapping reflect that

In [32]:
tokenizer(['question 1', 'question 2'],
          ['context 1', 'context 2'],
          return_overflowing_tokens=True)

{'input_ids': [[0, 40018, 112, 2, 2, 46796, 112, 2], [0, 40018, 132, 2, 2, 46796, 132, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]], 'overflow_to_sample_mapping': [0, 1]}

But there's still a problem here in that none of the sequences contain the full answer ("The Mirthless Cafe"). Right now, the correct full answer is split across sequences.

To counter this, we can tokenize our question/context pair into overlapping sequences by setting a stride length

In [33]:
stride = 5
x = tokenizer(question, context, max_length=example_max_length,
              truncation="only_second", return_overflowing_tokens=True,
              stride=stride, padding="max_length")

By setting a stride of 5, each context sequence starts 5 subwords back from the previous sequence

In [34]:
tokenizer.batch_decode(x['input_ids'])

['<s>Where did Sarah go?</s></s>Sarah went to The Mirth</s>',
 '<s>Where did Sarah go?</s></s> went to The Mirthless</s>',
 '<s>Where did Sarah go?</s></s> to The Mirthless Cafe</s>',
 '<s>Where did Sarah go?</s></s> The Mirthless Cafe last</s>',
 '<s>Where did Sarah go?</s></s> Mirthless Cafe last night</s>',
 '<s>Where did Sarah go?</s></s>irthless Cafe last night to</s>',
 '<s>Where did Sarah go?</s></s>less Cafe last night to meet</s>',
 '<s>Where did Sarah go?</s></s> Cafe last night to meet her</s>',
 '<s>Where did Sarah go?</s></s> last night to meet her friend</s>',
 '<s>Where did Sarah go?</s></s> night to meet her friend.</s>']

In [35]:
print(x.keys(), '\n')
x

dict_keys(['input_ids', 'attention_mask', 'overflow_to_sample_mapping']) 



{'input_ids': [[0, 13841, 222, 4143, 213, 116, 2, 2, 33671, 439, 7, 20, 256, 24208, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 439, 7, 20, 256, 24208, 1672, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 7, 20, 256, 24208, 1672, 16542, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 20, 256, 24208, 1672, 16542, 94, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 256, 24208, 1672, 16542, 94, 363, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 24208, 1672, 16542, 94, 363, 7, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 1672, 16542, 94, 363, 7, 972, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 16542, 94, 363, 7, 972, 69, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 94, 363, 7, 972, 69, 1441, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 363, 7, 972, 69, 1441, 4, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 

To fine-tune a model for question answering, our pre-trained *distilroberta-base* model expects this object to contain two more pieces of information:
- *start_positions*: the token positions where answers begin.
- *end_positions*: the token positions where answers end.<br>

All we have in our example (and the SQuAD dataset) is the position of the starting character of the answer.

In [36]:
print(answer_start)
print(context[answer_start:answer_start+len(answer)])

14
The Mirthless Cafe


We need to use this to locate the token positions where each answer starts and ends in every input_ids sequence. In some cases, the complete answer may not be in a particular sequence. We need to handle those cases as well.

To do this, we'll get more information by setting return_offsets_mapping to True in the tokenizer.

In [37]:
x = tokenizer(question, context, max_length=example_max_length,
              truncation="only_second", return_overflowing_tokens=True,
              stride=stride, return_offsets_mapping=True,
              padding="max_length")
x

{'input_ids': [[0, 13841, 222, 4143, 213, 116, 2, 2, 33671, 439, 7, 20, 256, 24208, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 439, 7, 20, 256, 24208, 1672, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 7, 20, 256, 24208, 1672, 16542, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 20, 256, 24208, 1672, 16542, 94, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 256, 24208, 1672, 16542, 94, 363, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 24208, 1672, 16542, 94, 363, 7, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 1672, 16542, 94, 363, 7, 972, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 16542, 94, 363, 7, 972, 69, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 94, 363, 7, 972, 69, 1441, 2], [0, 13841, 222, 4143, 213, 116, 2, 2, 363, 7, 972, 69, 1441, 4, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 

In [38]:
print(len(x['input_ids']))
print(len(x['offset_mapping']))

10
10


In [39]:
print(x['input_ids'][0])
print(x['offset_mapping'][0])

[0, 13841, 222, 4143, 213, 116, 2, 2, 33671, 439, 7, 20, 256, 24208, 2]
[(0, 0), (0, 5), (6, 9), (10, 15), (16, 18), (18, 19), (0, 0), (0, 0), (0, 5), (6, 10), (11, 13), (14, 17), (18, 19), (19, 23), (0, 0)]


In [40]:
print("First non-special input_id converted to token:")
print(tokenizer.convert_ids_to_tokens(x['input_ids'][0][1]), "\n")

offset = x['offset_mapping'][0][1]
print(f"Span extracted from context using corresponding offset_mapping {offset}:")
print(question[offset[0]:offset[1]])

First non-special input_id converted to token:
Where 

Span extracted from context using corresponding offset_mapping (0, 5):
Where


Since we know the character position of where the answer starts, we can use that and offset_mapping to get the start and ending token positions of the answer span

In [41]:
print(x['offset_mapping'][0])
print(x['offset_mapping'][1])

[(0, 0), (0, 5), (6, 9), (10, 15), (16, 18), (18, 19), (0, 0), (0, 0), (0, 5), (6, 10), (11, 13), (14, 17), (18, 19), (19, 23), (0, 0)]
[(0, 0), (0, 5), (6, 9), (10, 15), (16, 18), (18, 19), (0, 0), (0, 0), (6, 10), (11, 13), (14, 17), (18, 19), (19, 23), (23, 27), (0, 0)]


In [42]:
# Sequence ids help us identify whether a sequence belong to context or question
print(x['input_ids'][0])
print(x.sequence_ids(0))

[0, 13841, 222, 4143, 213, 116, 2, 2, 33671, 439, 7, 20, 256, 24208, 2]
[None, 0, 0, 0, 0, 0, None, None, 1, 1, 1, 1, 1, 1, None]


In [43]:
# We can calculate the answer end character position using the answer length.
answer_end = answer_start + len(answer)

print("Answer start character position:", answer_start)
print("Answer end character position:", answer_end)
print("Answer pulled from context:", context[answer_start:answer_end])

Answer start character position: 14
Answer end character position: 32
Answer pulled from context: The Mirthless Cafe


In [44]:
tokenizer.batch_decode(x['input_ids'])

['<s>Where did Sarah go?</s></s>Sarah went to The Mirth</s>',
 '<s>Where did Sarah go?</s></s> went to The Mirthless</s>',
 '<s>Where did Sarah go?</s></s> to The Mirthless Cafe</s>',
 '<s>Where did Sarah go?</s></s> The Mirthless Cafe last</s>',
 '<s>Where did Sarah go?</s></s> Mirthless Cafe last night</s>',
 '<s>Where did Sarah go?</s></s>irthless Cafe last night to</s>',
 '<s>Where did Sarah go?</s></s>less Cafe last night to meet</s>',
 '<s>Where did Sarah go?</s></s> Cafe last night to meet her</s>',
 '<s>Where did Sarah go?</s></s> last night to meet her friend</s>',
 '<s>Where did Sarah go?</s></s> night to meet her friend.</s>']

In [45]:
input_ids = x['input_ids'][0]
offset_mapping = x['offset_mapping'][0]
seq_ids = x.sequence_ids(0)

In [46]:
print("Sequence IDs: ", seq_ids)

Sequence IDs:  [None, 0, 0, 0, 0, 0, None, None, 1, 1, 1, 1, 1, 1, None]


In [47]:
context_pos_start = seq_ids.index(1)

In [48]:
# Utility function to find the *last* occurrence of a sequence.
def rindex(lst, value):
    return len(lst) - operator.indexOf(reversed(lst), value) - 1

# Get the end index position (i.e. the last occurrence of 1).
context_pos_end = rindex(seq_ids, 1)

In [49]:
print("Context tokens begin at position", context_pos_start)
print("Context tokens end at position", context_pos_end)

Context tokens begin at position 8
Context tokens end at position 13


Now that we know which tokens are part of the context, we can look at their corresponding offset mappings to check whether the start and end character positions are within the offsets.

In [50]:
context_offsets = offset_mapping[context_pos_start:context_pos_end+1]
print(context_offsets)

[(0, 5), (6, 10), (11, 13), (14, 17), (18, 19), (19, 23)]


In [51]:
print("Is the lowest offset value lower than or equal to the starting character position?")
print("Answer starting character position:", answer_start)
print("First offset:", context_offsets[0])

print(context_offsets[0][0] <= answer_start)

Is the lowest offset value lower than or equal to the starting character position?
Answer starting character position: 14
First offset: (0, 5)
True


In [52]:
print("Is the highest offset value higher than or equal to the ending character position?")
print("Answer ending character position:", answer_end)
print("Last offset:", context_offsets[-1])

print(context_offsets[-1][1] >= answer_end)

Is the highest offset value higher than or equal to the ending character position?
Answer ending character position: 32
Last offset: (19, 23)
False


So the first sequence contains a part of the answer but the full answer gets truncated. This matches a visual inspection:

In [53]:
print(tokenizer.batch_decode(input_ids))

['<s>', 'Where', ' did', ' Sarah', ' go', '?', '</s>', '</s>', 'Sarah', ' went', ' to', ' The', ' M', 'irth', '</s>']


In [54]:
input_ids = x['input_ids'][2]
offset_mapping = x['offset_mapping'][2]
seq_ids = x.sequence_ids(2)

context_pos_start = seq_ids.index(1)
context_pos_end = rindex(seq_ids, 1)

context_offsets = offset_mapping[context_pos_start:context_pos_end+1]

print("Is the lowest offset value lower than or equal to the starting character position?")
print("Answer starting character position:", answer_start)
print("First offset:", context_offsets[0])

print(context_offsets[0][0] <= answer_start)

print("Is the highest offset value higher than or equal to the ending character position?")
print("Answer ending character position:", answer_end)
print("Last offset:", context_offsets[-1])

print(context_offsets[-1][1] >= answer_end)


Is the lowest offset value lower than or equal to the starting character position?
Answer starting character position: 14
First offset: (11, 13)
True
Is the highest offset value higher than or equal to the ending character position?
Answer ending character position: 32
Last offset: (28, 32)
True


Now that we've confirmed the third sequence contains the full answer, we need to identify where the answer starts and ends in the input_ids. We can do this by scanning the offset_mapping from the left to find the start, and from the right to find the end.

In [55]:
s = e = 0

# Finding the starting token position
i = context_pos_start
while offset_mapping[i][0] < answer_start:
  i += 1
if offset_mapping[i][0] == answer_start:
  s = i
else:
  s = i - 1

# Finding the ending token position
j = context_pos_end
while offset_mapping[j][1] > answer_end:
  j -= 1
if offset_mapping[j][1] == answer_end:
  e = j
else:
  e = j + 1

In [56]:
print("Answer start token position in context:", s)
print("Answer end token position in context:", e)

Answer start token position in context: 9
Answer end token position in context: 13


In [57]:
print("Answer lifted from context:")
tokenizer.batch_decode(input_ids[s:e+1])

Answer lifted from context:


[' The', ' M', 'irth', 'less', ' Cafe']

Encapsulating the entire logic

In [58]:
def prepare_dataset(examples):
  # Some tokenizers don't strip spaces. If there happens to be question text
  # with excessive spaces, the context may not get encoded at all.
  examples["question"] = [q.lstrip() for q in examples["question"]]
  examples["context"] = [c.lstrip() for c in examples["context"]]

  # Tokenize.
  tokenized_examples = tokenizer(
      examples['question'],
      examples['context'],
      truncation="only_second",
      max_length = max_length,
      stride=stride,
      return_overflowing_tokens=True,
      return_offsets_mapping=True,
      padding="max_length"
  )

  # We'll collect a list of starting positions and ending positions.
  tokenized_examples['start_positions'] = []
  tokenized_examples['end_positions'] = []

  # Work through every sequence.
  for seq_idx in range(len(tokenized_examples['input_ids'])):
    seq_ids = tokenized_examples.sequence_ids(seq_idx)
    offset_mappings = tokenized_examples['offset_mapping'][seq_idx]

    cur_example_idx = tokenized_examples['overflow_to_sample_mapping'][seq_idx]
    answer = examples['answers'][cur_example_idx]
    answer_text = answer['text'][0]
    answer_start = answer['answer_start'][0]
    answer_end = answer_start + len(answer_text)

    context_pos_start = seq_ids.index(1)
    context_pos_end = rindex(seq_ids, 1)

    s = e = 0
    if (offset_mappings[context_pos_start][0] <= answer_start and
        offset_mappings[context_pos_end][1] >= answer_end):
      i = context_pos_start
      while offset_mappings[i][0] < answer_start:
        i += 1
      if offset_mappings[i][0] == answer_start:
        s = i
      else:
        s = i - 1

      j = context_pos_end
      while offset_mappings[j][1] > answer_end:
        j -= 1
      if offset_mappings[j][1] == answer_end:
        e = j
      else:
        e = j + 1

    tokenized_examples['start_positions'].append(s)
    tokenized_examples['end_positions'].append(e)

  return tokenized_examples

In [59]:
# Increasing the max length crashes colab
max_length = 400
stride = 100
batch_size = 32

In [60]:
# map is a function provided by huggingface
tokenized_datasets = data.map(
  prepare_dataset,
  batched=True,
  remove_columns=data["train"].column_names,
  num_proc=2,
)

  self.pid = os.fork()


Map (num_proc=2):   0%|          | 0/87599 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/10570 [00:00<?, ? examples/s]

In [61]:
tokenized_datasets.column_names

{'train': ['input_ids',
  'attention_mask',
  'offset_mapping',
  'overflow_to_sample_mapping',
  'start_positions',
  'end_positions'],
 'validation': ['input_ids',
  'attention_mask',
  'offset_mapping',
  'overflow_to_sample_mapping',
  'start_positions',
  'end_positions']}

In [62]:
# We don't need these columns as well
data = tokenized_datasets.remove_columns(["offset_mapping",
                                          "overflow_to_sample_mapping"])


In [63]:
data.column_names

{'train': ['input_ids', 'attention_mask', 'start_positions', 'end_positions'],
 'validation': ['input_ids',
  'attention_mask',
  'start_positions',
  'end_positions']}

In [64]:
# Converting huggingface dataset object to compatible tf dataset
train_set = data["train"].to_tf_dataset(batch_size=batch_size)
validation_set = data["validation"].to_tf_dataset(batch_size=batch_size)

In [65]:
# Loading the model
model = TFAutoModelForQuestionAnswering.from_pretrained(model_name)

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFRobertaForQuestionAnswering.

Some weights or buffers of the TF 2.0 model TFRobertaForQuestionAnswering were not initialized from the PyTorch model and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [66]:
def get_answer(tokenizer, model, question, context):
  inputs = tokenizer([question], [context], return_tensors="np")
  outputs = model(inputs)
  start_position = tf.argmax(outputs.start_logits, axis=1)
  end_position = tf.argmax(outputs.end_logits, axis=1)
  answer = inputs["input_ids"][0, int(start_position) : int(end_position) + 1]
  return tokenizer.decode(answer).strip()

In [67]:
# This is without fine-tuning the model
c = "Sarah went to The Mirthless Cafe last night to meet her friend."
q = "Where did Sarah go?"
get_answer(tokenizer, model, q, c)

''

In [68]:
# https://www.tensorflow.org/guide/mixed_precision
keras.mixed_precision.set_global_policy("mixed_float16")

# Use a learning rate recommended by the BERT authors.
# https://github.com/google-research/bert
model.compile(optimizer=keras.optimizers.Adam(learning_rate=3e-5))

In [69]:
model.fit(train_set, validation_data=validation_set, epochs=1)

Cause: for/else statement not yet supported


Cause: for/else statement not yet supported


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: <gast.gast.Expr object at 0x7f71f850b310>


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: <gast.gast.Expr object at 0x7f71f850b310>


<tf_keras.src.callbacks.History at 0x7f72f498dc00>

In [70]:
c = "Sarah went to The Mirthless Cafe last night to meet her friend."
q = "Where did Sarah go?"
get_answer(tokenizer, model, q, c)

'The Mirthless Cafe'

In [71]:
q = "Who did Sarah meet?"
get_answer(tokenizer, model, q, c)

'her friend'

In [72]:
q = "When did Sarah meet her friend?"
get_answer(tokenizer, model, q, c)

'last night'

In [73]:
q = "Who went to the restaurant?"
get_answer(tokenizer, model, q, c)

'Sarah'

In [74]:
# Asking a logic teaser question is difficult despite the
# answer being available. To be fair, there is ambiguity here.
q = "Who did Sarah's friend meet?"
get_answer(tokenizer, model, q, c)

'her friend'

In [75]:
# The model can't determine when a question can't be
# answered. Some question answering datasets explicitly
# train for this.
q = "How did Sarah get to the restaurant?"
get_answer(tokenizer, model, q, c)

'The Mirthless Cafe last night to meet her friend'

In [76]:
# The model isn't generative, either.
q = "What is a possible reason for why Sarah met her friend?"
get_answer(tokenizer, model, q, c)

'<s>'