# **Question Answering**
Question answering tasks return an answer given a question. If you’ve ever asked a virtual assistant like Alexa, Siri or Google what the weather is, then you’ve used a question answering model before.

There are two common types of question answering tasks:
- **Extractive**: extract the answer from the given context.
- **Abstractive**: generate an answer from the context that correctly answers the question
    - **Open Generative QA**: The model generates free text directly based on the context (Text Generation)
    - **Closed Generative QA**: no context is provided. The answer is completely generated by a model.

In [1]:
!pip install transformers datasets evaluate

Collecting datasets
  Downloading datasets-3.0.0-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.0-py3-none-any.whl (474 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.3/474.3 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m


In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## **Load Dataset**
Start by loading a smaller subset of the **SQuAD** dataset from the 🤗 Datasets library. This’ll give you a chance to experiment and make sure everything works before spending more time training on the full dataset

In [3]:
from datasets import load_dataset

squad = load_dataset("squad")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

In [4]:
squad

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [5]:
squad10k = load_dataset("squad", split="train[:10000]")

In [6]:
squad10k

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 10000
})

Split the dataset into train and test set

In [7]:
squad10k = squad10k.train_test_split(test_size=0.2)
squad10k

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 8000
    })
    test: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 2000
    })
})

In [8]:
squad10k["train"][0]

{'id': '56d116d717492d1400aab8f6',
 'title': 'The_Legend_of_Zelda:_Twilight_Princess',
 'context': 'At the time of its release, Twilight Princess was considered the greatest entry in the Zelda series by many critics, including writers for 1UP.com, Computer and Video Games, Electronic Gaming Monthly, Game Informer, GamesRadar, IGN, and The Washington Post. It received several Game of the Year awards, and was the most critically acclaimed game of 2006. In 2011, the Wii version was rereleased under the Nintendo Selects label. A high-definition port for the Wii U, The Legend of Zelda: Twilight Princess HD, will be released in March 2016.',
 'question': 'What year was Twilight Princess the most critically acclaimed game?',
 'answers': {'text': ['2006'], 'answer_start': [349]}}

There are several important fields here:
- `answers`: the starting location of the answer token and the answer text.
- `context`: background information from which the model needs to extract the answer.
- `question`: the question a model should answer.

In [9]:
print("Context: ", squad10k["train"][0]["context"])
print("Question: ", squad10k["train"][0]["question"])
print("Answer: ", squad10k["train"][0]["answers"])

Context:  At the time of its release, Twilight Princess was considered the greatest entry in the Zelda series by many critics, including writers for 1UP.com, Computer and Video Games, Electronic Gaming Monthly, Game Informer, GamesRadar, IGN, and The Washington Post. It received several Game of the Year awards, and was the most critically acclaimed game of 2006. In 2011, the Wii version was rereleased under the Nintendo Selects label. A high-definition port for the Wii U, The Legend of Zelda: Twilight Princess HD, will be released in March 2016.
Question:  What year was Twilight Princess the most critically acclaimed game?
Answer:  {'text': ['2006'], 'answer_start': [349]}


## **Preprocess**
For preprocessing we will use `DistilBERT` as our model

In [10]:
from transformers import AutoTokenizer

model_checkpoint = "distilbert/distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



There are a few preprocessing steps particular to question answering tasks you should be aware of:

1. Some examples in a dataset may have a very long context that exceeds the maximum input length of the model. **To deal with longer sequences, truncate only the `context` by setting `truncation="only_second"`** (tuncate in the second position of the context when the question with its context is too long).
2. Next, **map the start and end positions of the answer to the original `context`** by setting `return_offset_mapping=True`.
3. With the mapping in hand, now you can **find the start and end tokens of the answer**. Use the `sequence_ids` method to find which part of the offset corresponds to the `question` and which corresponds to the `context`.

create a function to **truncate and map the start and end tokens of the `answer` to the `context`**:

In [11]:
def preprocess_function(examples):
    # Some of the questions have lots of whitespace, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # extra whitespace
    questions = [q.strip() for q in examples["question"]]

    # Tokenize our examples with truncation and maybe padding. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = inputs.pop("offset_mapping") # get the offset mapping out
    answers = examples["answers"] # the answer of the question

    start_positions = [] # start-pos for the answer
    end_positions = [] # start-pos for the answer

    # Loop to get the start position and end_position for answering the question
    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
         # We will label impossible answers with the index of the CLS token
        start_char = answer["answer_start"][0] # start answer
        end_char = answer["answer_start"][0] + len(answer["text"][0]) # end answer

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = inputs.sequence_ids(i) # get/find the indices that start and end in the context

        # Find the start and end of the context
        # Start token index of the current span in the text.
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx

        # End token index of the current span in the text
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

Apply the **preprocessing** function over the entire dataset, use 🤗 Datasets map function. Then speed up the map function by setting `batched=True` to process multiple elements of the dataset at once. Remove any columns that don’t need.

In [12]:
tokenized_squad = squad10k.map(preprocess_function, batched=True, remove_columns=squad10k["train"].column_names)

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [13]:
tokenized_squad

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'start_positions', 'end_positions'],
        num_rows: 8000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'start_positions', 'end_positions'],
        num_rows: 2000
    })
})

## **Train**

### PyTorch

Load `DistilBERT` model

In [14]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


At this point, only three steps remain:

1. Define your training hyperparameters in `TrainingArguments`. The only required parameter is `output_dir` which specifies where to save your model. You’ll push this model to the Hub by setting `push_to_hub=True` (you need to be signed in to Hugging Face to upload your model).
2. Pass the training arguments to `Trainer` along with the model, dataset, tokenizer, and data collator.
3. Call `train()` to finetune your model.

In [15]:
training_args = TrainingArguments(
    output_dir="my_awesome_10k_qa_model",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
)

**Create the Data Collator**

In [16]:
from transformers import DefaultDataCollator

# it doesn't apply any additional preprocessing such as padding
data_collator = DefaultDataCollator() #PyTorch
# data_collator = DefaultDataCollator(return_tensors="tf") #TensorFlow

In [17]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_squad["train"],
    eval_dataset=tokenized_squad["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [18]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,2.7294,1.6258
2,1.4128,1.405135
3,1.0856,1.396977


TrainOutput(global_step=1500, training_loss=1.7425937906901041, metrics={'train_runtime': 936.3815, 'train_samples_per_second': 25.631, 'train_steps_per_second': 1.602, 'total_flos': 2351755800576000.0, 'train_loss': 1.7425937906901041, 'epoch': 3.0})

In [19]:
trainer.save_model("test-squad-trained")

events.out.tfevents.1727237725.0fc601a42adb.577.0:   0%|          | 0.00/6.60k [00:00<?, ?B/s]

In [20]:
trainer.push_to_hub()

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/ditherr/my_awesome_qa_model/commit/bce060e21f13f4b1c3960411fc5d7a2f040490ef', commit_message='End of training', commit_description='', oid='bce060e21f13f4b1c3960411fc5d7a2f040490ef', pr_url=None, pr_revision=None, pr_num=None)

### TensorFlow

To finetune a model in TensorFlow, start by s**etting up an optimizer function, learning rate schedule, and some training hyperparameters**

In [None]:
from transformers import create_optimizer

batch_size = 16
num_epochs = 2
total_train_steps = (len(tokenized_squad["train"]) // batch_size) * num_epochs
optimizer, schedule = create_optimizer(
    init_lr=2e-5,
    num_warmup_steps=0,
    num_train_steps=total_train_steps,
)

In [None]:
from transformers import TFAutoModelForQuestionAnswering

model = TFAutoModelForQuestionAnswering.from_pretrained("distilbert/distilbert-base-uncased")

Convert your datasets to the `tf.data.Dataset` format with `prepare_tf_dataset()`

In [None]:
tf_train_set = model.prepare_tf_dataset(
    tokenized_squad["train"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

tf_validation_set = model.prepare_tf_dataset(
    tokenized_squad["test"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

Configure the model for training with `compile`

In [None]:
import tensorflow as tf

model.compile(optimizer=optimizer)

The last thing to setup before you start training is to provide a way to push your model to the Hub. This can be done by specifying where to push your model and tokenizer in the `PushToHubCallback`

In [None]:
from transformers.keras_callbacks import PushToHubCallback

callback = PushToHubCallback(
    output_dir="my_awesome_qa_model",
    tokenizer=tokenizer,
)

Finally, you’re ready to start training your model! Call `fit` with your training and validation datasets, the number of epochs, and your callback to finetune the model

In [None]:
model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3, callbacks=[callback])

## **Evaluate**
[Full Code](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb#scrollTo=f-EZY8mXT0X9)

In [None]:
...

## **Inference**

In [21]:
question = "How many programming languages does BLOOM support?"
context = "BLOOM has 176 billion parameters and can generate text in 46 languages natural languages and 13 programming languages."

In [27]:
from transformers import pipeline

question_answerer = pipeline("question-answering", model=model_checkpoint)
question_answerer(question=question, context=context)

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


{'score': 0.0033551272936165333, 'start': 0, 'end': 5, 'answer': 'BLOOM'}

**PyTorch**

In [23]:
from transformers import AutoTokenizer

model_check = 'ditherr/my_awesome_qa_model'
tokenizer = AutoTokenizer.from_pretrained(model_check)
inputs = tokenizer(question, context, return_tensors="pt")

Pass your inputs to the model and return the `logits`

In [24]:
import torch
from transformers import AutoModelForQuestionAnswering

model = AutoModelForQuestionAnswering.from_pretrained(model_check)
with torch.no_grad():
    outputs = model(**inputs)

In [25]:
answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()

In [26]:
predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
tokenizer.decode(predict_answer_tokens)

'46 languages natural languages and 13'

**TensorFlow**

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("my_awesome_qa_model")
inputs = tokenizer(question, context, return_tensors="tf")

In [None]:
from transformers import TFAutoModelForQuestionAnswering

model = TFAutoModelForQuestionAnswering.from_pretrained("my_awesome_qa_model")
outputs = model(**inputs)

In [None]:
answer_start_index = int(tf.math.argmax(outputs.start_logits, axis=-1)[0])
answer_end_index = int(tf.math.argmax(outputs.end_logits, axis=-1)[0])

In [None]:
predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
tokenizer.decode(predict_answer_tokens)