# Assignment 7: Question Answering
_Word Representations and Language Models (WS 24/25)_

Group Members: Alexander Weyhe, Buket Sak, Ludmila Bajuk
***

In this assignment, you'll work with a BERT model fine-tuned for Question Answering. We ask you to evaluate the model on the SQuAD 1.1 as described in the original paper and compare performances of BERT compared to the paper's logistic regression model. Then, we ask you to experiment with the model on our news dataset, generate question/answer pairs manually, and evaluate model performance qualitatively.

<div style="position:relative;padding:.75rem 1.25rem;margin-bottom:1rem;border:1px solid transparent;border-radius:.25rem;background-color:#FFF2CC;border-color:#D6B656;color:#856404">
<b>How to Submit the Assignment</b>

Please work on this assignment in groups of two or three. Make sure to add your names to this files header. After completion, share this assignment with me (<b>Julian Schelb - <a target="blank" href="https://www.kaggle.com/julianschelb">https://www.kaggle.com/julianschelb</a></b>) due Wednesday, 8th January, 12:00. Use the upper-right share button as instructed in the tutorial. In ILIAS, submit this notebook as response to Assignment 07. You can download this notebook using the "Download Notebook" option in the "File" menu.
</div>

In [4]:
from transformers import BertTokenizer, BertForQuestionAnswering
import transformers
import torch
import numpy as np
import json
import string
import re

## About the Stanford Question Answering Dataset (SQuAD)

SQuAD is a reading comprehension dataset containing approximately 100k question-answer pairs derived from 536 Wikipedia articles. Each question is associated with a specific text segment that serves as context, and the answer is a contiguous text span extracted directly from that segment.

The dataset was developed by selecting a set of Wikipedia articles and employing crowdworkers to formulate questions answerable by specific text segments. The crowdworkers also annotated the exact positions of the answers within the text.

For further information:

- [The Stanford Question Answering Dataset](https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/)
- [SQuAD: 100,000+ Questions for Machine Comprehension of Text](https://arxiv.org/abs/1606.05250)
- [Dataset on the Hugging Face Hub](https://huggingface.co/datasets/rajpurkar/squad)

## About Question Answering with BERT

BERT can be used for question answering by concatenating the context and question into a single input sequence, separated by a `[SEP]` token. For this task the model is fine-tuned to predict two separate probability distributions: one for the likelihood of each token being the start of the answer, and another for the likelihood of each token being the end. The tokens with the highest probabilities from each distribution, along with all tokens in between, are considered the predicted answer.



## Task 1: Question Answering with BERT

Task 1 is about loading the SQuAD 1.1 dataset and fine-tuned BERT model (both from huggingface) as well as computing the model's Exact Match and F1 score as introduced in the SQuAD paper. We already pre-define the code for downloading the data and model. We use a BERT model that is fine-tuned on the train set of the SQuAD corpus. Therefore, we only need validation data for evaluation. Description of dataset downloaded from huggingface: https://huggingface.co/datasets/squad

Please implement the following steps:
1. Inspect the data to get an overview of the data, you'll need to predict answers.
2. For all questions, generate answer predictions with BERT (using GPU might help you a lot here and reduces runtime to only a few minutes).
3. Since predicted and true answers might not only differ in content but also in punctuations, whitespaces etc, please implement the following preprocessing steps for all predicted and true answers: lower-case answers and remove punctuations (the string package might be useful here), articles (regex might useful to remove "a", "and", "the"), and standardize whitespaces.
4. Use the pre-processed predicted answers and the true answers included to calculate the Exact Match and F1-score.

In [1]:
# Read data: we only need the validation set since the model is already fine-tuned on the train set
#!pip install datasets
from datasets import load_dataset
df_test = load_dataset('squad',split = 'validation')

README.md:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

In [5]:
# Load tokenizer and model: we take a BERT model already fine-tuned on the Squad dataset
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

#tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') # Optionally try out non-fine-tuned BERT model to see how important fine-tuning is.
#model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [6]:
# Print example data
print(df_test[4])

{'id': '56be4db0acb8001400a502f0', 'title': 'Super_Bowl_50', 'context': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.', 'question': 'What color was used to emphasize the 50th anniversary of the Super Bowl?', 'answers': {'text': ['gold', 'gold', 'gold'], 'answer_start'

In [7]:
# Extract data
contexts = df_test['context']
#print(contexts[1])
questions = df_test['question']
#print(questions[3])
true_answers = [answer['text'][0] for answer in df_test['answers']]
print(true_answers[4])

gold


In [8]:
# Cconnect to GPU and push model to GPU
device = torch.device('cuda'if torch.cuda.is_available()else 'cpu')
model = model.to(device)

In [9]:
# Iterate over all questions/paragraphs and predict answer --> use GPU to speed-up process (should take a few minutes to get output for all 10570 question/answer pairs)
predicted_answers = []

for i in range(len(questions)): 
    context = contexts[i]
    question = questions[i]
    
    # Tokenize inputs
    inputs = tokenizer(question, context, return_tensors="pt", max_length=512, truncation=True)
    inputs = {key: value.to(device) for key, value in inputs.items()} 

    
    with torch.no_grad():
        outputs = model(**inputs)

    start_logits = outputs.start_logits
    end_logits = outputs.end_logits

    # Get the most probable start and end positions
    start_idx = torch.argmax(start_logits)
    end_idx = torch.argmax(end_logits) + 1

    # Decode the predicted span into text
    input_ids = inputs['input_ids'].squeeze()
    predicted_answer = tokenizer.decode(input_ids[start_idx:end_idx])
    predicted_answers.append(predicted_answer)

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

In [10]:
# pre-process predicted and true answers for computation of exact match metric and F1-score: we lower-case predicion/answer, remove punctuations,articles, and whitespaces

def preprocess(text):
    text = text.lower()
    #articles
    text = re.sub(r'\b(a|an|the)\b', ' ', text)  
    # punctuation
    text = ''.join(ch for ch in text if ch not in string.punctuation)  
   #  extra whitespaces
    text = ' '.join(text.split())  
    return text

predicted_answers = [preprocess(answer) for answer in predicted_answers]
true_answers = [preprocess(answer) for answer in true_answers]

In [11]:
# Compute proportion of exact matches between predicitons and true answers
def exact_match(prediction, ground_truth):
    return int(prediction == ground_truth)

exact_matches = [exact_match(pred, true)for pred, true in zip(predicted_answers, true_answers)]
em_score = np.mean(exact_matches)

In [12]:
exact_matches
print("Mean od exact matches:", em_score)

Mean od exact matches: 0.6546830652790918


In [13]:
# Compute F1 score based on shared words between prediction and true answer

def f1_score(prediction, ground_truth):
    prediction_tokens = prediction.split()
    ground_truth_tokens = ground_truth.split()
    common = set(prediction_tokens) & set(ground_truth_tokens)
    num_same = len(common)

    if num_same == 0:
        return 0.0

    precision = num_same / len(prediction_tokens)
    recall = num_same / len(ground_truth_tokens)
    return 2 * (precision * recall) / (precision + recall)

f1_scores = [f1_score(pred, true) for pred, true in zip(predicted_answers, true_answers)]
mean_f1 = np.mean(f1_scores)

# Print Results
print(f"Exact Match (EM) Score: {em_score * 100:.2f}%")
print(f"F1 Score: {mean_f1 * 100:.2f}%")

Exact Match (EM) Score: 65.47%
F1 Score: 79.49%


Exact match achieved by the Squad paper's logistic regression model on dev data:
- Exact match score = 40.0%
- F1- score = 51 %
- Human performance: F1 = 86%, EM = 77%

## Task 2: Experiment with BERT

Task 2 is about experimenting with the BERT model on our news data. Please implement the following steps:

1. Select one article from the news data on which you want to generate question/answer pairs.
2. Generate exemplary question/answer pairs and check if the model can predict the answers correctly. Try to generate easy and more difficult questions. Are there any questions the model cannot correctly answer to?

In [14]:
# Read news data
with open("/kaggle/input/relevant_articles.json") as d:
    articles = json.load(d)

In [15]:
# Display example article
example_art = articles[3]
example_art

{'_id': 198038,
 'url': 'http://www.bbc.co.uk/news/uk-scotland-35210821#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa',
 'title': "'Very drunk' patient numbers revealed",
 'feed': 'bbc',
 'type': 'politics',
 'pub': {'$date': '2016-01-02T00:42:46.000+0000'},
 'ret': {'$date': '2016-01-02T00:45:47.000+0000'},
 'lang': 'en',
 'refs': ['http://www.bbc.co.uk/news/uk-scotland-35097230'],
 'sum': 'Ambulances attend more than 60 incidents on average every day where a patient is so drunk that it has to be formally noted by crews.',
 'body': 'Paramedics treated about 12,000 people who were so drunk it was noted on Scottish Ambulance Service systems in the six months to the end of September.\nThe figures were obtained by the Scottish Conservatives under freedom of information laws.\nThe ambulance service said alcohol had a significant impact on its operations.\nIt comes after a recent internal Scottish Ambulance Service survey showed alcohol was a factor in more than half of all call-outs ambulanc

In [16]:
# Generate question/answer pairs
questions = [ "How many drunk people did the paramedics treated?", 
            "What country is the text about",
            "What drove the increase in demand over the festive period?", 
             "What is the text about?", 
             "Which country is leading in the tackling alcohol misuse?"
            ] 

answers = ["12,000",
          "Scotland",
           "Alcohol",
           "Alcohol issues in Scotland.",
           "Scotland"
          ]

In [17]:
# Iterate over all questions/paragraphs and predict answer
predicted_answers = []

for question in questions:
    inputs = tokenizer(question, example_art['text'], return_tensors="pt", max_length=512, truncation=True)
    inputs = {key: value.to(device) for key, value in inputs.items()}  

    with torch.no_grad():
        outputs = model(**inputs)

    # Extract start and end logits
    start_logits = outputs.start_logits
    end_logits = outputs.end_logits

    # Get the most probable start and end positions
    start_idx = torch.argmax(start_logits)
    end_idx = torch.argmax(end_logits) + 1

    # Decode the predicted span
    input_ids = inputs["input_ids"].squeeze()
    predicted_answer = tokenizer.decode(input_ids[start_idx:end_idx])
    predicted_answers.append(predicted_answer)

# Display Results
for i, question in enumerate(questions):
    print(f"Question: {question}")
    print(f"Expected Answer: {answers[i]}")
    print(f"Predicted Answer: {predicted_answers[i]}")
    print("-" * 50)

Question: How many drunk people did the paramedics treated?
Expected Answer: 12,000
Predicted Answer: 12, 000
--------------------------------------------------
Question: What country is the text about
Expected Answer: Scotland
Predicted Answer: 
--------------------------------------------------
Question: What drove the increase in demand over the festive period?
Expected Answer: Alcohol
Predicted Answer: alcohol
--------------------------------------------------
Question: What is the text about?
Expected Answer: Alcohol issues in Scotland.
Predicted Answer: [CLS]
--------------------------------------------------
Question: Which country is leading in the tackling alcohol misuse?
Expected Answer: Scotland
Predicted Answer: scotland
--------------------------------------------------


## Task 3: Experiment with Chat Models

In this task, we will explore using generative chat models for Question Answering by parsing both the document and the question into the model as a structured prompt using system, user, and assistant messages. Unlike BERT, the model will not be fine-tuned on our dataset. 

In [18]:
!pip install -q transformers accelerate bitsandbytes

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.1/69.1 MB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25h

<div style="position:relative;padding:.75rem 1.25rem;margin-bottom:1rem;border:1px solid transparent;border-radius:.25rem;background-color:#dae8fc;border-color:#6c8ebf;color:#0c5460">
<b>Step 1: Load the Model</b> 
</div>

Load the ["HuggingFaceTB/SmolLM-1.7B-Instruct"](https://huggingface.co/HuggingFaceTB/SmolLM-1.7B-Instruct) model, which was fine-tuned to function as a chatbot and supports prompts designed for chat formats. You can explore many other chat models here: [Hugging Face Chat Models](https://huggingface.co/models?other=conversational).

In [19]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

In [20]:
# Load the model
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM-1.7B-Instruct")
chat_model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM-1.7B-Instruct")


tokenizer_config.json:   0%|          | 0.00/3.59k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/655 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/738 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.42G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

<div style="position:relative;padding:.75rem 1.25rem;margin-bottom:1rem;border:1px solid transparent;border-radius:.25rem;background-color:#dae8fc;border-color:#6c8ebf;color:#0c5460">
<b>Step 2: Define the Prompt Template</b> 
</div>

Complete the `compile_qa_prompt()` function to generate a QA-prompt using a chat template containing the following messages (as shown in the figure below):
   1. **System Message:** Define the model's behavior (e.g., "You are a helpful assistant trained to answer questions accurately using the provided context.").
   2. **User Message:** Include the news article as context, followed by the question.
   3. **Assistant Message:** Add a special token at the very end of the prompt template to signal the model to generate a response.

   *Note:* The `apply_chat_template` method in Hugging Face's Transformers library can be used to format these messages into a single string that aligns with the expected input format of chat models. This method ensures that the conversation history is structured correctly for models fine-tuned for chat interactions. For more details, refer to the [Chat Templates documentation](https://huggingface.co/docs/transformers/main/chat_templating).

**Prompt structure used with chat models:**


In [21]:
#from transformers import Conversation, apply_chat_template
# Function to create a prompt

def compile_qa_prompt(context, question):
    """Generates a prompt for a question-answering task using the provided context and question."""
    messages = [
        {"role": "system_message", "content": "You are a helpful assistant trained to answer questions accurately using the provided context."},
        {"role": "user_message", "content": f"Context: {context}\nQuestion: {question}"}
    ]

    input_text=tokenizer.apply_chat_template(messages, tokenize=False)
    print(input_text)


    return input_text
    

In [22]:
# Output a prompt in text form
prompt = compile_qa_prompt(example_art['text'], questions[0])
print(prompt)

<|im_start|>system_message
You are a helpful assistant trained to answer questions accurately using the provided context.<|im_end|>
<|im_start|>user_message
Context: Ambulances attend more than 60 incidents on average every day where a patient is so drunk that it has to be formally noted by crews.

 

Paramedics treated about 12,000 people who were so drunk it was noted on Scottish Ambulance Service systems in the six months to the end of September.
The figures were obtained by the Scottish Conservatives under freedom of information laws.
The ambulance service said alcohol had a significant impact on its operations.
It comes after a recent internal Scottish Ambulance Service survey showed alcohol was a factor in more than half of all call-outs ambulance staff dealt with at weekends.
The latest figures showed Scotland's largest health board, NHS Greater Glasgow and Clyde, had the highest number of alcohol-related 999 call-outs in the six month period at 3,849. It was followed by NHS Lot

<div style="position:relative;padding:.75rem 1.25rem;margin-bottom:1rem;border:1px solid transparent;border-radius:.25rem;background-color:#dae8fc;border-color:#6c8ebf;color:#0c5460">
<b>Step 3: Generate Answers</b> 
</div>

Use the model and the function you implemented earlier to generate answers for the same context and questions used in Task 2. Repeat this process for all your questions, and compare the generated answers with those obtained from BERT.

In [23]:
# Generate Answers
inputs = tokenizer.encode(prompt, return_tensors="pt")
outputs = chat_model.generate(inputs, max_new_tokens=50, temperature=0.2, top_p=0.9, do_sample=True)

# Decode and print the model response
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Assistant's answer:")
print(answer)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Assistant's answer:
system_message
You are a helpful assistant trained to answer questions accurately using the provided context.
user_message
Context: Ambulances attend more than 60 incidents on average every day where a patient is so drunk that it has to be formally noted by crews.

 

Paramedics treated about 12,000 people who were so drunk it was noted on Scottish Ambulance Service systems in the six months to the end of September.
The figures were obtained by the Scottish Conservatives under freedom of information laws.
The ambulance service said alcohol had a significant impact on its operations.
It comes after a recent internal Scottish Ambulance Service survey showed alcohol was a factor in more than half of all call-outs ambulance staff dealt with at weekends.
The latest figures showed Scotland's largest health board, NHS Greater Glasgow and Clyde, had the highest number of alcohol-related 999 call-outs in the six month period at 3,849. It was followed by NHS Lothian, with 1,9

**Your interpretation:**