# Biomedical Question Answering with PubMedQA and PubMedBERT

This project implements a biomedical question answering pipeline using the PubMedQA dataset and PubMedBERT.


### Environment Setup

The code was developed and tested in a Unix-based environment (Ubuntu/Mac), consistent with the class VM configuration. All dependencies required to run the notebook are listed in the accompanying requirements.txt file. These includes libraries such as transformers, datasets, nltk, scikit-learn, sentence-transformers, and rank_bm25, as well as the accelerate library (version ≥ 0.26.0). To set up the environment, one can simply run pip install -r requirements.txt.

In [2]:
# importing necessary libraries

from datasets import load_dataset
from rank_bm25 import BM25Okapi        
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from nltk.tokenize import TreebankWordTokenizer
import numpy as np
from sklearn.metrics import accuracy_score, f1_score
import torch



from sentence_transformers import SentenceTransformer, util

## Dataset


The dataset used is the PubMedQA dataset, which is a biomedical question-answering dataset designed for reasoning over biomedical research texts. This project utilizes the PubMedQA Labeled (PQA-L) subset, which contains 1000 manually expert annotations in the form of labels: yes/no/maybe.


#### Loading and preparing the dataset

First the pubmed_qa dataset is loaded from Hugging Face's dataset library, selecting the "train" split. We then shuffle the data using a fixed seed (42) to ensure consistent shuffling across runs.

Next, we apply a train-test split, allocating 80% of the data for training and 20% for testing. Finally, we extract the train and test datasets from the split dictionary, storing them separately for later use in model training and evaluation.


In [3]:
# ——— load & split
dataset = load_dataset("pubmed_qa", "pqa_labeled")["train"].shuffle(seed=42)
split = dataset.train_test_split(test_size=0.2, seed=42)
train_dataset, test_dataset = split["train"], split["test"]


## BM25 retriever

BM25 is a probabilistic ranking function commonly used in information retrieval. It is used to select relevant contexts for answering questions here. 

1. Prepare the corpus: created by extracting and joining contexts from the train_dataset. 
2. Initializing BM25 Tokenizer: A word tokenizer(TreebankWordTokenizer) is instantiated. This tokenizer breaks down each document into individual tokens for BM25 processing.
3. Create BM25 index: BM25Okapi builds an index from tokenized documents. This allows scoring query relevance.
4. retrieve_with_bm25 function: This function retrieves the top-k relevant contexts for a given query

This implementation allows efficient retrieval of relevant biomedical contexts using BM25 ranking. The top-ranked documents help the question-answering model select the most useful information for inference.

In [4]:
# ——— BM25 retriever (unchanged)
corpus = [' '.join(e['contexts']) for e in train_dataset["context"]]
tokenizer_bm25 = TreebankWordTokenizer()
bm25 = BM25Okapi([tokenizer_bm25.tokenize(doc) for doc in corpus])
def retrieve_with_bm25(q, k=1):
    tokens = tokenizer_bm25.tokenize(q)
    scores = bm25.get_scores(tokens)
    idxs = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:k]
    return [corpus[i] for i in idxs]

## Preprocessing

To prepare biomedical question-answering data for model training, the preprocessing stage involves encoding categorical labels into numerical values for efficient classification.

A dictionary (label2id) maps answers "yes," "no," and "maybe" to numeric identifiers (1, 0, and 2, respectively). An inverse mapping (id2label) is also created to convert predicted numeric labels back into their original categorical format, ensuring interpretability when reviewing model predictions.

A domain-specific tokenizer is initialized using the pretrained PubMedBERT model, which is optimized for biomedical texts.


In [5]:
# ——— preprocessing
label2id = {'no': 0, 'yes': 1, 'maybe': 2}
id2label = {v:k for k,v in label2id.items()}
tokenizer = AutoTokenizer.from_pretrained("microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext")

The preprocess function prepares biomedical question-answering data for model training. It tokenizes each question along with a retrieved relevant context using PubMedBERT. If no context is found, the entry is skipped. The function stores input IDs, attention masks, and numerical labels, ensuring compatibility with the model. It is then applied to both train and test datasets for structured input preparation.

In [6]:
def preprocess(examples):
    in_ids, attn, labs = [], [], []
    for q, lbl in zip(examples['question'], examples['final_decision']):
        docs = retrieve_with_bm25(q)
        if not docs: continue
        enc = tokenizer(q, docs[0], truncation=True, padding='max_length', max_length=512)
        in_ids.append(enc['input_ids'])
        attn.append(enc['attention_mask'])
        labs.append(label2id[lbl.lower()])
    return {'input_ids':in_ids, 'attention_mask':attn, 'labels':labs}

train_enc = preprocess(train_dataset)
test_enc  = preprocess(test_dataset)

The PubMedQADataset class is a custom dataset wrapper designed for use with PyTorch’s Dataset module. It takes preprocessed tokenized data (enc) as input and makes it compatible with PyTorch's data-loading pipeline.

1. The __init__ method initializes the dataset by storing the encoded input dictionary.
2. The __len__ method returns the number of samples based on the length of the labels list, ensuring proper iteration during training.
3. The __getitem__ method retrieves a specific data sample as a dictionary, converting each element (input IDs, attention masks, labels) into PyTorch tensors for model compatibility.

Finally, train_ds and test_ds instances are created, preparing data for model training and evaluation.

In [7]:
class PubMedQADataset(torch.utils.data.Dataset):
    def __init__(self, enc): self.enc = enc
    def __len__(self): return len(self.enc['labels'])
    def __getitem__(self, i): return {k:torch.tensor(v[i]) for k,v in self.enc.items()}

train_ds, test_ds = PubMedQADataset(train_enc), PubMedQADataset(test_enc)



## Evaluation
The compute_metrics function evaluates the performance of a biomedical question-answering model

It extracts predicted labels by selecting the highest probability class using np.argmax(). Then, it computes accuracy, which measures overall correctness, and macro F1-score, which evaluates balance across all classes.

In [8]:
# ——— classification metrics
def compute_metrics(pred):
    preds = np.argmax(pred.predictions, axis=1)
    acc   = accuracy_score(pred.label_ids, preds)
    f1    = f1_score(pred.label_ids, preds, average='macro')
    return {'accuracy':acc, 'f1':f1}


### Semantic Similarity Evaluation with Sentence-BERT

#### Setting the Random Seed
Before executing the model, we set a fixed random seed to ensure reproducibility.

#### Defining the Semantic Similarity Evaluation Function

The function evaluate_semantic_similarity calculates how similar predicted responses are to the actual reference answers.

1. preds: A list of predicted answers from the model.

2. refs: A list of reference (ground truth) answers.

To achieve this, Sentence-BERT (all-MiniLM-L6-v2), a pretrained model designed for sentence embeddings, is used.

The function iterates through the predicted-reference pairs, encoding each sentence into an embedding tensor and computing cosine similarity between them. Each similarity score is appended to a list, and after processing all pairs, an average similarity score is calculated and printed. 

This metric quantifies how closely the model’s predictions align with the expected answers, offering insight into its performance in capturing semantic nuances. By leveraging contextual embeddings and cosine similarity, this function provides a better measure of model evaluation in biomedical question answering tasks.

In [9]:
torch.manual_seed(32)
def evaluate_semantic_similarity(preds, refs):
    model_Sentence = SentenceTransformer("all-MiniLM-L6-v2")
    sim_scores = []
    for pred, ref in zip(preds, refs):
        sim = util.cos_sim(
            model_Sentence.encode(pred, convert_to_tensor=True),
            model_Sentence.encode(ref, convert_to_tensor=True)
        ).item()
        sim_scores.append(sim)
    avg_sim = sum(sim_scores) / len(sim_scores)
    print(f"Average Semantic Similarity: {avg_sim:.2f}")

### Pretraining

The pretrained evaluation process assesses the baseline performance of the PubMedBERT model before fine-tuning.

First, the pretrained model is loaded using AutoModelForSequenceClassification, specifying three classification labels (yes, no, maybe). The Trainer is then initialized, linking the model with the evaluation dataset (test_ds) and defining performance metrics (compute_metrics). The evaluation is executed using .evaluate(), which returns accuracy and macro F1 scores.

After the initial assessment, the function extracts raw predictions by applying .predict() on the test dataset. The predicted labels are determined using np.argmax(), selecting the highest probability class for each instance. Reference labels are retrieved for comparison. Both predicted and reference labels are converted back to human-readable text using the predefined id2label mapping.

Finally, the script evaluates the semantic similarity between the predicted and reference texts using evaluate_semantic_similarity(). This step measures how close the model’s raw outputs are to the expected answers, providing further insights into the pretrained model’s effectiveness before fine-tuning.


In [10]:
# — PART 1: pretrained evaluation
print("=== PART 1: Pretrained evaluation ===")
pre_model = AutoModelForSequenceClassification.from_pretrained(
    "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext",
    num_labels=3, id2label=id2label, label2id=label2id
)
pre_trainer = Trainer(model=pre_model, compute_metrics=compute_metrics, eval_dataset=test_ds)
pre_res = pre_trainer.evaluate()
print(pre_res)

# get raw preds & refs as texts
pred_out = pre_trainer.predict(test_ds)
pred_ids = np.argmax(pred_out.predictions, axis=1)
ref_ids  = pred_out.label_ids
pred_texts = [id2label[i] for i in pred_ids]
ref_texts  = [id2label[i] for i in ref_ids]

print("Semantic similarity (pretrained):")
evaluate_semantic_similarity(pred_texts, ref_texts)

=== PART 1: Pretrained evaluation ===


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'eval_loss': 1.2734869718551636, 'eval_model_preparation_time': 0.0041, 'eval_accuracy': 0.175, 'eval_f1': 0.15578002244668912, 'eval_runtime': 141.1092, 'eval_samples_per_second': 1.417, 'eval_steps_per_second': 0.177}




Semantic similarity (pretrained):
Average Semantic Similarity: 0.59


### Fine tuning

The fine-tuning and re-evaluation process aims to improve the biomedical question-answering model.


First, we reload the pretrained model and define the number of classification labels (num_labels=3). Training hyperparameters, such as learning rate (2e-5), batch size (8), number of epochs (3), and weight decay (0.01), are configured using TrainingArguments. These settings help optimize model training while preventing overfitting.


Next, the model is fine-tuned using the Trainer class, incorporating both the training dataset (train_ds) and evaluation metrics (compute_metrics). The .train() function initiates the fine-tuning process, adjusting model weights based on biomedical question-answering examples. Once training is complete, .evaluate() is used to assess model performance on the test dataset (test_ds), providing accuracy and macro F1-score results.

After evaluation, predictions are extracted using .predict(), converting raw model outputs into categorical labels using np.argmax(). These predictions are mapped back to their textual representations (yes, no, maybe) via the id2label dictionary. Finally, semantic similarity between model predictions and reference answers is calculated using evaluate_semantic_similarity(), assessing how well the fine-tuned model captures meaning compared to ground truth responses.

In [11]:
# — PART 2: finetune & re-evaluate
print("\n=== PART 2: Finetune & evaluate ===")
ft_model = AutoModelForSequenceClassification.from_pretrained(
    "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext",
    num_labels=3, id2label=id2label, label2id=label2id
)
training_args = TrainingArguments(
    output_dir='./results', learning_rate=2e-5,
    per_device_train_batch_size=8, per_device_eval_batch_size=8,
    num_train_epochs=3, weight_decay=0.01, logging_steps=50
)
ft_trainer = Trainer(
    model=ft_model, args=training_args,
    train_dataset=train_ds, compute_metrics=compute_metrics
)
ft_trainer.train()
ft_res = ft_trainer.evaluate(eval_dataset=test_ds)
print(ft_res)

# get finetuned preds & refs
pred_out2 = ft_trainer.predict(test_ds)
pred_ids2 = np.argmax(pred_out2.predictions, axis=1)
ref_ids2  = pred_out2.label_ids
pred_texts2 = [id2label[i] for i in pred_ids2]
ref_texts2  = [id2label[i] for i in ref_ids2]

print("Semantic similarity (finetuned):")
evaluate_semantic_similarity(pred_texts2, ref_texts2)



=== PART 2: Finetune & evaluate ===


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
50,0.9751
100,0.9596
150,0.922
200,0.8573
250,0.7603
300,0.7829




{'eval_loss': 1.138144850730896, 'eval_accuracy': 0.475, 'eval_f1': 0.3110209601081812, 'eval_runtime': 132.0452, 'eval_samples_per_second': 1.515, 'eval_steps_per_second': 0.189, 'epoch': 3.0}




Semantic similarity (finetuned):
Average Semantic Similarity: 0.83


## Results

The final comparison highlights the improvements achieved through fine-tuning. 

The accuracy of the fine-tuned model significantly increases from 0.1750 to 0.4750, showing better overall correctness in predictions. 

The macro F1-score also improves from 0.1558 to 0.3110, indicating enhanced balance across all classification labels. 

Additionally, the semantic similarity between predicted and reference answers shows a notable boost—from 0.59 to 0.83—suggesting that the fine-tuned model generates responses much closer to the expected answers in meaning. 





In [12]:
# — FINAL comparison
print("\n=== Comparison on test set ===")
print(f"Accuracy → pretrained: {pre_res['eval_accuracy']:.4f}, finetuned: {ft_res['eval_accuracy']:.4f}")
print(f"   F1    → pretrained: {pre_res['eval_f1']:.4f}, finetuned: {ft_res['eval_f1']:.4f}")

print("Average semantic similarity -> pretrained:")
semantic_pretrained = evaluate_semantic_similarity(pred_texts, ref_texts)

print("Average semantic similarity -> finetuned:")
semantic_finetuned = evaluate_semantic_similarity(pred_texts2, ref_texts2)





=== Comparison on test set ===
Accuracy → pretrained: 0.1750, finetuned: 0.4750
   F1    → pretrained: 0.1558, finetuned: 0.3110
Average semantic similarity -> pretrained:
Average Semantic Similarity: 0.59
Average semantic similarity -> finetuned:
Average Semantic Similarity: 0.83
