In [2]:
from datasets import load_dataset
from rank_bm25 import BM25Okapi
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline
from nltk.tokenize import TreebankWordTokenizer
import nltk

<h2>Loading the dataset</h2>

The PubMedQA dataset is loaded, specifically the pqa_labeled dataset that consists of 1000 samples. The dataset is shuffled and split into train, development, and test set. 

The train set consists of 450 samples. The dev set consists of 50 samples and the test set consists of 500 samples. 

In [4]:

dataset = load_dataset("pubmed_qa", "pqa_labeled")["train"].shuffle(seed=42)
split_dataset = dataset.train_test_split(test_size=500, seed=42)
split_dataset["dev"] = split_dataset["train"].train_test_split(test_size=50, seed=42)["test"]

#Three subsets of data for model training and evaluation
train_dataset = split_dataset["train"]
dev_dataset = split_dataset["dev"]
test_dataset = split_dataset["test"]


This dataset is structured to support question-answering (QA) tasks related to biomedical literature. It contains five columns that provide essential information for each entry.

1. pubid: A unique identifier for each record.
2. question: The medical or scientific question posed.
3. context: Background information related to the question.
4. long_answer: A detailed response to the question based on the provided context.
5. final_decision: yes/no/maybe

In [8]:
train_dataset

Dataset({
    features: ['pubid', 'question', 'context', 'long_answer', 'final_decision'],
    num_rows: 500
})

<h2>BM25 for document retrieval</h2>

Preparing a corpus from training set to be used with BM25, an efficient (alternative to tf-idf) ranking algorithm for information retrieval. BM25 helps rank documents by their relevance to a given query, making it useful for QA tasks.

The following code achieves the following:
- Retrieves the context field from the training dataset (train_dataset).
- Converts the extracted texts into a list of strings (corpus).
- Joins multiple contexts within each entry into a single string.
- Initializes the Treebank Word Tokenizer, which is optimized for English text.
- Tokenizes each document in the corpus into a list of words
- Creates a BM25 index using the tokenized corpus.
- BM25 ranks documents based on the frequency and importance of words in a query.




In [17]:
# Prepare corpus for BM25 using training set


# Extract just the 'context' field from the train set
corpus_data = train_dataset["context"]
corpus = [' '.join(entry['contexts']) for entry in corpus_data]

# Tokenize the corpus
tokenizer = TreebankWordTokenizer()
tokenized_corpus = [tokenizer.tokenize(doc) for doc in corpus]

# Initialize BM25
bm25 = BM25Okapi(tokenized_corpus)


<rank_bm25.BM25Okapi at 0x7ff5b3561d00>