In this notebook we will be learning sentences embedding from our pubmed dataset. After learning the embedding we will save those embedding for paragraphs in a Postgres database. We will later use that database to query our question to find relevant paragraphs related to the question.

In [2]:
from datasets import load_dataset, concatenate_datasets

In [3]:
dataset_id = "pubmed_qa"
columns_to_use = ['pubid', 'question', 'context', 'long_answer']

In [4]:
unlabeled_dataset = load_dataset(dataset_id,  "pqa_unlabeled")
labeled_dataset = load_dataset(dataset_id,  "pqa_labeled")

In [5]:
def explode_context(examples):
    """
    Each column in the datase 

    Args:
        context (dict): _description_
    """
    contexts = []
    all_contexts = examples.get("context")
    for context_dict in all_contexts:
        contexts.extend(context_dict.get("contexts"))
    return {"context": contexts}

In [6]:
unlabeled_dataset

DatasetDict({
    train: Dataset({
        features: ['pubid', 'question', 'context', 'long_answer'],
        num_rows: 61249
    })
})

In [7]:
unlabeled_dataset["train"].shape

(61249, 4)

In [8]:
unlabeled_context_dataset =  unlabeled_dataset['train'].map(explode_context,
     remove_columns=['question', 'long_answer', "context", "pubid"], 
     batched=True)

In [18]:
labeled_context_dataset = labeled_dataset["train"].map(explode_context, remove_columns=[
                                              'question', 'long_answer', "context", "pubid", "final_decision"], batched=True)

In [19]:
context_dataset = concatenate_datasets([unlabeled_context_dataset, labeled_context_dataset])

Let us check one split of the dataset

fr

With the context as data dataset, we can now save them in the database by using encoding.

Once we have created our dataset, let us try to learn embedding of the first two sentences and check if the embedding model work.

### Testing the embedding model.

We will be using the sentence transformer model to learn the word embeddings of our text.

In [22]:
from sentence_transformers import SentenceTransformer

In [23]:
embedding_model_name = 'michiyasunaga/BioLinkBERT-large'

# Load the BERT model
model = SentenceTransformer(embedding_model_name)

# Display the max_sequence_length of the model
max_sequence_length = model.max_seq_length
print("Max Sequence Length:", max_sequence_length)

No sentence-transformers model found with name /Users/esp.py/.cache/torch/sentence_transformers/michiyasunaga_BioLinkBERT-large. Creating a new one with MEAN pooling.


Max Sequence Length: 512


Since our model have a max_sequence_length of 512, we need to split the context into chunks of 512 tokens.

In [26]:
def extract_embeddings(examples):
    """
    take a batch of example compute the embeddings and save the subset of the embeddings
    Add a new columns named embedding to the subsets of example and save the subset locally.
    """
    examples["embedding"] = model.encode(examples["context"], show_progress_bar=True)
    examples.save_to_disk("embeddings_pubmed_qa")
    return examples  

In [28]:
data_with_embeddings = context_dataset.map(extract_embeddings, batched=True, batch_size=16)  

In [30]:
import os

In [31]:
os.environ.get("HF_DATASETS_CACHE")

This code need to be fun from a GPU, need to find a way to connect to collab gpu local.

This will be a fun for another day.