In this notebook we will be learning sentences embedding from our pubmed dataset. After learning the embedding we will save those embedding for paragraphs in a Postgres database. We will later use that database to query our question to find relevant paragraphs related to the question.

In [1]:
from datasets import load_dataset

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
dataset_id = "pubmed_qa"

In [3]:
dataset = load_dataset(dataset_id,  "pqa_unlabeled")

Let us check one split of the dataset

fr

In [4]:
from random import randrange

random_id = randrange(len(dataset["train"]))
print(dataset["train"][random_id].get("context").get("contexts"))

['An increased incidence of acute urinary retention has been reported after interstitial prostate radiation therapy when the anterior base of the prostate gland receives 100% of the prescription dose. The frequency of prostate cancer in this location as a function of the pre-treatment prostate specific antigen (PSA), biopsy Gleason score, and 1992 American Joint Commission on Cancer Staging (AJCC) was determined.', "One hundred four men treated at the Brigham and Women's Hospital with radical prostatectomy for clinically localized prostate cancer between 1995-1996 comprised the study population. Prostatectomy specimens were whole mounted and the location of each tumor foci enumerated.", 'Of 269 foci of prostate cancer found in 39 low-risk prostate cancer patients (PSA<10 ng/ml, biopsy Gleason score<or = 6, and 1992 AJCC clinical stage T1c,2a), a single focus (0.37%) was noted in the anterior base. Conversely, 20/355 (5.6%) and 18/251 (7.2%) tumor foci were noted in the anterior base in

Once we have downloaded the dataset, let us iterate and get only the contexts subset which will load in our database.

In [5]:
dataset = dataset.flatten()

In [6]:
dataset

DatasetDict({
    train: Dataset({
        features: ['pubid', 'question', 'context.contexts', 'context.labels', 'context.meshes', 'long_answer'],
        num_rows: 61249
    })
})

In [8]:
pandas_dataset = dataset.get('train').to_pandas()

In [9]:
pandas_dataset.shape

(61249, 6)

In [10]:
pandas_dataset.head()

Unnamed: 0,pubid,question,context.contexts,context.labels,context.meshes,long_answer
0,14499029,Is naturopathy as effective as conventional th...,[Although the use of alternative medicine in t...,"[BACKGROUND, OBJECTIVE, DESIGN, SETTING, PATIE...","[Anxiety, Cohort Studies, Confidence Intervals...",Naturopathy appears to be an effective alterna...
1,14499049,Can randomised trials rely on existing electro...,"[To estimate the feasibility, utility and reso...","[OBJECTIVES, DATA SOURCES, REVIEW METHODS, RES...","[Arthroplasty, Replacement, Knee, Bias, Blood ...",Routine data have the potential to support hea...
2,14499672,Is laparoscopic radical prostatectomy better t...,[To compare morbidity in two groups of patient...,"[OBJECTIVE, PATIENTS AND METHODS, RESULTS]","[Aged, Follow-Up Studies, Humans, Italy, Lapar...",The results of our non-randomized study show t...
3,14499773,Does bacterial gastroenteritis predispose peop...,[Irritable bowel syndrome (IBS) might develop ...,"[OBJECTIVES, METHODS, RESULTS]","[Adolescent, Adult, Age Distribution, Aged, Ag...",Symptoms consistent with IBS and functional di...
4,14499777,Is early colonoscopy after admission for acute...,[Urgent colonoscopy has been proposed for the ...,"[OBJECTIVES, METHODS, RESULTS]","[Acute Disease, Aged, Aged, 80 and over, Cohor...",No significant association is apparent between...


In [11]:
pandas_dataset['context.contexts'].explode().to_frame()

Unnamed: 0,context.contexts
0,Although the use of alternative medicine in th...
0,To compare naturopathic therapy with conventio...
0,"A retrospective cohort study, using abstracted..."
0,One natural medicine and six conventional medi...
0,Women aged 40 years of age or more with a diag...
...,...
61247,To evaluate failures and to investigate the ne...
61247,In October 1992 a prospective study on primary...
61247,After a median follow-up time of 36 months in ...
61248,The aim of this study is the comparison betwee...


In [12]:
all_contexts = pandas_dataset['context.contexts'].explode().to_frame().drop_duplicates()

In [13]:
from datasets import Dataset, Features, DatasetInfo

In [14]:
context_features = Features({"context": str})

In [15]:
Dataset

datasets.arrow_dataset.Dataset

In [16]:
context_dataset = Dataset.from_pandas(all_contexts)

In [17]:
context_dataset

Dataset({
    features: ['context.contexts', '__index_level_0__'],
    num_rows: 201201
})

With the context as data dataset, we can now save them in the database by using encoding.

Once we have created our dataset, let us try to learn embedding of the first two sentences and check if the embedding model work.

### Testing the embedding model.

We will be using the sentence transformer model to learn the word embeddings of our text.

In [18]:
from sentence_transformers import SentenceTransformer

In [19]:
embedding_model_name = 'michiyasunaga/BioLinkBERT-large'

# Load the BERT model
model = SentenceTransformer(embedding_model_name)

# Display the max_sequence_length of the model
max_sequence_length = model.max_seq_length
print("Max Sequence Length:", max_sequence_length)

No sentence-transformers model found with name /Users/esp.py/.cache/torch/sentence_transformers/michiyasunaga_BioLinkBERT-large. Creating a new one with MEAN pooling.


Max Sequence Length: 512


Since our model have a max_sequence_length of 512, we need to split the context into chunks of 512 tokens.

In [20]:
all_contexts ["sentence_length"] = all_contexts['context.contexts'].str.len()
all_contexts.sort_values(by="sentence_length", ascending=False)

Unnamed: 0,context.contexts,sentence_length
30233,"From 1995-1999, local multidisciplinary child ...",4772
282,Thirty-two dogs underwent 1 hour of sustained ...,3207
30679,In vitro ischemia lasting 10 minutes produced ...,2668
6600,We studied 98 consecutive patients who were ad...,2668
18718,The patients' Acute Physiology and Chronic Hea...,2578
...,...,...
26964,8,1
34596,•,1
58536,♦,1
26964,7,1


In [24]:
context_dataset.shape

(201201, 2)

In [30]:
sentence_embeddings = model.encode(
    context_dataset['context.contexts'], convert_to_tensor=True, show_progress_bar=True, batch_size=128)

Batches:   0%|          | 1/1572 [01:34<41:04:00, 94.11s/it]

In [28]:
sentence_embeddings.shape

torch.Size([5, 1024])

This code need to be fun from a GPU, need to find a way to connect to collab gpu local.

This will be a fun for another day.