In [None]:
%pip install sentence_transformers polars

# part1 (Sean Davis)
## The problem

We have a set of terms (sometimes multi-word) that represent a meaning. We want to map those terms to other terms from an ontology or controlled vocabulary, but we want to do so using their meaning, not just text matching. "Encodings" are a way of transforming words or sentences into vectors of numbers such that the points in N-dimensional space that are near each other have similar meanings. We use that idea here to map

uncurated term ---> curated term

We do this by:

curated term ---> encodings

uncurated term ---> encodings


## Sentence transformers

Sentence transformers refer to a type of natural language processing (NLP) model designed specifically for transforming sentences or text snippets into fixed-dimensional vectors, often with the goal of capturing semantic similarity. These models use deep learning techniques, typically employing architectures like Siamese networks or Transformer models.

The primary objective of sentence transformers is to generate embeddings or representations of sentences in a way that the distance or similarity between these embeddings reflects the semantic meaning of the corresponding sentences. This makes them useful for various NLP tasks such as sentence similarity, clustering, and information retrieval.

Commonly used architectures for sentence transformers include BERT (Bidirectional Encoder Representations from Transformers), RoBERTa (Robustly optimized BERT approach), and DistilBERT, among others. Pre-trained transformer models can be fine-tuned on specific tasks or datasets to create sentence embeddings tailored to a particular application.

Sentence embeddings obtained from these models can be useful in a variety of applications, including semantic search, document retrieval, and sentiment analysis, where understanding the underlying semantic relationships between sentences is crucial.







In [2]:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('pritamdeka/S-PubMedBert-MS-MARCO')
embeddings = model.encode(sentences)
print(embeddings)

[[-0.5002643  -0.50500387 -0.47107634 ...  0.1224565   0.05091011
   0.47183713]
 [-0.43048763 -0.16165991 -0.46010908 ...  0.10756782  0.16894086
   0.5752676 ]]


In [5]:
import polars as pl

In [5]:
# You may need to adjust the URL to be the correct "raw" URL since this
# uses a token. Your token may be different.
df = pl.read_csv('https://github.com/cBioPortal/GSoC/files/14504294/curated_bodysite.csv')
df.filter(pl.col('curated_bodysite') != "NA")

curation_id,original_bodysite,curated_bodysite,curated_bodysite_ontology_term_id,curated_bodysite_source
str,str,str,str,str
"""acyc_fmi_2014:…","""SALIVARY GLAND…","""Salivary Gland…","""NCIT:C12426""","""TUMOR_TISSUE_S…"
"""acyc_fmi_2014:…","""LUNG""","""Lung""","""NCIT:C12468""","""TUMOR_TISSUE_S…"
"""acyc_fmi_2014:…","""LUNG""","""Lung""","""NCIT:C12468""","""TUMOR_TISSUE_S…"
"""acyc_fmi_2014:…","""BUCCAL MUCOSA""","""Buccal Mucosa""","""NCIT:C12505""","""TUMOR_TISSUE_S…"
"""acyc_fmi_2014:…","""TUMOR EXTENTIO…","""Bone""","""NCIT:C12366""","""TUMOR_TISSUE_S…"
…,…,…,…,…
"""luad_tcga_pan_…","""LUNG""","""Lung""","""NCIT:C12468""","""TUMOR_TISSUE_S…"
"""luad_tcga_pan_…","""LUNG""","""Lung""","""NCIT:C12468""","""TUMOR_TISSUE_S…"
"""luad_tcga_pan_…","""LUNG""","""Lung""","""NCIT:C12468""","""TUMOR_TISSUE_S…"
"""luad_tcga_pan_…","""LUNG""","""Lung""","""NCIT:C12468""","""TUMOR_TISSUE_S…"


Here, we simply filter out the rows where there is no curated value.

In [6]:
filt_df = df.filter(df['curated_bodysite']!="NA").select(['original_bodysite', 'curated_bodysite']).unique()

In the code below, we are creating a set of "original" (uncurated) values and a set of "curated" values.

In [7]:
orig = []
for x in filt_df['original_bodysite'].to_list():
    orig.extend(x.split('<;>'))
orig = list(set(orig))
cura = []
for x in filt_df['curated_bodysite'].to_list():
    cura.extend(x.split('<;>'))
cura = list(set(cura))
print("first 'uncurated' results: ", orig[:10])
print("first 'curated' results: ", cura[:10])

first 'uncurated' results:  ['BRAIN STEM- PONS,BRAIN STEM-MEDULLA,CEREBELLUM/POSTERIOR FOSSA,SPINAL CORD- CERVICAL', 'DURA', 'OVARY', 'DIFFUSE', 'HEAD AND NECKREGIONAL LYMPH NODE', 'CNS', 'SMALL BOWEL MELANOMA', 'ABDOMEN/DIAPHRAGM', 'MEDIASTINAL', 'FRONTAL, PARIETAL AND TEMPORAL LOBE']
first 'curated' results:  ['Pelvic Mass', 'Esophagus', 'Muscle', 'Right Lobe of the Liver', 'Axillary Lymph Node', 'Psoas Muscle', 'Greater Curvature of the Stomach', 'Urethra', 'Above', 'Perineum']


Our task is to use the "curated" set, which corresponds to our controlled vocabulary or ontology terms, as a dictionary of sorts. We want to provide an index to that dictionary that allows us to look up words by **their meaning**. The emdeddings are the "meaning" and we can then use those as our index.

In [8]:
# embed the curated results (which would, more generally, be the set of ontology terms of interest)
cura_embed = model.encode(cura)

In [10]:
# For the first 10 "uncurated" or "original" terms
# 1. Embed term
# 2. Find top 5 most similar vectors from curated term embeddings
# 3. Report back the curated terms and scores

from sentence_transformers import util
import torch

top_k = 5

queries = orig[0:10]
corpus = cura
corpus_embeddings = cura_embed

for query in queries:
    # embed each uncurated term
    query_embedding = model.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 5 scores
    # Seach in the corpus embeddings (the ontology terms that we embedded above)
    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n======================\n")
    print("Query:", query)
    print("Top 5 most similar sentences in corpus:")

    for score, idx in zip(top_results[0], top_results[1]):
        print("   ", corpus[idx], "(Score: {:.4f})".format(score))

    """
    # Alternatively, we can also use util.semantic_search to perform cosine similarty + topk
    hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=5)
    hits = hits[0]      #Get the hits for the first query
    for hit in hits:
        print(corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))
    """



Query: BRAIN STEM- PONS,BRAIN STEM-MEDULLA,CEREBELLUM/POSTERIOR FOSSA,SPINAL CORD- CERVICAL
Top 5 most similar sentences in corpus:
    Brain Stem (Score: 0.9223)
    Cervical Spinal Cord (Score: 0.9196)
    Brain (Score: 0.9081)
    Central Nervous System (Score: 0.9038)
    Cerebellum (Score: 0.9026)


Query: DURA
Top 5 most similar sentences in corpus:
    Dura Mater (Score: 0.9757)
    Spinal Cord Dura Mater (Score: 0.9354)
    Skull (Score: 0.9081)
    Skin (Score: 0.8967)
    Meninges (Score: 0.8960)


Query: OVARY
Top 5 most similar sentences in corpus:
    Ovary (Score: 1.0000)
    Testis (Score: 0.9143)
    Spleen (Score: 0.8985)
    Uterus (Score: 0.8975)
    Pancreas (Score: 0.8941)


Query: DIFFUSE
Top 5 most similar sentences in corpus:
    Diffuse (Score: 1.0000)
    Distant (Score: 0.8687)
    Bilateral (Score: 0.8517)
    Alveolar (Score: 0.8482)
    Multifocal (Score: 0.8479)


Query: HEAD AND NECKREGIONAL LYMPH NODE
Top 5 most similar sentences in corpus:
    Head a

In summary: `Sentence-transformer` [utils](https://www.sbert.net/docs/package_reference/util.html) has functions for cosine similarty, paraprase mining and semantic search.  

### part 2 (Tyrone Lee)
We will attempt this query on the QA sample of 10 bioconductor support threads. The embedding model will differ from above because the source material is changed; instead of medical ontological terms, we are ingesting the top 10 questions and answers from [Bioconductor Forum](https://support.bioconductor.org/). Thought of using [codebert](https://huggingface.co/mchochlov/codebert-base-cd-ft) but it probably makes more sense to use the [multi-QA pretrained](https://www.sbert.net/docs/pretrained_models.html#multi-qa-models) encoders 

In [3]:
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
embeddings = model.encode(sentences)
print(embeddings)

[[ 2.32228190e-02  1.49951935e-01  1.80716366e-02  6.45313561e-02
   4.18474153e-03  5.07423542e-02  7.09168389e-02 -4.55774218e-02
   5.10215089e-02 -2.57678218e-02  9.00099054e-02 -2.18730960e-02
   8.43797326e-02 -6.58102110e-02 -2.66505815e-02 -8.88935402e-02
   2.96451505e-02  2.01866925e-02 -3.38864475e-02 -1.30975619e-03
   1.87437152e-04  3.30534726e-02  1.27255137e-03  3.21808569e-02
  -3.58300023e-02 -2.87955329e-02  2.54739225e-02  5.04682250e-02
   6.56116828e-02 -3.47291455e-02 -7.74945542e-02  1.40688559e-02
  -2.01698691e-02  5.18850461e-02 -1.04643069e-02  5.99336214e-02
   8.09594318e-02  9.55686420e-02 -4.56609093e-02  2.75027826e-02
   9.46429092e-03 -8.58922303e-02  2.92307157e-02  3.21322978e-02
   2.20695604e-02 -5.48757724e-02  4.88309227e-02 -3.58557813e-02
  -1.37343332e-02 -1.08075269e-01 -6.05810471e-02 -1.38984695e-02
  -8.91551301e-02  4.36695106e-02  1.58954635e-02  5.08062951e-02
   1.40990159e-02  3.98281366e-02  6.52970513e-03 -4.16178852e-02
   1.18416

load in csv with question and responses

In [6]:
df = pl.read_csv('bioc_qa.csv')
df

AID,QID,Question,Response
str,str,str,str
"""answer1""","""question1""","""I am a bit con…","""The thing to u…"
"""answer2""","""question2""","""I am working o…","""Just to be cle…"
"""answer3""","""question3""","""I am new in th…","""There is no go…"
"""answer4""","""question4""","""I am testing s…","""To answer your…"
"""answer5""","""question5""","""In all RNA-seq…","""The most compl…"
"""answer6""","""question6""","""I know findOve…","""From the discu…"
"""answer7""","""question7""","""I have just do…","""I wrote two he…"
"""answer8""","""question8""","""How can I filt…","""If you want to…"
"""answer9""","""question9""","""I am analysing…","""You can use th…"
"""answer10""","""question10""","""How do I merge…","""Merge is a pre…"


In [7]:
questions = df['Question'].to_list()
answers = df['Response'].to_list()
qID = df['QID'].to_list()
aID = df['AID'].to_list()
print("first question: ", qID[:1], questions[:1])
print("first response: ", aID[:1], answers[:1])

first question:  ['question1'] ['I am a bit confused about the concepts of the 3 things: FDR, FDR adjusted p-value and q-value, which I initially thought I was clear about. Are FDR adjusted p-value the same as q-value? (my understanding is that FDR adjusted p-value = original p-value * number of genes/rank of the gene, is that right?) When people say xxx genes are differentially expressed with an FDR cutoff of 0.05, does that mean xxx genes have an FDR adjusted p-value smaller than 0.05?']
first response:  ['answer1'] ['The thing to understand is that terms like FDR and q-value were defined in specific ways by their original inventors but are used in more generic ways by later researchers who adapt, modify or use the ideas.The term "false discovery rate (FDR)" was created by Benjamini and Hochberg in their 1995 paper. They gave a particular definition of what they meant by FDR.  Their procedure accepted or rejected hypotheses, but did not produce adjusted p-values.Benjamini and Yekutie

embed ONLY the answers

In [8]:
# embed the curated responses ('answers' in this case)
ans_embed = model.encode(answers)

Same demo as above, only now we are giving queries that are not in the answer corpus. The demo will return the most similar text in corpus(the answers) to the query.

In [9]:
# For the first 10 "uncurated" or "original" terms
# 1. Embed term
# 2. Find top 5 most similar vectors from curated term embeddings
# 3. Report back the curated terms and scores

from sentence_transformers import util
import torch

top_k = 5

queries = questions[0:10]
corpus = answers
corpus_embeddings = ans_embed

for query in queries:
    # embed each uncurated term
    query_embedding = model.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 5 scores
    # Seach in the corpus embeddings (the ontology terms that we embedded above)
    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n======================\n")
    print("Query:", query)
    print("Top 5 most similar sentences in corpus:")

    for score, idx in zip(top_results[0], top_results[1]):
        print("   ", corpus[idx][:25], "(Score: {:.4f})".format(score))

    """
    # Alternatively, we can also use util.semantic_search to perform cosine similarty + topk
    hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=5)
    hits = hits[0]      #Get the hits for the first query
    for hit in hits:
        print(corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))
    """



Query: I am a bit confused about the concepts of the 3 things: FDR, FDR adjusted p-value and q-value, which I initially thought I was clear about. Are FDR adjusted p-value the same as q-value? (my understanding is that FDR adjusted p-value = original p-value * number of genes/rank of the gene, is that right?) When people say xxx genes are differentially expressed with an FDR cutoff of 0.05, does that mean xxx genes have an FDR adjusted p-value smaller than 0.05?
Top 5 most similar sentences in corpus:
    The thing to understand i (Score: 0.7044)
    If you want to filter out (Score: 0.2630)
    The most complete explana (Score: 0.2588)
    You can use the ensembldb (Score: 0.2270)
    To answer your questions: (Score: 0.2042)


Query: I am working on RNA-Seq data. I'm using DESeq2 for my analysis. I have 20 samples from 3 batches. I am testing for 2 conditions, cond1 and cond2.dds <- DESeqDataSetFromMatrix(countData = countTable3, colData = coldata, design = ~cond1 * cond2). When i 

This demo prints the ID as a number in front of the answer, to check at a glance if most similar scoring answer is from the same row as the question. 

In [10]:
    # Alternatively, we can also use util.semantic_search to perform cosine similarty + topk
top_k = 5

queries = questions[0:10]
queryID = 0
corpus = answers
corpus_embeddings = ans_embed

for query in queries:
        query_embedding = model.encode(query, convert_to_tensor=True)
        print("\n======================\n")
        print("Query:", query)
        print("ID:", queryID)
        print("Top 5 most similar sentences in corpus:")
        queryID +=1

        hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=5)
        hits = hits[0]      #Get the hits for the first query
        for hit in hits:
            print(hit['corpus_id'],corpus[hit['corpus_id']][:50], "(Score: {:.4f})".format(hit['score']))



Query: I am a bit confused about the concepts of the 3 things: FDR, FDR adjusted p-value and q-value, which I initially thought I was clear about. Are FDR adjusted p-value the same as q-value? (my understanding is that FDR adjusted p-value = original p-value * number of genes/rank of the gene, is that right?) When people say xxx genes are differentially expressed with an FDR cutoff of 0.05, does that mean xxx genes have an FDR adjusted p-value smaller than 0.05?
ID: 0
Top 5 most similar sentences in corpus:
0 The thing to understand is that terms like FDR and (Score: 0.7044)
7 If you want to filter out genes with low expressio (Score: 0.2630)
4 The most complete explanation of what the dispersi (Score: 0.2588)
8 You can use the ensembldb package to do the mappin (Score: 0.2270)
3 To answer your questions:1) scaledTPM is TPM's sca (Score: 0.2042)


Query: I am working on RNA-Seq data. I'm using DESeq2 for my analysis. I have 20 samples from 3 batches. I am testing for 2 conditions, co

As demonstrated, cosine similarity seems "good enough" metric to evaluate if embedded documents match a given query. Trying to use this to compare answers given from an LLM.