<a href="https://colab.research.google.com/github/alisonmitchell/Biomedical-Knowledge-Graph/blob/main/02_Exploratory_Data_Analysis/Text_Embeddings.ipynb"
   target="_parent">
   <img src="https://colab.research.google.com/assets/colab-badge.svg"
      alt="Open in Colab">
</a>

# Representation Learning - Text Embeddings

## 1. Introduction

Text embeddings, like word embeddings, are dense vectors that capture semantic meaning but, instead of just individual words, they encode sentences and documents to facilitate comparison between larger bodies of text. They are the most common kind of embeddings stored in vector databases.

The best text embedding models are built using transformers, which leverage the self-attention mechanism introduced in Vaswani et al's ["Attention is All You Need"](https://arxiv.org/abs/1706.03762) paper (2017), marking a turning point and becoming the state-of-the-art (SOTA) architecture in NLP. The self-attention mechanism captures every word's interactions with every other word in a sequence meaning that word (token) embeddings are no longer static but dynamic and contextually-aware. Positional embeddings convey word order by adding information about a word's relative or absolute position within the sequence into its word embedding, allowing the transformer to understand the context and relationships between words.

## 2. Install/import libraries

In [None]:
!pip install embedding-explorer embetter[text]

In [None]:
import pandas as pd
import json
import pickle
import warnings
warnings.filterwarnings("ignore")

from sklearn.feature_extraction.text import CountVectorizer
from embetter.text import SentenceEncoder
from embedding_explorer import show_network_explorer
from embedding_explorer import show_clustering

## 3. Sentence Transformers

Transformers allow use of the same 'core' model and fine-tuning it for different use cases by swapping the last few layers without retraining the core model. This led to the rise of generalist pretrained language models (PLMs) leveraging the idea of transfer learning, and supervised fine-tuning on domain/task-specific data. One of the first and most popular PLMs was [Google's BERT](https://arxiv.org/abs/1810.04805) (Bidirectional Encoder Representations from Transformers).

Initially, getting useful sentence embeddings from BERT was problematic and relied on one of two approaches: averaging values across all token embeddings output by BERT, or using the output of the first `[CLS]` token embedding that is trained to encode the entire input text and is used in classification tasks. However, accuracy proved to be worse than using mean GloVe embeddings.

The breakthrough came in 2019 with the introduction of [Sentence-BERT (SBERT)](https://arxiv.org/abs/1908.10084) and the [`sentence-transformers`](https://www.sbert.net/) library. SBERT was the first transformer built to create a single vector embedding for sentences or paragraphs, and outperformed the previous SOTA models for semantic textual similarity tasks. Sentence transformers became the industry standard for embedding text with many more models having been built since the original SBERT.






## 4. Exploring Corpora with Dynamic Embedding Models

We will use the [embedding-explorer](https://centre-for-humanities-computing.github.io/embedding-explorer/) "network explorer" app to explore semantic networks in our corpus with a pretrained language model (PLM) from the sentence-transformers library.


### 4.1 Networks of N-grams with Sentence Transformers

Here we will look at n-grams in our corpus, specifically four-grams, and use an embedding model to learn context-aware representations of the four-grams.

In [None]:
# Load the dataset
with open('2024-03-02_pmc_arxiv_full_text_merged_plus_cleaned_article_titles_test.pickle', 'rb') as f:
    pmc_arxiv_full_text_merged_plus_cleaned_article_titles_test = pickle.load(f)

In [None]:
pmc_arxiv_full_text_merged_plus_cleaned_article_titles_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   article_id     20 non-null     object
 1   published      20 non-null     object
 2   revised        20 non-null     object
 3   title          20 non-null     object
 4   title_cleaned  20 non-null     object
 5   journal        20 non-null     object
 6   authors        20 non-null     object
 7   doi            20 non-null     object
 8   pdf_url        20 non-null     object
 9   text           20 non-null     object
 10  text_cleaned   20 non-null     object
 11  word_count     20 non-null     int64 
 12  sent_count     20 non-null     int64 
dtypes: int64(2), object(11)
memory usage: 2.2+ KB


In [None]:
# Define training data
corpus = pmc_arxiv_full_text_merged_plus_cleaned_article_titles_test.text_cleaned

We will use `CountVectorizer` to extract the 4000 most frequent four-grams.

In [None]:
# Create a CountVectorizer for four-grams
feature_extractor = CountVectorizer(ngram_range=(4, 4), max_features=4000)
X = feature_extractor.fit_transform(corpus)

# Get the vectoriser's vocabulary (four-grams)
four_grams = feature_extractor.get_feature_names_out()

In [None]:
four_grams

array(['00 97 98 nm', '12 dasabuvir nsp 12', '12 nsp 12 dasabuvir', ...,
       'years billion investment de', 'yellow green m36 floral',
       'youden statistic determine optimal'], dtype=object)

In [None]:
len(four_grams)

4000

We will then use the [embetter](https://github.com/koaning/embetter) package which implements scikit-learn compatible embeddings, to load the [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) pretrained sentence transformers model, a BERT variant. It is one of the smallest pretrained models but is stable, fast and still offers good quality embeddings.

In [None]:
!pip install embetter[text]

In [None]:
# load the pretrained embedding model
from embetter.text import SentenceEncoder

encoder = SentenceEncoder("all-MiniLM-L6-v2")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
# load the model and the n-grams into embedding-explorer and launch network explorer app
show_network_explorer(four_grams, vectorizer=encoder)

We can specify seeds of arbitrary length instead of just the four-grams in our corpus, including whole sentences, and it will still make sense. The screenshots below show two combinations of two sentences and the semantic networks arising from the four-grams around them.

'Cambridge scientists have identified 200 approved drugs predicted to work against COVID-19' and 'Remdesivir is effective against many RNA viruses'

![network explorer four-grams example 1](images/sent_trf_four_grams.png)

There is a connection between the first pair of sentences of RNA-dependent RNA polymerase (RdRp) inhibitors, and COVID-19 patients remdesivir as phrases in the middle of the two sentence clusters.

'Remdesivir is effective against many RNA viruses' again , and 'The pandemic led to remarkable efforts to quickly develop new therapeutics'

![network explorer four-grams example 2](images/sent_trf_four_grams_v2.png)

RNA-dependent RNA polymerase (RdRp) inhibitors and COVID-19 patients remdesivir appear in the middle again but with an additional node for 'outbreak detect treat avoid' which is colour-coded for 'Remdesivir is effective against many RNA viruses' but located with 'outbreak great efforts therapeutic', a first level association of 'The pandemic led to remarkable efforts to quickly develop new therapeutics'.

### 4.2 Investigating Corpus-Level Semantic Structure with Document Embeddings

We can also investigate semantic representations at the document level. One approach to this is topic modelling which we will address separately. Here we will continue using the four-grams and sentence transformers with the same embedding model, but associate them with the indices and titles of the articles we extracted them from.

In [None]:
metadata_list = []

# Loop through each four-gram and summarise the documents that contain it
for i, four_gram in enumerate(four_grams):
    doc_indices = []
    doc_titles = []

    # Loop through each document
    for index in range(len(pmc_arxiv_full_text_merged_plus_cleaned_article_titles_test)):
        if X.toarray()[index, i] != 0:  # Check for presence of the four-gram in the doc
            doc_indices.append(index)
            doc_titles.append(pmc_arxiv_full_text_merged_plus_cleaned_article_titles_test['title_cleaned'].iloc[index])

    # Combine document information for each four-gram
    metadata_list.append({
        'four_gram': four_gram,
        'index': ', '.join(map(str, doc_indices)),
        'title': ', '.join(doc_titles)
    })

# Convert the list to a DataFrame
metadata_df = pd.DataFrame(metadata_list)

In [None]:
metadata_df

Unnamed: 0,four_gram,index,title
0,00 97 98 nm,8,Structural Homology-Based Drug Repurposing App...
1,12 dasabuvir nsp 12,8,Structural Homology-Based Drug Repurposing App...
2,12 nsp 12 dasabuvir,8,Structural Homology-Based Drug Repurposing App...
3,12 ribavirin nsp 12,8,Structural Homology-Based Drug Repurposing App...
4,12 sars cov ns5b,8,Structural Homology-Based Drug Repurposing App...
...,...,...,...
3995,world health organization january,"5, 7",Novel Drug Design for Treatment of COVID-19: A...
3996,wos core collection database,0,Drug repositioning: A bibliometric analysis.
3997,years billion investment de,7,A comprehensive review of artificial intellige...
3998,yellow green m36 floral,4,Drug Repurposing Using Gene Co-Expression and ...


In [None]:
len(four_grams)

4000

In [None]:
with open('2024-09-29_four_gram_metadata_df.pickle', 'wb') as f:
  pickle.dump(metadata_df, f)

The DataFrame has a row for each four-gram with some appearing in multiple documents as indicated in the index and title columns.

### 4.3 Projection and clustering

We can also use [embedding-explorer's "show clustering" app](https://centre-for-humanities-computing.github.io/embedding-explorer/projection_clustering.html) for projecting whole embedding spaces into two dimensions and investigating the natural clusters that arise in the data. Various parameters can be selected such as dimensionality reduction method and number of dimensions to reduce embeddings to, clustering method and number of clusters to find, and projection method to 2D space.

In [None]:
from embedding_explorer import show_clustering

In [None]:
show_clustering(
    four_grams,  # The four-grams as the input for clustering
    vectorizer=encoder,  # The sentence transformer encoder
    metadata=metadata_df,  # Summarised metadata DataFrame with four-gram, indices, and titles
    hover_data=['four_gram', 'index', 'title']  # Tooltip data to display on hover
)

#### 4.3.1  SVD - K-Means - SVD

Screenshot of SVD for dimensionality reduction to 10 dimensions, K-means clustering for 10 clusters, and SVD for 2D projection.

![SVD - K-Means - SVD](images/SVD_Kmeans_SVD.png)

#### 4.3.2 SVD - K-Means - TSNE

Screenshot of SVD for dimensionality reduction to 10 dimensions, K-means clustering for 10 clusters, and t-SNE for 2D projection.

![SVD - K-Means - TSNE](images/SVD_Kmeans_TSNE_tooltip.png)

#### 4.3.3 UMAP - K-Means - UMAP

Screenshot of UMAP for dimensionality reduction to 10 dimensions, K-means clustering for 10 clusters, and UMAP for 2D projection.

![UMAP - K-Means - UMAP](images/UMAP_Kmeans_UMAP_tooltip.png)

#### 4.3.4 UMAP - HDBSCAN - TSNE

Screenshot of UMAP for dimensionality reduction to 10 dimensions, HDBSCAN (no maximum cluster size), and t-SNE for 2D projection.

![UMAP - HDBSCAN - TSNE](images/UMAP_HDBSCAN_TSNE_tooltip.png)

And another showing a data point associated with two articles:


![UMAP - HDBSCAN - TSNE two articles](images/UMAP_HDBSCAN_TSNE_tooltip_2.png)

## 5. Choosing an embedding model

Encoder-only models like BERT would seem the logical choice for text embeddings given their bidirectional understanding of context when transforming raw input text into contextualised dense vector representations. However, due to the  rich representations learned during training, GPT-style decoder models can also produce text embeddings despite their primary function being generation rather than encoding.

OpenAI's `text-embedding-ada-002` released in December 2022, is now outperformed by `text-embedding-3-small` and `text-embedding-3-large`, the newest and most performant embedding models according to the [embeddings guide](https://platform.openai.com/docs/guides/embeddings).  

We will try an example sentence with `text-embedding-3-small`.

In [None]:
!pip install openai

In [None]:
import openai

from openai import OpenAI

In [None]:
with open("api_keys.json") as f:
    data = json.load(f)

In [None]:
openai.api_key = data['keys']["OPENAI_API_KEY"]

In [None]:
client = OpenAI(api_key=openai.api_key)

def get_embedding(text, model="text-embedding-3-small"):
   return client.embeddings.create(
       input = [text],
       model=model).data[0].embedding

In [None]:
embedding = get_embedding("Cambridge scientists have identified 200 approved drugs predicted to work against COVID-19.")

In [None]:
with open('2024-09-30_openai_text_embedding_3-sm.pickle', 'wb') as f:
  pickle.dump(embedding, f)

In [None]:
print(embedding)

[-0.007050425745546818, -0.047010645270347595, 0.01667741872370243, 0.027381885796785355, -0.04302867874503136, -0.008783753030002117, -0.030614307150244713, 0.03654041141271591, -0.026655763387680054, -0.006622949615120888, 0.021912535652518272, -0.03269898518919945, -0.013339593075215816, 0.04178724065423012, 0.006154483184218407, -0.049189016222953796, -0.05574755370616913, 0.010440954007208347, -0.014299949631094933, 0.05223404988646507, 0.008473393507301807, 0.027592696249485016, -0.02599990926682949, 0.0044709304347634315, 0.0208467748016119, -0.02532063238322735, 0.005024306941777468, -0.001714295824058354, -0.006576103158295155, -0.003856067545711994, 0.036680951714515686, -0.019078312441706657, -0.008742761798202991, 0.0367746464908123, -0.03480708599090576, -0.029888182878494263, -0.006658084690570831, -0.003487149951979518, -0.03541609272360802, -0.002150555606931448, -0.03511158749461174, 0.002393572824075818, 0.012683738954365253, 0.03518185764551163, -0.09561408311128616,

By default, the length of the embedding vector is 1536 dimensions for `text-embedding-3-small`.

In [None]:
len(embedding)

1536

To solve the problem of the lack of a single all-purpose embedding model [Muennighoff et al](https://arxiv.org/abs/2210.07316) (2022) introduced the Massive Text Embedding Benchmark ([MTEB](https://huggingface.co/spaces/mteb/leaderboard)) leaderboard which is a good place to start when selecting a model. It spans eight embedding tasks covering multiple datasets and languages, which are constantly updated, and against which model performance is evaluated. It can be used to help select the most suitable model for fine-tuning for a specific domain and/or task.


Muennighoff et al's comprehensive study concluded that the field has yet to converge on a universal text embedding method and scale it up sufficiently to provide state-of-the-art results on all embedding tasks. [Hongliu Cao](https://arxiv.org/abs/2406.01607) (2024) reviews recent advances in universal text embedding models with a focus on the top performing text embeddings on MTEB. Four different eras of text embedding are identified, from Count-based embeddings (with dimensionality reduction techniques), to Static dense word embeddings, Contextualised embeddings, and finally Universal text embeddings.

<table>
  <tr><td>
    <img src="https://raw.githubusercontent.com/alisonmitchell/Biomedical-Knowledge-Graph/main/02_Exploratory_Data_Analysis/images/2406.01607_Recent_advances_in_text_embedding_H_Cao_2024.png"
         alt="The 4 different eras of text embeddings"  width="100%" height="auto">
  </td></tr>
  <tr><td align="center">
    <b>Fig 1.</b>The 4 different eras of text embeddings. 1st era: Count-based Embeddings (with dimension reduction techniques);<br/> 2nd era: Static dense word embeddings, 3rd era: Contextualized embeddings; 4th era: Universal text embeddings.<br/> <a href="https://arxiv.org/pdf/2406.01607">([2406.01607] Recent advances in text embedding, H Cao, 2024)</a>.<br/>&nbsp;
  </td></tr>
</table>





The latter is defined as 'a unified comprehensive
text embedding model that can address a multitude of input
text length, downstream tasks, domains and languages'. The overall performance on MTEB English benchmarks is improved by such models especially on Retrieval, Reranking, Clustering and Pair Classification tasks, however, there is still no single universal model that achieves SOTA performance on all benchmarks, with further improvements to be made in summarisation tasks, language universality, and domain diversity.




### References

* https://medium.com/@kirudang/language-model-history-before-and-after-transformer-the-ai-revolution-bedc7948a130

* https://viso.ai/deep-learning/representation-learning/

* https://towardsdatascience.com/text-embeddings-comprehensive-guide-afd97fce8fb5

* https://medium.com/mantisnlp/text-embedding-models-how-to-choose-the-right-one-fd6bdb7ee1fd

* https://towardsdatascience.com/explore-semantic-relations-in-corpora-with-embedding-models-0a6d64c3ec7f

* https://github.com/centre-for-humanities-computing/embedding-explorer

* https://centre-for-humanities-computing.github.io/embedding-explorer/

* https://realpython.com/chromadb-vector-database/

* https://www.pinecone.io/learn/series/nlp/sentence-embeddings/

* https://www.pinecone.io/learn/series/nlp/dense-vector-embeddings-nlp/

* https://platform.openai.com/docs/guides/embeddings

* Vaswani, A et al. (2017). Attention Is All You Need. [arXiv:1706.03762](https://arxiv.org/pdf/1706.03762)

* Devlin, J et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. [arXiv:1810.04805](https://arxiv.org/pdf/1810.04805)

* Reimers, N. and Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. [arXiv:1908.10084](https://arxiv.org/pdf/1908.10084)

* Naseem, U. et al. (2020). A Comprehensive Survey on Word Representation Models: From Classical to State-Of-The-Art Word Representation Language Models. [arXiv:2010.15036](https://arxiv.org/pdf/2010.15036)

* Muennighoff, N. et al. (2022). MTEB: Massive Text Embedding Benchmark. [arXiv:2210.07316](https://arxiv.org/abs/2210.07316)

* Su, H. et al. (2022). One Embedder, Any Task: Instruction-Finetuned Text Embeddings. [arXiv:2212.09741](https://arxiv.org/pdf/2212.09741)

* Cao, H. (2024). Recent advances in text embedding: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark. [arXiv:2406.01607](https://arxiv.org/pdf/2406.01607)