### Semantic Search with SBERT


Two types of semantic search - _Symmetric_ Semantic Search and _Asymmetric_ Semantic Search.

#### Symmetric Semantic Search

For Symmetric Semantic Search the input query and the entries in the text corpus are approximately of the same length and have the same amount of content. An example would be searching for similar questions: Your query could for example be "How to learn Python online?" and you want to find an entry ike "How to learn Python on the web?". For symmetric tasks, you could exchange the query and the entries in the given text corpus.

Suitable models for symmetric semantic search can be found among the set of pre-trained sentence embedding models. For example `all-mpnet-base-v2` (420 MB) and `all-distilroberta-v1` (290 MB).

#### Asymmetric Semantic Search

In this case we have a short query (like question or some keywords) and we want to find a longer paragraph answering the query. An example would be a query like "What is Python" and we want to find a paragraph like "Python is an interpreted, high-level and general-purpose programming language. Python's design philosopy is ...". For asymmetric tasks, flipping the queries in the entries in the text corpus does not make sense.

Suitable models for asymmetric semantic search are the pretrained MS Marco models. Among the models tuned for cosine-similarity one can choose `msmarco-distilbert-base-v4` and `msmarco-roberta-base-v3`.



In [4]:
!pip install -U sentence-transformers datasets fsspec


Collecting fsspec
  Using cached fsspec-2025.5.1-py3-none-any.whl.metadata (11 kB)


In [5]:
# Symmetric Semantic Search

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

sentences = ['Generate 768 dim embeddings for each sentence', 'Sentences are passed as a list of strings']

embeddings = model.encode(sentences)

for sentence, embedding in zip(sentences, embeddings):
  print("My sentence: ", sentence)
  print("Sentence Transformer Embedding: ", embedding)
  print("")

My sentence:  Generate 768 dim embeddings for each sentence
Sentence Transformer Embedding:  [-0.24625525  0.6436813   0.30080056 -0.10090573 -0.18973285 -0.03768769
 -0.13906462 -0.34431723 -0.23384398  0.13244423 -0.05336657  0.24885814
  0.24511687 -0.25617316 -0.08420948  0.33067372 -0.27076212  0.6782078
 -0.7065085  -0.46104592  0.27874663  0.23837136 -0.40843928 -0.27558264
  0.46099934 -0.07821565 -0.29660904 -0.25337937  0.6728783  -0.34621838
  0.11565675  0.5246502   0.47243908  0.33126864 -0.06534381 -0.08121847
 -0.2695944   0.3282064  -0.05759253 -0.16452837 -0.30180246  0.17743246
  0.06289521  0.06998185 -0.03683375 -0.05886895 -0.47789574  0.3925636
  0.11040414  0.4663116  -0.10027011  0.12904194 -0.7454573   0.20695008
 -0.08739292 -0.29257065 -0.09386834 -0.289376   -0.15315506 -0.0508649
 -0.28702986  0.04288907 -0.07932635  0.23078537  0.13428289  0.02654775
  0.01984073 -0.03557086 -0.43110615  1.1263238  -0.39067203 -0.00258689
 -0.1613418   0.31628454 -0.176274

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [7]:
from datasets import load_dataset

# Load the STSB dataset, specifying the split (train, validation, or test)
train_dataset = load_dataset("sentence-transformers/stsb", split="train")
eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")
test_dataset = load_dataset("sentence-transformers/stsb", split="test")

# The dataset contains the following columns:
# 'sentence1': The first sentence in the pair.
# 'sentence2': The second sentence in the pair.
# 'score': The similarity score between the two sentences.

README.md:   0%|          | 0.00/1.50k [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/471k [00:00<?, ?B/s]

data/validation-00000-of-00001.parquet:   0%|          | 0.00/142k [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/108k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5749 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1379 [00:00<?, ? examples/s]

In [8]:
test_dataset[:5]['sentence1']

['A girl is styling her hair.',
 'A group of men play soccer on the beach.',
 "One woman is measuring another woman's ankle.",
 'A man is cutting up a cucumber.',
 'A man is playing a harp.']

In [9]:
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'
device


'cuda'

In [10]:
!sudo apt install nvidia-utils-575

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  libnvidia-cfg1-575 libnvidia-compute-575 libnvidia-decode-575
  libnvidia-gpucomp-575 nvidia-persistenced
Suggested packages:
  nvidia-driver-575
The following NEW packages will be installed:
  libnvidia-cfg1-575 libnvidia-compute-575 libnvidia-decode-575
  libnvidia-gpucomp-575 nvidia-persistenced nvidia-utils-575
0 upgraded, 6 newly installed, 0 to remove and 35 not upgraded.
Need to get 75.3 MB of archives.
After this operation, 425 MB of additional disk space will be used.
Get:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  libnvidia-cfg1-575 575.57.08-0ubuntu1 [145 kB]
Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  libnvidia-decode-575 575.57.08-0ubuntu1 [2,555 kB]
Get:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  libnvidia-gpucomp-575 

In [11]:
!nvidia-smi

Mon Jun  9 01:08:39 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:00:04.0 Off |                    0 |
| N/A   31C    P0             52W /  400W |     717MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

Paraphrase mining is the task of finding parapphrases (texts with identical / similar meaning) in a large corpus of sentences. It compares all sentences against all other sentences and returns a list with the pairs that have the highest cosine similarity score.

In [12]:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import paraphrase_mining

# Paraphrase Mining

sentences1 = test_dataset['sentence1']
paraphrases = paraphrase_mining(model, sentences1)
print(len(paraphrases))
for paraphrase in paraphrases[0:10]:
  score, i, j = paraphrase
  print("i={} \t\t j={} \t\t {} \t\t {} \t\t Score: {:.4f}".format(i, j, sentences1[i], sentences1[j], score))


83418
i=25 		 j=159 		 A  man is dancing. 		 A man is dancing. 		 Score: 1.0000
i=25 		 j=179 		 A  man is dancing. 		 A man is dancing. 		 Score: 1.0000
i=159 		 j=179 		 A man is dancing. 		 A man is dancing. 		 Score: 1.0000
i=131 		 j=178 		 The man is playing the guitar. 		 The man is playing the guitar. 		 Score: 1.0000
i=276 		 j=351 		 A woman riding a brown horse. 		 A woman riding a brown horse. 		 Score: 1.0000
i=385 		 j=424 		 Two dogs play in the grass. 		 Two dogs play in the grass. 		 Score: 1.0000
i=635 		 j=750 		 It depends on what you want to have in your tank. 		 It depends on what you want to have in your tank. 		 Score: 1.0000
i=642 		 j=647 		 Unfortunately the answer to your question is we simply do not know. 		 Unfortunately the answer to your question is we simply do not know. 		 Score: 1.0000
i=642 		 j=743 		 Unfortunately the answer to your question is we simply do not know. 		 Unfortunately the answer to your question is we simply do not know. 		 Score: 1

In [13]:
from sentence_transformers import SentenceTransformer, util
import torch

# Symmetric Semantic Search

symmetric_embedder = SentenceTransformer('all-mpnet-base-v2')

corpus = train_dataset['sentence1']
corpus_embeddings = symmetric_embedder.encode(corpus, convert_to_tensor=True,
                                              show_progress_bar = True,
                                              batch_size = 128)

queries = ['Climate change and strategies to prevent global warming',
           'Quark plasma distribution at high energies']

# find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(5, len(corpus))

for query in queries:
  query_embedding = symmetric_embedder.encode(query, convert_to_tensor=True)
  cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
  top_results = torch.topk(cos_scores, k=top_k)

  print(f"\n\nQuery: {query}")
  print('------------------------------------')
  print('\nTop 5 most similar sentences in corpus:')

  for score, idx in zip(top_results[0], top_results[1]):
    print(corpus[idx], "(Score: {:.4f})".format(score))

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/45 [00:00<?, ?it/s]



Query: Climate change and strategies to prevent global warming
------------------------------------

Top 5 most similar sentences in corpus:
Pope calls for action on climate change in draft encyclical (Score: 0.5091)
No, none of those factors account for the recent warming. (Score: 0.5027)
It is an excellent indicator of climate change. (Score: 0.4652)
NASA satellite images show that Arctic ice has been shrinking at the rate of nearly 10 percent a decade. (Score: 0.4261)
regional and international non-proliferation issues should be addressed through dialogue and negotiations.  (Score: 0.3738)


Query: Quark plasma distribution at high energies
------------------------------------

Top 5 most similar sentences in corpus:
Gunman kills 6 in shooting at Wisconsin Sikh temple (Score: 0.2327)
The large brown dog is jumping through the tall grass. (Score: 0.2270)
The black dog is walking through the tall grass. (Score: 0.2127)
A brown dog is running through green grass. (Score: 0.2071)
Two 

In [14]:
# Asymetric Semantic Search

asymmetric_embedder = SentenceTransformer('msmarco-distilbert-base-v4')

queruies = ['Climate change and strategies to prevent global warming', 'Quark plasma distribution at high energies']

corpus = train_dataset['sentence1']
corpus_embeddings = asymmetric_embedder.encode(corpus, convert_to_tensor=True,
                                               show_progress_bar=True,
                                               batch_size = 128)

top_k = min(5, len(corpus))

for query in queries:
  query_embedding = asymmetric_embedder.encode(query, convert_to_tensor=True)

  cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
  top_results = torch.topk(cos_scores, k=top_k)

  print(f"\n\nQuery: {query}")
  print('------------------------------------')
  print('\nTop 5 most similar sentences in corpus:')

  for score, idx in zip(top_results[0], top_results[1]):
    print(corpus[idx], "(Score: {:.4f})".format(score))

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.53k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/545 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/319 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/45 [00:00<?, ?it/s]



Query: Climate change and strategies to prevent global warming
------------------------------------

Top 5 most similar sentences in corpus:
goff stated that this presents a major threat to global security.  (Score: 0.3905)
Pope calls for action on climate change in draft encyclical (Score: 0.3737)
the resolution requires all 192 united nations member states to adopt laws to prevent terrorists, black marketeers and other non-state actors from manufacturing, acquiring or trafficking in nuclear, biological or chemical weapons or the materials to make them.  (Score: 0.3724)
It is an excellent indicator of climate change. (Score: 0.3588)
No, none of those factors account for the recent warming. (Score: 0.3560)


Query: Quark plasma distribution at high energies
------------------------------------

Top 5 most similar sentences in corpus:
Stony Brook University launched the study in 1996, after earlier studies indicated a possible connection between electromagnetic fields and cancer. (Sco

In [16]:
# Semantic search using util.semantic_search

query = queries[0]
query_embedding = asymmetric_embedder.encode(query, convert_to_tensor=True)

hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=5)
hits = hits[0] # get the hits for the first query

for hit in hits:
  print(corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))

goff stated that this presents a major threat to global security.  (Score: 0.3905)
Pope calls for action on climate change in draft encyclical (Score: 0.3737)
the resolution requires all 192 united nations member states to adopt laws to prevent terrorists, black marketeers and other non-state actors from manufacturing, acquiring or trafficking in nuclear, biological or chemical weapons or the materials to make them.  (Score: 0.3724)
It is an excellent indicator of climate change. (Score: 0.3588)
No, none of those factors account for the recent warming. (Score: 0.3560)
