
**Author:** Carolina Gonçalves, carolina.goncalves@research.fchampalimaud.org

**Scope**: This notebook aims to adress the basic aspect of lexical (keyword) search and semantic search through embeddings, comparing the two approaches.

**Introduction**

> In typical search, one starts with a collection of documents and then tries
to rank documents based on how well they match some user’s query [1]. In tradicional search, this is mainly done through *keyword matching*. Indexed documents or web pages containing the query's keywords are, not only returned as macthes, but ranked according to how well they match the query. Amongst the many ranking algorithms [[2]](https://www.geeksforgeeks.org/keyword-searching-algorithms-for-search-engines/), one of the most popular ones is **BM25**. This is the algorithm will be using here.


> We'll start by downloading a simple dataset and implement BM25 to retrieve the most similar/related sentences to another sentence (query). We'll then move on to using embeddings computed from a transformer model for the same purpose. There's a final simple example, before we can draw some conclusions about the advantages and caveats of both methods.


[1] "AI-Powered Search", Trey Grainger, Doug Turnbull, Max Irwin


First, we need to install the libraries we'll be using:
1. *datasets* from Hugging Face: has many available datasets for mutiple tasks that we can use.
2. *rank_bm25*: python package that implements BM25
3. *sentence-transformers*: python module to access, use, and training state-of-the-art text and image embedding models.

In [1]:
!pip install -U sentence-transformers rank_bm25
!pip install datasets

Collecting sentence-transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl.metadata (10 kB)
Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25, sentence-transformers
Successfully installed rank_bm25-0.2.2 sentence-transformers-3.0.1
Collecting datasets
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x8

In [2]:
import random
import torch

# Set a random seed
random_seed = 42
random.seed(random_seed)

# Set a random seed for PyTorch (for GPU as well)
torch.manual_seed(random_seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(random_seed)

# Load Dataset

This dataset contains pairs of questions from Quora, labeled as either duplicate (semantically similar) or not.

In [None]:
# import libraries needed to run this code
from datasets import load_dataset
import pandas as pd

In [None]:
ds = load_dataset("HHousen/quora", split="validation") #https://huggingface.co/datasets/HHousen/quora

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading data:   0%|          | 0.00/52.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/384348 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [None]:
ds

Dataset({
    features: ['label', 'sentence1', 'sentence2', 'instance_id'],
    num_rows: 10000
})

In [None]:
#data = ds.to_pandas().loc[:, ["title", "text"]]
data = ds.to_pandas().loc[:, ["label", "sentence1", "sentence2"]]
data.tail()

Unnamed: 0,label,sentence1,sentence2
9995,0,Which were some major programming breakthrough...,Prove it using integration - e ^ b > 1 + b + b...
9996,1,Would an audio recording of someone admitting ...,Does admission of consent recorded on video st...
9997,0,What are some of the questions on Quora that m...,What are the most annoying types of questions ...
9998,1,What are some popular method to do suicide ?,What is the easiest pain free method of commit...
9999,0,"Prove that among any K consecutive integers , ...",Are most Americsns so brainwashed to not see t...


In [None]:
data.iloc[0, 1], data.iloc[0, 2]

('How do I get funding for my web based startup idea ?',
 'How do I get seed funding pre product ?')

In [None]:
data.iloc[1, 1], data.iloc[1, 2]

('Is honey a viable alternative to sugar for diabetics ?',
 "How would you compare the United States ' euthanasia laws to Denmark ?")

In [None]:
data["label"].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,5000
0,5000


# Lexical Search: BM25


BM25 (Best Matching 25) is a ranking algorithm used in information retrieval systems, particularly in search engines, to rank documents based on their relevance to a given query. It is an extension of the probabilistic retrieval model and works by scoring each document according to the **frequency of query terms within the document** while considering factors such as document length and term saturation. BM25 improves retrieval effectiveness by balancing term frequency, document length normalization, and **inverse document frequency**, making it one of the most popular and effective algorithms for ranking search results. If you want to better understand how BM25 works, see [[3]](https://kmwllc.com/index.php/2020/03/20/understanding-tf-idf-and-bm-25/).

<img src="https://media.geeksforgeeks.org/wp-content/uploads/20240124120825/BM25.webp" alt="BM25 Algorithm" width="800"/>

(taken from [[2]](https://www.geeksforgeeks.org/keyword-searching-algorithms-for-search-engines/))

One common step before keyword-based ranking is to **pre-process** the text data and queries. Words can have multiple variations, without changing the core meaning:
* Lower vs uppercase
* Punctuation
* Verb conjugations
* (...)

Pre-processing them to a common simple form increases the sensitivity of the algorithm, reducing the risk of not retrieving true document matches because of wording variations.

In [None]:
import re
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from rank_bm25 import BM25Okapi
import numpy as np
import time
import pickle

In [None]:
nltk.download("stopwords")
stop_words = set(stopwords.words("english"))  # nlkt
print(stop_words)

porter_stemmer = PorterStemmer()

def preprocess(text, stem=False, print_tokens=False):
  # Tokenization and lowercasing; also ignore punctuation
  tokens = re.findall(r"\w+", text.lower()) #separating words

  # Remove noisy terms including stopwords from indexing
  filtered_tokens = [
      token
      for token in tokens
      if token not in stop_words
  ]
  # Converting words to their cannonical form
  if stem:
    filtered_tokens = [
        porter_stemmer.stem(token) for token in filtered_tokens
    ]
  if print_tokens:
    print(filtered_tokens)
  return filtered_tokens

{'will', 'other', 'you', 'himself', 'into', 'very', 'but', 'should', 'here', 'now', 'each', 'and', 'having', 'between', 're', 'they', "couldn't", 'isn', 'hadn', 'do', 'we', 'myself', 'ain', "shan't", 'have', 'with', 'my', 'there', 'it', 'ourselves', 'doing', "won't", 'if', 'yourselves', 'her', 'against', 'wouldn', 've', "mustn't", 'its', 'don', 'itself', 'both', "haven't", 'up', 'be', 'that', 'is', "you'd", 'where', 'most', 'm', 'theirs', 'd', 'why', 'me', 'until', 'hers', 'yourself', 'their', "you'll", "doesn't", "wouldn't", 'your', 'ma', 'yours', 'him', 'hasn', 'own', "don't", 'didn', 'had', 'under', 'below', 'haven', 'off', "wasn't", 'few', 'am', 'he', 'so', 'such', 'out', 'before', 'our', 'were', 'doesn', 'whom', 'any', 'in', 'again', 'weren', 'won', 'just', 'mustn', 'same', "that'll", 'a', "should've", "hadn't", 'an', 'been', "isn't", 'not', 'no', 'what', 'of', 'wasn', 'then', 'them', "she's", 'because', 'over', 'only', 'when', 'shan', 'or', 'about', 'was', 'more', 'herself', 'mig

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
passages = data["sentence2"].tolist() #get list of all "sentence2" as the
          # corpus of passages to search from
tini = time.time()
tokenized_corpus = [preprocess(passage) for passage in passages[:500]]
print(f"Total time spent pre-processing: {time.time()-tini}s")

Total time spent pre-processing: 0.011415243148803711s


In [None]:
# To visualize what this pre-processing does
print("Original text: ")
print(passages[2])
print("Tokenized and filtered text: ")
print(tokenized_corpus[2])
print("Stemmed text: ")
print([porter_stemmer.stem(token) for token in tokenized_corpus[2]])

Original text: 
What can I do to stop being depressed ?
Tokenized and filtered text: 
['stop', 'depressed']
Stemmed text: 
['stop', 'depress']


In [None]:
def search_wBM25(query, processed_docs, original_docs, stem=False, print_keywords=False, k=1.2, b=0.75):
    '''
    Implements BM25 algortihm for ranking documents based on their relevance to a given query.
    k and b are hyperparameters that control for term frequency saturation and document length normalization, respectively.
    Returns the top-3 most relevant documents to the query, order by relevance (BM25 score).
    '''
    tini = time.time()
    # Initialize BM25 with the tokenized and pre-processed corpus
    # IDF will be computed based on this corpus
    bm25 = BM25Okapi(processed_docs, k1=k, b=b)
    # Pre-process the query as the processed_docs (stem=True if the docs were stemmed)
    # Then, use BM25 to score each document based on its similarity to the query
    bm25_scores = bm25.get_scores(preprocess(query, stem=stem, print_tokens=print_keywords))
    top_n = np.argpartition(bm25_scores, -3)[-3:] #select the top 3 most matched docs to the query
    bm25_hits = [{'corpus_id': idx, 'score': bm25_scores[idx]} for idx in top_n]
    bm25_hits = sorted(bm25_hits, key=lambda x: x['score'], reverse=True) #sorting the passages by their score
    tout = time.time()

    print("Print top-3 lexical search (BM25) results")
    print_results(bm25_hits, original_docs)
    print(f"Total time spent searching: {tout-tini}s")

def print_results(hits, original_docs):
  for result in hits[0:3]:
    print("{}: {:.3f}\t{}".format(result['corpus_id'], result['score'], original_docs[result['corpus_id']].replace("\n", " ")))

In [None]:
pick_qid = 1
query = data.loc[pick_qid, "sentence1"]
print(f"{query}\t It has {data.iloc[pick_qid, 0]} semantic twin sentence.")
search_wBM25(query, tokenized_corpus, passages)

Is honey a viable alternative to sugar for diabetics ?	 It has 0 semantic twin sentence.
Print top-3 lexical search (BM25) results
377: 6.058	What are the health effects , if any , of honey and lemon water ?
103: 4.924	Does putting sugar in a car 's gas tank really ruin the car ?
499: 0.000	Assume a flat , infinite earth but with no atmosphere . If I pointed a laser parallel to its surface , how far would it travel before hitting the ground ?
Total time spent searching: 0.017023801803588867s


In [None]:
pick_qid = 2
query = data.loc[pick_qid, "sentence1"]
print(f"{query}\t It has {data.iloc[pick_qid, 0]} semantic twin sentence.")
search_wBM25(query, tokenized_corpus, passages)

How can I stop my depression ?	 It has 1 semantic twin sentence.
Print top-3 lexical search (BM25) results
2: 7.175	What can I do to stop being depressed ?
278: 6.561	How do I treat depression without medication ?
340: 4.788	How do I stop my Shepherd-Husky mix puppy from humping my furniture ?
Total time spent searching: 0.012544631958007812s


In [None]:
pick_qid = 497
query = data.loc[pick_qid, "sentence1"]
print(f"{query}\t It has {data.iloc[pick_qid, 0]} semantic twin sentence.")
search_wBM25(query, tokenized_corpus, passages)

How does iron change from solid to liquid and gas ?	 It has 0 semantic twin sentence.
Print top-3 lexical search (BM25) results
497: 21.920	How does gold change from solid to liquid and gas ?
227: 6.561	Where are the Avengers in Iron Man 3 ?
237: 5.170	How can cows produce less methane gas ?
Total time spent searching: 0.013359546661376953s


# Semantic Search: Embeddings

> Semantic search aims to improve the relevance and accuracy of search results by understanding the meaning and context behind the query, rather than relying solely on exact keyword matches. Unlike traditional keyword-matching algorithms like BM25, which rank documents based on the presence and frequency of specific words, semantic search uses embeddings and more advanced natural language processing (NLP) techniques to capture the nuances of language, such as synonyms, context, and intent.

> Embeddings, which are dense vector representations of words or phrases, allow semantic search systems to recognize that different words or phrases with similar meanings should yield similar search results, even if the exact keywords do not appear in the documents.

> Here, we levarege transformer models from *sentence-transformers*. These models were already trained for semantic similarity tasks. By embedding each sentence, paragraph, chunk of text and then comparing both embedding through similarity metrics (e.g: cosine similarity), they can measure the semantic similarity to a specific query.

In [3]:
from sentence_transformers import SentenceTransformer, util
import time

  from tqdm.autonotebook import tqdm, trange


In [4]:
model = SentenceTransformer('all-MiniLM-L6-v2') #loads embedding model
    # this model is uncased, meaning there's no difference between "English" and "english"
    # depending on your dataset and task, you might want to use a cased model.
# More embeddings models here: https://www.sbert.net/docs/sentence_transformer/pretrained_models.html

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

In [None]:
tini = time.time()
# passages = data["sentence2"].tolist()
embeddings = model.encode(passages[:500], convert_to_tensor=True)#.to('cuda') #needs GPU to be fast
print(f"Total time spent embedding: {time.time()-tini}s")

print(embeddings.shape)

Total time spent embedding: 8.096527338027954s
torch.Size([500, 384])


In [None]:
pick_qid = 1
query = data.loc[pick_qid, "sentence1"]
print(f"{query}\t It has {data.iloc[pick_qid, 0]} semantic twin sentence.")

tini = time.time()
query_embeddings = model.encode(query, convert_to_tensor=True)#.to('cuda')
similarity_function = util.cos_sim #use cosine similarity as the ranking metric
results = util.semantic_search(
    query_embeddings,
    embeddings,
    score_function=similarity_function,
    top_k=3,
)
tout = time.time()

print_results(results[0], passages)
print(f"Total time spent searching: {tout-tini}s")

Is honey a viable alternative to sugar for diabetics ?	 It has 0 semantic twin sentence.
377: 0.508	What are the health effects , if any , of honey and lemon water ?
103: 0.400	Does putting sugar in a car 's gas tank really ruin the car ?
194: 0.339	Which alcoholic drink is good for health ?
Total time spent searching: 0.04375958442687988s


In [None]:
pick_qid = 2
query = data.loc[pick_qid, "sentence1"]
print(f"{query}\t It has {data.iloc[pick_qid, 0]} semantic twin sentence.")

tini = time.time()
query_embeddings = model.encode(query, convert_to_tensor=True)#.to('cuda')
similarity_function = util.cos_sim
results = util.semantic_search(
    query_embeddings,
    embeddings,
    score_function=similarity_function,
    top_k=3,
)
tout = time.time()

print_results(results[0], passages)
print(f"Total time spent searching: {tout-tini}s")

How can I stop my depression ?	 It has 1 semantic twin sentence.
2: 0.916	What can I do to stop being depressed ?
278: 0.714	How do I treat depression without medication ?
226: 0.579	How do I quit any kind of addiction ?
Total time spent searching: 0.028767108917236328s


In [None]:
pick_qid = 497
query = data.loc[pick_qid, "sentence1"]
print(f"{query}\t It has {data.iloc[pick_qid, 0]} semantic twin sentence.")

tini = time.time()
query_embeddings = model.encode(query, convert_to_tensor=True)#.to('cuda')
similarity_function = util.cos_sim
results = util.semantic_search(
    query_embeddings,
    embeddings,
    score_function=similarity_function,
    top_k=3,
)
tout = time.time()

print_results(results[0], passages)
print(f"Total time spent searching: {tout-tini}s")

How does iron change from solid to liquid and gas ?	 It has 0 semantic twin sentence.
497: 0.735	How does gold change from solid to liquid and gas ?
79: 0.337	What are the only two elements that are liquid at 25 ° C -LRB- room temperature -RRB- ? In a periodic table
110: 0.330	What is the process of distillation ?
Total time spent searching: 0.04038548469543457s


# Example comparing BM25 (with and without stemming) and embeddings

In [None]:
my_own_query = "How can I stop my depression?"

my_own_corpus = [
                 "What can I do to stop being depressed?",
                 "How do I stop my Shepherd-Husky mix puppy from humping my furniture?",
                 "How do I treat depression without medication?",
                 "How can I improve my mood?",
                 "Eat healthy, well-balanced meals."
                 "How do I quit any kind of addiction?"
                 ]

In [None]:
## BM25
tokenized_corpus = [preprocess(passage) for passage in my_own_corpus]
search_wBM25(my_own_query, tokenized_corpus, original_docs=my_own_corpus, print_keywords=True)

##BM25 w/ stemming
print("\nSearch with BM25, but including stemming in the preprocessing:")
tokenized_corpus = [preprocess(passage, stem=True) for passage in my_own_corpus]
search_wBM25(my_own_query, tokenized_corpus, original_docs=my_own_corpus, stem=True, print_keywords=True)

['stop', 'depression']
Print top-3 lexical search (BM25) results
2: 1.161	How do I treat depression without medication?
0: 0.438	What can I do to stop being depressed?
1: 0.277	How do I stop my Shepherd-Husky mix puppy from humping my furniture?
Total time spent searching: 0.0013012886047363281s

Search with BM25, but including stemming in the preprocessing:
['stop', 'depress']
Print top-3 lexical search (BM25) results
0: 0.875	What can I do to stop being depressed?
2: 0.355	How do I treat depression without medication?
1: 0.277	How do I stop my Shepherd-Husky mix puppy from humping my furniture?
Total time spent searching: 0.000423431396484375s


In [None]:
## Embeddings
print("\nSearch with embeddings:")
corpus_embeddings = model.encode(my_own_corpus, convert_to_tensor=True) #corpus mebeddings

tini = time.time()
query_embeddings = model.encode(my_own_query, convert_to_tensor=True)#.to('cuda')
print(query_embeddings.shape)

similarity_function = util.cos_sim
results = util.semantic_search(
    query_embeddings,
    corpus_embeddings, #corpus mebeddings
    score_function=similarity_function,
    top_k=3,
)
tout = time.time()

print_results(results[0], my_own_corpus)
print(f"Total time spent searching: {tout-tini}s")


Search with embeddings:
torch.Size([384])
0: 0.916	What can I do to stop being depressed?
2: 0.714	How do I treat depression without medication?
3: 0.583	How can I improve my mood?
Total time spent searching: 0.027634382247924805s


In [None]:
my_own_query = "What are Transformers in Natural Language Processing?"

my_own_corpus = ["Transformers is a series of science fiction action films based on the Transformers franchise.",
                 "Transformers: Revenge of the Fallen",
                 "A Transformer is a novel architecture that aims to solve sequence-to-sequence tasks.",
                 "Transformers are used to process and comprehend text in and end-to-end fashion.",
                 "It's natural to be confused."]

In [None]:
## BM25
tokenized_corpus = [preprocess(passage) for passage in my_own_corpus]
search_wBM25(my_own_query, tokenized_corpus, original_docs=my_own_corpus, print_keywords=True)

##BM25 w/ stemming
print("\nSearch with BM25, but including stemming in the preprocessing:")
tokenized_corpus = [preprocess(passage, stem=True) for passage in my_own_corpus]
search_wBM25(my_own_query, tokenized_corpus, original_docs=my_own_corpus, stem=True, print_keywords=True)

['transformers', 'natural', 'language', 'processing']
Print top-3 lexical search (BM25) results
4: 1.511	It's natural to be confused.
1: 0.327	Transformers: Revenge of the Fallen
0: 0.314	Transformers is a series of science fiction action films based on the Transformers franchise.
Total time spent searching: 0.0008490085601806641s

Search with BM25, but including stemming in the preprocessing:
['transform', 'natur', 'languag', 'process']
Print top-3 lexical search (BM25) results
4: 1.511	It's natural to be confused.
3: 1.188	Transformers are used to process and comprehend text in and end-to-end fashion.
1: 0.317	Transformers: Revenge of the Fallen
Total time spent searching: 0.0007281303405761719s


In [None]:
## Embeddings
print("\nSearch with embeddings:")
corpus_embeddings = model.encode(my_own_corpus, convert_to_tensor=True) #corpus mebeddings

tini = time.time()
query_embeddings = model.encode(my_own_query, convert_to_tensor=True)#.to('cuda')
print(query_embeddings.shape)

similarity_function = util.cos_sim
results = util.semantic_search(
    query_embeddings,
    corpus_embeddings, #corpus mebeddings
    score_function=similarity_function,
    top_k=3,
)
tout = time.time()

print_results(results[0], my_own_corpus)
print(f"Total time spent searching: {tout-tini}s")


Search with embeddings:
torch.Size([384])
3: 0.794	Transformers are used to process and comprehend text in and end-to-end fashion.
2: 0.512	A Transformer is a novel architecture that aims to solve sequence-to-sequence tasks.
0: 0.404	Transformers is a series of science fiction action films based on the Transformers franchise.
Total time spent searching: 0.020563602447509766s


**Conclusions**

* Pre-processing makes keyword search more robust to *wording variations* and terms that are general and irrelevant, independently of your data corpus.
* Stemming helps identifying more possible matches, but might also include false positives.
* BM25 assigns less importance to words that appear more frequently across different documents in your database (because there are probably general words, like pronouns or prepositions). But this also means that this decision will be dependent on the quality and variety of your dataset.
* Embeddings capture the semantic meaning of the query, not only keywords (*"search on things, not strings"*).

*Takeaways*
1. Lexical Search (BM25) works best for scenarios where you must ensure certain keywords appear in the search results. It is also used as a first stage for fast retrieval.
2. Semantic Search with Embeddings although more expensive, provides a better representation of the query/sentence as one unit of meaning.
3. Leveraging embedding models is useful for complex or ambiguous queries, as it enables the search engine to retrieve more contextually relevant results. Nonetheless, it's important to note that these lack transparency and their notion of semantic similarity/relatedness depends on the training dataset.