<a href="https://colab.research.google.com/github/dimitarpg13/transformer_examples/blob/main/notebooks/bert/Semantic_Search_with_SBERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Semantic Search with SBERT

This is a simple application for sentence embeddings: semantic search

We have a corpus with various sentences. Then, for a given query sentence,
we want to find the most similar sentence in this corpus.

This notebook outputs for various queries the top 5 most similar sentences in the corpus.

In [1]:
!pip install sentence-transformers langchain-community

Collecting langchain-community
  Downloading langchain_community-0.3.27-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.10.1-py3-none-any.whl.metadata (3.4 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain-community)
  Downloading python_dotenv-1.1.1-py3-none-any.whl.metadata (24 k

In [2]:
import torch
import numpy as np

from langchain_community.embeddings import SentenceTransformerEmbeddings

embedder = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

  embedder = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Corpus with example documents

In [4]:

corpus = [
    "Machine learning is a field of study that gives computers the ability to learn without being explicitly programmed.",
    "Deep learning is part of a broader family of machine learning methods based on artificial neural networks with representation learning.",
    "Neural networks are computing systems vaguely inspired by the biological neural networks that constitute animal brains.",
    "Mars rovers are robotic vehicles designed to travel on the surface of Mars to collect data and perform experiments.",
    "The James Webb Space Telescope is the largest optical telescope in space, designed to conduct infrared astronomy.",
    "SpaceX's Starship is designed to be a fully reusable transportation system capable of carrying humans to Mars and beyond.",
    "Global warming is the long-term heating of Earth's climate system observed since the pre-industrial period due to human activities.",
    "Renewable energy sources include solar, wind, hydro, and geothermal power that naturally replenish over time.",
    "Carbon capture technologies aim to collect CO2 emissions before they enter the atmosphere and store them underground.",
]
# Use "convert_to_tensor=True" to keep the tensors on GPU (if available)
corpus_embeddings = embedder.embed_documents(corpus)

corpus_embeddings_np = np.array(corpus_embeddings)

Construct the query sentences

In [5]:
queries = [
    "How do artificial neural networks work?",
    "What technology is used for modern space exploration?",
    "How can we address climate change challenges?",
]

Define the function for computing the cosine similarity vector obtained by computing the cosine similarity between a single embedding vector on the left and a list of embedding vectors on the right

In [None]:
def calculate_cosine_similarity_with_list(single_embedding, list_of_embeddings):
    """
    Calculates the cosine similarity between a single embedding vector
    and a list of embedding vectors using NumPy.

    Args:
        single_embedding (np.ndarray): A 1D NumPy array representing the single embedding vector.
        list_of_embeddings (np.ndarray): A 2D NumPy array where each row represents an embedding vector.

    Returns:
        np.ndarray: A 1D NumPy array containing the cosine similarity scores
                    between the single embedding and each embedding in the list.
    """
    # Normalize the single embedding vector
    norm_single_embedding = single_embedding / np.linalg.norm(single_embedding)

    # Normalize the list of embedding vectors
    norm_list_of_embeddings = list_of_embeddings / np.linalg.norm(list_of_embeddings, axis=1, keepdims=True)

    # Calculate the dot product (cosine similarity)
    cosine_similarities = np.dot(norm_list_of_embeddings, norm_single_embedding)

    return cosine_similarities

Encode the queries and find the closest 5 sentences of the corpus for each query sentence based on cosine similarity

In [8]:
top_k = min(5, len(corpus))
for query in queries:
    query_embedding = embedder.embed_query(query)

    # Convert to NumPy arrays for easier calculation
    query_embedding_np = np.array(query_embedding)


    # Calculate cosine similarity
    cosine_similarity_scores = calculate_cosine_similarity_with_list(query_embedding_np, corpus_embeddings_np)


    # We use cosine-similarity and torch.topk to find the highest 5 scores
    #similarity_scores = embedder.similarity(query_embedding, corpus_embeddings)[0]
    scores, indices = torch.topk(cosine_similarity_scores, k=top_k)

    print("\nQuery:", query)
    print("Top 5 most similar sentences in corpus:")

    for score, idx in zip(scores, indices):
        print(f"(Score: {score:.4f})", corpus[idx])

    """
    # Alternatively, we can also use util.semantic_search to perform cosine similarty + topk
    hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=5)
    hits = hits[0]      #Get the hits for the first query
    for hit in hits:
        print(corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))
    """

AttributeError: 'HuggingFaceEmbeddings' object has no attribute 'similarity'