# Calculating similarity with an embedding model including retrieval

This notebook uses the sentences from the UN general debate which were segmented in the [last notebook](10-prepare-data.ipynb). 

We will use different models for vectorizing the sentences (i.e. calculating the embeddings):
* multi-qa-MiniLM-L6-cos-v1 is recommended by [SBERT](https://sbert.net)
* embeddinggemma-300m is a small, but powerful model from Google
* snowflake-arctic-embed-l-v2.0 is ranked quite high on the [MTEB](https://huggingface.co/spaces/mteb/leaderboard)

The actual calculation can take from seconds to minutes, depending on the hardware. To save this time later, we save the embeddings in `numpy` format.

After this, the retrieval takes place. The retrieval function is documented with extensive comments. Notice the different ways of how questions can be differentiated from possible answers!

## Load data

In [None]:
import json
with open("sentences.json") as f:
    sentences = json.load(f)

## Encode sentences

Sentence Bert can be found at https://sbert.net

In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')

In [None]:
# can take a minute or two depending on CPU/GPU configuration
sembeddings = model.encode(sentences, show_progress_bar=True, normalize_embeddings=True)

In [None]:
len(sembeddings)

In [None]:
sembeddings.shape

In [None]:
import numpy as np
with open("sentences-mqa.npy", "wb") as f:
    np.save(f, sembeddings)

Many more models are available on Hugging Face.

Benchmark of models: https://huggingface.co/spaces/mteb/leaderboard

Search for all sentence similarity models: https://huggingface.co/models?pipeline_tag=sentence-similarity&sort=trending

In [None]:
# superfast alternative using ModelVec, speedup 400x CPU to 25x GPU:
# you can try it, but it is more focused on lexical than semantic retrieval
model_fast = SentenceTransformer("minishlab/potion-base-8M", device="cpu")
sembeddings_fast = model_fast.encode(sentences, show_progress_bar=True, 
                             normalize_embeddings=True)

### Alternative Model

In [None]:
# option: truncate_dim=dimensions
# option for cpu: backend="openvino"
model2 = SentenceTransformer('google/embeddinggemma-300m')

In [None]:
# can take a minute or two depending on CPU/GPU configuration
sembeddings2 = model2.encode(sentences, show_progress_bar=True, 
                             normalize_embeddings=True)

In [None]:
sembeddings2.shape

In [None]:
# if we wanted, we could now quantize the embeddings to save space and add performance:
from sentence_transformers.quantization import quantize_embeddings
binary_embeddings2 = quantize_embeddings(sembeddings2, precision="ubinary")
binary_embeddings2.shape

In [None]:
with open("sentences-gemma.npy", "wb") as f:
    np.save(f, sembeddings2)

## One more alternative

In [None]:
model3 = SentenceTransformer("Snowflake/snowflake-arctic-embed-l-v2.0", trust_remote_code=True)

In [None]:
sembeddings3 = model3.encode(sentences, show_progress_bar=True, normalize_embeddings=True)

In [None]:
sembeddings3.shape

In [None]:
with open("sentences-arctic.npy", "wb") as f:
    np.save(f, sembeddings3)

## `Qwen/Qwen3-Embedding-0.6B` ranks really well

In [None]:
model4 = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True)

In [None]:
sembeddings4 = model4.encode(sentences, show_progress_bar=True, normalize_embeddings=True)

In [None]:
sembeddings4.shape

In [None]:
with open("sentences-qwen.npy", "wb") as f:
    np.save(f, sembeddings4)

## Retrieval

In [None]:
def search(query, text, corpus_embeddings, model, query_prompt_name=None, top=20):
    # code query to restrict search space
    question_embedding = model.encode(query, normalize_embeddings=True, prompt_name=query_prompt_name)
    
    # Determine similarity (vectors are normalized)
    sim = model.similarity(question_embedding, corpus_embeddings)[0].numpy() 
    # Alternative: sim = np.dot(corpus_embeddings, question_embedding)
    
    # Get most similar top_k by sorting
    hits = [ { "id": i, "text": text[i], "score": sim[i] } 
                     for i in sim.argsort()[::-1][0:top] ]
    
    # Return as dataframe
    return pd.DataFrame(hits)

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', 0)

In [None]:
m1df = search("Is the climate crisis worse for poorer countries?", sentences, sembeddings, model)
m1df

In [None]:
m2adf = search("Is the climate crisis worse for poorer countries?", sentences, sembeddings2, model2)
m2adf

In [None]:
m2bdf = search("task: search result | query: Is the climate crisis worse for poorer countries?", sentences, sembeddings2, model2)
m2bdf

In [None]:
# difference is big
set(m2bdf["id"]).symmetric_difference(set(m2adf["id"]))

In [None]:
m3adf = search("Is the climate crisis worse for poorer countries?", sentences, sembeddings3, model3)
m3adf

In [None]:
m3bdf = search("Is the climate crisis worse for poorer countries?", sentences, sembeddings3, model3, 
               query_prompt_name="query")
m3bdf

In [None]:
# again a big difference in matches
set(m3bdf["id"]).symmetric_difference(set(m3adf["id"]))

In [None]:
m4adf = search("Is the climate crisis worse for poorer countries?", sentences, sembeddings4, model4)
m4adf

In [None]:
m4bdf = search("Is the climate crisis worse for poorer countries?", sentences, sembeddings4, model4, 
               query_prompt_name="query")
m4bdf

In [None]:
# only a minor difference in matches
set(m4bdf["id"]).symmetric_difference(set(m4adf["id"]))