# Calculating similarity with an embedding model including retrieval

This notebook uses the sentences from the UN general debate which were segmented in the [last notebook](10-prepare-data.ipynb). 

We will use different models for vectorizing the sentences (i.e. calculating the embeddings):
* multi-qa-MiniLM-L6-cos-v1 is recommended by [SBERT](https://sbert.net)
* embeddinggemma-300m is a small, but powerful model from Google
* snowflake-arctic-embed-l-v2.0 is ranked quite high on the [MTEB](https://huggingface.co/spaces/mteb/leaderboard)

The actual calculation can take from seconds to minutes, depending on the hardware. To save this time later, we save the embeddings in `numpy` format.

After this, the retrieval takes place. The retrieval function is documented with extensive comments. Notice the different ways of how questions can be differentiated from possible answers!

## Load data

In [1]:
import json
with open("sentences.json") as f:
    sentences = json.load(f)

## Encode sentences

Sentence Bert can be found at https://sbert.net

In [2]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

[1mBertModel LOAD REPORT[0m from: sentence-transformers/multi-qa-MiniLM-L6-cos-v1
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


In [3]:
# can take a minute or two depending on CPU/GPU configuration
sembeddings = model.encode(sentences, show_progress_bar=True, normalize_embeddings=True)

Batches:   0%|          | 0/574 [00:00<?, ?it/s]

In [4]:
len(sembeddings)

18342

In [5]:
sembeddings.shape

(18342, 384)

In [8]:
import numpy as np
with open("sentences-mqa.npy", "wb") as f:
    np.save(f, sembeddings)

Many more models are available on Hugging Face.

Benchmark of models: https://huggingface.co/spaces/mteb/leaderboard

Search for all sentence similarity models: https://huggingface.co/models?pipeline_tag=sentence-similarity&sort=trending

In [9]:
# superfast alternative using ModelVec, speedup 400x CPU to 25x GPU:
# you can try it, but it is more focused on lexical than semantic retrieval
model_fast = SentenceTransformer("minishlab/potion-base-8M", device="cpu")
sembeddings_fast = model_fast.encode(sentences, show_progress_bar=True, 
                             normalize_embeddings=True)

Batches:   0%|          | 0/574 [00:00<?, ?it/s]

### Alternative Model

In [10]:
# option: truncate_dim=dimensions
# option for cpu: backend="openvino"
model2 = SentenceTransformer('google/embeddinggemma-300m')

No sentence-transformers model found with name google/embeddinggemma-300m. Creating a new one with mean pooling.


OSError: You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/google/embeddinggemma-300m.
401 Client Error. (Request ID: Root=1-698e7486-6000cfc56d400e933a812604;1bddb74e-cb15-4e87-9e9e-0e1ad262dccb)

Cannot access gated repo for url https://huggingface.co/google/embeddinggemma-300m/resolve/main/config.json.
Access to model google/embeddinggemma-300m is restricted. You must have access to it and be authenticated to access it. Please log in.

In [11]:
# can take a minute or two depending on CPU/GPU configuration
sembeddings2 = model2.encode(sentences, show_progress_bar=True, 
                             normalize_embeddings=True)

NameError: name 'model2' is not defined

In [None]:
sembeddings2.shape

In [12]:
# if we wanted, we could now quantize the embeddings to save space and add performance:
from sentence_transformers.quantization import quantize_embeddings
binary_embeddings2 = quantize_embeddings(sembeddings2, precision="ubinary")
binary_embeddings2.shape

NameError: name 'sembeddings2' is not defined

In [None]:
with open("sentences-gemma.npy", "wb") as f:
    np.save(f, sembeddings2)

## One more alternative

In [17]:
model3 = SentenceTransformer("Snowflake/snowflake-arctic-embed-l-v2.0", trust_remote_code=True)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/203 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/818 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

In [18]:
sembeddings3 = model3.encode(sentences, show_progress_bar=True, normalize_embeddings=True)

Batches:   0%|          | 0/574 [00:00<?, ?it/s]

In [19]:
sembeddings3.shape

(18342, 1024)

In [None]:
with open("sentences-arctic.npy", "wb") as f:
    np.save(f, sembeddings3)

## `Qwen/Qwen3-Embedding-0.6B` ranks really well

In [13]:
model4 = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/215 [00:00<?, ?B/s]



README.md: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.19G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/310 [00:00<?, ?it/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/313 [00:00<?, ?B/s]

In [14]:
sembeddings4 = model4.encode(sentences, show_progress_bar=True, normalize_embeddings=True)

Batches:   0%|          | 0/574 [00:00<?, ?it/s]

In [15]:
sembeddings4.shape

(18342, 1024)

In [16]:
with open("sentences-qwen.npy", "wb") as f:
    np.save(f, sembeddings4)

## Retrieval

In [21]:
def search(query, text, corpus_embeddings, model, query_prompt_name=None, top=20):
    # code query to restrict search space
    question_embedding = model.encode(query, normalize_embeddings=True, prompt_name=query_prompt_name)
    
    # Determine similarity (vectors are normalized)
    sim = model.similarity(question_embedding, corpus_embeddings)[0].numpy() 
    # Alternative: sim = np.dot(corpus_embeddings, question_embedding)
    
    # Get most similar top_k by sorting
    hits = [ { "id": i, "text": text[i], "score": sim[i] } 
                     for i in sim.argsort()[::-1][0:top] ]
    
    # Return as dataframe
    return pd.DataFrame(hits)

In [20]:
import pandas as pd
pd.set_option('display.max_colwidth', 0)

In [22]:
m1df = search("Is the climate crisis worse for poorer countries?", sentences, sembeddings, model)
m1df

Unnamed: 0,id,text,score
0,3353,Nowhere is that more critical than the accelerating climate crisis.,0.724767
1,5374,Concerning the climate crisis.,0.720952
2,2931,"Despite having contributed the least to climate change, it is the poorest and most vulnerable parts of the world that suffer the most devastating consequences.",0.699785
3,2474,"And yet, the climate crisis is wreaking havoc.",0.683676
4,9937,"Poor, vulnerable, climate-distressed and resource-challenged developing countries are absolutely fed up and insulted by the unfulfilled perennial promises of the developed world on climate financing.",0.68083
5,8765,We know that those who have done the least to cause the climate crisis are those most vulnerable to its effects.,0.679045
6,2926,The climate emergency is worsening.,0.67823
7,14348,"Developing countries, such as Cote d’Ivoire, which are only marginally responsible for climate change, are disproportionately affected and are suffering the most from its consequences.",0.675712
8,17405,Climate is far from the only crisis the world faces.,0.673427
9,3981,"The climate crisis is indeed impacting health security, food security, water security, economic security and peace security.",0.67158


In [23]:
m2adf = search("Is the climate crisis worse for poorer countries?", sentences, sembeddings2, model2)
m2adf

NameError: name 'sembeddings2' is not defined

In [24]:
m2bdf = search("task: search result | query: Is the climate crisis worse for poorer countries?", sentences, sembeddings2, model2)
m2bdf

NameError: name 'sembeddings2' is not defined

In [None]:
# difference is big
set(m2bdf["id"]).symmetric_difference(set(m2adf["id"]))

In [25]:
m3adf = search("Is the climate crisis worse for poorer countries?", sentences, sembeddings3, model3)
m3adf

Unnamed: 0,id,text,score
0,2931,"Despite having contributed the least to climate change, it is the poorest and most vulnerable parts of the world that suffer the most devastating consequences.",0.716447
1,8765,We know that those who have done the least to cause the climate crisis are those most vulnerable to its effects.,0.699156
2,8330,We remain concerned that countries that have contributed less to the global emission of greenhouse gases continue to be disproportionately affected by the impacts of climate change.,0.678113
3,10331,"Developing countries have made progress in reducing carbon emissions, but we continue to be the most affected by climate disasters.",0.67678
4,1593,"While our countries are the smallest contributors to global climate change, we find ourselves on the front lines of the crisis.",0.671569
5,15596,Those rich countries have failed to deliver on their own promise to finance climate change adaptation.,0.670178
6,3137,Too late for us to save as many as we can from the climate crisis?,0.665923
7,8922,Another problem is that most climate finance is provided to low- and middle-income countries in the form of loans.,0.659857
8,10486,"Climate change exacerbates social and economic inequalities, too.",0.659016
9,12476,"We know that many States here, in particular the most vulnerable — those that have contributed the least to global warming and have burned the least fossil fuels — are the ones that have suffered the most from the climate crisis.",0.649419


In [26]:
m3bdf = search("Is the climate crisis worse for poorer countries?", sentences, sembeddings3, model3, 
               query_prompt_name="query")
m3bdf

Unnamed: 0,id,text,score
0,2931,"Despite having contributed the least to climate change, it is the poorest and most vulnerable parts of the world that suffer the most devastating consequences.",0.613402
1,461,More than half of the world’s top 50 most climate-vulnerable countries are home to 40 per cent of people living in extreme poverty.,0.590028
2,10331,"Developing countries have made progress in reducing carbon emissions, but we continue to be the most affected by climate disasters.",0.582705
3,14416,"In the end, the most affected are always the poorest countries and peoples of the world, who are suffering from inflation, food shortages and high fuel prices.",0.569932
4,18320,"The effects of climate change are causing suffering to the most vulnerable communities, especially small island developing States, least developed countries and those affected by conflict.",0.565447
5,14348,"Developing countries, such as Cote d’Ivoire, which are only marginally responsible for climate change, are disproportionately affected and are suffering the most from its consequences.",0.561062
6,9937,"Poor, vulnerable, climate-distressed and resource-challenged developing countries are absolutely fed up and insulted by the unfulfilled perennial promises of the developed world on climate financing.",0.557159
7,12303,"Developing countries, particularly least developed countries, are currently the most vulnerable to the severe consequences of climate change, natural disasters and diseases.",0.556326
8,7636,"These crises are hitting hardest those who are least responsible for their creation — vulnerable populations, women and children and the world’s poorest peoples.",0.556109
9,576,"It is also no secret that those who are least responsible for climate change are the ones suffering the most from its effects, particularly small island developing States.",0.555903


In [29]:
# again a big difference in matches
set(m3bdf["id"]).symmetric_difference(set(m3adf["id"]))

{461,
 502,
 1407,
 1593,
 3137,
 4863,
 5233,
 7636,
 9606,
 10030,
 10158,
 10486,
 11513,
 12303,
 12650,
 14257,
 14348,
 14416,
 15596,
 16366,
 17405,
 18121}

In [27]:
m4adf = search("Is the climate crisis worse for poorer countries?", sentences, sembeddings4, model4)
m4adf

Unnamed: 0,id,text,score
0,2931,"Despite having contributed the least to climate change, it is the poorest and most vulnerable parts of the world that suffer the most devastating consequences.",0.675578
1,8765,We know that those who have done the least to cause the climate crisis are those most vulnerable to its effects.,0.656746
2,5233,The climate crisis is another challenge that exacerbates the economic divide between nations and impedes humankind’s sustainable development.,0.652542
3,14348,"Developing countries, such as Cote d’Ivoire, which are only marginally responsible for climate change, are disproportionately affected and are suffering the most from its consequences.",0.647372
4,9606,It is unjust that the countries that paid the highest human price are bearing the heaviest climate burden.,0.641751
5,16366,"The enrichment of those countries comes at the cost of misery for others that contribute less to pollution and are, coincidentally, among the very poorest.",0.636018
6,18320,"The effects of climate change are causing suffering to the most vulnerable communities, especially small island developing States, least developed countries and those affected by conflict.",0.634239
7,11513,"We also acknowledge that our people, the people of the small island developing States, those who are least culpable for the climate crisis, are the ones who continue to be most disproportionately affected.",0.633259
8,12476,"We know that many States here, in particular the most vulnerable — those that have contributed the least to global warming and have burned the least fossil fuels — are the ones that have suffered the most from the climate crisis.",0.632228
9,9937,"Poor, vulnerable, climate-distressed and resource-challenged developing countries are absolutely fed up and insulted by the unfulfilled perennial promises of the developed world on climate financing.",0.62008


In [28]:
m4bdf = search("Is the climate crisis worse for poorer countries?", sentences, sembeddings4, model4, 
               query_prompt_name="query")
m4bdf

Unnamed: 0,id,text,score
0,2931,"Despite having contributed the least to climate change, it is the poorest and most vulnerable parts of the world that suffer the most devastating consequences.",0.674338
1,5233,The climate crisis is another challenge that exacerbates the economic divide between nations and impedes humankind’s sustainable development.,0.658558
2,14348,"Developing countries, such as Cote d’Ivoire, which are only marginally responsible for climate change, are disproportionately affected and are suffering the most from its consequences.",0.657744
3,9937,"Poor, vulnerable, climate-distressed and resource-challenged developing countries are absolutely fed up and insulted by the unfulfilled perennial promises of the developed world on climate financing.",0.656781
4,8765,We know that those who have done the least to cause the climate crisis are those most vulnerable to its effects.,0.65247
5,11513,"We also acknowledge that our people, the people of the small island developing States, those who are least culpable for the climate crisis, are the ones who continue to be most disproportionately affected.",0.652062
6,16366,"The enrichment of those countries comes at the cost of misery for others that contribute less to pollution and are, coincidentally, among the very poorest.",0.640291
7,12476,"We know that many States here, in particular the most vulnerable — those that have contributed the least to global warming and have burned the least fossil fuels — are the ones that have suffered the most from the climate crisis.",0.637314
8,18320,"The effects of climate change are causing suffering to the most vulnerable communities, especially small island developing States, least developed countries and those affected by conflict.",0.633459
9,14058,"The cross-border financial impacts of crises, such as climate change and the pandemic, are impeding the ability of smaller indebted countries, such as mine, to make progress on the SDGs and climate adaptation and mitigation.",0.633324


In [None]:
# only a minor difference in matches
set(m4bdf["id"]).symmetric_difference(set(m4adf["id"]))