# Embedding models

> ***Important:*** *As we won't need it in this notebook and usage is limited, make sure you are* ***not*** *using a GPU runtime. Click on `Runtime` > `Change runtime type` > Select `CPU` and Save.*

> *Also, to make sure there are no older sessions running, click on `Runtime` > `Manage sessions` > `Terminate other sessions`*

While we covered the absolute basics for text embeddings and common metrics in the last notebook, in this notebook we will learn how to use embedding models in the [SentenceTransformers](https://www.sbert.net/index.html) library and understand the difference between bi-encoders and cross-encoders. This foundation is essential before we can effectively fine-tune embedding models with synthetic data.

In [None]:
!pip install huggingface_hub[hf_xet]
!wget "https://drive.google.com/uc?export=download&id=1kTbWY9JJf0fFoqZGh6d-DRHQel6sT-9Y" -O ./sample_data.csv

Collecting hf-xet>=0.1.4 (from huggingface_hub[hf_xet])
  Downloading hf_xet-1.0.3-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (494 bytes)
Downloading hf_xet-1.0.3-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (53.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.8/53.8 MB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: hf-xet
Successfully installed hf-xet-1.0.3
--2025-04-24 14:53:06--  https://drive.google.com/uc?export=download&id=1kTbWY9JJf0fFoqZGh6d-DRHQel6sT-9Y
Resolving drive.google.com (drive.google.com)... 74.125.141.138, 74.125.141.139, 74.125.141.101, ...
Connecting to drive.google.com (drive.google.com)|74.125.141.138|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://drive.usercontent.google.com/download?id=1kTbWY9JJf0fFoqZGh6d-DRHQel6sT-9Y&export=download [following]
--2025-04-24 14:53:06--  https://drive.usercontent.google.com/download?id=1kTbWY9JJ

In [None]:
import numpy as np
import pandas as pd
import torch
from sentence_transformers import SentenceTransformer, CrossEncoder

## Load sample data

First of all, we will load the same example query and candidate passages from the last notebook

In [None]:
query = "Under what conditions does the AI Act exempt individual users from compliance when utilizing AI systems for personal, non-commercial purposes?"

In [None]:
df = pd.read_csv("sample_data.csv")

In [None]:
df.head()

Unnamed: 0,passage_id,passage,relevant
0,0,Chapter I - GENERAL PROVISIONS\n\nArticle 2 - ...,True
1,1,Preamble\n\n(13)The notion of ‘deployer’ refer...,True
2,2,ANNEX IV\n\n(b) the design specifications of t...,False
3,3,"Chapter IX - POST-MARKET MONITORING, INFORMATI...",False
4,4,"Chapter IX - POST-MARKET MONITORING, INFORMATI...",False


In [None]:
passages = df.passage.tolist()
len(passages)

100

We will also collect our relevancy labels

In [None]:
df.relevant.value_counts()

Unnamed: 0_level_0,count
relevant,Unnamed: 1_level_1
False,98
True,2


In [None]:
relevant_ids = [0, 1]  # only 2 out of 100 passages are relevant

We will also use a simplified version of our compute metrics function from the last notebook

In [None]:
def compute_metrics(ranked_indices, relevant_indices, k=10):
    """
    Calculate retrieval metrics: Accuracy@k, Precision@k, Recall@k, and nDCG@k.

    Args:
        ranked_indices: List or array of passage indices sorted by relevance score (descending)
        relevant_indices: List or array of indices that are considered relevant
        k: Cutoff point for calculating metrics (default: 10)

    Returns:
        Dictionary containing accuracy@k, precision@k, recall@k, and ndcg@k
    """
    import numpy as np

    # Ensure k is not larger than the number of ranked passages
    k = min(k, len(ranked_indices))

    # Get the top k passage indices
    top_k_indices = ranked_indices[:k]

    # Count relevant documents in top k
    relevant_in_top_k = sum(1 for idx in top_k_indices if idx in relevant_indices)

    # Calculate basic metrics
    accuracy_at_k = 1 if relevant_in_top_k > 0 else 0
    precision_at_k = relevant_in_top_k / k if k > 0 else 0
    recall_at_k = relevant_in_top_k / len(relevant_indices) if relevant_indices else 0

    # Calculate nDCG@k
    # For binary relevance, relevant documents have a gain of 1, irrelevant have 0
    gains = [1.0 if idx in relevant_indices else 0.0 for idx in top_k_indices]

    # Calculate DCG
    dcg = gains[0] if gains else 0.0  # First element has no discount
    for i in range(1, len(gains)):
        dcg += gains[i] / np.log2(i + 2)  # +2 because log_2(2) = 1

    # Calculate ideal DCG - create an ideal ranking with all relevant docs at the top
    # The number of relevant docs to consider is min(k, total number of relevant docs)
    num_relevant_to_consider = min(k, len(relevant_indices))
    ideal_gains = [1.0] * num_relevant_to_consider + [0.0] * (k - num_relevant_to_consider)
    ideal_gains = ideal_gains[:k]  # Ensure we only have k elements

    idcg = ideal_gains[0] if ideal_gains else 0.0
    for i in range(1, len(ideal_gains)):
        idcg += ideal_gains[i] / np.log2(i + 2)

    # Calculate nDCG
    ndcg_at_k = dcg / idcg if idcg > 0 else 0.0

    return {
        f"accuracy@{k}": accuracy_at_k,
        f"precision@{k}": precision_at_k,
        f"recall@{k}": recall_at_k,
        f"ndcg@{k}": ndcg_at_k.item()
    }

## Retrievers and rerankers

Most production-grade retrieval systems don't just consist of a single text embedding model. Rather, they combine a multitude of models working together to get the best speed-accuracy trade-off possible.



A usual setup looks like this:

<img src="https://drive.google.com/uc?export=view&id=1z53KpLXxKIG0arQ_p20pilNad7PlYcQ8" alt="Evaluation metrics" width="900">

1. **Retrieval stage**:
  - quickly retrieve a set of candidate passages given a search query
  - often combines:
    - keyword-based retrieval: BM25 (a powerful baseline that serves as a strong complement to embedding models in real-world applications, though not covered in this workshop)
    - embedding-based retrieval (the focus of this workshop)
2. **Reranking stage**:
  - reorder previously retrieved candidates with a slower and stronger model
  - can use encoder- or decoder-based classifiers or even LLMs




## Bi-encoders

We start with bi-encoders, which is the architecture that most common embedding models are based on. Bi-encoders encode query and passage *separately*. This has the advantage that you can encode your entire knowledge base or corpus *offline and ahead of time*.

When using a bi-encoder for retrieval:
1. The passages in your corpus are encoded once and stored
2. When a query arrives, only the query needs to be encoded
3. The system then compares the query embedding with all passage embeddings using cosine similarity
4. Finally, passages are ranked from most to least similar

This approach is computationally efficient for large-scale retrieval since passage encoding is a one-time cost, and similarity calculation is fast even across millions of documents.

<details>
<summary><b>Click to view:</b> Diagram showing Bi-encoder's <b>separate</b> encoding of query and passage</summary>

<img src="https://drive.google.com/uc?export=view&id=14udYJnpAu5QYxeUd6eMaUAGoNG51ceTA" alt="Evaluation metrics" width="700">


</details>

In [None]:
model = SentenceTransformer("all-MiniLM-L6-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Encoding a text with sentence transformers is as simple as this

In [None]:
query_emb = model.encode(query)
query_emb.shape

(384,)

In [None]:
query_emb[:10]

array([-0.02686959, -0.07888777, -0.01452074, -0.0716766 ,  0.02457212,
        0.03749802,  0.02059894,  0.00908657, -0.01796497, -0.00741236],
      dtype=float32)

We can also comfortably encode our list of candidate documents. This will take a few seconds to run

In [None]:
passage_embs = model.encode(passages)
passage_embs.shape

(100, 384)

We now have 100 vectors of 384 elements each. Next, we compute a similarity score for each query-passage pair

In [None]:
sims = model.similarity(query_emb, passage_embs)
sims.shape

torch.Size([1, 100])

In [None]:
sims

tensor([[0.7717, 0.5385, 0.4007, 0.5307, 0.4191, 0.5779, 0.3568, 0.5476, 0.5744,
         0.4548, 0.5159, 0.5565, 0.5175, 0.6209, 0.4088, 0.4672, 0.5983, 0.5893,
         0.5396, 0.3241, 0.5006, 0.5651, 0.5951, 0.5738, 0.5249, 0.6356, 0.4256,
         0.3525, 0.5025, 0.4571, 0.4683, 0.4796, 0.5881, 0.4217, 0.5294, 0.5329,
         0.5606, 0.5141, 0.5300, 0.5373, 0.5025, 0.4914, 0.4263, 0.5185, 0.4848,
         0.2944, 0.1869, 0.5220, 0.4948, 0.4210, 0.5452, 0.6115, 0.4732, 0.5236,
         0.3919, 0.6101, 0.3740, 0.2704, 0.5606, 0.6755, 0.4973, 0.2274, 0.6020,
         0.5554, 0.4715, 0.2727, 0.2960, 0.3810, 0.2514, 0.5048, 0.4612, 0.5003,
         0.4791, 0.4850, 0.3907, 0.3682, 0.1849, 0.5342, 0.5009, 0.2394, 0.5669,
         0.4717, 0.3463, 0.5619, 0.4878, 0.5378, 0.5586, 0.5474, 0.4921, 0.4608,
         0.5225, 0.4953, 0.5008, 0.4249, 0.4252, 0.5500, 0.5578, 0.5085, 0.2062,
         0.4432]])

We sort the scores (and IDs) in descending order, starting with the most relevant passage

In [None]:
sorted_sims, sorted_ids = torch.sort(sims, dim=1, descending=True)
sorted_ids = sorted_ids.squeeze().tolist()

And we compute our metrics

In [None]:
compute_metrics(sorted_ids, relevant_ids, 10)

{'accuracy@10': 1,
 'precision@10': 0.1,
 'recall@10': 0.5,
 'ndcg@10': 0.6131471927654584}

## Cross-encoders

In contrast to bi-encoders, cross-encoders process the query and passage *together* as a pair. This architecture allows the model to capture complex interactions between the query and passage text, leading to more accurate relevance judgments.

When using a cross-encoder for reranking:
1. The model takes both the query and a candidate passage as input, typically formatted as: `[CLS] query [SEP] passage [SEP]`
2. The model processes the entire sequence through self-attention layers, allowing tokens from the query to directly attend to tokens in the passage and vice versa
3. The final representation from the `[CLS]` token is passed through a classification layer to produce a relevance score
4. This process is repeated for each query-passage pair that needs to be ranked

Cross-encoders typically achieve higher accuracy than bi-encoders because they can model the direct interaction between query and passage. However, this comes at a computational cost - you must run the model for every query-passage pair, making it impractical as a first-stage retriever for large corpora.


<details>
<summary><b>Click to view:</b> Diagram showing Cross-encoder's <b>joint</b> processing of query and passage</summary>

<img src="https://drive.google.com/uc?export=view&id=1E7uWEA112WjvWhaaTe2n2CH9NMmp0Wij" alt="Evaluation metrics" width="700">

Try to compare this image with bi-encoder.
</details>

In [None]:
reranker = CrossEncoder(
    "cross-encoder/ms-marco-MiniLM-L-6-v2", default_activation_function=torch.nn.Sigmoid()
)

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

In [None]:
rerank_results = reranker.rank(query, passages)

In [None]:
rerank_results[:10]

[{'corpus_id': 0, 'score': np.float32(0.97436655)},
 {'corpus_id': 96, 'score': np.float32(0.90323603)},
 {'corpus_id': 55, 'score': np.float32(0.9008539)},
 {'corpus_id': 50, 'score': np.float32(0.873167)},
 {'corpus_id': 13, 'score': np.float32(0.86190397)},
 {'corpus_id': 1, 'score': np.float32(0.7287527)},
 {'corpus_id': 95, 'score': np.float32(0.63713706)},
 {'corpus_id': 44, 'score': np.float32(0.6318005)},
 {'corpus_id': 33, 'score': np.float32(0.5952161)},
 {'corpus_id': 16, 'score': np.float32(0.53559625)}]

In [None]:
ranks = [res["corpus_id"] for res in rerank_results]

In [None]:
compute_metrics(ranks, relevant_ids, 10)

{'accuracy@10': 1,
 'precision@10': 0.2,
 'recall@10': 1.0,
 'ndcg@10': 0.8315546295836225}

The reranker was able to find both relevant passages within the `top-10`, which leads to higher values for both recall and NCDG@10.

However, while the bi-encoder's query encoding and similarity scoring function were pretty much instant, running the cross-encoder over all query-passage pairs takes ca. 30-40 seconds.

## Bonus: PyTorch implementations

While we used the convenient [SentenceTransformers](https://www.sbert.net/index.html) library above, it hides away some interesting details which can help with understanding better how bi-encoders and cross-encoders differ.

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

### Bi-encoders

We start with bi-encoders

In [None]:
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

In [None]:
passage = passages[0]

First we need to tokenize our inputs

In [None]:
query_tokens = tokenizer(query, padding=True, truncation=True, return_tensors='pt')
passage_tokens = tokenizer(passage, padding=True, truncation=True, return_tensors='pt')

In [None]:
query_tokens

{'input_ids': tensor([[  101,  2104,  2054,  3785,  2515,  1996,  9932,  2552, 11819,  3265,
          5198,  2013, 12646,  2043, 16911,  9932,  3001,  2005,  3167,  1010,
          2512,  1011,  3293,  5682,  1029,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1]])}

In [None]:
query_tokens["input_ids"].shape, passage_tokens["input_ids"].shape

(torch.Size([1, 26]), torch.Size([1, 42]))

Next, we process query and passage *separately* through the model

In [None]:
model.eval()
with torch.no_grad():
    query_outputs = model(**query_tokens)
    passage_outputs = model(**passage_tokens)

In [None]:
query_outputs.last_hidden_state.shape, passage_outputs.last_hidden_state.shape

(torch.Size([1, 26, 384]), torch.Size([1, 42, 384]))

We get one embedding vector *per token*. However, we need to convert these token-level embeddings to a single document-level embedding. The way this is done depends on the model, but a common method is to apply mean pooling.

In [None]:
def mean_pooling(model_output, attention_mask):
    # First element of model_output contains all token embeddings
    token_embeddings = model_output[0]
    # Attention mask is applied to ignore padding tokens
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

In [None]:
query_emb = mean_pooling(query_outputs, query_tokens['attention_mask'])
passage_emb = mean_pooling(passage_outputs, passage_tokens['attention_mask'])

In [None]:
query_emb.shape, passage_emb.shape

(torch.Size([1, 384]), torch.Size([1, 384]))

We now have a single vector representing each document as a whole

Finally, we can compute the cosine similarity of these two *separately* encoded documents to get a single score which describes the relevancy of the passage given the query

In [None]:
def cosine_similarity(embedding1, embedding2):
    # Normalize the embeddings to unit length
    embedding1_normalized = F.normalize(embedding1.squeeze(), p=2, dim=0)
    embedding2_normalized = F.normalize(embedding2.squeeze(), p=2, dim=0)

    # Compute cosine similarity
    similarity = torch.dot(embedding1_normalized, embedding2_normalized)

    return similarity.item()

In [None]:
cosine_similarity(query_emb, passage_emb)

0.7716524600982666

### Cross-encoders

In [None]:
from transformers import AutoModelForSequenceClassification

In [None]:
tokenizer = AutoTokenizer.from_pretrained('cross-encoder/ms-marco-MiniLM-L6-v2')
model = AutoModelForSequenceClassification.from_pretrained('cross-encoder/ms-marco-MiniLM-L6-v2')

In contrast to above, we now tokenize and process each query-passage pair *together*

In [None]:
model_inputs = tokenizer(query, passage, padding=True, truncation=True, return_tensors='pt')

In [None]:
model_inputs

{'input_ids': tensor([[  101,  2104,  2054,  3785,  2515,  1996,  9932,  2552, 11819,  3265,
          5198,  2013, 12646,  2043, 16911,  9932,  3001,  2005,  3167,  1010,
          2512,  1011,  3293,  5682,  1029,   102,  3127,  1045,  1011,  2236,
          8910,  3720,  1016,  1011,  9531,  2184,  1012,  2023,  7816,  2515,
          2025,  6611,  2000, 14422,  1997, 21296,  2545,  2040,  2024,  3019,
          5381,  2478,  9932,  3001,  1999,  1996,  2607,  1997,  1037, 11850,
          3167,  2512,  1011,  2658,  4023,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1

In [None]:
tokenizer.decode(model_inputs["input_ids"][0])

'[CLS] under what conditions does the ai act exempt individual users from compliance when utilizing ai systems for personal, non - commercial purposes? [SEP] chapter i - general provisions article 2 - scope 10. this regulation does not apply to obligations of deployers who are natural persons using ai systems in the course of a purely personal non - professional activity. [SEP]'

The `[SEP]` token separates query and passage in the encoded input. The `[CLS]` token is used as input to the classification head, which returns the similarity score

In [None]:
model.eval()
with torch.no_grad():
    model_outputs = model(**model_inputs).logits

In [None]:
score = torch.sigmoid(model_outputs).item()
score

0.9743664264678955