<a href="https://colab.research.google.com/github/fubotz/IR_2025S/blob/main/Dense_Retrieval_Example_SCHAMBECK_Fabian.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A Dense Retrieval using DPR

Install the needed packages - dataasets and faiss-cpu

In [20]:
%%capture
%pip install datasets faiss-cpu

In [21]:
import torch

from datasets import load_dataset
from transformers import DPRContextEncoder, DPRContextEncoderTokenizer
from transformers import DPRQuestionEncoder, DPRQuestionEncoderTokenizer

## Data Preprocessing and Indexing

Load the **context** DPR models from huggingface hub.  
They will be used to encode the context, the document data.

In [22]:
torch.set_grad_enabled(False)

ctx_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")
ctx_tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")

Some weights of the model checkpoint at facebook/dpr-ctx_encoder-single-nq-base were not used when initializing DPRContextEncoder: ['ctx_encoder.bert_model.pooler.dense.bias', 'ctx_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing DPRContextEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRContextEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokeniz

Load a dataset (Crime and Punishment)

In [23]:
ds = load_dataset('crime_and_punish', split='train[:100]'); ds

Dataset({
    features: ['line'],
    num_rows: 100
})

In [24]:
ds[:20]

{'line': ['CRIME AND PUNISHMENT\r\n',
  '\r\n',
  '\r\n',
  '\r\n',
  '\r\n',
  'PART I\r\n',
  '\r\n',
  '\r\n',
  '\r\n',
  'CHAPTER I\r\n',
  '\r\n',
  'On an exceptionally hot evening early in July a young man came out of\r\n',
  'the garret in which he lodged in S. Place and walked slowly, as though\r\n',
  'in hesitation, towards K. bridge.\r\n',
  '\r\n',
  'He had successfully avoided meeting his landlady on the staircase. His\r\n',
  'garret was under the roof of a high, five-storied house and was more\r\n',
  'like a cupboard than a room. The landlady who provided him with garret,\r\n',
  'dinners, and attendance, lived on the floor below, and every time\r\n',
  'he went out he was obliged to pass her kitchen, the door of which\r\n']}

Prepare the data to be indexed: convert each line to an *embedding vector*

In [25]:
ds_with_embeddings = ds.map(
    lambda example:
    {'embeddings': ctx_encoder(**ctx_tokenizer(example["line"], return_tensors="pt"))[0][0].numpy()})

... and index them

In [26]:
ds_with_embeddings.add_faiss_index(column='embeddings')

  0%|          | 0/1 [00:00<?, ?it/s]

Dataset({
    features: ['line', 'embeddings'],
    num_rows: 100
})

In [27]:
ds_with_embeddings

Dataset({
    features: ['line', 'embeddings'],
    num_rows: 100
})

## Retrieval

Load the **question** DPR models from huggingface hub.  
They will be used to encode the query.

In [28]:
q_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
q_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-single-nq-base")

Some weights of the model checkpoint at facebook/dpr-question_encoder-single-nq-base were not used when initializing DPRQuestionEncoder: ['question_encoder.bert_model.pooler.dense.bias', 'question_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing DPRQuestionEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRQuestionEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Tokenize and encode a query:

In [29]:
question = "Is it serious ?"
question_embedding = q_encoder(**q_tokenizer(question, return_tensors="pt"))[0][0].numpy()

and perform retrieval

In [30]:
scores, retrieved_examples = ds_with_embeddings.get_nearest_examples('embeddings', question_embedding, k=5)
retrieved_examples["line"]

['_that_ serious? It is not serious at all. It’s simply a fantasy to amuse\r\n',
 'and complaints, and to rack his brains for excuses, to prevaricate, to\r\n',
 'CRIME AND PUNISHMENT\r\n',
 'for him. But to be stopped on the stairs, to be forced to listen to her\r\n',
 'trivial, irrelevant gossip, to pestering demands for payment, threats\r\n']

In [31]:
scores

array([ 94.92376 ,  95.6583  ,  98.723434, 103.86665 , 106.68605 ],
      dtype=float32)

Although we created the index inside the huggingface dataset, it can still be accessed through the `.faiss_index` property.  
We can use it for [more operations and methods](https://github.com/facebookresearch/faiss/wiki/Special-operations-on-indexes), such as [Range Search](https://github.com/facebookresearch/faiss/wiki/Special-operations-on-indexes#range-search)

In [32]:
faiss_index = ds_with_embeddings.get_index('embeddings').faiss_index
faiss_index

<faiss.swigfaiss_avx2.IndexFlat; proxy of <Swig Object of type 'faiss::IndexFlat *' at 0x791fb97031e0> >

In [33]:
# access the index with .faiss_index
faiss_index = ds_with_embeddings.get_index('embeddings').faiss_index

question = "murder"
question_embedding = q_encoder(**q_tokenizer(question, return_tensors="pt"))[0][0].numpy()

# use it for more complicated methods - e.g., range_search
limits, distances, indices = faiss_index.range_search(x=question_embedding.reshape(1, -1), thresh=0.5)

limits, distances, indices

(array([0, 0], dtype=uint64), array([], dtype=float32), array([], dtype=int64))

In [34]:
distances, indices = faiss_index.search(x=question_embedding.reshape(1, -1), k=3)
distances, indices

(array([[ 82.41716,  99.00274, 104.89004]], dtype=float32),
 array([[ 0, 60, 34]]))

## Disk operations
How to save and load the index:

In [35]:
# to save:
ds_index_path = 'my_index.faiss'
ds_index_name = 'embeddings'

ds_with_embeddings.save_faiss_index(ds_index_name, ds_index_path)

In [36]:
# to load:
ds = load_dataset('crime_and_punish', split='train[:100]')
ds.load_faiss_index(ds_index_name, ds_index_path)