# Indexing: ColBERT (Contextualized Late Interaction over BERT)

![ColBERT](../images/images-ColBERT.png)

**ColBERT** is a retrieval model that enhances search efficiency and accuracy by combining the deep language understanding of BERT with a novel interaction mechanism. 

**How ColBERT Works:**

1. **Independent Encoding:**
   - Both queries and documents are independently processed through BERT to generate token-level embeddings. 

2. **Late Interaction Mechanism:**
   - Instead of combining query and document embeddings early, ColBERT delays their interaction. It computes similarity scores between each query token embedding and all document token embeddings, capturing fine-grained relationships. 

3. **MaxSim Operation:**
   - For each query token, ColBERT selects the maximum similarity score (MaxSim) from the document tokens, emphasizing the most relevant matches. 

4. **Aggregation:**
   - The model aggregates these maximum similarity scores to produce a final relevance score for ranking documents. 

**Benefits of ColBERT:**

- **Efficiency:** By precomputing document embeddings offline, ColBERT reduces online query processing time, enabling scalable BERT-based search over large text collections in tens of milliseconds. 

- **Effectiveness:** The fine-grained token-level interactions allow ColBERT to capture nuanced semantic relationships, improving retrieval accuracy. 

- **Scalability:** ColBERT's architecture supports efficient indexing and retrieval, making it suitable for large-scale information retrieval tasks. 

**Example in Practice:**

Imagine you're searching for information on "climate change impacts on agriculture." ColBERT processes your query and the documents as follows:

- **Query Processing:** Each word in your query is encoded into a vector that captures its contextual meaning.

- **Document Processing:** Each document is independently encoded into a matrix of token-level embeddings, representing the contextual meaning of each word.

- **Similarity Calculation:** ColBERT computes similarity scores between each query token and all document tokens, identifying the most relevant matches.

- **Ranking:** Documents are ranked based on the aggregated similarity scores, with higher scores indicating more relevant content.

This process enables ColBERT to deliver precise search results by understanding the context and nuances of both the query and the documents.

For a more detailed understanding, you can refer to the original research paper on ColBERT: 

By implementing ColBERT, search systems can achieve a balance between the deep understanding provided by BERT and the efficiency required for large-scale retrieval tasks. 

The [RAGatouille](https://python.langchain.com/docs/integrations/retrievers/ragatouille/) library includes support for ColBERT.

Links:

- [ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction)](https://arxiv.org/pdf/2112.01488) paper

## Setup

In [1]:
%run "../Z - Common/setup.ipynb"

!pip install -qU numpy==1.26.0 tokenizers transformers==4.36.2 faiss-cpu torch ragatouille 


Stored 'enable_langsmith' (bool)


USER_AGENT environment variable not set, consider setting it to identify your requests.


To create an index we need to load a trained model. This can be one of your own or a pretrained one from the hub! Here we use a pretrained model, then load a couple of documents from wikipedia into it.

In [2]:
from ragatouille import RAGPretrainedModel
from ragatouille.utils import get_wikipedia_page

rag = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
docs = [get_wikipedia_page("Hayao_Miyazaki"), get_wikipedia_page("Studio_Ghibli")]
colbert_index_path = rag.index(
    collection=docs,
    index_name="my_index", 
    max_document_length=180,
    split_documents=True,
)

  from .autonotebook import tqdm as notebook_tqdm
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


[Dec 31, 13:58:06] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...


  self.scaler = torch.cuda.amp.GradScaler()


This is a behaviour change from RAGatouille 0.8.0 onwards.
This works fine for most users and smallish datasets, but can be considerably slower than FAISS and could cause worse results in some situations.
If you're confident with FAISS working on your machine, pass use_faiss=True to revert to the FAISS-using behaviour.
--------------------


[Dec 31, 13:58:07] #> Note: Output directory .ragatouille/colbert/indexes/my_index already exists


[Dec 31, 13:58:07] #> Will delete 10 files already at .ragatouille/colbert/indexes/my_index in 20 seconds...




[Dec 31, 13:58:28] [0] 		 #> Encoding 178 passages..


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 6/6 [00:12<00:00,  2.01s/it]

[Dec 31, 13:58:40] [0] 		 avg_doclen_est = 131.20787048339844 	 len(local_sample) = 178
[Dec 31, 13:58:40] [0] 		 Creating 2,048 partitions.
[Dec 31, 13:58:40] [0] 		 *Estimated* 23,355 embeddings.
[Dec 31, 13:58:40] [0] 		 #> Saving the indexing plan to .ragatouille/colbert/indexes/my_index/plan.json ..



  sub_sample = torch.load(sub_sample_path)
  centroids = torch.load(centroids_path, map_location='cpu')
  avg_residual = torch.load(avgresidual_path, map_location='cpu')
  bucket_cutoffs, bucket_weights = torch.load(buckets_path, map_location='cpu')


used 17 iterations (1.9021s) to cluster 22188 items into 2048 clusters
[0.036, 0.037, 0.038, 0.033, 0.03, 0.035, 0.032, 0.034, 0.032, 0.033, 0.032, 0.035, 0.033, 0.034, 0.033, 0.037, 0.03, 0.031, 0.032, 0.035, 0.033, 0.033, 0.033, 0.034, 0.034, 0.03, 0.036, 0.033, 0.033, 0.033, 0.033, 0.037, 0.036, 0.033, 0.032, 0.032, 0.033, 0.032, 0.034, 0.037, 0.033, 0.036, 0.033, 0.031, 0.034, 0.033, 0.032, 0.034, 0.035, 0.033, 0.032, 0.034, 0.032, 0.034, 0.034, 0.034, 0.035, 0.037, 0.038, 0.03, 0.032, 0.034, 0.031, 0.033, 0.035, 0.033, 0.034, 0.036, 0.03, 0.032, 0.033, 0.031, 0.032, 0.035, 0.035, 0.033, 0.033, 0.036, 0.033, 0.036, 0.033, 0.036, 0.031, 0.037, 0.031, 0.032, 0.035, 0.033, 0.032, 0.04, 0.032, 0.034, 0.033, 0.035, 0.035, 0.033, 0.037, 0.033, 0.034, 0.035, 0.036, 0.038, 0.035, 0.033, 0.036, 0.034, 0.033, 0.032, 0.034, 0.03, 0.033, 0.034, 0.034, 0.031, 0.035, 0.033, 0.034, 0.033, 0.035, 0.036, 0.031, 0.031, 0.034, 0.035, 0.032, 0.035, 0.035, 0.035]


0it [00:00, ?it/s]

[Dec 31, 13:58:42] [0] 		 #> Encoding 178 passages..


100%|██████████| 6/6 [00:12<00:00,  2.14s/it]
1it [00:13, 13.07s/it]
  return torch.load(codes_path, map_location='cpu')
100%|██████████| 1/1 [00:00<00:00, 537.66it/s]

[Dec 31, 13:58:55] #> Optimizing IVF to store map from centroids to list of pids..
[Dec 31, 13:58:55] #> Building the emb2pid mapping..
[Dec 31, 13:58:55] len(emb2pid) = 23355



100%|██████████| 2048/2048 [00:00<00:00, 92815.99it/s]

[Dec 31, 13:58:55] #> Saved optimized IVF to .ragatouille/colbert/indexes/my_index/ivf.pid.pt
Done indexing!





In [3]:
results = rag.search(query="What animation studio did Miyazaki found?", k=3)
results

Loading searcher for index my_index for the first time... This may take a few seconds
[Dec 31, 13:58:57] #> Loading codec...
[Dec 31, 13:58:57] #> Loading IVF...
[Dec 31, 13:58:57] Loading segmented_lookup_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...


  ivf, ivf_lengths = torch.load(os.path.join(self.index_path, "ivf.pid.pt"), map_location='cpu')


[Dec 31, 13:58:57] #> Loading doclens...


100%|██████████| 1/1 [00:00<00:00, 4429.04it/s]

[Dec 31, 13:58:57] #> Loading codes and residuals...



  return torch.load(codes_path, map_location='cpu')
  return torch.load(residuals_path, map_location='cpu')
100%|██████████| 1/1 [00:00<00:00, 196.54it/s]

[Dec 31, 13:58:57] Loading filter_pids_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...





[Dec 31, 13:58:57] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
Searcher loaded!

#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: . What animation studio did Miyazaki found?, 		 True, 		 None
#> Output IDs: torch.Size([32]), tensor([  101,     1,  2054,  7284,  2996,  2106,  2771,  3148, 18637,  2179,
         1029,   102,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103])
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])



  return torch.cuda.amp.autocast() if self.activated else NullContextManager()


[{'content': '=== Studio Ghibli ===\n\n\n==== Early films (1985–1995) ====\nFollowing the success of Nausicaä of the Valley of the Wind, Miyazaki and Takahata founded the animation production company Studio Ghibli on June 15, 1985, as a subsidiary of Tokuma Shoten, with offices in Kichijōji designed by Miyazaki. Miyazaki named the studio after the Caproni Ca.309 and the Italian word meaning "a hot wind that blows in the desert"; the name had been registered a year earlier.',
  'score': 25.885944366455078,
  'rank': 1,
  'document_id': 'ad85c72e-6e24-423b-9ecf-75b9fbe31273',
  'passage_id': 42},
 {'content': 'Hayao Miyazaki (宮崎 駿 or 宮﨑 駿, Miyazaki Hayao, [mijaꜜzaki hajao]; born January 5, 1941) is a Japanese animator, filmmaker, and manga artist. He co-founded Studio Ghibli and serves as its honorary chairman. Over the course of his career, Miyazaki has attained international acclaim as a masterful storyteller and creator of Japanese animated feature films, and is widely regarded as one

We can then convert easily to a LangChain retriever! We can pass in any kwargs we want when creating (like k):



In [4]:
retriever = rag.as_langchain_retriever(k=3)

retriever.invoke("What animation studio did Miyazaki found?")

  return torch.cuda.amp.autocast() if self.activated else NullContextManager()


[Document(metadata={}, page_content='=== Studio Ghibli ===\n\n\n==== Early films (1985–1995) ====\nFollowing the success of Nausicaä of the Valley of the Wind, Miyazaki and Takahata founded the animation production company Studio Ghibli on June 15, 1985, as a subsidiary of Tokuma Shoten, with offices in Kichijōji designed by Miyazaki. Miyazaki named the studio after the Caproni Ca.309 and the Italian word meaning "a hot wind that blows in the desert"; the name had been registered a year earlier.'),
 Document(metadata={}, page_content='Hayao Miyazaki (宮崎 駿 or 宮﨑 駿, Miyazaki Hayao, [mijaꜜzaki hajao]; born January 5, 1941) is a Japanese animator, filmmaker, and manga artist. He co-founded Studio Ghibli and serves as its honorary chairman. Over the course of his career, Miyazaki has attained international acclaim as a masterful storyteller and creator of Japanese animated feature films, and is widely regarded as one of the most accomplished filmmakers in the history of animation.\nBorn in 

We can easily combine this retriever in to a chain.

In [5]:
from langchain import hub
from pprint import pprint
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

prompt = hub.pull("rlm/rag-prompt")
# pprint(prompt)

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)




In [6]:
chain.invoke("What animation studio did Miyazaki found?")

  return torch.cuda.amp.autocast() if self.activated else NullContextManager()


'Miyazaki co-founded Studio Ghibli on June 15, 1985, along with Isao Takahata, as a subsidiary of Tokuma Shoten. The studio was named after the Caproni Ca.309 and the Italian word meaning "a hot wind that blows in the desert."'