# ColBERTv2: Indexing & Search Notebook

If you're working in Google Colab, we recommend selecting "GPU" as your hardware accelerator in the runtime settings.

First, we'll import the relevant classes. Note that `Indexer` and `Searcher` are the key actors here. Next, we'll download the necessary dependencies.

In [1]:
!git -C ColBERT/ pull || git clone https://github.com/stanford-futuredata/ColBERT.git
import sys; sys.path.insert(0, 'ColBERT/')


fatal: cannot change to 'ColBERT/': No such file or directory
Cloning into 'ColBERT'...
remote: Enumerating objects: 2860, done.[K
remote: Counting objects: 100% (1/1), done.[K
remote: Total 2860 (delta 0), reused 0 (delta 0), pack-reused 2859 (from 1)[K
Receiving objects: 100% (2860/2860), 2.07 MiB | 7.75 MiB/s, done.
Resolving deltas: 100% (1800/1800), done.


In [2]:
try: # When on google Colab, let's install all dependencies with pip.
    import google.colab
    !pip install -U pip
    !pip install -e ColBERT/['faiss-cpu']
except Exception:
    import sys; sys.path.insert(0, 'ColBERT/')
    try:
        from colbert.modeling.checkpoint import Checkpoint
    except Exception:
        print("If you're running outside Colab, please make sure you install ColBERT in conda following the instructions in our README.")
        assert False

Collecting pip
  Downloading pip-25.3-py3-none-any.whl.metadata (4.7 kB)
Downloading pip-25.3-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m22.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-25.3
Obtaining file:///content/ColBERT
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
Collecting bitarray (from colbert-ai==0.2.22)
  Downloading bitarray-3.8.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (34 kB)
Collecting ninja (from colbert-ai==0.2.22)
  Downloading ninja-1.13.0-py3-none-manyl

In [3]:
!pip install datasets
!pip install torch



In [4]:
import torch
import numpy as np
from tqdm import tqdm
import os

In [5]:
from colbert.modeling.checkpoint import Checkpoint
from colbert.infra import ColBERTConfig

We will use the dev set of the **LoTTE benchmark** we recently introduced in the ColBERTv2 paper. We'll download it from HuggingFace datasets. The dev and test sets contain several domain-specific corpora, and we'll use the smallest dev set corpus, namely lifestyle:dev.

For the purposes of a quick demo, we will only run the `Indexer` on the first 10,000 passages. As we do this, let's also remove the queries whose relevant passages are all outside this small set of passages.

In [6]:
from datasets import load_dataset

# Load FiQA corpus (documents)
corpus_dataset = load_dataset("mteb/fiqa", "corpus")
corpus_data = corpus_dataset['corpus']

# Extract corpus texts - combining title and text
corpus_ids = [item['_id'] for item in corpus_data]
corpus_texts = []
for item in corpus_data:
    title = item.get('title', '') or ''
    text = item.get('text', '') or ''
    # Combine title and text with a separator if title exists
    if title:
        combined = f"{title} {text}"
    else:
        combined = text
    corpus_texts.append(combined)

print(f'Loaded {len(corpus_texts):,} corpus documents')

README.md: 0.00B [00:00, ?B/s]

corpus.jsonl:   0%|          | 0.00/47.0M [00:00<?, ?B/s]

Generating corpus split:   0%|          | 0/57638 [00:00<?, ? examples/s]

Loaded 57,638 corpus documents


In [11]:
lens = [len(t) for t in corpus_texts]
np.histogram(lens, bins=30)

(array([31103, 15664,  6150,  2376,  1078,   527,   278,   166,    92,
           70,    37,    30,    25,    11,     8,     6,     8,     3,
            3,     0,     0,     0,     0,     1,     1,     0,     0,
            0,     0,     1]),
 array([    0.        ,   566.33333333,  1132.66666667,  1699.        ,
         2265.33333333,  2831.66666667,  3398.        ,  3964.33333333,
         4530.66666667,  5097.        ,  5663.33333333,  6229.66666667,
         6796.        ,  7362.33333333,  7928.66666667,  8495.        ,
         9061.33333333,  9627.66666667, 10194.        , 10760.33333333,
        11326.66666667, 11893.        , 12459.33333333, 13025.66666667,
        13592.        , 14158.33333333, 14724.66666667, 15291.        ,
        15857.33333333, 16423.66666667, 16990.        ]))

This loaded 417 queries and 269k passages. Let's inspect one query and one passage to verify we have done so correctly.

In [7]:
# Load FiQA queries
queries_dataset = load_dataset('mteb/fiqa', 'queries')
queries_data = queries_dataset['queries']

# Extract query texts
query_ids = [item['_id'] for item in queries_data]
query_texts = [item['text'] for item in queries_data]

print(f'Loaded {len(query_texts):,} queries')

queries.jsonl:   0%|          | 0.00/600k [00:00<?, ?B/s]

Generating queries split:   0%|          | 0/6648 [00:00<?, ? examples/s]

Loaded 6,648 queries


In [8]:
print("Sample Query:")
print(f"ID: {query_ids[0]}")
print(f"Text: {query_texts[0]}")
print()
print("Sample Document:")
print(f"ID: {corpus_ids[0]}")
print(f"Text: {corpus_texts[0][:500]}...")

Sample Query:
ID: 0
Text: What is considered a business expense on a business trip?

Sample Document:
ID: 3
Text: I'm not saying I don't like the idea of on-the-job training too, but you can't expect the company to do that. Training workers is not their job - they're building software. Perhaps educational systems in the U.S. (or their students) should worry a little about getting marketable skills in exchange for their massive investment in education, rather than getting out with thousands in student debt and then complaining that they aren't qualified to do anything....


## Indexing

For an efficient search, we can pre-compute the ColBERT representation of each passage and index them.

Below, the `Indexer` take a model checkpoint and writes a (compressed) index to disk. We then prepare a `Searcher` for retrieval from this index.

In [9]:
# Configuration
doc_maxlen = 300   # Maximum document length in tokens
query_maxlen = 32  # Maximum query length in tokens
checkpoint_name = 'colbert-ir/colbertv2.0'

# Initialize the ColBERT checkpoint
config = ColBERTConfig(
    doc_maxlen=doc_maxlen,
    query_maxlen=query_maxlen,
)

ckpt = Checkpoint(checkpoint_name, colbert_config=config)
print(f"Loaded ColBERT checkpoint: {checkpoint_name}")

artifact.metadata: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/405 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Loaded ColBERT checkpoint: colbert-ir/colbertv2.0


  self.scaler = torch.cuda.amp.GradScaler()


To save space and time, we will only run the `Indexer` on the first 10,000 passages. To do so, we will filter out queries that do not contain passages with ids less than 10,000.

In [10]:
# Set to None to process all documents, or set a number to limit for testing
max_corpus_docs = None  # e.g., 1000 for testing
max_queries = None      # e.g., 100 for testing

# Batch size for encoding
batch_size = 32

# Create output directory
output_dir = 'data/fiqa'
os.makedirs(output_dir, exist_ok=True)

Now run the `Indexer` on the collection subset. Assuming the use of only one GPU, this cell should take about six minutes to finish running.

In [11]:
def encode_documents_batched(texts, ids, ckpt, batch_size=32, max_docs=None):
    """
    Encode documents in batches and return embeddings.

    Returns:
        embeddings_list: List of numpy arrays, each of shape (num_tokens, 128)
        processed_ids: List of document IDs
    """
    if max_docs is not None:
        texts = texts[:max_docs]
        ids = ids[:max_docs]

    embeddings_list = []
    processed_ids = []

    num_batches = (len(texts) + batch_size - 1) // batch_size

    for i in tqdm(range(0, len(texts), batch_size), total=num_batches, desc="Encoding documents"):
        batch_texts = texts[i:i+batch_size]
        batch_ids = ids[i:i+batch_size]

        # Generate embeddings using docFromText
        # Returns tuple: (embeddings, counts) where embeddings is tensor of shape (batch, max_tokens, dim)
        with torch.no_grad():
            batch_embs = ckpt.docFromText(batch_texts, bsize=batch_size)
            # # embs[0] contains the embeddings tensor
            # if isinstance(embs, tuple):
            #     batch_embs = embs[0]
            # else:
            #     batch_embs = embs
        # Convert to numpy once for the whole batch
        embs_np = batch_embs[0].cpu().numpy()

        # Then iterate through the batch
        for i, doc_id in enumerate(batch_ids):
            doc_emb = embs_np[i]  # Shape: (300, 128)
            embeddings_list.append(doc_emb)
            processed_ids.append(doc_id)

    return embeddings_list, processed_ids

print("Generating corpus embeddings...")
corpus_embeddings, corpus_processed_ids = encode_documents_batched(
    corpus_texts, corpus_ids, ckpt, batch_size=batch_size, max_docs=max_corpus_docs
)

Generating corpus embeddings...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
Encoding documents: 100%|██████████| 1802/1802 [04:02<00:00,  7.42it/s]


In [12]:
print(f"Generated embeddings for {len(corpus_embeddings)} documents")
print(f"Sample embedding shape: {corpus_embeddings[0].shape}")

Generated embeddings for 57638 documents
Sample embedding shape: (300, 128)


In [13]:
def encode_queries_batched(texts, ids, ckpt, batch_size=32, max_queries=None):
    """
    Encode queries in batches and return embeddings.

    Returns:
        embeddings_list: List of numpy arrays, each of shape (num_tokens, 128)
        processed_ids: List of query IDs
    """
    if max_queries is not None:
        texts = texts[:max_queries]
        ids = ids[:max_queries]

    embeddings_list = []
    processed_ids = []

    num_batches = (len(texts) + batch_size - 1) // batch_size

    for i in tqdm(range(0, len(texts), batch_size), total=num_batches, desc="Encoding queries"):
        batch_texts = texts[i:i+batch_size]
        batch_ids = ids[i:i+batch_size]

        # Generate embeddings using queryFromText
        with torch.no_grad():
            batch_embs = ckpt.queryFromText(batch_texts, bsize=batch_size)

        # Convert to numpy and store
        for j, (emb, query_id) in enumerate(zip(batch_embs, batch_ids)):
            if isinstance(emb, torch.Tensor):
                emb_np = emb.cpu().numpy()
            else:
                emb_np = np.array(emb)
            embeddings_list.append(emb_np)
            processed_ids.append(query_id)

    return embeddings_list, processed_ids

print("Generating query embeddings...")
query_embeddings, query_processed_ids = encode_queries_batched(
    query_texts, query_ids, ckpt, batch_size=batch_size, max_queries=max_queries
)
print(f"Generated embeddings for {len(query_embeddings)} queries")
print(f"Sample query embedding shape: {query_embeddings[0].shape}")

Generating query embeddings...


Encoding queries: 100%|██████████| 208/208 [00:03<00:00, 53.31it/s]

Generated embeddings for 6648 queries
Sample query embedding shape: (32, 128)





## Search

Having built the index and prepared our `searcher`, we can search for individual query strings.

We can use the `queries` set we loaded earlier — or you can supply your own questions. Feel free to get creative! But keep in mind this set of ~300k lifestyle passages can only answer a small, focused set of questions!

In [14]:
import json

# Save metadata about the embeddings
metadata = {
    'dataset': 'mteb/fiqa',
    'model': checkpoint_name,
    'doc_maxlen': doc_maxlen,
    'query_maxlen': query_maxlen,
    'embedding_dim': 128,
    'num_corpus_docs': len(corpus_embeddings),
    'num_queries': len(query_embeddings),
    'corpus_ids': corpus_processed_ids,
    'query_ids': query_processed_ids,
}

metadata_path = os.path.join(output_dir, 'metadata.json')
with open(metadata_path, 'w') as f:
    json.dump(metadata, f, indent=2)
print(f"Saved metadata to {metadata_path}")

def save_embeddings(embeddings_list, ids_list, filepath):
    """
    Save embeddings to a compressed numpy file.

    Since embeddings have variable lengths (different number of tokens),
    we save them as a list of arrays along with their IDs.
    """
    # Convert IDs to numpy array
    ids_array = np.array(ids_list, dtype=object)

    # Save embeddings as object array (allows variable-length arrays)
    embeddings_array = np.array(embeddings_list, dtype=object)

    np.savez_compressed(
        filepath,
        ids=ids_array,
        embeddings=embeddings_array
    )
    print(f"Saved embeddings to {filepath}")

batch_size = 1_000

# Save corpus embeddings
for b in range(0, len(corpus_embeddings), batch_size):
    corpus_embeddings_batch = corpus_embeddings[b:b+batch_size]
    corpus_processed_ids_batch = corpus_processed_ids[b:b+batch_size]
    blabel = b // batch_size
    corpus_emb_path = os.path.join(output_dir, f"corpus_embeddings_{blabel}.npz")
    save_embeddings(corpus_embeddings_batch, corpus_processed_ids_batch, corpus_emb_path)

# Save query embeddings
query_emb_path = os.path.join(output_dir, 'query_embeddings.npz')
save_embeddings(query_embeddings, query_processed_ids, query_emb_path)

Saved metadata to data/fiqa/metadata.json
Saved embeddings to data/fiqa/corpus_embeddings_0.npz
Saved embeddings to data/fiqa/corpus_embeddings_1.npz
Saved embeddings to data/fiqa/corpus_embeddings_2.npz
Saved embeddings to data/fiqa/corpus_embeddings_3.npz
Saved embeddings to data/fiqa/corpus_embeddings_4.npz
Saved embeddings to data/fiqa/corpus_embeddings_5.npz
Saved embeddings to data/fiqa/corpus_embeddings_6.npz
Saved embeddings to data/fiqa/corpus_embeddings_7.npz
Saved embeddings to data/fiqa/corpus_embeddings_8.npz
Saved embeddings to data/fiqa/corpus_embeddings_9.npz
Saved embeddings to data/fiqa/corpus_embeddings_10.npz
Saved embeddings to data/fiqa/corpus_embeddings_11.npz
Saved embeddings to data/fiqa/corpus_embeddings_12.npz
Saved embeddings to data/fiqa/corpus_embeddings_13.npz
Saved embeddings to data/fiqa/corpus_embeddings_14.npz
Saved embeddings to data/fiqa/corpus_embeddings_15.npz
Saved embeddings to data/fiqa/corpus_embeddings_16.npz
Saved embeddings to data/fiqa/cor

In [49]:
# Load corpus embeddings
corpus_data = np.load(corpus_emb_path, allow_pickle=True)
loaded_corpus_ids = corpus_data['ids'].tolist()
loaded_corpus_embs = [np.array(emb) for emb in corpus_data['embeddings']]

# Load query embeddings
query_data = np.load(query_emb_path, allow_pickle=True)
loaded_query_ids = query_data['ids'].tolist()
loaded_query_embs = [np.array(emb) for emb in query_data['embeddings']]

print(f"Loaded {len(loaded_corpus_embs)} corpus embeddings")
print(f"Loaded {len(loaded_query_embs)} query embeddings")
print(f"\nFirst corpus embedding shape: {loaded_corpus_embs[0].shape}")
print(f"First query embedding shape: {loaded_query_embs[0].shape}")

Loaded 1000 corpus embeddings
Loaded 100 query embeddings

First corpus embedding shape: (300, 128)
First query embedding shape: (32, 128)


In [17]:
import shutil

# Create zip of the entire output directory
shutil.make_archive('~/data/fiqa', 'zip')

OSError: [Errno 28] No space left on device

In [None]:
# Download
# from google.colab import files
# files.download('fiqa_colbert_embeddings.zip')