# ColBERTv2: Indexing & Search Notebook
Welcome to DSPy's official ColBERT notebook! This notebook will teach you how to download the ColBERT library and LoTTE datasets, run the Indexer on the LoTTE passages and a sample language model, and use the Searcher to output the most relevant passages to various queries. You can test your own custom queries once you finish running all cells.

If you're working in Google Colab, we recommend selecting "GPU" as your hardware accelerator in the runtime settings.

First, we'll import the relevant classes. Note that `Indexer` and `Searcher` are the key actors here. Next, we'll download the necessary dependencies.

In [None]:
!pip install "colbert-ir[faiss-gpu, torch]"

Collecting colbert-ir[faiss-gpu,torch]
  Downloading colbert_ir-0.2.14-py3-none-any.whl (107 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m107.0/107.0 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitarray (from colbert-ir[faiss-gpu,torch])
  Downloading bitarray-2.8.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (287 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m287.4/287.4 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets (from colbert-ir[faiss-gpu,torch])
  Downloading datasets-2.14.6-py3-none-any.whl (493 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
Collecting git-python (from colbert-ir[faiss-gpu,torch])
  Downloading git_python-1.0.3-py2.py3-none-any.whl (1.9 kB)
Collecting python-dotenv (from colbert-ir[faiss-gpu,torch])
  Downloading python_dotenv-1.0.0-py3-none-any.whl (19 kB)
Collecting ninja (from co

In [None]:
import colbert

In [None]:
from colbert import Indexer, Searcher
from ColBERTmain.colbert.infra import Run, RunConfig, ColBERTConfig
from ColBERTmain.colbert.data import Queries, Collection

We will use the dev set of the **LoTTE benchmark** we recently introduced in the ColBERTv2 paper. In summary, the [LoTTE dataset](https://huggingface.co/colbertv2) is designed for evaluating IR systems on long-tail topics which may not be well-covered by entity-centric knowledge bases like Wikipedia. It consists of 12 test sets with 500-2000 queries and 100k-2M passages per set. The dataset is divided into 5 topics. There is also a pooled setting in which  passages and queries are aggregated across all topics. Test sets are categorized by topic and include related validation sets. The dataset encourages realistic out-of-domain testing by keeping passages disjoint and offers a "pooled" setting for diverse testing. Queries are from search (GooAQ) and forum (StackExchange) sources. Evaluation focuses on retrieval solely from answer post text using the success@5 metric.

We'll download one such dev set corpus from HuggingFace, namely technology:dev.

For the purposes of a quick demo, we will only run the `Indexer` on the first 10,000 passages. As we do this, let's also remove the queries whose relevant passages are all outside this small set of passages.

In [None]:
from datasets import load_dataset

dataset = 'lifestyle'
datasplit = 'dev'

collection_dataset = load_dataset("colbertv2/lotte_passages", dataset)
collection = [x['text'] for x in collection_dataset[datasplit + '_collection']]

queries_dataset = load_dataset("colbertv2/lotte", dataset)
queries = [x['query'] for x in queries_dataset['search_' + datasplit]]

f'Loaded {len(queries)} queries and {len(collection):,} passages'

Downloading builder script:   0%|          | 0.00/21.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/895 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/271M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/109M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating dev_collection split: 0 examples [00:00, ? examples/s]

Generating test_collection split: 0 examples [00:00, ? examples/s]

Downloading builder script:   0%|          | 0.00/10.9k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/533 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/426k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/393k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/136k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/211k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating forum_dev split: 0 examples [00:00, ? examples/s]

Generating examples


Generating forum_test split: 0 examples [00:00, ? examples/s]

Generating examples


Generating search_dev split: 0 examples [00:00, ? examples/s]

Generating examples


Generating search_test split: 0 examples [00:00, ? examples/s]

Generating examples


'Loaded 417 queries and 268,893 passages'

This loaded 916 queries and 1.2 million passages. Let's inspect one query and one passage to verify we have done so correctly.

In [None]:
print(queries[24])
print()
print(collection[19929])
print()

are blossom end rot tomatoes edible?

I think the spraying thing is not after, its during. The cold will freeze the mist, keeping the air around the trees at (but not below) freezing. See http://www.ehow.com/how_5805520_use-freeze-damage-fruit-trees.html for example which recommends a sprinkler. The releases heat thing is kind of an oversimplification, but basically as long as you have any liquid water around, it will keep things at zero. The sap of your tree is not pure water, and therefore freezes somewhat below zero. By having the water freeze instead you stay away from the temps that would damage your plants. That said, http://www.ehow.com/how-does_5245655_spraying-frost-protect-fruit-freezing_.html is total gibberish since evaporation doesnt generate heat, quite the opposite. There is a better explanation at http://www.gardenguides.com/135830-spray-water-plants-during-frost.html This is a picture from a blog entry that gives you details from the citrus farmers point of view.



## Indexing

For an efficient search, we can pre-compute the ColBERT representation of each passage and index them.

Below, the `Indexer` take a model checkpoint and writes a (compressed) index to disk. We then prepare a `Searcher` for retrieval from this index.

To save space and time, we will only run the `Indexer` on the first 10,000 passages. To do so, we will filter out queries that do not contain passages with ids less than 10,000.

In [None]:
max_id = 10000
answer_pids = [x['answers']['answer_pids'] for x in queries_dataset['search_' + datasplit]]
filtered_queries = [q for q, apids in zip(queries, answer_pids) if any(x < max_id for x in apids)]

nbits = 2   # encode each dimension with 2 bits
doc_maxlen = 300 # truncate passages at 300 tokens
max_id = 10000

index_name = f'{dataset}.{datasplit}.{nbits}bits'

f'Filtered down to {len(filtered_queries)} queries'

'Filtered down to 20 queries'

We have already built `Indexer` for you on the collection subset to save additional time. Download it from the [colbert-ir HuggingFace repository](https://huggingface.co/colbert-ir/indexes/tree/main) below.

However, if you wish to manually run the `Indexer` on the collection subset, you can update the `APPLY_INDEXING` flag. Assuming the use of only one GPU, this cell should take about six minutes to finish running.

In [None]:
APPLY_INDEXING = True

if APPLY_INDEXING:
    from huggingface_hub import snapshot_download

    !mkdir "index"
    indexer = snapshot_download(repo_id="colbert-ir/indexes", local_dir="index")
    index_name = indexer + "/intro_colbert"
else:
    checkpoint = 'colbert-ir/colbertv2.0'

    with Run().context(RunConfig(nranks=1, experiment='notebook')):  # nranks specifies the number of GPUs to use
        config = ColBERTConfig(doc_maxlen=doc_maxlen, nbits=nbits, kmeans_niters=4) # kmeans_niters specifies the number of iterations of k-means clustering; 4 is a good and fast default.                                                                           # Consider larger numbers for small datasets.

        indexer = Indexer(checkpoint=checkpoint, config=config)
        indexer.index(name=index_name, collection=collection[:max_id], overwrite=True)


Fetching 11 files:   0%|          | 0/11 [00:00<?, ?it/s]

Downloading (…)cf32e/.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

Downloading (…)bert/0.metadata.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)olbert/metadata.json:   0%|          | 0.00/2.71k [00:00<?, ?B/s]

Downloading (…)lbert/doclens.0.json:   0%|          | 0.00/33.0k [00:00<?, ?B/s]

Downloading (…)ro_colbert/plan.json:   0%|          | 0.00/2.74k [00:00<?, ?B/s]

Downloading 0.codes.pt:   0%|          | 0.00/3.49M [00:00<?, ?B/s]

Downloading buckets.pt:   0%|          | 0.00/999 [00:00<?, ?B/s]

Downloading 0.residuals.pt:   0%|          | 0.00/27.9M [00:00<?, ?B/s]

Downloading avg_residual.pt:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading ivf.pid.pt:   0%|          | 0.00/2.12M [00:00<?, ?B/s]

Downloading centroids.pt:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

## Search

Having downloaded the index and prepared our `searcher`, we can search for individual query strings.

We can use the `queries` set we loaded earlier — or you can supply your own questions. Feel free to get creative! But keep in mind this set of ~300k lifestyle passages can only answer a small, focused set of questions!

In [None]:
# To create the searcher using its relative name (i.e., not a full path), set
# experiment=value_used_for_indexing in the RunConfig.
with Run().context(RunConfig(experiment='notebook')):
    searcher = Searcher(index=index_name, collection=collection)


# If you want to customize the search latency--quality tradeoff, you can also supply a
# config=ColBERTConfig(ncells=.., centroid_score_threshold=.., ndocs=..) argument.
# The default settings with k <= 10 (1, 0.5, 256) gives the fastest search,
# but you can gain more extensive search by setting larger values of k or
# manually specifying more conservative ColBERTConfig settings (e.g. (4, 0.4, 4096)).

[Nov 12, 04:26:56] #> Loading codec...
[Nov 12, 04:26:56] #> Loading IVF...
[Nov 12, 04:26:56] #> Loading doclens...


100%|██████████| 1/1 [00:00<00:00, 774.57it/s]

[Nov 12, 04:26:56] #> Loading codes and residuals...



100%|██████████| 1/1 [00:00<00:00, 59.96it/s]


In [None]:
queries = [filtered_queries[13], filtered_queries[5], filtered_queries[3], "How do I move files from desktop to virtual machine?"] # try with an in-range query or supply your own

for query in queries:
  print(f"#> {query}")

  # Find the top-3 passages for this query
  results = searcher.search(query, k=3)

  # Print out the top-k retrieved passages
  for passage_id, passage_rank, passage_score in zip(*results):
      print(f"\t [{passage_rank}] \t\t {passage_score:.1f} \t\t {searcher.collection[passage_id]}")

#> are some cats just skinny?

#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: . are some cats just skinny?, 		 True, 		 None
#> Output IDs: torch.Size([32]), tensor([  101,     1,  2024,  2070,  8870,  2074, 15629,  1029,   102,   103,
          103,   103,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103])
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])

	 [1] 		 15.6 		 It could be an indication of a fever, or it could be a sign of some other medical problem, such as an ear infection. It is probably a good idea to get your dog checked out by a vet.
	 [2] 		 15.6 		 So I was at the feed store today getting a couple bales of hay, and some wood pellet litter. I picked up a 10 pound bag of whole safflower seed. When I got home I spread some out in th