# ColBERTv2: Indexing & Search Notebook

If you're working in Google Colab, we recommend selecting "GPU" as your hardware accelerator in the runtime settings.

First, we'll import the relevant classes. Note that `Indexer` and `Searcher` are the key actors here. Next, we'll download the necessary dependencies.

In [3]:
!git -C ColBERT/ pull || git clone https://github.com/stanford-futuredata/ColBERT.git
import sys; sys.path.insert(0, 'ColBERT/')


fatal: cannot change to 'ColBERT/': No such file or directory
Cloning into 'ColBERT'...
remote: Enumerating objects: 2576, done.[K
remote: Counting objects: 100% (1089/1089), done.[K
remote: Compressing objects: 100% (335/335), done.[K
remote: Total 2576 (delta 853), reused 798 (delta 754), pack-reused 1487[K
Receiving objects: 100% (2576/2576), 2.01 MiB | 5.50 MiB/s, done.
Resolving deltas: 100% (1606/1606), done.


In [4]:
try: # When on google Colab, let's install all dependencies with pip.
    import google.colab
    !pip install -U pip
    !pip install -e ColBERT/['faiss-gpu','torch']
except Exception:
  import sys; sys.path.insert(0, 'ColBERT/')
  try:
    from colbert import Indexer, Searcher
  except Exception:
    print("If you're running outside Colab, please make sure you install ColBERT in conda following the instructions in our README. You can also install (as above) with pip but it may install slower or less stable faiss or torch dependencies. Conda is recommended.")
    assert False

Collecting pip
  Downloading pip-23.3.2-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-23.3.2
Obtaining file:///content/ColBERT
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting bitarray (from colbert-ai==0.2.17)
  Downloading bitarray-2.9.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (34 kB)
Collecting datasets (from colbert-ai==0.2.17)
  Downloading datasets-2.16.1-py3-none-any.whl.metadata (20 kB)
Collecting git-python (from colbert-ai==0.2.17)
  Downloading git_python-1.0.3-py2.py3-none-any.whl (1.9 kB)
Collecting python-dotenv (from colbert-ai==0.2.17)
  Downloading python_dotenv-1.0.0-py3-none-any.whl (19 kB)
Collecting ninja (from colbert

In [5]:
import colbert

In [6]:
from colbert import Indexer, Searcher
from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert.data import Queries, Collection

We will use the dev set of the **LoTTE benchmark** we recently introduced in the ColBERTv2 paper. We'll download it from HuggingFace datasets. The dev and test sets contain several domain-specific corpora, and we'll use the smallest dev set corpus, namely lifestyle:dev.

For the purposes of a quick demo, we will only run the `Indexer` on the first 10,000 passages. As we do this, let's also remove the queries whose relevant passages are all outside this small set of passages.

In [16]:
#!git clone https://github.com/NickVoulg02/Information-Retrieval.git
import pandas as pd
from datasets import Dataset
import pandas as pd
dataset = 'test'
df1 = pd.read_csv("Information-Retrieval/colbert_test/doc_col.tsv", delimiter = '\t', index_col=0)
df2 =  pd.read_csv("Information-Retrieval/colbert_test/queries_20.tsv", delimiter = '\t', index_col=0)
collection = Dataset.from_pandas(df1)
query = Dataset.from_pandas(df2)
print(query['query'][5])
print(collection['doc'][5])

WHAT IS THE EFFECT OF WATER OR OTHER THERAPEUTIC AGENTS ON THE PHYSICAL PROPERTIES VISCOSITY ELASTICITY OF SPUTUM OR BRONCHIAL SECRETIONS FROM CF PATIENTS
PSEUDOMONAS AERUGINOSA INFECTION IN CYSTIC FIBROSIS DISTRIBUTION OF B AND T LYMPHOCYTES IN RELATION TO THE HUMORAL IMMUNE RESPONSE THE INFLUENCE OF CHRONIC PS AERUGINOSA INFECTION ON THE OCCURRENCE OF BONE MARROWDERIVED LYMPHOCYTES B CELLS AND THYMUSDERIVED LYMPHOCYTES T CELLS IN PERIPHERAL BLOOD HAVE BEEN STUDIED IN 2 GROUPS OF PATIENTS WITH CYSTIC FIBROSIS ONE GROUP 9 PATIENTS SUFFERED FROM CHRONIC PS AERUGINOSA INFECTION IN THE RESPIRATORY TRACT AND PRODUCED MULTIPLE PS AERUGINOSA PRECIPITINS WHICH WERE DEMONSTRATED BY MEANS OF CROSSED IMMUNOELECTROPHORESIS THE OTHER GROUP 9 PATIENTS HAD NEVER HARBOURED PS AERUGINOSA IN THE RESPIRATORY TRACT AND PRESENTED NO DEMONSTRABLE PRECIPITINS AGAINST THIS BACTERIA THE LYMPHOCYTES WERE EXAMINED FOR THE PRESENCE OF 2 SURFACE MARKERS FOR B CELLS RECEPTOR FOR C3 COMPLEMENT COMPONENT EAC ROSETTE

## Indexing

For an efficient search, we can pre-compute the ColBERT representation of each passage and index them.

Below, the `Indexer` take a model checkpoint and writes a (compressed) index to disk. We then prepare a `Searcher` for retrieval from this index.

In [17]:
nbits = 2   # encode each dimension with 2 bits
doc_maxlen = 300 # truncate passages at 300 tokens

index_name = f'{dataset}.{nbits}bits'

Now run the `Indexer` on the collection subset. Assuming the use of only one GPU, this cell should take about six minutes to finish running.

In [18]:
checkpoint = 'colbert-ir/colbertv2.0'

with Run().context(RunConfig(nranks=1, experiment='notebook')):  # nranks specifies the number of GPUs to use
    config = ColBERTConfig(doc_maxlen=doc_maxlen, nbits=nbits, kmeans_niters=4) # kmeans_niters specifies the number of iterations of k-means clustering; 4 is a good and fast default.
                                                                                # Consider larger numbers for small datasets.

    indexer = Indexer(checkpoint=checkpoint, config=config)
    indexer.index(name=index_name, collection=collection["doc"], overwrite=True)



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


artifact.metadata:   0%|          | 0.00/1.63k [00:00<?, ?B/s]



[Jan 18, 17:42:39] #> Creating directory /content/experiments/notebook/indexes/test.2bits 


#> Starting...
#> Joined...


In [19]:
indexer.get_index() # You can get the absolute path of the index, if needed.

'/content/experiments/notebook/indexes/test.2bits'

## Search

Having built the index and prepared our `searcher`, we can search for individual query strings.

We can use the `queries` set we loaded earlier — or you can supply your own questions. Feel free to get creative! But keep in mind this set of ~300k lifestyle passages can only answer a small, focused set of questions!

In [20]:
# To create the searcher using its relative name (i.e., not a full path), set
# experiment=value_used_for_indexing in the RunConfig.
with Run().context(RunConfig(experiment='notebook')):
    searcher = Searcher(index=index_name, collection=collection["doc"])

[Jan 18, 17:47:11] #> Loading codec...
[Jan 18, 17:47:11] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Jan 18, 17:47:11] Loading packbits_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Jan 18, 17:47:11] #> Loading IVF...
[Jan 18, 17:47:11] #> Loading doclens...


100%|██████████| 1/1 [00:00<00:00, 3477.86it/s]

[Jan 18, 17:47:11] #> Loading codes and residuals...



100%|██████████| 1/1 [00:00<00:00, 164.75it/s]


In [30]:
question = query["query"][0] # try with an in-range query or supply your own
print(f"#> {question}")

# Find the top-3 passages for this query
results = searcher.search(question, k=15)

# Print out the top-k retrieved passages
for passage_id, passage_rank, passage_score in zip(*results):
    print(f"\t [{passage_rank}] \t\t {passage_score:.1f} \t\t {searcher.collection[passage_id]}")
    print(f"{collection['doc_id'][passage_id]}")


#> WHAT ARE THE EFFECTS OF CALCIUM ON THE PHYSICAL PROPERTIES OF MUCUS FROM CF PATIENTS
	 [1] 		 23.0 		 EFFECTS OF CALCIUM ON INTESTINAL MUCIN IMPLICATIONS FOR CYSTIC FIBROSIS A MAJOR FEATURE OF THE DISEASE CYSTIC FIBROSIS IS THE EXCESSIVE CONCENTRATION OF MUCUS WITHIN DUCTS AND GLANDS OF MUCOUSPRODUCING ORGANS SOME MUCOUS SECRETIONS ALSO SHOW AN ELEVATION IN CALCIUM CONCENTRATION USING PURIFIED RAT INTESTINAL GOBLET CELL MUCIN AS A MODEL MUCIN WE HAVE INVESTIGATED THE EFFECT OF MILLIMOLAR ADDITIONS 125 MM OF CACL2 ON THE PHYSICAL PROPERTIES OF THE MUCIN ISOTONICITY OF INCUBATION MEDIA WAS PRESERVED IN ORDER TO MIMIC IN VIVO CONDITIONS CACL2 815MM CAUSED A 1533 DECREASE IN VISCOSITY NO CHANGE IN ELECTROPHORETIC MOBILITY IN ACRYLAMIDE GELS AND A 2030 DECREASE IN SOLUBILITY OF THE MUCIN SOLUBILITY CHANGES WERE REVERSED BY THE ADDITION OF EDTA 20 MM TO INCUBATIONS INSOLUBILITY WAS ALSO PRODUCED IN INCUBATIONS OF MUCIN WITH A MIXTURE OF SOLUBLE INTESTINAL CONTENTS NACL WASHINGS THESE FIND