# JHU JSALT Summer School IR Laboratory -- Part 4.2

This notebook is mainly borrowed from the series of Colab notebooks created for the SIGIR 2023 Tutorial entitled '**Neural Methods for CLIR Tutorial**'. For more information, please visit their [repository](https://github.com/hltcoe/clir-tutorial).

In this notebook, we will go through a quick demonstration of using PLAID-X (an extension of [ColBERT-X](https://arxiv.org/abs/2201.08471)) for running a CLIR experiment over a subset of the NeuCLIR Chinese collection.

The overall run time of this notebook is about 15 minutes. Please remember to select the a Colab runtime with GPU (`Runtime > Change runtime type`).

Since we are operating in a VM, **the results will not be saved** after the session is disconnected; this might happen e.g., because of an idle timeout. If you would like to save the index in your Google Drive, you can mount your Google Drive onto the VM at the left panel and point the `index_root` variable to your selected directory.

## Get Started

The following cell will check whether this notebook has GPU access. Upon execution you should see a table with Nvidia GPU information. If you are seeing an command error, that means you are either running a CPU or TPU VM. In this case, you should switch to a GPU using Runtime > Change runtime type.

In [1]:
!nvidia-smi

Mon May 27 20:40:39 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   51C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

Let's install the packages!
The following command will install Huggingface `datasets`, Google Translate (for presentation), and [PLAID-X](https://github.com/hltcoe/ColBERT-X/tree/plaid-x).

In [2]:
!pip install -U --progress-bar on datasets googletrans==3.1.0a0 PLAID-X

Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting googletrans==3.1.0a0
  Downloading googletrans-3.1.0a0.tar.gz (19 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting PLAID-X
  Downloading PLAID_X-0.3.1-py3-none-any.whl (107 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m107.5/107.5 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting httpx==0.13.3 (from googletrans==3.1.0a0)
  Downloading httpx-0.13.3-py3-none-any.whl (55 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.1/55.1 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
Collecting hstspreload (from httpx==0.13.3->googletrans==3.1.0a0)
  Downloading hstspreload-2024.5.1-py3-none-any.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m15.7 MB/s[0m eta [36m0

After installation, let's download the dataset. The [NeuCLIR 1 Collection](https://huggingface.co/datasets/neuclir/neuclir1) is publicly available on Huggingface Datasets! Topics and qrels are available on the TREC website, from which we will directly download it.

However, working with the entire NeuCLIR Chinese collection will take too much indexing time. For this demonstration, we'll just use the first 40k documents.

In [3]:
# Download topics and qrels from NIST
!wget -q --show-progress https://trec.nist.gov/data/neuclir/topics.0720.utf8.jsonl
!wget -q --show-progress https://trec.nist.gov/data/neuclir/2022-qrels.zho

import json
import pandas as pd
from tqdm.auto import tqdm

import ir_measures as irms
from datasets import load_dataset

# Only loading the first 40k docs from HF Datasets
ds = load_dataset('neuclir/neuclir1', split='zho', streaming=True) # total 3179209
doc_subset = [ o for i, o in zip(tqdm(range(40_000), desc='Loading first 40k docs from NeuCLIR Chinese Collection'), ds) ]
subset_doc_ids = set([ d['id'] for d in doc_subset ])

use_topic = '66' # use topic 66 as demo -- expecting to have 9 relevant docs

qrels = pd.DataFrame([ l for l in irms.read_trec_qrels('2022-qrels.zho') if l.query_id == use_topic and l.doc_id in subset_doc_ids ])
topics = [ t for t in map(json.loads, open("topics.0720.utf8.jsonl")) if t['topic_id'] == use_topic ]



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/1.31k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Loading first 40k docs from NeuCLIR Chinese Collection:   0%|          | 0/40000 [00:00<?, ?it/s]

## Indexing

In this tutorial, we use a Multilingual ColBERT-X model (`hltcoe/plaidx-large-neuclir-mtd-mix-passages-mt5xxl-engeng`) that is trained on Chinese, Persian, and Russian for the NeuCLIR Track. If you are interested in the details of the model, please refer to [Yang et al. (2024)](https://arxiv.org/abs/2405.00977).

In [4]:
from colbert.infra import ColBERTConfig

from colbert.data import Collection
from colbert import Indexer, Searcher

Since System RAM on Colab VM is quite limited, let's precompile C++ extensions to avoid peak memory usage exceeding the limit. This process will take around 3 minutes.

In [5]:
from colbert.indexing.codecs.residual import ResidualCodec
ResidualCodec.try_load_torch_extensions(True)

[May 27, 20:43:03] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...


If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].


[May 27, 20:44:49] Loading packbits_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...


If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].


Alright, let's start actually indexing the collection.
We first create a collection object and the indexer. To avoid running out of memory, we cap the batch size at 64. If you are running on your own machine, you can potentially increase the batch size to speed up indexing. For this experiment, we use just 1 per coordinate for the residual of the vectors to the closest cluster centroid.

If you wish to put the index in a different directory, you can specify that by passing the path to the `index_root` parameter in `ColBERTConfig`, e.g., `ColBERTConfig(bsize=64, index_root="./drive/MyDrive/plaid/")`.

In [6]:
collection = Collection.cast([ l['text'] for l in doc_subset ])
indexer = Indexer(checkpoint='hltcoe/plaidx-large-neuclir-mtd-mix-passages-mt5xxl-engeng', config=ColBERTConfig(bsize=64, nbits=1))

artifact.metadata:   0%|          | 0.00/2.79k [00:00<?, ?B/s]

Indexing is broken into **three** parts. The preparation step first calculates the cluster centroids; the encoding step encodes the collection according to those centroids; the finalize step creates the inverted lookup file for the centroids and the passage ids.

In [7]:
# This command will run for about 10 minutes
indexer.prepare(name='neuclir.zho.40k', collection=collection, overwrite=True)



[May 27, 20:46:23] #> Creating directory /content/experiments/default/indexes/neuclir.zho.40k 


{
    "query_token_id": "[unused0]",
    "doc_token_id": "[unused1]",
    "query_token": "[Q]",
    "doc_token": "[D]",
    "ncells": null,
    "centroid_score_threshold": null,
    "ndocs": null,
    "only_approx": false,
    "index_path": null,
    "nbits": 1,
    "kmeans_niters": 4,
    "resume": false,
    "max_sampled_pid": -1,
    "max_num_partitions": -1,
    "use_lagacy_build_ivf": false,
    "reuse_centroids_from": null,
    "similarity": "cosine",
    "bsize": 32,
    "accumsteps": 1,
    "lr": 5e-6,
    "maxsteps": 400000,
    "save_every": null,
    "resume_optimizer": false,
    "fix_broken_optimizer_state": false,
    "warmup": null,
    "warmup_bert": null,
    "relu": false,
    "nway": 6,
    "n_query_alternative": 1,
    "use_ib_negatives": false,
    "kd_loss": "KLD",
    "reranker": false,
    "distillation_alpha": 1.0,
    "ignore_scores": false,
    "model_name": "xl



config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/698 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/418 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

[May 27, 20:46:54] [0] 		 # of sampled PIDs = 35055 	 sampled_pids[:3] = [22257, 7091, 33814]
[May 27, 20:46:54] [0] 		 #> Encoding 35055 passages..
[May 27, 20:54:36] [0] 		 avg_doclen_est = 206.5302276611328 	 len(local_sample) = 35,055
[May 27, 20:54:46] [0] 		 Creaing 32,768 partitions.
[May 27, 20:54:46] [0] 		 *Estimated* 8,261,209 embeddings.
[May 27, 20:54:46] [0] 		 #> Saving the indexing plan to /content/experiments/default/indexes/neuclir.zho.40k/plan.json ..
{
    "query_token_id": "[unused0]",
    "doc_token_id": "[unused1]",
    "query_token": "[Q]",
    "doc_token": "[D]",
    "ncells": null,
    "centroid_score_threshold": null,
    "ndocs": null,
    "only_approx": false,
    "index_path": null,
    "nbits": 1,
    "kmeans_niters": 4,
    "resume": false,
    "max_sampled_pid": -1,
    "max_num_partitions": -1,
    "use_lagacy_build_ivf": false,
    "reuse_centroids_from": null,
    "similarity": "cosine",
    "bsize": 32,
    "accumsteps": 1,
    "lr": 5e-6,
    "maxs

'/content/experiments/default/indexes/neuclir.zho.40k'

In [8]:
# This command takes about 7 minutes.
indexer.encode(name='neuclir.zho.40k', collection=collection)

{
    "query_token_id": "[unused0]",
    "doc_token_id": "[unused1]",
    "query_token": "[Q]",
    "doc_token": "[D]",
    "ncells": null,
    "centroid_score_threshold": null,
    "ndocs": null,
    "only_approx": false,
    "index_path": null,
    "nbits": 1,
    "kmeans_niters": 4,
    "resume": true,
    "max_sampled_pid": -1,
    "max_num_partitions": -1,
    "use_lagacy_build_ivf": false,
    "reuse_centroids_from": null,
    "similarity": "cosine",
    "bsize": 32,
    "accumsteps": 1,
    "lr": 5e-6,
    "maxsteps": 400000,
    "save_every": null,
    "resume_optimizer": false,
    "fix_broken_optimizer_state": false,
    "warmup": null,
    "warmup_bert": null,
    "relu": false,
    "nway": 6,
    "n_query_alternative": 1,
    "use_ib_negatives": false,
    "kd_loss": "KLD",
    "reranker": false,
    "distillation_alpha": 1.0,
    "ignore_scores": false,
    "model_name": "xlm-roberta-large",
    "force_resize_embeddings": true,
    "shuffle_passages": true,
    "sampling_m

0it [00:00, ?it/s]

[May 27, 20:57:01] [0] 		 #> Encoding 25000 passages..
[May 27, 21:02:39] [0] 		 #> Saving chunk 0: 	 25,000 passages and 5,167,426 embeddings. From #0 onward.


1it [05:42, 342.48s/it]

[May 27, 21:02:44] [0] 		 #> Encoding 15000 passages..
[May 27, 21:06:03] [0] 		 #> Saving chunk 1: 	 15,000 passages and 3,092,486 embeddings. From #25,000 onward.


2it [09:05, 272.56s/it]


'/content/experiments/default/indexes/neuclir.zho.40k'

In [9]:
indexer.finalize(name='neuclir.zho.40k', collection=collection)

{
    "query_token_id": "[unused0]",
    "doc_token_id": "[unused1]",
    "query_token": "[Q]",
    "doc_token": "[D]",
    "ncells": null,
    "centroid_score_threshold": null,
    "ndocs": null,
    "only_approx": false,
    "index_path": null,
    "nbits": 1,
    "kmeans_niters": 4,
    "resume": true,
    "max_sampled_pid": -1,
    "max_num_partitions": -1,
    "use_lagacy_build_ivf": false,
    "reuse_centroids_from": null,
    "similarity": "cosine",
    "bsize": 64,
    "accumsteps": 1,
    "lr": 5e-6,
    "maxsteps": 400000,
    "save_every": null,
    "resume_optimizer": false,
    "fix_broken_optimizer_state": false,
    "warmup": null,
    "warmup_bert": null,
    "relu": false,
    "nway": 6,
    "n_query_alternative": 1,
    "use_ib_negatives": false,
    "kd_loss": "KLD",
    "reranker": false,
    "distillation_alpha": 1.0,
    "ignore_scores": false,
    "model_name": "xlm-roberta-large",
    "force_resize_embeddings": true,
    "shuffle_passages": true,
    "sampling_m

clean up local ivf files: 100%|██████████| 3/3 [00:00<00:00, 509.53it/s]

[May 27, 21:07:42] [0] 		 #> Saving the indexing metadata to /content/experiments/default/indexes/neuclir.zho.40k/metadata.json ..





'/content/experiments/default/indexes/neuclir.zho.40k'

And we are done! The default index location is `experiments/default/indexes/neuclir.zho.40k/`, but you can modify this by providing `index_root` to the `ColBERTConfig` object.

In [10]:
!ls ./experiments/default/indexes/neuclir.zho.40k/

0.codes.pt	 1.codes.pt	  avg_residual.pt  doclens.0.json  metadata.json
0.metadata.json  1.metadata.json  buckets.pt	   doclens.1.json  plan.json
0.residuals.pt	 1.residuals.pt   centroids.pt	   ivf.pid.pt	   sample.0.pt


## Searching

Finally, we search our index with a query. In this tutorial, we use topic `66` (COVID-19 vaccination rate in China) as an example.

In [11]:
searcher = Searcher(index='neuclir.zho.40k', collection=collection)

[May 27, 21:07:43] [0] 		 Loading model hltcoe/plaidx-large-neuclir-mtd-mix-passages-mt5xxl-engeng...
[May 27, 21:07:55] #> Loading codec...
[May 27, 21:07:55] #> Loading IVF...
[May 27, 21:07:55] #> Loading doclens...


100%|██████████| 2/2 [00:00<00:00, 302.22it/s]

[May 27, 21:07:55] #> Loading codes and residuals...



100%|██████████| 2/2 [00:00<00:00,  2.25it/s]


In [12]:
raw_scores = searcher.search_all({ t['topic_id']: t['topics'][0]['topic_title'] for t in topics }, k=2500)


#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: . COVID-19 vaccination rate in China, 		 True, 		 None
#> Output IDs: torch.Size([32]), tensor([     0, 250002,      5,   8244,  74116,   8363,  51294,   2320,  34515,
            23,   9098,      2, 250001, 250001, 250001, 250001, 250001, 250001,
        250001, 250001, 250001, 250001, 250001, 250001, 250001, 250001, 250001,
        250001, 250001, 250001, 250001, 250001])
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])



100%|██████████| 1/1 [00:00<00:00,  2.88it/s]


We assemble the search results into the format that `ir_measures` likes to use and evaluate the search results with nDCG@20 and R@100.

In [13]:
run = {
    qid: {
        doc_subset[didx]['id']: score
        for didx, _, score in ranking
    }
    for qid, ranking in raw_scores.items()
}
irms.calc_aggregate([irms.nDCG@20, irms.R@100], qrels, run)

{nDCG@20: 0.8613562398590289, R@100: 0.8888888888888888}

Let's pull out the top five documents and see how good they are!

In [14]:
top5 = [
    {**doc_subset[didx], 'score': score, 'rank': rank}
    for didx, rank, score in raw_scores.data['66'][:5]
]

top5

[{'id': '857abc3e-5f0b-4c59-83e4-aadee93e63f2',
  'cc_file': 'crawl-data/CC-NEWS/2021/07/CC-NEWS-20210714102729-00668.warc.gz',
  'time': '2021-07-14T18:21:30+00:00',
  'title': '沒打疫苗步步難行 未接種者在中國多地出行受限',
  'text': '中國的COVID-19疫苗接種已逾14億劑次，隨著疫苗施打的普及，多地政府相繼對未接種者作出限制，過去一週，江西、浙江、安徽都發布告知說，沒接種疫苗將影響出行。\n\n據中國國家衛生健康委員會官網，截至13日，中國大陸2019冠狀病毒疾病（COVID-19）疫苗接種累計14億201萬9000劑次。\n\n隨著疫苗施打的普及，各地也對未接種民眾作出限制。\n\n綜合澎湃新聞等陸媒報導，浙江省寧波市寧海縣衛生和計畫生育局官方微信公眾號公布，疫情防控辦公室11日發布通知：25日起，原則上不允許未接種疫苗者進入醫療機構住院部、養老院、學校（幼兒園、托兒所、校外培訓機構）、圖書館、博物館、監所等重點場所。\n\n浙江麗水市青田縣8日也公布了類似通知：21日起，不允許未接種者進入醫療機構住院部、養老院、托兒所、學校（幼兒園、校外培訓機構）、圖書館、博物館、監所等重點場所。\n\n江西省撫州市崇仁縣的最新通知則說，將在商場、景區、車站、影院等公共場所實行掃「贛通碼」查看疫苗接種紀錄，居民若未接種疫苗，將對生活和出行帶來不便。\n\n另據江西贛州定南縣的通告，26日起，不允許未接種疫苗者進入超市、醫院、學校、車站等；贛州的安遠縣也發通告指，26日起不允許未接種疫苗者進入超市、醫院、學校、車站等重點公共場所。\n\n安徽省黃山市休寧縣衛健委昨天也通知，8月1日起將在全縣範圍內，對出入超市、市場、銀行、賓館酒店、電影院、醫院、藥店、理髮店、政務大廳等各類公共場所人員和乘坐公車的市民展開疫苗接種查驗（安康碼）。',
  'url': 'https://udn.com/news/story/121707/5601597',
  'score': 27.65625,
  'rank': 1},
 {'id': '02c17cc0-d97f-4640-a07e-9cdc6

Well, you might not be able to read Chinese (exactly why we need CLIR!). But we can use Google Translation!

In [15]:
from googletrans import Translator
translate = lambda x: Translator().translate(x, src='zh-tw', dest='en').text

[
    {**d, 'title': translate(d['title']), 'text': translate(d['text'])}
    for d in top5
]

[{'id': '857abc3e-5f0b-4c59-83e4-aadee93e63f2',
  'cc_file': 'crawl-data/CC-NEWS/2021/07/CC-NEWS-20210714102729-00668.warc.gz',
  'time': '2021-07-14T18:21:30+00:00',
  'title': 'It’s difficult to move around without vaccination. Travel for unvaccinated people is restricted in many places in China.',
  'text': 'China has administered more than 1.4 billion doses of the COVID-19 vaccine. As vaccination becomes more popular, governments in many places have successively imposed restrictions on those who have not been vaccinated. In the past week, Jiangxi, Zhejiang, and Anhui have all issued notices saying that those who have not been vaccinated will Affect travel.\n\nAccording to the official website of the National Health Commission of China, as of the 13th, a total of 1.42019 million doses of coronavirus disease 2019 (COVID-19) vaccinations have been administered in mainland China.\n\nAs vaccination becomes more popular, various places have also imposed restrictions on unvaccinated peopl

# Looking into ColBERT Model...

Let's look at the model and see how the ColBERT model match queries and documents.

In [20]:
import torch

In [21]:
# We first load the model.
checkpoint = searcher.checkpoint
# checkpoint = Checkpoint("hltcoe/plaidx-large-neuclir-mtd-mix-passages-mt5xxl-engeng", colbert_config=ColBERTConfig()) # get the model checkpoint from the indexer

Print out the query and document that we use as example.

In [22]:
print(f"Query: {topics[0]['topics'][0]['topic_title']}")
print(f"Doc: {doc_subset[0]['title']}")

Query: COVID-19 vaccination rate in China
Doc: 欧科云链链上大师重磅上线，一起来用“链上Bloomberg”听听行业脉搏跳动


And encode them using `.queryFromText` and `.docFromText` method.

In [23]:
Q = checkpoint.queryFromText([topics[0]['topics'][0]['topic_title']])
D = checkpoint.docFromText([doc_subset[0]['title']])

print(f"Q --> {Q}")
print(f"D --> {D}")

Q --> tensor([[[ 0.1107,  0.0228,  0.0166,  ...,  0.0965,  0.1255, -0.0602],
         [ 0.0743, -0.0375, -0.0051,  ...,  0.0833,  0.1296, -0.0697],
         [ 0.1155, -0.0507, -0.0123,  ...,  0.1050,  0.1711, -0.1309],
         ...,
         [ 0.0731,  0.0374, -0.0623,  ...,  0.0700,  0.1603,  0.0215],
         [ 0.0772,  0.0159, -0.0497,  ...,  0.0774,  0.1268,  0.0376],
         [ 0.0658, -0.0356, -0.1141,  ...,  0.0390,  0.1180,  0.0027]]],
       device='cuda:0')
D --> tensor([[[ 0.0589, -0.0119,  0.1035,  ...,  0.0645,  0.1656, -0.2026],
         [ 0.0608, -0.0355,  0.0547,  ...,  0.0817,  0.1698, -0.1680],
         [ 0.0948, -0.0541,  0.0352,  ...,  0.0768,  0.1760, -0.1558],
         ...,
         [ 0.0714, -0.0683,  0.1262,  ..., -0.0561,  0.0774, -0.2094],
         [ 0.0142,  0.0054,  0.0979,  ...,  0.0014,  0.1449, -0.2135],
         [ 0.1066, -0.0522,  0.0349,  ...,  0.0771,  0.1899, -0.1544]]],
       device='cuda:0', dtype=torch.float16)


Let's use `torch` to implement the MaxSim operator.
We remove the first dimension (the batch example dimension) for convenient.

In [24]:
(D[0] @ Q.half()[0].T).max(axis=0).values.sum()

tensor(20.5000, device='cuda:0', dtype=torch.float16)

We can also use the `colbert_score` function implemented in the package to verify the score. The function expect a `mask` argument to tell the function which document tokens are masked because of batching. Here, we provide it with a dummy mask since no tokens are masked in this simple example.
In the next cell -- you should expect exactly the same value.

In [25]:
from colbert.modeling.colbert import colbert_score
colbert_score(Q, D, torch.ones_like(D)[:, :, [0]])

tensor([20.5000], device='cuda:0', dtype=torch.float16)

# Practice

Now let's try experimenting with a translate-trained model! We also released our Chinese translate-trained CLIR ColBERT-X model on Huggingface `eugene-yang/colbertx-xlmr-large-tt-zho`. Can you index the 40k subset with 4 bits for the residuals and run the same evaluation?

You may need to restart the runtime to avoid going over the VM RAM limit.

In [None]:
# Your solution



In [None]:
#@title Solution

# indexing
collection = Collection.cast([ l['text'] for l in doc_subset ])
indexer = Indexer(checkpoint='eugene-yang/colbertx-xlmr-large-tt-zho', config=ColBERTConfig(bsize=64, nbits=4))
indexer.prepare(name='neuclir.zho.40k.tt.4bits', collection=collection, overwrite=True)
indexer.encode(name='neuclir.zho.40k.tt.4bits', collection=collection)
indexer.finalize(name='neuclir.zho.40k.tt.4bits', collection=collection)

# searching
searcher = Searcher(index='neuclir.zho.40k', collection=collection)
raw_scores = searcher.search_all({ t['topic_id']: t['topics'][0]['topic_title'] for t in topics }, k=2500)
run = {
    qid: {
        doc_subset[didx]['id']: score
        for didx, _, score in ranking
    }
    for qid, ranking in raw_scores.items()
}
irms.calc_aggregate([irms.nDCG@20, irms.R@100], qrels, run)

# And there you go!

You've learned how to run an CLIR experiment with PLAID-X!