# JHU JSALT Summer School IR Laboratory -- Part 3.1

This notebook is mainly borrowed from the series of Colab notebooks created for the SIGIR 2023 Tutorial entitled '**Neural Methods for CLIR Tutorial**'. For more information, please visit their [repository](https://github.com/hltcoe/clir-tutorial).

In this notebook we are going to walk through a CLIR example using a Translate-trained cross-encoder to rerank an existing ranked list on the NeuCLIR Chinese collection.

## Get Started

The following cell will check whether this notebook has GPU access. Upon execution you should see a table with Nvidia GPU information. If you are seeing an command error, that means you are either running a CPU or TPU VM. In this case, you should switch to a GPU using Runtime > Change runtime type.

In [None]:
!nvidia-smi

Thu Jul  6 20:03:26 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   48C    P8    10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Let's install the packages!
The following command will install `ir_measurees`, Huggingface `datasets`, Google Translate (for presentation), and Huggingface Transformers.

In [None]:
!pip install -q -U --progress-bar on ir_measures transformers datasets googletrans==3.1.0a0

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/48.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.8/48.8 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m43.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m33.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.1/55.1 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m37.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.4/133.4 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

After installation, let's download the dataset. The [NeuCLIR 1 Collection](https://huggingface.co/datasets/neuclir/neuclir1) is publicly available on Huggingface Datasets! Topics and qrels are available on the TREC website, from which we will directly download it.

We are going to rerank a baseline BM25 search result provided by the NeuCLIR organizers; this is also available on the TREC website.

However, working with the entire NeuCLIR Chinese collection will take too much indexing time. For this demonstration, we'll just use the first 40k documents.

In [None]:
# Download topics and qrels from NIST
!wget -q --show-progress https://trec.nist.gov/data/neuclir/topics.0720.utf8.jsonl
!wget -q --show-progress https://trec.nist.gov/data/neuclir/2022-qrels.zho
!wget -q --show-progress https://trec.nist.gov/data/neuclir/zho-base-run-results.txt

import json
import pandas as pd
from tqdm.auto import tqdm

import ir_measures as irms
from datasets import load_dataset

# Only loading the first 40k docs from HF Datasets
ds = load_dataset('neuclir/neuclir1', split='zho', streaming=True) # total 3179209
doc_subset = [ o for i, o in zip(tqdm(range(40_000), desc='Loading first 40k docs from NeuCLIR Chinese Collection'), ds) ]
subset_doc_ids = set([ d['id'] for d in doc_subset ])

use_topic = '66' # use topic 66 as demo -- expecting to have 9 relevant docs

qrels = pd.DataFrame([ l for l in irms.read_trec_qrels('2022-qrels.zho') if l.query_id == use_topic and l.doc_id in subset_doc_ids ])
topics = [ t for t in map(json.loads, open("topics.0720.utf8.jsonl")) if t['topic_id'] == use_topic ]



Downloading builder script:   0%|          | 0.00/1.31k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Loading first 40k docs from NeuCLIR Chinese Collection:   0%|          | 0/40000 [00:00<?, ?it/s]

Here we create helper functions so we can obtain the query and document text more conveniently.

In [None]:
topic_id_idx = { t['topic_id']: i for i, t in enumerate(topics) }
def get_query_by_topic_id(topic_id, query_type='title'):
    return topics[ topic_id_idx[topic_id] ]['topics'][0][f'topic_{query_type}']

doc_id_to_idx = { d['id']: i for i, d in enumerate(doc_subset) }
def get_doc_text_by_doc_id(doc_id):
    doc = doc_subset[ doc_id_to_idx[doc_id] ]
    return doc['title'] + ' ' + doc['text']

## Cross-Encoder Reranker

A cross-encoder is a straight-up Transformer model that performs sequence pair classification.
The following creates a `CrossEncoderReranker` class that will make it easier to feed the query-document pairs into the model. You can definitely use the generic sequence classification interface from Huggingface to do this trick!

In [None]:
import torch
import torch.nn.functional as F

from tqdm.auto import tqdm
from typing import List, Tuple, Union

from transformers import (AutoModelForSequenceClassification,
                          AutoTokenizer,
                          PreTrainedModel,
                          PreTrainedTokenizer)
from transformers.modeling_outputs import SequenceClassifierOutput
from transformers.tokenization_utils import BatchEncoding

class CrossEncoderReranker(torch.nn.Module):
    def __init__(self, model: PreTrainedModel, tokenizer: Union[str, PreTrainedTokenizer] = None):
        super().__init__()

        self.model = model
        self.tokenizer = tokenizer or AutoTokenizer.from_pretrained(model.config.model_type)
        self.eval()

    @property
    def config(self):
        return self.model.config

    @property
    def device(self):
        return self.model.device

    @staticmethod
    def _logits_to_scores(logits: torch.Tensor) -> torch.Tensor:
        return F.log_softmax(logits, dim=1)[:, 1].contiguous()

    def prepare_rerank(self, query: str, candidates: List[str]):
        return self.tokenize_pairs([ (query, c) for c in candidates ])

    def rerank(self, query: str, candidates: List[str], batch_size=100, progress_bar=True):
        return torch.concat([
            self.forward(
                **self.prepare_rerank(query, candidates[i: i+batch_size]).to(self.device)
            )
            for i in tqdm(range(0, len(candidates), batch_size),
                          disable=not progress_bar)
        ])

    def tokenize_pairs(self, batch: List[Tuple[str, str]]) -> BatchEncoding:
        return self.tokenizer(
            batch,
            return_tensors='pt',
            max_length=256,
            padding='max_length',
            truncation='longest_first'
        )

    def forward(self, *args, **kwargs):
        scores: SequenceClassifierOutput = self.model(*args, **kwargs)
        if scores.logits.shape[1] == 1:
            return scores.logits[:, 0].contiguous()
        elif scores.logits.shape[1] == 2:
            return self._logits_to_scores(scores.logits)

        raise ValueError(f"Unrecognized logit shape {scores.logits.shape}")

    @classmethod
    def load(cls, model_name_or_path, tokenizer_name=None):
        return cls(
            AutoModelForSequenceClassification.from_pretrained(model_name_or_path),
            AutoTokenizer.from_pretrained(tokenizer_name or model_name_or_path, use_fast=True)
        )


The model ([`eugene-yang/ce-xlmr-large-clir-eng.zho`](https://huggingface.co/eugene-yang/ce-xlmr-large-clir-eng.zho)) we use here is trained with English queries and translated documents from the MS-MARCO dataset.

Let's load the model into the GPU.

In [None]:
reranker = CrossEncoderReranker.load('eugene-yang/ce-xlmr-large-clir-eng.zho').to('cuda')

Downloading (…)lve/main/config.json:   0%|          | 0.00/850 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/452 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

And let's load the baseline ranked list. Since we are only working with the 40k subset, we also need to filter documents outside those 40k out of the initial ranked list.

Scoring against the filtered qrels, this BM25 result is just ok -- giving us an nDCG@20 of 0.17.

In [None]:
to_rerank = pd.DataFrame([ l for l in irms.read_trec_run('zho-base-run-results.txt') if l.query_id == use_topic and l.doc_id in subset_doc_ids ])

irms.calc_aggregate([irms.nDCG@20, irms.AP], qrels, to_rerank)

{AP: 0.07383512544802867, nDCG@20: 0.17465294461817227}

Let's rerank this run with the cross-encoder.

You can see that we get much better search results!
Now the nDCG@20 has jumped up to 0.47!  

In [None]:
rerank_scores = {}

with torch.inference_mode():
    for query_id, d in to_rerank.groupby('query_id'):
        raw_scores = reranker.rerank(
            query=get_query_by_topic_id(query_id),
            candidates=d.doc_id.map(get_doc_text_by_doc_id).tolist()
        ).cpu().tolist()

        rerank_scores[query_id] = dict(zip(d.doc_id, raw_scores))

irms.calc_aggregate([irms.nDCG@20, irms.AP], qrels, rerank_scores)

  0%|          | 0/1 [00:00<?, ?it/s]

{AP: 0.30090090090090094, nDCG@20: 0.4742708106036133}

# Exercise
There are other cross-encoders out there that can do CLIR reranking.

`cross-encoder/mmarco-mMiniLMv2-L12-H384-v1` from Sentence-Transformers is an example. You can try using this model to rerank as well!

Try it out yourself here:

In [None]:
# Your solution



In [None]:
#@title See Solution

sbert = CrossEncoderReranker.load('cross-encoder/mmarco-mMiniLMv2-L12-H384-v1').to('cuda')
rerank_scores = {}

with torch.inference_mode():
    for query_id, d in to_rerank.groupby('query_id'):
        raw_scores = sbert.rerank(
            query=get_query_by_topic_id(query_id),
            candidates=d.doc_id.map(get_doc_text_by_doc_id).tolist()
        ).cpu().tolist()

        rerank_scores[query_id] = dict(zip(d.doc_id, raw_scores))

irms.calc_aggregate([irms.nDCG@20, irms.AP], qrels, rerank_scores)


# And there you go!

You've learned how to run a simple cross-encoder reranker for CLIR!