# JHU JSALT Summer School IR Laboratory -- Part 4.1

In this notebook, we are going to walk through a CLIR example using a DPR model on the NeuCLIR Chinese collection.

## Get Started

The following cell will check whether this notebook has GPU access. Upon execution you should see a table with Nvidia GPU information. If you are seeing an command error, that means you are either running a CPU or TPU VM. In this case, you should switch to a GPU using Runtime > Change runtime type.

In [None]:
!nvidia-smi

Sun Jun 16 19:35:45 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   65C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

Let's install the packages!
The following command will install `ir_measurees`, Huggingface `datasets`, Google Translate (for presentation), and Huggingface Transformers.

In [None]:
!pip install -q -U --progress-bar on ir_measures transformers datasets googletrans==3.1.0a0 more-itertools faiss-gpu

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/48.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.8/48.8 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m19.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.1/55.1 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m23.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.4/133.4 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.8/58.8 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

After installation, let's download the dataset. The [NeuCLIR 1 Collection](https://huggingface.co/datasets/neuclir/neuclir1) is publicly available on Huggingface Datasets! Topics and qrels are available on the TREC website, from which we will directly download it.

We are going to rerank a baseline BM25 search result provided by the NeuCLIR organizers; this is also available on the TREC website.

However, working with the entire NeuCLIR Chinese collection will take too much indexing time. For this demonstration, we'll just use the first 40k documents.

In [None]:
# Download topics and qrels from NIST
!wget -q --show-progress https://trec.nist.gov/data/neuclir/topics.0720.utf8.jsonl
!wget -q --show-progress https://trec.nist.gov/data/neuclir/2022-qrels.zho

import json
import pandas as pd
from tqdm.auto import tqdm

import ir_measures as irms
from datasets import load_dataset

# Only loading the first 40k docs from HF Datasets
ds = load_dataset('neuclir/neuclir1', split='zho', streaming=True) # total 3179209
doc_subset = [ o for i, o in zip(tqdm(range(40_000), desc='Loading first 40k docs from NeuCLIR Chinese Collection'), ds) ]
subset_doc_ids = set([ d['id'] for d in doc_subset ])

use_topic = '66' # use topic 66 as demo -- expecting to have 9 relevant docs

qrels = pd.DataFrame([ l for l in irms.read_trec_qrels('2022-qrels.zho') if l.query_id == use_topic and l.doc_id in subset_doc_ids ])
topics = [ t for t in map(json.loads, open("topics.0720.utf8.jsonl")) if t['topic_id'] == use_topic ]



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading first 40k docs from NeuCLIR Chinese Collection:   0%|          | 0/40000 [00:00<?, ?it/s]

Here we create helper functions so we can obtain the query and document text more conveniently.

In [None]:
topic_id_idx = { t['topic_id']: i for i, t in enumerate(topics) }
def get_query_by_topic_id(topic_id, query_type='title'):
    return topics[ topic_id_idx[topic_id] ]['topics'][0][f'topic_{query_type}']

doc_id_to_idx = { d['id']: i for i, d in enumerate(doc_subset) }
def get_doc_text_by_doc_id(doc_id):
    doc = doc_subset[ doc_id_to_idx[doc_id] ]
    return doc['title'] + ' ' + doc['text']

# Dense Retrieval with One Vector Per Sequence

Now we are ready to start working on encoding our documents.

Let's first load a model that was pretrained with text in multiple languages and capable of encoding the meaning of the text into a vector.

In this example, we use `eugene-yang/dpr-xlm-align-C3-zhen` but there are a lot of models out there that can do job.

In [None]:
import torch
from transformers import AutoModel, AutoTokenizer

import faiss

from more_itertools import batched

model_name = 'eugene-yang/dpr-xlm-align-C3-zhen'

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModel.from_pretrained(model_name)
model = model.half().to('cuda')


After loading the model, let's define the `mean_pooling` function suggested by [`sentence-transformers`](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2). The pooling function varies from model to model. You should be aware of which one you should be using when using a pretrained model.

In [None]:
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


We also want to support fast searching, which we can rely on `faiss` for vector nearest neighbor search. In this demo, we use a index setup that performs exact matches, which is slow for large collections. If you would like to use something that's faster, you could use other setups, e.g. `PQ384x4fs`. See https://github.com/facebookresearch/faiss/wiki/The-index-factory for more.

The first argument a faiss index needs is the dimension of the vectors, which we can get by doing a mini test run with an empty string.

Here, since `faiss` only maintain a running integer index of the vectors, we need to manually maintain a mapping between the running id and the actual document ids. You could also use `IndexIDMap` provided by `faiss` but for simplicity, we just use a python list here.



In [None]:
with torch.no_grad():
    dimension = model(**tokenizer("", return_tensors='pt').to(model.device)).last_hidden_state.shape[-1]
print(dimension)

768


In [None]:
index = faiss.index_factory(dimension, "Flat")
docid_mapping = []

Now we are ready to encode the documents.

Here we use a very simple data loader that batch the documents. You can certainly use a PyTorch Dataloader but just to keep things simple here.

A tricks with working with padding is to sort the input text from short to long, so that you don't apply excessive paddings to batches with shorter text.

In [None]:
batch_size = 256
loader = tqdm(
    batched(sorted(map(lambda x: (x['id'], f"{x['title']}\n{x['text']}"), doc_subset), key=lambda x: len(x[1])),
            batch_size),
    total=len(doc_subset)/batch_size
)

with torch.no_grad():
    for batch in loader:
        docids, text = list(zip(*batch))
        inputs = tokenizer(text, padding='longest', max_length=512, truncation=True, return_tensors='pt')
        model_output = model(**inputs.to(model.device))
        # text_embeddings = mean_pooling(model_output, inputs['attention_mask']).cpu()
        text_embeddings = model_output.last_hidden_state[:, 0].cpu()

        index.add(text_embeddings.numpy().astype('float32'))
        docid_mapping += docids


  0%|          | 0/156.25 [00:00<?, ?it/s]

We can verify how many vectors we have added to the index by using the following command.

In [None]:
index.ntotal

40000

And we are ready for some searching!
We first need to get the query text and encode them using the same process.

In [None]:
query_text = get_query_by_topic_id(use_topic) + ' ' + get_query_by_topic_id(use_topic, query_type='description')

print(query_text)

with torch.no_grad():
    inputs = tokenizer(query_text, padding=True, truncation=True, return_tensors='pt')
    query_embeddings = mean_pooling(model(**inputs.to(model.device)), inputs['attention_mask']).cpu()

print(query_embeddings.shape)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


COVID-19 vaccination rate in China I am interested in finding articles that provide information about the COVID-19 vaccination rate in China.
torch.Size([1, 768])


Then we can search the `faiss` index with the query embeddings.
It returns two `numpy` arrays, the similarity values the their associated running index.

In [None]:
D, I = index.search(query_embeddings.numpy(), k=1000)

We can then map them to the original document id by using the `docid_mapping` that we maintained during indexing.

And finally, we can evaluate the performance.

In [None]:
ranking = {
    docid_mapping[int(i)]: float(v)
    for i, v in zip(I[0], -D[0])
}

irms.calc_aggregate([irms.nDCG@20, irms.AP, irms.R@100], qrels, {use_topic: ranking})

{nDCG@20: 0.07075576072108061,
 R@100: 0.5555555555555556,
 AP: 0.05835123228926886}

Well, it's not so great isn't it! However, it has some decent recall that we can use with a reranker (or most of them time you would just want to use a BM25...).

# Exercise
Well, you might have already realized that the tokenizer was truncating the text with a maximum length of 512. We can actually do better then that!

There's a commonly-used techinque called `MaxP`, which splits the documents into passages, search the passage collection, and use the maximunm passage score of a document as its document score.

See more: https://arxiv.org/pdf/1905.09217

Let's implement this and see if it actually improves the result!

In [None]:
# Your solution



In [None]:
#@title See Solution

from itertools import chain

def split_documents(doc, length=180, stride=180):
    doc_text = f"{doc['title']}\n{doc['text']}"
    tokens = tokenizer.encode(doc_text, add_special_tokens=False)
    for i, offset in enumerate(range(0, len(tokens), stride)):
        yield (doc['id'], i, tokenizer.decode(tokens[offset:offset+length]))

batch_size = 256
estimate_npassages = sum([ len(d['title'] + d['text'])//180+1 for d in doc_subset ])
loader = tqdm(
    batched(chain.from_iterable(map(split_documents, doc_subset)), batch_size),
    total=estimate_npassages//batch_size
)

maxp_index = faiss.index_factory(dimension, "Flat")
maxp_docid_mapping = []

with torch.no_grad():
    for batch in loader:
        docids, passage_ids, text = list(zip(*batch))
        inputs = tokenizer(text, padding='longest', max_length=512, truncation=True, return_tensors='pt')
        model_output = model(**inputs.to(model.device))
        # text_embeddings = mean_pooling(model_output, inputs['attention_mask']).cpu()
        text_embeddings = model_output.last_hidden_state[:, 0].cpu()

        maxp_index.add(text_embeddings.numpy().astype('float32'))
        maxp_docid_mapping += docids

# since we are using the same model, let me reuse the query embeddings
D, I = maxp_index.search(query_embeddings.numpy(), k=1000)

# MaxP
maxp_ranking = {}
for idx, val in zip(I[0], -D[0]):
    docid = maxp_docid_mapping[idx]
    if docid not in maxp_ranking or maxp_ranking[docid] < val:
        maxp_ranking[docid] = float(val)

irms.calc_aggregate([irms.nDCG@20, irms.AP, irms.R@100], qrels, {use_topic: maxp_ranking})

# And there you go!

You've learned how to run a DPR model for CLIR!