# JHU JSALT Summer School IR Laboratory -- Part 5

This notebook is mainly borrowed from the series of Colab notebooks created for the SIGIR 2023 Tutorial entitled '**Neural Methods for CLIR Tutorial**'. For more information, please visit their [repository](https://github.com/hltcoe/clir-tutorial).

In this notebook we are going to walk through a CLIR example using a Translate-trained bi-encoder model BLADE to produce a ranked list on the NeuCLIR Chinese collection.


## Setup
Replicating the steps from the official Anserini [notebook](https://github.com/castorini/anserini-notebooks/blob/master/anserini_robust04_demo.ipynb)

We first download the precompiled image from maven. The code was written in 2023, which uses some deprecated functions are later removed from the latest version. In this notebook, we use Anserini 0.21.0.

In [None]:
!wget https://repo1.maven.org/maven2/io/anserini/anserini/0.21.0/anserini-0.21.0-fatjar.jar

--2024-05-27 20:42:07--  https://repo1.maven.org/maven2/io/anserini/anserini/0.21.0/anserini-0.21.0-fatjar.jar
Resolving repo1.maven.org (repo1.maven.org)... 199.232.192.209, 199.232.196.209, 2a04:4e42:4c::209, ...
Connecting to repo1.maven.org (repo1.maven.org)|199.232.192.209|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 159044007 (152M) [application/java-archive]
Saving to: ‘anserini-0.21.0-fatjar.jar’


2024-05-27 20:42:15 (26.3 MB/s) - ‘anserini-0.21.0-fatjar.jar’ saved [159044007/159044007]



Let's install the packages!
The following command will install `ir_measurees`, Huggingface `datasets`, Google Translate (for presentation), and Huggingface Transformers.

In [None]:
!pip install -q -U --progress-bar on ir_measures transformers datasets googletrans==3.1.0a0

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/48.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.8/48.8 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m53.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m36.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.1/55.1 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m71.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.4/133.4 kB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

After installation, let's download the dataset. The NeuCLIR 1 Collection is publicly available on Huggingface Datasets! Topics and qrels are available on the TREC website, from which we will directly download it.

Working with the entire NeuCLIR Chinese collection will take too much indexing time. For this demonstration, we'll just use the first 40k documents.

In [None]:
# Download topics and qrels from NIST
!wget -q --show-progress https://trec.nist.gov/data/neuclir/topics.0720.utf8.jsonl
!wget -q --show-progress https://trec.nist.gov/data/neuclir/2022-qrels.zho

import json
import pandas as pd
from tqdm.auto import tqdm

import ir_measures as irms
from datasets import load_dataset

# Only loading the first 40k docs from HF Datasets
ds = load_dataset('neuclir/neuclir1', split='zho', streaming=True) # total 3179209
doc_subset = [ o for i, o in zip(tqdm(range(40_000), desc='Loading first 40k docs from NeuCLIR Chinese Collection'), ds) ]
subset_doc_ids = set([ d['id'] for d in doc_subset ])

use_topic = '66' # use topic 66 as demo -- expecting to have 9 relevant docs

qrels = pd.DataFrame([ l for l in irms.read_trec_qrels('2022-qrels.zho') if l.query_id == use_topic and l.doc_id in subset_doc_ids ])
topics = [ t for t in map(json.loads, open("topics.0720.utf8.jsonl")) if t['topic_id'] == use_topic ]



Downloading builder script:   0%|          | 0.00/1.31k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Loading first 40k docs from NeuCLIR Chinese Collection:   0%|          | 0/40000 [00:00<?, ?it/s]

Here we create helper functions so we can obtain the query and document text more conveniently.


In [None]:
topic_id_idx = { t['topic_id']: i for i, t in enumerate(topics) }
def get_query_by_topic_id(topic_id, query_type='title', lang="eng"):
    for topic in topics[ topic_id_idx[topic_id] ]['topics']:
      if topic["lang"] == lang:
        return topic[f'topic_{query_type}']

doc_id_to_idx = { d['id']: i for i, d in enumerate(doc_subset) }
def get_doc_text_by_doc_id(doc_id):
    doc = doc_subset[ doc_id_to_idx[doc_id] ]
    return doc['title'] + ' ' + doc['text']

## BLADE model

BLADE is a transformer-based model initialized with a pruned bilingual language model that generates sparse vectors for queries and documents.
The following creates a `Blade` class that will make it easier to feed the query (or document) into the model and obtain sparse vectors with dimension spanning the bilingual vocabulary of the underyling language model.

In [None]:
import json
import torch

from transformers import AutoModelForMaskedLM

class Blade(torch.nn.Module):

    def __init__(self, model_type_or_dir):
        super().__init__()
        self.transformer = AutoModelForMaskedLM.from_pretrained(model_type_or_dir)

    def forward(self, **kwargs):
        out = self.transformer(**kwargs)["logits"] # output (logits) of MLM head, shape (bs, pad_len, voc_size)
        values, _ = torch.max(torch.log(1 + torch.relu(out)) * kwargs["attention_mask"].unsqueeze(-1), dim=1)
        return values

## Indexing

Indexing with BLADE is a two-step process.


1.   We generate the sparse weights for a subset of the documents using the fine-tuned model.
2.   We store the generated weights into a sparse index using Anserini.




In [None]:
!mkdir -p collection

Helper function that takes in a batch of documents and generates sparse vectors by preserving the highest top-k weights.

In [None]:
def process_text(docs, ids, model, tokenizer, device, reverse_voc, max_length, top_k):
  with torch.inference_mode():
    features = tokenizer(
        docs, return_tensors = "pt", max_length = max_length,
        padding = True, truncation = True
    )
    features = {key: val.to(device) for key, val in features.items()}
    doc_reps = model(**features)

  cols = [torch.nonzero(x).squeeze().cpu().tolist() for x in torch.unbind(doc_reps, dim = 0)]

  res = {}
  for col, doc_rep, id_ in zip(cols, doc_reps, ids):
    weights = doc_rep[col].cpu().tolist()

    if type(col) == list:
      weights_dict = {k : v for k, v in zip(col, weights)}
    else:
      weights_dict = {col : weights}

    tokids = heapq.nlargest(top_k, weights_dict, key = weights_dict.__getitem__)
    tokids = set(tokids)

    dict_blade = {
      reverse_voc[k]: round(v * 100)
      for k, v in weights_dict.items() if  k in tokids and round(v * 100) > 0
    }

    dict_blade = dict(sorted(dict_blade.items(), key = operator.itemgetter(1), reverse = True))

    if len(dict_blade.keys()) == 0:
      print("empty input =>", id_)
      dict_blade['"[unused993]"'] = 1

    res[id_] = dict_blade
  return res

Loading the BLADE model. Make sure to change the runtime type to include the free GPU (T4).

In [None]:
import json
import math
import heapq
import torch
import operator

from transformers import AutoTokenizer

model_name = "srnair/blade-en-zh"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = Blade(model_name)
device = torch.device("cuda")
model.to(device)
model.eval()


Blade(
  (transformer): BertForMaskedLM(
    (bert): BertModel(
      (embeddings): BertEmbeddings(
        (word_embeddings): Embedding(35225, 768, padding_idx=0)
        (position_embeddings): Embedding(512, 768)
        (token_type_embeddings): Embedding(2, 768)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (encoder): BertEncoder(
        (layer): ModuleList(
          (0-11): 12 x BertLayer(
            (attention): BertAttention(
              (self): BertSelfAttention(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=768, out_features=768, bias=True)
     

Processing document identifiers and document text.

In [None]:
ids, docs = [], []
for doc_id in tqdm(doc_id_to_idx, total = len(doc_id_to_idx)):
  content = get_doc_text_by_doc_id(doc_id)
  ids.append(doc_id)
  docs.append(content.lower())

  0%|          | 0/40000 [00:00<?, ?it/s]

### First stage
Running BLADE model on document texts. This step will take approximately 16 minutes on T4 GPU using a batch size of 32 documents.

In [None]:
reverse_voc = {v : k for k, v in tokenizer.vocab.items()}
top_k = int(len(tokenizer) * 0.01) # Only preserving the top 10% of the highest weights
max_length = 256
batch_size = 32

with open("collection/zho_blade_subset.jsonl", "w") as f:
    for i in tqdm(range(0, len(docs), batch_size), total = math.ceil(len(docs) / batch_size)):
        doc = docs[i:i+batch_size]
        pid = ids[i:i+batch_size]
        res = process_text(doc, pid, model, tokenizer, device, reverse_voc, max_length, top_k)
        for id_, text in zip(pid, doc):
            dict_ = dict(id=id_, content=text, vector=res[id_])
            f.write(json.dumps(dict_, ensure_ascii=False)+"\n")

  0%|          | 0/1250 [00:00<?, ?it/s]

### Second stage
Index the generated weights using Anserini. At the end of the indexing, you should see 40,000 documents indexed

In [None]:
!java -cp anserini-0.21.0-fatjar.jar io.anserini.index.IndexCollection \
  -collection JsonVectorCollection \
  -generator DefaultLuceneDocumentGenerator \
  -threads 9 \
  -input collection \
  -index indexes/zho_neuclir_subset_blade \
  -impact \
  -pretokenized

2023-07-15 21:53:26,567 INFO  [main] index.IndexCollection (IndexCollection.java:380) - Setting log level to INFO
2023-07-15 21:53:26,570 INFO  [main] index.IndexCollection (IndexCollection.java:383) - Starting indexer...
2023-07-15 21:53:26,570 INFO  [main] index.IndexCollection (IndexCollection.java:385) - DocumentCollection path: collection
2023-07-15 21:53:26,573 INFO  [main] index.IndexCollection (IndexCollection.java:386) - CollectionClass: JsonVectorCollection
2023-07-15 21:53:26,574 INFO  [main] index.IndexCollection (IndexCollection.java:387) - Generator: DefaultLuceneDocumentGenerator
2023-07-15 21:53:26,575 INFO  [main] index.IndexCollection (IndexCollection.java:388) - Threads: 9
2023-07-15 21:53:26,576 INFO  [main] index.IndexCollection (IndexCollection.java:389) - Language: en
2023-07-15 21:53:26,576 INFO  [main] index.IndexCollection (IndexCollection.java:390) - Stemmer: porter
2023-07-15 21:53:26,581 INFO  [main] index.IndexCollection (IndexCollection.java:391) - Keep s

## Retrieval

Post indexing of Chinese documents, we want to generate a ranked list for a given English query using the BLADE model. Similar to indexing, retrieval is a two-step process


1.   We generate the sparse weights for a query using the fine-tuned model.
2.   We perform retrieval using the generated query weights with Anserini.



In [None]:
!mkdir -p runs

Get the English query for a specific topic_id (66).

In [None]:
topic_text = get_query_by_topic_id(use_topic, lang="eng")

### First stage
Running BLADE on query text

In [None]:
max_length = 32 # set maximum query length to 32
query_dict = process_text([topic_text], ["66"], model, tokenizer, device, reverse_voc, max_length, top_k)

Helper function to create query vectors in Anserini format

In [None]:
def get_anserini_query(id_, query_vector):
  exp_query = " ".join([" ".join([key]*val) for key, val in query_vector.items()])
  return f"{id_}\t{exp_query}\n"

In [None]:
with open("zho_blade_topics.txt", "w") as f:
  f.write(get_anserini_query(use_topic, query_dict[use_topic]))

### Second stage
Perform retrieval using Anserini

In [None]:
!java -cp anserini-0.21.0-fatjar.jar io.anserini.search.SearchCollection \
  -index indexes/zho_neuclir_subset_blade \
  -topics zho_blade_topics.txt \
  -topicreader TsvInt \
  -output runs/zho_neuclir_subset_blade.title.txt \
  -impact \
  -pretokenized

2023-07-15 21:56:27,880 INFO  [main] search.SearchCollection (SearchCollection.java:951) - Index: indexes/zho_neuclir_subset_blade
2023-07-15 21:56:28,127 INFO  [main] search.SearchCollection (SearchCollection.java:955) - Fields: []
2023-07-15 21:56:28,154 INFO  [main] search.SearchCollection (SearchCollection.java:1227) - runtag: Anserini
2023-07-15 21:56:29,318 INFO  [pool-2-thread-1] search.SearchCollection$SearcherThread (SearchCollection.java:883) - ranker: impact(), reranker: default: 1 queries processed in 00:00:01 = ~0.91 q/s
2023-07-15 21:56:29,346 INFO  [main] search.SearchCollection (SearchCollection.java:1439) - Total run time: 00:00:01


Scoring against the filtered qrels, this BLADE result is better than BM25 -- giving us an nDCG@20 of 0.5531 compared to 0.1483 for the same topic.

In [None]:
to_rerank = pd.DataFrame([ l for l in irms.read_trec_run("runs/zho_neuclir_subset_blade.title.txt")])

irms.calc_aggregate([irms.nDCG@20, irms.AP], qrels, to_rerank)

{AP: 0.3504142956973146, nDCG@20: 0.5531909174508711}

# Exercise
Perform retrieval using a different topic id.

For generating a score, refer to this [cell](https://colab.research.google.com/drive/1u_8ESzz_f26toFy45m17UQRZXGVqMt0B#scrollTo=PI64O_uLCK_o&line=19&uniqifier=1) on how to filter qrels to only include the chosen topic id.

Try it out yourself here:

In [None]:
# Your solution
use_topic =
qrels =