# JHU JSALT Summer School IR Laboratory -- Part 2.5

This notebook is mainly borrowed from the series of Colab notebooks created for the SIGIR 2023 Tutorial entitled '**Neural Methods for CLIR Tutorial**'. For more information, please visit their [repository](https://github.com/hltcoe/clir-tutorial).

In this notebook we are going to walk through a CLIR example using Anserini, which is a wrapper around Lucene. We use a BM25 model with query translation to generate a ranked list on the NeuCLIR Chinese collection as an example.



## Setup
Replicating the steps from the official Anserini [notebook](https://github.com/castorini/anserini-notebooks/blob/master/anserini_robust04_demo.ipynb)

We first need to upgrade the Java version on Google Colab

We first download the precompiled image from maven. The code was written in 2023, which uses some deprecated functions are later removed from the latest version. In this notebook, we use Anserini 0.21.0.

In [1]:
!wget https://repo1.maven.org/maven2/io/anserini/anserini/0.21.0/anserini-0.21.0-fatjar.jar

--2024-05-27 20:42:07--  https://repo1.maven.org/maven2/io/anserini/anserini/0.21.0/anserini-0.21.0-fatjar.jar
Resolving repo1.maven.org (repo1.maven.org)... 199.232.192.209, 199.232.196.209, 2a04:4e42:4c::209, ...
Connecting to repo1.maven.org (repo1.maven.org)|199.232.192.209|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 159044007 (152M) [application/java-archive]
Saving to: ‘anserini-0.21.0-fatjar.jar’


2024-05-27 20:42:15 (26.3 MB/s) - ‘anserini-0.21.0-fatjar.jar’ saved [159044007/159044007]



Let's install the packages!
The following command will install `ir_measurees`, Huggingface `datasets`, Google Translate (for presentation), and Huggingface Transformers.

In [2]:
!pip install -q -U --progress-bar on ir_measures transformers datasets googletrans==3.1.0a0

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.8/48.8 kB[0m [31m136.2 kB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.1/9.1 MB[0m [31m25.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m31.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.1/55.1 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m45.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.4/133.4 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.8/58.8 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━

After installation, let's download the dataset. The NeuCLIR 1 Collection is publicly available on Huggingface Datasets! Topics and qrels are available on the TREC website, from which we will directly download it.

Working with the entire NeuCLIR Chinese collection will take too much indexing time. For this demonstration, we'll just use the first 40k documents.

In [3]:
# Download topics and qrels from NIST
!wget -q --show-progress https://trec.nist.gov/data/neuclir/topics.0720.utf8.jsonl
!wget -q --show-progress https://trec.nist.gov/data/neuclir/2022-qrels.zho

import json
import pandas as pd
from tqdm.auto import tqdm

import ir_measures as irms
from datasets import load_dataset

# Only loading the first 40k docs from HF Datasets
ds = load_dataset('neuclir/neuclir1', split='zho', streaming=True) # total 3179209
doc_subset = [ o for i, o in zip(tqdm(range(40_000), desc='Loading first 40k docs from NeuCLIR Chinese Collection'), ds) ]
subset_doc_ids = set([ d['id'] for d in doc_subset ])

use_topic = '66' # use topic 66 as demo -- expecting to have 9 relevant docs

qrels = pd.DataFrame([ l for l in irms.read_trec_qrels('2022-qrels.zho') if l.query_id == use_topic and l.doc_id in subset_doc_ids ])
topics = [ t for t in map(json.loads, open("topics.0720.utf8.jsonl")) if t['topic_id'] == use_topic ]



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/1.31k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Loading first 40k docs from NeuCLIR Chinese Collection:   0%|          | 0/40000 [00:00<?, ?it/s]

Here we create helper functions so we can obtain the query and document text more conveniently.




In [4]:
topic_id_idx = { t['topic_id']: i for i, t in enumerate(topics) }
def get_query_by_topic_id(topic_id, query_type='title', lang="eng"):
    for topic in topics[ topic_id_idx[topic_id] ]['topics']:
      if topic["lang"] == lang:
        return topic[f'topic_{query_type}']

doc_id_to_idx = { d['id']: i for i, d in enumerate(doc_subset) }
def get_doc_text_by_doc_id(doc_id):
    doc = doc_subset[ doc_id_to_idx[doc_id] ]
    return doc['title'] + ' ' + doc['text']

## Indexing

We first index the NeuCLIR Chinese document subset using Anserini


In [5]:
!mkdir -p collection

Creating jsonl files for Anserini

In [6]:
import json
with open("collection/zho_neuclir_subset.jsonl", "w") as f:
  for doc_id in tqdm(doc_id_to_idx, total = len(doc_id_to_idx)):
    content = get_doc_text_by_doc_id(doc_id)
    text = json.dumps({"id": doc_id, "contents": content})
    f.write(text+"\n")

  0%|          | 0/40000 [00:00<?, ?it/s]

Starting the indexing for Chinese documents. At the end of the indexing, you should see 40,000 documents indexed

In [9]:
!java -cp anserini-0.21.0-fatjar.jar io.anserini.index.IndexCollection \
  -collection JsonCollection \
  -generator DefaultLuceneDocumentGenerator \
  -threads 9 \
  -input collection \
  -index indexes/zho_neuclir_subset_bm25 \
  -storePositions \
  -storeDocvectors \
  -storeRaw \
  -language zh

2024-05-27 20:45:02,873 INFO  [main] index.IndexCollection (IndexCollection.java:380) - Setting log level to INFO
2024-05-27 20:45:02,877 INFO  [main] index.IndexCollection (IndexCollection.java:383) - Starting indexer...
2024-05-27 20:45:02,878 INFO  [main] index.IndexCollection (IndexCollection.java:385) - DocumentCollection path: collection
2024-05-27 20:45:02,878 INFO  [main] index.IndexCollection (IndexCollection.java:386) - CollectionClass: JsonCollection
2024-05-27 20:45:02,879 INFO  [main] index.IndexCollection (IndexCollection.java:387) - Generator: DefaultLuceneDocumentGenerator
2024-05-27 20:45:02,879 INFO  [main] index.IndexCollection (IndexCollection.java:388) - Threads: 9
2024-05-27 20:45:02,880 INFO  [main] index.IndexCollection (IndexCollection.java:389) - Language: zh
2024-05-27 20:45:02,880 INFO  [main] index.IndexCollection (IndexCollection.java:390) - Stemmer: porter
2024-05-27 20:45:02,880 INFO  [main] index.IndexCollection (IndexCollection.java:391) - Keep stopwor

## Retrieval

Post indexing of Chinese documents, we want to generate a ranked list for a given translated Chinese query using BM25 model.

In [10]:
!mkdir -p runs

Get the translated Chinese query for a specific topic_id (66). See the [cell](https://colab.research.google.com/drive/1u_8ESzz_f26toFy45m17UQRZXGVqMt0B#scrollTo=PI64O_uLCK_o&line=19&uniqifier=1) for more details

In [11]:
topic_text = get_query_by_topic_id(use_topic, lang="zho")

Create a text file for the topic in the following tsv format

In [12]:
with open("zho_topics.txt", "w") as f:
  f.write(f"{use_topic}\t{topic_text}\n")

Perform retrieval using Anserini's BM25 with default hyperparameters

In [13]:
!java -cp anserini-0.21.0-fatjar.jar io.anserini.search.SearchCollection \
  -index indexes/zho_neuclir_subset_bm25 \
  -topics zho_topics.txt \
  -topicreader TsvInt \
  -output runs/zho_neuclir_subset_bm25.title.txt \
  -bm25 \
  -language zh

2024-05-27 20:47:08,694 INFO  [main] search.SearchCollection (SearchCollection.java:929) - Index: indexes/zho_neuclir_subset_bm25
2024-05-27 20:47:08,972 INFO  [main] search.SearchCollection (SearchCollection.java:933) - Fields: []
2024-05-27 20:47:08,973 INFO  [main] search.SearchCollection (SearchCollection.java:682) - Using language-specific analyzer
2024-05-27 20:47:08,974 INFO  [main] search.SearchCollection (SearchCollection.java:683) - Language: zh
2024-05-27 20:47:09,022 INFO  [main] search.SearchCollection (SearchCollection.java:1208) - runtag: Anserini
2024-05-27 20:47:09,699 INFO  [pool-2-thread-1] search.SearchCollection$SearcherThread (SearchCollection.java:861) - ranker: bm25(k1=0.9,b=0.4), reranker: default: 1 queries processed in 00:00:00 = ~1.59 q/s
2024-05-27 20:47:09,725 INFO  [main] search.SearchCollection (SearchCollection.java:1420) - Total run time: 00:00:01


Scoring against the filtered qrels, this BM25 result is just ok -- giving us an nDCG@20 of 0.1483.

In [14]:
!pwd

/content


In [15]:
to_rerank = pd.DataFrame([ l for l in irms.read_trec_run("runs/zho_neuclir_subset_bm25.title.txt")])

irms.calc_aggregate([irms.nDCG@20, irms.AP], qrels, to_rerank)

{nDCG@20: 0.1482972305701491, AP: 0.06837054789182448}

# Exercise
Perform retrieval using a different topic id.

For generating a score, refer to this [cell](https://colab.research.google.com/drive/1u_8ESzz_f26toFy45m17UQRZXGVqMt0B#scrollTo=PI64O_uLCK_o&line=19&uniqifier=1) on how to filter qrels to only include the chosen topic id.

Try it out yourself here:

In [None]:
# Your solution
use_topic =
qrels =

# And there you go!

You've learned how to run a simple BM25 retrieval model using query translation for CLIR!