# Information Retrival Project

Authors: Delia Mennitti - 19610, Letizia Meroi - , Sara Napolitano - 

For this project, we use the **SWIM-IR dataset**, which is described in detail in the paper *“Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval”* by Nandan Thakur, Jianmo Ni, Gustavo Hernández Ábrego, John Wieting, Jimmy Lin, and Daniel Cer.

## Task Definition

We focus on a **cross-lingual Information Retrieval (IR) task** using the SWIM-IR dataset.

Given an English query, the objective is to **retrieve the relevant Wikipedia passage written in another language**. Each query has exactly one associated relevant passage, enabling **automatic and reproducible evaluation** of retrieval performance.


In [4]:
import os
import json
import gzip
from collections import defaultdict
import numpy as np
import pandas as pd
from tqdm import tqdm
import ast
from rank_bm25 import BM25Okapi
import jieba

# Exploratory Data Analysis

Some stats, bar charts, how is the dataset structured etc...

In [5]:
# Base data directory
BASE_DATA_DIR = "data/swim_ir_v1/swim_ir_v1"

In [6]:
# Full path to Chinese cross-lingual train file
zh_path = os.path.join(BASE_DATA_DIR, "cross_lingual", "zh", "train.jsonl")

with open(zh_path, "r", encoding="utf-8") as f:
    for i, line in enumerate(f):
        print(line.strip())
        if i >= 5:
            break


{'_id': '18', 'lang': 'Chinese', 'code': 'zh', 'query': '1869 年，哪个国家发生了一起抢劫案？', 'title': '!Kora Wars', 'text': 'Jan Kivido and Piet Rooi formed a partnership and were the most consistent raiders. The first recorded significant incident between the !Kora people and the colonial government occurred in 1869, when a Griqua and Scottish trader were robbed along the southern bank of the Orange River. Piet Rooi, the leader of another nomadic !Kora group, was held responsible for the robbery, and as punishment was lashed and committed to three months hard labour. He was subsequently released on account of insufficient evidence against him. The treatment he received did not sit well with many of the !Ikora raiders, and this'}
{'_id': '39', 'lang': 'Chinese', 'code': 'zh', 'query': '电影《女性艺术革命》是关于什么的？', 'title': '!Women Art Revolution', 'text': 'historians for over 4 decades about their individual and group efforts to help women succeed in the art world and society by helping them overcome obstac

# Full BM25 baseline for all languages Available

In [7]:
LANGUAGES = [d for d in os.listdir(BASE_DATA_DIR) if os.path.isdir(os.path.join(BASE_DATA_DIR, d))]
MAX_ITEMS = 1000  # use None for full dataset
K = 10  # top-K retrieval

# Robust JSONL loader
def load_jsonl_robust(path, max_items=None):
    data = []
    with open(path, "r", encoding="utf-8") as f:
        for i, line in enumerate(f):
            if max_items and i >= max_items:
                break
            line = line.strip()
            if not line:
                continue
            try:
                obj = json.loads(line)
            except json.JSONDecodeError:
                obj = ast.literal_eval(line)
            data.append(obj)
    return data

# Tokenizer
def tokenize(text, lang_code):
    if lang_code == "zh":
        return list(jieba.cut(text))
    else:
        return text.lower().split()

# Evaluation
def evaluate(retrieved, qrels, K=10):
    recalls = []
    rr_list = []
    for qid, top_docs in retrieved.items():
        relevant_doc = qrels[qid]
        recalls.append(1.0 if relevant_doc in top_docs[:K] else 0.0)
        try:
            rank = top_docs.index(relevant_doc) + 1
            rr_list.append(1.0 / rank)
        except ValueError:
            rr_list.append(0.0)
    return np.mean(recalls), np.mean(rr_list)

# Store results
results = []

for lang_code in LANGUAGES:
    print(f"\nProcessing language: {lang_code}")
    lang_path = os.path.join(BASE_DATA_DIR, lang_code, "train.jsonl")
    data = load_jsonl_robust(lang_path, max_items=MAX_ITEMS)

    # Build documents, queries, qrels
    documents = {}
    queries = {}
    qrels = {}
    for item in data:
        doc_id = f"{lang_code}_{item['_id']}"
        documents[doc_id] = {"text": item["title"] + " " + item["text"], "lang": item["code"]}
        queries[doc_id] = item["query"]
        qrels[doc_id] = doc_id

    # Tokenize corpus
    doc_ids = list(documents.keys())
    tokenized_corpus = [tokenize(documents[doc_id]["text"], documents[doc_id]["lang"]) for doc_id in doc_ids]
    bm25 = BM25Okapi(tokenized_corpus)

    # Tokenize queries
    tokenized_queries = {qid: tokenize(q, documents[qid]["lang"]) for qid, q in queries.items()}

    # Retrieve top-K
    retrieved = {}
    for qid, query_tokens in tokenized_queries.items():
        scores = bm25.get_scores(query_tokens)
        top_indices = scores.argsort()[-K:][::-1]
        retrieved[qid] = [doc_ids[i] for i in top_indices]

    # Evaluate
    recall, mrr = evaluate(retrieved, qrels, K=K)
    results.append({"language": lang_code, "Recall@10": recall, "MRR@10": mrr})
    print(f"Recall@{K}: {recall:.4f}, MRR@{K}: {mrr:.4f}")

# Display summary table
df_results = pd.DataFrame(results)
df_results


Processing language: monolingual


FileNotFoundError: [Errno 2] No such file or directory: 'data/swim_ir_v1/swim_ir_v1/monolingual/train.jsonl'

# Dense Multilingual Retrieval Using Embeddings
Next step: use LaBSE, mSBERT, or XLM-R embeddings to encode queries and passages.

Build a dense vector index (FAISS or similar) and retrieve top-K passages.

# Hybrid Approach (BM25 + Dense Retrieval)
Combine BM25 scores and dense retrieval scores

Test weighted combination or re-ranking

Compare improvements over BM25-only or dense-only

# Neural Reranking on Top-K Results

Optional but nice to have

Use cross-encoder models to rerank top-K retrieved passages

Improves semantic matching on hard queries

# Results and Conclusions