# 🧪 ImpliRet · Hands-on Notebook

This notebook is a **hands-on companion** to the ImpliRet benchmark  
([📄 paper](https://arxiv.org/abs/2506.14407) • [💻 GitHub](https://github.com/ZeinabTaghavi/ImpliRet)).


**In the next ~10&nbsp;minutes you will**

1. 🔄 **Load** any of the six ImpliRet corpus slices  
2. 🔍 **Run a single retriever** then score the results<br>(BM25, ColBERT-v2, Contriever, Dragon+, HippoRAG 2, ReasonIR-8B)  
3. 📚 **RAG-Style** will give either the indexing result or oracle indexing into LLM(Llama-3, GPT-4.1, …)  

> **Tip:** switch Colab to **GPU** (`Runtime → Change accelerator`) for a 3 × speed-up. We tested this notebook with RTX 2080 Ti GPUs.

🔍 First, let's check the number of available GPUs


In [None]:
# Configure CUDA to utilize all available GPUs
import torch
import os

num_gpus = torch.cuda.device_count()
if num_gpus > 0:
    gpu_ids = list(range(num_gpus))
    os.environ["CUDA_VISIBLE_DEVICES"] = ",".join(map(str, gpu_ids))
    print(f"Using {num_gpus} GPUs: {gpu_ids}")
else:
    print("No GPUs detected, using CPU")

🔑 Hugging Face Login Required

To use Llama models (in HippoRAG, ReasonIR, or RAG-Style experiments), you must first authenticate with Hugging Face.

This allows access to model weights and tokenizers from the Hugging Face Hub.

In [None]:
import os
from huggingface_hub import login

# Authenticate with Hugging Face using token
hf_token = 'You_token_here'
if hf_token:
    login(token=hf_token)
else:
    print("Warning: HF_TOKEN environment variable not found")


🛠️ Clone the repository and install requirements

First, we'll clone the ImpliRet repository and install the required dependencies


In [4]:
# Install dependencies (≈ 1 min)
# Python version should be >= 3.9
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

DEVICE = "cuda"  # Options: ["cuda", "cpu"]
!git clone https://github.com/ZeinabTaghavi/ImpliRet.git
os.chdir("ImpliRet")
!python -m venv impliret_env && source impliret_env/bin/activate
!pip install -r requirements.txt

import os, json, pathlib, itertools, pprint, time


## 1 · 🔄 **Load a subset of Dataset**

Use 🤗 **datasets** to stream any subset on-the-fly.

In [5]:
from datasets import load_dataset

SPLIT = "arithmetic"       #@param ["arithmetic", "wknow", "temporal"]
STYLE = "multispeaker"     #@param ["multispeaker", "unispeaker"]

ds = load_dataset("zeinabTaghavi/ImpliRet", name=STYLE, split=SPLIT)
print(f"Loaded {len(ds):,} examples  |  Columns → {list(ds.features.keys())}")
print("\nSample question →", ds[0]["question"])
print("Implicit document snippet →", ds[0]["pos_document"][:200], "...")

Loaded 1,500 examples  |  Columns → ['id', 'tuple_set_id', 'forum_question', 'pos_document', 'question', 'answer', 'explicit_hint']

Sample question → What brand and model of what was the brand and model of the smartphone that cost $1950? were priced at $1,950?
Implicit document snippet → 2024-12-15 09:01, Zoe: I’m happy to share my thoughts on this. As a wildlife photographer constantly on the move, I understand the need for a reliable smartphone, and I recently upgraded myself with a ...


## 2 · 🔍 Run a single retriever

You can **either** run the full baseline matrix via a helper **bash** script  
*or* call the **Python CLI** for a single configuration. 

### **Quick & Single-shot (bash)** 

In [None]:
# Run the retrieval script
! bash retrieve.sh

### **Step by Step**

In [None]:
from Retrieval.retrieve_indexing import save_retriever_indices

OUTPUT = pathlib.Path("Retrieval/results")
OUTPUT.mkdir(parents=True, exist_ok=True)

RETRIEVER = "bm25"      #@param ["bm25", "colbert", "contriever", "dragon_plus", "hipporag2", "reasonir8b"]

run_file = save_retriever_indices(
    output_folder=str(OUTPUT),
    category=SPLIT,
    discourse=STYLE,
    retriever_name=RETRIEVER,
)

### 📊 Evaluate retrieval results

Now lets go over evaluation!

Thre evaluation result files also will be stored at `ImpliRet/Retrieval/reports`

In [7]:
# Evaluate retrieval results and get recall, MRR and NDCG metrics
from Retrieval.reporting import evaluate_run

recall_results, mrr_results, ndcg_results = evaluate_run()

arithmetic | multispeaker | bm25: 1500 items
Recall@10: 0.1627
MRR@10: 0.0561
nDCG@10: 0.0807


## 3 · 📚 RAG-Style

You can **either** run the full baseline matrix via a helper **bash** script  
*or* call the **Python CLI** for a single configuration. 

> Note: This example uses the lightweight [Llama 3.2 3B](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) model by default. You can configure other models in the setup.

### **Quick & Single-shot (bash)** 

In [6]:
# Run the full baseline matrix via bash script
! bash ./sync_run_tests.sh

Starting the job
[run_tests] Parsed CLI / YAML arguments
[run_tests] Found model-config.
[run_tests] Initialising ExperimentTester …

 ----------- [STEP 1] Initialization -----------
Loading dataset configurations...
README.md: 7.00kB [00:00, 18.7MB/s]
A_Multi.csv: 1.65MB [00:00, 7.85MB/s]
W_Multi.csv: 1.67MB [00:00, 16.9MB/s]
T_Multi.csv: 1.45MB [00:00, 19.8MB/s]
Generating arithmetic split: 100%|█| 1500/1500 [00:00<00:00, 46455.75 examples/s
Generating wknow split: 100%|█████| 1500/1500 [00:00<00:00, 40474.36 examples/s]
Generating temporal split: 100%|██| 1500/1500 [00:00<00:00, 44631.02 examples/s]
[Init] Loaded dataset with 1500 rows

 ----------- [STEP 2] Processing Data -----------
Building conversation dictionaries...
1500
--------------------------------
[Init] ExperimentTester initialisation complete. Results will be written to ./RAG_Style/results/arithmetic_multispeaker/llama_3.2_3b/arithmetic_multispeaker_llama_3.2_3b_1_1752586253.json

 ----------- [STEP 4] Running Evaluat

### **Step by Step**

In [12]:
from RAG_Style.scripts.syncr.sync_run_tests import run_experiment
CONFIG = "RAG_Style/experiment_configs/bm/A_Multi_llama_bm_10.yaml"

# Configure experiment parameters
category = "arithmetic"              #@param ["arithmetic", "wknowledge", "temporal"] # Reasoning categories
discourse_type = "multispeaker"      #@param ["multispeaker", "unispeaker"] # Forum vs chat style discourse

# Configure model settings
model_name = "llama_3.2_3b"         # Model name from vLLM-API config
model_configs_dir = "RAG_Style/model_configs"

# Configure evaluation settings
metric = "EM , contains , rouge-recall"
k = 1                               # Number of documents to retrieve
use_retrieval = False               # Use user retrieval (True) or oracle retrieval (False)
seed = 42
output_folder = "RAG_Style/results/"

# Run experiment with configured parameters
result_path = run_experiment([
    "--model_name",      model_name,
    "--model_configs_dir", model_configs_dir,
    "--category",        category,
    "--discourse",       discourse_type,
    "--metric",          "EM,contains,rouge-recall",   # no spaces
    "--k",               str(k),                       # "1"
    "--use_retrieval",   str(use_retrieval).lower(),   # "false"
    "--seed",            str(seed),                    # "42"
    "--output_folder",   output_folder
])
# Alternative: Use config file
# result_path = run_experiment(["--config", CONFIG])

print("Saved generations →", result_path)

[run_tests] Parsed CLI / YAML arguments
[run_tests] Found model-config.
[run_tests] Initialising ExperimentTester …

 ----------- [STEP 1] Initialization -----------
Loading dataset configurations...
[Init] Loaded dataset with 1500 rows

 ----------- [STEP 2] Processing Data -----------
Building conversation dictionaries...
1500
--------------------------------
[Init] ExperimentTester initialisation complete. Results will be written to RAG_Style/results/arithmetic_multispeaker/llama_3.2_3b/arithmetic_multispeaker_llama_3.2_3b_1_1750853307.json

 ----------- [STEP 4] Running Evaluation -----------
Processing 1500 examples...

 ----------- [STEP 5] Generating Responses -----------
Loading model and passing prompts to it...

 ----------- [STEP 1] Checking vLLM Installation -----------

 ----------- [STEP 2] Setting Up Model Configuration -----------
[ModelLoader] Loading meta-llama/Llama-3.2-3B-Instruct via local vLLM (TP=1, util=0.9, dir=None)

 ----------- [STEP 3] Loading Model -------

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


INFO 06-25 14:08:52 model_runner.py:1099] Loading model weights took 6.0160 GB
INFO 06-25 14:08:53 worker.py:241] Memory profiling takes 0.90 seconds
INFO 06-25 14:08:53 worker.py:241] the current vLLM instance can use total_gpu_memory (10.57GiB) x gpu_memory_utilization (0.90) = 9.51GiB
INFO 06-25 14:08:53 worker.py:241] model weights take 6.02GiB; non_torch_memory takes 0.10GiB; PyTorch activation peak memory takes 1.18GiB; the rest of the memory reserved for KV Cache is 2.21GiB.
INFO 06-25 14:08:53 gpu_executor.py:76] # GPU blocks: 1294, # CPU blocks: 2340
INFO 06-25 14:08:53 gpu_executor.py:80] Maximum concurrency for 1024 tokens per request: 20.22x
INFO 06-25 14:08:56 model_runner.py:1415] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilizatio

Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:18<00:00,  1.90it/s]

INFO 06-25 14:09:15 model_runner.py:1535] Graph capturing finished in 18 secs, took 0.19 GiB
INFO 06-25 14:09:15 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 22.54 seconds





Model loaded successfully with dtype='half'.

 ----------- [STEP 4] Loading Tokenizer -----------
Tokenizer loaded successfully.
--------------------------------
Sending prompts to model...
INFO 06-25 14:09:15 chat_utils.py:333] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.


Processed prompts:  22%|██▏       | 336/1500 [00:27<01:25, 13.61it/s, est. speed input: 5039.65 toks/s, output: 346.08 toks/s]



Processed prompts: 100%|██████████| 1500/1500 [02:00<00:00, 12.45it/s, est. speed input: 5113.30 toks/s, output: 441.50 toks/s]



 ----------- [STEP 6] Computing Scores -----------

 ----------- [STEP 7] Saving Results -----------

 ----------- Evaluation Complete -----------
Results saved to: RAG_Style/results/arithmetic_multispeaker/llama_3.2_3b/arithmetic_multispeaker_llama_3.2_3b_1_1750853307.json
[run_tests] Experiment completed.
Saved generations → RAG_Style/results/arithmetic_multispeaker/llama_3.2_3b


### 📊 Evaluate RAG-Style results

Now lets go over evaluation!

Thre evaluation result files also will be stored at `ImpliRet/RAG_Style/reports`

In [1]:
import os
os.chdir("ImpliRet")
from RAG_Style.reporting import reporting

# Configure paths and parameters
result_path = "RAG_Style/results"
report_output_folder = "RAG_Style/reports"
metrics = "EM , rouge-recall"
reporting(result_path, metrics, report_output_folder, warn=False)

# Print evaluation tables
print("\nEM Table:")
with open("RAG_Style/reports/EM_table.tex", "r") as f:
    print(f.read())

print("\nROUGE-1 Recall Table:")
with open("RAG_Style/reports/rouge-1-recall_table.tex", "r") as f:
    print(f.read())

Parsed entries:
Total entries: 2
model_categories: dict_keys(['gpt-4.1-2025-04-14', 'llama_3.2_3b'])
len(model_categories[list(model_categories.keys())[0]]): 1
Result path: RAG_Style/results
Metrics: ['EM', 'rouge-recall']
Report output folder: RAG_Style/reports

EM Table:
Experiment | arithmetic | wknow | temporal | Uni Avg | Multi Avg \\
K | unispeaker & multispeaker | unispeaker & multispeaker | unispeaker & multispeaker |  |  \\
gpt-4.1-2025-04-14 & 1 & - & 9.00 & - & - & - & - & - & 9.00 \\
llama_3.2_3b & 1 & - & 0.00 & - & - & - & - & - & 0.00 \\

Average between unispeaker and multispeaker per k
Experiment & K & Avg(Uni+Multi) \\
gpt-4.1-2025-04-14 & 1 & 9.00 \\
llama_3.2_3b & 1 & 0.00 \\


ROUGE-1 Recall Table:
Experiment | arithmetic | wknow | temporal | Uni Avg | Multi Avg \\
K | unispeaker & multispeaker | unispeaker & multispeaker | unispeaker & multispeaker |  |  \\
gpt-4.1-2025-04-14 & 1 & - & 93.87 & - & - & - & - & - & 93.87 \\
llama_3.2_3b & 1 & - & 54.50 & - & - & - &

### 🎉 Thank you!
 
Thanks for following along with this notebook! Hope it was helpful! 👋

If you need help or think this notebook should be updated, feel free to email me at zeinabtaghavi1377@gmail.com 📧

