# Retrieval Baseline Notebook

## Retrieval Baseline for SciSumm-RAG

In this notebook we:
- Load the FAISS index and chunks (chunks.jsonl)
- Demonstrate retrieval (direct and hybrid)
- Generate summary

In [1]:
import os, sys
from pathlib import Path

root = Path(os.getcwd()).parent
sys.path.insert(0, str(root))
sys.path.insert(0, str(root / "src"))

In [2]:
import pandas as pd
import faiss, json, numpy as np
import json
from typing import Tuple, List
from pathlib import Path
from sentence_transformers import SentenceTransformer, CrossEncoder

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
from src.retriever.embed import embed_texts
from src.retriever.index import (
    normalize_embeddings,
    search,
    hybrid_search,
    load_embeddings
)
from src.generator.hf_summarizer import HFSummarizer

In [4]:
# Auto-detection project_root: go up until we find data/clean/embeddings.npy
root = Path.cwd()
while not (root / "data" / "clean" / "embeddings.npy").exists():
    # if we get to the root of the file system - exit with an error
    if root.parent == root:
        raise RuntimeError("Could not find the folder data/clean/embeddings.npy")
    root = root.parent

project_root = root
print("project_root:", project_root)

# Paths
clean_data = project_root / "data" / "clean" / "metadata_clean.csv"
clean_dir = project_root / "data" / "clean"
index_dir = project_root / "data" / "index" / "faiss"

emb_path = clean_dir / "embeddings.npy"
ids_path = clean_dir / "ids.json"

flat_index_path   = index_dir / "flat_index.index"
flat_ids_path     = index_dir / "flat_index_ids.json"
hnsw_index_path   = index_dir / "hnsw_index.index"
hnsw_ids_path     = index_dir / "hnsw_index_ids.json"
ivfpq_index_path = index_dir / "ivfpq_index.index"
ivfpq_ids_path   = index_dir / "ivfpq_index_ids.json"
opqivfpq_index_path = index_dir / "opqivfpq_index.index"
opqivfpq_ids_path   = index_dir / "opqivfpq_index_ids.json"

# To make importing src/... works
sys.path.append(str(project_root))

print("project_root:", project_root)
print("embeddings exists:", emb_path.exists(), emb_path)
print("ids exists:       ", ids_path.exists(), ids_path)

project_root: D:\SciSumm-RAG
project_root: D:\SciSumm-RAG
embeddings exists: True D:\SciSumm-RAG\data\clean\embeddings.npy
ids exists:        True D:\SciSumm-RAG\data\clean\ids.json


In [5]:
def load_index_and_ids(
    index_file: Path,
    ids_file: Path
) -> Tuple[faiss.Index, List[Tuple[str,str,str]]]:
    # 1) read the FAISS index from the .index file
    idx = faiss.read_index(str(index_file))
    # 2) read metadata from *_ids.json
    with open(ids_file, 'r', encoding='utf-8') as f:
        raw = json.load(f)
    # JSON stores lists, let's convert them to tuples
    ids = [tuple(x) for x in raw]
    return idx, ids

In [6]:
ids, vecs = load_embeddings(emb_path, ids_path)
# turn each [paper_id, section, chunk_id] into a tuple
ids = [tuple(x) for x in ids]
vecs = vecs.astype('float32')  

index_flat, ids_flat = load_index_and_ids(flat_index_path, flat_ids_path)

In [7]:
import json

# Loading chunk_texts mapping from JSONL file
chunks_file = clean_dir / "chunks.jsonl" 
chunk_texts = {}
with open(chunks_file, "r", encoding="utf-8") as f:
    for line in f:
        pid, section, cid, txt = json.loads(line)
        chunk_texts[(pid, section, cid)] = txt

## Preparation of the summarizer

In [8]:
import torch 

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

In [9]:
# HF summarizer
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2", device="cuda")
summarizer = HFSummarizer()

Device set to use cuda:0


In [10]:
queries = [
    "What is a mechanism for generating notebook interfaces for DSLs?",  # CV/PL
    "How to stabilize corium during severe nuclear accident?",           # Nuclear
    "What methods exist for probabilistic verification of software?"   # ML/verification
]

# Embed & normalize all requests
q_embs = embed_texts(queries)
q_embs = normalize_embeddings(q_embs.astype(np.float32))

2025-07-12 13:32:07,063 - INFO - Load pretrained SentenceTransformer: all-MiniLM-L6-v2
Batches: 100%|██████████| 1/1 [00:00<00:00,  2.32it/s]


In [11]:
# True — (dense+rerank), False — FAISS
use_hybrid = True

In [12]:
results = []
chunk_keys = list(chunk_texts.keys())

for q, q_emb in zip(queries, q_embs):
    if use_hybrid:
        res = hybrid_search(
            coarse_idx    = index_flat,
            ids           = chunk_keys,
            queries       = q_emb[np.newaxis, :],
            query_texts   = [q],
            chunks_path   = clean_dir / "chunks.jsonl",
            rerank_model  = reranker,
            top_k_coarse  = 100,
            top_k         = 5
        )[0]
    else:
        # FAISS search
        distances, indices = index_flat.search(q_emb[np.newaxis, :], 5)
        # (id, score)
        res = [
            (chunk_keys[i], float(score))
            for i, score in zip(indices[0], distances[0])
        ]
    results.append(res)

Batches: 100%|██████████| 4/4 [00:00<00:00,  9.79it/s]
Batches: 100%|██████████| 4/4 [00:00<00:00, 29.18it/s]
Batches: 100%|██████████| 4/4 [00:00<00:00, 26.93it/s]


In [13]:
for q, res in zip(queries, results):
    print(f"\nQUERY: {q}")
    for rank, (key, score) in enumerate(res, start=1):
        pid, section, cid = key
        snippet = chunk_texts[key][:200].replace("\n", " ")
        print(f"{rank}. {pid} | {section} | {cid} (score={score:.3f})")
        print("   ", snippet, "…")


QUERY: What is a mechanism for generating notebook interfaces for DSLs?
1. 2002.06180 | para_0 | para_0__1 (score=8.897)
    interaction between end - users and dsls. approach : in this paper, we present bacat \ ' a, a mechanism for generating notebook interfaces for dsls in a language parametric fashion. we designed this m …
2. 2002.06180 | para_0 | para_0__0 (score=4.598)
    context : computational notebooks are a contemporary style of literate programming, in which users can communicate and transfer knowledge by interleaving executable code, output, and prose in a single …
3. 2002.06180 | para_0 | para_0__2 (score=2.609)
    ), sweeterjs ( an extended version of javascript ), and ql ( a dsl for questionnaires ). additionally, it is relevant to generate notebook implementations rather than implementing them manually. we me …
4. 2005.09028 | para_0 | para_0__1 (score=-2.127)
    ranging from krishnamurthi ' s classic automata dsl to a sound synthesis dsl and a probabilistic programm

In [14]:
for q, res in zip(queries, results):
    top_pid = res[0][0]    # tuple (paper_id, section, cid)
    pid     = top_pid[0]   # paper_id itself

    # collect all the chunks of this article from the top 5
    passages = [
        chunk_texts[key]
        for key, _ in res
        if key[0] == pid
    ]
    combined = "\n\n".join(passages)

    summary = summarizer.summarize(
        combined,
        max_length=150,
        min_length=30
    )
    print(f"\nSUMMARY for {pid}:\n", summary)


SUMMARY for 2002.06180:
 bacat \ ' a is a mechanism for generating notebook interfaces for dsls in a language parametric fashion . The tool can be used to generate notebooks for halide, sweeterjs, and ql for questionnaires . It can be easily generated with little manual configuration .

SUMMARY for 2408.15290:
 in - vessel retention ( ivr ) strategy for nuclear reactors in case of a severe accident ( sa ) intends to stabilize and retain the corium in the vessel by using the vessel wall as a heat exchanger with an external water loop . This strategy relies on simple actions to be passively taken as soon as sa signal is raised : vessel depressurization and reactor pit flooding .

SUMMARY for 2304.13519:
 A counterfeit - proof label composed of randomly distributed gold nanospheres or rods in a semi-transparent material . The characteristic positioning of the label ' s elements can be precisely measured using a smartphone ' s camera and additional technologies .
