# 🔎 Multilingual Retrieval & Ranking Pipeline (Colab)
**Hybrid BM25 + Dense (FAISS) + Fusion + Learning-to-Rank (LightGBM) + optional Cross-Encoder rerank**  
Dataset: **Amazon Reviews Multi** (multilingual, medium-sized). Language can be switched in config.

This notebook demonstrates a production-style search ranking pipeline:
1. **Data build** (documents + queries + qrels) from a real dataset
2. **Lexical retrieval** with BM25
3. **Dense retrieval** with Sentence Transformers + FAISS
4. **Fusion** (RRF) to build a candidate set
5. **Learning-to-Rank (GBDT)** features and model training
6. **Optional cross-encoder rerank** for top-N
7. **Evaluation** (NDCG@10, MRR, Recall@100) and **latency sampling**

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 0) Setup & Installs

In [1]:
%%capture
!pip -q install datasets==3.6.0 transformers sentence-transformers faiss-cpu rank-bm25 lightgbm langdetect unidecode pandas scikit-learn

In [2]:
import os, sys, math, random, gc, time
from dataclasses import dataclass
from typing import List, Dict, Tuple
import numpy as np
import pandas as pd
from tqdm.auto import tqdm

import torch
from transformers import AutoTokenizer, AutoModel
from sentence_transformers import SentenceTransformer, CrossEncoder
import faiss

from datasets import load_dataset
from rank_bm25 import BM25Okapi

import lightgbm as lgb

from sklearn.model_selection import train_test_split
from langdetect import detect as lang_detect
from unidecode import unidecode

SEED = 42
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

## 1) Config — dataset, language, and sizes

In [3]:

CONFIG = {
    "language": "en",                  # Choose from: 'en','de','fr','es','ja','zh'
    "N_DOCS": 50000,                   # number of product docs to index
    "N_QUERIES": 5000,                 # number of queries (review titles) to use
    "TOPK_BM25": 200,
    "TOPK_ANN": 200,
    "FUSION_K": 300,                   # candidates to feed the ranker
    "RERANK_TOPN": 50,                 # optional cross-encoder budget (set 0 to disable)

    # Models
    "dense_model": "intfloat/multilingual-e5-base",   # multilingual dense encoder
    "ce_model": "cross-encoder/mmarco-mMiniLMv2-L12-H384-v1",  # multilingual cross-encoder
    "batch_size": 128,

    # Ranker
    "train_frac": 0.8,
    "ndcg_k": 10
}
CONFIG

{'language': 'en',
 'N_DOCS': 50000,
 'N_QUERIES': 5000,
 'TOPK_BM25': 200,
 'TOPK_ANN': 200,
 'FUSION_K': 300,
 'RERANK_TOPN': 50,
 'dense_model': 'intfloat/multilingual-e5-base',
 'ce_model': 'cross-encoder/mmarco-mMiniLMv2-L12-H384-v1',
 'batch_size': 128,
 'train_frac': 0.8,
 'ndcg_k': 10}

## 2) Load & Construct a Retrieval Dataset from Amazon Reviews Multi
We will:
- Load `amazon_reviews_multi` for the chosen language.
- Build a **product document** per `product_id` by aggregating review titles (and a snippet of review bodies).
- Build **queries** from review titles, with relevance pointing to the corresponding `product_id`.

In [7]:
def load_amazon_reviews_multi(lang="en", n_docs=50000, n_queries=5000, seed=SEED):
#    ds = load_dataset("/content/drive/MyDrive/Colab Notebooks/Datasets/amazon_reviews_multi", lang, split="train")
#    ds = load_dataset("amazon_reviews_multi", lang, split="train")
    # Keep only fields we need
 #   df = ds.to_pandas()[["review_id", "product_id", "review_title", "review_body", "stars"]].dropna()
    # Build product docs by grouping
    df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Datasets/amazon_reviews_multi/en/train.csv")
    df = df[df["language"] == lang]
    df = df[["review_id", "product_id", "review_title", "review_body", "stars"]].dropna()
    grp = df.groupby("product_id")
    prod_stats = grp.agg({
        "review_title": lambda s: " | ".join(s.astype(str).head(10).tolist()),  # top titles
        "review_body": lambda s: " ".join(s.astype(str).head(5).tolist()),
        "stars": "mean"
    }).reset_index()
    prod_stats["doc_text"] = (prod_stats["review_title"].fillna("") + " " + prod_stats["review_body"].fillna("")).str.strip()
    prod_stats = prod_stats[prod_stats["doc_text"].str.len() > 10]
    # Sample N docs
    prod_stats = prod_stats.sample(frac=1.0, random_state=seed)
    prod_stats = prod_stats.head(n_docs).reset_index(drop=True)
    product_ids = set(prod_stats["product_id"].tolist())

    # Queries from review titles mapped to product ids in our sampled set
    q = df[df["product_id"].isin(product_ids)][["review_title", "product_id"]].dropna()
    q = q.rename(columns={"review_title": "query", "product_id": "relevant_pid"})
    q = q[q["query"].str.len() > 2].drop_duplicates().sample(frac=1.0, random_state=seed)
    q = q.head(n_queries).reset_index(drop=True)

    return prod_stats[["product_id", "doc_text", "stars"]], q

docs_df, queries_df = load_amazon_reviews_multi(CONFIG["language"], CONFIG["N_DOCS"], CONFIG["N_QUERIES"])
len(docs_df), len(queries_df), docs_df.head(2), queries_df.head(2)

(50000,
 5000,
            product_id                                           doc_text  \
 0  product_en_0980094  Calming and Peaceful I absolutely LOVE this Oi...   
 1  product_en_0781467  Sturdy I use this for hiking and open water sw...   
 
    stars  
 0    5.0  
 1    3.0  ,
                                               query        relevant_pid
 0                                     As advertised  product_en_0268689
 1  Slides all over the place needs a rubber backing  product_en_0402232)

## 3) Tokenization utilities for BM25 (space and non-space languages)

In [8]:

def simple_tokenize(text: str, lang_hint: str = "en"):
    text = str(text).lower()
    # Basic normalization
    text = text.replace("\n", " ").strip()
    if lang_hint in {"ja","zh","ko","th"}:
        # Character unigrams + bigrams for non-space languages
        chars = [c for c in text if not c.isspace()]
        bigrams = [chars[i] + chars[i+1] for i in range(len(chars)-1)]
        return chars + bigrams
    else:
        # Space languages: basic whitespace + alphanum filter
        tokens = []
        for tok in text.split():
            tok = "".join(ch for ch in tok if ch.isalnum() or ch in "-_/")
            if tok:
                tokens.append(tok)
        return tokens

# Pre-tokenize corpus for BM25
tokenized_corpus = [simple_tokenize(t, CONFIG["language"]) for t in tqdm(docs_df["doc_text"].tolist(), desc="Tokenizing corpus")]
len(tokenized_corpus[0]), tokenized_corpus[0][:15]

Tokenizing corpus:   0%|          | 0/50000 [00:00<?, ?it/s]

(37,
 ['calming',
  'and',
  'peaceful',
  'i',
  'absolutely',
  'love',
  'this',
  'oil',
  'i',
  'defuse',
  'it',
  'at',
  'night',
  'while',
  'we'])

## 4) Lexical Retrieval — BM25

In [9]:

bm25 = BM25Okapi(tokenized_corpus)

def bm25_search(queries: List[str], topk: int, lang_hint: str):
    results = []
    for q in queries:
        toks = simple_tokenize(q, lang_hint)
        scores = bm25.get_scores(toks)  # numpy array
        # Top-k
        top_idx = np.argpartition(scores, -topk)[-topk:]
        top_idx = top_idx[np.argsort(-scores[top_idx])]
        results.append((top_idx, scores[top_idx]))
    return results

# Quick smoke test
sample = queries_df["query"].head(3).tolist()
bm_out = bm25_search(sample, topk=5, lang_hint=CONFIG["language"])
[(len(idx), idx[:3]) for idx,_ in bm_out]

[(5, array([20504, 15157, 44007])),
 (5, array([27821,  6452, 18749])),
 (5, array([17627, 39110, 28989]))]

## 5) Dense Retrieval — Sentence Transformers + FAISS

In [10]:

dense = SentenceTransformer(CONFIG["dense_model"], device=device)

def encode_texts(texts: List[str], batch_size=128):
    vecs = []
    for i in tqdm(range(0, len(texts), batch_size), desc="Encoding"):
        batch = texts[i:i+batch_size]
        emb = dense.encode(batch, batch_size=batch_size, convert_to_numpy=True, show_progress_bar=False, normalize_embeddings=True)
        vecs.append(emb)
    return np.vstack(vecs)

# Build document embeddings & FAISS index (cosine via inner product on normalized vectors)
doc_embeddings = encode_texts(docs_df["doc_text"].tolist(), batch_size=CONFIG["batch_size"]).astype("float32")
index = faiss.IndexHNSWFlat(doc_embeddings.shape[1], 32)
index.hnsw.efConstruction = 200
index.hnsw.efSearch = 128
index.add(doc_embeddings)

def ann_search(queries: List[str], topk: int):
    q_emb = encode_texts(queries, batch_size=CONFIG["batch_size"]).astype("float32")
    scores, idx = index.search(q_emb, topk)
    return idx, scores

# Smoke test
ann_idx, ann_scores = ann_search(sample, topk=5)
ann_idx[0][:5], ann_scores[0][:5]

modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/418 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

Encoding:   0%|          | 0/391 [00:00<?, ?it/s]

Encoding:   0%|          | 0/1 [00:00<?, ?it/s]

(array([15157, 20504,  7569, 46562, 22528]),
 array([0.17396608, 0.17396608, 0.18446824, 0.2015819 , 0.20352012],
       dtype=float32))

## 6) Fusion — Reciprocal Rank Fusion (RRF)

In [11]:

def rrf_fuse(bm_idx, bm_scores, ann_idx, ann_scores, k_fuse=300, K=60):
    # bm_idx/scores: list of arrays per query
    fused_indices = []
    fused_scores = []
    for q in range(len(bm_idx)):
        # Build rank dicts
        ranks = {}
        # BM25
        for r, doc_id in enumerate(bm_idx[q]):
            ranks.setdefault(doc_id, 0.0)
            ranks[doc_id] += 1.0 / (K + r + 1)
        # ANN
        for r, doc_id in enumerate(ann_idx[q]):
            ranks.setdefault(doc_id, 0.0)
            ranks[doc_id] += 1.0 / (K + r + 1)
        # Top k_fuse
        items = sorted(ranks.items(), key=lambda x: -x[1])[:k_fuse]
        fused_indices.append(np.array([it[0] for it in items], dtype=int))
        fused_scores.append(np.array([it[1] for it in items], dtype=float))
    return fused_indices, fused_scores

## 7) Candidate Generation — BM25 + ANN + Fusion

In [12]:

def build_candidates(queries_df, topk_bm25, topk_ann, k_fuse, lang_hint):
    qs = queries_df["query"].tolist()
    # BM25
    bm_out = bm25_search(qs, topk=topk_bm25, lang_hint=lang_hint)
    bm_idx = [o[0] for o in bm_out]
    bm_scores = [o[1] for o in bm_out]
    # ANN
    ann_idx, ann_scores = ann_search(qs, topk=topk_ann)
    # Fuse
    fused_idx, fused_scores = rrf_fuse(bm_idx, bm_scores, ann_idx, ann_scores, k_fuse=k_fuse)
    return bm_idx, bm_scores, ann_idx, ann_scores, fused_idx, fused_scores

bm_idx, bm_scores, ann_idx, ann_scores, fused_idx, fused_scores = build_candidates(
    queries_df, CONFIG["TOPK_BM25"], CONFIG["TOPK_ANN"], CONFIG["FUSION_K"], CONFIG["language"]
)
len(fused_idx), len(fused_idx[0])

Encoding:   0%|          | 0/40 [00:00<?, ?it/s]

(5000, 300)

## 8) Learning-to-Rank — Features for a fast ranker (LightGBM)

In [14]:
# Build a mapping for convenience
pid_list = docs_df["product_id"].tolist()

def build_feature_row(q_text, q_id, cand_idx, bm_ranks, bm_s, ann_ranks, ann_s):
    # Example features
    doc_text = docs_df.loc[cand_idx, "doc_text"]
    pop = float(docs_df.loc[cand_idx, "stars"]) if not np.isnan(docs_df.loc[cand_idx, "stars"]) else 0.0

    # Length features
    q_len = len(q_text.split())
    d_len = len(doc_text.split())

    # Scores and ranks (normalize)
    bm_rank = bm_ranks.get(cand_idx, 9999)
    ann_rank = ann_ranks.get(cand_idx, 9999)
    bm_score = bm_s.get(cand_idx, 0.0)
    ann_score = ann_s.get(cand_idx, 0.0)

    return [bm_score, ann_score, bm_rank, ann_rank, pop, q_len, d_len]

feature_names = ["bm25_score","ann_score","bm25_rank","ann_rank","doc_popularity","query_len","doc_len"]

def build_ltr_dataset(queries_df, fused_idx, bm_idx, bm_scores, ann_idx, ann_scores, train=True):
    X = []
    y = []
    q_groups = []
    qids = []
    for qi in range(len(queries_df)):
        q_text = queries_df.loc[qi, "query"]
        rel_pid = queries_df.loc[qi, "relevant_pid"]
        # Build rank dicts for features
        bm_rank_map = {int(doc): r for r, doc in enumerate(bm_idx[qi])}
        ann_rank_map = {int(doc): r for r, doc in enumerate(ann_idx[qi])}
        bm_score_map = {int(doc): float(bm_scores[qi][r]) for r, doc in enumerate(bm_idx[qi])}
        ann_score_map = {int(doc): float(ann_scores[qi][r]) for r, doc in enumerate(ann_idx[qi])}

        # Candidate set
        cands = fused_idx[qi]
        local_X, local_y = [], []
        for doc_idx in cands:
            feats = build_feature_row(q_text, qi, int(doc_idx), bm_rank_map, bm_score_map, ann_rank_map, ann_score_map)
            local_X.append(feats)
            label = 1 if docs_df.loc[int(doc_idx), "product_id"] == rel_pid else 0
            local_y.append(label)

        # Ensure at least one positive; if not, skip (rare but possible)
        if sum(local_y) == 0:
            continue

        X.extend(local_X)
        y.extend(local_y)
        q_groups.append(len(local_y))
        qids.append(qi)

    X = np.array(X, dtype=np.float32)
    y = np.array(y, dtype=np.int64)
    return X, y, q_groups, qids

# Train/Dev split on queries
train_q, dev_q = train_test_split(queries_df, test_size=1-CONFIG["train_frac"], random_state=SEED, shuffle=True)
# Reset index after splitting
train_q = train_q.reset_index(drop=True)
dev_q = dev_q.reset_index(drop=True)

# Rebuild candidates per split for realism (optional optimization: reuse)
bm_idx_tr, bm_scores_tr, ann_idx_tr, ann_scores_tr, fused_idx_tr, fused_scores_tr = build_candidates(
    train_q, CONFIG["TOPK_BM25"], CONFIG["TOPK_ANN"], CONFIG["FUSION_K"], CONFIG["language"]
)
bm_idx_dev, bm_scores_dev, ann_idx_dev, ann_scores_dev, fused_idx_dev, fused_scores_dev = build_candidates(
    dev_q, CONFIG["TOPK_BM25"], CONFIG["TOPK_ANN"], CONFIG["FUSION_K"], CONFIG["language"]
)

X_tr, y_tr, g_tr, qids_tr = build_ltr_dataset(train_q, fused_idx_tr, bm_idx_tr, bm_scores_tr, ann_idx_tr, ann_scores_tr)
X_dev, y_dev, g_dev, qids_dev = build_ltr_dataset(dev_q, fused_idx_dev, bm_idx_dev, bm_scores_dev, ann_idx_dev, ann_scores_dev)

X_tr.shape, X_dev.shape, sum(g_tr), sum(g_dev), feature_names

Encoding:   0%|          | 0/32 [00:00<?, ?it/s]

Encoding:   0%|          | 0/8 [00:00<?, ?it/s]

((1022533, 7),
 (256970, 7),
 1022533,
 256970,
 ['bm25_score',
  'ann_score',
  'bm25_rank',
  'ann_rank',
  'doc_popularity',
  'query_len',
  'doc_len'])

## 9) Train a LightGBM Learning-to-Rank model (LambdaMART)

In [16]:

ranker = lgb.LGBMRanker(
    objective="lambdarank",
    n_estimators=400,
    learning_rate=0.05,
    num_leaves=63,
    max_depth=-1,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=SEED
)

ranker.fit(
    X_tr, y_tr,
    group=g_tr,
    eval_set=[(X_dev, y_dev)],
    eval_group=[g_dev],
    eval_at=[CONFIG["ndcg_k"]]
)

def ranker_predict_scores(X):
    return ranker.predict(X)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.026671 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1216
[LightGBM] [Info] Number of data points in the train set: 1022533, number of used features: 7


In [20]:
import warnings
warnings.filterwarnings('ignore', category=UserWarning)

## 10) Evaluation: NDCG@10, MRR, Recall@100

In [21]:

def dcg_at_k(rels, k):
    rels = np.array(rels)[:k]
    gains = (2**rels - 1) / np.log2(np.arange(2, len(rels)+2))
    return gains.sum()

def ndcg_at_k(gt_rels, scores, k=10):
    # gt_rels: list of 0/1; scores: model scores aligned
    order = np.argsort(-scores)
    rels_sorted = np.array(gt_rels)[order]
    dcg = dcg_at_k(rels_sorted, k)
    ideal = dcg_at_k(sorted(gt_rels, reverse=True), k)
    return dcg / ideal if ideal > 0 else 0.0

def mrr_at_k(gt_rels, scores, k=10):
    order = np.argsort(-scores)[:k]
    rels_sorted = np.array(gt_rels)[order]
    for i, r in enumerate(rels_sorted, start=1):
        if r > 0:
            return 1.0 / i
    return 0.0

def recall_at_k(gt_rels, scores, k=100):
    order = np.argsort(-scores)[:k]
    rels_sorted = np.array(gt_rels)[order]
    return float(rels_sorted.sum() > 0)

def evaluate_ranker(dev_q, fused_idx_dev, X_dev, y_dev, g_dev, k=10):
    ndcgs, mrrs, recs = [], [], []
    offset = 0
    for grp_size in g_dev:
        scores = ranker_predict_scores(X_dev[offset:offset+grp_size])
        labels = y_dev[offset:offset+grp_size]
        ndcgs.append(ndcg_at_k(labels, scores, k=k))
        mrrs.append(mrr_at_k(labels, scores, k=k))
        recs.append(recall_at_k(labels, scores, k=max(100, k)))
        offset += grp_size
    return np.mean(ndcgs), np.mean(mrrs), np.mean(recs)

ndcg10, mrr10, rec100 = evaluate_ranker(dev_q, fused_idx_dev, X_dev, y_dev, g_dev, k=CONFIG["ndcg_k"])
print(f"NDCG@{CONFIG['ndcg_k']}: {ndcg10:.4f} | MRR@{CONFIG['ndcg_k']}: {mrr10:.4f} | Recall@100: {rec100:.4f}")

NDCG@10: 0.7037 | MRR@10: 0.6778 | Recall@100: 0.9441


## 11) Optional Cross-Encoder Rerank on top-N (budgeted)

In [18]:

USE_CE = CONFIG["RERANK_TOPN"] > 0
if USE_CE:
    ce = CrossEncoder(CONFIG["ce_model"], device=device)
    def ce_rerank(q_texts: List[str], cand_indices: List[np.ndarray], topN: int):
        new_scores = []
        pairs = []
        # Build pairs for a batch CE scoring
        for qi, cands in enumerate(cand_indices):
            cands = cands[:topN]
            for d_idx in cands:
                doc = docs_df.loc[int(d_idx), "doc_text"]
                pairs.append((q_texts[qi], doc))
        # Score all pairs
        scores = ce.predict(pairs, batch_size=64, show_progress_bar=True)
        # Re-split
        pos = 0
        for qi, cands in enumerate(cand_indices):
            cands = cands[:topN]
            sc = scores[pos:pos+len(cands)]
            new_scores.append((cands, sc))
            pos += len(cands)
        return new_scores

    # Evaluate CE rerank on dev split
    q_texts_dev = dev_q["query"].tolist()
    ce_top = CONFIG["RERANK_TOPN"]
    ce_scored = ce_rerank(q_texts_dev, fused_idx_dev, ce_top)

    # Compute metrics using CE scores merged into final ranking (only top-N reranked, rest keep original RRF score order)
    ndcgs_ce, mrrs_ce, recs_ce = [], [], []
    for qi in range(len(dev_q)):
        # Build labels aligned to reranked cands
        rer_cands, rer_scores = ce_scored[qi]
        labels = [1 if docs_df.loc[int(idx), "product_id"] == dev_q.loc[qi, "relevant_pid"] else 0 for idx in rer_cands]
        ndcgs_ce.append(ndcg_at_k(labels, rer_scores, k=CONFIG["ndcg_k"]))
        mrrs_ce.append(mrr_at_k(labels, rer_scores, k=CONFIG["ndcg_k"]))
        recs_ce.append(recall_at_k(labels, rer_scores, k=max(100, CONFIG["ndcg_k"])))
    print(f"[CE Rerank Top-{ce_top}] NDCG@{CONFIG['ndcg_k']}: {np.mean(ndcgs_ce):.4f} | MRR@{CONFIG['ndcg_k']}: {np.mean(mrrs_ce):.4f}")
else:
    print("Cross-Encoder rerank disabled. Set CONFIG['RERANK_TOPN'] > 0 to enable.")

config.json:   0%|          | 0.00/891 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/471M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

Batches:   0%|          | 0/782 [00:00<?, ?it/s]

[CE Rerank Top-50] NDCG@10: 0.6393 | MRR@10: 0.6199


## 12) Latency sampling (rough, P50-ish)

In [22]:

import time
def time_stage(fn, *args, **kwargs):
    t0 = time.time()
    out = fn(*args, **kwargs)
    return (time.time() - t0), out

sample_qs = dev_q["query"].head(50).tolist()

t_bm, bm_out = time_stage(bm25_search, sample_qs, 50, CONFIG["language"])
t_ann, ann_out = time_stage(ann_search, sample_qs, 50)
bm_idx_s = [o[0] for o in bm_out]
bm_sc_s = [o[1] for o in bm_out]
ann_idx_s, ann_sc_s = ann_out
t_fuse, fuse_out = time_stage(rrf_fuse, bm_idx_s, bm_sc_s, ann_idx_s, ann_sc_s, 100)
print(f"BM25 (50 queries): {t_bm:.3f}s | ANN: {t_ann:.3f}s | Fusion: {t_fuse:.3f}s")
print("Per-query (approx): BM25 {:.1f} ms | ANN {:.1f} ms | Fusion {:.1f} ms".format(1000*t_bm/len(sample_qs), 1000*t_ann/len(sample_qs), 1000*t_fuse/len(sample_qs)))

Encoding:   0%|          | 0/1 [00:00<?, ?it/s]

BM25 (50 queries): 2.913s | ANN: 0.054s | Fusion: 0.003s
Per-query (approx): BM25 58.3 ms | ANN 1.1 ms | Fusion 0.1 ms


## 13) Save Artifacts (optional)

In [23]:

!mkdir -p artifacts
np.save("artifacts/doc_embeddings.npy", doc_embeddings)
docs_df.to_csv("artifacts/docs.csv", index=False)
dev_q.to_csv("artifacts/dev_queries.csv", index=False)
print("Saved to ./artifacts")

Saved to ./artifacts


## 14) Next Steps
- Swap `language` in `CONFIG` to another (e.g., `'de'`, `'fr'`, `'ja'`, `'zh'`) and re-run.
- Increase `N_DOCS` / `N_QUERIES` as Colab resources allow.
- Add more features to the ranker (price, brand match, attribute overlap).
- Replace LightGBM ranker with a **distilled cross-encoder** in the main path if latency budget allows.
- Export the learned ranker and serve it behind a low-latency API (e.g., FastAPI + ONNXRuntime for feature scoring).