#Similarity & Reranking Playground

1.Build the pipeline step by step.

2.Experiment with dense, sparse, hybrid retrieval.

3.Add a CrossEncoder reranker.

4.Package the reusable logic into clean functions you can later copy into query.py.

In [53]:
#1 ‚Äî Setup & imports

#Purpose: load essentials, set logging, and make BM25/Ensemble imports robust across LangChain versions.
# set up
import sys, os

# Add project root to sys.path
repo_root = os.path.abspath("..")   # assuming notebook is in /notebooks
sys.path.append(repo_root)

# Logging
import logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")

# Core
from langchain.schema import Document

# Robust imports for BM25 + Ensemble (LangChain moved some modules across versions)
try:
    from langchain.retrievers import BM25Retriever, EnsembleRetriever
except Exception:
    from langchain_community.retrievers import BM25Retriever
    from langchain.retrievers import EnsembleRetriever  # still here in many versions

# Reranker
from sentence_transformers import CrossEncoder

# Your ingestion & index
import importlib
import src.ingestion.video_loader as video_loader
from src.retrieval.index import TextIndexer


In [54]:
#2 ‚Äî Reload your local module (avoid stale imports)

#Purpose: Jupyter caches modules; this forces it to pick up your latest video_loader.py.

importlib.reload(video_loader)
from src.ingestion.video_loader import fetch_video_info, fetch_transcript, build_docs_from_video


2025-09-29 17:38:28,359 [INFO] Use pytorch device_name: mps
2025-09-29 17:38:28,360 [INFO] Load pretrained SentenceTransformer: BAAI/bge-small-en


In [55]:
#3 ‚Äî Config

#Purpose: put constants in one place; easy to tweak.

URL = "https://www.youtube.com/watch?v=j9w7hEfeIbE"
CHROMA_DIR = "data/chroma"
MODEL_NAME = "BAAI/bge-small-en"

# Retrieval knobs
DENSE_K = 5
SPARSE_K = 5
HYBRID_WEIGHTS = [0.7, 0.3]   # [dense, sparse]
RERANK_TOP_K = 5              # final results after reranking


In [56]:
#4 ‚Äî Fetch video data

#Purpose: get metadata + transcript with helpful prints.
meta = fetch_video_info(URL)
transcript = fetch_transcript(URL, lang="en")

print("Video ID:", meta["video_id"])
print("Title   :", meta["title"])
print("#Segments (transcript):", len(transcript.get("segments", [])))
print("Has chapters? ", bool(meta.get("chapters")))



2025-09-29 17:38:31,160 [INFO] üì∫ Fetching video info for https://www.youtube.com/watch?v=j9w7hEfeIbE
2025-09-29 17:38:33,251 [INFO] ‚úÖ Metadata fetched for video j9w7hEfeIbE (How to do a One-Way Goodness of Fit Chi-Square in JASP (15-10))
2025-09-29 17:38:33,252 [INFO] üìù Fetching transcript for https://www.youtube.com/watch?v=j9w7hEfeIbE (lang=en)
2025-09-29 17:38:35,762 [INFO] ‚úÖ 186 transcript segments fetched.


Video ID: j9w7hEfeIbE
Title   : How to do a One-Way Goodness of Fit Chi-Square in JASP (15-10)
#Segments (transcript): 186
Has chapters?  True


In [57]:
#5 ‚Äî Build documents (description + transcript)

#Purpose: convert to query-ready Documents.

docs = build_docs_from_video(meta, transcript)
print(f"‚úÖ Prepared {len(docs)} documents")
print("First doc:", docs[0].metadata.get("type"), "|", docs[0].page_content[:180], "...")


2025-09-29 17:38:35,770 [INFO] ‚úÖ Produced 14 description chunks.
2025-09-29 17:38:35,774 [INFO] ‚è© Splitting transcript by 4 chapters


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-09-29 17:38:38,338 [INFO] ‚úÖ Produced 8 transcript chunks.
2025-09-29 17:38:38,339 [INFO] üì¶ Built 22 docs (desc=14, trans=8).


‚úÖ Prepared 22 documents
First doc: description | We learn how to calculate a One-Way Chi-Square goodness of fit test in JASP using the setting for Multinomial Test. For the null hypothesis, we assume that the observed values in o ...


In [58]:
#Quick dataset stats (chunk counts & sizes)

#Purpose: sanity-check your splits and sizes by type.

from collections import Counter

counts = Counter(d.metadata["type"] for d in docs)
lens = {"description": [], "transcript": []}
for d in docs:
    lens[d.metadata["type"]].append(len(d.page_content))

def summarize(kind):
    arr = lens.get(kind, [])
    if not arr: return "0 chunks"
    return f"{len(arr)} chunks | avg {sum(arr)//len(arr)} chars | min {min(arr)} | max {max(arr)}"

print("Counts:", dict(counts))
print("Description:", summarize("description"))
print("Transcript :", summarize("transcript"))


Counts: {'description': 14, 'transcript': 8}
Description: 14 chunks | avg 115 chars | min 16 | max 587
Transcript : 8 chunks | avg 939 chars | min 226 | max 1197


In [59]:
#7-Build dense index (Chroma)

#Purpose: embed + persist; use a collection name tied to this video id to avoid collisions.

import importlib
import src.retrieval.index as index_module
importlib.reload(index_module)
from src.retrieval.index import TextIndexer

# now test
indexer = TextIndexer(CHROMA_DIR, "test_collection", MODEL_NAME)
print(hasattr(indexer, "as_retriever"))  # should be True


# Build a collection name tied to this video ID to avoid mixing across videos
collection_name = f"jasp_text_{meta['video_id']}"

# Initialize your indexer
indexer = TextIndexer(
    chroma_dir=CHROMA_DIR,   # parameter name matches your class
    collection_name=collection_name,
    model_name=MODEL_NAME
)

# Insert docs (upsert)
indexer.upsert_documents(docs, source_prefix=collection_name)

# Create a LangChain retriever wrapper (needed for hybrid retrieval)
dense_retriever = indexer.as_retriever(search_kwargs={"k": DENSE_K})

print("‚úÖ Dense retriever ready:", collection_name)




2025-09-29 17:38:38,413 [INFO] Use pytorch device_name: mps
2025-09-29 17:38:38,413 [INFO] Load pretrained SentenceTransformer: BAAI/bge-small-en
2025-09-29 17:38:40,549 [INFO] Use pytorch device_name: mps
2025-09-29 17:38:40,549 [INFO] Load pretrained SentenceTransformer: BAAI/bge-small-en


True


2025-09-29 17:38:43,123 [INFO] ‚úÖ Inserted 22 docs into Chroma collection 'jasp_text_j9w7hEfeIbE'.


‚úÖ Dense retriever ready: jasp_text_j9w7hEfeIbE


In [60]:
#8 ‚Äî Build BM25 (sparse) retriever

#Purpose: exact keyword matching, great for rare terms like ‚ÄúANOVA‚Äù.

bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = SPARSE_K
print("‚úÖ BM25 retriever ready (k =", SPARSE_K, ")")



‚úÖ BM25 retriever ready (k = 5 )


In [61]:
#9 ‚Äî Hybrid retriever (dense + sparse)

#Purpose: combine strengths of semantics + keywords.

hybrid_retriever = EnsembleRetriever(
    retrievers=[dense_retriever, bm25_retriever],
    weights=HYBRID_WEIGHTS
)
print("‚úÖ Hybrid retriever ready with weights", HYBRID_WEIGHTS)


‚úÖ Hybrid retriever ready with weights [0.7, 0.3]
