# Quickstart Notebook for Hybrid Retrieval System

This notebook demonstrates the **complete hybrid retrieval pipeline** in an interactive environment.

---

**Features Covered:**
- Knowledge Base Setup
- Data Curation & Storage
- Sparse (BM25), Dense (FAISS), and Hybrid (RRF) Retrieval
- Method Comparison
- Statistics & Cleanup

---

In [None]:
# Cell 1: Setup and Imports
from uuid import uuid4
from kb.kb_enhanced import EnhancedKnowledgeBase
from retrieval.sparse_retrieval import BM25Retriever, MongoFTSRetriever
from retrieval.dense_retrieval import FAISSRetriever
from retrieval.hybrid_fusion import HybridRetriever
from curator.orchestrator import CuratorLLM, RetrievalOrchestrator

print("All imports successful!")

ModuleNotFoundError: No module named 'retrieval'

In [3]:
# Cell 2: Initialize System
print("Initializing Knowledge Base...")
kb = EnhancedKnowledgeBase(db_name="hyper_kb_quickstart")

print("Initializing Curator LLM...")
curator_llm = CuratorLLM(kb)

print("System initialized successfully!")

Initializing Knowledge Base...


NameError: name 'EnhancedKnowledgeBase' is not defined

In [None]:
# Cell 3: Add Sample Data (Single Interaction)
session_id = str(uuid4())
print(f"Session ID: {session_id}\n")

# Example Q&A pair
query = "What is the difference between supervised and unsupervised learning?"
response = (
    "Supervised learning uses labeled data where each example has an input and corresponding "
    "correct output. The model learns to map inputs to outputs. Unsupervised learning works "
    "with unlabeled data, finding patterns and structure without predefined labels. "
    "Common supervised tasks include classification and regression, while unsupervised tasks "
    "include clustering and dimensionality reduction."
)

interaction_id = curator_llm.curate_and_store(
    query_text=query,
    response_text=response,
    session_id=session_id
)
print(f"Stored interaction: {interaction_id}")

In [None]:
# Cell 4: Add More Interactions (Batch Processing)
interactions = [
    {
        "query": "How do neural networks learn?",
        "response": (
            "Neural networks learn through backpropagation and gradient descent. During training, "
            "the network makes predictions, compares them to true labels, calculates error, and "
            "propagates this error backward through the network to update weights."
        ),
        "dialogue_act": "question"
    },
    {
        "query": "What is overfitting?",
        "response": (
            "Overfitting occurs when a model learns the training data too well, including noise "
            "and outliers, resulting in poor performance on new data. It happens when the model "
            "is too complex relative to the amount of training data."
        ),
        "dialogue_act": "question",
        "topic_shift_score": 0.3
    }
]

interaction_ids = curator_llm.curate_batch(interactions, session_id)
print(f"Stored {len(interaction_ids)} additional interactions")

In [None]:
# Cell 5: Initialize Retrievers
print("Initializing retrievers...\n")

# Sparse Retriever (BM25)
sparse_retriever = BM25Retriever(kb, k1=1.5, b=0.75)
sparse_retriever.index_documents()
print("  - BM25 sparse retriever indexed")

# Dense Retriever (FAISS)
dense_retriever = FAISSRetriever(kb, dimension=384, index_type='flat')
dense_retriever.index_embeddings()
print("  - FAISS dense retriever indexed")

# Hybrid Retriever (Reciprocal Rank Fusion)
hybrid_retriever = HybridRetriever(
    sparse_retriever=sparse_retriever,
    dense_retriever=dense_retriever,
    fusion_method='rrf',
    fusion_params={'k': 60}
)
print("  - Hybrid retriever (RRF fusion) ready\n")

print("All retrievers initialized!")

In [None]:
# Cell 6: Test Hybrid Retrieval
retrieval_orch = RetrievalOrchestrator(kb, hybrid_retriever)

test_query = "How does a machine learning model prevent overfitting?"
print(f"\nQuery: {test_query}\n")

# Get query embedding
query_embedding = curator_llm.get_embedding_for_query(test_query)

# Perform hybrid retrieval
results = retrieval_orch.retrieve(
    query=test_query,
    query_embedding=query_embedding,
    session_id=session_id,
    top_k=3,
    method='hybrid'
)

print("Top 3 Hybrid Retrieval Results:")
print("-" * 60)
for result in results:
    print(f"\nRank {result.rank} | Score: {result.score:.4f}")
    print(f"Query: {result.query_text}")
    print(f"Response: {result.response_text[:200].strip()}...\n")

In [None]:
# Cell 7: Compare Retrieval Methods
print("Comparing retrieval methods...\n")
methods = ['sparse', 'dense', 'hybrid']

for method in methods:
    print(f"--- {method.upper()} RETRIEVAL (Top 2) ---")
    results = retrieval_orch.retrieve(
        query=test_query,
        query_embedding=query_embedding,
        session_id=session_id,
        top_k=2,
        method=method
    )
    for result in results:
        preview = result.query_text[:60].replace('\n', ' ')
        print(f"  Rank {result.rank}: {result.score:.4f} - {preview}...")
    print("")

In [None]:
# Cell 8: View Knowledge Base Statistics
stats = kb.get_stats()

print("\nKnowledge Base Statistics:")
print("=" * 50)
for key, value in stats.items():
    if key != 'most_accessed':
        formatted_key = key.replace('_', ' ').title()
        print(f"{formatted_key:<25}: {value}")
print("=" * 50)

In [None]:
# Cell 9: Cleanup and Shutdown
kb.close()
print("\nSession complete! Knowledge base connection closed.")

---
**You're all set!**

You now have a fully functional hybrid retrieval system with:
- Curated knowledge storage
- Dual retrieval engines
- Smart fusion
- Performance comparison

Try modifying queries or adding more data to explore further!

---