# Vector Store Notebook

This notebook builds the FAISS vector store from the knowledge base and tests retrieval with sample prompts using the **Retriever** class with hybrid scoring.

## Purpose
1.  **Load Knowledge Base**: Use `KnowledgeBase` class to load entries from `data/knowledge_base/`.
2.  **Build Vector Index**: Index entries using `KnowledgeBase.index_entries()`.
3.  **Use Retriever**: Import and use the `Retriever` class from `src.rag` with hybrid scoring.
4.  **Test Retrieval**: Run sample prompts and display top-5 retrieved entries.
5.  **Save Index**: Persist the vector store for later use.

## Hybrid Scoring (from src.rag.retriever)
**Hybrid Score = Semantic Score + Topic Boost**
- Semantic: Embedding similarity (0-1)
- Topic Boost: +0.15 per matching topic keyword in query


In [1]:
# Setup and Imports
import sys
import json
from pathlib import Path
import numpy as np

# Add project root to path
project_root = Path("..").resolve()
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

from config import get_config, get_config_loader

# Import RAG components from src
from src.rag import (
    VectorStore, 
    EmbeddingModel, 
    KnowledgeBase,
    Retriever,
    extract_query_keywords,
    DEFAULT_TOPIC_BOOST,
)

print("âœ… Imports successful!")
print(f"Default Topic Boost: {DEFAULT_TOPIC_BOOST}")

âœ… Imports successful!
Default Topic Boost: 0.15


## 1. Initialize Knowledge Base and Load Entries

Using the `KnowledgeBase` class to properly load entries so they are available for search.

In [2]:
# Define paths
KNOWLEDGE_BASE_DIR = project_root / "data" / "knowledge_base"
VECTOR_INDEX_PATH = project_root / "data" / "models" / "vector_index"

# Initialize Knowledge Base with the path
knowledge_base = KnowledgeBase(
    knowledge_base_path=KNOWLEDGE_BASE_DIR,
)

# Load entries from directory (this populates entry_by_id)
num_loaded = knowledge_base.load_from_directory()

print(f"\nâœ… Loaded {num_loaded} entries from knowledge base.")
print(f"Entries indexed by ID: {len(knowledge_base.entry_by_id)}")

[32m2025-12-05 23:09:56.753[0m | [1mINFO    [0m | [36mconfig_loader[0m:[36mload[0m:[36m93[0m - [1mâœ… Loaded configuration from D:\University\Uni\Semester 7\Generative AI\Project\Cirq-RAG-Code-Assistant\config\config.dev.json[0m
[32m2025-12-05 23:09:56.754[0m | [34m[1mDEBUG   [0m | [36mconfig_loader[0m:[36mcreate_directories[0m:[36m236[0m - [34m[1mCreated all necessary directories[0m
[32m2025-12-05 23:09:56.755[0m | [1mINFO    [0m | [36msrc.rag.embeddings[0m:[36m__init__[0m:[36m97[0m - [1mLoading embedding model: BAAI/bge-base-en-v1.5[0m
[32m2025-12-05 23:09:56.756[0m | [1mINFO    [0m | [36msrc.rag.embeddings[0m:[36m__init__[0m:[36m98[0m - [1mUsing device: cpu[0m
[32m2025-12-05 23:10:00.950[0m | [1mINFO    [0m | [36msrc.rag.embeddings[0m:[36m__init__[0m:[36m106[0m - [1mâœ… Embedding model loaded successfully[0m
[32m2025-12-05 23:10:00.951[0m | [1mINFO    [0m | [36msrc.rag.embeddings[0m:[36m__init__[0m:[36m113[0m - 


âœ… Loaded 102 entries from knowledge base.
Entries indexed by ID: 102


## 2. View Sample Entries

In [3]:
# Display sample entries
print("Sample entries:")
for i, entry in enumerate(knowledge_base.entries[:3]):
    print(f"\n--- Entry {i+1} ---")
    print(f"ID: {entry.get('id', 'N/A')}")
    print(f"Difficulty: {entry.get('difficulty', 'N/A')}")
    print(f"Topics: {entry.get('topics', [])}")
    print(f"Task: {entry.get('task', 'N/A')[:100]}...")

Sample entries:

--- Entry 1 ---
ID: design_basic_adder_template
Difficulty: advanced
Topics: ['arithmetic', 'adder', 'toffoli']
Task: Design a reusable Cirq gate or function implementing a ripple-carry adder for n-bit integers using T...

--- Entry 2 ---
ID: design_basic_adder_template_v2
Difficulty: advanced
Topics: ['arithmetic', 'adder']
Task: Show how to use the RippleCarryAdder gate to add two specific 3-bit classical numbers by initializin...

--- Entry 3 ---
ID: design_bb84_round
Difficulty: intermediate
Topics: ['bb84', 'cryptography', 'basis_encoding']
Task: Write a Cirq circuit that implements one round of the BB84 protocol on a single qubit: Alice chooses...


## 3. Index Entries in Vector Store

Using `KnowledgeBase.index_entries()` to generate embeddings and build the vector store.

In [4]:
# Index all entries (generates embeddings and adds to vector store)
print("Indexing entries... (this may take a moment)")
knowledge_base.index_entries(batch_size=16)

print(f"\nâœ… Vector store size: {knowledge_base.vector_store.size()}")

[32m2025-12-05 23:10:00.974[0m | [1mINFO    [0m | [36msrc.rag.knowledge_base[0m:[36mindex_entries[0m:[36m181[0m - [1mIndexing 102 entries in vector store[0m
[32m2025-12-05 23:10:00.975[0m | [34m[1mDEBUG   [0m | [36msrc.rag.embeddings[0m:[36mencode[0m:[36m166[0m - [34m[1mGenerating embeddings for 102 texts[0m


Indexing entries... (this may take a moment)


Batches:   0%|          | 0/7 [00:00<?, ?it/s]

[32m2025-12-05 23:10:17.766[0m | [34m[1mDEBUG   [0m | [36msrc.rag.vector_store[0m:[36madd[0m:[36m204[0m - [34m[1mAdded 102 embeddings to vector store[0m
[32m2025-12-05 23:10:17.767[0m | [1mINFO    [0m | [36msrc.rag.knowledge_base[0m:[36mindex_entries[0m:[36m248[0m - [1mâœ… Indexed 102 entries[0m



âœ… Vector store size: 102


## 4. Initialize Retriever with Hybrid Scoring

Using the `Retriever` class from `src.rag` with **hybrid scoring enabled**.

In [5]:
# Create Retriever with hybrid scoring
retriever = Retriever(
    knowledge_base=knowledge_base,
    top_k=5,
    similarity_threshold=0.3,   # Lower threshold to see more results
    topic_boost=0.15,           # Boost per matching topic
    use_hybrid_scoring=True,    # Enable topic boosting
)

print("âœ… Retriever initialized with HYBRID SCORING enabled!")
print(f"   Topic boost per match: {retriever.topic_boost}")
print(f"   Similarity threshold: {retriever.similarity_threshold}")

[32m2025-12-05 23:10:17.779[0m | [1mINFO    [0m | [36msrc.rag.retriever[0m:[36m__init__[0m:[36m170[0m - [1mInitialized Retriever with top_k=5, threshold=0.3, topic_boost=0.15, hybrid_scoring=True[0m


âœ… Retriever initialized with HYBRID SCORING enabled!
   Topic boost per match: 0.15
   Similarity threshold: 0.3


## 5. Test Retrieval with Sample Prompts

Using the `Retriever.retrieve_with_metadata()` method which applies hybrid scoring automatically.

In [6]:
# Define test prompts
test_prompts = [
    "Create a Bell state circuit with two qubits",
    "Implement Grover's search algorithm",
    "Build a QAOA circuit for MaxCut optimization",
    "Quantum phase estimation example",
    "How to implement quantum teleportation",
    "bell_state entanglement",  # Topic-focused query
    "qpe phase_estimation",     # Topic-focused query
]

print(f"Testing with {len(test_prompts)} prompts...")
print("ðŸ“Œ Using HYBRID SCORING from src.rag.Retriever!\n")

Testing with 7 prompts...
ðŸ“Œ Using HYBRID SCORING from src.rag.Retriever!



In [7]:
# Run tests for each prompt
for i, prompt in enumerate(test_prompts, 1):
    print("=" * 80)
    print(f"Test {i}: \"{prompt}\"")
    print("=" * 80)
    
    # Show keywords extracted
    keywords = extract_query_keywords(prompt)
    print(f"Query Keywords: {keywords}")
    
    # Use Retriever from src.rag
    results = retriever.retrieve_with_metadata(prompt, top_k=5)
    
    if results:
        print(f"\nTop 5 Retrieved Entries (Hybrid Scoring):")
        for rank, res in enumerate(results, 1):
            entry = res.get('entry', {})
            entry_id = entry.get('id', res.get('id', 'N/A'))
            hybrid_score = res.get('score', 0)
            semantic_score = res.get('semantic_score', hybrid_score)
            topic_boost = res.get('topic_boost', 0)
            difficulty = entry.get('difficulty', 'N/A')
            topics = entry.get('topics', [])
            task_preview = entry.get('task', entry.get('description', 'N/A'))[:60]
            
            boost_str = f" +{topic_boost:.2f}" if topic_boost > 0 else ""
            print(f"\n  #{rank} [Score: {hybrid_score:.4f} = {semantic_score:.4f}{boost_str}]")
            print(f"      ID: {entry_id}")
            print(f"      Difficulty: {difficulty}")
            print(f"      Topics: {', '.join(topics[:5]) if topics else 'N/A'}")
            print(f"      Task: {task_preview}...")
    else:
        print("  No results found.")
    
    print()

[32m2025-12-05 23:10:17.804[0m | [34m[1mDEBUG   [0m | [36msrc.rag.embeddings[0m:[36mencode[0m:[36m166[0m - [34m[1mGenerating embeddings for 1 texts[0m
[32m2025-12-05 23:10:17.837[0m | [34m[1mDEBUG   [0m | [36msrc.rag.retriever[0m:[36mretrieve[0m:[36m277[0m - [34m[1mRetrieved 5 results for query: Create a Bell state circuit with two qubits... (hybrid=True)[0m
[32m2025-12-05 23:10:17.839[0m | [34m[1mDEBUG   [0m | [36msrc.rag.embeddings[0m:[36mencode[0m:[36m166[0m - [34m[1mGenerating embeddings for 1 texts[0m
[32m2025-12-05 23:10:17.868[0m | [34m[1mDEBUG   [0m | [36msrc.rag.retriever[0m:[36mretrieve[0m:[36m277[0m - [34m[1mRetrieved 5 results for query: Implement Grover's search algorithm... (hybrid=True)[0m
[32m2025-12-05 23:10:17.869[0m | [34m[1mDEBUG   [0m | [36msrc.rag.embeddings[0m:[36mencode[0m:[36m166[0m - [34m[1mGenerating embeddings for 1 texts[0m
[32m2025-12-05 23:10:17.896[0m | [34m[1mDEBUG   [0m | [36ms

Test 1: "Create a Bell state circuit with two qubits"
Query Keywords: {'two', 'state', 'bell', 'create', 'qubits', 'with', 'circuit'}

Top 5 Retrieved Entries (Hybrid Scoring):

  #1 [Score: 1.0448 = 0.7448 +0.30]
      ID: design_bell_state
      Difficulty: beginner
      Topics: bell_state, entanglement, hadamard, cnot
      Task: Design a Cirq circuit that prepares the Bell state (|00> + |...

  #2 [Score: 0.9961 = 0.6961 +0.30]
      ID: design_bell_state_v2
      Difficulty: beginner
      Topics: bell_state, entanglement, hadamard, cnot
      Task: Create a Cirq function that returns a Bell-state preparation...

  #3 [Score: 0.9841 = 0.6841 +0.30]
      ID: design_bell_state_v3
      Difficulty: beginner
      Topics: bell_state, entanglement
      Task: Design a Bell-state circuit that uses an explicit Moment str...

  #4 [Score: 0.8443 = 0.6943 +0.15]
      ID: designer_teleportation_circuit
      Difficulty: intermediate
      Topics: teleportation, entanglement, bell_measure

[32m2025-12-05 23:10:17.971[0m | [34m[1mDEBUG   [0m | [36msrc.rag.embeddings[0m:[36mencode[0m:[36m166[0m - [34m[1mGenerating embeddings for 1 texts[0m



Top 5 Retrieved Entries (Hybrid Scoring):

  #1 [Score: 0.9499 = 0.6499 +0.30]
      ID: design_bell_state
      Difficulty: beginner
      Topics: bell_state, entanglement, hadamard, cnot
      Task: Design a Cirq circuit that prepares the Bell state (|00> + |...

  #2 [Score: 0.9276 = 0.6276 +0.30]
      ID: design_bell_state_v3
      Difficulty: beginner
      Topics: bell_state, entanglement
      Task: Design a Bell-state circuit that uses an explicit Moment str...

  #3 [Score: 0.9036 = 0.6036 +0.30]
      ID: design_bell_state_v2
      Difficulty: beginner
      Topics: bell_state, entanglement, hadamard, cnot
      Task: Create a Cirq function that returns a Bell-state preparation...

  #4 [Score: 0.7812 = 0.6312 +0.15]
      ID: designer_teleportation_circuit
      Difficulty: intermediate
      Topics: teleportation, entanglement, bell_measurement
      Task: Implement the Quantum Teleportation circuit to transfer a st...

  #5 [Score: 0.7644 = 0.6144 +0.15]
      ID: design

[32m2025-12-05 23:10:17.995[0m | [34m[1mDEBUG   [0m | [36msrc.rag.retriever[0m:[36mretrieve[0m:[36m277[0m - [34m[1mRetrieved 5 results for query: qpe phase_estimation... (hybrid=True)[0m



Top 5 Retrieved Entries (Hybrid Scoring):

  #1 [Score: 0.9565 = 0.6565 +0.30]
      ID: design_qpe_rz
      Difficulty: advanced
      Topics: qpe, phase_estimation, rz_gate
      Task: Estimate the phase of an Rz(theta) gate where theta=pi/3 usi...

  #2 [Score: 0.9512 = 0.6512 +0.30]
      ID: design_qpe_t_gate
      Difficulty: advanced
      Topics: qpe, phase_estimation, t_gate
      Task: Estimate the phase of a T gate (phase = 1/8) using 2 countin...

  #3 [Score: 0.9415 = 0.6415 +0.30]
      ID: design_qpe_general
      Difficulty: advanced
      Topics: qpe, phase_estimation, general
      Task: Create a general Quantum Phase Estimation function in Cirq t...

  #4 [Score: 0.9380 = 0.6380 +0.30]
      ID: designer_quantum_phase_estimation_t_gate
      Difficulty: advanced
      Topics: qpe, phase_estimation, t_gate
      Task: Estimate the phase of a T gate (phase pi/4) using 3 precisio...

  #5 [Score: 0.7590 = 0.6090 +0.15]
      ID: designer_shor_order_finding_circuit
    

## 6. Save Vector Store

In [8]:
# Save the index
knowledge_base.save_index(VECTOR_INDEX_PATH)

print(f"âœ… Vector store saved to: {VECTOR_INDEX_PATH}")
print(f"   Total entries indexed: {knowledge_base.vector_store.size()}")

[32m2025-12-05 23:10:18.014[0m | [1mINFO    [0m | [36msrc.rag.vector_store[0m:[36msave[0m:[36m435[0m - [1mSaved FAISS index to D:\University\Uni\Semester 7\Generative AI\Project\Cirq-RAG-Code-Assistant\data\models\vector_index[0m
[32m2025-12-05 23:10:18.015[0m | [1mINFO    [0m | [36msrc.rag.knowledge_base[0m:[36msave_index[0m:[36m375[0m - [1mSaved knowledge base index to D:\University\Uni\Semester 7\Generative AI\Project\Cirq-RAG-Code-Assistant\data\models\vector_index[0m


âœ… Vector store saved to: D:\University\Uni\Semester 7\Generative AI\Project\Cirq-RAG-Code-Assistant\data\models\vector_index
   Total entries indexed: 102


## 7. Verify Load

In [9]:
# Create new KnowledgeBase and load from disk
loaded_kb = KnowledgeBase(knowledge_base_path=KNOWLEDGE_BASE_DIR)
loaded_kb.load_from_directory()  # Load entries first
loaded_kb.load_index(VECTOR_INDEX_PATH)  # Then load vector index

# Create Retriever
loaded_retriever = Retriever(
    knowledge_base=loaded_kb,
    use_hybrid_scoring=True,
    similarity_threshold=0.3,
)

print(f"âœ… Loaded from disk.")
print(f"   Entries: {len(loaded_kb.entries)}")
print(f"   Vector store size: {loaded_kb.vector_store.size()}")

# Test
test_query = "Create a bell state circuit"
print(f"\nVerification Query: \"{test_query}\"")
print(f"Query Keywords: {extract_query_keywords(test_query)}")
print(f"\nTop 5 Results (Hybrid Scoring):")

verify_results = loaded_retriever.retrieve(test_query, top_k=5)
for res in verify_results:
    entry = res.get('entry', {})
    topics = entry.get('topics', [])
    boost = res.get('topic_boost', 0)
    boost_str = f" [+{boost:.2f} boost]" if boost > 0 else ""
    print(f"  - {entry.get('id', 'N/A')} (Score: {res['score']:.4f}){boost_str} | Topics: {', '.join(topics[:4]) if topics else 'N/A'}")

[32m2025-12-05 23:10:18.029[0m | [1mINFO    [0m | [36msrc.rag.embeddings[0m:[36m__init__[0m:[36m97[0m - [1mLoading embedding model: BAAI/bge-base-en-v1.5[0m
[32m2025-12-05 23:10:18.030[0m | [1mINFO    [0m | [36msrc.rag.embeddings[0m:[36m__init__[0m:[36m98[0m - [1mUsing device: cpu[0m
[32m2025-12-05 23:10:21.825[0m | [1mINFO    [0m | [36msrc.rag.embeddings[0m:[36m__init__[0m:[36m106[0m - [1mâœ… Embedding model loaded successfully[0m
[32m2025-12-05 23:10:21.826[0m | [1mINFO    [0m | [36msrc.rag.embeddings[0m:[36m__init__[0m:[36m113[0m - [1mEmbedding dimension: 768[0m
[32m2025-12-05 23:10:21.827[0m | [1mINFO    [0m | [36msrc.rag.vector_store[0m:[36m_init_faiss[0m:[36m139[0m - [1mInitialized FAISS index[0m
[32m2025-12-05 23:10:21.827[0m | [1mINFO    [0m | [36msrc.rag.vector_store[0m:[36m__init__[0m:[36m120[0m - [1mInitialized VectorStore with faiss backend[0m
[32m2025-12-05 23:10:21.828[0m | [1mINFO    [0m | [36ms

âœ… Loaded from disk.
   Entries: 102
   Vector store size: 102

Verification Query: "Create a bell state circuit"
Query Keywords: {'bell', 'state', 'create', 'circuit'}

Top 5 Results (Hybrid Scoring):
  - design_bell_state (Score: 0.9665) [+0.30 boost] | Topics: bell_state, entanglement, hadamard, cnot
  - design_bell_state_v2 (Score: 0.9596) [+0.30 boost] | Topics: bell_state, entanglement, hadamard, cnot
  - design_bell_state_v3 (Score: 0.9397) [+0.30 boost] | Topics: bell_state, entanglement
  - designer_teleportation_circuit (Score: 0.8116) [+0.15 boost] | Topics: teleportation, entanglement, bell_measurement
  - design_ghz_state_3_qubit (Score: 0.7897) [+0.15 boost] | Topics: ghz_state, entanglement, superposition


## Summary

This notebook has:
1. âœ… Loaded entries using `KnowledgeBase.load_from_directory()`
2. âœ… Indexed entries using `KnowledgeBase.index_entries()`
3. âœ… Used `Retriever` from `src.rag` with **hybrid scoring**
4. âœ… Tested retrieval with sample prompts
5. âœ… Saved and loaded the index

### Hybrid Scoring (from src.rag.retriever)
```python
Hybrid Score = Semantic Score + (TOPIC_BOOST Ã— Matching Topic Count)

# Default: TOPIC_BOOST = 0.15
```

Entries with matching topic keywords get a significant boost in ranking!