<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); 
            color: white; 
            padding: 25px; 
            border-radius: 10px; 
            border-left: 6px solid #ffd700;
            margin: 25px 0;
            box-shadow: 0 4px 6px rgba(0,0,0,0.1);">
<h2 style="margin: 0 0 10px 0; color: white; font-size: 22px;">🎯 DAT409: Hybrid Search with Aurora PostgreSQL for MCP Retrieval</h2>
<p style="margin: 0 0 12px 0; font-size: 15px; opacity: 0.95;">
<strong>Workshop Lab:</strong> Production-ready hybrid search implementation combining semantic vectors, full-text search, and rank fusion
</p>
<p style="margin: 0 0 8px 0; font-size: 14px;">
⏱️ <strong>Duration:</strong> 60 minutes | <strong>Level:</strong> 400 (Expert)
</p>
<p style="margin: 0; font-size: 14px; opacity: 0.9;">
🛠️ <strong>Your Task:</strong> Complete 6 TODO sections (2 per search method) to build enterprise-grade search architecture
</p>
</div>

---

### 📋 What You'll Implement

**TODO 1: Fuzzy Search (5 min)** - Trigram-based typo tolerance with pg_trgm  
**TODO 2: Semantic Search (5 min)** - Vector similarity with pgvector and Cohere embeddings  
**TODO 3: Hybrid RRF (5 min)** - Reciprocal Rank Fusion eliminating score normalization challenges

### 🎯 Learning Objectives

1. Understand fundamental differences between keyword, fuzzy, semantic, and hybrid search approaches
2. Implement production PostgreSQL search patterns using tsvector, pg_trgm, and pgvector
3. Compare RRF vs weighted fusion for handling heterogeneous score distributions
4. Optimize vector indexes (HNSW) for sub-100ms query performance
5. Design search strategies aligned with query patterns and accuracy requirements

### 📊 Dataset: E-commerce Product Catalog

**21,704 Amazon products** with pre-generated Cohere embeddings (1024-dim)  
Categories include electronics, home goods, apparel, and more  
Rich metadata: descriptions, prices, ratings, reviews, images

---

## ⚙️ Step 0: Select Python Kernel (START HERE!)

<div style="background: #fff3cd; border-left: 5px solid #ff9800; padding: 15px; margin: 15px 0; border-radius: 5px; color: #000;">
<strong>⚠️ IMPORTANT FIRST STEP</strong><br><br>
Before running any code, select the correct Python kernel:
<ol style="margin: 10px 0;">
<li>Look at the <strong>top-right corner</strong> of Jupyter</li>
<li>Click the kernel selector (should show "Python 3.13.3")</li>
<li>If not showing Python 3.13.3, select it from the dropdown</li>
</ol>
</div>

✅ **Run the cell below to verify** you're using Python 3.13

In [1]:
import sys
version = sys.version.split()[0]
print(f"🐍 Python version: {version}")
if version.startswith('3.13'):
    print("✅ Correct! You're using Python 3.13")
else:
    print(f"⚠️  WARNING: Expected Python 3.13, but found {version}")
    print("   Please change the kernel: Click top-right → Select Kernel → Python 3.13.3")

## 📦 Step 1: Environment & Database Setup

In [None]:
# ============================================================
# ENVIRONMENT & DATABASE SETUP
# ============================================================

import sys
import os
import warnings
warnings.filterwarnings('ignore')

# Import required libraries
import boto3
import json
import psycopg
from pgvector.psycopg import register_vector
import pandas as pd
import numpy as np
from pathlib import Path
from typing import Optional
from dotenv import load_dotenv
from IPython.display import display, HTML, clear_output
import ipywidgets as widgets

# Load environment variables
env_path = Path('/workshop/.env')
if env_path.exists():
    load_dotenv(env_path, override=True)
    print("✅ Environment loaded")
else:
    print("⚠️  .env file not found - check bootstrap")

# Database configuration
dbhost = os.getenv('DB_HOST')
dbport = os.getenv('DB_PORT', '5432')
dbuser = os.getenv('DB_USER')
dbpass = os.getenv('DB_PASSWORD')
dbname = os.getenv('DB_NAME', 'workshop_db')
aws_region = os.getenv('AWS_REGION', 'us-west-2')

# Verify credentials
if not all([dbhost, dbuser, dbpass]):
    print("❌ Missing credentials - check .env file")
    sys.exit(1)

print(f"\n📍 Configuration:")
print(f"   Database: {dbuser}@{dbhost}:{dbport}/{dbname}")
print(f"   AWS Region: {aws_region}")

# Initialize Bedrock client
bedrock_runtime = boto3.client(
    service_name='bedrock-runtime',
    region_name=aws_region
)

# Test database connection
try:
    with psycopg.connect(
        host=dbhost, port=dbport, user=dbuser,
        password=dbpass, dbname=dbname, autocommit=True
    ) as conn:
        register_vector(conn)
        
        # Verify PostgreSQL and extensions
        pg_version = conn.execute("SELECT version()").fetchone()[0].split(',')[0]
        pgvector_version = conn.execute(
            "SELECT extversion FROM pg_extension WHERE extname = 'vector'"
        ).fetchone()
        
        print(f"   PostgreSQL: {pg_version}")
        print(f"   pgvector: v{pgvector_version[0]}" if pgvector_version else "   ⚠️  pgvector not installed")
        
        # Check data
        result = conn.execute("""
            SELECT COUNT(*) as count, COUNT(embedding) as with_embeddings 
            FROM bedrock_integration.product_catalog
        """).fetchone()
        
        if result and result[0] > 0:
            print(f"   Products: {result[0]:,} ({result[1]:,} with embeddings)")
        else:
            print("   ⚠️  No data found - run parallel-fast-loader.py")
            
except Exception as e:
    print(f"❌ Database connection failed: {e}")
    sys.exit(1)

# Embedding generation function
def generate_embedding(text: str, input_type: str = "search_query") -> Optional[list]:
    """Generate embeddings using Cohere Embed v3 via Bedrock"""
    if not text:
        return None
    
    try:
        response = bedrock_runtime.invoke_model(
            modelId='cohere.embed-english-v3',
            body=json.dumps({
                "texts": [text],
                "input_type": input_type,
                "embedding_types": ["float"]
            })
        )
        result = json.loads(response['body'].read())
        return result['embeddings']['float'][0]
    except Exception as e:
        print(f"❌ Embedding generation failed: {e}")
        return None

# Test embedding generation
test_embedding = generate_embedding("wireless bluetooth headphones", "search_query")
if test_embedding:
    print(f"\n🤖 Bedrock Models:")
    print(f"   Cohere Embed v3 (1024-dim): ✅")
    print(f"   Cohere Rerank v3.5: Available")
else:
    print("\n⚠️  Bedrock embedding test failed")

print("\n✅ Setup complete - proceed to Step 2: Data Overview & Verification!")

## 📊 Step 2: Data Overview & Verification

In [None]:
# ============================================================
# DATA OVERVIEW & VERIFICATION
# ============================================================

import psycopg
from pgvector.psycopg import register_vector
import pandas as pd
from IPython.display import display, HTML

# Connect and verify data
with psycopg.connect(
    host=dbhost, port=dbport, user=dbuser,
    password=dbpass, dbname=dbname, autocommit=True
) as conn:
    register_vector(conn)
    
    # Check if table exists
    exists = conn.execute("""
        SELECT EXISTS (
            SELECT 1 FROM information_schema.tables 
            WHERE table_schema = 'bedrock_integration' 
            AND table_name = 'product_catalog'
        );
    """).fetchone()[0]
    
    if not exists:
        print("❌ Data not found. Please run: python parallel-fast-loader.py")
    else:
        # Get statistics
        stats = conn.execute("""
            SELECT 
                COUNT(*) as total,
                COUNT(embedding) as with_embeddings,
                COUNT(DISTINCT category_name) as categories,
                AVG(price)::NUMERIC(10,2) as avg_price
            FROM bedrock_integration.product_catalog;
        """).fetchone()
        
        print("📊 DATA OVERVIEW")
        print("-" * 40)
        print(f"Total Products: {stats[0]:,}")
        print(f"With Embeddings: {stats[1]:,} ({stats[1]/stats[0]*100:.0f}%)")
        print(f"Categories: {stats[2]}")
        print(f"Avg Price: ${stats[3]}")
        
        # Show top categories
        print("\n📦 TOP CATEGORIES")
        print("-" * 40)
        categories = conn.execute("""
            SELECT category_name, COUNT(*) as count
            FROM bedrock_integration.product_catalog
            GROUP BY category_name
            ORDER BY count DESC
            LIMIT 5;
        """).fetchall()
        
        for cat, count in categories:
            print(f"  • {cat}: {count:,}")
        
        # Show indexes
        print("\n🔍 INDEXES")
        print("-" * 40)
        indexes = conn.execute("""
            SELECT indexname FROM pg_indexes
            WHERE schemaname = 'bedrock_integration'
            AND tablename = 'product_catalog';
        """).fetchall()
        
        for idx in indexes:
            name = idx[0]
            if 'embedding' in name:
                print(f"  • {name} (Vector search)")
            elif 'fts' in name:
                print(f"  • {name} (Full-text search)")
            elif 'trgm' in name:
                print(f"  • {name} (Fuzzy search)")
            elif 'pkey' in name:
                print(f"  • {name} (Primary key)")
            elif 'category' in name:
                print(f"  • {name} (Category filter)")
            elif 'price' in name:
                print(f"  • {name} (Price range)")
            else:
                print(f"  • {name}")
        
        print("\n✅ Database ready for hybrid search!")

## 🔍 Step 3: Implement Search Methods

You'll implement three search methods with **6 TODO sections** (2 per method).

In [13]:
# ============================================================
# 1. KEYWORD SEARCH - FULL-TEXT (PROVIDED)
# ============================================================

def keyword_search(query: str, limit: int = 10) -> list[dict]:
    """
    Full-text search using PostgreSQL tsvector and ts_rank.
    Fast exact and near-exact matches, but no typo tolerance.
    """
    with psycopg.connect(
        host=dbhost, port=dbport, user=dbuser,
        password=dbpass, autocommit=True
    ) as conn:
        results = conn.execute("""
            SELECT 
                "productId",
                product_description,
                category_name,
                price,
                stars,
                reviews,
                imgurl,
                ts_rank(to_tsvector('english', product_description), query) AS score
            FROM bedrock_integration.product_catalog, to_tsquery('english', %(query)s) query
            WHERE to_tsvector('english', product_description) @@ query
            ORDER BY score DESC
            LIMIT %(limit)s;
        """, {'query': ' & '.join(query.split()), 'limit': limit}).fetchall()
        
        return [{
            'productId': r[0],
            'description': r[1][:200] + '...',
            'category': r[2],
            'price': float(r[3]) if r[3] else 0,
            'stars': float(r[4]) if r[4] else 0,
            'reviews': int(r[5]) if r[5] else 0,
            'imgUrl': r[6],
            'score': float(r[7]) if r[7] else 0,
            'method': 'Keyword'
        } for r in results]

In [None]:
# ============================================================
# 2. FUZZY SEARCH - TYPO TOLERANCE (🔨 TODO 1: YOUR TURN!)
# ============================================================

def fuzzy_search(query: str, limit: int = 10) -> list[dict]:
    """
    🎯 TODO 1 of 3: Implement Fuzzy Search with pg_trgm
    
    ⏱️ Estimated time: 7 minutes
    
    WHAT YOU'RE BUILDING:
    - Handles typos and misspellings ("wireles headphon" → "wireless headphones")
    - Uses trigram matching to find similar text
    - Great for user-generated queries with spelling errors
    
    HOW TRIGRAMS WORK:
    "bluetooth" → trigrams: "  b", " bl", "blu", "lue", "uet", "eto", "too", "oth", "th "
    "blutooth" → trigrams: "  b", " bl", "blu", "lut", "uto", "too", "oot", "oth", "th "
    Similarity = (matching trigrams) / (total unique trigrams) = 5/10 = 0.50
    
    THE similarity() FUNCTION:
    - Returns a score between 0 and 1 (1 = perfect match)
    - Example: similarity('bluetooth', 'blutooth') ≈ 0.50
    - Example: similarity('bluetooth', 'blue') ≈ 0.36
    
    THE %% OPERATOR:
    - Filters results by similarity threshold (default: 0.3)
    - Only returns matches with similarity > threshold
    - More efficient than calculating similarity for every row
    
    YOUR TASK: Complete the 2 marked sections below
    """
    
    with psycopg.connect(
        host=dbhost, port=dbport, user=dbuser,
        password=dbpass, autocommit=True
    ) as conn:
        results = conn.execute("""
            SELECT 
                "productId",
                product_description,
                category_name,
                price,
                stars,
                reviews,
                imgurl,
                -- ✏️ TODO 1.1: Calculate similarity score
                -- 
                -- WHAT THIS DOES:
                -- Calculates how similar product_description is to the search query
                -- Returns a decimal between 0.0 (no match) and 1.0 (perfect match)
                -- 
                -- THE SIMILARITY FUNCTION:
                -- SYNTAX: similarity(text_column, search_text)
                -- 
                -- WHY WE USE lower():
                -- Makes search case-insensitive ("Bluetooth" matches "bluetooth")
                -- 
                -- EXAMPLE:
                -- similarity(lower(product_description), lower(%(query)s)) AS sim
                --
                -- PARAMETERS:
                -- %(query)s is the search query passed from Python
                -- 
                ___
                
            FROM bedrock_integration.product_catalog
            
            -- ✏️ TODO 1.2: Filter by similarity threshold using %% operator
            -- 
            -- WHAT THIS DOES:
            -- Filters to only return products with similarity above threshold
            -- The %% operator is pg_trgm's "similar to" operator
            -- 
            -- THE %% OPERATOR:
            -- SYNTAX: text_column %% search_text
            -- Default threshold: 0.3 (configurable with pg_trgm.similarity_threshold)
            -- 
            -- WHY USE %%:
            -- - Pre-filters rows using GIN index (fast!)
            -- - Avoids calculating similarity() for every row
            -- - Returns only reasonably similar matches
            -- 
            -- EXAMPLE:
            -- WHERE lower(product_description) %% lower(%(query)s)
            --
            -- 💡 TIP: Only products with similarity > 0.3 will pass this filter
            --
            ___
            
            ORDER BY sim DESC
            LIMIT %(limit)s;
        """, {'query': query, 'limit': limit}).fetchall()
        
        return [{
            'productId': r[0],
            'description': r[1][:200] + '...',
            'category': r[2],
            'price': float(r[3]) if r[3] else 0,
            'stars': float(r[4]) if r[4] else 0,
            'reviews': int(r[5]) if r[5] else 0,
            'imgUrl': r[6],
            'score': float(r[7]) if r[7] else 0,
            'method': 'Fuzzy'
        } for r in results]

# =================================================================
# SOLUTION (for reference - don't peek until you've tried!)
# =================================================================
"""
TODO 1.1:
similarity(lower(product_description), lower(%(query)s)) AS sim

TODO 1.2:
WHERE lower(product_description) %% lower(%(query)s)
"""

# =================================================================
# UNDERSTANDING FUZZY SEARCH
# =================================================================
"""
REAL-WORLD EXAMPLE:

User types: "wireles hedphones"
Without fuzzy search: 0 results (exact match fails)
With fuzzy search: Finds "wireless headphones" products!

HOW IT WORKS:
1. Break text into 3-character chunks (trigrams)
2. Compare trigram overlap between query and products
3. Score based on % of shared trigrams
4. Return products above similarity threshold

WHEN TO USE FUZZY SEARCH:
✓ User-generated queries (likely to have typos)
✓ Mobile/voice input (autocorrect issues)
✓ International users (spelling variations)
✗ Conceptual search ("gift for coffee lover")
✗ When exact matches required (SKU/part numbers)
"""

<div style="background: #e3f2fd; padding: 12px; border-radius: 5px; margin: 15px 0; border-left: 4px solid #2196F3; color: #000;">
<strong>💡 PROGRESS CHECKPOINT</strong><br>
After completing TODO 1, run the test cell below to verify your fuzzy search implementation.<br>
Expected: Function should find products even with typos in the query.<br>
If your results don't match, revisit the TODO and check the hints!
</div>

In [None]:
# 🧪 TEST 1: Fuzzy Search - Run this after completing TODO 1
print("🔍 Testing fuzzy_search() with typos...\n")

test_results = fuzzy_search("wireles headphon", limit=3)

if len(test_results) > 0:
    print("✅ TODO 1 WORKING!")
    print(f"   Found {len(test_results)} products matching 'wireles headphon'")
    print(f"   Top result: {test_results[0]['description'][:70]}...")
    print(f"   Similarity score: {test_results[0]['score']:.3f}")
else:
    print("❌ No results found. Debug checklist:")
    print("   1. Did you complete both TODO 1.1 and TODO 1.2?")
    print("   2. Check for syntax errors (missing commas, parentheses)")
    print("   3. Verify the query parameter name is exactly: %(query)s")

<div style="background: #d4edda; padding: 10px; border-radius: 5px; margin: 10px 0; color: #000;">
<strong>Progress:</strong> [■□□] TODO 1 of 3: Fuzzy Search Complete
</div>

In [None]:
# ============================================================
# 3. SEMANTIC SEARCH - VECTOR SIMILARITY (🔨 TODO 2: YOUR TURN!)
# ============================================================

def semantic_search(query: str, limit: int = 10) -> list[dict]:
    """
    🎯 TODO 2 of 3: Implement Semantic Search with pgvector
    
    ⏱️ Estimated time: 7 minutes
    
    WHAT YOU'RE BUILDING:
    - Finds conceptually related products (not just keyword matches)
    - Uses vector embeddings to understand meaning
    - Works great for natural language queries
    
    HOW VECTOR SEARCH WORKS:
    1. Convert query to 1024-dimensional vector (embedding)
    2. Compare query vector to product vectors using distance metrics
    3. Return products with most similar vectors
    
    DISTANCE METRICS:
    - Cosine Distance (<=>): Measures angle between vectors (0 = identical, 2 = opposite)
    - L2 Distance (<->): Euclidean distance (straight-line distance)
    - Inner Product (<#>): Dot product (used less often)
    
    WHY COSINE DISTANCE:
    - Normalized embeddings make cosine optimal
    - Range: 0.0 (identical) to 2.0 (completely different)
    - Cohere embeddings are optimized for cosine distance
    
    CONVERTING DISTANCE TO SIMILARITY:
    - Distance: 0.0 = perfect match, higher = less similar
    - Similarity: 1.0 = perfect match, lower = less similar
    - Formula: similarity = 1 - distance
    - Example: distance=0.15 → similarity=0.85 (85% similar)
    
    EXAMPLE QUERY:
    Query: "gift for coffee lover"
    Finds: "espresso machine", "coffee subscription", "ceramic mug set"
    Why: Vectors capture conceptual relationships!
    
    YOUR TASK: Complete the 2 marked sections below
    """
    
    # Generate embedding for query
    query_embedding = generate_embedding(query, "search_query")
    if not query_embedding:
        print("❌ Failed to generate query embedding")
        return []
    
    with psycopg.connect(
        host=dbhost, port=dbport, user=dbuser,
        password=dbpass, autocommit=True
    ) as conn:
        register_vector(conn)
        
        results = conn.execute("""
            SELECT 
                "productId",
                product_description,
                category_name,
                price,
                stars,
                reviews,
                imgurl,
                -- ✏️ TODO 2.1: Convert cosine distance to similarity score
                -- 
                -- WHAT THIS DOES:
                -- Calculates how similar product embeddings are to query embedding
                -- 
                -- THE <=> OPERATOR:
                -- This is pgvector's cosine distance operator
                -- Returns: 0.0 (identical vectors) to 2.0 (opposite vectors)
                -- Lower distance = more similar
                -- 
                -- CONVERTING TO SIMILARITY:
                -- We want: 1.0 (perfect match) to 0.0 (no match)
                -- Formula: similarity = 1 - distance
                -- 
                -- SYNTAX: 1 - (vector_column <=> query_vector::vector)
                -- 
                -- WHY ::vector CAST:
                -- PostgreSQL needs to know %(embedding)s is a vector type
                -- The ::vector cast tells PostgreSQL to treat it as a vector
                -- 
                -- FULL EXAMPLE:
                -- (1 - (embedding <=> %(embedding)s::vector)) AS similarity
                --
                -- 💡 WALKTHROUGH EXAMPLE:
                --   Query: "wireless headphones"
                --   Product A: "Bluetooth headphones" (distance: 0.12)
                --   Product B: "Coffee maker" (distance: 0.85)
                --   Similarity A: 1 - 0.12 = 0.88 (88% similar)
                --   Similarity B: 1 - 0.85 = 0.15 (15% similar)
                --
                ___
                
            FROM bedrock_integration.product_catalog
            WHERE embedding IS NOT NULL
            
            -- ✏️ TODO 2.2: Order by distance (ascending = closest first)
            -- 
            -- WHAT THIS DOES:
            -- Sorts results by similarity to query (most similar first)
            -- 
            -- WHY ORDER BY DISTANCE (not similarity):
            -- PostgreSQL can use HNSW index to speed up distance-based ordering
            -- The HNSW index is optimized for nearest-neighbor search
            -- 
            -- SYNTAX: ORDER BY vector_column <=> query_vector::vector
            -- 
            -- ASCENDING ORDER:
            -- We want ascending (ASC) because lower distance = more similar
            -- ASC is default, so you don't need to specify it
            -- 
            -- EXAMPLE:
            -- ORDER BY embedding <=> %(embedding)s::vector
            --
            -- 💡 INDEX USAGE:
            --   With HNSW index: <100ms for millions of vectors
            --   Without index: 1000ms+ (scans every row)
            --   Check with: EXPLAIN ANALYZE before your query
            --
            ___
            
            LIMIT %(limit)s;
        """, {'embedding': query_embedding, 'limit': limit}).fetchall()
        
        return [{
            'productId': r[0],
            'description': r[1][:200] + '...',
            'category': r[2],
            'price': float(r[3]) if r[3] else 0,
            'stars': float(r[4]) if r[4] else 0,
            'reviews': int(r[5]) if r[5] else 0,
            'imgUrl': r[6],
            'score': float(r[7]) if r[7] else 0,
            'method': 'Semantic'
        } for r in results]

# =================================================================
# SOLUTION (for reference - don't peek until you've tried!)
# =================================================================
"""
TODO 2.1:
(1 - (embedding <=> %(embedding)s::vector)) AS similarity

TODO 2.2:
ORDER BY embedding <=> %(embedding)s::vector
"""

# =================================================================
# UNDERSTANDING SEMANTIC SEARCH
# =================================================================
"""
WHY IT'S POWERFUL:

Query: "gift for coffee lover"
Semantic search finds:
  ✓ "premium espresso machine"
  ✓ "artisan coffee subscription"
  ✓ "ceramic coffee mug set"

Keyword search would miss these because they don't contain "gift" or "lover"!

The embeddings capture CONCEPTS:
- "coffee lover" ≈ "coffee enthusiast" ≈ "caffeine fan"
- "gift" ≈ "present" ≈ "perfect for"

WHEN TO USE SEMANTIC SEARCH:
✓ Natural language queries
✓ Conceptual similarity ("gift for X")
✓ Cross-language search (with multilingual embeddings)
✓ When users don't know exact terminology
✗ Exact SKU/part numbers
✗ When precision is critical (legal, medical)
"""

<div style="background: #e3f2fd; padding: 12px; border-radius: 5px; margin: 15px 0; border-left: 4px solid #2196F3; color: #000;">
<strong>💡 PROGRESS CHECKPOINT</strong><br>
After completing TODO 2, run the test cell below to verify your semantic search implementation.<br>
Expected: Function should find conceptually related products (not just keyword matches).<br>
If your results don't match, revisit the TODO and check the hints!
</div>

In [None]:
# 🧪 TEST 2: Semantic Search - Run this after completing TODO 2
print("🧠 Testing semantic_search() with conceptual query...\n")

# Test with semantic query (not exact keywords)
test_results = semantic_search("affordable laptop for students", limit=3)

if len(test_results) > 0:
    print("✅ TODO 2 WORKING!")
    print(f"   Found {len(test_results)} semantically similar products")
    print(f"   Top result: {test_results[0]['description'][:70]}...")
    print(f"   Similarity score: {test_results[0]['score']:.3f}")
    print()
    print("   💡 NOTICE: Results may include 'notebook computer' or 'budget-friendly'")
    print("              even though query used 'laptop' and 'affordable'!")
else:
    print("❌ No results found. Debug checklist:")
    print("   1. Did you complete both TODO 2.1 and TODO 2.2?")
    print("   2. Check the %(embedding)s parameter name is exact")
    print("   3. Verify ::vector cast is included")
    print("   4. Make sure query_embedding was generated (check cell above)")

<div style="background: #d4edda; padding: 10px; border-radius: 5px; margin: 10px 0; color: #000;">
<strong>Progress:</strong> [■■□] TODO 2 of 3: Semantic Search Complete
</div>

In [None]:
# ============================================================
# 4. HYBRID SEARCH - WEIGHTED FUSION (PROVIDED)
# ============================================================

def hybrid_search(
    query: str,
    semantic_weight: float = 0.7,
    keyword_weight: float = 0.3,
    limit: int = 10
) -> list[dict]:
    """
    Hybrid Search combining semantic and keyword approaches using weighted score fusion.
    
    IMPORTANT: This implementation intentionally does NOT normalize scores to demonstrate
    a common production challenge. Different search methods produce vastly different score ranges.
    """
    
    # Normalize weights
    total = semantic_weight + keyword_weight
    semantic_weight = semantic_weight / total
    keyword_weight = keyword_weight / total
    
    # Get results from both methods
    semantic_results = semantic_search(query, limit * 2)
    keyword_results = keyword_search(query, limit * 2)
    
    # Combine and score
    product_scores = {}
    product_data = {}
    
    # Process semantic results
    for result in semantic_results:
        pid = result['productId']
        product_scores[pid] = result['score'] * semantic_weight
        product_data[pid] = result
    
    # Process keyword results
    for result in keyword_results:
        pid = result['productId']
        if pid in product_scores:
            product_scores[pid] += result['score'] * keyword_weight
        else:
            product_scores[pid] = result['score'] * keyword_weight
            product_data[pid] = result
    
    # Sort and return top results
    sorted_products = sorted(product_scores.items(), key=lambda x: x[1], reverse=True)[:limit]
    
    results = []
    for pid, score in sorted_products:
        product = product_data[pid].copy()
        product['score'] = score
        product['method'] = 'Hybrid'
        results.append(product)
    
    return results

In [None]:
# ============================================================
# 5. HYBRID SEARCH - RRF (🔨 TODO 3: YOUR TURN!)
# ============================================================

def hybrid_search_rrf(query: str, k: int = 60, limit: int = 10) -> list[dict]:
    """
    🎯 TODO 3 of 3: Implement Reciprocal Rank Fusion (RRF)
    
    ⏱️ Estimated time: 7 minutes
    
    WHAT YOU'RE BUILDING:
    - Combines semantic + keyword search intelligently
    - Uses RANKS (positions) instead of raw scores
    - No need to normalize scores from different methods!
    
    THE RRF FORMULA:
    score = 1/(k + rank)  where k=60
    
    WHY IT WORKS:
    - Product ranking #1: score = 1/(60+1) = 0.0164
    - Product ranking #10: score = 1/(60+10) = 0.0143
    - Product ranking #100: score = 1/(60+100) = 0.0063
    
    EXAMPLE:
    Product "wireless headphones":
    - Ranks #1 in semantic search → RRF score: 1/(60+1) = 0.0164
    - Ranks #5 in keyword search → RRF score: 1/(60+5) = 0.0154
    - Combined RRF score: 0.0164 + 0.0154 = 0.0318
    
    KEY INSIGHT:
    - Rank-based scoring means we don't need to normalize!
    - Products appearing in BOTH methods naturally score higher
    - Products appearing in only ONE method get lower scores
    
    YOUR TASK: Complete the 2 marked sections below
    """
    
    # Generate query embedding (provided)
    query_embedding = generate_embedding(query, "search_query")
    if not query_embedding:
        return []
    
    with psycopg.connect(
        host=dbhost, port=dbport, user=dbuser,
        password=dbpass, autocommit=True
    ) as conn:
        register_vector(conn)
        
        results = conn.execute("""
            -- Step 1: Semantic search with rankings
            WITH semantic_results AS (
                SELECT 
                    "productId",
                    -- ✏️ TODO 3.1: Add ROW_NUMBER() to rank semantic results
                    -- 
                    -- WHAT THIS DOES:
                    -- Assigns a rank (1, 2, 3...) to each product based on similarity
                    -- 
                    -- THE ROW_NUMBER() FUNCTION:
                    -- Assigns sequential integers to rows in result set
                    -- Requires ORDER BY clause to determine ranking
                    -- 
                    -- SYNTAX: ROW_NUMBER() OVER (ORDER BY column_name [ASC|DESC])
                    -- 
                    -- WHAT TO ORDER BY:
                    -- We want products with LOWEST distance ranked first
                    -- Lower distance = more similar
                    -- So we ORDER BY distance ASC (ascending)
                    -- 
                    -- FULL EXAMPLE:
                    -- ROW_NUMBER() OVER (ORDER BY embedding <=> %(embedding)s::vector) AS rank
                    --
                    -- 💡 WALKTHROUGH:
                    --   Product A: distance 0.10 → rank 1 (most similar)
                    --   Product B: distance 0.15 → rank 2
                    --   Product C: distance 0.20 → rank 3
                    --   Product D: distance 0.25 → rank 4
                    --
                    ___
                FROM bedrock_integration.product_catalog
                WHERE embedding IS NOT NULL
                ORDER BY embedding <=> %(embedding)s::vector
                LIMIT 50
            ),
            -- Step 2: Keyword search with rankings (provided)
            keyword_results AS (
                SELECT 
                    "productId",
                    ROW_NUMBER() OVER (ORDER BY ts_rank(to_tsvector('english', product_description), query) DESC) AS rank
                FROM bedrock_integration.product_catalog, to_tsquery('english', %(keyword_query)s) query
                WHERE to_tsvector('english', product_description) @@ query
                LIMIT 50
            )
            -- Step 3: Combine using RRF
            SELECT 
                COALESCE(s."productId", k."productId") AS "productId",
                p.product_description,
                p.category_name,
                p.price,
                p.stars,
                p.reviews,
                p.imgurl,
                -- ✏️ TODO 3.2: Calculate combined RRF score
                -- 
                -- WHAT THIS DOES:
                -- Sums the RRF scores from BOTH search methods
                -- 
                -- THE RRF FORMULA:
                -- score = 1 / (k + rank)  where k=60
                -- 
                -- HANDLING MISSING RANKS:
                -- - If product appears in both methods: use actual ranks
                -- - If product appears in only ONE method: use rank=1000 for missing method
                -- - COALESCE(rank, 1000) handles this automatically
                -- 
                -- WHAT TO CALCULATE:
                -- RRF from semantic: 1.0 / (60 + COALESCE(s.rank, 1000))
                -- RRF from keyword:  1.0 / (60 + COALESCE(k.rank, 1000))
                -- Combined: Add them together!
                -- 
                -- FULL EXAMPLE:
                -- (1.0 / (60 + COALESCE(s.rank, 1000))) + (1.0 / (60 + COALESCE(k.rank, 1000))) AS rrf_score
                --
                -- 💡 WALKTHROUGH EXAMPLE:
                --   Product "wireless headphones":
                --   - s.rank = 2 (2nd in semantic)
                --   - k.rank = 5 (5th in keyword)
                --   - Semantic RRF: 1/(60+2) = 0.0161
                --   - Keyword RRF: 1/(60+5) = 0.0154
                --   - Combined: 0.0161 + 0.0154 = 0.0315
                --
                --   Product "bluetooth speakers" (only in semantic):
                --   - s.rank = 10 (10th in semantic)
                --   - k.rank = NULL (not in keyword results)
                --   - Semantic RRF: 1/(60+10) = 0.0143
                --   - Keyword RRF: 1/(60+1000) = 0.0009 (very low!)
                --   - Combined: 0.0143 + 0.0009 = 0.0152
                --
                ___
                
            FROM semantic_results s
            FULL OUTER JOIN keyword_results k ON s."productId" = k."productId"
            JOIN bedrock_integration.product_catalog p ON COALESCE(s."productId", k."productId") = p."productId"
            ORDER BY rrf_score DESC
            LIMIT %(limit)s;
        """, {
            'embedding': query_embedding,
            'keyword_query': ' & '.join(query.split()),
            'limit': limit
        }).fetchall()
        
        return [{
            'productId': r[0],
            'description': r[1][:200] + '...',
            'category': r[2],
            'price': float(r[3]) if r[3] else 0,
            'stars': float(r[4]) if r[4] else 0,
            'reviews': int(r[5]) if r[5] else 0,
            'imgUrl': r[6],
            'score': float(r[7]) if r[7] else 0,
            'method': 'Hybrid-RRF'
        } for r in results]

# =================================================================
# SOLUTION (for reference - don't peek until you've tried!)
# =================================================================
"""
TODO 3.1:
ROW_NUMBER() OVER (ORDER BY embedding <=> %(embedding)s::vector) AS rank

TODO 3.2:
(1.0 / (60 + COALESCE(s.rank, 1000))) + (1.0 / (60 + COALESCE(k.rank, 1000))) AS rrf_score
"""

# =================================================================
# WHY RRF IS BETTER THAN WEIGHTED SCORES
# =================================================================
"""
PROBLEM WITH WEIGHTED SCORES:
- Semantic search scores: 0.85, 0.82, 0.79, ...
- Keyword search scores:   12.5, 8.3, 6.1, ...
- How do you combine these? They're on different scales!

RRF SOLUTION:
- Convert to ranks: #1, #2, #3, ...
- Ranks are always on the same scale!
- RRF formula: 1/(k+rank) creates comparable scores
- Just add the RRF scores together!

REAL-WORLD BENEFIT:
- No manual tuning of weights
- Works reliably across different queries
- Products in BOTH top results automatically score higher
- Research shows k=60 works well across domains

WHEN TO USE RRF vs WEIGHTED:
✓ RRF: Different score scales, no domain knowledge for weights
✓ Weighted: Similar score scales, strong intuition about weights
✓ RRF: Production systems (less maintenance)
✓ Weighted: Specialized domains with known query patterns
"""

<div style="background: #e3f2fd; padding: 12px; border-radius: 5px; margin: 15px 0; border-left: 4px solid #2196F3; color: #000;">
<strong>💡 PROGRESS CHECKPOINT</strong><br>
After completing TODO 3, run the test cell below to verify your hybrid RRF search implementation.<br>
Expected: Function should combine semantic and keyword results using rank-based fusion.<br>
If your results don't match, revisit the TODO and check the hints!
</div>

In [None]:
# 🧪 TEST 3: Hybrid RRF Search - Run this after completing TODO 3
print("⚖️ Testing hybrid_search_rrf() with complex query...\n")

test_results = hybrid_search_rrf("affordable wireless bluetooth headphones", limit=5)

if len(test_results) > 0:
    print("✅ TODO 3 WORKING!")
    print(f"   Found {len(test_results)} products using hybrid RRF")
    print()
    print("   Top 3 results:")
    for i, result in enumerate(test_results[:3], 1):
        print(f"   {i}. {result['description'][:60]}...")
        print(f"      RRF Score: {result['score']:.4f}")
    print()
    print("   💡 NOTICE: Results balance semantic meaning + keyword matching!")
else:
    print("❌ No results found. Debug checklist:")
    print("   1. Did you complete both TODO 3.1 and TODO 3.2?")
    print("   2. Check parameter names: %(embedding)s and %(keyword_query)s")
    print("   3. Verify parentheses are balanced in RRF calculation")
    print("   4. Make sure COALESCE syntax is correct")
    print("   5. Check that you're ADDING the two RRF scores together")

<div style="background: #d4edda; padding: 10px; border-radius: 5px; margin: 10px 0; color: #000;">
<strong>Progress:</strong> [■■■] All 3 TODOs Complete!
</div>

In [None]:
# ============================================================
# 4. COHERE RERANK FUNCTION (PROVIDED)
# ============================================================

def rerank_results(query: str, results: list[dict], top_n: int = 10) -> list[dict]:
    """Rerank search results using Cohere Rerank v3.5"""
    if not results:
        return results
    
    try:
        documents = [r['description'] for r in results]
        
        response = bedrock_runtime.invoke_model(
            modelId='cohere.rerank-v3-5:0',
            body=json.dumps({
                "api_version": 2,
                "query": query,
                "documents": documents,
                "top_n": min(top_n, len(documents))
            })
        )
        
        rerank_response = json.loads(response['body'].read())
        reranked = []
        
        for item in rerank_response['results']:
            idx = item['index']
            result = results[idx].copy()
            result['rerank_score'] = item['relevance_score']
            reranked.append(result)
        
        return reranked
    except Exception as e:
        print(f"Rerank failed: {e}")
        return results

print("✅ Rerank function ready")

## 🎮 Step 4: Interactive Search Interface

Now let's create an interactive interface to explore and compare different search methods.

In [None]:
# ============================================================
# INTERACTIVE SEARCH INTERFACE
# ============================================================

def create_search_interface():
    """Create an interactive search interface with proper product display"""
    import ipywidgets as widgets
    from IPython.display import display, HTML
    
    # Professional style definitions
    style = """
    <style>
        .search-container { padding: 20px; background: #f8f9fa; border-radius: 10px; }
        .result-card { 
            margin: 15px 0; padding: 20px; background: white; 
            border-radius: 8px; border: 1px solid #e3e6e8;
            transition: all 0.3s; position: relative;
            box-shadow: 0 1px 2px rgba(0,0,0,0.05);
        }
        .result-card:hover { 
            box-shadow: 0 8px 20px rgba(0,0,0,0.12); 
            transform: translateY(-2px);
            border-color: #ff9900;
        }
        .method-badge {
            position: absolute; top: 15px; right: 15px;
            padding: 5px 12px; border-radius: 20px;
            font-size: 11px; font-weight: bold;
            text-transform: uppercase;
        }
        .keyword { background: #e3f2fd; color: #1565c0; }
        .fuzzy { background: #fce4ec; color: #c2185b; }
        .semantic { background: #e8f5e9; color: #2e7d32; }
        .hybrid { background: #fff3e0; color: #e65100; }
        
        .product-content { display: flex; gap: 20px; }
        .product-image {
            flex-shrink: 0; width: 150px; height: 150px;
            object-fit: contain; border: 1px solid #e3e6e8;
            border-radius: 4px; padding: 10px; background: white;
        }
        .product-details { flex-grow: 1; }
        .product-title {
            font-size: 16px; color: #0066c0; text-decoration: none;
            font-weight: 500; line-height: 1.4; display: block; margin-bottom: 8px;
        }
        .product-title:hover { color: #c7511f; text-decoration: underline; }
        .product-price {
            font-size: 21px; color: #B12704; font-weight: 500; margin: 8px 0;
        }
        .product-rating {
            display: flex; align-items: center; gap: 8px; margin: 8px 0;
        }
        .stars { color: #ff9900; }
        .product-category { color: #565959; font-size: 12px; margin-top: 8px; }
        .score-info {
            margin-top: 12px; padding-top: 12px; border-top: 1px solid #e3e6e8;
            display: flex; justify-content: space-between; align-items: center;
        }
        .score-bar {
            height: 6px; background: #e9ecef; border-radius: 3px;
            overflow: hidden; flex-grow: 1; margin-right: 10px; max-width: 200px;
        }
        .score-fill {
            height: 100%; background: linear-gradient(90deg, #ff9900, #ff6600);
            transition: width 0.5s;
        }
        .score-text { color: #565959; font-size: 12px; font-weight: 500; }
        .comparison-grid {
            display: grid; grid-template-columns: repeat(auto-fit, minmax(400px, 1fr));
            gap: 20px; margin-top: 20px;
        }
        .no-results {
            padding: 40px; text-align: center; color: #565959;
            background: #f7f8f8; border-radius: 8px;
        }
    </style>
    """
    
    # Widget definitions
    query_input = widgets.Text(
        value='',
        placeholder='Try "Apple AirPods" or "coffee maker" or "laptop bag"...',
        description='Search:',
        style={'description_width': '80px'},
        layout=widgets.Layout(width='700px')
    )
    
    search_method = widgets.RadioButtons(
        options=[
            ('Keyword (Exact Match)', 'keyword'),
            ('Fuzzy (Typo Tolerance)', 'fuzzy'),
            ('Semantic (Conceptual)', 'semantic'),
            ('Hybrid (Combined)', 'hybrid'),
            ('Hybrid-RRF (Rank Fusion)', 'hybrid_rrf'),
            ('🔍 Compare All Methods', 'compare')
        ],
        value='compare',
        description='Method:',
        style={'description_width': '80px'}
    )
    
    # Hybrid search weight sliders
    semantic_weight = widgets.FloatSlider(
        value=0.7, min=0, max=1, step=0.1,
        description='Semantic:',
        style={'description_width': '80px'},
        layout=widgets.Layout(width='350px')
    )
    
    keyword_weight = widgets.FloatSlider(
        value=0.3, min=0, max=1, step=0.1,
        description='Keyword:',
        style={'description_width': '80px'},
        layout=widgets.Layout(width='350px')
    )
    
    results_limit = widgets.IntSlider(
        value=3, min=1, max=10, step=1,
        description='Results:',
        style={'description_width': '80px'},
        layout=widgets.Layout(width='300px')
    )
    
    search_button = widgets.Button(
        description='🔍 Search Products',
        button_style='primary',
        layout=widgets.Layout(width='200px', height='40px')
    )
    
    rerank_checkbox = widgets.Checkbox(
        value=False,
        description='Use Cohere Rerank',
        style={'description_width': 'initial'}
    )
    
    results_output = widgets.Output()
    
    # Example queries that demonstrate real differences
    example_queries = [
        # Exact keyword matches
        ("wireless bluetooth headphones", "Common Terms", "keyword"),
        ("stainless steel water bottle", "Product Type", "keyword"),
        
        # Conceptual searches
        ("something to keep coffee hot all day", "Problem Solving", "semantic"),
        ("gift for someone who loves cooking", "Gift Ideas", "semantic"),
        
        # Typo tolerance
        ("wireles blutooth hedphones", "With Typos", "fuzzy"),
        ("stainles steel watter botle", "Misspellings", "fuzzy"),
        
        # Balanced hybrid (RRF excels here)
        ("durable laptop backpack with USB charging", "Multi-Feature", "hybrid_rrf"),
        ("ergonomic office chair under 300 dollars", "Specs + Price", "hybrid_rrf"),
        
        # Mixed queries
        ("organic sustainable water bottle", "Features + Product", "hybrid"),
        ("affordable noise canceling headphones under 200", "Specs + Budget", "hybrid"),
        
        # Activity based
        ("equipment for home yoga practice", "Activity Based", "semantic"),
        ("tools for remote work from home", "Use Case", "semantic")
    ]
    
    def format_result(result: dict, method_class: str = '') -> str:
        """Format a single search result with full product display"""
        # Extract product details
        product_id = result.get('productId', 'Unknown')
        description = result.get('description', 'No description available')
        price = result.get('price', 0)
        stars = result.get('stars', 0)
        reviews = result.get('reviews', 0)
        category = result.get('category', 'Unknown Category')
        score = result.get('score', 0)
        rerank_score = result.get('rerank_score', None)
        img_url = result.get('imgUrl', '')  # Changed to imgUrl with capital U
        
        # Create star display
        star_display = '★' * int(stars) + '☆' * (5 - int(stars))
        
        # Generate Amazon search link
        search_terms = description.split()[:5]
        link_url = f"https://www.amazon.com/s?k={'+'.join(search_terms)}"
        
        # Calculate score percentage for visual bar
        display_score = rerank_score if rerank_score is not None else score
        score_percent = min(display_score * 100, 100) if display_score > 0 else 0
        
        # Score label
        score_label = "Rerank Score" if rerank_score is not None else "Relevance"
        
        # Simple direct image embed exactly like Part 2 notebook
        return f"""
        <div class="result-card">
            <div class="method-badge {method_class}">{result.get('method', 'Unknown')}</div>
            
            <div class="product-content">
                <img src="{img_url}" style="width: 150px; height: 150px; object-fit: contain; border: 1px solid #e3e6e8; border-radius: 4px; padding: 10px; background: white;">
                
                <div class="product-details">
                    <a href="{link_url}" target="_blank" class="product-title">
                        {description}
                    </a>
                    
                    <div class="product-price">${price:.2f}</div>
                    
                    <div class="product-rating">
                        <span class="stars">{star_display}</span>
                        <span style="color: #007185; font-size: 14px;">({reviews:,} reviews)</span>
                    </div>
                    
                    <div class="product-category">Category: {category}</div>
                    
                    <div class="score-info">
                        <div style="display: flex; align-items: center; flex-grow: 1;">
                            <div class="score-bar">
                                <div class="score-fill" style="width: {score_percent}%"></div>
                            </div>
                            <span class="score-text">{score_label}: {display_score:.3f}</span>
                        </div>
                        <a href="{link_url}" target="_blank" style="color: #ff9900; text-decoration: none; font-size: 13px;">
                            View on Amazon →
                        </a>
                    </div>
                </div>
            </div>
        </div>
        """
    
    def set_example_query(query: str, method: str | None = None):
        """Set an example query and optionally the search method"""
        query_input.value = query
        if method:
            search_method.value = method
    
    # Create example buttons
    example_buttons = []
    for query, label, best_method in example_queries:
        btn = widgets.Button(
            description=f"{label}: {query[:30]}..." if len(query) > 30 else f"{label}: {query}",
            layout=widgets.Layout(width='auto', margin='2px'),
            tooltip=f"Best with: {best_method}"
        )
        btn.on_click(lambda b, q=query, m=best_method: set_example_query(q, m))
        example_buttons.append(btn)
    
    def on_search_clicked(b):
        """Handle search button click"""
        results_output.clear_output()
        
        with results_output:
            display(HTML(style))
            
            query = query_input.value
            method = search_method.value
            limit = results_limit.value
            use_rerank = rerank_checkbox.value
            
            if not query:
                display(HTML('<div class="no-results">Please enter a search query!</div>'))
                return
            
            display(HTML(f'<h3 style="color: #0f1111;">🔍 Results for: "{query}"</h3>'))
            
            if method == 'compare':
                # Compare all methods
                methods_to_compare = [
                    ('Keyword (Exact)', keyword_search, 'keyword'),
                    ('Fuzzy (Typos)', fuzzy_search, 'fuzzy'),
                    ('Semantic (Cohere)', semantic_search, 'semantic'),
                    ('Hybrid (70/30)', lambda q, l: hybrid_search(q, 0.7, 0.3, l), 'hybrid'),
                    ('Hybrid-RRF (k=60)', lambda q, l: hybrid_search_rrf(q, 60, l), 'hybrid')
                ]
                
                # Method colors
                method_colors = {
                    'keyword': '1565c0',
                    'fuzzy': 'c2185b',
                    'semantic': '2e7d32',
                    'hybrid': 'e65100'
                }
                
                html_output = '<div class="comparison-grid">'
                
                for method_name, func, css_class in methods_to_compare:
                    border_color = method_colors.get(css_class, '666666')
                    html_output += f'<div><h4 style="color: #0f1111; border-bottom: 2px solid #{border_color}; padding-bottom: 8px; margin-bottom: 15px;">{method_name}</h4>'
                    
                    try:
                        import time
                        start = time.time()
                        results = func(query, limit)
                        elapsed = time.time() - start
                        
                        # Apply reranking if enabled
                        if use_rerank and results:
                            results = rerank_results(query, results, min(len(results), 2))
                        
                        if results:
                            html_output += f'<p style="color: #565959; font-size: 12px;">Found {len(results)} results in {elapsed:.3f}s</p>'
                            for result in results[:2]:  # Show top 2 per method
                                html_output += format_result(result, css_class)
                        else:
                            html_output += '<div class="no-results">No results found with this method</div>'
                            
                    except Exception as e:
                        html_output += f'<div class="no-results">Error: {str(e)}</div>'
                    
                    html_output += '</div>'
                
                html_output += '</div>'
                display(HTML(html_output))
                
            else:
                # Single method search
                try:
                    import time
                    start = time.time()
                    
                    if method == 'keyword':
                        results = keyword_search(query, limit)
                        css_class = 'keyword'
                        method_name = 'Keyword (Exact Match)'
                    elif method == 'fuzzy':
                        results = fuzzy_search(query, limit)
                        css_class = 'fuzzy'
                        method_name = 'Fuzzy (Typo Tolerance)'
                    elif method == 'semantic':
                        results = semantic_search(query, limit)
                        css_class = 'semantic'
                        method_name = 'Semantic Search (Cohere)'
                    elif method == 'hybrid':
                        results = hybrid_search(
                            query, 
                            semantic_weight.value,
                            keyword_weight.value,
                            limit
                        )
                        css_class = 'hybrid'
                        method_name = f'Hybrid (S:{semantic_weight.value:.1f}/K:{keyword_weight.value:.1f})'
                    elif method == 'hybrid_rrf':
                        results = hybrid_search_rrf(query, 60, limit)
                        css_class = 'hybrid'
                        method_name = 'Hybrid-RRF (k=60)'
                    
                    elapsed = time.time() - start
                    
                    # Apply reranking if enabled
                    if use_rerank and results:
                        rerank_start = time.time()
                        results = rerank_results(query, results, len(results))
                        rerank_time = time.time() - rerank_start
                        total_time = elapsed + rerank_time
                        
                        display(HTML(f'''
                            <p style="color: #565959;">
                                Method: <strong>{method_name}</strong> | 
                                Search: <strong>{elapsed:.3f}s</strong> | 
                                Rerank: <strong>{rerank_time:.3f}s</strong> |
                                Total: <strong>{total_time:.3f}s</strong> | 
                                Results: <strong>{len(results)}</strong>
                            </p>
                        '''))
                    else:
                        display(HTML(f'''
                            <p style="color: #565959;">
                                Method: <strong>{method_name}</strong> | 
                                Time: <strong>{elapsed:.3f}s</strong> | 
                                Results: <strong>{len(results)}</strong>
                            </p>
                        '''))
                    
                    if results:
                        for result in results:
                            display(HTML(format_result(result, css_class)))
                    else:
                        display(HTML('<div class="no-results">No products found. Try a different search term or method.</div>'))
                        
                except Exception as e:
                    display(HTML(f'<div class="no-results">Error: {str(e)}</div>'))
                    import traceback
                    print(traceback.format_exc())
    
    search_button.on_click(on_search_clicked)
    
    # Create status display for weights
    weight_status = widgets.HTML(
        value="<div style='padding: 5px; font-size: 0.9em; color: #2E8B57;'>✓ Weights sum to 1.0</div>"
    )

    def validate_and_update_weights(change):
        current_sum = round(semantic_weight.value + keyword_weight.value, 1)
        
        if current_sum > 1:
            # If semantic weight was changed
            if change.owner == semantic_weight:
                keyword_weight.value = max(0, round(1 - semantic_weight.value, 1))
            # If keyword weight was changed
            else:
                semantic_weight.value = max(0, round(1 - keyword_weight.value, 1))
            
            current_sum = round(semantic_weight.value + keyword_weight.value, 1)
        
        # Update status display
        if current_sum > 1:
            weight_status.value = f"<div style='padding: 5px; font-size: 0.9em; color: #DC143C;'>⚠️ Sum exceeds 1 (Current: {current_sum})</div>"
        elif current_sum == 1:
            weight_status.value = f"<div style='padding: 5px; font-size: 0.9em; color: #2E8B57;'>✓ Weights sum to {current_sum}</div>"
        else:
            weight_status.value = f"<div style='padding: 5px; font-size: 0.9em; color: #DAA520;'>ℹ️ Sum is {current_sum}</div>"

    # Observe changes in both sliders
    semantic_weight.observe(validate_and_update_weights, names='value')
    keyword_weight.observe(validate_and_update_weights, names='value')
    
    # Layout
    display(HTML("""
        <style>
            .adaptive-title {
                color: #000000;
            }
            @media (prefers-color-scheme: dark) {
                .adaptive-title { color: #ffffff; }
            }
            body.vscode-dark .adaptive-title,
            body.vscode-high-contrast .adaptive-title,
            .jp-Notebook-dark .adaptive-title {
                color: #ffffff;
            }
        </style>
        <h2 class="adaptive-title">🛍️ Amazon Product Search Comparison</h2>
        <div style="background: #f7f8f8; padding: 15px; border-radius: 8px; margin: 15px 0;">
            <h4 style="color: #0f1111; margin-top: 0;">Search Method Strengths:</h4>
            <ul style="color: #565959; margin: 10px 0;">
                <li><strong style="color: #1565c0;">Keyword:</strong> Perfect for exact product names, SKUs, brand searches</li>
                <li><strong style="color: #c2185b;">Fuzzy:</strong> Handles typos and misspellings</li>
                <li><strong style="color: #2e7d32;">Semantic:</strong> Understands intent and concepts using Cohere embeddings</li>
                <li><strong style="color: #e65100;">Hybrid:</strong> Best overall - combines keyword matching with semantic understanding</li>
            </ul>
           <div style="border-left: 4px solid #4CAF50; padding-left: 10px; margin-top: 10px; color: black;"> 
                <strong>🤖 Cohere Models:</strong> embed-english-v3 (embeddings) • rerank-v3-5:0 (re-ranking)
            </div>
            <div style="background: #fff3e0; border-left: 4px solid #e65100; padding: 12px; margin-top: 15px; border-radius: 4px;">
                <h4 style="color: #2e7d32; margin-top: 0; margin-bottom: 8px;">💡 Understanding Hybrid Search Approaches</h4>
                <p style="color: #1b5e20; margin: 8px 0; font-size: 13px;">
                    <strong>Challenge:</strong> Different search methods produce vastly different score ranges (semantic: 0.7-1.0, keyword: 0.01-0.1), causing one method to dominate weighted combinations.
                </p>
                <p style="color: #1b5e20; margin: 8px 0; font-size: 13px;">
                    <strong>Solutions Demonstrated:</strong>
                </p>
                <ul style="color: #1b5e20; margin: 8px 0; font-size: 13px; padding-left: 20px;">
                    <li><strong>Hybrid (70/30):</strong> Weighted score fusion - simple but requires careful tuning</li>
                    <li><strong>Hybrid-RRF:</strong> Rank-based fusion - robust, no normalization needed ✨</li>
                    <li><strong>Cohere Rerank:</strong> ML-based re-ranking - most sophisticated approach</li>
                </ul>
                <div style="background: #e8f5e9; border-left: 4px solid #4caf50; padding: 10px; margin: 8px 0; font-size: 13px; color: #1b5e20;">
                    <strong>💡 Try the examples below</strong> to see how each method handles different query types!
                </div>
            </div>
        </div>
    """))
    
    # Display interface
    display(widgets.VBox([
        widgets.HTML('<h4 style="color: #0f1111; margin: 15px 0;">📝 Example Searches (Click to Try):</h4>'),
        widgets.GridBox(
            example_buttons,
            layout=widgets.Layout(
                grid_template_columns='repeat(3, 1fr)',
                grid_gap='5px'
            )
        ),
        widgets.HTML('<hr style="margin: 20px 0; border-color: #e3e6e8;">'),
        query_input,
        search_method,
        widgets.HTML('<h4 style="color: #0f1111;">⚙️ Options:</h4>'),
        widgets.HBox([
            widgets.VBox([
                widgets.HTML('<strong>Hybrid Weights:</strong>'),
                semantic_weight,
                keyword_weight,
                weight_status
            ]),
            widgets.VBox([
                results_limit,
                rerank_checkbox
            ])
        ]),
        search_button,
        results_output
    ]))

# Create and display the interface
create_search_interface()

## 🔒 Step 5: Beyond Search - RLS & MCP in Action

<div style="background: #e8f5e9; border-left: 5px solid #4caf50; padding: 15px; margin: 15px 0; color: #000;">
<strong>🎯 Quick Preview</strong><br>
Your hybrid search is production-ready! Here's how it powers AI agents with secure, role-based access.
</div>

⏱️ **Estimated time:** 2 minutes (read-only)

---

### Three Personas, One Database

The demo app shows three user roles accessing the `bedrock_integration.knowledge_base` table:

| Persona | Icon | Access Levels | Example Query |
|---------|------|---------------|---------------|
| **Customer** | 👤 | `product_faq` only | "How do I set up my camera?" |
| **Support Agent** | 🎧 | `product_faq`, `support_ticket`, `internal_note` | "Recent complaints about vacuums" |
| **Product Manager** | 👔 | All content + `analytics` | "Sales trends by category" |

---

### How Security Works: Application-Level RLS

Instead of traditional database policies, AI agents use **trusted connections** with security in the system prompt:
```python
# Agent uses admin credentials
mcp_client = MCPClient(admin_credentials)

# Security enforced in system prompt
system_prompt = f"""
SECURITY: Only query WHERE '{persona}' = ANY(persona_access)
ALLOWED: {allowed_content_types}
DENIED: {denied_content_types}
"""
```

**Data Structure:**

| id | content | content_type | persona_access |
|----|---------|--------------|--------------------|
| 1 | 'Setup guide...' | product_faq | {customer, support_agent, product_manager} |
| 2 | 'Ticket #1234...' | support_ticket | {support_agent, product_manager} |
| 3 | 'Q4 sales data...' | analytics | {product_manager} |

**Why This Pattern?**
- ✅ Connection pooling (efficient)
- ✅ Works with Aurora Data API (serverless)
- ✅ Flexible security rules in code
- ⚠️ Agent must be trusted (has READ-ONLY admin access)

---

### Model Context Protocol (MCP) = Natural Language → SQL

**Direct Search (what you built):**
```python
results = hybrid_search_rrf("wireless headphones")
# You write SQL → Returns products
```

**MCP Agent Search (demo app):**
```python
results = strands_agent_search(
    "Find budget headphones with good reviews and no complaints",
    persona="support_agent"
)
# Agent writes SQL → Queries products + knowledge_base → Synthesizes answer
```

**Under the hood:**
1. **User Query:** "Find wireless headphones under $100 with no battery complaints"
2. **Agent Analyzes** the natural language query
3. **Tool Execution:**
   - Tool 1: Hybrid search for "wireless headphones" WHERE price < 100
   - Tool 2: Query knowledge_base WHERE content_type='support_ticket' AND 'support_agent' = ANY(persona_access)
4. **Agent Response:** "Found 5 models. Based on tickets, avoid models X and Y (battery issues). Recommend model Z - 4.5 stars, no complaints."

**MCP Resources:**
- **Aurora MCP Server**: [awslabs.postgres-mcp-server](https://awslabs.github.io/mcp/servers/postgres-mcp-server) - Provides `get_table_schema` and `run_query` tools
- **Strands SDK**: [strandsagents.com](https://strandsagents.com/latest/) - Agent framework with MCP support

---

### 🚀 Try the Live Demo

**Launch from Terminal:**
```bash
cd demo-app
streamlit run streamlit_app.py
```

**What to Try:**
1. **Switch personas** (sidebar) → See different data access
2. **Tab 1: MCP Agent** → Ask natural language questions
3. **Tab 2: Search Comparison** → Compare all methods side-by-side

**Sample Queries by Persona:**
- 👤 Customer: "How do I troubleshoot my device?"
- 🎧 Support: "What are common issues with vacuums?"
- 👔 PM: "Show me sales trends by category"

---

## 💡 Key Takeaways

### Search Method Comparison

| Method | Best For | When to Use |
|--------|----------|-------------|
| **Keyword** | Exact terms, known terminology | SKU searches, technical docs |
| **Fuzzy** | Typo tolerance, spelling variants | User input, mobile queries |
| **Semantic** | Natural language, conceptual matches | Product discovery, support |
| **Hybrid** | Production systems, mixed queries | E-commerce, knowledge bases |
| **Cohere Rerank** | Final top-K refinement | Critical accuracy scenarios |

### Domain-Specific Tuning

| Use Case | Semantic | Keyword | Why |
|----------|----------|---------|-----|
| **E-commerce** | 70% | 30% | Natural discovery + exact brand matches |
| **Support Tickets** | 80% | 20% | Intent matters most |
| **Technical Docs** | 40% | 60% | Precise terminology critical |
| **Part Numbers** | 10% | 90% | Exact matches required |

---

<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); 
            color: white; 
            padding: 30px; 
            border-radius: 15px; 
            margin: 30px 0;
            box-shadow: 0 6px 12px rgba(0,0,0,0.2);">
<h2 style="margin: 0 0 15px 0; color: white; font-size: 24px;">🎉 Lab Complete!</h2>
<p style="margin: 0 0 15px 0; font-size: 16px; line-height: 1.8;">
You've successfully implemented production-ready hybrid search combining fuzzy matching, semantic vectors, rank fusion, and ML-based reranking.
</p>
<hr style="border: 0; border-top: 2px solid rgba(255,255,255,0.3); margin: 20px 0;">
<p style="margin: 0 0 10px 0; font-size: 15px;">
<strong>✅ What You Built:</strong>
</p>
<ul style="margin: 10px 0; font-size: 14px; line-height: 1.8;">
<li><strong>Fuzzy Search</strong> with pg_trgm for typo tolerance</li>
<li><strong>Semantic Search</strong> with pgvector and Cohere embeddings</li>
<li><strong>Hybrid RRF</strong> eliminating score normalization challenges</li>
<li><strong>Cohere Rerank</strong> for ML-based result refinement</li>
<li><strong>Interactive Tools</strong> for experimentation and weight tuning</li>
</ul>
<hr style="border: 0; border-top: 2px solid rgba(255,255,255,0.3); margin: 20px 0;">
<p style="margin: 0; font-size: 15px;">
🚀 <strong>Next:</strong> Explore the Streamlit demo to see MCP integration and RLS in action!
</p>
</div>

**Questions?** Flag down an instructor or check the workshop materials.

---