# DAT409: Implement hybrid search with Aurora PostgreSQL for MCP retrieval

---

### 🎯 Workshop Learning Objectives

In this hands-on workshop, you will:

1. **Understand** the fundamental differences between keyword, semantic, and hybrid search
2. **Implement** multiple search strategies using PostgreSQL and pgvector
3. **Compare** tsvector vs pg_trgm for full-text search performance
4. **Build** an enterprise-ready hybrid search system with Cohere embeddings
5. **Visualize** search results interactively to understand trade-offs

### 📊 Our Dataset: E-commerce Product Catalog

We're working with a real-world product catalog containing:
- **21,704** products across multiple categories
- Rich product descriptions for semantic understanding
- Metadata including prices, ratings, and reviews
- Pre-generated embeddings for immediate experimentation

### 🔍 Search Methods We'll Explore

| Search Type | Method | Strengths | Limitations |
|------------|--------|-----------|-------------|
| **Keyword Search** | Exact/fuzzy matching | Fast, precise for known terms | Misses semantic meaning |
| **Semantic Search** | Vector embeddings | Understands context & intent | May miss exact matches |
| **Hybrid Search** | Combined approach | Best of both worlds | Requires tuning |

---

Let's begin! 🚀

## 📦 Step 1: Environment Setup & Dependencies

First, let's install and import all necessary libraries.

**Note**: The workshop environment should already have all dependencies installed via the bootstrap script.

In [None]:
# Step 1: Environment Setup & Dependencies

import sys
import os
import subprocess
import warnings
warnings.filterwarnings('ignore')

# Check Python version
print(f"🐍 Python version: {sys.version.split()[0]}")

# Verify we're using Python 3.13 as configured in bootstrap
if not sys.version.startswith('3.13'):
    print(f"⚠️ Warning: Expected Python 3.13, but running {sys.version.split()[0]}")

# Check for requirements.txt in the correct location
requirements_path = '/workshop/lab1-hybrid-search/requirements.txt'
if os.path.exists(requirements_path):
    # Check if packages are already installed
    try:
        import pgvector
        import psycopg
        import pandas
        import numpy
        import boto3
        from dotenv import load_dotenv
        print("✅ All required packages already installed from bootstrap")
    except ImportError as e:
        print(f"📥 Installing dependencies from {requirements_path}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", "--user", "-q", "-r", requirements_path])
        print("✅ Dependencies installed")
else:
    print(f"⚠️ Requirements file not found at {requirements_path}")
    print("   Packages should already be installed from bootstrap")

# Import all required libraries
import boto3
import json
import psycopg
from pgvector.psycopg import register_vector
import pandas as pd
import numpy as np
from IPython.display import display, HTML, clear_output
import ipywidgets as widgets
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
import time
from typing import Optional
from pathlib import Path

# Load environment variables from .env file created by bootstrap
from dotenv import load_dotenv

env_path = Path('/workshop/.env')
if env_path.exists():
    load_dotenv(env_path, override=True)
    print(f"✅ Loaded environment from {env_path}")
else:
    print("⚠️ No .env file found at /workshop/.env")
    print("   Bootstrap may not have completed successfully")

# Verify database environment is set
db_vars = ['DB_HOST', 'DB_USER', 'DB_PASSWORD', 'DB_NAME']
db_status = []
for var in db_vars:
    value = os.getenv(var)
    if value:
        if var == 'DB_PASSWORD':
            db_status.append(f"  {var}: {'*' * 8}")
        else:
            db_status.append(f"  {var}: {value}")
    else:
        db_status.append(f"  {var}: ❌ Missing")

print("📊 Database configuration:")
print('\n'.join(db_status))

# Set visualization style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

print("\n✅ Environment setup complete!")

## 🔐 Step 2: Database Connection & Model Setup

Let's establish connections to our Aurora PostgreSQL database and initialize the Cohere client.

In [None]:
# Step 2: Database Connection & Model Setup

import os
from pathlib import Path
from dotenv import load_dotenv
import boto3
import json
import psycopg
from pgvector.psycopg import register_vector

# Load environment variables from the correct path
env_path = Path('/workshop/.env')
if env_path.exists():
    load_dotenv(env_path, override=True)
    print(f"✅ Loaded environment from {env_path}")
else:
    print("⚠️ Warning: /workshop/.env not found, using environment variables")

# Configuration
dbhost = os.getenv('DB_HOST')
dbport = os.getenv('DB_PORT', '5432')
dbuser = os.getenv('DB_USER')
dbpass = os.getenv('DB_PASSWORD')
dbname = os.getenv('DB_NAME', 'workshop_db')
region = os.getenv('AWS_REGION', 'us-west-2')

# Verify we have all required credentials
if not all([dbhost, dbuser, dbpass]):
    print("❌ Missing database credentials. Please check your .env file")
    print(f"   DB_HOST: {'✓' if dbhost else '✗'}")
    print(f"   DB_USER: {'✓' if dbuser else '✗'}")
    print(f"   DB_PASSWORD: {'✓' if dbpass else '✗'}")
else:
    print("✅ All database credentials loaded")

# Initialize Bedrock client
bedrock_runtime = boto3.client(
    service_name='bedrock-runtime',
    region_name=region
)

print("\n🔧 Configuration:")
print(f"   • Database: {dbuser}@{dbhost}:{dbport}/{dbname}")
print(f"   • AWS Region: {region}")

# Test database connection
try:
    with psycopg.connect(
        host=dbhost, 
        port=dbport, 
        user=dbuser,
        password=dbpass, 
        dbname=dbname, 
        autocommit=True
    ) as conn:
        register_vector(conn)
        
        # Get PostgreSQL version
        result = conn.execute("SELECT version()").fetchone()
        print(f"   • PostgreSQL: {result[0].split(',')[0]}")
        
        # Check pgvector extension
        result = conn.execute("SELECT extversion FROM pg_extension WHERE extname = 'vector'").fetchone()
        if result:
            print(f"   • pgvector: v{result[0]}")
        
        # Check data is loaded
        result = conn.execute("""
            SELECT COUNT(*) as count, 
                   COUNT(embedding) as with_embeddings 
            FROM bedrock_integration.product_catalog
        """).fetchone()
        if result and result[0] > 0:
            print(f"   • Data: {result[0]:,} products ({result[1]:,} with embeddings)")
        else:
            print("   ⚠️ No data found in product_catalog table")
            
except Exception as e:
    print(f"❌ Database connection failed: {e}")
    print("   Check that the database is running and credentials are correct")

# Embedding function for Cohere via Bedrock
def generate_embedding_cohere(text: str, input_type: str = "search_query") -> list[float] | None:
    """Generate Cohere embeddings via Amazon Bedrock"""
    if not text:
        return None
        
    try:
        body = json.dumps({
            "texts": [text],
            "input_type": input_type,
            "embedding_types": ["float"],
            "truncate": "END"
        })
        
        response = bedrock_runtime.invoke_model(
            modelId='cohere.embed-english-v3',
            body=body,
            accept='application/json',
            contentType='application/json'
        )
        
        response_body = json.loads(response['body'].read())
        if 'embeddings' in response_body and 'float' in response_body['embeddings']:
            return response_body['embeddings']['float'][0]
        elif 'embeddings' in response_body:
            return response_body['embeddings'][0]
            
    except Exception as e:
        print(f"   Cohere embedding failed: {e}, trying Titan...")
        # Fallback to Titan
        try:
            payload = json.dumps({'inputText': text[:8000]})
            response = bedrock_runtime.invoke_model(
                body=payload,
                modelId='amazon.titan-embed-text-v2:0',
                accept="application/json",
                contentType="application/json"
            )
            return json.loads(response.get("body").read()).get("embedding")
        except Exception as e2:
            print(f"   Titan embedding also failed: {e2}")
            return None

# Display models being used
print("\n🤖 Models via Amazon Bedrock:")
print("   • cohere.embed-english-v3 (1024-dim embeddings)")
print("   • amazon.titan-embed-text-v2:0 (fallback embeddings)")
print("   • cohere.rerank-v3-5:0 (result re-ranking)")

print("\n✅ Database and models ready for hybrid search!")

## 📊 Step 3: Data Overview & Verification

Let's verify that our pre-loaded product catalog is ready for searching.

In [None]:
# Step 3: Data Overview & Verification

import psycopg
from pgvector.psycopg import register_vector
import pandas as pd
from IPython.display import display, HTML

# Connect and verify data
with psycopg.connect(
    host=dbhost, port=dbport, user=dbuser,
    password=dbpass, dbname=dbname, autocommit=True
) as conn:
    register_vector(conn)
    
    # Check if table exists
    exists = conn.execute("""
        SELECT EXISTS (
            SELECT 1 FROM information_schema.tables 
            WHERE table_schema = 'bedrock_integration' 
            AND table_name = 'product_catalog'
        );
    """).fetchone()[0]
    
    if not exists:
        print("❌ Data not found. Please run: python parallel-fast-loader.py")
    else:
        # Get statistics
        stats = conn.execute("""
            SELECT 
                COUNT(*) as total,
                COUNT(embedding) as with_embeddings,
                COUNT(DISTINCT category_name) as categories,
                AVG(price)::NUMERIC(10,2) as avg_price
            FROM bedrock_integration.product_catalog;
        """).fetchone()
        
        print("📊 DATA OVERVIEW")
        print("-" * 40)
        print(f"Total Products: {stats[0]:,}")
        print(f"With Embeddings: {stats[1]:,} ({stats[1]/stats[0]*100:.0f}%)")
        print(f"Categories: {stats[2]}")
        print(f"Avg Price: ${stats[3]}")
        
        # Show top categories
        print("\n📦 TOP CATEGORIES")
        print("-" * 40)
        categories = conn.execute("""
            SELECT category_name, COUNT(*) as count
            FROM bedrock_integration.product_catalog
            GROUP BY category_name
            ORDER BY count DESC
            LIMIT 5;
        """).fetchall()
        
        for cat, count in categories:
            print(f"  • {cat}: {count:,}")
        
        # Show indexes
        print("\n🔍 INDEXES")
        print("-" * 40)
        indexes = conn.execute("""
            SELECT indexname FROM pg_indexes
            WHERE schemaname = 'bedrock_integration'
            AND tablename = 'product_catalog';
        """).fetchall()
        
        for idx in indexes:
            name = idx[0]
            if 'embedding' in name:
                print(f"  • {name} (Vector search)")
            elif 'fts' in name:
                print(f"  • {name} (Full-text search)")
            elif 'trgm' in name:
                print(f"  • {name} (Fuzzy search)")
            elif 'pkey' in name:
                print(f"  • {name} (Primary key)")
            elif 'category' in name:
                print(f"  • {name} (Category filter)")
            elif 'price' in name:
                print(f"  • {name} (Price range)")
            else:
                print(f"  • {name}")
        
        print("\n✅ Database ready for hybrid search!")

## 🔧 Step 4: Implementing Search Functions

Now let's implement different search methods and compare their performance.

In [None]:
# Step 4: Search Function Implementations

# ============================================================
# EMBEDDING GENERATION
# ============================================================

def generate_embedding(text: str, input_type: str = "search_query") -> list[float] | None:
    """Generate embeddings using Cohere embed-english-v3"""
    try:
        body = json.dumps({
            "texts": [text],
            "input_type": input_type,
            "embedding_types": ["float"],
            "truncate": "END"
        })
        
        response = bedrock_runtime.invoke_model(
            modelId='cohere.embed-english-v3',
            body=body,
            accept='application/json',
            contentType='application/json'
        )
        
        response_body = json.loads(response['body'].read())
        
        if 'embeddings' in response_body and 'float' in response_body['embeddings']:
            return response_body['embeddings']['float'][0]
        elif 'embeddings' in response_body:
            return response_body['embeddings'][0]
        return None
            
    except Exception as e:
        print(f"Error generating embedding: {e}")
        return None

# ============================================================
# 1. KEYWORD SEARCH - FULL TEXT
# ============================================================

def keyword_search(query: str, limit: int = 10) -> list[dict]:
    """PostgreSQL Full-Text Search using TSVector"""
    with psycopg.connect(
        host=dbhost, port=dbport, user=dbuser,
        password=dbpass, autocommit=True
    ) as conn:
        results = conn.execute("""
            SELECT 
                "productId",
                product_description,
                category_name,
                price,
                stars,
                reviews,
                imgurl as "imgUrl",
                ts_rank_cd(
                    to_tsvector('english', product_description), 
                    plainto_tsquery('english', %s)
                ) as rank
            FROM bedrock_integration.product_catalog
            WHERE to_tsvector('english', product_description) 
                  @@ plainto_tsquery('english', %s)
            ORDER BY rank DESC
            LIMIT %s;
        """, (query, query, limit)).fetchall()
        
        return [{
            'productId': r[0],
            'description': r[1][:200] + '...',
            'category': r[2],
            'price': float(r[3]) if r[3] else 0,
            'stars': float(r[4]) if r[4] else 0,
            'reviews': int(r[5]) if r[5] else 0,
            'imgUrl': r[6],
            'score': float(r[7]) if r[7] else 0,
            'method': 'Keyword'
        } for r in results]

# ============================================================
# 2. FUZZY SEARCH - TYPO TOLERANCE
# ============================================================

def fuzzy_search(query: str, limit: int = 10) -> list[dict]:
    """PostgreSQL Trigram Search for typo tolerance"""
    with psycopg.connect(
        host=dbhost, port=dbport, user=dbuser,
        password=dbpass, autocommit=True
    ) as conn:
        conn.execute("SET pg_trgm.similarity_threshold = 0.1;")
        
        results = conn.execute("""
            SELECT 
                "productId",
                product_description,
                category_name,
                price,
                stars,
                reviews,
                imgurl as "imgUrl",
                similarity(lower(product_description), lower(%s)) as sim
            FROM bedrock_integration.product_catalog
            WHERE lower(product_description) %% lower(%s)
            ORDER BY sim DESC
            LIMIT %s;
        """, (query, query, limit)).fetchall()
        
        return [{
            'productId': r[0],
            'description': r[1][:200] + '...',
            'category': r[2],
            'price': float(r[3]) if r[3] else 0,
            'stars': float(r[4]) if r[4] else 0,
            'reviews': int(r[5]) if r[5] else 0,
            'imgUrl': r[6],
            'score': float(r[7]) if r[7] else 0,
            'method': 'Fuzzy'
        } for r in results]

# ============================================================
# 3. SEMANTIC SEARCH - VECTOR SIMILARITY
# ============================================================

def semantic_search(query: str, limit: int = 10) -> list[dict]:
    """Semantic Search using Cohere embeddings"""
    
    # Generate query embedding
    query_embedding = generate_embedding(query, "search_query")
    if not query_embedding:
        return []
    
    with psycopg.connect(
        host=dbhost, port=dbport, user=dbuser,
        password=dbpass, autocommit=True
    ) as conn:
        register_vector(conn)
        
        results = conn.execute("""
            SELECT 
                "productId",
                product_description,
                category_name,
                price,
                stars,
                reviews,
                imgurl as "imgUrl",
                1 - (embedding <=> %s::vector) as similarity
            FROM bedrock_integration.product_catalog
            WHERE embedding IS NOT NULL
            ORDER BY embedding <=> %s::vector
            LIMIT %s;
        """, (query_embedding, query_embedding, limit)).fetchall()
        
        return [{
            'productId': r[0],
            'description': r[1][:200] + '...',
            'category': r[2],
            'price': float(r[3]) if r[3] else 0,
            'stars': float(r[4]) if r[4] else 0,
            'reviews': int(r[5]) if r[5] else 0,
            'imgUrl': r[6],
            'score': float(r[7]) if r[7] else 0,
            'method': 'Semantic'
        } for r in results]

# ============================================================
# 4. HYBRID SEARCH - BEST OF BOTH
# ============================================================

def hybrid_search(
    query: str,
    semantic_weight: float = 0.7,
    keyword_weight: float = 0.3,
    limit: int = 10
) -> list[dict]:
    """
    Hybrid Search combining semantic and keyword approaches using weighted score fusion.
    
    IMPORTANT: This implementation intentionally does NOT normalize scores to demonstrate
    a common production challenge. Different search methods produce vastly different score ranges:
    - Semantic (cosine similarity): typically 0.7-1.0 for good matches
    - Keyword (ts_rank_cd): typically 0.01-0.1 even for good matches
    
    This causes semantic scores to dominate the final ranking even with equal weights.
    
    Example with 70/30 weights:
      Semantic: 0.90 × 0.7 = 0.63
      Keyword:  0.05 × 0.3 = 0.015
      Combined: 0.645 (semantic dominates!)
    
    Production solutions:
    1. Cohere Rerank (ML-based, no normalization needed) - demonstrated in this notebook
    2. Reciprocal Rank Fusion (RRF) - rank-based, no normalization needed
    3. Min-Max normalization - scale each method's scores to [0,1] before weighting
    """
    
    # Normalize weights
    total = semantic_weight + keyword_weight
    semantic_weight = semantic_weight / total
    keyword_weight = keyword_weight / total
    
    # Get results from both methods
    semantic_results = semantic_search(query, limit * 2)
    keyword_results = keyword_search(query, limit * 2)
    
    # Combine and score
    product_scores = {}
    product_data = {}
    
    # Process semantic results
    for result in semantic_results:
        pid = result['productId']
        product_scores[pid] = result['score'] * semantic_weight
        product_data[pid] = result
    
    # Process keyword results
    for result in keyword_results:
        pid = result['productId']
        if pid in product_scores:
            product_scores[pid] += result['score'] * keyword_weight
        else:
            product_scores[pid] = result['score'] * keyword_weight
            product_data[pid] = result
    
    # Sort and return top results
    sorted_products = sorted(product_scores.items(), key=lambda x: x[1], reverse=True)[:limit]
    
    results = []
    for pid, score in sorted_products:
        product = product_data[pid].copy()
        product['score'] = score
        product['method'] = 'Hybrid'
        results.append(product)
    
    return results

# ============================================================
# COHERE RERANK FUNCTION
# ============================================================


# ============================================================
# 5. HYBRID SEARCH - RRF (Reciprocal Rank Fusion)
# ============================================================

def hybrid_search_rrf(
    query: str,
    k: int = 60,
    limit: int = 10
) -> list[dict]:
    """
    Hybrid Search using Reciprocal Rank Fusion (RRF).
    
    RRF is a rank-based fusion method that does NOT require score normalization.
    Instead of combining raw scores, it combines ranks using the formula:
    
    RRF_score = sum(1 / (k + rank)) for each method
    
    Where k is a constant (typically 60) that reduces the impact of high ranks.
    
    Benefits:
    - No score normalization needed
    - Robust to different score scales
    - Simple and effective
    - Used by major search engines
    """
    
    query_embedding = generate_embedding(query, "search_query")
    if not query_embedding:
        return []
    
    with psycopg.connect(
        host=dbhost, port=dbport, user=dbuser,
        password=dbpass, autocommit=True
    ) as conn:
        register_vector(conn)
        
        results = conn.execute("""
            WITH semantic_search AS (
                SELECT 
                    "productId",
                    product_description,
                    category_name,
                    price,
                    stars,
                    reviews,
                    imgurl,
                    RANK() OVER (ORDER BY embedding <=> %s::vector) AS rank
                FROM bedrock_integration.product_catalog
                WHERE embedding IS NOT NULL
                ORDER BY embedding <=> %s::vector
                LIMIT 20
            ),
            keyword_search AS (
                SELECT 
                    "productId",
                    product_description,
                    category_name,
                    price,
                    stars,
                    reviews,
                    imgurl,
                    RANK() OVER (ORDER BY ts_rank_cd(to_tsvector('english', product_description), query) DESC) AS rank
                FROM bedrock_integration.product_catalog, plainto_tsquery('english', %s) query
                WHERE to_tsvector('english', product_description) @@ query
                LIMIT 20
            )
            SELECT
                COALESCE(s."productId", k."productId") AS product_id,
                COALESCE(s.product_description, k.product_description) AS description,
                COALESCE(s.category_name, k.category_name) AS category,
                COALESCE(s.price, k.price) AS price,
                COALESCE(s.stars, k.stars) AS stars,
                COALESCE(s.reviews, k.reviews) AS reviews,
                COALESCE(s.imgurl, k.imgurl) AS imgurl,
                (COALESCE(1.0 / (60 + s.rank), 0.0) + COALESCE(1.0 / (60 + k.rank), 0.0)) AS rrf_score
            FROM semantic_search s
            FULL OUTER JOIN keyword_search k ON s."productId" = k."productId"
            ORDER BY rrf_score DESC
            LIMIT %s
        """, (query_embedding, query_embedding, query, limit)).fetchall()
        
        return [{
            'productId': r[0],
            'description': r[1][:200] + '...',
            'category': r[2],
            'price': float(r[3]) if r[3] else 0,
            'stars': float(r[4]) if r[4] else 0,
            'reviews': int(r[5]) if r[5] else 0,
            'imgUrl': r[6],
            'score': float(r[7]) if r[7] else 0,
            'method': 'Hybrid-RRF'
        } for r in results]

def rerank_results(query: str, results: list[dict], top_k: int = 5) -> list[dict]:
    """Re-rank search results using Cohere Rerank model"""
    if not results:
        return []
    
    try:
        # Prepare documents for reranking
        documents = [r['description'] for r in results]
        
        body = json.dumps({
            "query": query,
            "documents": documents,
            "top_n": top_k,
            "api_version": 2
        })
        
        response = bedrock_runtime.invoke_model(
            modelId='cohere.rerank-v3-5:0',
            body=body,
            accept='application/json',
            contentType='application/json'
        )
        
        response_body = json.loads(response['body'].read())
        
        # Reorder results based on rerank scores
        reranked = []
        for item in response_body.get('results', []):
            idx = item['index']
            result = results[idx].copy()
            result['rerank_score'] = item['relevance_score']
            reranked.append(result)
        
        return reranked
        
    except Exception as e:
        print(f"Reranking failed: {e}")
        return results[:top_k]

print("✅ Search functions loaded successfully!")

## 🎮 Step 5: Interactive Search Interface

Now let's create an interactive interface to explore and compare different search methods.

In [None]:
# Step 5: Interactive Search Interface with Product Display

def create_search_interface():
    """Create an interactive search interface with proper product display"""
    import ipywidgets as widgets
    from IPython.display import display, HTML
    
    # Professional style definitions
    style = """
    <style>
        .search-container { padding: 20px; background: #f8f9fa; border-radius: 10px; }
        .result-card { 
            margin: 15px 0; padding: 20px; background: white; 
            border-radius: 8px; border: 1px solid #e3e6e8;
            transition: all 0.3s; position: relative;
            box-shadow: 0 1px 2px rgba(0,0,0,0.05);
        }
        .result-card:hover { 
            box-shadow: 0 8px 20px rgba(0,0,0,0.12); 
            transform: translateY(-2px);
            border-color: #ff9900;
        }
        .method-badge {
            position: absolute; top: 15px; right: 15px;
            padding: 5px 12px; border-radius: 20px;
            font-size: 11px; font-weight: bold;
            text-transform: uppercase;
        }
        .keyword { background: #e3f2fd; color: #1565c0; }
        .fuzzy { background: #fce4ec; color: #c2185b; }
        .semantic { background: #e8f5e9; color: #2e7d32; }
        .hybrid { background: #fff3e0; color: #e65100; }
        
        .product-content { display: flex; gap: 20px; }
        .product-image {
            flex-shrink: 0; width: 150px; height: 150px;
            object-fit: contain; border: 1px solid #e3e6e8;
            border-radius: 4px; padding: 10px; background: white;
        }
        .product-details { flex-grow: 1; }
        .product-title {
            font-size: 16px; color: #0066c0; text-decoration: none;
            font-weight: 500; line-height: 1.4; display: block; margin-bottom: 8px;
        }
        .product-title:hover { color: #c7511f; text-decoration: underline; }
        .product-price {
            font-size: 21px; color: #B12704; font-weight: 500; margin: 8px 0;
        }
        .product-rating {
            display: flex; align-items: center; gap: 8px; margin: 8px 0;
        }
        .stars { color: #ff9900; }
        .product-category { color: #565959; font-size: 12px; margin-top: 8px; }
        .score-info {
            margin-top: 12px; padding-top: 12px; border-top: 1px solid #e3e6e8;
            display: flex; justify-content: space-between; align-items: center;
        }
        .score-bar {
            height: 6px; background: #e9ecef; border-radius: 3px;
            overflow: hidden; flex-grow: 1; margin-right: 10px; max-width: 200px;
        }
        .score-fill {
            height: 100%; background: linear-gradient(90deg, #ff9900, #ff6600);
            transition: width 0.5s;
        }
        .score-text { color: #565959; font-size: 12px; font-weight: 500; }
        .comparison-grid {
            display: grid; grid-template-columns: repeat(auto-fit, minmax(400px, 1fr));
            gap: 20px; margin-top: 20px;
        }
        .no-results {
            padding: 40px; text-align: center; color: #565959;
            background: #f7f8f8; border-radius: 8px;
        }
    </style>
    """
    
    # Widget definitions
    query_input = widgets.Text(
        value='',
        placeholder='Try "Apple AirPods" or "coffee maker" or "laptop bag"...',
        description='Search:',
        style={'description_width': '80px'},
        layout=widgets.Layout(width='700px')
    )
    
    search_method = widgets.RadioButtons(
        options=[
            ('Keyword (Exact Match)', 'keyword'),
            ('Fuzzy (Typo Tolerance)', 'fuzzy'),
            ('Semantic (Conceptual)', 'semantic'),
            ('Hybrid (Combined)', 'hybrid'),
            ('Hybrid-RRF (Rank Fusion)', 'hybrid_rrf'),
            ('🔍 Compare All Methods', 'compare')
        ],
        value='compare',
        description='Method:',
        style={'description_width': '80px'}
    )
    
    # Hybrid search weight sliders
    semantic_weight = widgets.FloatSlider(
        value=0.7, min=0, max=1, step=0.1,
        description='Semantic:',
        style={'description_width': '80px'},
        layout=widgets.Layout(width='350px')
    )
    
    keyword_weight = widgets.FloatSlider(
        value=0.3, min=0, max=1, step=0.1,
        description='Keyword:',
        style={'description_width': '80px'},
        layout=widgets.Layout(width='350px')
    )
    
    results_limit = widgets.IntSlider(
        value=3, min=1, max=10, step=1,
        description='Results:',
        style={'description_width': '80px'},
        layout=widgets.Layout(width='300px')
    )
    
    search_button = widgets.Button(
        description='🔍 Search Products',
        button_style='primary',
        layout=widgets.Layout(width='200px', height='40px')
    )
    
    rerank_checkbox = widgets.Checkbox(
        value=False,
        description='Use Cohere Rerank',
        style={'description_width': 'initial'}
    )
    
    results_output = widgets.Output()
    
    # Example queries that demonstrate real differences
    example_queries = [
        # Exact keyword matches
        ("wireless bluetooth headphones", "Common Terms", "keyword"),
        ("stainless steel water bottle", "Product Type", "keyword"),
        
        # Conceptual searches
        ("something to keep coffee hot all day", "Problem Solving", "semantic"),
        ("gift for someone who loves cooking", "Gift Ideas", "semantic"),
        
        # Typo tolerance
        ("wireles blutooth hedphones", "With Typos", "fuzzy"),
        ("stainles steel watter botle", "Misspellings", "fuzzy"),
        
        # Balanced hybrid (RRF excels here)
        ("durable laptop backpack with USB charging", "Multi-Feature", "hybrid_rrf"),
        ("ergonomic office chair under 300 dollars", "Specs + Price", "hybrid_rrf"),
        
        # Mixed queries
        ("organic sustainable water bottle", "Features + Product", "hybrid"),
        ("affordable noise canceling headphones under 200", "Specs + Budget", "hybrid"),
        
        # Activity based
        ("equipment for home yoga practice", "Activity Based", "semantic"),
        ("tools for remote work from home", "Use Case", "semantic")
    ]
    
    def format_result(result: dict, method_class: str = '') -> str:
        """Format a single search result with full product display"""
        # Extract product details
        product_id = result.get('productId', 'Unknown')
        description = result.get('description', 'No description available')
        price = result.get('price', 0)
        stars = result.get('stars', 0)
        reviews = result.get('reviews', 0)
        category = result.get('category', 'Unknown Category')
        score = result.get('score', 0)
        rerank_score = result.get('rerank_score', None)
        img_url = result.get('imgUrl', '')  # Changed to imgUrl with capital U
        
        # Create star display
        star_display = '★' * int(stars) + '☆' * (5 - int(stars))
        
        # Generate Amazon search link
        search_terms = description.split()[:5]
        link_url = f"https://www.amazon.com/s?k={'+'.join(search_terms)}"
        
        # Calculate score percentage for visual bar
        display_score = rerank_score if rerank_score is not None else score
        score_percent = min(display_score * 100, 100) if display_score > 0 else 0
        
        # Score label
        score_label = "Rerank Score" if rerank_score is not None else "Relevance"
        
        # Simple direct image embed exactly like Part 2 notebook
        return f"""
        <div class="result-card">
            <div class="method-badge {method_class}">{result.get('method', 'Unknown')}</div>
            
            <div class="product-content">
                <img src="{img_url}" style="width: 150px; height: 150px; object-fit: contain; border: 1px solid #e3e6e8; border-radius: 4px; padding: 10px; background: white;">
                
                <div class="product-details">
                    <a href="{link_url}" target="_blank" class="product-title">
                        {description}
                    </a>
                    
                    <div class="product-price">${price:.2f}</div>
                    
                    <div class="product-rating">
                        <span class="stars">{star_display}</span>
                        <span style="color: #007185; font-size: 14px;">({reviews:,} reviews)</span>
                    </div>
                    
                    <div class="product-category">Category: {category}</div>
                    
                    <div class="score-info">
                        <div style="display: flex; align-items: center; flex-grow: 1;">
                            <div class="score-bar">
                                <div class="score-fill" style="width: {score_percent}%"></div>
                            </div>
                            <span class="score-text">{score_label}: {display_score:.3f}</span>
                        </div>
                        <a href="{link_url}" target="_blank" style="color: #ff9900; text-decoration: none; font-size: 13px;">
                            View on Amazon →
                        </a>
                    </div>
                </div>
            </div>
        </div>
        """
    
    def set_example_query(query: str, method: str | None = None):
        """Set an example query and optionally the search method"""
        query_input.value = query
        if method:
            search_method.value = method
    
    # Create example buttons
    example_buttons = []
    for query, label, best_method in example_queries:
        btn = widgets.Button(
            description=f"{label}: {query[:30]}..." if len(query) > 30 else f"{label}: {query}",
            layout=widgets.Layout(width='auto', margin='2px'),
            tooltip=f"Best with: {best_method}"
        )
        btn.on_click(lambda b, q=query, m=best_method: set_example_query(q, m))
        example_buttons.append(btn)
    
    def on_search_clicked(b):
        """Handle search button click"""
        results_output.clear_output()
        
        with results_output:
            display(HTML(style))
            
            query = query_input.value
            method = search_method.value
            limit = results_limit.value
            use_rerank = rerank_checkbox.value
            
            if not query:
                display(HTML('<div class="no-results">Please enter a search query!</div>'))
                return
            
            display(HTML(f'<h3 style="color: #0f1111;">🔍 Results for: "{query}"</h3>'))
            
            if method == 'compare':
                # Compare all methods
                methods_to_compare = [
                    ('Keyword (Exact)', keyword_search, 'keyword'),
                    ('Fuzzy (Typos)', fuzzy_search, 'fuzzy'),
                    ('Semantic (Cohere)', semantic_search, 'semantic'),
                    ('Hybrid (70/30)', lambda q, l: hybrid_search(q, 0.7, 0.3, l), 'hybrid'),
                    ('Hybrid-RRF (k=60)', lambda q, l: hybrid_search_rrf(q, 60, l), 'hybrid')
                ]
                
                # Method colors
                method_colors = {
                    'keyword': '1565c0',
                    'fuzzy': 'c2185b',
                    'semantic': '2e7d32',
                    'hybrid': 'e65100'
                }
                
                html_output = '<div class="comparison-grid">'
                
                for method_name, func, css_class in methods_to_compare:
                    border_color = method_colors.get(css_class, '666666')
                    html_output += f'<div><h4 style="color: #0f1111; border-bottom: 2px solid #{border_color}; padding-bottom: 8px; margin-bottom: 15px;">{method_name}</h4>'
                    
                    try:
                        import time
                        start = time.time()
                        results = func(query, limit)
                        elapsed = time.time() - start
                        
                        # Apply reranking if enabled
                        if use_rerank and results:
                            results = rerank_results(query, results, min(len(results), 2))
                        
                        if results:
                            html_output += f'<p style="color: #565959; font-size: 12px;">Found {len(results)} results in {elapsed:.3f}s</p>'
                            for result in results[:2]:  # Show top 2 per method
                                html_output += format_result(result, css_class)
                        else:
                            html_output += '<div class="no-results">No results found with this method</div>'
                            
                    except Exception as e:
                        html_output += f'<div class="no-results">Error: {str(e)}</div>'
                    
                    html_output += '</div>'
                
                html_output += '</div>'
                display(HTML(html_output))
                
            else:
                # Single method search
                try:
                    import time
                    start = time.time()
                    
                    if method == 'keyword':
                        results = keyword_search(query, limit)
                        css_class = 'keyword'
                        method_name = 'Keyword (Exact Match)'
                    elif method == 'fuzzy':
                        results = fuzzy_search(query, limit)
                        css_class = 'fuzzy'
                        method_name = 'Fuzzy (Typo Tolerance)'
                    elif method == 'semantic':
                        results = semantic_search(query, limit)
                        css_class = 'semantic'
                        method_name = 'Semantic Search (Cohere)'
                    elif method == 'hybrid':
                        results = hybrid_search(
                            query, 
                            semantic_weight.value,
                            keyword_weight.value,
                            limit
                        )
                        css_class = 'hybrid'
                        method_name = f'Hybrid (S:{semantic_weight.value:.1f}/K:{keyword_weight.value:.1f})'
                    elif method == 'hybrid_rrf':
                        results = hybrid_search_rrf(query, 60, limit)
                        css_class = 'hybrid'
                        method_name = 'Hybrid-RRF (k=60)'
                    
                    elapsed = time.time() - start
                    
                    # Apply reranking if enabled
                    if use_rerank and results:
                        rerank_start = time.time()
                        results = rerank_results(query, results, len(results))
                        rerank_time = time.time() - rerank_start
                        total_time = elapsed + rerank_time
                        
                        display(HTML(f'''
                            <p style="color: #565959;">
                                Method: <strong>{method_name}</strong> | 
                                Search: <strong>{elapsed:.3f}s</strong> | 
                                Rerank: <strong>{rerank_time:.3f}s</strong> |
                                Total: <strong>{total_time:.3f}s</strong> | 
                                Results: <strong>{len(results)}</strong>
                            </p>
                        '''))
                    else:
                        display(HTML(f'''
                            <p style="color: #565959;">
                                Method: <strong>{method_name}</strong> | 
                                Time: <strong>{elapsed:.3f}s</strong> | 
                                Results: <strong>{len(results)}</strong>
                            </p>
                        '''))
                    
                    if results:
                        for result in results:
                            display(HTML(format_result(result, css_class)))
                    else:
                        display(HTML('<div class="no-results">No products found. Try a different search term or method.</div>'))
                        
                except Exception as e:
                    display(HTML(f'<div class="no-results">Error: {str(e)}</div>'))
                    import traceback
                    print(traceback.format_exc())
    
    search_button.on_click(on_search_clicked)
    
    # Create status display for weights
    weight_status = widgets.HTML(
        value="<div style='padding: 5px; font-size: 0.9em; color: #2E8B57;'>✓ Weights sum to 1.0</div>"
    )

    def validate_and_update_weights(change):
        current_sum = round(semantic_weight.value + keyword_weight.value, 1)
        
        if current_sum > 1:
            # If semantic weight was changed
            if change.owner == semantic_weight:
                keyword_weight.value = max(0, round(1 - semantic_weight.value, 1))
            # If keyword weight was changed
            else:
                semantic_weight.value = max(0, round(1 - keyword_weight.value, 1))
            
            current_sum = round(semantic_weight.value + keyword_weight.value, 1)
        
        # Update status display
        if current_sum > 1:
            weight_status.value = f"<div style='padding: 5px; font-size: 0.9em; color: #DC143C;'>⚠️ Sum exceeds 1 (Current: {current_sum})</div>"
        elif current_sum == 1:
            weight_status.value = f"<div style='padding: 5px; font-size: 0.9em; color: #2E8B57;'>✓ Weights sum to {current_sum}</div>"
        else:
            weight_status.value = f"<div style='padding: 5px; font-size: 0.9em; color: #DAA520;'>ℹ️ Sum is {current_sum}</div>"

    # Observe changes in both sliders
    semantic_weight.observe(validate_and_update_weights, names='value')
    keyword_weight.observe(validate_and_update_weights, names='value')
    
    # Layout
    display(HTML("""
        <style>
            .adaptive-title { 
                color: #0f1111; 
            }
            @media (prefers-color-scheme: dark) {
                .adaptive-title { color: #e3e6e8; }
            }
            body.vscode-dark .adaptive-title,
            body.vscode-high-contrast .adaptive-title,
            .jp-Notebook-dark .adaptive-title {
                color: #e3e6e8;
            }
        </style>
        <h2 class="adaptive-title">🛍️ Amazon Product Search Comparison</h2>
        <div style="background: #f7f8f8; padding: 15px; border-radius: 8px; margin: 15px 0;">
            <h4 style="color: #0f1111; margin-top: 0;">Search Method Strengths:</h4>
            <ul style="color: #565959; margin: 10px 0;">
                <li><strong style="color: #1565c0;">Keyword:</strong> Perfect for exact product names, SKUs, brand searches</li>
                <li><strong style="color: #c2185b;">Fuzzy:</strong> Handles typos and misspellings</li>
                <li><strong style="color: #2e7d32;">Semantic:</strong> Understands intent and concepts using Cohere embeddings</li>
                <li><strong style="color: #e65100;">Hybrid:</strong> Best overall - combines keyword matching with semantic understanding</li>
            </ul>
           <div style="border-left: 4px solid #4CAF50; padding-left: 10px; margin-top: 10px; color: black;"> 
                <strong>🤖 Cohere Models:</strong> embed-english-v3 (embeddings) • rerank-v3-5:0 (re-ranking)
            </div>
            <div style="background: #fff3e0; border-left: 4px solid #e65100; padding: 12px; margin-top: 15px; border-radius: 4px;">
                <h4 style="color: #2e7d32; margin-top: 0; margin-bottom: 8px;">💡 Understanding Hybrid Search Approaches</h4>
                <p style="color: #1b5e20; margin: 8px 0; font-size: 13px;">
                    <strong>Challenge:</strong> Different search methods produce vastly different score ranges (semantic: 0.7-1.0, keyword: 0.01-0.1), causing one method to dominate weighted combinations.
                </p>
                <p style="color: #1b5e20; margin: 8px 0; font-size: 13px;">
                    <strong>Solutions Demonstrated:</strong>
                </p>
                <ul style="color: #1b5e20; margin: 8px 0; font-size: 13px; padding-left: 20px;">
                    <li><strong>Hybrid (70/30):</strong> Weighted score fusion - simple but requires careful tuning</li>
                    <li><strong>Hybrid-RRF:</strong> Rank-based fusion - robust, no normalization needed ✨</li>
                    <li><strong>Cohere Rerank:</strong> ML-based re-ranking - most sophisticated approach</li>
                </ul>
                <div style="background: #e8f5e9; border-left: 4px solid #4caf50; padding: 10px; margin: 8px 0; font-size: 13px; color: #1b5e20;">
                    <strong>💡 Try the examples below</strong> to see how each method handles different query types!
                </div>
            </div>
        </div>
    """))
    
    # Display interface
    display(widgets.VBox([
        widgets.HTML('<h4 style="color: #0f1111; margin: 15px 0;">📝 Example Searches (Click to Try):</h4>'),
        widgets.GridBox(
            example_buttons,
            layout=widgets.Layout(
                grid_template_columns='repeat(3, 1fr)',
                grid_gap='5px'
            )
        ),
        widgets.HTML('<hr style="margin: 20px 0; border-color: #e3e6e8;">'),
        query_input,
        search_method,
        widgets.HTML('<h4 style="color: #0f1111;">⚙️ Options:</h4>'),
        widgets.HBox([
            widgets.VBox([
                widgets.HTML('<strong>Hybrid Weights:</strong>'),
                semantic_weight,
                keyword_weight,
                weight_status
            ]),
            widgets.VBox([
                results_limit,
                rerank_checkbox
            ])
        ]),
        search_button,
        results_output
    ]))

# Create and display the interface
create_search_interface()

## 📈 Step 6: Performance Analysis & Insights (OPTIONAL)

Let's analyze the performance characteristics of different search methods.

In [None]:
# Step 6: Performance Analysis & Optimization Insights

import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm
from IPython.display import display, HTML

def performance_analysis():
    """Analyze and visualize performance of different search methods"""
    
    # Test queries representing different search scenarios
    test_queries = [
        # Exact match scenarios
        ("Apple AirPods", "Exact Product"),
        ("wireless headphones", "Product Category"),
        
        # Semantic scenarios
        ("gift for coffee lover", "Intent-based"),
        ("eco-friendly water bottle", "Attribute-focused"),
        
        # Typo scenarios
        ("wireles hedphones", "Spelling Errors"),
        ("bose quitcomfort", "Brand Typos"),
        
        # Complex scenarios
        ("best camera under 500", "Budget Constraint"),
        ("lightweight laptop for travel", "Multi-attribute")
    ]
    
    results_data = []
    
    print("🔄 Running performance analysis...")
    print("-" * 50)
    
    for query, scenario in tqdm(test_queries, desc="Testing queries"):
        # Test each method
        methods = [
            ('Keyword', lambda q: keyword_search(q, 5)),
            ('Fuzzy', lambda q: fuzzy_search(q, 5)),
            ('Semantic', lambda q: semantic_search(q, 5)),
            ('Hybrid', lambda q: hybrid_search(q, semantic_weight=0.7, keyword_weight=0.3, limit=5))
        ]
        
        for method_name, method_func in methods:
            try:
                # Measure performance
                start_time = time.time()
                results = method_func(query)
                elapsed = time.time() - start_time
                
                # Calculate metrics
                results_data.append({
                    'Query': query[:25] + '...' if len(query) > 25 else query,
                    'Scenario': scenario,
                    'Method': method_name,
                    'Results': len(results),
                    'Avg Score': np.mean([r['score'] for r in results]) if results else 0,
                    'Max Score': max([r['score'] for r in results]) if results else 0,
                    'Time (ms)': elapsed * 1000,
                    'Success': len(results) > 0
                })
            except Exception as e:
                results_data.append({
                    'Query': query[:25] + '...' if len(query) > 25 else query,
                    'Scenario': scenario,
                    'Method': method_name,
                    'Results': 0,
                    'Avg Score': 0,
                    'Max Score': 0,
                    'Time (ms)': 0,
                    'Success': False
                })
    
    df = pd.DataFrame(results_data)
    
    # Create comprehensive visualizations
    fig = plt.figure(figsize=(18, 10))
    gs = fig.add_gridspec(2, 3, hspace=0.25, wspace=0.3)
    
    # 1. Success Rate by Scenario
    ax1 = fig.add_subplot(gs[0, 0])
    success_pivot = df.pivot_table(
        values='Success',
        index='Scenario',
        columns='Method',
        aggfunc='mean'
    ) * 100
    success_pivot.plot(kind='bar', ax=ax1, width=0.8)
    ax1.set_title('Success Rate by Scenario (%)', fontsize=12, fontweight='bold')
    ax1.set_ylabel('Success Rate (%)')
    ax1.set_xlabel('')
    ax1.legend(title='Method', loc='lower right')
    ax1.tick_params(axis='x', rotation=45)
    ax1.grid(True, alpha=0.3, axis='y')
    
    # 2. Average Score Heatmap
    ax2 = fig.add_subplot(gs[0, 1])
    score_pivot = df.pivot_table(
        values='Avg Score',
        index='Scenario',
        columns='Method'
    )
    sns.heatmap(score_pivot, annot=True, fmt='.3f', cmap='RdYlGn', ax=ax2, vmin=0, vmax=1, cbar_kws={'label': 'Score'})
    ax2.set_title('Average Relevance Score', fontsize=12, fontweight='bold')
    ax2.set_xlabel('Method')
    ax2.set_ylabel('')
    
    # 3. Best Method by Scenario
    ax3 = fig.add_subplot(gs[0, 2])
    
    # Calculate best method for each scenario
    scenario_matrix = pd.crosstab(df['Scenario'], df['Method'], values=df['Avg Score'], aggfunc='mean')
    
    # Create a normalized version for visualization
    scenario_norm = scenario_matrix.div(scenario_matrix.max(axis=1), axis=0)
    
    sns.heatmap(scenario_norm, annot=scenario_matrix.round(2), fmt='', cmap='YlOrRd', ax=ax3, vmin=0, vmax=1, cbar_kws={'label': 'Score'})
    ax3.set_title('Best Method by Scenario', fontsize=12, fontweight='bold')
    ax3.set_xlabel('Method')
    ax3.set_ylabel('')
    
    # 4. Method Performance Radar Chart
    ax4 = fig.add_subplot(gs[1, 0], projection='polar')
    
    # Calculate aggregate metrics
    metrics = df.groupby('Method').agg({
        'Time (ms)': lambda x: 1 / (1 + x.mean()/100),  # Inverse for better = higher, normalized
        'Results': lambda x: x.mean() / 5,  # Normalize by max results
        'Avg Score': 'mean',
        'Success': 'mean'
    })
    
    categories = ['Speed', 'Coverage', 'Relevance', 'Reliability']
    angles = np.linspace(0, 2 * np.pi, len(categories), endpoint=False).tolist()
    angles += angles[:1]
    
    colors = {'Keyword': '#1f77b4', 'Fuzzy': '#ff7f0e', 'Semantic': '#2ca02c', 'Hybrid': '#d62728'}
    
    for method in metrics.index:
        values = metrics.loc[method].values.tolist()
        values += values[:1]
        ax4.plot(angles, values, 'o-', linewidth=2, label=method, color=colors.get(method, 'gray'))
        ax4.fill(angles, values, alpha=0.15, color=colors.get(method, 'gray'))
    
    ax4.set_xticks(angles[:-1])
    ax4.set_xticklabels(categories, size=11)
    ax4.set_ylim(0, 1)
    ax4.set_title('Normalized Performance Comparison', fontsize=12, fontweight='bold', pad=20)
    ax4.legend(loc='upper right', bbox_to_anchor=(1.2, 1.1))
    ax4.grid(True)
    
    # 5. Speed vs Accuracy Trade-off
    ax5 = fig.add_subplot(gs[1, 1])
    
    colors_scatter = {'Keyword': '#1f77b4', 'Fuzzy': '#ff7f0e', 'Semantic': '#2ca02c', 'Hybrid': '#d62728'}
    
    for method in df['Method'].unique():
        method_data = df[df['Method'] == method]
        ax5.scatter(method_data['Time (ms)'], method_data['Avg Score'], 
                   label=method, alpha=0.7, s=120, color=colors_scatter.get(method, 'gray'),
                   edgecolors='black', linewidth=0.5)
    
    ax5.set_xlabel('Response Time (ms)', fontsize=11)
    ax5.set_ylabel('Average Score', fontsize=11)
    ax5.set_title('Speed vs Accuracy Trade-off', fontsize=12, fontweight='bold')
    ax5.legend(loc='best')
    ax5.grid(True, alpha=0.3)
    
    # Add ideal zone annotation
    ax5.axhspan(0.4, 0.6, alpha=0.1, color='green', label='Optimal Zone')
    ax5.axvspan(0, 2500, alpha=0.1, color='green')
    
    # 6. Method Rankings
    ax6 = fig.add_subplot(gs[1, 2])
    
    # Calculate rankings
    rankings = []
    for scenario in df['Scenario'].unique():
        scenario_data = df[df['Scenario'] == scenario]
        scenario_ranked = scenario_data.sort_values('Avg Score', ascending=False)
        for i, (_, row) in enumerate(scenario_ranked.iterrows(), 1):
            rankings.append({
                'Scenario': scenario,
                'Method': row['Method'],
                'Rank': i
            })
    
    ranking_df = pd.DataFrame(rankings)
    ranking_pivot = ranking_df.pivot_table(values='Rank', index='Method', aggfunc='mean')
    
    # Create bar chart with colors - convert to numpy array to fix the error
    bars = ax6.bar(ranking_pivot.index, ranking_pivot.values.flatten(), 
                   color=[colors_scatter.get(m, 'gray') for m in ranking_pivot.index])
    
    # Add value labels on bars
    for bar in bars:
        height = bar.get_height()
        ax6.text(bar.get_x() + bar.get_width()/2., height,
                f'{height:.2f}', ha='center', va='bottom')
    
    ax6.set_title('Average Ranking Across Scenarios', fontsize=12, fontweight='bold')
    ax6.set_ylabel('Average Rank (Lower is Better)', fontsize=11)
    ax6.set_xlabel('Method', fontsize=11)
    ax6.set_ylim(0.5, 4.5)
    ax6.invert_yaxis()
    ax6.grid(True, alpha=0.3, axis='y')
    
    plt.suptitle('Search Method Performance Analysis', fontsize=14, fontweight='bold', y=0.98)
    plt.tight_layout()
    plt.show()
    
    # Display summary statistics
    print("\n📊 Performance Summary Statistics:")
    print("="*60)
    
    summary = df.groupby('Method').agg({
        'Time (ms)': ['mean', 'std', 'min', 'max'],
        'Results': ['mean', 'std'],
        'Avg Score': ['mean', 'std'],
        'Success': lambda x: f"{x.mean()*100:.1f}%"
    }).round(2)
    
    display(summary)
    
    # Recommendations based on analysis
    print("\n💡 Method Recommendations by Use Case:")
    print("="*60)
    
    recommendations = {
        "Exact Product Search": "Keyword (TSVector) - Fastest and most accurate for exact matches",
        "Typo Tolerance": "Fuzzy (pg_trgm) - Best at handling misspellings",
        "Conceptual Search": "Semantic - Understanding intent and context",
        "General Purpose": "Hybrid - Balanced performance across all scenarios",
        "Speed Critical": "Keyword - Lowest latency",
        "Accuracy Critical": "Hybrid/Semantic - Highest relevance scores"
    }
    
    for use_case, rec in recommendations.items():
        print(f"• {use_case}: {rec}")
    
    return df

# Additional optimization insights function
def optimization_insights(df: pd.DataFrame):
    """Generate specific optimization recommendations"""
    
    print("\n🔧 Optimization Insights:")
    print("="*60)
    
    # Calculate insights
    avg_times = df.groupby('Method')['Time (ms)'].mean()
    avg_scores = df.groupby('Method')['Avg Score'].mean()
    
    # Speed analysis
    fastest = avg_times.idxmin()
    slowest = avg_times.idxmax()
    speed_diff = (avg_times[slowest] - avg_times[fastest]) / avg_times[fastest] * 100
    
    print(f"\n⚡ Speed Analysis:")
    print(f"  • Fastest: {fastest} ({avg_times[fastest]:.1f}ms)")
    print(f"  • Slowest: {slowest} ({avg_times[slowest]:.1f}ms)")
    print(f"  • Performance gap: {speed_diff:.0f}% slower")
    
    # Accuracy analysis
    most_accurate = avg_scores.idxmax()
    least_accurate = avg_scores.idxmin()
    
    print(f"\n🎯 Accuracy Analysis:")
    print(f"  • Most accurate: {most_accurate} (avg score: {avg_scores[most_accurate]:.3f})")
    print(f"  • Least accurate: {least_accurate} (avg score: {avg_scores[least_accurate]:.3f})")
    
    # SQL Optimization Commands for DBAs
    print(f"\n📝 SQL OPTIMIZATION COMMANDS FOR DBAs:")
    print("="*60)
    
    print("\n-- 1. KEYWORD SEARCH OPTIMIZATION (TSVector)")
    print("""
    -- Check if GIN index exists on tsvector
    SELECT indexname, indexdef 
    FROM pg_indexes 
    WHERE tablename = 'product_catalog' 
    AND indexdef LIKE '%gin%tsvector%';
    
    -- Create optimized GIN index if missing
    CREATE INDEX CONCURRENTLY IF NOT EXISTS product_catalog_fts_gin_idx 
    ON bedrock_integration.product_catalog 
    USING GIN (to_tsvector('english', product_description));
    
    -- Analyze table statistics
    ANALYZE bedrock_integration.product_catalog;
    """)
    
    print("\n-- 2. FUZZY SEARCH OPTIMIZATION (pg_trgm)")
    print("""
    -- Check trigram extension
    SELECT * FROM pg_extension WHERE extname = 'pg_trgm';
    
    -- Create trigram GIN index
    CREATE INDEX CONCURRENTLY IF NOT EXISTS product_catalog_trgm_idx 
    ON bedrock_integration.product_catalog 
    USING GIN (product_description gin_trgm_ops);
    
    -- Optimize similarity threshold
    SET pg_trgm.similarity_threshold = 0.3;  -- Adjust based on requirements
    
    -- Check current threshold
    SHOW pg_trgm.similarity_threshold;
    """)
    
    print("\n-- 3. VECTOR SEARCH OPTIMIZATION (pgvector)")
    print("""
    -- Check HNSW index parameters
    SELECT indexname, indexdef 
    FROM pg_indexes 
    WHERE tablename = 'product_catalog' 
    AND indexdef LIKE '%hnsw%';
    
    -- Create optimized HNSW index for Cohere embeddings
    CREATE INDEX CONCURRENTLY IF NOT EXISTS product_catalog_embedding_hnsw_idx 
    ON bedrock_integration.product_catalog 
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);
    
    -- Optimize work_mem for vector operations
    SET work_mem = '256MB';  -- Increase for large vector operations
    
    -- Check vector dimension statistics
    SELECT 
        COUNT(*) as total_products,
        COUNT(embedding) as products_with_embeddings,
        AVG(vector_dims(embedding)) as avg_dimensions
    FROM bedrock_integration.product_catalog;
    """)
    
    print("\n-- 4. QUERY PERFORMANCE ANALYSIS")
    print("""
    -- Enable query timing
    \\timing on
    
    -- Analyze slow queries
    SELECT 
        query,
        calls,
        mean_exec_time,
        max_exec_time,
        total_exec_time
    FROM pg_stat_statements
    WHERE query LIKE '%product_catalog%'
    ORDER BY mean_exec_time DESC
    LIMIT 10;
    
    -- Check table bloat
    SELECT 
        schemaname,
        tablename,
        pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size,
        n_live_tup,
        n_dead_tup,
        round(n_dead_tup::numeric/NULLIF(n_live_tup,0), 2) as dead_ratio
    FROM pg_stat_user_tables
    WHERE schemaname = 'bedrock_integration'
    ORDER BY n_dead_tup DESC;
    """)
    
    print("\n-- 5. MAINTENANCE COMMANDS")
    print("""
    -- Vacuum and analyze for optimal performance
    VACUUM (ANALYZE, VERBOSE) bedrock_integration.product_catalog;
    
    -- Reindex if fragmented
    REINDEX TABLE CONCURRENTLY bedrock_integration.product_catalog;
    
    -- Update table statistics
    ANALYZE bedrock_integration.product_catalog (product_description, embedding);
    
    -- Monitor index usage
    SELECT 
        schemaname,
        tablename,
        indexname,
        idx_scan,
        idx_tup_read,
        idx_tup_fetch
    FROM pg_stat_user_indexes
    WHERE schemaname = 'bedrock_integration'
    ORDER BY idx_scan DESC;
    """)
    
    # Performance recommendations based on actual timings
    print(f"\n🎯 SPECIFIC RECOMMENDATIONS BASED ON ANALYSIS:")
    print("="*60)
    
    if avg_times['Semantic'] > 2500:
        print("\n⚠️ VECTOR SEARCH NEEDS OPTIMIZATION:")
        print("  • Consider increasing HNSW 'm' parameter to 32")
        print("  • Set maintenance_work_mem = '512MB' before index creation")
        print("  • Consider using IVFFlat for datasets > 1M rows")
    
    if avg_times['Keyword'] > 1500:
        print("\n⚠️ KEYWORD SEARCH NEEDS OPTIMIZATION:")
        print("  • Ensure GIN index exists on tsvector column")
        print("  • Consider partial indexes for frequently searched categories")
        print("  • Use ts_stat() to analyze term frequency")
    
    if avg_times['Fuzzy'] > 2000:
        print("\n⚠️ FUZZY SEARCH NEEDS OPTIMIZATION:")
        print("  • Verify pg_trgm GIN index exists")
        print("  • Consider lowering similarity_threshold")
        print("  • Use word_similarity() for better partial matching")
    
    # Cost-benefit analysis
    print(f"\n💰 Cost-Benefit Analysis:")
    
    hybrid_benefit = (avg_scores['Hybrid'] - avg_scores['Keyword']) / avg_scores['Keyword'] * 100
    hybrid_cost = (avg_times['Hybrid'] - avg_times['Keyword']) / avg_times['Keyword'] * 100
    
    print(f"  • Hybrid vs Keyword:")
    print(f"    - Accuracy improvement: +{hybrid_benefit:.1f}%")
    print(f"    - Speed cost: +{hybrid_cost:.1f}% slower")
    print(f"    - ROI: {hybrid_benefit/hybrid_cost:.2f}x benefit per ms")

# Run the analysis
print("Starting comprehensive performance analysis...")
perf_df = performance_analysis()

# Generate optimization insights
optimization_insights(perf_df)

## 🎓 Key Takeaways & Best Practices

### Search Method Deep Dive

| Method | Best For | Limitations | Index Strategy |
|--------|----------|-------------|----------------|
| **TSVector** | • Exact lexical matches<br>• Known domain terminology<br>• Boolean queries (AND/OR/NOT)<br>• Phrase searching | • No semantic understanding<br>• Misses synonyms/variants<br>• Language-specific configs<br>• Stopword dependencies | GIN/GiST indexes<br>Partial index patterns |
| **pg_trgm** | • Fuzzy matching (Levenshtein)<br>• Typo tolerance<br>• Substring searches<br>• LIKE pattern acceleration | • Limited to character n-grams<br>• No semantic context<br>• Memory intensive at scale<br>• Fixed similarity threshold | GIN with gin_trgm_ops<br>Composite for selectivity<br>Expression indexes |
| **Semantic** | • Natural language queries<br>• Conceptual similarity<br>• Intent understanding<br>• Cross-language search | • Requires embedding models<br>• May miss exact terms<br>• Latency for embedding<br>• Storage overhead (1536-dim) | HNSW (pgvector)<br>Partitioned indexes |
| **Hybrid** | • Production search systems<br>• User-facing applications<br>• Mixed query patterns<br>• Evolving requirements | • Tuning complexity<br>• Multiple index maintenance<br>• Query planning overhead<br>• Cache invalidation | Combined strategy<br>Parallel index scans<br>Cost-based optimization |

### Enterprise-Grade Weight Configuration Matrix

| Use Case | Semantic Weight | Keyword Weight | Rationale | Index Priority |
|----------|----------------|----------------|-----------|----------------|
| **E-commerce Catalog** | 70% | 30% | Users describe products naturally; exact SKUs handled separately | HNSW for vectors, partial GIN for categories |
| **Technical Documentation** | 40% | 60% | Precise terminology critical; concepts secondary for accuracy | GIN with custom dictionaries, smaller HNSW |
| **Customer Support Tickets** | 80% | 20% | Intent matters more than exact wording; emotional context crucial | Large HNSW index, basic text search fallback |
| **SKU/Part Number Search** | 10% | 90% | Exact matches required; minimal semantic benefit | B-tree for exact, pg_trgm for fuzzy |
| **Legal Document Repository** | 35% | 65% | Precise legal terms essential; some conceptual linking helpful | Full-text with phrase search, auxiliary vectors |
| **Knowledge Base Articles** | 65% | 35% | Balance between natural queries and technical terms | Dual indexing with equal maintenance priority |

### ✅ Congratulations! You've completed Lab 1!

You've successfully:
- Implemented multiple search methods (TSVector, pg_trgm, pgvector)
- Compared different search approaches and their trade-offs
- Built a configurable hybrid search system with dynamic weighting
- Analyzed index strategies for each search method
- Learned enterprise-grade weight configurations for various use cases

### 🚀 Ready for Lab 2: MCP & Strands Integration!

Next up: Build intelligent agents with Model Context Protocol and Strands framework