# Demo 4: RAG with Re-ranking Pipeline (Bedrock Knowledge Base)
Pattern: Advanced Modular RAG with Quality Improvement using Bedrock Knowledge Base

**Pipeline:**
1. Store documents in S3 and create Bedrock Knowledge Base
2. Initial retrieval (top 20 candidates using vector search)
3. Re-ranker model (refine to top 5)
4. Generation with high-quality context
5. Relevance scoring throughout

In [None]:
import boto3
import json
import numpy as np
import time
from typing import List, Dict, Tuple
import re

In [None]:
# Initialize clients
bedrock_runtime = boto3.client('bedrock-runtime')
s3vectors = boto3.client('s3vectors')

# Configuration
VECTOR_BUCKET = f"reranking-vectors-{int(time.time())}"
VECTOR_INDEX = "reranking-index"
EMBEDDING_MODEL = "amazon.titan-embed-text-v1"
GENERATION_MODEL = "amazon.nova-pro-v1:0"

In [None]:
# Create S3 Vector bucket and index
s3vectors.create_vector_bucket(VectorBucketName=VECTOR_BUCKET)
print(f"Created vector bucket: {VECTOR_BUCKET}")

# Create vector index
s3vectors.create_index(
    VectorBucketName=VECTOR_BUCKET,
    IndexName=VECTOR_INDEX,
    DataType="float32",
    Dimension=1536,  # Titan embedding dimension
    DistanceMetric="cosine"
)
print(f"Created vector index: {VECTOR_INDEX}")

In [None]:
# Extended document collection for re-ranking demo
documents = [
    {"id": "lambda_pricing", "title": "AWS Lambda Pricing", "content": "AWS Lambda pricing is based on requests and compute time. You pay $0.20 per 1M requests and $0.0000166667 per GB-second. Free tier includes 1M requests monthly."},
    {"id": "lambda_memory", "title": "Lambda Memory Configuration", "content": "Configure Lambda memory from 128 MB to 10,240 MB. CPU power scales with memory allocation. Higher memory improves performance but increases cost."},
    {"id": "lambda_timeout", "title": "Lambda Timeout Settings", "content": "Lambda maximum execution time is 15 minutes (900 seconds). Default timeout is 3 seconds. Configure based on function requirements."},
    {"id": "lambda_coldstart", "title": "Lambda Cold Start Optimization", "content": "Cold starts add latency when Lambda initializes execution environments. Use provisioned concurrency and optimize package size to reduce cold starts."},
    {"id": "lambda_vpc", "title": "Lambda VPC Configuration", "content": "Lambda functions can access VPC resources like RDS databases. VPC configuration adds cold start latency. Use VPC endpoints for AWS services."},
    {"id": "lambda_monitoring", "title": "Lambda Monitoring", "content": "Monitor Lambda with CloudWatch metrics: Duration, Invocations, Errors, Throttles. Enable X-Ray tracing for distributed systems."},
    {"id": "lambda_security", "title": "Lambda Security", "content": "Use IAM roles with least privilege. Store secrets in AWS Secrets Manager. Enable encryption and validate input data."},
    {"id": "lambda_deployment", "title": "Lambda Deployment", "content": "Deploy Lambda using blue/green, canary, or all-at-once strategies. Use AWS CodeDeploy for automated deployments."},
    {"id": "lambda_layers", "title": "Lambda Layers", "content": "Lambda layers allow sharing code and dependencies across functions. Layers reduce deployment package size and enable code reuse."},
    {"id": "lambda_triggers", "title": "Lambda Triggers", "content": "Lambda functions can be triggered by S3 events, API Gateway, DynamoDB streams, SQS queues, and many other AWS services."},
    {"id": "lambda_env_vars", "title": "Lambda Environment Variables", "content": "Environment variables store configuration data. Maximum size is 4 KB. Use Systems Manager Parameter Store for larger configurations."},
    {"id": "lambda_concurrency", "title": "Lambda Concurrency", "content": "Lambda automatically scales to handle concurrent executions. Set reserved concurrency to limit scaling. Use provisioned concurrency for consistent performance."},
    {"id": "lambda_errors", "title": "Lambda Error Handling", "content": "Handle errors with try-catch blocks, dead letter queues, and retry policies. Monitor error rates and set up CloudWatch alarms."},
    {"id": "lambda_performance", "title": "Lambda Performance Optimization", "content": "Optimize Lambda performance by right-sizing memory, minimizing cold starts, using connection pooling, and efficient code practices."},
    {"id": "lambda_costs", "title": "Lambda Cost Optimization", "content": "Optimize Lambda costs by right-sizing memory allocation, reducing execution time, using ARM processors, and monitoring usage patterns."},
    {"id": "lambda_best_practices", "title": "Lambda Best Practices", "content": "Follow Lambda best practices: separate business logic from handler, use environment variables, implement proper logging, and design for idempotency."},
    {"id": "lambda_testing", "title": "Lambda Testing", "content": "Test Lambda functions locally using SAM CLI, implement unit tests, integration tests, and use AWS X-Ray for debugging distributed applications."},
    {"id": "lambda_scaling", "title": "Lambda Scaling", "content": "Lambda automatically scales from zero to thousands of concurrent executions. Understand scaling limits and configure reserved concurrency as needed."},
    {"id": "lambda_integration", "title": "Lambda Integration Patterns", "content": "Integrate Lambda with other AWS services using event-driven patterns, API Gateway for REST APIs, and Step Functions for workflows."},
    {"id": "lambda_troubleshooting", "title": "Lambda Troubleshooting", "content": "Troubleshoot Lambda issues using CloudWatch Logs, X-Ray tracing, and monitoring key metrics like duration, errors, and throttles."}
]

print(f"Loaded {len(documents)} documents for re-ranking demo")

In [None]:
def get_embedding(text: str) -> List[float]:
    """Get embedding using Titan model"""
    response = bedrock_runtime.invoke_model(
        modelId=EMBEDDING_MODEL,
        body=json.dumps({"inputText": text})
    )
    return json.loads(response['body'].read())['embedding']

# Generate embeddings and store in S3 Vectors
print("Generating embeddings and storing in S3 Vectors...")

vectors_to_put = []
for doc in documents:
    embedding = get_embedding(doc["content"])
    
    vectors_to_put.append({
        "Key": doc["id"],
        "Vector": embedding,
        "Metadata": {
            "title": doc["title"],
            "content": doc["content"]
        }
    })
    print(f"Prepared vector for {doc['id']}")
    time.sleep(0.1)

# Batch insert vectors
s3vectors.put_vectors(
    VectorBucketName=VECTOR_BUCKET,
    IndexName=VECTOR_INDEX,
    Vectors=vectors_to_put
)
print("All vectors stored in S3 Vectors")

In [None]:
def initial_retrieval(query: str, top_k: int = 20) -> List[Dict]:
    """Stage 1: Initial retrieval using Bedrock Knowledge Base"""
    response = bedrock_agent.retrieve(
        knowledgeBaseId=kb_id,
        retrievalQuery={'text': query},
        retrievalConfiguration={
            'vectorSearchConfiguration': {
                'numberOfResults': top_k
            }
        }
    )
    
    candidates = []
    for result in response['retrievalResults']:
        candidates.append({
            "doc_id": result['metadata']['x-amz-bedrock-kb-source-uri'].split('/')[-1].replace('.txt', ''),
            "document": {
                "title": result['content']['text'].split('\n')[0].replace('Title: ', ''),
                "content": result['content']['text'].split('\n\n', 1)[1] if '\n\n' in result['content']['text'] else result['content']['text']
            },
            "initial_score": result['score']
        })
    
    return candidates

In [None]:
def rerank_with_llm(query: str, candidates: List[Dict], top_k: int = 5) -> List[Dict]:
    """Stage 2: Re-rank candidates using LLM for relevance scoring"""
    
    # Prepare candidates for re-ranking
    candidate_texts = []
    for i, candidate in enumerate(candidates):
        doc = candidate["document"]
        candidate_texts.append(f"Document {i+1}: {doc['title']}\n{doc['content']}")
    
    # Create re-ranking prompt
    candidates_text = "\n\n".join(candidate_texts)
    
    rerank_prompt = f"""You are a relevance scoring system. Given a query and a list of documents, score each document's relevance to the query on a scale of 0-100.

Query: {query}

Documents:
{candidates_text}

Provide relevance scores in JSON format:
{{"scores": [score1, score2, score3, ...]}}

Consider:
- Direct relevance to the query
- Completeness of information
- Specificity to the question asked

JSON Response:"""
    
    # Get re-ranking scores from Nova
    response = bedrock_runtime.invoke_model(
        modelId=GENERATION_MODEL,
        body=json.dumps({
            "messages": [{
                "role": "user",
                "content": [{"text": rerank_prompt}]
            }],
            "inferenceConfig": {
                "maxTokens": 500,
                "temperature": 0.1
            }
        })
    )
    
    result = json.loads(response['body'].read())
    rerank_response = result['output']['message']['content'][0]['text']
    
    try:
        # Extract JSON from response
        json_match = re.search(r'\{.*\}', rerank_response, re.DOTALL)
        if json_match:
            scores_data = json.loads(json_match.group())
            scores = scores_data.get("scores", [])
        else:
            # Fallback: use initial scores
            scores = [candidate["initial_score"] * 100 for candidate in candidates]
    except:
        # Fallback: use initial scores
        scores = [candidate["initial_score"] * 100 for candidate in candidates]
    
    # Add re-ranking scores to candidates
    for i, candidate in enumerate(candidates):
        if i < len(scores):
            candidate["rerank_score"] = scores[i]
        else:
            candidate["rerank_score"] = candidate["initial_score"] * 100
    
    # Sort by re-ranking score and return top_k
    candidates.sort(key=lambda x: x["rerank_score"], reverse=True)
    return candidates[:top_k]

In [None]:
def generate_answer_with_reranked_context(query: str, reranked_docs: List[Dict]) -> str:
    """Stage 3: Generate answer using high-quality re-ranked context"""
    
    context_parts = []
    for doc_data in reranked_docs:
        doc = doc_data["document"]
        score = doc_data["rerank_score"]
        context_parts.append(f"[Relevance: {score:.1f}] {doc['title']}: {doc['content']}")
    
    context = "\n\n".join(context_parts)
    
    prompt = f"""Based on the following high-quality, re-ranked context, provide a comprehensive and accurate answer.

Context (with relevance scores):
{context}

Question: {query}

Answer:"""
    
    response = bedrock_runtime.invoke_model(
        modelId=GENERATION_MODEL,
        body=json.dumps({
            "messages": [{
                "role": "user",
                "content": [{"text": prompt}]
            }],
            "inferenceConfig": {
                "maxTokens": 400,
                "temperature": 0.1
            }
        })
    )
    
    result = json.loads(response['body'].read())
    return result['output']['message']['content'][0]['text']

In [None]:
def reranking_rag_pipeline(query: str) -> Dict:
    """Complete re-ranking RAG pipeline"""
    
    print(f"\n{'='*60}")
    print(f"RE-RANKING RAG PIPELINE")
    print(f"{'='*60}")
    print(f"Query: {query}\n")
    
    # Stage 1: Initial retrieval (top 20)
    print("Stage 1: Initial Retrieval (top 20)")
    print("-" * 40)
    initial_candidates = initial_retrieval(query, top_k=20)
    
    print("Top 5 initial candidates:")
    for i, candidate in enumerate(initial_candidates[:5], 1):
        doc = candidate["document"]
        score = candidate["initial_score"]
        print(f"  {i}. {doc['title']} (similarity: {score:.3f})")
    
    print(f"\nRetrieved {len(initial_candidates)} candidates for re-ranking\n")
    
    # Stage 2: Re-ranking (top 5)
    print("Stage 2: Re-ranking (top 5)")
    print("-" * 40)
    reranked_candidates = rerank_with_llm(query, initial_candidates, top_k=5)
    
    print("Re-ranked results:")
    for i, candidate in enumerate(reranked_candidates, 1):
        doc = candidate["document"]
        initial_score = candidate["initial_score"]
        rerank_score = candidate["rerank_score"]
        print(f"  {i}. {doc['title']}")
        print(f"     Initial: {initial_score:.3f} → Re-ranked: {rerank_score:.1f}")
    
    # Stage 3: Generation
    print(f"\nStage 3: Answer Generation")
    print("-" * 40)
    answer = generate_answer_with_reranked_context(query, reranked_candidates)
    
    print(f"Final Answer: {answer}")
    print(f"\n{'='*60}\n")
    
    return {
        "query": query,
        "initial_candidates": len(initial_candidates),
        "reranked_candidates": len(reranked_candidates),
        "final_sources": [c["document"]["title"] for c in reranked_candidates],
        "rerank_scores": [c["rerank_score"] for c in reranked_candidates],
        "answer": answer
    }

In [None]:
# Test re-ranking RAG pipeline
test_questions = [
    "How can I optimize Lambda costs?",
    "What are Lambda cold start issues and solutions?",
    "How to monitor Lambda function performance?",
    "Lambda security best practices?"
]

results = []
for question in test_questions:
    result = reranking_rag_pipeline(question)
    results.append(result)

In [None]:
# Compare with standard RAG (no re-ranking)
def standard_rag(query: str, top_k: int = 5) -> Dict:
    """Standard RAG without re-ranking for comparison"""
    
    # Direct retrieval without re-ranking
    candidates = initial_retrieval(query, top_k=top_k)
    
    # Generate answer directly
    context_parts = []
    for candidate in candidates:
        doc = candidate["document"]
        context_parts.append(f"{doc['title']}: {doc['content']}")
    
    context = "\n\n".join(context_parts)
    
    prompt = f"""Based on the following context, answer the question:

Context:
{context}

Question: {query}

Answer:"""
    
    response = bedrock_runtime.invoke_model(
        modelId=GENERATION_MODEL,
        body=json.dumps({
            "messages": [{
                "role": "user",
                "content": [{"text": prompt}]
            }],
            "inferenceConfig": {
                "maxTokens": 400,
                "temperature": 0.1
            }
        })
    )
    
    result = json.loads(response['body'].read())
    answer = result['output']['message']['content'][0]['text']
    
    return {
        "sources": [c["document"]["title"] for c in candidates],
        "scores": [c["initial_score"] for c in candidates],
        "answer": answer
    }

In [None]:
# Comparison: Re-ranking vs Standard RAG
print("COMPARISON: RE-RANKING vs STANDARD RAG")
print("="*50)

sample_query = "How to reduce Lambda costs?"
print(f"Query: {sample_query}\n")

# Standard RAG
print("Standard RAG (no re-ranking):")
standard_result = standard_rag(sample_query)
print("Sources:")
for i, (source, score) in enumerate(zip(standard_result["sources"], standard_result["scores"]), 1):
    print(f"  {i}. {source} (similarity: {score:.3f})")
print(f"\nAnswer: {standard_result['answer'][:150]}...\n")

# Re-ranking RAG
print("Re-ranking RAG:")
rerank_result = reranking_rag_pipeline(sample_query)

## Re-ranking Pipeline Benefits with Bedrock Knowledge Base

### Bedrock Knowledge Base Advantages:
✅ **Managed vector search**: No infrastructure to manage  
✅ **Built-in chunking**: Automatic document processing  
✅ **Optimized retrieval**: AWS-optimized vector search  
✅ **Scalable**: Handle millions of documents seamlessly  

### Standard RAG Limitations:
❌ **Single-stage retrieval** may miss nuanced relevance  
❌ **Vector similarity alone** doesn't capture query intent perfectly  
❌ **No quality refinement** of retrieved context  

### Re-ranking Pipeline Advantages:
✅ **Two-stage refinement**: Broad retrieval → Precise re-ranking  
✅ **LLM-based relevance scoring**: Better understanding of query intent  
✅ **Quality-focused context**: Only the most relevant documents for generation  
✅ **Improved answer quality**: Higher precision with focused context  
✅ **Managed infrastructure**: Bedrock handles vector operations  

### Pipeline Stages:
1. **Document Storage**: Upload documents to S3, create Knowledge Base
2. **Initial Retrieval (20 candidates)**: Use Bedrock vector search
3. **Re-ranking (5 best)**: LLM evaluates true relevance to query
4. **Generation**: High-quality context produces better answers

### When to Use Re-ranking with Knowledge Base:
- **High-quality requirements**: When answer accuracy is critical
- **Large document collections**: Bedrock scales automatically
- **Complex queries**: Multi-faceted questions needing precise context
- **Managed solution**: Want AWS to handle vector infrastructure

### Trade-offs:
- **Knowledge Base setup**: Initial configuration required
- **Higher processing cost**: Additional LLM call for re-ranking
- **Better quality**: Significantly improved answer relevance
- **Managed scaling**: No infrastructure management needed

In [None]:
# Performance analysis
print("RE-RANKING PIPELINE PERFORMANCE SUMMARY")
print("="*50)

total_rerank_score = 0
for result in results:
    avg_rerank_score = sum(result["rerank_scores"]) / len(result["rerank_scores"])
    total_rerank_score += avg_rerank_score
    
    print(f"Query: {result['query'][:40]}...")
    print(f"  Initial candidates: {result['initial_candidates']}")
    print(f"  Final candidates: {result['reranked_candidates']}")
    print(f"  Avg re-rank score: {avg_rerank_score:.1f}")
    print()

print(f"Overall average re-rank score: {total_rerank_score/len(results):.1f}")
print(f"\nDemo complete! Re-ranking bucket: {RERANK_BUCKET}")
print("Re-ranking pipeline with S3 vectors provides superior answer quality through two-stage refinement.")
print("S3 storage enables cost-effective scaling for large document collections.")