# S3 Vectors Hybrid Search Demo
**Pattern:** Hybrid search combining S3 Vectors (semantic) + keyword search (lexical)

**Components:**
- Amazon S3 Vectors for semantic search
- BM25 keyword search
- Reciprocal Rank Fusion (RRF)
- Amazon Bedrock for embeddings and generation

In [1]:
import boto3
import json
import numpy as np
import time
from typing import List, Dict, Tuple
from collections import Counter
import math
import re

In [2]:
# Initialize clients
bedrock_runtime = boto3.client('bedrock-runtime')
s3_vectors = boto3.client('s3vectors')
s3 = boto3.client('s3')

# Configuration
VECTOR_BUCKET = f"hybrid-vectors-{int(time.time())}"
INDEX_NAME = "documents"
EMBEDDING_MODEL = "amazon.titan-embed-text-v2:0"
GENERATION_MODEL = "amazon.nova-pro-v1:0"
REGION = "us-east-1"

## 1. Setup S3 Vector Bucket and Index

In [6]:
def create_vector_infrastructure():
    """Create S3 vector bucket and index"""
    try:
        # Create vector bucket
        s3_vectors.create_vector_bucket(
            vectorBucketName=VECTOR_BUCKET
        )
        print(f"Created vector bucket: {VECTOR_BUCKET}")
        
        # Create vector index
        s3_vectors.create_index(
            vectorBucketName=VECTOR_BUCKET,
            indexName=INDEX_NAME,
            dimension=1024,  # Titan v2 dimensions
            dataType="float32",
            distanceMetric="cosine",
            metadataConfiguration={"nonFilterableMetadataKeys": ["content", "source"]}
        )
        print(f"Created index: {INDEX_NAME}")
        
    except Exception as e:
        print(f"Setup error: {e}")

create_vector_infrastructure()

Created vector bucket: hybrid-vectors-1767740568
Created index: documents


## 2. Document Processing and Embedding

In [7]:
def get_embedding(text: str) -> List[float]:
    """Generate embedding using Bedrock"""
    response = bedrock_runtime.invoke_model(
        modelId=EMBEDDING_MODEL,
        body=json.dumps({"inputText": text})
    )
    return json.loads(response['body'].read())['embedding']

def chunk_document(text: str, chunk_size: int = 500) -> List[str]:
    """Simple text chunking"""
    words = text.split()
    return [' '.join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)]

In [9]:
# Sample documents
documents = [
    "Amazon S3 Vectors provides cost-effective vector storage for AI applications with sub-second query performance.",
    "Machine learning models require vector embeddings to represent text data in high-dimensional space.",
    "Retrieval Augmented Generation combines vector search with large language models for better responses.",
    "Hybrid search merges semantic similarity with keyword matching for comprehensive document retrieval.",
    "Amazon Bedrock offers foundation models for embedding generation and text completion tasks."
]

# Process and store documents
def store_documents():
    vectors_to_insert = []
    
    for i, doc in enumerate(documents):
        embedding = get_embedding(doc)
        
        vector_data = {
            "key": f"doc_{i}",
            "data": {"float32": embedding},
            "metadata": {
                "content": doc,
                "source": f"document_{i}",
                "type": "text"
            }
        }
        vectors_to_insert.append(vector_data)
    
    # Insert vectors into S3 Vectors
    s3_vectors.put_vectors(
        vectorBucketName=VECTOR_BUCKET,
        indexName=INDEX_NAME,
        vectors=vectors_to_insert
    )
    print(f"Stored {len(vectors_to_insert)} vectors")
    return documents

stored_docs = store_documents()

Stored 5 vectors


## 3. BM25 Keyword Search Implementation

In [10]:
class BM25:
    def __init__(self, documents: List[str], k1=1.5, b=0.75):
        self.k1 = k1
        self.b = b
        self.documents = documents
        self.doc_count = len(documents)
        
        # Tokenize and build index
        self.tokenized_docs = [self._tokenize(doc) for doc in documents]
        self.doc_lengths = [len(doc) for doc in self.tokenized_docs]
        self.avg_doc_length = sum(self.doc_lengths) / self.doc_count
        
        # Build term frequency and document frequency
        self.term_frequencies = []
        self.document_frequencies = Counter()
        
        for doc in self.tokenized_docs:
            tf = Counter(doc)
            self.term_frequencies.append(tf)
            for term in tf.keys():
                self.document_frequencies[term] += 1
    
    def _tokenize(self, text: str) -> List[str]:
        return re.findall(r'\b\w+\b', text.lower())
    
    def search(self, query: str, top_k: int = 5) -> List[Tuple[int, float]]:
        query_terms = self._tokenize(query)
        scores = []
        
        for doc_idx in range(self.doc_count):
            score = 0
            doc_length = self.doc_lengths[doc_idx]
            
            for term in query_terms:
                if term in self.term_frequencies[doc_idx]:
                    tf = self.term_frequencies[doc_idx][term]
                    df = self.document_frequencies[term]
                    idf = math.log((self.doc_count - df + 0.5) / (df + 0.5))
                    
                    score += idf * (tf * (self.k1 + 1)) / (
                        tf + self.k1 * (1 - self.b + self.b * doc_length / self.avg_doc_length)
                    )
            
            scores.append((doc_idx, score))
        
        return sorted(scores, key=lambda x: x[1], reverse=True)[:top_k]

# Initialize BM25
bm25 = BM25(stored_docs)
print("BM25 index created")

BM25 index created


## 4. Hybrid Search Implementation

In [17]:
def vector_search(query: str, top_k: int = 5) -> List[Tuple[str, float, str]]:
    """Semantic search using S3 Vectors"""
    query_embedding = get_embedding(query)
    
    response = s3_vectors.query_vectors(
        vectorBucketName=VECTOR_BUCKET,
        indexName=INDEX_NAME,
        queryVector={"float32": query_embedding},
        topK=top_k,
        returnDistance=True,
        returnMetadata=True
    )
    
    results = []
    for result in response['vectors']:
        key = result['key']
        distance = result['distance']
        content = result['metadata']['content']
        similarity = 1 - distance  # Convert distance to similarity
        results.append((key, similarity, content))
    
    return results

def reciprocal_rank_fusion(vector_results: List, bm25_results: List, k: int = 60) -> List[Tuple[str, float]]:
    """Combine rankings using RRF"""
    scores = {}
    
    # Process vector results
    for rank, (doc_key, similarity, content) in enumerate(vector_results):
        doc_idx = int(doc_key.split('_')[1])
        scores[doc_idx] = scores.get(doc_idx, 0) + 1 / (k + rank + 1)
    
    # Process BM25 results
    for rank, (doc_idx, bm25_score) in enumerate(bm25_results):
        scores[doc_idx] = scores.get(doc_idx, 0) + 1 / (k + rank + 1)
    
    # Sort by combined score
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

def hybrid_search(query: str, top_k: int = 3) -> List[Dict]:
    """Perform hybrid search combining vector and keyword search"""
    # Get results from both methods
    vector_results = vector_search(query, top_k)
    bm25_results = bm25.search(query, top_k)
    
    print(f"Vector results: {[(r[0], f'{r[1]:.3f}') for r in vector_results]}")
    print(f"BM25 results: {[(r[0], f'{r[1]:.3f}') for r in bm25_results]}")
    
    # Combine using RRF
    fused_results = reciprocal_rank_fusion(vector_results, bm25_results)
    
    # Format final results
    final_results = []
    for doc_idx, rrf_score in fused_results[:top_k]:
        final_results.append({
            'document': stored_docs[doc_idx],
            'doc_id': doc_idx,
            'rrf_score': rrf_score
        })
    
    return final_results

## 5. Test Hybrid Search

In [18]:
# Test queries
test_queries = [
    "vector storage cost effective",
    "machine learning embeddings",
    "hybrid search semantic keyword"
]

for query in test_queries:
    print(f"\n=== Query: '{query}' ===")
    results = hybrid_search(query)
    
    for i, result in enumerate(results, 1):
        print(f"{i}. [Score: {result['rrf_score']:.4f}] {result['document'][:100]}...")


=== Query: 'vector storage cost effective' ===
Vector results: [('doc_0', '0.535'), ('doc_1', '0.238'), ('doc_2', '0.152')]
BM25 results: [(0, '2.722'), (3, '0.000'), (4, '0.000')]
1. [Score: 0.0328] Amazon S3 Vectors provides cost-effective vector storage for AI applications with sub-second query p...
2. [Score: 0.0161] Machine learning models require vector embeddings to represent text data in high-dimensional space....
3. [Score: 0.0161] Hybrid search merges semantic similarity with keyword matching for comprehensive document retrieval....

=== Query: 'machine learning embeddings' ===
Vector results: [('doc_1', '0.644'), ('doc_4', '0.303'), ('doc_2', '0.196')]
BM25 results: [(1, '3.231'), (0, '0.000'), (2, '0.000')]
1. [Score: 0.0328] Machine learning models require vector embeddings to represent text data in high-dimensional space....
2. [Score: 0.0317] Retrieval Augmented Generation combines vector search with large language models for better response...
3. [Score: 0.0161] Amazon

## 6. RAG with Hybrid Search

In [27]:
def generate_response(query: str, context_docs: List[str]) -> str:
    """Generate response using Bedrock"""
    context = "\n\n".join([f"Document {i+1}: {doc}" for i, doc in enumerate(context_docs)])
    
    prompt = f"""Based on the following context, answer the question.

Context:
{context}

Question: {query}

Answer:"""
    
    response = bedrock_runtime.invoke_model(
        modelId=GENERATION_MODEL,
        body=json.dumps({
            "schemaVersion": "messages-v1",
            "messages": [{"role": "user", "content": [{"text": prompt}]}],
            "inferenceConfig": {"maxTokens": 500, "temperature": 0.1}
        })
    )
    
    return json.loads(response['body'].read())['output']['message']['content'][0]['text']

def hybrid_rag(query: str) -> str:
    """Complete RAG pipeline with hybrid search"""
    # Retrieve relevant documents
    search_results = hybrid_search(query)
    context_docs = [result['document'] for result in search_results]
    
    # Generate response
    response = generate_response(query, context_docs)
    
    return response, search_results

In [28]:
# Test RAG pipeline
question = "How does S3 Vectors help with cost-effective AI applications?"

answer, sources = hybrid_rag(question)

print(f"Question: {question}")
print(f"\nAnswer: {answer}")
print(f"\nSources used:")
for i, source in enumerate(sources, 1):
    print(f"{i}. [RRF: {source['rrf_score']:.4f}] {source['document'][:80]}...")

Vector results: [('doc_0', '0.803'), ('doc_1', '0.316'), ('doc_4', '0.242')]
BM25 results: [(0, '5.753'), (1, '0.000'), (4, '0.000')]
Question: How does S3 Vectors help with cost-effective AI applications?

Answer: Amazon S3 Vectors helps with cost-effective AI applications by providing a specialized storage solution designed for vector data, which is commonly used in machine learning models. Hereâ€™s how it achieves cost-effectiveness:

1. **Efficient Storage**: S3 Vectors is optimized for storing vector embeddings, which are essential for representing text data in high-dimensional space as mentioned in Document 2. This specialized storage likely reduces the overhead and inefficiency associated with storing such data in general-purpose storage solutions.

2. **Sub-second Query Performance**: As highlighted in Document 1, S3 Vectors offers sub-second query performance. This rapid access to vector data means that AI applications can retrieve the necessary embeddings quickly, reducing la

## 7. Cleanup Resources

In [30]:
def cleanup():
    """Clean up S3 Vectors resources"""
    try:
        # Delete index
        s3_vectors.delete_index(
            vectorBucketName=VECTOR_BUCKET,
            indexName=INDEX_NAME
        )
        print(f"Deleted index: {INDEX_NAME}")
        
        # Delete vector bucket
        s3_vectors.delete_vector_bucket(
            vectorBucketName=VECTOR_BUCKET
        )
        print(f"Deleted vector bucket: {VECTOR_BUCKET}")
        
    except Exception as e:
        print(f"Cleanup error: {e}")

# Uncomment to cleanup
cleanup()

Deleted index: documents
Deleted vector bucket: hybrid-vectors-1767740568
