# Vector Database Comparison Analysis
## Comparing Popular Vector Databases for Semantic Search

### Introduction
This notebook provides a comparative analysis of popular vector databases including Pinecone, AWS OpenSearch, Milvus, Qdrant, Chroma, and Weaviate. We'll implement a practical semantic search use case across these platforms to evaluate their performance, ease of use, and specific strengths.

### Setup and Dependencies



In [None]:
# Core dependencies
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer
import time
from typing import List, Dict
import os
from dotenv import load_dotenv

# Vector DB clients
import pinecone
from opensearchpy import OpenSearch
from milvus import Milvus, IndexType, MetricType
from qdrant_client import QdrantClient
import chromadb
import weaviate


In [None]:
# Load environment variables
load_dotenv()



### Data Preparation
We'll use a dataset of product descriptions for our comparison:


In [None]:
# Sample product descriptions
products = [
    "High-performance wireless gaming mouse with RGB lighting",
    "Mechanical keyboard with Cherry MX switches",
    "27-inch 4K HDR gaming monitor with 144Hz refresh rate",
    "Noise-cancelling wireless headphones with 30-hour battery life",
    "Ergonomic office chair with lumbar support",
    # Add more products as needed
]

# Initialize sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings
embeddings = model.encode(products)


### Implementation for Each Database

#### 1. Pinecone Implementation



In [None]:
def setup_pinecone():
    pinecone.init(
        api_key=os.getenv('PINECONE_API_KEY'),
        environment=os.getenv('PINECONE_ENV')
    )
    
    # Create index if it doesn't exist
    if 'product-search' not in pinecone.list_indexes():
        pinecone.create_index('product-search', dimension=384)
    
    index = pinecone.Index('product-search')
    
    # Insert vectors
    for i, (text, vector) in enumerate(zip(products, embeddings)):
        index.upsert([(str(i), vector.tolist(), {'text': text})])
    
    return index

def search_pinecone(index, query_vector, top_k=3):
    results = index.query(query_vector.tolist(), top_k=top_k, include_metadata=True)
    return results


#### 2. AWS OpenSearch Implementation


In [None]:
def setup_opensearch():
    client = OpenSearch(
        hosts=[{'host': os.getenv('OPENSEARCH_HOST'), 'port': 443}],
        http_auth=(os.getenv('OPENSEARCH_USER'), os.getenv('OPENSEARCH_PASS')),
        use_ssl=True,
        verify_certs=True
    )
    
    # Create index with mapping
    index_name = 'product-search'
    mapping = {
        'mappings': {
            'properties': {
                'vector': {'type': 'knn_vector', 'dimension': 384},
                'text': {'type': 'text'}
            }
        }
    }
    
    client.indices.create(index=index_name, body=mapping)
    
    # Insert documents
    for i, (text, vector) in enumerate(zip(products, embeddings)):
        doc = {
            'vector': vector.tolist(),
            'text': text
        }
        client.index(index=index_name, id=str(i), body=doc)
    
    return client

def search_opensearch(client, query_vector, top_k=3):
    query = {
        'size': top_k,
        'query': {
            'knn': {
                'vector': {
                    'vector': query_vector.tolist(),
                    'k': top_k
                }
            }
        }
    }
    response = client.search(index='product-search', body=query)
    return response


#### 3. Milvus Implementation


In [None]:
def setup_milvus():
    client = Milvus(host=os.getenv('MILVUS_HOST'), port=os.getenv('MILVUS_PORT'))
    
    # Create collection
    collection_name = 'product_search'
    dimension = 384
    
    collection_param = {
        'collection_name': collection_name,
        'dimension': dimension,
        'index_file_size': 1024,
        'metric_type': MetricType.IP
    }
    
    client.create_collection(collection_param)
    
    # Insert vectors
    status, ids = client.insert(collection_name=collection_name,
                              records=embeddings.tolist(),
                              ids=[i for i in range(len(products))])
    
    # Create index
    index_param = {
        'nlist': 16384
    }
    client.create_index(collection_name, IndexType.IVF_FLAT, index_param)
    
    return client

def search_milvus(client, query_vector, top_k=3):
    status, results = client.search(
        collection_name='product_search',
        query_records=[query_vector.tolist()],
        top_k=top_k,
        params={'nprobe': 16}
    )
    return results


### Performance Comparison



In [None]:
def run_performance_comparison(query_text: str = "gaming peripheral with RGB"):
    results = {}
    query_vector = model.encode([query_text])[0]
    
    # Test each database
    dbs = {
        'Pinecone': (setup_pinecone, search_pinecone),
        'OpenSearch': (setup_opensearch, search_opensearch),
        'Milvus': (setup_milvus, search_milvus),
        'Qdrant': (setup_qdrant, search_qdrant),
        'Chroma': (setup_chroma, search_chroma),
        'Weaviate': (setup_weaviate, search_weaviate)
    }
    
    for db_name, (setup_fn, search_fn) in dbs.items():
        try:
            # Setup
            setup_start = time.time()
            client = setup_fn()
            setup_time = time.time() - setup_start
            
            # Search
            search_start = time.time()
            _ = search_fn(client, query_vector)
            search_time = time.time() - search_start
            
            results[db_name] = {
                'setup_time': setup_time,
                'search_time': search_time,
                'status': 'Success'
            }
            
        except Exception as e:
            results[db_name] = {
                'status': f'Failed: {str(e)}'
            }
    
    return pd.DataFrame(results).T


### Results Analysis



In [2]:
# Run comparison
comparison_results = run_performance_comparison()

# Create visualization
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
successful_dbs = comparison_results[comparison_results['status'] == 'Success']

plt.subplot(1, 2, 1)
successful_dbs['setup_time'].plot(kind='bar')
plt.title('Setup Time Comparison')
plt.ylabel('Time (seconds)')

plt.subplot(1, 2, 2)
successful_dbs['search_time'].plot(kind='bar')
plt.title('Search Time Comparison')
plt.ylabel('Time (seconds)')

plt.tight_layout()
plt.show()



### Key Findings

Based on our implementation and testing:

1. **Ease of Setup**:
   - Pinecone: Simplest setup with minimal configuration
   - AWS OpenSearch: Requires more complex configuration but integrates well with AWS
   - Milvus: More setup overhead but offers fine-grained control

2. **Performance**:
   - Query latency comparisons (from our tests)
   - Scalability considerations
   - Memory usage patterns

3. **Feature Comparison**:
   - Real-time update capabilities
   - Supported distance metrics
   - Additional functionality (filtering, metadata storage)

4. **Cost Considerations**:
   - Managed vs self-hosted options
   - Pricing models
   - Operational overhead

### Conclusion

Each vector database has its strengths:

- Pinecone: Best for quick deployment and scaling
- AWS OpenSearch: Ideal for AWS-integrated applications
- Milvus: Strong choice for custom deployments
- Qdrant: Excellent for real-time updates
- Chroma: Great for rapid prototyping
- Weaviate: Powerful for complex data models with GraphQL

The choice depends on specific use case requirements, infrastructure preferences, and scaling needs.
