# Similarity Search with pgvector and Amazon Aurora PostgreSQL

## Learning Objectives

1. Use HuggingFace's sentence transformer model [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) to generate embeddings
2. Store and query vector embeddings using pgvector in Aurora PostgreSQL  
3. Implement semantic search using LangChain's vector store capabilities
4. Calculate similarity scores between text documents

## Install Dependencies

Install required Python libraries for the setup.

In [1]:
# Install sentencepiece for tokenization (required by transformer models)
# Suppress conda output
!conda install -c conda-forge sentencepiece -y > /dev/null 2>&1
print("✅ Sentencepiece installed")

✅ Sentencepiece installed


In [2]:
%%writefile requirements1.txt
# First set of dependencies
langchain==0.2.16
langchain-community==0.2.17
langchain-postgres==0.0.15
psycopg2-binary==2.9.10
pgvector==0.2.5
python-dotenv==1.0.0

Overwriting requirements1.txt


In [3]:
%%writefile requirements2.txt
# Second set of dependencies  
sentence-transformers==2.2.2
pandas==2.0.3
numpy==1.24.3
torch
transformers

Overwriting requirements2.txt


In [4]:
# Install packages in two steps to avoid conflicts
# Suppress pip output for cleaner display
!pip install -r requirements1.txt -q 2>/dev/null
!pip install -r requirements2.txt -q 2>/dev/null

print("✅ Installation complete!")

✅ Installation complete!


## Open-source extension pgvector for PostgreSQL

[pgvector](https://github.com/pgvector/pgvector) is an open-source extension for PostgreSQL that allows you to store and search vector embeddings for exact and approximate nearest neighbor search.

Key features:
- Store embeddings alongside regular data
- Exact and approximate nearest neighbor search
- L2, inner product, and cosine distance metrics
- IVFFlat and HNSW indexes for fast search

In [5]:
# Import required libraries and setup environment
import warnings
import os
import logging

# Suppress all warnings before imports
warnings.filterwarnings('ignore')
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'  # Suppress TensorFlow warnings
os.environ['CUDA_VISIBLE_DEVICES'] = ''  # Disable CUDA to avoid GPU warnings

# Suppress specific library warnings
import sys
if 'ipykernel' in sys.modules:
    # Suppress tqdm warnings in notebook
    import tqdm
    tqdm.tqdm = tqdm.std.tqdm

# Suppress transformers and torch warnings
os.environ["TRANSFORMERS_VERBOSITY"] = "error"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Disable torchvision beta warnings
try:
    import torchvision
    torchvision.disable_beta_transforms_warning()
except:
    pass

# Set logging level to ERROR only
logging.getLogger().setLevel(logging.ERROR)
logging.getLogger('InstructorEmbedding').setLevel(logging.ERROR)
logging.getLogger('sentence_transformers').setLevel(logging.ERROR)

# Now import the rest of the libraries
from dotenv import load_dotenv
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_postgres.vectorstores import PGVector
from langchain_postgres import PGVector
from langchain.docstore.document import Document

# Load environment variables
load_dotenv()

# Try to use HuggingFaceInstructEmbeddings, fall back to regular HuggingFaceEmbeddings if not available
try:
    from langchain_community.embeddings import HuggingFaceInstructEmbeddings
    
    # Suppress the INSTRUCTOR_Transformer loading message
    import io
    from contextlib import redirect_stdout, redirect_stderr
    
    with redirect_stdout(io.StringIO()), redirect_stderr(io.StringIO()):
        embeddings = HuggingFaceInstructEmbeddings(
            model_name="sentence-transformers/all-mpnet-base-v2",
            model_kwargs={'device': 'cpu'}
        )
    print("✅ Using HuggingFaceInstructEmbeddings")
    
except (ImportError, Exception) as e:
    # Fallback to regular HuggingFaceEmbeddings which works with the same model
    from langchain_community.embeddings import HuggingFaceEmbeddings
    embeddings = HuggingFaceEmbeddings(
        model_name="sentence-transformers/all-mpnet-base-v2",
        model_kwargs={'device': 'cpu'},
        encode_kwargs={'normalize_embeddings': True}
    )
    print("✅ Using HuggingFaceEmbeddings (fallback)")

print("✅ Environment setup complete!")
print(f"📊 Using embedding model: all-mpnet-base-v2")

✅ Using HuggingFaceInstructEmbeddings
✅ Environment setup complete!
📊 Using embedding model: all-mpnet-base-v2


In [6]:
# Database connection configuration
# Using environment variables from .env file

DB_HOST = os.getenv('PGVECTOR_HOST')
DB_PORT = os.getenv('PGVECTOR_PORT', '5432')
DB_NAME = os.getenv('PGVECTOR_DATABASE')
DB_USER = os.getenv('PGVECTOR_USER')
DB_PASSWORD = os.getenv('PGVECTOR_PASSWORD')
DB_DRIVER = os.getenv('PGVECTOR_DRIVER', 'psycopg2')

# Build connection string
CONNECTION_STRING = f"postgresql+{DB_DRIVER}://{DB_USER}:{DB_PASSWORD}@{DB_HOST}:{DB_PORT}/{DB_NAME}"

# Collection name for vector store
COLLECTION_NAME = "hotel_reviews_langchain"

# Display configuration (masking password)
display_connection = CONNECTION_STRING.replace(DB_PASSWORD, "****")
print(f"📊 Database Configuration:")
print(f"   Host: {DB_HOST}")
print(f"   Database: {DB_NAME}")
print(f"   Collection: {COLLECTION_NAME}")

📊 Database Configuration:
   Host: apgpg-pgvector.cluster-cng8i4a88jth.us-west-2.rds.amazonaws.com
   Database: postgres
   Collection: hotel_reviews_langchain


## Load Test Data

Load hotel reviews data from CSV file. The file should have a 'comments' column with review text.

In [7]:
# Load data from CSV file
import pandas as pd
import os

# Check for the actual data file
data_file = './fictitious_hotel_reviews_trimmed_500.csv'
if not os.path.exists(data_file):
    # Try alternative path
    data_file = './data/fictitious_hotel_reviews_trimmed_500.csv'
    
if not os.path.exists(data_file):
    print("⚠️ Data file not found. Creating sample data...")
    # Create more diverse sample data if file doesn't exist
    os.makedirs('./data', exist_ok=True)
    
    sample_reviews = [
        "Excellent service and beautiful rooms. The staff was very helpful and the breakfast was amazing.",
        "Great location near the beach. Pool area was fantastic! Very family friendly.",
        "Amazing mountain views. Perfect for a peaceful getaway. Very quiet and relaxing.",
        "Convenient location but rooms were a bit small. Good value for money though.",
        "Beautiful lake views. Restaurant food was delicious. Will definitely come back.",
        "The room was spotlessly clean and the bed was very comfortable. Great night's sleep.",
        "Staff went above and beyond to help us. Really appreciated their hospitality.",
        "Loved the spa facilities. Very relaxing atmosphere throughout the hotel.",
        "Business center was well equipped. Perfect for work trips.",
        "Kids loved the pool and game room. Great family vacation spot.",
        "Room service was prompt and the food quality was excellent.",
        "The concierge helped us plan our entire itinerary. Very knowledgeable.",
        "Gym facilities were modern and well-maintained. Appreciated the 24-hour access.",
        "The rooftop bar had amazing views of the city. Great cocktails too.",
        "Breakfast buffet had lots of options including healthy choices.",
        "Location was perfect - walking distance to all major attractions.",
        "The hotel shuttle service to the airport was very convenient.",
        "Loved the boutique feel of this hotel. Very unique decor.",
        "Conference facilities were excellent for our business meeting.",
        "The pet-friendly policy was great. Our dog was well taken care of."
    ]
    
    # Create more varied data
    import random
    comments = []
    for _ in range(100):
        comments.append(random.choice(sample_reviews))
    
    sample_data = pd.DataFrame({'comments': comments})
    data_file = './data/fictitious_hotel_reviews_trimmed_500.csv'
    sample_data.to_csv(data_file, index=False)
    print(f"✅ Created sample data file with {len(comments)} reviews")

# Load data using LangChain's CSVLoader
# The CSVLoader will treat each row as a document
loader = CSVLoader(
    file_path=data_file,
    encoding='utf-8',
    csv_args={'delimiter': ','}
)
data = loader.load()

print(f"✅ Loaded {len(data)} documents from {data_file}")
print(f"\nFirst 3 reviews:")
for i, doc in enumerate(data[:3], 1):
    print(f"\n{i}. {doc.page_content[:150]}...")

✅ Loaded 500 documents from ./data/fictitious_hotel_reviews_trimmed_500.csv

First 3 reviews:

1. ﻿comments: nice hotel expensive parking got good deal stay hotel anniversary, arrived late evening took advice previous reviews did valet parking, che...

2. ﻿comments: ok nothing special charge diamond member hilton decided chain shot 20th anniversary seattle, start booked suite paid extra website descript...

3. ﻿comments: nice rooms not 4 star experience hotel monaco seattle good hotel n't 4 star level.positives large bathroom mediterranean suite comfortable ...


## Split Text into Chunks

Split documents into smaller chunks for better retrieval performance.

In [8]:
# Initialize text splitter
# For hotel reviews, we might not need to split if reviews are already short
# But we'll keep this for consistency with the original notebook
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len
)

# Split documents into chunks
docs = text_splitter.split_documents(data)

print(f"✅ Split {len(data)} documents into {len(docs)} chunks")
print(f"Average chunk size: {sum(len(d.page_content) for d in docs) / len(docs):.0f} characters")

✅ Split 500 documents into 500 chunks
Average chunk size: 528 characters


## Create Collection

Create pgvector collection and store document embeddings in Aurora PostgreSQL.

In [9]:
from typing import List, Tuple

# Create PGVector instance and store documents
# This will:
# 1. Connect to Aurora PostgreSQL
# 2. Create necessary tables if they don't exist
# 3. Generate embeddings for all documents
# 4. Store embeddings in the database

print("🚀 Creating vector store collection...")
print("⏳ This may take a minute...")

try:
    db = PGVector.from_documents(
        documents=docs,
        embedding=embeddings,
        collection_name=COLLECTION_NAME,
        connection=CONNECTION_STRING,
        pre_delete_collection=True  # Clean start - delete if exists
    )
    
    print(f"✅ Vector store created successfully!")
    print(f"📊 Collection: {COLLECTION_NAME}")
    print(f"📝 Documents indexed: {len(docs)}")
    
except Exception as e:
    print(f"❌ Error creating vector store: {e}")
    print("\nTroubleshooting:")
    print("1. Check database connection settings in .env file")
    print("2. Ensure pgvector extension is installed: CREATE EXTENSION IF NOT EXISTS vector;")
    print("3. Verify database user has necessary permissions")

🚀 Creating vector store collection...
⏳ This may take a minute...
✅ Vector store created successfully!
📊 Collection: hotel_reviews_langchain
📝 Documents indexed: 500


## Similarity Search with Score

Perform similarity search and retrieve documents with their similarity scores.

In [10]:
# Define search query
query = "What do some of the positive reviews say?"

# Perform similarity search with scores
# Returns documents with their cosine similarity scores (0-1, higher is better)
# Increase k to get more results, then filter duplicates
docs_with_score = db.similarity_search_with_score(query, k=10)

print(f"🔍 Query: '{query}'")
print(f"📊 Found {len(docs_with_score)} total matches")
print("="*60)

🔍 Query: 'What do some of the positive reviews say?'
📊 Found 10 total matches


In [11]:
# Display search results with scores
# Show unique results with different content
seen_content = set()
unique_results = []

for doc, score in docs_with_score:
    # Get first 100 chars of content for comparison
    content_key = doc.page_content[:100]
    if content_key not in seen_content:
        seen_content.add(content_key)
        unique_results.append((doc, score))

print(f"\n📊 Showing {len(unique_results)} unique results:")
print("="*60)

for i, (doc, score) in enumerate(unique_results, 1):
    print(f"\nResult {i}:")
    print(f"📈 Similarity Score: {score:.4f}")
    print(f"📄 Content: {doc.page_content[:300]}...")
    print("-" * 60)


📊 Showing 10 unique results:

Result 1:
📈 Similarity Score: 0.4225
📄 Content: ﻿comments: decent expensive pros enjoyable stay, rooms bath clean beds crisp sheets, room appointed amenities like dvd audio players nice touch including furniture, minibar stocked items including toiletries, cons service good better price, great included items like free internet connectivity simple...
------------------------------------------------------------

Result 2:
📈 Similarity Score: 0.4233
📄 Content: ﻿comments: better previous reviews suggest stayed downtown hilton recently nights conference, apprehensive stay given negative reviews travellers, positive experience stay did not use exercise facility business features, suggestion requested upper floor room view puget sound, quite pleasant, room cl...
------------------------------------------------------------

Result 3:
📈 Similarity Score: 0.4306
📄 Content: ﻿comments: poor service good reviews andra gets makes wonder stayed place.my wife spent night

## Calculate Cosine Similarity

Use cosine distance strategy for similarity calculations and create a retriever.

In [12]:
from langchain_postgres.vectorstores import DistanceStrategy

# Create a new vector store with cosine distance strategy
db_cosine = PGVector(
    embeddings=embeddings,
    collection_name=COLLECTION_NAME,
    connection=CONNECTION_STRING,
    distance_strategy=DistanceStrategy.COSINE  # Use cosine similarity
)

# Create a retriever for use in chains
# This can be integrated with LangChain chains and agents
retriever = db_cosine.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}  # Return top 4 results
)

print("✅ Created retriever with cosine similarity")
print("📊 Retriever will return top 4 most similar documents")

✅ Created retriever with cosine similarity
📊 Retriever will return top 4 most similar documents


In [13]:
# Test the retriever
query = 'What do some of the positive reviews say?'
retrieved_docs = retriever.invoke(query)

print(f"🔍 Query: '{query}'")
print(f"📊 Retrieved {len(retrieved_docs)} documents\n")

# Display first two results
for i, doc in enumerate(retrieved_docs[:2], 1):
    print(f"Document {i}:")
    print(f"{doc.page_content[:200]}...\n")

🔍 Query: 'What do some of the positive reviews say?'
📊 Retrieved 4 documents

Document 1:
﻿comments: decent expensive pros enjoyable stay, rooms bath clean beds crisp sheets, room appointed amenities like dvd audio players nice touch including furniture, minibar stocked items including toi...

Document 2:
﻿comments: better previous reviews suggest stayed downtown hilton recently nights conference, apprehensive stay given negative reviews travellers, positive experience stay did not use exercise facilit...



## Additional Search Methods

Explore different search methods available in LangChain with better handling of the results.

In [14]:
# 1. Basic similarity search (without scores)
print("1️⃣ Basic Similarity Search:")
basic_results = db.similarity_search("excellent service", k=3)
print(f"Found {len(basic_results)} results")

# Show first unique result
if basic_results:
    print(f"Sample: {basic_results[0].page_content[:150]}...\n")

# 2. Maximum Marginal Relevance (MMR) search
# Returns diverse results by balancing relevance and diversity
print("2️⃣ MMR Search (for diverse results):")
mmr_results = db.max_marginal_relevance_search(
    "hotel amenities",
    k=3,
    fetch_k=10,  # Fetch more candidates for diversity
    lambda_mult=0.5  # Balance between relevance and diversity
)
print(f"Found {len(mmr_results)} diverse results")

# Show unique MMR results
seen = set()
for i, doc in enumerate(mmr_results, 1):
    content_key = doc.page_content[:50]
    if content_key not in seen:
        print(f"  {i}. {doc.page_content[:100]}...")
        seen.add(content_key)

print()

# 3. Similarity search with different queries
print("3️⃣ Testing different query types:")
test_queries = [
    "breakfast quality",
    "room cleanliness", 
    "staff friendliness"
]

for test_query in test_queries:
    results = db.similarity_search_with_score(test_query, k=1)
    if results:
        doc, score = results[0]
        print(f"  Query: '{test_query}' - Best match (score: {score:.3f})")
        print(f"    → {doc.page_content[:80]}...")

1️⃣ Basic Similarity Search:
Found 3 results
Sample: ﻿comments: excellent staff, housekeeping quality hotel chocked staff make feel home, experienced exceptional service desk staff concierge door men mai...

2️⃣ MMR Search (for diverse results):
Found 3 diverse results
  1. ﻿comments: basic hotel basic needs hotel perfect young travellers just need place sleep no needs far...
  2. ﻿comments: comfortable pleasant stay stayed hotel nights attend conference, room comfortable hotel l...
  3. ﻿comments: fair room quality unfriendly staff make sure reservation traveling group team, motel not ...

3️⃣ Testing different query types:
  Query: 'breakfast quality' - Best match (score: 0.545)
    → ﻿comments: decent expensive pros enjoyable stay, rooms bath clean beds crisp she...
  Query: 'room cleanliness' - Best match (score: 0.439)
    → ﻿comments: n't bother dump, door room warped room hallway stunk big time, beddin...
  Query: 'staff friendliness' - Best match (score: 0.455)
    → ﻿comments:

## Summary

In this notebook, we demonstrated:

✅ **Vector Embeddings**: Generated 768-dimensional embeddings using all-mpnet-base-v2  
✅ **pgvector Storage**: Stored embeddings in Aurora PostgreSQL with pgvector extension  
✅ **Similarity Search**: Retrieved semantically similar documents  
✅ **Score Calculation**: Computed cosine similarity scores  
✅ **LangChain Integration**: Created retrievers for use in chains  

### Next Steps:
- Scale to larger datasets
- Integrate with LLMs for question-answering (RAG)
- Optimize with IVFFlat or HNSW indexes
- Experiment with different embedding models