# Similarity Search with pgvector and Amazon Aurora PostgreSQL

## Learning Objectives

1. Use HuggingFace's sentence transformer model [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) to generate embeddings
2. Store and query vector embeddings using pgvector in Aurora PostgreSQL  
3. Implement semantic search using LangChain's vector store capabilities
4. Calculate similarity scores between text documents

## Install Dependencies

Install required Python libraries for the setup.

In [1]:
# Install sentencepiece for tokenization (required by transformer models)
!conda install -c conda-forge sentencepiece -y > /dev/null 2>&1
print("✅ Sentencepiece installed")

✅ Sentencepiece installed


In [2]:
%%writefile requirements.txt
# Core dependencies
langchain==0.2.16
langchain-community==0.2.17
langchain-postgres==0.0.15
psycopg2-binary==2.9.10
pgvector==0.2.5
python-dotenv==1.0.0
sentence-transformers>=2.5.0
huggingface-hub>=0.20.0
pandas==2.0.3
numpy==1.24.3
torch
transformers>=4.36.0

Writing requirements.txt


In [3]:
# Install all packages
!pip install -r requirements.txt -q
print("✅ Installation complete!")

✅ Installation complete!


## Setup Environment and Import Libraries

Import required libraries and initialize the embedding model.

In [4]:
# Import required libraries and setup environment
import warnings
import os
import logging

# Suppress warnings
warnings.filterwarnings('ignore')
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
os.environ["TRANSFORMERS_VERBOSITY"] = "error"
os.environ["TOKENIZERS_PARALLELISM"] = "false"
logging.getLogger().setLevel(logging.ERROR)

# Import core libraries
from dotenv import load_dotenv
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_postgres.vectorstores import PGVector
from langchain.docstore.document import Document
from langchain_community.embeddings import HuggingFaceEmbeddings

# Load environment variables
load_dotenv()

# Initialize embeddings
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2",
    model_kwargs={'device': 'cpu'},
    encode_kwargs={'normalize_embeddings': True}
)

print("✅ Environment setup complete!")
print(f"📊 Using embedding model: all-mpnet-base-v2")

  embeddings = HuggingFaceEmbeddings(


✅ Environment setup complete!
📊 Using embedding model: all-mpnet-base-v2


## Configure Database Connection

Set up connection parameters for Aurora PostgreSQL with pgvector.

In [5]:
# Database connection configuration
import os

DB_HOST = os.getenv('PGVECTOR_HOST', 'localhost')
DB_PORT = os.getenv('PGVECTOR_PORT', '5432')
DB_NAME = os.getenv('PGVECTOR_DATABASE', 'postgres')
DB_USER = os.getenv('PGVECTOR_USER', 'postgres')
DB_PASSWORD = os.getenv('PGVECTOR_PASSWORD', 'password')
DB_DRIVER = 'psycopg2'

# Build connection string
CONNECTION_STRING = f"postgresql+{DB_DRIVER}://{DB_USER}:{DB_PASSWORD}@{DB_HOST}:{DB_PORT}/{DB_NAME}"

# Collection name for vector store
COLLECTION_NAME = "hotel_reviews_langchain"

# Display configuration (masking password)
display_connection = CONNECTION_STRING.replace(DB_PASSWORD, "****")
print(f"📊 Database Configuration:")
print(f"   Host: {DB_HOST}")
print(f"   Database: {DB_NAME}")
print(f"   Collection: {COLLECTION_NAME}")

📊 Database Configuration:
   Host: apgpg-pgvector.cluster-c7kkeakuk3cl.us-west-2.rds.amazonaws.com
   Database: postgres
   Collection: hotel_reviews_langchain


## Load and Prepare Data

Load hotel reviews data from CSV file or create sample data.

In [12]:
# Load data from CSV file
import pandas as pd
import os

# Check for data file
data_file = './data/fictitious_hotel_reviews_trimmed_500.csv'
if not os.path.exists(data_file):
    print("⚠️ Data file not found. Creating sample data...")
    os.makedirs('./data', exist_ok=True)
    
    # Create sample hotel reviews
    sample_reviews = [
        "Excellent service and beautiful rooms. The staff was very helpful and the breakfast was amazing.",
        "Great location near the beach. Pool area was fantastic! Very family friendly.",
        "Amazing mountain views. Perfect for a peaceful getaway. Very quiet and relaxing.",
        "Convenient location but rooms were a bit small. Good value for money though.",
        "Beautiful lake views. Restaurant food was delicious. Will definitely come back.",
        "The room was spotlessly clean and the bed was very comfortable. Great night's sleep.",
        "Staff went above and beyond to help us. Really appreciated their hospitality.",
        "Loved the spa facilities. Very relaxing atmosphere throughout the hotel.",
        "Business center was well equipped. Perfect for work trips.",
        "Kids loved the pool and game room. Great family vacation spot.",
    ]
    
    # Create DataFrame and save
    sample_data = pd.DataFrame({'comments': sample_reviews * 10})  # Duplicate for more data
    data_file = './data/hotel_reviews.csv'
    sample_data.to_csv(data_file, index=False)
    print(f"✅ Created sample data file with {len(sample_data)} reviews")

# Load data using LangChain's CSVLoader
loader = CSVLoader(
    file_path=data_file,
    encoding='utf-8',
    csv_args={'delimiter': ','}
)
data = loader.load()

print(f"✅ Loaded {len(data)} documents")
print(f"\nSample reviews:")
for i, doc in enumerate(data[:3], 1):
    print(f"{i}. {doc.page_content[:100]}...")

⚠️ Data file not found. Creating sample data...
✅ Created sample data file with 100 reviews
✅ Loaded 100 documents

Sample reviews:
1. comments: Excellent service and beautiful rooms. The staff was very helpful and the breakfast was am...
2. comments: Great location near the beach. Pool area was fantastic! Very family friendly....
3. comments: Amazing mountain views. Perfect for a peaceful getaway. Very quiet and relaxing....


## Split Documents into Chunks

In [7]:
# Initialize text splitter
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len
)

# Split documents into chunks
docs = text_splitter.split_documents(data)

print(f"✅ Split {len(data)} documents into {len(docs)} chunks")
print(f"Average chunk size: {sum(len(d.page_content) for d in docs) / len(docs):.0f} characters")

✅ Split 100 documents into 100 chunks
Average chunk size: 86 characters


## Create Vector Store and Index Documents

In [8]:
# Create PGVector instance and store documents
print("🚀 Creating vector store collection...")
print("⏳ This may take a minute...")

try:
    db = PGVector.from_documents(
        documents=docs,
        embedding=embeddings,
        collection_name=COLLECTION_NAME,
        connection=CONNECTION_STRING,
        pre_delete_collection=True  # Clean start
    )
    
    print(f"✅ Vector store created successfully!")
    print(f"📊 Collection: {COLLECTION_NAME}")
    print(f"📝 Documents indexed: {len(docs)}")
    
except Exception as e:
    print(f"❌ Error creating vector store: {e}")
    print("\nTroubleshooting:")
    print("1. Check database connection settings")
    print("2. Ensure pgvector extension is installed: CREATE EXTENSION IF NOT EXISTS vector;")
    print("3. Verify database user has necessary permissions")
    raise

🚀 Creating vector store collection...
⏳ This may take a minute...
✅ Vector store created successfully!
📊 Collection: hotel_reviews_langchain
📝 Documents indexed: 100


## Perform Similarity Search

In [9]:
# Define search query
query = "What do some of the positive reviews say?"

# Perform similarity search with scores
docs_with_score = db.similarity_search_with_score(query, k=5)

print(f"🔍 Query: '{query}'")
print(f"📊 Found {len(docs_with_score)} matches")
print("="*60)

# Display results
for i, (doc, score) in enumerate(docs_with_score, 1):
    print(f"\nResult {i}:")
    print(f"📈 Similarity Score: {score:.4f}")
    print(f"📄 Content: {doc.page_content[:200]}...")
    print("-" * 60)

🔍 Query: 'What do some of the positive reviews say?'
📊 Found 5 matches

Result 1:
📈 Similarity Score: 0.5726
📄 Content: comments: Convenient location but rooms were a bit small. Good value for money though....
------------------------------------------------------------

Result 2:
📈 Similarity Score: 0.5726
📄 Content: comments: Convenient location but rooms were a bit small. Good value for money though....
------------------------------------------------------------

Result 3:
📈 Similarity Score: 0.5726
📄 Content: comments: Convenient location but rooms were a bit small. Good value for money though....
------------------------------------------------------------

Result 4:
📈 Similarity Score: 0.5726
📄 Content: comments: Convenient location but rooms were a bit small. Good value for money though....
------------------------------------------------------------

Result 5:
📈 Similarity Score: 0.5726
📄 Content: comments: Convenient location but rooms were a bit small. Good value for money t

## Create a Retriever for Chain Integration

In [10]:
from langchain_postgres.vectorstores import DistanceStrategy

# Create vector store with cosine distance strategy
db_cosine = PGVector(
    embeddings=embeddings,
    collection_name=COLLECTION_NAME,
    connection=CONNECTION_STRING,
    distance_strategy=DistanceStrategy.COSINE
)

# Create a retriever
retriever = db_cosine.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}
)

print("✅ Created retriever with cosine similarity")

# Test the retriever
test_query = "excellent service"
retrieved_docs = retriever.invoke(test_query)

print(f"\n🔍 Test Query: '{test_query}'")
print(f"📊 Retrieved {len(retrieved_docs)} documents\n")

for i, doc in enumerate(retrieved_docs[:2], 1):
    print(f"Document {i}: {doc.page_content[:150]}...\n")

✅ Created retriever with cosine similarity

🔍 Test Query: 'excellent service'
📊 Retrieved 4 documents

Document 1: comments: Staff went above and beyond to help us. Really appreciated their hospitality....

Document 2: comments: Staff went above and beyond to help us. Really appreciated their hospitality....



## Explore Different Search Methods

In [11]:
# 1. Basic similarity search
print("1️⃣ Basic Similarity Search:")
basic_results = db.similarity_search("excellent service", k=3)
print(f"Found {len(basic_results)} results")
if basic_results:
    print(f"Sample: {basic_results[0].page_content[:150]}...\n")

# 2. Maximum Marginal Relevance (MMR) search
print("2️⃣ MMR Search (for diverse results):")
mmr_results = db.max_marginal_relevance_search(
    "hotel amenities",
    k=3,
    fetch_k=10,
    lambda_mult=0.5
)
print(f"Found {len(mmr_results)} diverse results")
for i, doc in enumerate(mmr_results, 1):
    print(f"  {i}. {doc.page_content[:100]}...")

print()

# 3. Test different query types
print("3️⃣ Testing different query types:")
test_queries = [
    "breakfast quality",
    "room cleanliness", 
    "staff friendliness"
]

for test_query in test_queries:
    results = db.similarity_search_with_score(test_query, k=1)
    if results:
        doc, score = results[0]
        print(f"  Query: '{test_query}' - Best match (score: {score:.3f})")
        print(f"    → {doc.page_content[:80]}...")

1️⃣ Basic Similarity Search:
Found 3 results
Sample: comments: Staff went above and beyond to help us. Really appreciated their hospitality....

2️⃣ MMR Search (for diverse results):
Found 3 diverse results
  1. comments: Loved the spa facilities. Very relaxing atmosphere throughout the hotel....
  2. comments: Loved the spa facilities. Very relaxing atmosphere throughout the hotel....
  3. comments: Loved the spa facilities. Very relaxing atmosphere throughout the hotel....

3️⃣ Testing different query types:
  Query: 'breakfast quality' - Best match (score: 0.514)
    → comments: Excellent service and beautiful rooms. The staff was very helpful and ...
  Query: 'room cleanliness' - Best match (score: 0.566)
    → comments: The room was spotlessly clean and the bed was very comfortable. Great ...
  Query: 'staff friendliness' - Best match (score: 0.430)
    → comments: Staff went above and beyond to help us. Really appreciated their hospi...


## Summary

In this notebook, we demonstrated:

✅ **Vector Embeddings**: Generated 768-dimensional embeddings using all-mpnet-base-v2  
✅ **pgvector Storage**: Stored embeddings in Aurora PostgreSQL with pgvector extension  
✅ **Similarity Search**: Retrieved semantically similar documents  
✅ **Score Calculation**: Computed cosine similarity scores  
✅ **LangChain Integration**: Created retrievers for use in chains  

### Next Steps:
- Scale to larger datasets
- Integrate with LLMs for question-answering (RAG)
- Optimize with IVFFlat or HNSW indexes
- Experiment with different embedding models