# Build LLM powered hybrid vector search with Amazon Aurora PostgreSQL, Amazon OpenSearch, Langchain and Bedrock
_**Using a pretrained LLM on Bedrock for similarity search on Amazon product reviews using pgvector and OpenSearch**_

---

---

## Contents


1. [Background](#Background)
1. [Setup](#Setup)
1. [Downloading Amazon Fine Food Reviews data](#Downloading-Amazon-Fine-Food-Reviews-data)
1. [Store the vectors in Aurora PostgreSQL](#Store-the-vectors-in-Aurora-PostgreSQL)
1. [Store the vectors in Amazon OpenSearch](#Store-the-vectors-in-Amazon-OpenSearch)
1. [Perform hybrid search using vectors stored in Aurora PostgreSQL and Amazon OpenSearch](#Perform-hybrid-search-using-vectors-stored-in-Aurora-PostgreSQL-and-Amazon-OpenSearch)
1. [Conclusion](#Conclusion)

## Background

Vector search is a powerful technique that enables efficient and accurate retrieval of relevant data from large datasets. It represents data as high-dimensional vectors in a vector space, allowing for similarity comparisons based on vector distances. This approach is particularly useful when dealing with complex data structures, such as text, images, or numerical data, where traditional keyword-based search methods may fall short. 

In the context of AWS, there are several vector database options available for performing vector search. One option is Amazon Aurora for PostgreSQL with the pgvector extension, which provides vector support for structured data stored in relational databases. This extension enables vector operations, such as similarity searches, clustering, and nearest neighbor queries, directly within the PostgreSQL database. 

Another option is Amazon OpenSearch, a managed service that combines the power of OpenSearch with the scalability and security of AWS. Amazon OpenSearch supports vector search out-of-the-box, making it well-suited for unstructured data, such as text documents, product descriptions, or customer reviews. Hybrid search, which combines both structured and unstructured data sources, is becoming increasingly important as organizations grapple with data that spans multiple formats and sources. 
By leveraging both structured data from relational databases and unstructured data from search engines or document repositories, organizations can gain a more comprehensive understanding of their data and unlock new insights. 

In this notebook, we will explore how to perform vector search on structured data using Amazon Aurora for PostgreSQL with the pgvector extension, and on unstructured data using Amazon OpenSearch. We will then combine the results from both sources using Langchain's EnsembleRetriever, which takes a list of retrievers as input, ensembles their results, and re-ranks them based on the Reciprocal Rank Fusion algorithm. The Reciprocal Rank Fusion algorithm is a widely-used technique for combining the results of multiple retrieval systems. It works by assigning weights to each retrieval system based on their performance on a particular query, and then re-ranking the combined results accordingly. This approach helps to leverage the strengths of each retrieval system and provide more accurate and relevant search results. By combining vector search on structured and unstructured data, and leveraging the power of Langchain's EnsembleRetriever, we can unlock the full potential of hybrid search and deliver more comprehensive and accurate search results to end-users..

Here are the steps we'll follow to build this hybrid search: After some initial setup, we'll  generate feature vectors for Amazon Fine Food Reviews dataset from *__Kaggle__*. Those feature vectors will be stored both in Amazon Aurora for PostgreSQL and Amazon OpenSearch. Next, we'll explore some sample text queries  byusing 'Anthropic Cluade on Bedrock', and visualize the results 

## Setup
Install required python libraries for the workshop

In [None]:
!pip install langchain pgvector opensearch-py

Import the required modules and classes from Langchain for document loading, text splitting, embedding generation, vector storage, Bedrock etc

In [None]:
from langchain.vectorstores.pgvector import PGVector
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import BedrockEmbeddings
import boto3
from langchain.vectorstores import OpenSearchVectorSearch
from langchain.llms.bedrock import Bedrock

### Downloading Amazon Fine Food Reviews data

This dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories.

**Downloading food reviews from Amazon data**: Data originally from here: https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews 

 **Citation:** <br>
 http://i.stanford.edu/~julian/pdfs/www13.pdf <br>
 *J. McAuley and J. Leskovec. From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. WWW, 2013.* <br>

In [None]:
#Read the CSV file from the downloaded dataset

import pandas as pd
import numpy as np
 # We are just using subset of ~300 MB file from Kaggle. Change this to consume full data 
df = pd.read_csv('data/Reviews_small.csv') 
df.head()

df['reviews'] = df['Summary'] + df['Text']  # We are only interested in Summary and Text columns

After reading the CSV file from the downloaded dataset, split the text data into smaller chunks using Langchain's RecursiveCharacterTextSplitter.


In [None]:
#Load documents from Pandas dataframe for insertion into database
from langchain.document_loaders import DataFrameLoader

text_splitter = RecursiveCharacterTextSplitter(
  chunk_size = 1000,
  chunk_overlap = 20,
  length_function = len,
  is_separator_regex = False,
)

# page_content_column is the column name in the dataframe that contains the we'll create embeddings for
loader = DataFrameLoader(df, page_content_column = 'reviews')

pages = loader.load_and_split(text_splitter=text_splitter)

### Store the vectors in Aurora PostgreSQL

Retrieve the Amazon Aurora PostgreSQL credentials from AWS Secrets Manager and construct the connection string. Aurora PostgreSQL with the pgvector extension will be used for storing and retrieving structured data vectors. 

In [None]:
from botocore.exceptions import ClientError
import json 

region = "us-east-1" #Replace it with your region

def get_secret(secret_name, region_name):

    # Create a Secrets Manager client
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=region_name
    )

    try:
        get_secret_value_response = client.get_secret_value(
            SecretId=secret_name
        )
    except ClientError as e:
        # For a list of exceptions thrown, see
        # https://docs.aws.amazon.com/secretsmanager/latest/apireference/API_GetSecretValue.html
        raise e

    secret = get_secret_value_response['SecretString'] 
    return secret

#Replace it with your secret key
database_secrets = json.loads(get_secret('demo-secret', region))
dbhost = database_secrets['host']
dbport = database_secrets['port']
dbuser = database_secrets['username']
dbpass = database_secrets['password']

CONNECTION_STRING = PGVector.connection_string_from_db_params(                                                  
    driver = 'psycopg2',
    user = dbuser,                                      
    password = dbpass,                                  
    host = dbhost,                                            
    port = dbport,                                          
    database = 'postgres' ) #Replace it with your database name

Store the document chunks in Amazon Aurora PostgreSQL using Langchain's BedrockEmbeddings, which utilizes the Titan Embeddings model (amazon.titan-embed-text-v1) for generating vector representations of the text data. 

In [None]:
COLLECTION_NAME = "review_collection" #Replace it with your collection name

# Initialize the text embedding model
embeddings = BedrockEmbeddings()

db = PGVector.from_documents(
                                documents=pages,
                                embedding=embeddings,
                                collection_name=COLLECTION_NAME,
                                connection_string=CONNECTION_STRING
                            )

Run a sample similarity search query on Amazon Aurora PostgreSQL to ensure that the vector embeddings are properly inserted and can be retrieved. 

In [None]:
from langchain.schema import Document

# Query for which we want to find semantically similar documents
query = "Tell me about Vitality canned dog food?"

#Fetch the k=3 most similar documents
docs = db.similarity_search(query, k=3)

doc = docs[0]
# Access the document's content
doc_content = doc.page_content

print("Content snippet:" + doc_content[:500])

### Store the vectors in Amazon OpenSearch

Retrieve the Amazon OpenSearch credentials from AWS Secrets Manager. Amazon OpenSearch will be used for storing and retrieving unstructured data vectors.

In [None]:
opensearch_secrets = json.loads(get_secret('demo-secret-opensearch', region)) #Replace it with your secret key
osurl = opensearch_secrets['opensearch_endpoint']
osuser = opensearch_secrets['opensearch_user']
ospass = opensearch_secrets['opensearch_password']


Create an index in Amazon OpenSearch if one does not already exist. This index will be used to store the vector embeddings of the unstructured data. 

In [None]:
from opensearchpy import OpenSearch

client = OpenSearch(
    hosts=[{'host': osurl, 'port': 443}],
    http_auth=(osuser, ospass),
    use_ssl=True,
    verify_certs=True
)

index_name = "food-review" #Replace it with your index name
indexBody = {
    "settings": {
        "index.knn": True
    },
    "mappings": {
        "properties": {
            "vector_field": {
                "type": "knn_vector",
                 "dimension": 1536,
                "method": {
                    "engine": "faiss",
                    "name": "hnsw"
                }
            }
        }
    }
}

try:
    create_response = client.indices.create(index_name, body=indexBody)
    print('\nCreating index:')
    print(create_response)
except Exception as e:
    print(e)
    print("(Index likely already exists?)")

Store the document chunks in Amazon OpenSearch using Langchain's BedrockEmbeddings, which utilizes the Titan Embeddings model (amazon.titan-embed-text-v1) for generating vector representations of the text data. 

In [None]:
from langchain.vectorstores import OpenSearchVectorSearch
vector_store = OpenSearchVectorSearch.from_documents(
        documents=pages,
        embedding=embeddings,
        opensearch_url=f"https://{osurl}",
        http_auth=(osuser, ospass),
        use_ssl=True,
        verify_certs=True,
        index_name='food-review' #Replace it with your index name
    )

 Run a sample similarity search query on Amazon OpenSearch to ensure that the vector embeddings are properly inserted and can be retrieved. 

In [None]:
# Query for which we want to find semantically similar documents
query = "Tell me about Vitality canned dog food?"

#Fetch the k=3 most similar documents
docs = vector_store.similarity_search(query, k=3)

doc = docs[1]
# Access the document's content
doc_content = doc.page_content

print("Content snippet:" + doc_content[:500])

### Perform hybrid search using vectors stored in Aurora PostgreSQL and Amazon OpenSearch

Combine the search results from Amazon OpenSearch (unstructured data) and Amazon Aurora PostgreSQL (structured data) using Langchain's EnsembleRetriever. The EnsembleRetriever takes a list of retrievers as input, ensembles their results, and re-ranks them based on the Reciprocal Rank Fusion algorithm.

In [None]:
from langchain.retrievers import EnsembleRetriever

pgvector_retriever = db.as_retriever(search_kwargs={"k":3})
opensearch_retriever = vector_store.as_retriever(search_kwargs={"k":3})
ensemble_retriever = EnsembleRetriever(retrievers=[pgvector_retriever, opensearch_retriever], weights=[0.5, 0.5])

Make sure that Ensemble Retriever returns the correct results by running a sample query for similarity search

In [None]:
# Query for which we want to find semantically similar documents
query = "Tell me about Vitality canned dog food?"

#Fetch the k=3 most similar documents
docs = ensemble_retriever.get_relevant_documents(query, k=3)

doc = docs[0]
# Access the document's content
doc_content = doc.page_content

print("Content snippet:" + doc_content[:500])



Perform a query using Anthropic on Bedrock as the language model (llm) and the EnsembleRetriever created in the previous step as the retriever. This step combines the power of Anthropic's language model with the hybrid search capabilities of the EnsembleRetriever, leveraging both structured and unstructured data sources.

In [None]:
llm = Bedrock(
                model_id='anthropic.claude-v2', #Replace it with model you want to use
                model_kwargs={'max_tokens_to_sample': 4096}
            )

from langchain.chains import RetrievalQA
from langchain.callbacks import StdOutCallbackHandler

# Set up the retrieval chain with the language model and database retriever
chain = RetrievalQA.from_chain_type(
                                        llm=llm,
                                        retriever=ensemble_retriever,
                                        verbose=True
                                    )

# Initialize the output callback handler
handler = StdOutCallbackHandler()

# Run the retrieval chain with a query
chain.run(
            'how is Vitality canned dog food?',
            callbacks=[handler]
        )

## Conclusion

By combining vector search on structured and unstructured data sources using Langchain's `EnsembleRetriever`, this notebook demonstrates how to leverage the strengths of different retrieval systems and provide more accurate and relevant search results. 

This approach is particularly beneficial for organizations dealing with hybrid data environments, where data is spread across multiple formats and sources. By unlocking the potential of hybrid search, organizations can gain a more holistic understanding of their data and uncover valuable insights that might have been missed by relying on a single retrieval system.