# Supercharging Vector Similarity Search with Amazon Aurora and pgvector
In this Jupyter Notebook, you'll explore how to store vector embeddings in a vector database using [Amazon Aurora](https://aws.amazon.com/es/rds/aurora/) and the pgvector extension. This approach is particularly useful for applications that require efficient similarity searches on high-dimensional data, such as natural language processing, image recognition, and recommendation systems.

[Amazon Aurora](https://aws.amazon.com/es/rds/aurora/) is a fully managed relational database service provided by Amazon Web Services (AWS). It is compatible with PostgreSQL and supports the [pgvector](https://github.com/pgvector/pgvector) extension, which introduces a 'vector' data type and specialized query operators for vector similarity searches. The pgvector extension utilizes the ivfflat indexing mechanism to expedite these searches, allowing you to store and index up to 16,000 dimensions, while optimizing search performance for up to 2,000 dimensions.

For developers and data engineers with experience in relational databases and PostgreSQL, Amazon Aurora with pgvector offers a powerful and familiar solution for managing vector datastores, especially when dealing with structured datasets. Alternatively, Amazon Relational Database Service (RDS) for PostgreSQL is also a suitable option, particularly if you require specific PostgreSQL versions.

Both Amazon Aurora and Amazon RDS for PostgreSQL offer horizontal scaling capabilities for read queries, with a maximum of 15 replicas. Additionally, Amazon Aurora PostgreSQL provides a Serverless v2 option, which automatically scales compute and memory resources based on your application's demand, simplifying operations and capacity planning.

To get started with storing embeddings in a vector database using Amazon Aurora and pgvector, follow these steps:

In [None]:
# !pip install psycopg2
# !pip install pgvector
# !pip install langchain_postgres
# !pip install sqlalchemy

**1- Set up an Amazon Aurora instance:** Ensure that you have an Amazon Aurora instance configured and running. Add all the necessary connection details, such as the endpoint, username, and password, to your application's environment variables or a .env file.

> [Follow steps here. ](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.CreateInstance.html)

In order for you to connect to the Aurora instance from your computer using this notebook, you must allow public access.

> Learn more in [How do I configure a provisioned Amazon Aurora DB cluster to be publicly accessible?](https://repost.aws/knowledge-center/aurora-mysql-connect-outside-vpc)

![Aurora public](aurora_public.jpg)

And add a new rule with your IP in the Inbound [security group](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-security-groups.html)

![Security Group](security_group.jpg)

In [None]:
PGVECTOR_DRIVER='psycopg2'
PGVECTOR_USER='<<Username>>'
PGVECTOR_PASSWORD='<<Password>>'
PGVECTOR_HOST='<<Aurora DB cluster host>>'
PGVECTOR_PORT=5432
PGVECTOR_DATABASE='<<DBName>>'

In [None]:
import os
driver=os.getenv("PGVECTOR_DRIVER"),
user=os.getenv("PGVECTOR_USER"),
password=os.getenv("PGVECTOR_PASSWORD"),
host=os.getenv("PGVECTOR_HOST"),
port=os.getenv("PGVECTOR_PORT"),
database=os.getenv("PGVECTOR_DATABASE")

In [None]:
import psycopg2
from sqlalchemy import create_engine

# Establish the connection to the database
conn = psycopg2.connect(
    host=host,
    database=database,
    user=user,
    password=password,
)
# Create a cursor to run queries
cur = conn.cursor()
# Create a vectorstore
engine = create_engine(f"postgresql://{user}:{password}@{host}/{database}") # Create the SQLAlchemy engine


**2- Enable the [pgvector extension](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraPostgreSQLReleaseNotes/AuroraPostgreSQL.Extensions.html?sc_channel=el&sc_campaign=genai&sc_geo=mult&sc_country=mult&sc_outcome=acq&sc_content=vector-embeddings-and-rag-demystified-2):** Once connected to your Aurora instance, enable the pgvector extension by running the following SQL command:

In [None]:
cur.execute("CREATE EXTENSION vector;")

**3 -Create a table to store embeddings:** Define a table schema to store your vector embeddings and any associated metadata. 

This table includes columns for a unique identifier (id), the original text (text), and the vector embedding (embedding) with a dimensionality of 1536.

In [None]:
table_name = "embeddings"
query = f"""CREATE TABLE {table_name} (
    id SERIAL PRIMARY KEY,
    text TEXT,
    embedding VECTOR(1536)
);"""
cur.execute(query)

4. **Insert embeddings into the table using Langchain**:

In [None]:
from langchain_community.embeddings import BedrockEmbeddings # to create embeddings for the documents.
from langchain_experimental.text_splitter import SemanticChunker # to split documents into smaller chunks.
from langchain.docstore.document import Document
import boto3
from langchain_postgres import PGVector
from langchain_postgres.vectorstores import PGVector
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.chat_models import BedrockChat
from langchain.chains import RetrievalQA
from langchain.callbacks import StdOutCallbackHandler

In [None]:
bedrock_client              = boto3.client("bedrock-runtime") 
bedrock_embeddings          = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1",client=bedrock_client)
bedrock_embeddings_image = BedrockEmbeddings(model_id="amazon.titan-embed-image-v1",client=bedrock_client)
llm = BedrockChat(model_id="anthropic.claude-3-sonnet-20240229-v1:0", client=bedrock_client)

In [None]:
# function to create vector store
def create_vectorstore(embeddings,collection_name,conn):
                       
    vectorstore = PGVector(
        embeddings=embeddings,
        collection_name=collection_name,
        connection=conn,
        use_jsonb=True,
    )
    return vectorstore

In [None]:
### Retrieve information using Amazon Bedrock
def retrieve_information(llm, vectordb,query):
    # Set up the retrieval chain with the language model and database retriever
    chain = RetrievalQA.from_chain_type(
                                            llm=llm,
                                            retriever=vectordb.as_retriever(),
                                            verbose=True
                                        )

    # Initialize the output callback handler
    handler = StdOutCallbackHandler()

    # Run the retrieval chain with a query
    chain_value = chain.run(
                query,
                callbacks=[handler]
            )
    return chain_value

## PDF File

In [None]:
# Load and process PDF documents
file_name = "Amazon_Bedrock_User_Guide.pdf"
path_file = "demo-files"
file_path = f"{path_file}/{file_name}"

In [None]:
def load_and_split_pdf_semantic(file_path, embeddings):
    text_splitter = SemanticChunker(embeddings, breakpoint_threshold_amount= 80)
    loader = PyPDFLoader(file_path)
    docs = loader.load_and_split(text_splitter)
    print(f"docs:{len(docs)}")
    return docs

In [None]:
docs = load_and_split_pdf_semantic(file_path, bedrock_embeddings)

In [None]:
collection_name_text = "text_collection"
vectorstore = create_vectorstore(bedrock_embeddings,collection_name_text,engine)

In [None]:
# Add documents to the vectorstore
vectorstore.add_documents(docs)

### Vector retriever
More information: 
- [pgvector](https://python.langchain.com/docs/integrations/vectorstores/pgvector/)
- [Retrievers](https://python.langchain.com/docs/modules/data_connection/retrievers/)


In [None]:
vectorstore.similarity_search("what is a prompt", k=5)

In [None]:
vectorstore.similarity_search_with_relevance_scores("what is a prompt", k=5)

### Retrieve information using Amazon Bedrock

In [None]:
query = "what is aprompt?"
response = retrieve_information(llm, vectorstore,query) 
print(response)


Learn more: 
- [Leverage pgvector and Amazon Aurora PostgreSQL for Natural Language Processing, Chatbots and Sentiment Analysis](https://aws.amazon.com/es/blogs/database/leverage-pgvector-and-amazon-aurora-postgresql-for-natural-language-processing-chatbots-and-sentiment-analysis/)

## Image File

In [None]:
import json
import base64
import os
from PIL import Image
import uuid

In [None]:
#calls Bedrock to get a vector from either an image, text, or both
def get_multimodal_vector(input_image_base64=None, input_text=None):
    request_body = {}
    if input_text:
        request_body["inputText"] = input_text
        
    if input_image_base64:
        request_body["inputImage"] = input_image_base64
    
    body = json.dumps(request_body)
    response = bedrock_client.invoke_model(
    	body=body, 
    	modelId="amazon.titan-embed-image-v1", 
    	accept="application/json", 
    	contentType="application/json"
    )

    response_body = json.loads(response.get('body').read())
    
    embedding = response_body.get("embedding")
    
    return embedding

In [None]:
# Función para convertir una imagen a base64
def image_to_base64(image_path):
    with open(image_path, "rb") as image_file:
        encoded_string = base64.b64encode(image_file.read()).decode("utf-8")
    vector = get_multimodal_vector(input_image_base64=encoded_string)
    return encoded_string, vector

In [None]:
def check_size_image(file_path):
    # Maximum image size supported is 2048 x 2048 pixel
    image = Image.open(file_path) #open image
    width, height = image.size # Get the width and height of the image in pixels
    if width > 2048 or height > 2048:
        print(f"Big File:{file_path} , width: {width}, height {height} px")
        dif_width = width - 2048
        dif_height = height - 2048
        if dif_width > dif_height:
            ave = 1-(dif_width/width)
            new_width = int(width*ave)
            new_height = int(height*ave)
        else:
            ave = 1-(dif_height/height)
            new_width = int(width*ave)
            new_height = int(height*ave)
        print(f"New file: {file_path} , width: {new_width}, height {new_height} px")

        new_image = image.resize((new_width, new_height))
        # Save New image
        new_image.save(file_path)
 
    return

In [None]:
def get_image_vectors_from_directory(path_name):
    documents = []
    embeddings = []
    for folder in os.walk(path_name):
        #print(f'In {folder[0]} are {len(folder[2])} folder:')
        for fichero in folder[2]:
            if fichero.endswith('.jpg'):
                file_path = os.path.join(folder[0], fichero)
                #print(file_path)
                check_size_image(file_path)
                image_base64, image_embedding = image_to_base64(file_path)
                documents.append({"page_content": image_base64, "file_path": file_path})
                embeddings.append(image_embedding)
            else:
                print("no a .jpg file: ", fichero)

    return documents, embeddings

In [None]:
path_file = "animals/animals"
path_name = f"{path_file}"
documents, embeddings = get_image_vectors_from_directory(path_name)

In [None]:
#image_vectorstore.drop_tables()
collection_name_image = "image_collection"

In [None]:
image_vectorstore = create_vectorstore(bedrock_embeddings_image,collection_name_image,engine)

In [None]:
texts = [d.get("file_path") for d in documents]
metadata = [{"file_path": d.get("file_path")} for d in documents]

In [None]:
image_vectorstore.add_embeddings(embeddings=embeddings, texts=texts, metadata=metadata)

In [None]:
similitary = image_vectorstore.similarity_search_with_relevance_scores("a woodpecker")

In [None]:
for n in similitary:
    print("Path:",n[0].page_content)
    print("Relevance Score:",n[1])
    

### Retrieve information using Amazon Bedrock

In [None]:
query = "a woodpecker"
response = retrieve_information(llm, image_vectorstore,query) 
print(response)


## Retriever by Image

In [None]:
image_path = "animals/animals/whale/3e8b0a420a.jpg"
image_base64, image_embedding = image_to_base64(image_path)


In [None]:
similitary_by_vector = image_vectorstore.similarity_search_with_score_by_vector(image_embedding)

In [None]:
for n in similitary_by_vector:
    print("Path:",n[0].page_content)
    print("Difference between vectors:",n[1]) 

## Delete vectorDB

In [83]:
vectorstore.drop_tables()
image_vectorstore.drop_tables()
