# Ingest data to a Vector DB (Amazon Aurora Postgresql with pgvector)
**_Use of Amazon Aurora PostgreSQL with pgvector as a vector database to store embeddings_**

This notebook works well with the `Data Science 2.0` kernel on a SageMaker Studio `ml.t3.medium` instance.

Here is a list of packages that are used in this notebook.
```
!pip freeze | grep -E -w "ipython-sql|langchain|pgvector|psycopg2|pypdf|SQLAlchemy"
-----------------------------------------------------------------------------------
ipython-sql==0.5.0
langchain==0.2.5
langchain-aws==0.1.6
langchain-community==0.2.4
langchain-core==0.2.7
langchain-text-splitters==0.2.1
pgvector==0.2.5
psycopg2-binary==2.9.6
pypdf==4.2.0
SQLAlchemy==2.0.28
```

## Step 1: Set up
Install the required packages

In [None]:
!pip install -U langchain==0.2.5
!pip install -U langchain-aws==0.1.6
!pip install -U langchain-community==0.2.4
!pip install -U SQLAlchemy==2.0.28
!pip install -U pgvector==0.2.5
!pip install -U psycopg2-binary==2.9.6
!pip install -U pypdf==4.2.0
!pip install -U ipython-sql==0.5.0

In [None]:
!pip list | grep -E -w "ipython-sql|langchain|pgvector|psycopg2|pypdf"

## Step 2: Download the data from the web

In this step we use `wget` to download the pdf version of Amazon Aurora User Guide.

**This data download would take a few minutes**.

In [None]:
%%sh
mkdir -p data
wget -O data/aurora-ug.pdf https://docs.aws.amazon.com/pdfs/AmazonRDS/latest/AuroraUserGuide/aurora-ug.pdf

## Step 3: Load data into Amazon Aurora PostgreSQL using pgvector

In [None]:
import boto3

aws_region = boto3.Session().region_name

In [None]:
import json
from typing import List

def get_cfn_outputs(stackname: str, region_name: str) -> List:
    cfn = boto3.client('cloudformation', region_name=region_name)
    outputs = {}
    for output in cfn.describe_stacks(StackName=stackname)['Stacks'][0]['Outputs']:
        outputs[output['OutputKey']] = output['OutputValue']
    return outputs

def get_secret_name(stack_name: str, region_name: str = 'us-east-1'):
    cf_client = boto3.client('cloudformation', region_name=region_name)
    response = cf_client.describe_stacks(StackName=stack_name)
    outputs = response["Stacks"][0]["Outputs"]

    secrets = [e for e in outputs if e['ExportName'] == 'VectorDBSecret'][0]
    secret_name = secrets['OutputValue']
    return secret_name

def get_secret(secret_name: str, region_name: str = 'us-east-1'):
    client = boto3.client('secretsmanager', region_name=region_name)
    get_secret_value_response = client.get_secret_value(SecretId=secret_name)
    secret = get_secret_value_response['SecretString']

    return json.loads(secret)

def get_credentials(secret_id: str, region_name: str) -> str:
    client = boto3.client('secretsmanager', region_name=region_name)
    response = client.get_secret_value(SecretId=secret_id)
    secrets_value = json.loads(response['SecretString'])
    return secrets_value

In [None]:
import urllib

CFN_STACK_NAME = "RAGPgVectorStack" # name of CloudFormation stack

secret_name = get_secret_name(CFN_STACK_NAME)
secret = get_secret(secret_name)

db_username = secret['username']
db_password = urllib.parse.quote_plus(secret['password'])
db_port = secret['port']
db_host = secret['host']

driver = 'psycopg2'

connection_string = f"postgresql+{driver}://{db_username}:{db_password}@{db_host}:{db_port}/"

In [None]:
%load_ext sql

In [None]:
%sql $connection_string

In [None]:
%%sql

CREATE EXTENSION IF NOT EXISTS vector;

In [None]:
%%sql

SELECT typname
FROM pg_type
WHERE typname = 'vector';

##### Load the embeddings into Aurora PostgreSQL DB cluster

In [None]:
from langchain_community.document_loaders.pdf import PyPDFLoader
from langchain_text_splitters.character import RecursiveCharacterTextSplitter

In [None]:
pdf_path = './data/aurora-ug.pdf'

loader = PyPDFLoader(file_path=pdf_path)

text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ".", " "],
    chunk_size=1000,
    chunk_overlap=100
)

In [None]:
%%time
chunks = loader.load_and_split(text_splitter)

CPU times: user 3min 50s, sys: 778 ms, total: 3min 51s
Wall time: 4min 40s


In [None]:
from langchain_community.embeddings import BedrockEmbeddings

embeddings = BedrockEmbeddings(
    model_id='amazon.titan-embed-text-v1',
    region_name=aws_region
)

In [None]:
import numpy as np


max_docs_per_put = 10
num_shards = (len(chunks) // max_docs_per_put) + 1
shards = np.array_split(chunks, num_shards)

In [None]:
from langchain_community.vectorstores import PGVector


collection_name = 'llm_rag_embeddings'

vectordb = PGVector.from_existing_index(
    embedding=embeddings,
    collection_name=collection_name,
    connection_string=connection_string)

In [None]:
%%time

vectordb.add_documents(documents=shards[0])

CPU times: user 44.7 ms, sys: 10 ms, total: 54.8 ms
Wall time: 3.48 s


In [None]:
%%time
import time


for shard in shards[1:]:
    vectordb.add_documents(documents=shard)
    time.sleep(0.3)

CPU times: user 31.8 s, sys: 1.54 s, total: 33.3 s
Wall time: 27min 15s


## Step 4: Do a similarity search for user input to documents (embeddings) in Amazon Aurora PostgreSQL using Pgvector

In [None]:
%%time

query = "What is the company's strategy for generative AI?"

results = vectordb.similarity_search(query)
results

CPU times: user 13.4 ms, sys: 372 µs, total: 13.8 ms
Wall time: 314 ms


[Document(page_content="SageMaker each support diﬀerent machine learning use cases. Amazon Comprehend is a natural \nlanguage processing  (NLP) service that's used to extract insights from documents. By using Aurora \nmachine learning with Amazon Comprehend, you can determine the sentiment of text in your \ndatabase tables. SageMaker is a full-featured machine learning  service. Data scientists use Amazon \nSageMaker to build, train, and test machine learning models for a variety of inference tasks, such \nAurora machine learning 71", metadata={'source': './data/aurora-ug.pdf', 'page': 102}),
 Document(page_content='•Using Machine Learning\nLearn about Aurora Machine Learning.\nTutorials and sample code in GitHub\nThe following tutorials and sample code in GitHub show you how to perform common tasks with\nAmazon Aurora:\n•Creating an Aurora Serverless v2 lending library\nTutorials and sample code in GitHub 274', metadata={'source': './data/aurora-ug.pdf', 'page': 305}),
 Document(page_

## Clean up

To avoid incurring future charges, delete the resources. You can do this by deleting the CloudFormation template used to create the IAM role and SageMaker notebook.

## Conclusion

In this notebook we were able to see how to use Amazon Bedrock to generate embeddings and then ingest those embeddings into Amazon Aurora Postgresql using pgvector and finally do a similarity search for user input to the documents (embeddings) stored in the Amazon Aurora Postgresql. We used langchain as an abstraction layer to talk to both Amazon Bedrock as well as Amazon Aurora Postgresql.