# Preparing Aurora PostgreSQL to be used as a Knowledge Base for Amazon Bedrock

This notebook provides sample code for a data pipeline that ingests documents (typically stored in Amazon S3) into a knowledge base i.e. a vector database such as Amazon Aurora Postgresql using PGVector.

This notebook works well with the `Data Science 3.0` kernel on a SageMaker Studio `ml.t3.medium` instance.

Here is a list of packages that are used in this notebook.
```
!!pip list | grep -E -w "boto3|ipython-sql|psycopg|SQLAlchemy"
----------------------------------------------------------------------------------------
boto3                                1.34.127
ipython-sql                          0.5.0
psycopg                              3.1.19
psycopg-binary                       3.1.19
psycopg-pool                         3.2.2
SQLAlchemy                           2.0.28
```

# Prerequsites

The following IAM policies need to be attached to the SageMaker execution role that you use to run this notebook:

- AmazonSageMakerFullAccess
- AWSCloudFormationReadOnlyAccess
- AmazonS3FullAccess
- AmazonRDSReadOnlyAccess
- inline policy for Amazon Bedrock
  ```
  {
      "Version": "2012-10-17",
      "Statement": [
          {
              "Action": [
                  "bedrock:ListDataSources",
                  "bedrock:ListFoundationModelAgreementOffers",
                  "bedrock:ListFoundationModels",
                  "bedrock:ListIngestionJobs",
                  "bedrock:ListKnowledgeBases",
                  "bedrock:ListModelInvocationJobs"
              ],
              "Resource": "*",
              "Effect": "Allow",
              "Sid": "BedrockList"
          },
          {
              "Action": [
                  "bedrock:GetDataSource",
                  "bedrock:GetFoundationModel",
                  "bedrock:GetFoundationModelAvailability",
                  "bedrock:GetIngestionJob",
                  "bedrock:GetKnowledgeBase",
                  "bedrock:GetModelInvocationJob",
                  "bedrock:InvokeModel",
                  "bedrock:InvokeModelWithResponseStream",
                  "bedrock:ListTagsForResource",
                  "bedrock:Retrieve"
              ],
              "Resource": "*",
              "Effect": "Allow",
              "Sid": "BedrockRead"
          },
          {
              "Action": [
                  "bedrock:CreateFoundationModelAgreement",
                  "bedrock:CreateModelInvocationJob",
                  "bedrock:CreateProvisionedModelThroughput",
                  "bedrock:DeleteFoundationModelAgreement",
                  "bedrock:DeleteModelInvocationLoggingConfiguration",
                  "bedrock:DeleteProvisionedModelThroughput",
                  "bedrock:PutModelInvocationLoggingConfiguration",
                  "bedrock:RetrieveAndGenerate",
                  "bedrock:StartIngestionJob",
                  "bedrock:UpdateDataSource",
                  "bedrock:UpdateKnowledgeBase"
              ],
              "Resource": "*",
              "Effect": "Allow",
              "Sid": "BedrockWrite"
          },
          {
              "Action": [
                  "bedrock:TagResource",
                  "bedrock:UntagResource"
              ],
              "Resource": "*",
              "Effect": "Allow",
              "Sid": "BedrockTagging"
          }
      ]
  }
  ```


# Data Ingestion

## Step 1: Setup
Install the required packages.

In [None]:
!pip install -Uq pip

!pip install -U "boto3>=1.26.159"
!pip install -U ipython-sql==0.5.0
!pip install -U psycopg[binary]==3.1.19
!pip install -U SQLAlchemy==2.0.28

In [2]:
!pip list | grep -E -w "boto3|ipython-sql|psycopg|SQLAlchemy"

boto3                                1.34.127
ipython-sql                          0.5.0
psycopg                              3.1.19
psycopg-binary                       3.1.19
psycopg-pool                         3.2.2
SQLAlchemy                           2.0.28


## Step 2: Log in to the database with your master user

In [3]:
import boto3

aws_region = boto3.Session().region_name
aws_region

'us-east-1'

In [4]:
from utils import (
    get_secret_name,
    get_secret
)

In [6]:
import urllib

CFN_STACK_NAME = "BedrockKBAuroraPgVectorStack" # name of CloudFormation stack

secret_id = get_secret_name(CFN_STACK_NAME)
secret = get_secret(secret_id)

db_username = secret['username']
db_password = urllib.parse.quote_plus(secret['password'])
db_port = secret['port']
db_host = secret['host']

driver = 'psycopg'

connection_string = f"postgresql+{driver}://{db_username}:{db_password}@{db_host}:{db_port}/"

secret_id, connection_string

('BedrockKBAuroraPgVectorStac-Px90gROKnvtJ',
 'postgresql+psycopg://postgres:7z3-fggCXLG_rYR%3DqCA-URu2%3DDAajo@rag-pgvector-demo.cluster-cnrh6fettief.us-east-1.rds.amazonaws.com:5432/')

#### Load `ipython-sql` library to access RDBMS via IPython

In [7]:
%load_ext sql

In [58]:
%sql $connection_string

In [61]:
%%sql

SELECT datname FROM pg_database;

   postgresql+psycopg://bedrock_user:***@rag-pgvector-demo.cluster-cnrh6fettief.us-east-1.rds.amazonaws.com:5432/postgres
 * postgresql+psycopg://postgres:***@rag-pgvector-demo.cluster-cnrh6fettief.us-east-1.rds.amazonaws.com:5432/
4 rows affected.


datname
template0
rdsadmin
template1
postgres


In [59]:
%%sql

SELECT current_database();

   postgresql+psycopg://bedrock_user:***@rag-pgvector-demo.cluster-cnrh6fettief.us-east-1.rds.amazonaws.com:5432/postgres
 * postgresql+psycopg://postgres:***@rag-pgvector-demo.cluster-cnrh6fettief.us-east-1.rds.amazonaws.com:5432/
1 rows affected.


current_database
postgres


## Step 3: Setup pgvector

In [9]:
%%sql

CREATE EXTENSION IF NOT EXISTS vector;

 * postgresql+psycopg://postgres:***@rag-pgvector-demo.cluster-cnrh6fettief.us-east-1.rds.amazonaws.com:5432/
Done.


[]

In [10]:
%%sql

SELECT typname
FROM pg_type
WHERE typname = 'vector';

 * postgresql+psycopg://postgres:***@rag-pgvector-demo.cluster-cnrh6fettief.us-east-1.rds.amazonaws.com:5432/
1 rows affected.


typname
vector


(Optional) Use the following command to check the version of the `pg_vector` installed:

In [11]:
%%sql

SELECT extversion
FROM pg_extension
WHERE extname = 'vector';

 * postgresql+psycopg://postgres:***@rag-pgvector-demo.cluster-cnrh6fettief.us-east-1.rds.amazonaws.com:5432/
1 rows affected.


extversion
0.4.1


## Step 4: Create a specific schema that Bedrock can use to query the data

In [40]:
%%sql

CREATE SCHEMA bedrock_integration;

 * postgresql+psycopg://postgres:***@rag-pgvector-demo.cluster-cnrh6fettief.us-east-1.rds.amazonaws.com:5432/
Done.


[]

## Step 5: Create a new role that Bedrock can use to query the database

In [None]:
%%sql

CREATE ROLE bedrock_user WITH PASSWORD '{secret["password"]}' LOGIN;

## Step 6: Grant the user permission to manage the schema

To grant the `bedrock_user` permission to manage the `bedrock_integration` schema, so they can create tables or indexes in it.

In [43]:
%%sql

GRANT ALL ON SCHEMA bedrock_integration to bedrock_user;

 * postgresql+psycopg://postgres:***@rag-pgvector-demo.cluster-cnrh6fettief.us-east-1.rds.amazonaws.com:5432/
Done.


[]

## Setp 7: Login as the `bedrock_user` and create a table in the `bedrock_integration` schema.

In [48]:
new_db_username = 'bedrock_user'
new_connection_string = f"postgresql+{driver}://{new_db_username}:{db_password}@{db_host}:{db_port}/postgres"
new_connection_string

'postgresql+psycopg://bedrock_user:7z3-fggCXLG_rYR%3DqCA-URu2%3DDAajo@rag-pgvector-demo.cluster-cnrh6fettief.us-east-1.rds.amazonaws.com:5432/postgres'

In [49]:
%sql $new_connection_string

In [54]:
%%sql

CREATE TABLE IF NOT EXISTS bedrock_integration.bedrock_kb (
    id uuid PRIMARY KEY,
    embedding vector(1536),
    chunks text,
    metadata json
);

 * postgresql+psycopg://bedrock_user:***@rag-pgvector-demo.cluster-cnrh6fettief.us-east-1.rds.amazonaws.com:5432/postgres
   postgresql+psycopg://postgres:***@rag-pgvector-demo.cluster-cnrh6fettief.us-east-1.rds.amazonaws.com:5432/
Done.


[]

In [60]:
%%sql

SELECT *
FROM pg_catalog.pg_tables
WHERE schemaname != 'pg_catalog' AND
    schemaname != 'information_schema';

   postgresql+psycopg://bedrock_user:***@rag-pgvector-demo.cluster-cnrh6fettief.us-east-1.rds.amazonaws.com:5432/postgres
 * postgresql+psycopg://postgres:***@rag-pgvector-demo.cluster-cnrh6fettief.us-east-1.rds.amazonaws.com:5432/
1 rows affected.


schemaname,tablename,tableowner,tablespace,hasindexes,hasrules,hastriggers,rowsecurity
bedrock_integration,bedrock_kb,bedrock_user,,True,False,False,False


## References

  * [Using Aurora PostgreSQL as a Knowledge Base for Amazon Bedrock](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.VectorDB.html)
    * [Preparing Aurora PostgreSQL to be used as a Knowledge Base for Amazon Bedrock](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.VectorDB.html#AuroraPostgreSQL.VectorDB.PreparingKB)
  * [PostgreSQL Tutorial](https://www.postgresqltutorial.com/)