# Preparing Aurora PostgreSQL to be used as a Knowledge Base for Amazon Bedrock

This notebook provides sample code for a data pipeline that ingests documents (typically stored in Amazon S3) into a knowledge base i.e. a vector database such as Amazon Aurora Postgresql using PGVector.

This notebook works well with the `Data Science 3.0` kernel on a SageMaker Studio `ml.t3.medium` instance.

Here is a list of packages that are used in this notebook.
```
!!pip list | grep -E -w "boto3|ipython-sql|psycopg|SQLAlchemy"
--------------------------------------------------------------
boto3                                1.34.127
ipython-sql                          0.5.0
psycopg                              3.1.19
psycopg-binary                       3.1.19
psycopg-pool                         3.2.2
SQLAlchemy                           2.0.28
```

# Prerequsites

The following IAM policies need to be attached to the SageMaker execution role that you use to run this notebook:

- AmazonSageMakerFullAccess
- AWSCloudFormationReadOnlyAccess
- AmazonRDSReadOnlyAccess

# Data Ingestion

## Step 1: Setup
Install the required packages.

In [None]:
%%capture --no-stderr

!pip install -Uq pip

!pip install -U "boto3>=1.26.159"
!pip install -U ipython-sql==0.5.0
!pip install -U psycopg[binary]==3.1.19
!pip install -U SQLAlchemy==2.0.28

In [None]:
!pip list | grep -E -w "boto3|ipython-sql|psycopg|SQLAlchemy"

## Step 2: Create a database used for a Knowledge Base for Amazon Bedrock

#### Get connection info out of your database secret

In [None]:
import boto3

aws_region = boto3.Session().region_name
aws_region

In [None]:
import urllib

from utils import (
    get_cfn_outputs,
    get_secret_name,
    get_secret
)

CFN_STACK_NAME = "BedrockKBAuroraPgVectorStack" # name of CloudFormation stack

secret_id = get_secret_name(CFN_STACK_NAME)
secret = get_secret(secret_id)

db_username = secret['username']
db_password = urllib.parse.quote_plus(secret['password'])
db_port = secret['port']
db_host = secret['host']

#### Create a database to be used as a data source of a Knowledge Base for Amazon Bedrock

In [None]:
bedrock_vector_database_name = 'bedrock_vector_db'

In [None]:
%store bedrock_vector_database_name

In [None]:
import psycopg

conn = psycopg.connect(
    host=db_host,
    port=db_port,
    user=db_username,
    password=secret['password'],
    autocommit=True
)

with conn, conn.cursor() as cur:
    try:
        cur.execute(f"CREATE DATABASE {bedrock_vector_database_name}")
    except psycopg.errors.DuplicateDatabase:
        pass
    cur.execute(f"GRANT ALL PRIVILEGES ON DATABASE {bedrock_vector_database_name} TO {db_username}")

In [None]:
driver = 'psycopg'
connection_string = f"postgresql+{driver}://{db_username}:{db_password}@{db_host}:{db_port}/{bedrock_vector_database_name}"
connection_string

#### Load `ipython-sql` library to access RDBMS via IPython

In [None]:
%load_ext sql

In [None]:
%sql $connection_string

In [None]:
%%sql

SELECT datname FROM pg_database;

 * postgresql+psycopg://postgres:***@rag-pgvector-demo.cluster-cnrh6fettief.us-east-1.rds.amazonaws.com:5432/bedrock_vector_db
5 rows affected.


datname
template0
rdsadmin
template1
postgres
bedrock_vector_db


In [None]:
%%sql

SELECT current_database();

 * postgresql+psycopg://postgres:***@rag-pgvector-demo.cluster-cnrh6fettief.us-east-1.rds.amazonaws.com:5432/bedrock_vector_db
1 rows affected.


current_database
bedrock_vector_db


## Step 3: Setup pgvector

In [None]:
%%sql

CREATE EXTENSION IF NOT EXISTS vector;

In [None]:
%%sql

SELECT typname
FROM pg_type
WHERE typname = 'vector';

 * postgresql+psycopg://postgres:***@rag-pgvector-demo.cluster-cnrh6fettief.us-east-1.rds.amazonaws.com:5432/bedrock_vector_db
1 rows affected.


typname
vector


(Optional) Use the following command to check the version of the `pg_vector` installed:

In [None]:
%%sql

SELECT extversion
FROM pg_extension
WHERE extname = 'vector';

 * postgresql+psycopg://postgres:***@rag-pgvector-demo.cluster-cnrh6fettief.us-east-1.rds.amazonaws.com:5432/bedrock_vector_db
1 rows affected.


extversion
0.7.0


## Step 4: Create a specific schema that Bedrock can use to query the data

In [None]:
%%sql

CREATE SCHEMA bedrock_integration;

## Step 5: Create a new role that Bedrock can use to query the database

In [None]:
%%sql

CREATE ROLE bedrock_user WITH PASSWORD '{secret["password"]}' LOGIN;

## Step 6: Grant the user permission to manage the schema

To grant the `bedrock_user` permission to manage the `bedrock_integration` schema, so they can create tables or indexes in it.

In [None]:
%%sql

GRANT ALL ON SCHEMA bedrock_integration to bedrock_user;

## Setp 7: Login as the `bedrock_user` and create a table in the `bedrock_integration` schema.

In [None]:
bedrock_vectordb_username = 'bedrock_user'
vectordb_connection_string = f"postgresql+{driver}://{bedrock_vectordb_username}:{db_password}@{db_host}:{db_port}/{bedrock_vector_database_name}"
vectordb_connection_string

In [None]:
%sql $vectordb_connection_string

In [None]:
%%sql

CREATE TABLE IF NOT EXISTS bedrock_integration.bedrock_kb (
    id uuid PRIMARY KEY,
    embedding vector(1536),
    chunks text,
    metadata json,
    file_name varchar(255),
    year int
);

COMMENT ON COLUMN bedrock_integration.bedrock_kb.file_name IS 'source file name used for metdata filtering';
COMMENT ON COLUMN bedrock_integration.bedrock_kb.year IS 'file creation year used for metadata filtering';

In [None]:
%%sql

SELECT *
FROM pg_catalog.pg_tables
WHERE schemaname != 'pg_catalog' AND
    schemaname != 'information_schema';

 * postgresql+psycopg://bedrock_user:***@rag-pgvector-demo.cluster-cnrh6fettief.us-east-1.rds.amazonaws.com:5432/bedrock_vector_db
   postgresql+psycopg://postgres:***@rag-pgvector-demo.cluster-cnrh6fettief.us-east-1.rds.amazonaws.com:5432/bedrock_vector_db
1 rows affected.


schemaname,tablename,tableowner,tablespace,hasindexes,hasrules,hastriggers,rowsecurity
bedrock_integration,bedrock_kb,bedrock_user,,True,False,False,False


## (Recommended) Step 8: Create an index with the cosine operator for the bedrock to query the data

In [None]:
import psycopg

conn = psycopg.connect(
    host=db_host,
    port=db_port,
    user=db_username,
    password=secret['password'],
    dbname=bedrock_vector_database_name,
    autocommit=True
)

with conn, conn.cursor() as cur:
    cur.execute(
        "CREATE INDEX ON bedrock_integration.bedrock_kb "
        "USING hnsw (embedding vector_cosine_ops);"
    )

#### List indexes using `pg_indexes` view

In [None]:
%%sql

SELECT tablename, indexname, indexdef
FROM pg_indexes
WHERE schemaname = 'bedrock_integration'
ORDER BY tablename, indexname;

## (Optional) Clean up

If you don't need the vector database anymore, you can clean up all resources using the following commands.

#### Drop table

In [None]:
%%sql

DROP TABLE IF EXISTS bedrock_integration.bedrock_kb;

In [None]:
%%sql

SELECT *
FROM pg_catalog.pg_tables
WHERE schemaname != 'pg_catalog' AND
    schemaname != 'information_schema';

#### Drop database

In [None]:
import psycopg

conn = psycopg.connect(
    host=db_host,
    port=db_port,
    user=db_username,
    password=secret['password'],
    autocommit=True
)

with conn, conn.cursor() as cur:
    cur.execute(f"DROP DATABASE IF EXISTS {bedrock_vector_database_name}")

#### Drop schema

In [None]:
%%sql

DROP SCHEMA IF EXISTS bedrock_integration;

In [None]:
%%sql

SELECT *
FROM pg_catalog.pg_namespace
ORDER BY nspname;

#### Drop role

In [None]:
%%sql

DROP ROLE IF EXISTS bedrock_user;

In [None]:
%%sql

SELECT usename AS role_name,
  CASE
     WHEN usesuper AND usecreatedb THEN
	   CAST('superuser, create database' AS pg_catalog.text)
     WHEN usesuper THEN
	    CAST('superuser' AS pg_catalog.text)
     WHEN usecreatedb THEN
	    CAST('create database' AS pg_catalog.text)
     ELSE
	    CAST('' AS pg_catalog.text)
  END role_attributes
FROM pg_catalog.pg_user
ORDER BY role_name desc;

## References

  * [Using Aurora PostgreSQL as a Knowledge Base for Amazon Bedrock](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.VectorDB.html)
    * [Preparing Aurora PostgreSQL to be used as a Knowledge Base for Amazon Bedrock](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.VectorDB.html#AuroraPostgreSQL.VectorDB.PreparingKB)
  * [(Workshop) Generative AI Use Cases with Aurora PostgreSQL and pgvector](https://catalog.workshops.aws/pgvector/en-US/)
  * [PostgreSQL Tutorial](https://www.postgresqltutorial.com/)