# Build an Interactive Question Answering App with pgvector on Aurora PostgreSQL
_**Using a pretrained LLM and PostgreSQL extension `pgvector` for similarity search on product documentation**_

---

---

## Contents


1. [Background](#Background)
1. [Setup](#Setup)
1. [SageMaker Model Hosting](#SageMaker-Model-Hosting)
1. [Load data into Amazon Aurora PostgreSQL](#Open-source-extension-pgvector-for-PostgreSQL)
1. [Evaluate Aurora PostgreSQL vector search results](#Evaluate-Aurora-PostgreSQL-vector-search-results)
1. [Cleanup](#Cleanup)

## Background

Our use case focuses on answering questions over specific documents, relying solely on the information within those documents to generate accurate and context-aware answers. By combining the prowess of semantic search with the impressive capabilities of LLMs like sentence transformers, we will demonstrate how to build a Document Q & A system that leverages cutting-edge AI technologies.

One of the core components of searching textually similar items is a fixed length sentence/word embedding i.e. a  “feature vector” that corresponds to that text. The reference word/sentence embedding typically are generated offline and must be stored so they can be efficiently searched. In this use case we are using a pretrained SentenceTransformer model `all-mpnet-base-v2` from [HuggingFace Transformers](https://huggingface.co/sentence-transformers/all-mpnet-base-v2).

To enable efficient searches for textually similar items, we'll use Amazon SageMaker to generate fixed length sentence embeddings i.e “feature vectors” and use the Nearest Neighbor search in Amazon Aurora PostgreSQL using the extension `pgvector`. The PostgreSQL `pgvector` extension lets you store and search for points in vector space and find the "nearest neighbors" for those points. Use cases include text generation, search, chatbots, personalized recommendations (for example, items on the "You may also like..." on the Amazon shopping website), and fraud detection.

Here are the steps we'll follow to build textually similar items: After some initial setup, we'll host the pretrained language model in SageMaker. Then generate feature vectors for Aurora User Guide as a PDF. Those feature vectors will be stored in Amazon Aurora PostgreSQL within a vector datatype. Next, we'll explore some sample text queries, and explore the results.

## Setup
Install required python libraries for the setup.

In [None]:
!pip install -U psycopg2-binary pgvector tqdm boto3 requests gensim fitz PyMuPDF==1.19.0

### Enter document URL (as a PDF)

We've used the [Aurora User Guide](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/CHAP_AuroraOverview.html) as an example. Replace this with any PDF available on your enterprise’s knowledge corpus.

In [None]:
# Runtime: Approx. 2 mins

import requests
from io import BytesIO
import fitz
import re

pages_text = []

doc_source = "https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-ug.pdf"

with BytesIO( requests.get(doc_source).content ) as data:
    with fitz.open(stream=data, filetype="pdf") as doc:
        for page in doc:
            t = page.get_text().replace('\n', ' ').replace('\\n', ' ').replace('  ', ' ')
            t = re.sub(r"(\(p. [0-9]+\)|[0-9]+\s*$)", "", t)
            pages_text.append( t )

del pages_text[0:19]

In [None]:
# Runtime: Approx. < 1 mins

from gensim.utils import tokenize

n_tokens = [ len(list(tokenize(x))) for x in pages_text ]
max_tokens = 150
pages_text_final = []
tokens_final = []

In [None]:
# Runtime: Approx. < 1 mins

def split_text(text, mtoken):
    text_a = text.split('. ')
    token_a = [ len(list(tokenize(x))) for x in text_a ]
    tcnt = 0
    text = ""
    for x, t in zip(token_a, text_a):
        tcnt+=x
        if tcnt < mtoken:
            text = text + ". " + t
        if tcnt >= mtoken:
            pages_text_final.append(text)
            tokens_final.append( len(list(tokenize(text))) )
            text = t
            tcnt = x
    pages_text_final.append(text)
    tokens_final.append( len(list(tokenize(text))) )

In [None]:
# Runtime: Approx. < 1 mins

for token, text in zip(n_tokens, pages_text):
    if token > max_tokens:
        split_text(text, max_tokens)
    else:
        tokens_final.append(token)
        pages_text_final.append(text)

In [None]:
# Runtime: Approx. < 1 mins

from multiprocessing import cpu_count
from tqdm.contrib.concurrent import process_map

documents = {}

workers = 1 * cpu_count()

chunksize = 32

def generate_tokens(data):
    r = {}
    r['description'] = data
    r['description_tokenized'] = '. '.join( [ ' '.join(list(tokenize(x))) for x in data.split('. ') ]  )
    return r

#Generate Embeddings
document_list = process_map(generate_tokens, pages_text_final, max_workers=workers, chunksize=chunksize)


## SageMaker Model Hosting

In this section will deploy the pretrained [`all-mpnet-base-v2`](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) Hugging Face SentenceTransformer model into SageMaker and generate 768 dimensional vector embeddings for our Aurora product documentation.

In [None]:
# Runtime: Approx. 1 mins

import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

In [None]:
# Runtime: Approx. 4 mins

from sagemaker.huggingface.model import HuggingFaceModel

# https://www.sbert.net/docs/pretrained_models.html
# Hub Model configuration. <https://huggingface.co/models>
hub = {
  'HF_MODEL_ID': 'sentence-transformers/all-mpnet-base-v2',
  'HF_TASK': 'feature-extraction'
}

# Deploy Hugging Face Model 
predictor = HuggingFaceModel(
               env=hub, # configuration for loading model from Hub
               role=role, # iam role with permissions to create an Endpoint
               transformers_version='4.26',
               pytorch_version='1.13',
               py_version='py39',
            ).deploy(
               initial_instance_count=1,
               instance_type="ml.m5.xlarge",
               endpoint_name="apg-pg-vector"
            )
print(f"Hugging Face Model has been deployed successfully to SageMaker")


In [None]:
# Runtime: Approx. 1 mins
    
# https://github.com/UKPLab/sentence-transformers/issues/1107
def cls_pooling(model_output):
    #First element of model_output contains all token embeddings
    return [sublist[0] for sublist in model_output][0]

data = { "inputs": document_list[9].get('description_tokenized') }

result = cls_pooling( predictor.predict(data=data) )
print (len(result))


In [None]:
result

In [None]:
# Perform a job using realtime inference to generate embeddings ~40 min.

import urllib.request
import os
import json
import boto3

def generate_embeddings(data):
    r = {}
    r['description'] = data.get('description')
    inp = {'inputs' : data.get('description_tokenized') }
    try:
        r['descriptions_embeddings'] = cls_pooling( predictor.predict(data = inp) )
    except:
        r['descriptions_embeddings'] = None
    return r

workers = 1 * cpu_count()

chunksize = 128

#Generate Embeddings
data_embeddings = process_map(generate_embeddings, document_list, max_workers=workers, chunksize=chunksize)

document_list = data_embeddings.copy()

## Open-source extension pgvector for PostgreSQL

pgvector is an open-source extension for PostgreSQL that allows you to store and search vector embeddings for exact and approximate nearest neighbors. It is designed to work seamlessly with other PostgreSQL features, including indexing and querying.

One of the key benefits of using pgvector is that it allows you to perform similarity searches on large datasets quickly and efficiently. One of the most common applications of generative AI and large language models (LLMs) in an enterprise environment is answering questions based on the enterprise’s knowledge corpus. Pre-trained foundation models (FMs) perform well at natural language understanding (NLU) tasks such summarization, text generation and question answering on a broad variety of topics. pgvector supports exact and approximate nearest neighbor search, L2 distance, inner product, and cosine distance.

To further optimize your searches, you can also use pgvector's indexing features. By creating indexes on your vector data, you can speed up your searches and reduce the amount of time it takes to find the nearest neighbors to a given vector.

In this step we'll get all the document text from the Aurora User Guide and store those embeddings into a  vector datatype in Amazon Aurora PostgreSQL.

In [None]:
# Runtime: 2 minutes

import psycopg2
from pgvector.psycopg2 import register_vector
import boto3 
import json 

client = boto3.client('secretsmanager')

response = client.get_secret_value(
    SecretId='apgpg-vector-secret'
)
database_secrets = json.loads(response['SecretString'])

dbhost = database_secrets['host']
dbport = database_secrets['port']
dbuser = database_secrets['username']
dbpass = database_secrets['password']

dbconn = psycopg2.connect(host=dbhost, user=dbuser, password=dbpass, port=dbport, connect_timeout=10)
dbconn.set_session(autocommit=True)

cur = dbconn.cursor()
cur.execute("set maintenance_work_mem='2GB';")
cur.execute("set work_mem='1GB';")
cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
register_vector(dbconn)
cur.execute("DROP TABLE IF EXISTS documentation;")
cur.execute("""CREATE TABLE IF NOT EXISTS documentation(
                  docsource text, 
                  n_tokens bigint, 
                  doctext text, 
                  embeddings vector(768) 
                  );""")

for x in document_list:
    n_tokens = len(list(tokenize( x.get('description') )))
    cur.execute("""INSERT INTO documentation
                      (docsource, n_tokens, doctext, embeddings) 
                  VALUES(%s, %s, %s, %s);""", 
                  (doc_source, n_tokens, x.get('description'), x.get('descriptions_embeddings') ))

cur.execute("""CREATE INDEX ON documentation 
               USING ivfflat (embeddings vector_l2_ops) WITH (lists = 500);""")
cur.execute("VACUUM ANALYZE documentation;")

cur.close()
dbconn.close()
print ("Vector embeddings has been successfully loaded into PostgreSQL")

## Evaluate Aurora PostgreSQL vector search results

In [None]:
import numpy as np

data = {"inputs": "Aurora Global Database Write Forwarding"}

res1 = cls_pooling( predictor.predict(data=data) )

client = boto3.client('secretsmanager')

response = client.get_secret_value(
    SecretId='rdspg-vector-secret'
)
database_secrets = json.loads(response['SecretString'])

dbhost = database_secrets['host']
dbport = database_secrets['port']
dbuser = database_secrets['username']
dbpass = database_secrets['password']

dbconn = psycopg2.connect(host=dbhost, user=dbuser, password=dbpass, port=dbport, connect_timeout=10)
dbconn.set_session(autocommit=True)
cur = dbconn.cursor()

cur.execute("""SELECT docsource, doctext
  FROM (
    SELECT docsource, doctext, n_tokens, embeddings,
            (embeddings <=> %s) AS distances,
            SUM(n_tokens) OVER (ORDER BY (embeddings <=> %s)) AS cum_n_tokens
    FROM documentation
    ) subquery
  WHERE cum_n_tokens <= %s
  ORDER BY distances ASC;""", 
            (np.array(res1),np.array(res1),600,))

r = cur.fetchall()

for x in r:
    print (x[1])

cur.close()
dbconn.close()

## Cleanup

In [None]:
# Cleanup
predictor.delete_model()
predictor.delete_endpoint()