# Data Ingestion to Knowledge Base for Amazon Bedrock
**_Use of Knowledge Bases for Amazon Bedrock with Amazon Aurora Postgresql using PGVector as a vector database for storing embeddings_**

This notebook provides sample code for a data pipeline that ingests documents (typically stored in Amazon S3) into a knowledge base i.e. a vector database such as Amazon Aurora Postgresql using PGVector.

This notebook works well with the `Data Science 3.0` kernel on a SageMaker Studio `ml.t3.medium` instance.

Here is a list of packages that are used in this notebook.
```
!pip list | grep -E -w "boto3|ipython-sql|langchain|langchainhub|psycopg|SQLAlchemy|tenacity"
---------------------------------------------------------------------------------------------
boto3                                1.34.127
ipython-sql                          0.5.0
langchain                            0.2.5
langchain-aws                        0.1.6
langchain-community                  0.2.4
langchain-core                       0.2.7
langchainhub                         0.1.20
psycopg                              3.1.19
psycopg-binary                       3.1.19
psycopg-pool                         3.2.2
SQLAlchemy                           2.0.28
tenacity                             8.2.3
```

# Prerequsites

The following IAM policies need to be attached to the SageMaker execution role that you use to run this notebook:

- AmazonSageMakerFullAccess
- AWSCloudFormationReadOnlyAccess
- AmazonS3FullAccess
- AmazonRDSReadOnlyAccess
- inline policy for Amazon Bedrock
  ```
  {
      "Version": "2012-10-17",
      "Statement": [
          {
              "Action": [
                  "bedrock:ListDataSources",
                  "bedrock:ListFoundationModelAgreementOffers",
                  "bedrock:ListFoundationModels",
                  "bedrock:ListIngestionJobs",
                  "bedrock:ListKnowledgeBases",
                  "bedrock:ListModelInvocationJobs"
              ],
              "Resource": "*",
              "Effect": "Allow",
              "Sid": "BedrockList"
          },
          {
              "Action": [
                  "bedrock:GetDataSource",
                  "bedrock:GetFoundationModel",
                  "bedrock:GetFoundationModelAvailability",
                  "bedrock:GetIngestionJob",
                  "bedrock:GetKnowledgeBase",
                  "bedrock:GetModelInvocationJob",
                  "bedrock:InvokeModel",
                  "bedrock:InvokeModelWithResponseStream",
                  "bedrock:ListTagsForResource",
                  "bedrock:Retrieve"
              ],
              "Resource": "*",
              "Effect": "Allow",
              "Sid": "BedrockRead"
          },
          {
              "Action": [
                  "bedrock:CreateFoundationModelAgreement",
                  "bedrock:CreateModelInvocationJob",
                  "bedrock:CreateProvisionedModelThroughput",
                  "bedrock:DeleteFoundationModelAgreement",
                  "bedrock:DeleteModelInvocationLoggingConfiguration",
                  "bedrock:DeleteProvisionedModelThroughput",
                  "bedrock:PutModelInvocationLoggingConfiguration",
                  "bedrock:RetrieveAndGenerate",
                  "bedrock:StartIngestionJob",
                  "bedrock:UpdateDataSource",
                  "bedrock:UpdateKnowledgeBase"
              ],
              "Resource": "*",
              "Effect": "Allow",
              "Sid": "BedrockWrite"
          },
          {
              "Action": [
                  "bedrock:TagResource",
                  "bedrock:UntagResource"
              ],
              "Resource": "*",
              "Effect": "Allow",
              "Sid": "BedrockTagging"
          }
      ]
  }
  ```


# Data Ingestion

## Step 1: Setup
Install the required packages.

In [None]:
%%capture --no-stderr

!pip install -Uq pip

!pip install -U langchain==0.2.5
!pip install -U "boto3>=1.26.159" langchain-aws==0.1.6
!pip install -U langchain-community==0.2.4
!pip install -U langchainhub==0.1.20
!pip install -U SQLAlchemy==2.0.28
!pip install -U tenacity==8.2.3
!pip install -U psycopg[binary]==3.1.19
!pip install -U ipython-sql==0.5.0

In [None]:
!pip list | grep -E -w "boto3|ipython-sql|langchain|langchainhub|psycopg|SQLAlchemy|tenacity"

## Step 2: Check if Aurora Postgresql is ready to be used as a Knowledge Base for Amazon Bedrock

In [None]:
import boto3

aws_region = boto3.Session().region_name
aws_region

In [None]:
import urllib

from utils import (
    get_cfn_outputs,
    get_secret_name,
    get_secret
)


CFN_STACK_NAME = "BedrockKBAuroraPgVectorStack" # name of CloudFormation stack

secret_id = get_secret_name(CFN_STACK_NAME)
secret = get_secret(secret_id)

db_username = secret['username']
db_password = urllib.parse.quote_plus(secret['password'])
db_port = secret['port']
db_host = secret['host']

#### restore variables

In [None]:
%store -r bedrock_vector_database_name
%store -r table_name

In [None]:
driver = 'psycopg'
connection_string = f"postgresql+{driver}://{db_username}:{db_password}@{db_host}:{db_port}/{bedrock_vector_database_name}?autocommit=true"
connection_string

In [None]:
%load_ext sql

In [None]:
%sql $connection_string

In [None]:
%%sql

SELECT *
FROM pg_catalog.pg_tables
WHERE schemaname != 'pg_catalog' AND
    schemaname != 'information_schema';

 * postgresql+psycopg://postgres:***@rag-pgvector-demo.cluster-cnrh6fettief.us-east-1.rds.amazonaws.com:5432/bedrock_vector_db
1 rows affected.


schemaname,tablename,tableowner,tablespace,hasindexes,hasrules,hastriggers,rowsecurity
bedrock_integration,bedrock_kb,bedrock_user,,True,False,False,False


In [None]:
%%sql

SELECT
   table_name,
   column_name,
   data_type
FROM
   information_schema.columns
WHERE
   table_name = '{table_name}';

 * postgresql+psycopg://postgres:***@rag-pgvector-demo.cluster-cnrh6fettief.us-east-1.rds.amazonaws.com:5432/bedrock_vector_db
4 rows affected.


table_name,column_name,data_type
bedrock_kb,id,uuid
bedrock_kb,embedding,USER-DEFINED
bedrock_kb,metadata,json
bedrock_kb,year,integer
bedrock_kb,chunks,text
bedrock_kb,file_name,character varying


## Step 3: Download and prepare dataset

### Dataset

In this example, you will use several years of Amazon's Letter to Shareholders as a text corpus to perform Q&A on.

In [None]:
from pathlib import Path
from urllib.request import urlretrieve

data_root_dir = Path('./data')
data_root_dir.mkdir(parents=True, exist_ok=True)

urls = [
    'https://s2.q4cdn.com/299287126/files/doc_financials/2020/ar/2019-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2021/ar/Amazon-2020-Shareholder-Letter-and-1997-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2022/ar/2021-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/2022-Shareholder-Letter.pdf',
]

filenames = [
    'AMZN-2019-Shareholder-Letter.pdf',
    'AMZN-2020-Shareholder-Letter.pdf',
    'AMZN-2021-Shareholder-Letter.pdf',
    'AMZN-2022-Shareholder-Letter.pdf',
]

for idx, url in enumerate(urls):
    file_path = data_root_dir.joinpath(filenames[idx])
    urlretrieve(url, file_path)

### (Optional) Metadata

To use the metadata filtering feature, you need to provide metadata files alongside the source data files with the same name as the source data file and `.metadata.json` suffix.

For more information, see [here](https://docs.aws.amazon.com/bedrock/latest/userguide/kb-test-config.html).

ℹ️ In PostgreSQL, only `keys` defined as table columns can be added to `metadataAttributes`.

In this example, `file_name` and `year` should be in the PostgreSQL table.

In [None]:
import json

for name in filenames:
    metadata_file = f"{name}.metadata.json"
    with open(data_root_dir / metadata_file, "w", encoding="utf-8") as f:
        content = {
            "metadataAttributes": {
                "file_name": name,
                "year": int(name.split('-')[1])
            }
        }
        content = json.dumps(content)
        f.write(content)

## Step 4: Upload data to S3 Bucket

In [None]:
CFN_STACK_NAME = "BedrockKnowledgeBaseStack"
cfn_stack_outputs = get_cfn_outputs(CFN_STACK_NAME, aws_region)

knowledge_base_id = cfn_stack_outputs['KnowledgeBaseId']
data_source_name = cfn_stack_outputs['DataSourceName']

knowledge_base_id, data_source_name

In [None]:
bedrock_agent_client = boto3.client(
    'bedrock-agent',
    region_name=aws_region
)

In [None]:
# Get DataSourceId

response = bedrock_agent_client.list_data_sources(
    knowledgeBaseId=knowledge_base_id
)

data_source_id = response['dataSourceSummaries'][0]['dataSourceId']
data_source_id

In [None]:
# Get DataSource

response = bedrock_agent_client.get_data_source(
    knowledgeBaseId=knowledge_base_id,
    dataSourceId=data_source_id
)

ds_info = response['dataSource']
ds_info

In [None]:
data_source_s3_bucket_arn = ds_info['dataSourceConfiguration']['s3Configuration']['bucketArn']
data_source_s3_bucket_name = data_source_s3_bucket_arn.split(':')[-1]
data_source_s3_bucket_arn, data_source_s3_bucket_name

#### Upload data into S3

In [None]:
from sagemaker.s3 import S3Uploader

bucket, prefix = data_source_s3_bucket_name, 'data' # Replace prefix with yours

dataset_s3_path = S3Uploader.upload(
    local_path=str(data_root_dir), desired_s3_uri=f"s3://{bucket}/{prefix}"
)

dataset_s3_path

## Step 5: Start ingestion job

Once the Knowledge Base and Data Source are created by deploying CDK Stacks, we can start the ingestion job. During the ingestion job, Knowledge Base will fetch the documents in the data source, pre-process it to extract text, chunk it based on the chunking size provided, create embeddings of each chunk and then write it to the vector database, in this case Amazon OpenSearch Serverless Service.

In [None]:
import pprint
import time

pp = pprint.PrettyPrinter(indent=2)

In [None]:
# Start an ingestion job

start_job_response = bedrock_agent_client.start_ingestion_job(
    knowledgeBaseId=knowledge_base_id,
    dataSourceId=data_source_id
)

job = start_job_response["ingestionJob"]
pp.pprint(job)

In [None]:
while (job['status'] not in ['COMPLETE', 'FAILED']):
    get_job_response = bedrock_agent_client.get_ingestion_job(
        knowledgeBaseId=knowledge_base_id,
        dataSourceId=data_source_id,
        ingestionJobId=job["ingestionJobId"]
    )

    job = get_job_response["ingestionJob"]
    if job['status'] not in ['COMPLETE', 'FAILED']:
        pp.pprint(job)
        time.sleep(30)

pp.pprint(job)

# Test the knowledge base

## Using Knowlege Bases for Amazon Bedrock APIs

### RetrieveAndGenerate API

Behind the scenes, RetrieveAndGenerate API converts queries into embeddings, searches the knowledge base, and then augments the foundation model prompt with the search results as context information and returns the FM-generated response to the question. For multi-turn conversations, Knowledge Bases manage short-term memory of the conversation to provide more contextual results.

The output of the RetrieveAndGenerate API includes the generated response, source attribution as well as the retrieved text chunks

In [None]:
bedrock_agent_runtime_client = boto3.client(
    "bedrock-agent-runtime",
    region_name=aws_region
)

In [None]:
model_id = "anthropic.claude-3-sonnet-20240229-v1:0"
model_arn = f"arn:aws:bedrock:{aws_region}::foundation-model/{model_id}"

model_arn

In [None]:
query = "What is Amazon's doing in the field of generative AI?"

response = bedrock_agent_runtime_client.retrieve_and_generate(
    input={
        'text': query
    },
    retrieveAndGenerateConfiguration={
        'type': 'KNOWLEDGE_BASE',
        'knowledgeBaseConfiguration': {
            'knowledgeBaseId': knowledge_base_id,
            'modelArn': model_arn
        }
    },
)

generated_text = response['output']['text']
pp.pprint(generated_text)

('Amazon is investing heavily in Large Language Models (LLMs) and Generative '
 'AI, which it believes will transform and improve virtually every customer '
 'experience across its consumer, seller, brand, and creator offerings. Amazon '
 'has been working on its own LLMs for a while and sees Generative AI as a '
 'technology that will significantly accelerate machine learning adoption. '
 'Amazon is democratizing Generative AI technology through AWS, offering '
 'price-performant machine learning chips like Trainium and Inferentia so that '
 'companies of all sizes can afford to train and run their LLMs in production. '
 'AWS also enables companies to choose from various LLMs and build '
 'applications with AWS security, privacy and other features. One example is '
 "AWS's CodeWhisperer, which uses Generative AI to revolutionize developer "
 'productivity by generating code suggestions in real-time.')


In [None]:
## print out the source attribution/citations from the original documents to see if the response generated belongs to the context.

citations = response["citations"]
contexts = []
for citation in citations:
    retrievedReferences = citation["retrievedReferences"]
    for reference in retrievedReferences:
        contexts.append(reference["content"]["text"])

pp.pprint(contexts)

[ 'One final investment area that I’ll mention, that’s core to setting Amazon '
  'up to invent in every area of our business for many decades to come, and '
  'where we’re investing heavily is Large Language Models (“LLMs”) and '
  'Generative AI. Machine learning has been a technology with high promise for '
  'several decades, but it’s only been the last five to ten years that it’s '
  'started to be used more pervasively by companies. This shift was driven by '
  'several factors, including access to higher volumes of compute capacity at '
  'lower prices than was ever available. Amazon has been using machine '
  'learning extensively for 25 years, employing it in everything from '
  'personalized ecommerce recommendations, to fulfillment center pick paths, '
  'to drones for Prime Air, to Alexa, to the many machine learning services '
  'AWS offers (where AWS has the broadest machine learning functionality and '
  'customer base of any cloud provider). More recently, a newer form 

### Retrieve API

Retrieve API converts user queries into embeddings, searches the knowledge base, and returns the relevant results, giving you more control to build custom workﬂows on top of the semantic search results. The output of the Retrieve API includes the the retrieved text chunks, the location type and URI of the source data, as well as the relevance scores of the retrievals.

In [None]:
# retreive api for fetching only the relevant context.

relevant_documents = bedrock_agent_runtime_client.retrieve(
    retrievalQuery= {
        'text': query
    },
    knowledgeBaseId=knowledge_base_id,
    retrievalConfiguration= {
        'vectorSearchConfiguration': {
            'numberOfResults': 3 # will fetch top 3 documents which matches closely with the query.
        }
    }
)

pp.pprint(relevant_documents["retrievalResults"])

[ { 'content': { 'text': 'Amazon Business launched in 2015 and today drives '
                         'roughly $35B in annualized gross sales. More than '
                         'six million active customers, including 96 of the '
                         'global Fortune 100 companies, are enjoying Amazon '
                         'Business’ one-stop shopping, real-time analytics, '
                         'and broad selection on hundreds of millions of '
                         'business supplies. We believe that we’ve only '
                         'scratched the surface of what’s possible to date, '
                         'and plan to keep building the features our business '
                         'customers tell us they need and want.   While many '
                         'brands and merchants successfully sell their '
                         'products on Amazon’s marketplace, there are also a '
                         'large number of brands and sellers who have la

### (Optional) Metadata filtering

With metadata filters, you can retrieve not only semantically relevant chunks but a well-defined subset of those relevant chunks based on applied metadata filters and associated values.

For more information, see [here](https://aws.amazon.com/blogs/machine-learning/knowledge-bases-for-amazon-bedrock-now-supports-metadata-filtering-to-improve-retrieval-accuracy/).

In [None]:
# retreive api for fetching only the relevant context with metadata filtering.

relevant_documents = bedrock_agent_runtime_client.retrieve(
    retrievalQuery= {
        'text': query
    },
    knowledgeBaseId=knowledge_base_id,
    retrievalConfiguration= {
        'vectorSearchConfiguration': {
            'filter': {
                'lessThanOrEquals': {
                    "key": "year",
                    "value": 2020
                }
            },
            'numberOfResults': 3 # will fetch top 3 documents which matches closely with the query.
        }
    }
)

pp.pprint(relevant_documents["retrievalResults"])

## Using LangChain Integration with AWS

### Using the Knowledge Bases Retriever (AmazonKnowledgeBasesRetriever)

In [None]:
from langchain_aws import AmazonKnowledgeBasesRetriever


retriever = AmazonKnowledgeBasesRetriever(
    knowledge_base_id=knowledge_base_id,
    retrieval_config={
        "vectorSearchConfiguration": {
            "numberOfResults": 3
        }
    },
    region_name=aws_region
)

In [None]:
query = "What is Amazon doing in the field of Generative AI?"

retrieved_docs = retriever.invoke(query)
pp.pprint(retrieved_docs)

[ Document(page_content='Amazon Business launched in 2015 and today drives roughly $35B in annualized gross sales. More than six million active customers, including 96 of the global Fortune 100 companies, are enjoying Amazon Business’ one-stop shopping, real-time analytics, and broad selection on hundreds of millions of business supplies. We believe that we’ve only scratched the surface of what’s possible to date, and plan to keep building the features our business customers tell us they need and want.   While many brands and merchants successfully sell their products on Amazon’s marketplace, there are also a large number of brands and sellers who have launched their own direct-to-consumer websites. One of the challenges for these merchants is driving conversion from views to purchases. We invented Buy with Prime to help with this challenge. Buy with Prime allows third-party brands and sellers to offer their products on their own websites to our large Amazon Prime membership, and offer

### Q&A with RAG using LangChain RetrievalQA

In [None]:
from langchain_aws import ChatBedrock as BedrockChat


llm = BedrockChat(
    model_id=model_id,
    model_kwargs={
        "max_tokens": 512,
        "temperature": 0,
        "top_p": 0.9
    }
)

In [None]:
from langchain.prompts import PromptTemplate


PROMPT_TEMPLATE = """
Human: You are a financial advisor AI system, and provides answers to questions by using fact based and statistical information when possible.
Use the following pieces of information to provide a concise answer to the question enclosed in <question> tags.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
<context>
{context}
</context>

<question>
{question}
</question>

The response should be specific and use statistics or numbers when possible.

Assistant:"""
claude_prompt = PromptTemplate(template=PROMPT_TEMPLATE,
                               input_variables=["context", "question"])

In [None]:
from langchain.chains import RetrievalQA


qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": claude_prompt}
)

In [None]:
answer = qa.invoke(query)
pp.pprint(answer)

{ 'query': 'What is Amazon doing in the field of Generative AI?',
  'result': 'According to the context provided, Amazon is investing heavily in '
            'Large Language Models (LLMs) and Generative AI. Some key points '
            "about Amazon's efforts in this field:\n"
            '\n'
            '- Amazon has been working on developing its own LLMs for a while '
            'now. LLMs are trained on up to hundreds of billions of parameters '
            'across vast datasets.\n'
            '\n'
            '- Amazon believes Generative AI will transform and significantly '
            'improve virtually every customer experience across its consumer, '
            'seller, brand, and creator offerings. \n'
            '\n'
            '- Amazon is democratizing this technology through AWS so that '
            'companies of all sizes can leverage Generative AI. AWS offers '
            'machine learning chips like Trainium and Inferentia to enable '
            'affordable 

### Q&A with RAG using LCEL (LangChain Expression Language) Chains

In [None]:
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import (
  create_retrieval_chain
)
from langchain import hub


retrieval_qa_chat_prompt = hub.pull("langchain-ai/retrieval-qa-chat")
combine_docs_chain = create_stuff_documents_chain(llm, retrieval_qa_chat_prompt)
retrieval_qa_chain = create_retrieval_chain(retriever, combine_docs_chain)

In [None]:
answer = retrieval_qa_chain.invoke({'input': query})
pp.pprint(answer)

{ 'answer': 'According to the context, Amazon is investing heavily in large '
            'language models (LLMs) and generative AI. Some key points '
            'mentioned:\n'
            '\n'
            '1. Amazon believes generative AI based on very large language '
            'models will significantly accelerate machine learning adoption '
            'and transform virtually every customer experience.\n'
            '\n'
            '2. Amazon has been working on developing its own large language '
            'models for a while now.\n'
            '\n'
            '3. Amazon plans to continue investing substantially in these '
            'large language models across all of its consumer, seller, brand, '
            'and creator experiences.\n'
            '\n'
            '4. Similar to how AWS has democratized other technologies, Amazon '
            'is making generative AI available through AWS so that companies '
            'of all sizes can leverage it.\n'
          

## Cleanup

To avoid incurring future charges, delete the resources. You can do this by deleting the CloudFormation template used to create the IAM role and SageMaker notebook.

---

## Conclusion

In this notebook we were able to see how to use LLMs provided on Amazon Bedrock to generate embeddings and then ingest those embeddings into Amazon Aurora Postresql and finally do a similarity search for user input to the documents (embeddings) stored in the Aurora Postgresql. We used langchain as an abstraction layer to talk to both Amazon Bedrock as well as a Knowledge Base for Amazon Bedrock with Amazon Aurora Postgresql.

## References

  * [Amazon Bedrock Knowledge Base - Samples for building RAG workflows](https://github.com/aws-samples/amazon-bedrock-samples/tree/main/knowledge-bases) - This repository contains examples for customers to get started using the Amazon Bedrock Service.
  * [(AWS Machine Leearning Blog) Knowledge Bases for Amazon Bedrock now supports metadata filtering to improve retrieval accuracy (2024-04-08)](https://aws.amazon.com/blogs/machine-learning/knowledge-bases-for-amazon-bedrock-now-supports-metadata-filtering-to-improve-retrieval-accuracy/)
  * [Build a powerful question answering bot with Amazon SageMaker, Amazon OpenSearch Service, Streamlit, and LangChain](https://aws.amazon.com/blogs/machine-learning/build-a-powerful-question-answering-bot-with-amazon-sagemaker-amazon-opensearch-service-streamlit-and-langchain/)
  * [Using the Amazon SageMaker Studio Image Build CLI to build container images from your Studio notebooks](https://aws.amazon.com/blogs/machine-learning/using-the-amazon-sagemaker-studio-image-build-cli-to-build-container-images-from-your-studio-notebooks/)
  * [LangChain](https://python.langchain.com/docs/get_started/introduction.html) - A framework for developing applications powered by language models.
  * [LangChain-AWS](https://python.langchain.com/v0.1/docs/integrations/platforms/aws/) - The `LangChain` integrations related to `Amazon AWS` platform.
  * [LangChain > Components > Chains](https://python.langchain.com/v0.1/docs/modules/chains/) - Chains refer to sequences of calls - whether to an LLM, a tool, or a data preprocessing step. The primary supported way to do this is with [LCEL](https://python.langchain.com/v0.1/docs/expression_language/).
  * [LangChain Use cases > Q&A with RAG](https://python.langchain.com/v0.1/docs/use_cases/question_answering/)
  * [PostgreSQL Tutorial](https://www.postgresqltutorial.com/)