# Retrieval Augmented Generation (RAG) using Foundation Models in SageMaker

In this notebook we demonstrate how to use Retrieval Augmented Generation (RAG) to build a question-and-answer chatbot to converse with the **SEC Schedule 14A document** using Foundation Models in SageMaker.

Foundation models are usually trained offline, making the model agnostic to any data that is created after the model was trained. Additionally, foundation models are trained on very general domain corpora, making them less effective for domain-specific tasks. Retrieval Augmented Generation (RAG) is used to retrieve data from outside a foundation model and augment your prompts by adding the relevant retrieved data in context. For more information about RAG model architectures, see [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401).

With RAG, the external data used to augment your prompts can come from multiple data sources, such as a document repositories, databases, or APIs. The first step is to convert your documents and any user queries into a compatible format to perform relevancy search. To make the formats compatible, a document collection, or knowledge library, and user-submitted queries are converted to numerical representations using embedding language models. Embedding is the process by which text is given numerical representation in a vector space. RAG model architectures compare the embeddings of user queries within the vector of the knowledge library. The original user prompt is then appended with relevant context from similar documents within the knowledge library. This augmented prompt is then sent to the foundation model. You can update knowledge libraries and their relevant embeddings asynchronously.

In the previous sections of this workshop, you deployed the **FLAN-T5** Foundation Model to SageMaker Endpoints and used these models for various Natural Language Processing (NLP) tasks such as text summarization, common sense reasoning, translation and question and answering. In this section, we will use this SageMaker endpoints to create vector embeddings that are stored in Amazon OpenSearch. We then use these embeddings in a RAG-model for a question-and-answer chatbot. The diagram below depicts this architecture.

We will also use **LangChain**, an opensource framework for developing and interfacing with applications powered by language models.

![Rag Architecture](../images/10-architecture.png)

## Prerequisites

The following are the prerequisites for this notebook:
1. Deploy the SageMaker Jumpstart Model called `FLAN-T5-XL` ***OR*** Run the Jupyter Notebook titled `01-deploy-text2text-model.jpynb`. 
2. Deploy the SageMaker Jumpstart Model called `GPT-J 6B Embedding FP16` text embeddings model.
3. [Not required if you do step 2.] Run the Jupyter Notebook titled `02-deploy-text2emb-model.jpynb`. This notebook deploys the gpt-j-6b-fp16 LLM to a SageMaker Endpoint.
4. [Non-AWS Event] Run the Jupyter Notebook titled `03-create-vector-store.jpynb`. This notebook creates an Amazon OpenSearch Cluster and required Index for the vector database. This notebook is not required if you are running an AWS Event.


## Install Required Python Libraries

_*IMPORTANT*_: Ensure you are running Pythin 3.9+

### 1. Set Up Kernel and Required Dependencies

First, check that the correct kernel is chosen.

<img src="img/kernel_set_up_03.png" width="300"/>

You can click on that to see and check the details of the image, kernel, and instance type.

<img src="img/w3_kernel_and_instance_type_03.png" width="600"/>

# NOTE:  YOU CANNOT CONTINUE UNTIL THE KERNEL IS STARTED
# ### PLEASE WAIT UNTIL THE KERNEL IS STARTED BEFORE CONTINUING!!! ###

# Use `Shift+Enter` to Run Each Cell

Use `Shift+Enter` on the cell below to see the output.

# Click `Kernel` => `Restart Kernel and Run All Cells` to Run All Cells
![](img/restart-kernel-and-run-all-cells.png)

In [2]:
import sys

# Get the Python version.
python_version = sys.version_info

# Check if the Python version is above 3.9.
if python_version.major < 3 or python_version.minor < 9:
  # Raise an error message if the Python version is not above 3.9.
  raise Exception("Python version must be above 3.9.")

# Print a success message if the Python version is above 3.9.
print("Python version is above 3.9.")


Python version is above 3.9.


Begin by installing the required python libraries.

## _==> Please ignore all WARNINGs and ERRORs from the `pip install`'s below. <==_

In [3]:
!pip install -r ../requirements.txt

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Setup Environment

In [4]:
# Setup SageMaker Session
import sagemaker, boto3, json
from sagemaker.session import Session

sagemaker_session = Session()
aws_role = sagemaker_session.get_caller_identity_arn()
aws_region = boto3.Session().region_name
sess = sagemaker.Session()

In [5]:
# Import langchain 
from langchain.document_loaders import UnstructuredHTMLLoader,BSHTMLLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter,CharacterTextSplitter
from langchain.llms.sagemaker_endpoint import ContentHandlerBase
from langchain.vectorstores import OpenSearchVectorSearch

## Please Go Home > Deployments > Endpoints and copy the names of the Text-Embed and Text2Text models deployed in the Lab number 2. 

![Endpoint](../images/50-navigate-to-endpoints.png)

### They will be something like: "jumpstart-dft-hf-textembedding-yyyy-mm-dd-hh-min-sec-msc" and "jumpstart-example-huggingface-text2text-yyyy-mm-dd-hh-min-sec-msc"

<img src="../images/50-confirm-endpoints.png" alt="drawing" width="1000"/>

In [22]:
# Replace the name of the endpoint here
llm_endpoint_name ="jumpstart-example-huggingface-text2text-2023-10-06-19-56-29-081"
print('Your Text2Text Model is: ' ,llm_endpoint_name)

# Set embeddings model endpoint to jumpstart model endpoint
# Replace the name of the endpoint here
embeddings_model_endpoint_name = "jumpstart-example-huggingface-textembed-2023-10-06-20-10-28-812"
print('Your Text Embedding Model is: ' ,embeddings_model_endpoint_name)



Your Text2Text Model is:  jumpstart-example-huggingface-text2text-2023-10-06-19-56-29-081
Your Text Embedding Model is:  jumpstart-example-huggingface-textembed-2023-10-06-20-10-28-812


In [14]:
# Set variables for Amazon OpenSearch

import boto3
cfn_client = boto3.client('cloudformation')
stack_name = "genai-rag-workshop-studio"

response = cfn_client.describe_stacks(
    StackName = stack_name,
)

outputs = response['Stacks'][0]['Outputs'] 

opensearch_host_id= next(output['OutputValue'] for output in outputs
        if output['OutputKey'] == 'Opensearchhostid')

#Please confirm the name of the OpenSearch Index
_aos_host = opensearch_host_id
_aos_index = 'fsi-demo-knn'

print(_aos_host)

search-genaiopensearch-ugzhlx2nc4yikjjyskp7cux7wy.us-east-1.es.amazonaws.com


## Chunk your Data and Load into Amazon OpenSearch

In this section we will chunk the data into smaller documents. Chunking is a technique for splitting large texts into smaller chunks. It is an important step as it optimizes the relevance of the search query for our RAG-model. Which in turn improves the quality of the chatbot. 

In [15]:
loader = BSHTMLLoader("../data/14A/0000003153-20-000004.html")
data = loader.load()

In [16]:
print (f'You have {len(data)} document(s) in your data')
print (f'There are {len(data[0].page_content)} characters in your document')

You have 1 document(s) in your data
There are 153880 characters in your document



### Then we select  chunk size and overlap.

In [17]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1600, chunk_overlap=200)
docs = text_splitter.split_documents(data)

print (f'Now you have {len(docs)} documents')

Now you have 110 documents


In [18]:
# Helper function to process document

import regex as re

def postproc(s):
    s = s.replace(u'\xa0', u' ') # no-break space 
    s = s.replace('\n', ' ') # new-line
    s = re.sub(r'\s+', ' ', s) # multiple spaces
    return s

In [19]:
for doc in docs:
    doc.page_content = postproc(doc.page_content)

In [20]:
# Review the first document for correctness
docs[0]

Document(page_content='UNITED STATESSECURITIES AND EXCHANGE COMMISSIONWASHINGTON, D.C. 20549SCHEDULE 14A INFORMATIONProxy Statement Pursuant To Section 14(a)of the Securities Exchange Act of 1934xFiled by the RegistrantoFiled by a party other than the RegistrantCheck the appropriate box:oPreliminary proxy statementoConfidential, for use of the Commission only (as permitted by Rule 14a-6(e)(2))xDefinitive proxy statementoDefinitive additional materialsoSoliciting material under Rule 14a-12ALABAMA POWER COMPANY(Name of Registrant as Specified in Its Charter)(Name of Person(s) Filing Proxy Statement, if Other Than the Registrant)Payment of Filing Fee (Check the appropriate box):xNo fee required. oFee computed on table below per Exchange Act Rules 14a-6(i)(1) and 0-11. (1)Title of each class of securities to which transaction applies: (2)Aggregate number of securities to which transaction applies: (3)Per unit price or other underlying value of transaction computed pursuant to Exchange Act 

In [21]:
# Limit the number of total chunks to 1000
MAX_DOCS = 1000
if len(docs) > MAX_DOCS:
    docs = docs[:MAX_DOCS]

### Prior to populating a vector store, compute embedding to validate the smoothness / no exceptions.

We begin by extending the LangChain SageMakerEndpointEmbeddings Class to create a custom embeddings function.

In [23]:
import time
import json
import logging
from typing import List
from langchain.embeddings import SagemakerEndpointEmbeddings
from langchain.embeddings.sagemaker_endpoint import EmbeddingsContentHandler

logger = logging.getLogger(__name__)

# extend the SagemakerEndpointEmbeddings class from langchain to provide a custom embedding function
class SagemakerEndpointEmbeddingsJumpStart(SagemakerEndpointEmbeddings):
    def embed_documents(
        self, texts: List[str], chunk_size: int = 5
    ) -> List[List[float]]:
        """Compute doc embeddings using a SageMaker Inference Endpoint.

        Args:
            texts: The list of texts to embed.
            chunk_size: The chunk size defines how many input texts will
                be grouped together as request. If None, will use the
                chunk size specified by the class.

        Returns:
            List of embeddings, one for each text.
        """
        results = []
        _chunk_size = len(texts) if chunk_size > len(texts) else chunk_size
        st = time.time()
        for i in range(0, len(texts), _chunk_size):
            response = self._embedding_func(texts[i:i + _chunk_size])
            results.extend(response)
        time_taken = time.time() - st
        logger.info(f"got results for {len(texts)} in {time_taken}s, length of embeddings list is {len(results)}")
        return results


# class for serializing/deserializing requests/responses to/from the embeddings model
class ContentHandler(EmbeddingsContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:

        input_str = json.dumps({"text_inputs": prompt, **model_kwargs})
        return input_str.encode('utf-8') 

    def transform_output(self, output: bytes) -> str:

        response_json = json.loads(output.read().decode("utf-8"))
        embeddings = response_json["embedding"]
        if len(embeddings) == 1:
            return [embeddings[0]]
        return embeddings
    

def create_sagemaker_embeddings_from_js_model(embeddings_model_endpoint_name: str, aws_region: str) -> SagemakerEndpointEmbeddingsJumpStart:
    # all set to create the objects for the ContentHandler and 
    # SagemakerEndpointEmbeddingsJumpStart classes
    content_handler = ContentHandler()

    # note the name of the LLM Sagemaker endpoint, this is the model that we would
    # be using for generating the embeddings
    embeddings = SagemakerEndpointEmbeddingsJumpStart( 
        endpoint_name=embeddings_model_endpoint_name,
        region_name=aws_region, 
        content_handler=content_handler
    )
    return embeddings

Next, we create the embeddings object and batch the create the document embeddings.

In [24]:
embeddings = create_sagemaker_embeddings_from_js_model(embeddings_model_endpoint_name, aws_region)

### Create embeddings of your documents to get ready for semantic search



In [25]:
from tqdm import trange
from opensearchpy import RequestsHttpConnection
from requests_aws4auth import AWS4Auth


service = "es"
credentials = boto3.Session().get_credentials()

awsauth = AWS4Auth(
    credentials.access_key,
    credentials.secret_key,
    aws_region,
    service,
    session_token=credentials.token,
)

docsearch = OpenSearchVectorSearch.from_texts(
        texts = [d.page_content for d in docs],
        embedding = embeddings,
        metadatas = [d.metadata for d in docs],
        opensearch_url = [{'host': _aos_host, 'port': 443}],
        index_name = _aos_index,
        http_auth = awsauth,
        use_ssl = True,
        pre_delete_index=True,
        verify_certs=True,
        connection_class=RequestsHttpConnection,
        timeout=100000,
)

## Question answering over Documents 

So far, we have chunked a large document into smaller ones, created vector embedding and stored them in an OpenSearch Vector Database. Now, we can answer questions over this document data.

Since we have created an index over the data, we can do a semantic search over the documents; this way only the most relevant documents to answer the question are passed via the prompt to the Large Language Model (LLM). You save both time and money by not passing all the documents to the LLM.

We use langchains **question_answering** `stuff` document chain in this example. Further details on Document Chains can be found by visiting the langchain [documentation, here](https://python.langchain.com/docs/modules/chains/document/)

In [26]:
from typing import Dict

from langchain import PromptTemplate, SagemakerEndpoint
from langchain.llms.sagemaker_endpoint import LLMContentHandler
from langchain.chains.question_answering import load_qa_chain
import json

class SageMakerLLMContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs: Dict) -> bytes:
        # input_str = json.dumps({prompt: prompt, **model_kwargs})
        input_str = json.dumps({"text_inputs": prompt, **model_kwargs})
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> str:
        response_json = json.loads(output.read().decode("utf-8"))
        # return response_json[0]["generated_text"]
        return response_json['generated_texts'][0]


sagemaker_llm_content_handler= SageMakerLLMContentHandler()

chain = load_qa_chain(
    llm=SagemakerEndpoint(
        endpoint_name=llm_endpoint_name,
        # credentials_profile_name="credentials-profile-name",
        region_name=aws_region,
        model_kwargs={"temperature": 1e-10},
        content_handler=sagemaker_llm_content_handler,
    ),
    chain_type="stuff"
)


In [27]:
query = "Who are the directors?"
sim_docs = docsearch.similarity_search(query, include_metadata=True)
chain.run(input_documents=sim_docs, question=query)

'Phillip M. Webb'

In [28]:
query = "Who are the nominees?"
sim_docs = docsearch.similarity_search(query, include_metadata=True)
chain.run(input_documents=sim_docs, question=query)

'Phillip M. Webb'

In [29]:
for person in ['Mark A. Crosswhite', 'Phillip M. Webb']:
    for query_template in [
                    "How old is {PERSON}?",
                    "What is {PERSON} current position and what is the name of the organization he/she currently works for?"
                 ]:
    
        query = query_template.format(PERSON=person)
        print('Q:', query)

        sim_docs = docsearch.similarity_search(query, include_metadata=True)
        answer = chain.run(input_documents=sim_docs, question=query)    
        print('A:', answer)
        print('\n---\n')

Q: How old is Mark A. Crosswhite?
A: unanswerable

---

Q: What is Mark A. Crosswhite current position and what is the name of the organization he/she currently works for?
A: not enough information

---

Q: How old is Phillip M. Webb?
A: 62

---

Q: What is Phillip M. Webb current position and what is the name of the organization he/she currently works for?
A: President of Webb Concrete and Building Materials

---



#### Cleanup