# Retrieval-Augmented Generation: Question Answering based on Custom Dataset with Open-sourced [LangChain](https://python.langchain.com/en/latest/index.html) Library

Edited version of LLM RAG notebook customized for pre-deployed endpoint. Original notebook can be found in [here](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/jumpstart-foundation-models/question_answerIng_retrieval_augmented_generation_jumpstart/question_answerIng_langchain_jumpstart.ipynb)

When running in Studio, please make sure you are using Python 3.9 and above (`Data Science 3.0`) image has been tested).

**This notebook serves a template such that you can easily replace the example dataset by your own to build a custom question and answering application.**

## Step 1. Deploy large language model (LLM) and embedding model in SageMaker JumpStart

We will use a pre-deployed SageMaker endpoint that contains a running LLM and deploy a SageMaker JumpStart based GPT-J embedding model for text embeddings. To do that, you need modify the `_MODEL_CONFIG_` python dictionary defined as below.

In [4]:
!pip install --upgrade sagemaker --quiet
!pip install ipywidgets==7.0.0 --quiet
!pip install langchain --quiet
!pip install faiss-cpu --quiet

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip instal

In [5]:
import langchain
langchain.__version__

'0.0.141'

In [6]:
import time
import sagemaker, boto3, json
from sagemaker.session import Session
from sagemaker.model import Model
from sagemaker import image_uris, model_uris, script_uris, hyperparameters
from sagemaker.predictor import Predictor
from sagemaker.utils import name_from_base
from typing import Any, Dict, List, Optional
from langchain.embeddings import SagemakerEndpointEmbeddings
from langchain.llms.sagemaker_endpoint import ContentHandlerBase

sagemaker_session = sagemaker.Session(boto_session=boto3.Session(region_name='us-west-2'))
aws_role = sagemaker_session.get_caller_identity_arn()
aws_region = sagemaker_session.boto_region_name
sess = sagemaker.Session()
model_version = "*"

In [7]:
aws_role

'arn:aws:iam::249762235316:role/mod-6297809195fe4845-SageMakerExecutionRole-3FQK6YXK34VN'

In [8]:
def query_endpoint_with_json_payload(encoded_json, endpoint_name, content_type="application/json"):
    client = boto3.client("runtime.sagemaker",region_name='us-west-2')
    response = client.invoke_endpoint(
        EndpointName=endpoint_name, ContentType=content_type, Body=encoded_json
    )
    return response


def parse_response_model(query_response):
    model_predictions = json.loads(query_response["Body"].read())
    return model_predictions

Fill in endpoint name of pre-deployed LLM here:

In [9]:
endpoint_name = "oa-pythia12b-5fc5bf5d-0a34-48e6-885f-5e5d63a547b6"

Deploy SageMaker endpoint(s) for large language models and GPT-J 6B embedding model. Please uncomment the entries as below if you want to deploy multiple LLM models to compare their performance.

In [10]:
_MODEL_CONFIG_ = {
    "deployed_llm": {
        "deploy": "no",
        "endpoint_name":endpoint_name,
        "parse_function": parse_response_model,
        "prompt": """Answer based on context:\n\n{context}\n\n{question}""",
    },
    "huggingface-textembedding-gpt-j-6b-fp16": {
        "instance type": "ml.g5.xlarge",
        "env": {"TS_DEFAULT_WORKERS_PER_MODEL": "1"},
    },
}

In [12]:
newline, bold, unbold = "\n", "\033[1m", "\033[0m"
for model_id in _MODEL_CONFIG_:
    if 'deploy' in _MODEL_CONFIG_[model_id] and _MODEL_CONFIG_[model_id]['deploy'] == 'no':
        continue
    endpoint_name = name_from_base(f"RAGResearch-{model_id}")
    inference_instance_type = _MODEL_CONFIG_[model_id]["instance type"]

    # Retrieve the inference container uri. This is the base HuggingFace container image for the default model above.
    deploy_image_uri = image_uris.retrieve(
        region="us-west-2",
        framework=None,  # automatically inferred from model_id
        image_scope="inference",
        model_id=model_id,
        model_version=model_version,
        instance_type=inference_instance_type,
    )
    # Retrieve the model uri.
    model_uri = model_uris.retrieve(
        region="us-west-2",
        model_id=model_id, model_version=model_version, model_scope="inference"
    )
    model_inference = Model(
        sagemaker_session=sagemaker_session,
        image_uri=deploy_image_uri,
        model_data=model_uri,
        role=aws_role,
        predictor_cls=Predictor,
        name=endpoint_name,
        env=_MODEL_CONFIG_[model_id]["env"],
    )
    model_predictor_inference = model_inference.deploy(
        initial_instance_count=1,
        instance_type=inference_instance_type,
        predictor_cls=Predictor,
        endpoint_name=endpoint_name,
    )
    print(f"{bold}Model {model_id} has been deployed successfully.{unbold}{newline}")
    _MODEL_CONFIG_[model_id]["endpoint_name"] = endpoint_name

----------![1mModel huggingface-textembedding-gpt-j-6b-fp16 has been deployed successfully.[0m



## Step 2. Ask a question to LLM without providing the context

To better illustrate why we need retrieval-augmented generation (RAG) based approach to solve the question and anwering problem. Let's directly ask the model a question and see how they respond.

In [13]:
question = "<|prompter|>Which instances can I use with Managed Spot Training in SageMaker?<|endoftext|><|assistant|>"

In [14]:
payload = {
    "inputs": question,
    "max_new_tokens": 1000,
    "num_return_sequences": 1,
    "temperature":0.8,
    "top_k": 1000,
    "top_p": 0.1,
    "do_sample": True,
}

list_of_LLMs = list(_MODEL_CONFIG_.keys())
list_of_LLMs.remove("huggingface-textembedding-gpt-j-6b-fp16")  # remove the embedding model


for model_id in list_of_LLMs:
    endpoint_name = _MODEL_CONFIG_[model_id]["endpoint_name"]
    query_response = query_endpoint_with_json_payload(
        json.dumps(payload).encode("utf-8"), endpoint_name=endpoint_name
    )
    generated_texts = _MODEL_CONFIG_[model_id]["parse_function"](query_response)
    print(f"For model: {model_id}, the generated output is: {generated_texts[0]}\n")

For model: deployed_llm, the generated output is: {'generated_text': '<|prompter|>Which instances can I use with Managed Spot Training in SageMaker?<|endoftext|><|assistant|>Managed Spot Training can be used with a variety of SageMaker instances, including:\n\n'}



You can see the generated answer is wrong or doesn't make much sense. 

## Step 3. Improve the answer to the same question using **prompt engineering** with insightful context


To better answer the question well, we provide extra contextual information, combine it with a prompt, and send it to model together with the question. Below is an example.

In [15]:
context = """Managed Spot Training can be used with all instances supported in Amazon SageMaker. Managed Spot Training is supported in all AWS Regions where Amazon SageMaker is currently available."""

In [16]:
parameters = {
    "max_new_tokens": 1000,
    "num_return_sequences": 1,
    "top_k": 1000,
    "top_p": 0.1,
    "do_sample": True,
}

for model_id in list_of_LLMs:
    endpoint_name = _MODEL_CONFIG_[model_id]["endpoint_name"]

    prompt = _MODEL_CONFIG_[model_id]["prompt"]

    text_input = prompt.replace("{context}", context)
    text_input = "<|prompter|>"+text_input.replace("{question}", question)+"<|endoftext|><|assistant|>"
    payload = {"inputs": text_input, **parameters}

    query_response = query_endpoint_with_json_payload(
        json.dumps(payload).encode("utf-8"), endpoint_name=endpoint_name
    )
    generated_texts = _MODEL_CONFIG_[model_id]["parse_function"](query_response)
    print(
        f"{bold}For model: {model_id}, the generated output is: {generated_texts[0]}{unbold}{newline}"
    )

[1mFor model: deployed_llm, the generated output is: {'generated_text': '<|prompter|>Answer based on context:\n\nManaged Spot Training can be used with all instances supported in Amazon SageMaker. Managed Spot Training is supported in all AWS Regions where Amazon SageMaker is currently available.\n\n<|prompter|>Which instances can I use with Managed Spot Training in SageMaker?<|endoftext|><|assistant|><|endoftext|><|assistant|>Managed Spot Training is available in all instances supported in Amazon SageMaker. This includes instances of'}[0m



The output from step 3 tells us the chance to get the correct response significantly correlates with the insightful context you send into the LLM. 

**<span style="color:red">Now, the question becomes where can I find the insightful context based on the user query? The answer is to use a pre-stored knowledge data base with retrieval augmented generation, as shown in step 4 below</span>.**

## Step 4. Use RAG based approach with [LangChain](https://python.langchain.com/en/latest/index.html) and SageMaker endpoints to build a simplified question and answering application.


We plan to use document embeddings to fetch the most relevant documents in our document knowledge library and combine them with the prompt that we provide to LLM.

To achieve that, we will do following.

1. **Generate embedings for each of document in the knowledge library with SageMaker GPT-J-6B embedding model.**
2. **Identify top K most relevant documents based on user query.**
    - 2.1 **For a query of your interest, generate the embedding of the query using the same embedding model.**
    - 2.2 **Search the indexes of top K most relevant documents in the embedding space using in-memory Faiss search.**
    - 2.3 **Use the indexes to retrieve the corresponded documents.**
3. **Combine the retrieved documents with prompt and question and send them into SageMaker LLM.**



Note: The retrieved document/text should be large enough to contain enough information to answer a question; but small enough to fit into the LLM prompt -- maximum sequence length of 1024 tokens. 

---
To build a simiplied QA application with LangChain, we need: 
1. Wrap up our SageMaker endpoints for embedding model and LLM into `langchain.embeddings.SagemakerEndpointEmbeddings` and `langchain.llms.sagemaker_endpoint.SagemakerEndpoint`. That requires a small overwritten of `SagemakerEndpointEmbeddings` class to make it compatible with SageMaker embedding mdoel.
2. Prepare the dataset to build the knowledge data base. 

---

Wrap up our SageMaker endpoints for embedding model into `langchain.embeddings.SagemakerEndpointEmbeddings`. That requires a small overwritten of `SagemakerEndpointEmbeddings` class to make it compatible with SageMaker embedding mdoel.

In [17]:
class SagemakerEndpointEmbeddingsJumpStart(SagemakerEndpointEmbeddings):
    def embed_documents(self, texts: List[str], chunk_size: int = 5) -> List[List[float]]:
        """Compute doc embeddings using a SageMaker Inference Endpoint.

        Args:
            texts: The list of texts to embed.
            chunk_size: The chunk size defines how many input texts will
                be grouped together as request. If None, will use the
                chunk size specified by the class.

        Returns:
            List of embeddings, one for each text.
        """
        results = []
        _chunk_size = len(texts) if chunk_size > len(texts) else chunk_size

        for i in range(0, len(texts), _chunk_size):
            response = self._embedding_func(texts[i : i + _chunk_size])
            print
            results.extend(response)
        return results


class ContentHandler(ContentHandlerBase):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        input_str = json.dumps({"text_inputs": prompt, **model_kwargs})
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> str:
        response_json = json.loads(output.read().decode("utf-8"))
        embeddings = response_json["embedding"]
        if len(embeddings) == 1:
            return embeddings[0]
        return embeddings


content_handler = ContentHandler()

embeddings = SagemakerEndpointEmbeddingsJumpStart(
    endpoint_name=_MODEL_CONFIG_["huggingface-textembedding-gpt-j-6b-fp16"]["endpoint_name"],
    region_name=aws_region,
    content_handler=content_handler,
)
embeddings.client = boto3.client("runtime.sagemaker",region_name='us-west-2')

Next, we wrap up our SageMaker endpoints for LLM into `langchain.llms.sagemaker_endpoint.SagemakerEndpoint`. 

In [70]:
from langchain.llms.sagemaker_endpoint import SagemakerEndpoint

parameters = {
    "max_new_tokens": 2000,
    "num_return_sequences": 1,
    "top_k": 100,
    "top_p": 0.2,
    "temperature":0.8,
    "do_sample": True,
}


class ContentHandler(ContentHandlerBase):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        input_str = json.dumps({"inputs":  "<|prompter|>"+prompt+"<|endoftext|><|assistant|>", **model_kwargs})
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> str:
        response_json = json.loads(output.read().decode("utf-8"))
        return response_json[0]["generated_text"]


content_handler = ContentHandler()

sm_llm = SagemakerEndpoint(
    endpoint_name=_MODEL_CONFIG_["deployed_llm"]["endpoint_name"],
    region_name=aws_region,
    model_kwargs=parameters,
    content_handler=content_handler,
)

Now, let's download the example data and prepare it for demonstration. We will use [Amazon SageMaker FAQs](https://aws.amazon.com/sagemaker/faqs/) as knowledge library. The data are formatted in a CSV file with two columns Question and Answer. We use the Answer column as the documents of knowledge library, from which relevant documents are retrieved based on a query. 

**For your purpose, you can replace the example dataset of your own to build a custom question and answering application.**

In [19]:
aws_region

'us-west-2'

In [20]:
original_data = "s3://jumpstart-cache-prod-us-east-2/training-datasets/Amazon_SageMaker_FAQs/"

!mkdir -p rag_data
!aws s3 cp --recursive $original_data rag_data

download: s3://jumpstart-cache-prod-us-east-2/training-datasets/Amazon_SageMaker_FAQs/Amazon_SageMaker_FAQs.csv to rag_data/Amazon_SageMaker_FAQs.csv


For the case when you have data saved in multiple subsets. The following code will read all files that end with `.csv` and concatenate them together. Please ensure each `csv` file has the same format.

In [21]:
import glob
import os
import pandas as pd

all_files = glob.glob(os.path.join("rag_data/", "*.csv"))

df_knowledge = pd.concat(
    (pd.read_csv(f, header=None, names=["Question", "Answer"]) for f in all_files),
    axis=0,
    ignore_index=True,
)

Drop the `Question` column as it is not used in this demonstration.

In [22]:
df_knowledge.drop(["Question"], axis=1, inplace=True)

In [23]:
df_knowledge.head(5)

Unnamed: 0,Answer
0,Amazon SageMaker is a fully managed service to...
1,For a list of the supported Amazon SageMaker A...
2,Amazon SageMaker is designed for high availabi...
3,Amazon SageMaker stores code in ML storage vol...
4,Amazon SageMaker ensures that ML model artifac...


In [24]:
df_knowledge.to_csv("rag_data/processed_data.csv", header=True, index=False)

In [25]:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.document_loaders import TextLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import Chroma, AtlasDB, FAISS
from langchain.text_splitter import CharacterTextSplitter
from langchain import PromptTemplate
from langchain.chains.question_answering import load_qa_chain
from langchain.document_loaders.csv_loader import CSVLoader

Use langchain to read the `csv` data. There are multiple built-in functions in LangChain to read different format of files such as `txt`, `html`, and `pdf`. For details, see [LangChain document loaders](https://python.langchain.com/en/latest/modules/indexes/document_loaders.html).

In [26]:
loader = CSVLoader(file_path="rag_data/processed_data.csv")

In [27]:
documents = loader.load()
# text_splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=0)
# texts = text_splitter.split_documents(documents) ### if you use langchain.document_loaders.TextLoader to load text file. You can uncomment the code
## to split the text.

**Now, we can build an QA application. <span style="color:red">LangChain makes it extremly simple with following few lines of code</span>.**

Based on the question below, we can achieven the points in Step 4 with just a few lines of code as shown below.

In [64]:
question = "What are all the sampling methods supported by SageMaker Feature Store?"

In [65]:
index_creator = VectorstoreIndexCreator(
    vectorstore_cls=FAISS,
    embedding=embeddings,
    text_splitter=CharacterTextSplitter(chunk_size=300, chunk_overlap=0),
)

In [66]:
index = index_creator.from_loaders([loader])

In [71]:
print(index.query(question=question, llm=sm_llm))

<|prompter|>Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

Answer: There are no fixed limits to the size of the dataset you can use for training models with Amazon SageMaker.

Answer: Not at this time. The best model tuning performance and experience is within Amazon SageMaker.

Answer: You can find the Regions where Amazon SageMaker Studio is supported in the documentation here.

Answer: Managed Spot Training can be used with all instances supported in Amazon SageMaker.

Question: What are all the sampling methods supported by SageMaker Feature Store?
Helpful Answer:<|endoftext|><|assistant|>The sampling methods supported by SageMaker Feature Store include:

- Random Sampling



## Step 5. Customize the QA application above with different prompt.

Now, we see how simple it is to use LangChain to achieve question and answering application with just few lines of code. Let's break down the above `VectorstoreIndexCreator` and see what's happening under the hood. Furthermore, we will see how to incorporate a customize prompt rather than using a default prompt with `VectorstoreIndexCreator`.

Firstly, we **generate embedings for each of document in the knowledge library with SageMaker GPT-J-6B embedding model.**

In [33]:
docsearch = FAISS.from_documents(documents, embeddings)

In [58]:
question = "Which sampling method does SageMaker Feature Store support?"

Based on the question above, we then **identify top K most relevant documents based on user query, where K = 3 in this setup**.

In [59]:
docs = docsearch.similarity_search(question, k=10)

Print out the top 3 most relevant docuemnts as below.

In [45]:
docs

[Document(page_content='Answer: The difference between Savings Plans for Amazon SageMaker and Savings Plans for EC2 is in the services they include. SageMaker Savings Plans apply only to SageMaker ML Instance usage.', metadata={'source': 'rag_data/processed_data.csv', 'row': 152}),
 Document(page_content='Answer: There is no additional charge for using Amazon SageMaker Components for Kubeflow Pipelines.', metadata={'source': 'rag_data/processed_data.csv', 'row': 38}),
 Document(page_content='Answer: Data scientists can access Amazon SageMaker Inference Recommender from SageMaker Studio, AWS SDK for Python (Boto3), or AWS CLI. They can get deployment recommendations within SageMaker Studio in the SageMaker model registry for registered model versions. Data scientists can search and filter the recommendations through SageMaker Studio, AWS SDK, or AWS CLI.', metadata={'source': 'rag_data/processed_data.csv', 'row': 120}),
 Document(page_content='Answer: The components available through Am

Finally, we **combine the retrieved documents with prompt and question and send them into SageMaker LLM.** 

We define a customized prompt as below.

In [53]:
prompt_template = """Answer based on context:\n\n{context}\n\n{question}"""

PROMPT = PromptTemplate(template=prompt_template, input_variables=["context", "question"])

In [54]:
PROMPT

PromptTemplate(input_variables=['context', 'question'], output_parser=None, partial_variables={}, template='Answer based on context:\n\n{context}\n\n{question}', template_format='f-string', validate_template=True)

In [55]:
chain = load_qa_chain(llm=sm_llm, prompt=PROMPT)

Send the top 3 most relevant docuemnts and question into LLM to get a answer.

In [56]:
result = chain({"input_documents": docs, "question": question}, return_only_outputs=True)[
    "output_text"
]

Print the final answer from LLM as below, which is accurate.

In [57]:
print(result)

<|prompter|>Answer based on context:

Answer: The difference between Savings Plans for Amazon SageMaker and Savings Plans for EC2 is in the services they include. SageMaker Savings Plans apply only to SageMaker ML Instance usage.

Answer: There is no additional charge for using Amazon SageMaker Components for Kubeflow Pipelines.

Answer: Data scientists can access Amazon SageMaker Inference Recommender from SageMaker Studio, AWS SDK for Python (Boto3), or AWS CLI. They can get deployment recommendations within SageMaker Studio in the SageMaker model registry for registered model versions. Data scientists can search and filter the recommendations through SageMaker Studio, AWS SDK, or AWS CLI.

Answer: The components available through Amazon SageMaker Studio, including Amazon SageMaker Amazon Clarify, Amazon SageMaker Data Wrangler, Amazon SageMaker Feature Store, Amazon SageMaker Experiments, Amazon SageMaker Debugger, and Amazon SageMaker Model Monitor, can be added to SageMaker Pipeli

## Moving to deployment

In short, RAG solutions require:
1. **Embedding Model**: We need a way to convert text to vector embeddings. In this tutorial, we used GPT-J-6B as the embedding model. JumpStart supports other packaged embedding models including BLOOM 7B1, BERT and RoBERTa.
2. **Vector Store**: We need to have a data store to index embeddings for fast queries. In this tutorial, we used a local-machine based [FAISS](https://github.com/facebookresearch/faiss) deployment. To scale this up, you can consider OpenSearch. Check out OpenSearch's K-NN [documentation](https://opensearch.org/docs/latest/search-plugins/knn/approximate-knn/) and also LangChain's integration with [OpenSearch](https://python.langchain.com/en/latest/ecosystem/opensearch.html).
3. **Large Language Model**: We need a LLM to run Question and Answer. In this case, we just point LangChain to query a SageMaker endpoint with the `endpoint_name` configuration. 
4. **Running LangChain**: Lastly, we need to run LangChain somewhere. In this case, we are running LangChain in Jupyter notebook. Moving beyond experimentation, you can deploy LangChain on Lambda and trigger the chain when a request comes in. Check out this [Github repo](https://github.com/3coins/langchain-aws-template) for a sample deployment with Lambda + API Gateway for LangChain, complete with CDK code.

Also, do try out [Amazon Kendra](https://aws.amazon.com/kendra/), which effectively covers the functionalities of **1** and **2**. Kendra also comes with connectors that allow users to easily ingest documents from enterprise SaaS (e.g. Confluence, Sharepoint, Microsoft Exchange, Google Drive) and Amazon S3.