# Retrieval-Augmented Generation: Question Answering based on Custom Dataset with Open-sourced [LangChain](https://python.langchain.com/en/latest/index.html) Library


!!!This notebook should be run using the Data Science 2.0 Python kernel!!!

In this notebook we will demonstrate how to use a **Falcon 7B FP16** to answer questions using a library of documents as a reference, by using document embeddings and retrieval. The embeddings are generated from **GPT-J-6B-FP16** embedding model. 

**This notebook serves a template such that you can easily replace the example dataset by your own to build a custom question and asnwering application.**

## Step 1. Deploy large language model (LLM) and embedding model in SageMaker JumpStart


In [2]:
!pip install --upgrade sagemaker --quiet
!pip install ipywidgets==7.0.0 --quiet
!pip install langchain==0.0.148 --quiet
!pip install faiss-cpu --quiet

[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
sagemaker-datawrangler 0.4.3 requires sagemaker-data-insights==0.4.0, but you have sagemak

In [3]:
import time
import sagemaker, boto3, json
from sagemaker.session import Session
from sagemaker.model import Model
from sagemaker import image_uris, model_uris, script_uris, hyperparameters
from sagemaker.predictor import Predictor
from sagemaker.utils import name_from_base
from typing import Any, Dict, List, Optional
from langchain.embeddings import SagemakerEndpointEmbeddings
from langchain.llms.sagemaker_endpoint import ContentHandlerBase

sagemaker_session = Session()
aws_role = sagemaker_session.get_caller_identity_arn()
aws_region = boto3.Session().region_name
sess = sagemaker.Session()
model_version = "*"



Deploy SageMaker endpoint(s) for large language models and GPT-J 6B embedding model. Please uncomment the entries as below if you want to deploy multiple LLM models to compare their performance.

More information [here](https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.jumpstart.model.JumpStartModel)

In [4]:
llm_endpoint_name = "falcon-7b-instruct-bf16-endpoint"
embeddings_endpoint_name = "gpt-j-6b-endpoint"

In [5]:
%%time
from sagemaker.jumpstart.model import JumpStartModel

model_llm = JumpStartModel(
    model_id="huggingface-llm-falcon-7b-instruct-bf16",
    instance_type="ml.g5.2xlarge"
)
predictor_llm = model_llm.deploy(
    endpoint_name=llm_endpoint_name
)

-----------------!CPU times: user 275 ms, sys: 33.2 ms, total: 308 ms
Wall time: 9min 4s


In [6]:
%%time

model_embedding = JumpStartModel(
    model_id="huggingface-textembedding-gpt-j-6b-fp16",
    instance_type="ml.p3.2xlarge"
)
predictor_embedding = model_embedding.deploy(
    endpoint_name=embeddings_endpoint_name
)

-------------!CPU times: user 118 ms, sys: 18.6 ms, total: 137 ms
Wall time: 7min 3s


## Step 2. Ask a question to LLM without providing the context

To better illustrate why we need retrieval-augmented generation (RAG) based approach to solve the question and anwering problem. Let's directly ask the model a question and see how they respond.

Using the SageMaker Python SDK

In [36]:
prompt = "Which instances can I use with Managed Spot Training in SageMaker?"

payload = {
    "inputs": prompt,
    "parameters": {
        "do_sample": True,
        "top_p": 0.9,
        "temperature": 0.01,
        "max_new_tokens": 1024,
        "stop": ["<|endoftext|>", "</s>"]
    }
}

response = predictor_llm.predict(payload)
print(response[0]["generated_text"])


Managed Spot Training in SageMaker can be used for a variety of instances, including:

1. Training a model on a large dataset
2. Training a model on a small dataset
3. Training a model on a dataset with a large number of classes
4. Training a model on a dataset with a small number of classes
5. Training a model on a dataset with a large number of classes
6. Training a model on a dataset with a small number of classes
7. Training a model on a dataset with a large number of classes
8. Training a model on a dataset with a small number of classes
9. Training a model on a dataset with a large number of classes
10. Training a model on a dataset with a small number of classes
11. Training a model on a dataset with a large number of classes
12. Training a model on a dataset with a small number of classes
13. Training a model on a dataset with a large number of classes
14. Training a model on a dataset with a small number of classes
15. Training a model on a dataset with a large number of clas

Using the Boto3 library

In [8]:
prompt = "Which instances can I use with Managed Spot Training in SageMaker?"
# prompt = PROMPT.format(context=docs, question=question)

payload = {
    "inputs": prompt,
    "parameters": {
        "do_sample": True,
        "top_p": 0.9,
        "temperature": 0.01,
        "max_new_tokens": 1024,
        "early_stopping": True,
        "stop": ["<|endoftext|>", "</s>"]
    }
}

def query_endpoint_with_json_payload(
    encoded_json, 
    endpoint_name, 
    content_type="application/json"
):
    client = boto3.client("runtime.sagemaker")
    response = client.invoke_endpoint(
        EndpointName=endpoint_name, 
        ContentType=content_type, 
        Body=encoded_json
    )
    model_output = json.loads(response["Body"].read().decode("utf-8"))
    generated_text = model_output[0]["generated_text"]
    
    return generated_text


query_response = query_endpoint_with_json_payload(
    encoded_json=json.dumps(payload).encode("utf-8"), 
    endpoint_name=llm_endpoint_name
    )

print(prompt)
print(query_response)


Which instances can I use with Managed Spot Training in SageMaker?

Managed Spot Training in SageMaker can be used for a variety of instances, including:

1. Training a model on a large dataset
2. Training a model on a small dataset
3. Training a model on a dataset with a large number of classes
4. Training a model on a dataset with a small number of classes
5. Training a model on a dataset with a large number of classes
6. Training a model on a dataset with a small number of classes
7. Training a model on a dataset with a large number of classes
8. Training a model on a dataset with a small number of classes
9. Training a model on a dataset with a large number of classes
10. Training a model on a dataset with a small number of classes
11. Training a model on a dataset with a large number of classes
12. Training a model on a dataset with a small number of classes
13. Training a model on a dataset with a large number of classes
14. Training a model on a dataset with a small number of cl

You can see the generated answer is wrong or doesn't make much sense. 

## Step 3. Improve the answer to the same question using **prompt engineering** with insightful context


To better answer the question well, we provide extra contextual information, combine it with a prompt, and send it to model together with the question. Below is an example.

In [9]:
prompt = """Answer QUESTION based on CONTEXT:\n\nCONTEXT:{context}\n\nQUESTION:{question}\n\nANSWER:"""
question = "Which instances can I use with Managed Spot Training in SageMaker?"
context = """Managed Spot Training can be used with all instances supported in Amazon SageMaker. Managed Spot Training is supported in all AWS Regions where Amazon SageMaker is currently available."""

In [10]:
text_input = prompt.replace("{context}", context)
text_input = text_input.replace("{question}", question)

print(text_input)

payload = {
    "inputs": text_input,
    "parameters": {
        "do_sample": True,
        "top_p": 0.9,
        "temperature": 0.01,
        "max_new_tokens": 1024,
        "stop": ["<|endoftext|>", "</s>"]
    }
}

response = query_endpoint_with_json_payload(
    encoded_json=json.dumps(payload).encode("utf-8"), 
    endpoint_name=llm_endpoint_name
    )


print(response)


Answer QUESTION based on CONTEXT:

CONTEXT:Managed Spot Training can be used with all instances supported in Amazon SageMaker. Managed Spot Training is supported in all AWS Regions where Amazon SageMaker is currently available.

QUESTION:Which instances can I use with Managed Spot Training in SageMaker?

ANSWER:


You can use Managed Spot Training with all instances supported in Amazon SageMaker.


The output from step 3 tells us the chance to get the correct response significantly correlates with the insightful context you send into the LLM. 

**<span style="color:red">Now, the question becomes where can I find the insightful context based on the user query? The answer is to use a pre-stored knowledge data base with retrieval augmented generation, as shown in step 4 below</span>.**

## Step 4. Use RAG based approach with [LangChain](https://python.langchain.com/en/latest/index.html) and SageMaker endpoints to build a simplified question and answering application.


We plan to use document embeddings to fetch the most relevant documents in our document knowledge library and combine them with the prompt that we provide to LLM.

To achieve that, we will do following.

1. **Generate embedings for each of document in the knowledge library with SageMaker GPT-J-6B FP16 embedding model.**
2. **Identify top K most relevant documents based on user query.**
    - 2.1 **For a query of your interest, generate the embedding of the query using the same embedding model.**
    - 2.2 **Search the indexes of top K most relevant documents in the embedding space using in-memory Faiss search.**
    - 2.3 **Use the indexes to retrieve the corresponded documents.**
3. **Combine the retrieved documents with prompt and question and send them into SageMaker LLM.**



Note: The retrieved document/text should be large enough to contain enough information to answer a question; but small enough to fit into the LLM prompt -- maximum sequence length of 1024 tokens. 

---
To build a simiplied QA application with LangChain, we need: 
1. Wrap up our SageMaker endpoints for embedding model and LLM into `langchain.embeddings.SagemakerEndpointEmbeddings` and `langchain.llms.sagemaker_endpoint.SagemakerEndpoint`. That requires a small overwritten of `SagemakerEndpointEmbeddings` class to make it compatible with SageMaker embedding model.
2. Prepare the dataset to build the knowledge data base. 

---

Wrap up our SageMaker endpoints for embedding model into `langchain.embeddings.SagemakerEndpointEmbeddings`. That requires a small overwritten of `SagemakerEndpointEmbeddings` class to make it compatible with SageMaker embedding model.

In [11]:
from langchain.embeddings.sagemaker_endpoint import EmbeddingsContentHandler
from langchain.llms.sagemaker_endpoint import LLMContentHandler


class SagemakerEndpointEmbeddingsJumpStart(SagemakerEndpointEmbeddings):
    def embed_documents(self, texts: List[str], chunk_size: int = 5) -> List[List[float]]:
        """Compute doc embeddings using a SageMaker Inference Endpoint.

        Args:
            texts: The list of texts to embed.
            chunk_size: The chunk size defines how many input texts will
                be grouped together as request. If None, will use the
                chunk size specified by the class.

        Returns:
            List of embeddings, one for each text.
        """
        results = []
        _chunk_size = len(texts) if chunk_size > len(texts) else chunk_size

        for i in range(0, len(texts), _chunk_size):
            response = self._embedding_func(texts[i : i + _chunk_size])
            print
            results.extend(response)
        return results


class ContentHandler(EmbeddingsContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        input_str = json.dumps({"text_inputs": prompt, **model_kwargs})
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> str:
        response_json = json.loads(output.read().decode("utf-8"))
        embeddings = response_json["embedding"]
        return embeddings


content_handler = ContentHandler()

embeddings = SagemakerEndpointEmbeddingsJumpStart(
    endpoint_name=embeddings_endpoint_name,
    region_name=aws_region,
    content_handler=content_handler,
)

Next, we wrap up our SageMaker endpoints for LLM into `langchain.llms.sagemaker_endpoint.SagemakerEndpoint`. 

In [12]:
from langchain.llms.sagemaker_endpoint import LLMContentHandler, SagemakerEndpoint


parameters = {
    "do_sample": True,
    "top_p": 0.9,
    "temperature": 0.01,
    "max_new_tokens": 1024,
    "early_stopping": True,
    "stop": ["<|endoftext|>", "</s>"]
}


class ContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"
    
    def transform_input(self, prompt: str, model_kwargs=Dict) -> bytes:
        prompt = prompt[:1023]  #?? https://github.com/aws-samples/amazon-kendra-langchain-extensions/blob/main/kendra_retriever_samples/kendra_chat_falcon_40b.py
        input_str = json.dumps({"inputs": prompt, "parameters": model_kwargs})
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> str:
        response_json = json.loads(output.read().decode("utf-8"))
        return response_json[0]["generated_text"]


content_handler = ContentHandler()

sm_llm = SagemakerEndpoint(
    endpoint_name=llm_endpoint_name,
    region_name=aws_region,
    model_kwargs=parameters,
    content_handler=content_handler,
)

print(sm_llm)

[1mSagemakerEndpoint[0m
Params: {'endpoint_name': 'falcon-7b-instruct-bf16-endpoint', 'model_kwargs': {'do_sample': True, 'top_p': 0.9, 'temperature': 0.01, 'max_new_tokens': 1024, 'early_stopping': True, 'stop': ['<|endoftext|>', '</s>']}}


Now, let's download the example data and prepare it for demonstration. We will use [Amazon SageMaker FAQs](https://aws.amazon.com/sagemaker/faqs/) as knowledge library. The data are formatted in a CSV file with two columns Question and Answer. We use the Answer column as the documents of knowledge library, from which relevant documents are retrieved based on a query. 

**For your purpose, you can replace the example dataset of your own to build a custom question and answering application.**

In [13]:
original_data = "s3://jumpstart-cache-prod-us-east-2/training-datasets/Amazon_SageMaker_FAQs/"

!mkdir -p rag_data
!aws s3 cp --recursive $original_data rag_data

download: s3://jumpstart-cache-prod-us-east-2/training-datasets/Amazon_SageMaker_FAQs/Amazon_SageMaker_FAQs.csv to rag_data/Amazon_SageMaker_FAQs.csv


For the case when you have data saved in multiple subsets. The following code will read all files that end with `.csv` and concatenate them together. Please ensure each `csv` file has the same format.

In [14]:
import glob
import os
import pandas as pd

all_files = glob.glob(os.path.join("rag_data/", "*.csv"))

df_knowledge = pd.concat(
    (pd.read_csv(f, header=None, names=["Question", "Answer"]) for f in all_files),
    axis=0,
    ignore_index=True,
)

Drop the `Question` column as it is not used in this demonstration.

In [15]:
df_knowledge.drop(["Question"], axis=1, inplace=True)

In [16]:
df_knowledge.head(5)

Unnamed: 0,Answer
0,Amazon SageMaker is a fully managed service to...
1,For a list of the supported Amazon SageMaker A...
2,Amazon SageMaker is designed for high availabi...
3,Amazon SageMaker stores code in ML storage vol...
4,Amazon SageMaker ensures that ML model artifac...


In [17]:
df_knowledge.to_csv("rag_data/processed_data.csv", header=False, index=False)

In [18]:
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import Chroma, AtlasDB, FAISS
from langchain.text_splitter import CharacterTextSplitter
from langchain import PromptTemplate
from langchain.chains.question_answering import load_qa_chain
from langchain.document_loaders.csv_loader import CSVLoader

Use langchain to read the `csv` data. There are multiple built-in functions in LangChain to read different format of files such as `txt`, `html`, and `pdf`. For details, see [LangChain document loaders](https://python.langchain.com/en/latest/modules/indexes/document_loaders.html).

In [19]:
loader = CSVLoader(
    file_path="rag_data/processed_data.csv",
    csv_args={
        "delimiter": ",",
        "quotechar": '"',
    },  
)

In [20]:
documents = loader.load()
# text_splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=0)
# texts = text_splitter.split_documents(documents) ### if you use langchain.document_loaders.TextLoader to load text file. You can uncomment the code
## to split the text.

**Now, we can build an QA application. <span style="color:red">LangChain makes it extremly simple with following few lines of code</span>.**

Based on the question below, we can achieven the points in Step 4 with just a few lines of code as shown below.

In [21]:
question

'Which instances can I use with Managed Spot Training in SageMaker?'

In [22]:
index_creator = VectorstoreIndexCreator(
    vectorstore_cls=FAISS,
    embedding=embeddings,
    text_splitter=CharacterTextSplitter(chunk_size=300, chunk_overlap=0),
)

In [23]:
index = index_creator.from_loaders([loader])

In [37]:
index.query(question=question, llm=sm_llm)

'\n<p>Managed Spot Training in SageMaker is currently available for the following instance types:</p>\n\n<ul>\n<li>m5.xlarge</li>\n<li>c5.18xlarge</li>\n<li>x1e12.2xlarge</li>\n<li>x1e12.18xlarge</li>\n<li>x1e12.24xlarge</li>\n<li>x1e12.36xlarge</li>\n<li>x1e12.48xlarge</li>\n<li>x1e12.54xlarge</li>\n<li>x1e12.72xlarge</li>\n<li>x1e12.96xlarge</li>\n<li>x1e12.128xlarge</li>\n<li>x1e12.256xlarge</li>\n<li>x1e12.512xlarge</li>\n<li>x1e12.768xlarge</li>\n<li>x1e12.864xlarge</li>\n<li>x1e12.912xlarge</li>\n<li>x1e12.924xlarge</li>\n<li>x1e12.928xlarge</li>\n<li>x1e12.932xlarge</li>\n<li>x1e12.936xlarge</li>\n<li>x1e12.952xlarge</li>\n<li>x1e12.960xlarge</li>\n<li>x1e12.964xlarge</li>\n<li>x1e12.968xlarge</li>\n<li>x1e12.980xlarge</li>\n<li>x1e12.992xlarge</li>\n<li>x1e12.994xlarge</li>\n<li>x1e12.996xlarge</li>\n<li>x1e12.998xlarge</li>\n<li>x1e12.999xlarge</li>\n<li>x1e12.9999xlarge</li>\n<li>x1e12.9999xlarge</li>\n<li>x1e12.9999xlarge</li>\n<li>x1e12.9999xlarge</li>\n<li>x1e12.9999xlarge

## Step 5. Customize the QA application above with different prompt.

Now, we see how simple it is to use LangChain to achieve question and answering application with just few lines of code. Let's break down the above `VectorstoreIndexCreator` and see what's happening under the hood. Furthermore, we will see how to incorporate a customize prompt rather than using a default prompt with `VectorstoreIndexCreator`.

Firstly, we **generate embedings for each of document in the knowledge library with SageMaker GPT-J-6B FP16 embedding model.**

In [25]:
%%time
docsearch = FAISS.from_documents(documents, embeddings)

CPU times: user 367 ms, sys: 44.3 ms, total: 412 ms
Wall time: 7.36 s


In [26]:
question

'Which instances can I use with Managed Spot Training in SageMaker?'

Based on the question above, we then **identify top K most relevant documents based on user query, where K = 3 in this setup**.

In [27]:
docs = docsearch.similarity_search(question, k=3)

Print out the top 3 most relevant docuemnts as below.

In [28]:
docs

[Document(page_content='Amazon SageMaker is a fully managed service to prepare data and build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows.: Once a Managed Spot Training job is completed, you can see the savings in the AWS Management Console and also calculate the cost savings as the percentage difference between the duration for which the training job ran and the duration for which you were billed. Regardless of how many times your Managed Spot Training jobs are interrupted, you are charged only once for the duration for which the data was downloaded.', metadata={'source': 'rag_data/processed_data.csv', 'row': 88}),
 Document(page_content='Amazon SageMaker is a fully managed service to prepare data and build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows.: Managed Spot Training uses Amazon EC2 Spot instances for training, and these insta

Finally, we **combine the retrieved documents with prompt and question and send them into SageMaker LLM.** 

We define a customized prompt as below.

In [29]:
# prompt_template = """Answer based on context:\n\n{context}\n\n{question}"""
prompt_template = """Answer QUESTION based on CONTEXT:\n\nCONTEXT:{context}\n\nQUESTION:{question}\n\nANSWER:"""

PROMPT = PromptTemplate(template=prompt_template, input_variables=["context", "question"])

In [30]:
PROMPT.format(context=docs, question=question)

"Answer QUESTION based on CONTEXT:\n\nCONTEXT:[Document(page_content='Amazon SageMaker is a fully managed service to prepare data and build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows.: Once a Managed Spot Training job is completed, you can see the savings in the AWS Management Console and also calculate the cost savings as the percentage difference between the duration for which the training job ran and the duration for which you were billed. Regardless of how many times your Managed Spot Training jobs are interrupted, you are charged only once for the duration for which the data was downloaded.', metadata={'source': 'rag_data/processed_data.csv', 'row': 88}), Document(page_content='Amazon SageMaker is a fully managed service to prepare data and build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows.: Managed Spot Training uses Amazon EC2

In [31]:
chain = load_qa_chain(
    llm=sm_llm, 
    prompt=PROMPT, 
    verbose=True, 
    chain_type="stuff"
)

In [32]:
chain.prep_inputs

<bound method Chain.prep_inputs of StuffDocumentsChain(memory=None, callback_manager=<langchain.callbacks.shared.SharedCallbackManager object at 0x7f5fc02befd0>, verbose=True, input_key='input_documents', output_key='output_text', llm_chain=LLMChain(memory=None, callback_manager=<langchain.callbacks.shared.SharedCallbackManager object at 0x7f5fc02befd0>, verbose=True, prompt=PromptTemplate(input_variables=['context', 'question'], output_parser=None, partial_variables={}, template='Answer QUESTION based on CONTEXT:\n\nCONTEXT:{context}\n\nQUESTION:{question}\n\nANSWER:', template_format='f-string', validate_template=True), llm=SagemakerEndpoint(cache=None, verbose=False, callback_manager=<langchain.callbacks.shared.SharedCallbackManager object at 0x7f5fc02befd0>, client=<botocore.client.SageMakerRuntime object at 0x7f5f8a5fe220>, endpoint_name='falcon-7b-instruct-bf16-endpoint', region_name='us-east-1', credentials_profile_name=None, content_handler=<__main__.ContentHandler object at 0x

Send the top 3 most relevant docuemnts and question into LLM to get a answer.

In [33]:
result = chain(
    {"input_documents": docs, "question": question}, 
    return_only_outputs=True)["output_text"]



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mAnswer QUESTION based on CONTEXT:

CONTEXT:Amazon SageMaker is a fully managed service to prepare data and build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows.: Once a Managed Spot Training job is completed, you can see the savings in the AWS Management Console and also calculate the cost savings as the percentage difference between the duration for which the training job ran and the duration for which you were billed. Regardless of how many times your Managed Spot Training jobs are interrupted, you are charged only once for the duration for which the data was downloaded.

Amazon SageMaker is a fully managed service to prepare data and build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows.: Managed Spot Trai

Print the final answer from LLM as below, which is accurate.

In [34]:
result

'\n\nYou can use the following instances with Managed Spot Training in SageMaker:\n\n1. Amazon EC2 Spot instances\n2. Amazon EC2 Reserved instances\n3. Amazon EC2 On-Demand instances\n4. Amazon EC2 Auto Scaling instances\n5. Amazon EC2 Batch instances\n6. Amazon EC2 Spot instances with a minimum bid price of $0.50 per hour\n7. Amazon EC2 Spot instances with a minimum bid price of $0.25 per hour\n\nNote:\n\n1. Managed Spot Training is only available for Amazon SageMaker training jobs.\n2. Managed Spot Training is only available for Amazon SageMaker training jobs that are using Amazon EC2 Spot instances.\n3. Managed Spot Training is only available for Amazon SageMaker training jobs that are using Amazon EC2 Spot instances with a minimum bid price of $0.50 per hour.\n4. Managed Spot Training is only available for Amazon SageMaker training jobs that are using Amazon EC2 Spot instances with a minimum bid price of $0.25 per hour.\n\nFor more information on Managed Spot Training, please refer

In [35]:
prompt = PROMPT.format(context=docs, question=question)

payload = {
    "inputs": prompt,
    "parameters": {
        "do_sample": True,
        "top_p": 0.9,
        "temperature": 0.01,
        "max_new_tokens": 1024,
        "early_stopping": True,
        "stop": ["<|endoftext|>", "</s>"]
    }
}

query_response = query_endpoint_with_json_payload(
    encoded_json=json.dumps(payload).encode("utf-8"), 
    endpoint_name=llm_endpoint_name
    )

print(prompt)
print(query_response)


Answer QUESTION based on CONTEXT:

CONTEXT:[Document(page_content='Amazon SageMaker is a fully managed service to prepare data and build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows.: Once a Managed Spot Training job is completed, you can see the savings in the AWS Management Console and also calculate the cost savings as the percentage difference between the duration for which the training job ran and the duration for which you were billed. Regardless of how many times your Managed Spot Training jobs are interrupted, you are charged only once for the duration for which the data was downloaded.', metadata={'source': 'rag_data/processed_data.csv', 'row': 88}), Document(page_content='Amazon SageMaker is a fully managed service to prepare data and build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows.: Managed Spot Training uses Amazon EC2 Sp

---
## Clean up the environment
Delete the endpoints if you don't use them because they incure cost per hour!

In [38]:
# Create a low-level SageMaker service client.
sagemaker_client = boto3.client('sagemaker', region_name=aws_region)

# Delete endpoint
print('Deleting endpoint:', llm_endpoint_name)
sagemaker_client.delete_endpoint(EndpointName=llm_endpoint_name)
sagemaker_client.delete_endpoint_config(EndpointConfigName=llm_endpoint_name)

# Delete endpoint
print('Deleting endpoint:', embeddings_endpoint_name)
sagemaker_client.delete_endpoint(EndpointName=embeddings_endpoint_name)
sagemaker_client.delete_endpoint_config(EndpointConfigName=embeddings_endpoint_name)


Deleting endpoint: falcon-7b-instruct-bf16-endpoint
Deleting endpoint: gpt-j-6b-endpoint


{'ResponseMetadata': {'RequestId': 'fcceb0c2-0285-4757-852b-1bbe4c625c09',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'fcceb0c2-0285-4757-852b-1bbe4c625c09',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Thu, 24 Aug 2023 03:30:53 GMT'},
  'RetryAttempts': 1}}