## Retrieval Augmented Question & Answering with SageMaker Jumpstart Foundation Model using LangChain and Amazon OpenSearch Serverless
### Introduction
Q&A assistants powered by generative AI are designed to have natural conversations and answer questions on a wide range of topics. It uses the LLM foundation model to understand questions and generate relevant and helpful responses. With generative AI capabilities, the Q&A assistant can create unique responses instead of pulling from a database of pre-written responses. Overall, the goal is to have more human-like conversations that can educate, assist and to help improve user productivity.

While Q&A assistants powered by generative AI are helpful in providing assistance across general topics, they struggle in providing information / assistance that involves domain specific knowledge, such as enterprise data not exposed to the model used in the training process. In order to make the Q&A assistant understand enterprise data and to provide useful responses, 2 approaches are used in general to address the challenge:

* Finetune the LLM model with enterprise data;
* Integrate the LLM with enterprise knowledge through external databases (e.g. vector database). This approach is also referred as RAG (Retrieval Augmented Generation)

Previously we showed how you could build a movie assistant AI RAG Chatbot using Knowledge Bases for Bedrock. In this lab, we are going to explore another option of building a RAG chatbot using an open source LLM (Meta Llama2) hosted in Amazon SageMaker through SageMaker Jumpstart. For knowledge base integration, we'll setup a vector database using Amazon OpenSearch Serverless and integrate with the LLM through Langchain framework.


### Architecture
![qna-rag](images/langchain-sagemaker-qa-rag.png)
  
#### Ask question
![Question](./images/chatbot_lang.png)

When the documents index is prepared, you are ready to ask the questions and relevant documents will be fetched based on the question being asked. Following steps will be executed.
- Create an embedding of the input question
- Compare the question embedding with the embeddings in the index
- Fetch the (top N) relevant document chunks
- Add those chunks as part of the context in the prompt
- Send the prompt to the model hosted in SageMaker
- Get the contextual answer based on the documents retrieved

## Usecase
#### Dataset
To explain this architecture pattern we are using a few documents from MovieLens dataset. These documents explain topics such as:
- Movie synopsis.
- Release dates
- Cast members
  

#### Persona
Let's assume a persona of a user who is looking for information about movies/shows. 

In [None]:
%pip install opensearch-py==2.4.2 langchain==0.1.9 ipywidgets==8.0.4 boto3 scikit-learn matplotlib langchain_experimental -q

Setting up environment

In [None]:
import boto3
import uuid
import json
import time
import os
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth
import glob
from langchain.schema import Document
from langchain_community.vectorstores import OpenSearchVectorSearch
from typing import Any, Dict, List, Optional

In [None]:
random_id = str(uuid.uuid4().hex)[:5]
vectordb_name="sm-llm-vector-db"
vector_store_name = f'{vectordb_name}-{random_id}'
index_name = f"{vectordb_name}-index-{random_id}"
encryption_policy_name = f"{vectordb_name}-sp-{random_id}"
network_policy_name = f"{vectordb_name}-np-{random_id}"
access_policy_name = f"{vectordb_name}-ap-{random_id}"
kb_role_name = f"{vectordb_name}-role-{random_id}"
knowledge_base_name = f"{vectordb_name}-{random_id}"

## Data Preparation
In the following section, we're going to prepare our knoledge base store using Amazon OpenSearch Severless collection. The dataset is provided in the 'data' folder in this project and ready to be ingested. We'll leverage langchain framework to help us simplify the data ingestion process.

The main steps for data ingestion workflow are:

1. Create an opensearch serverless collection
2. Load the documents from the data folder
3. Create a numerical vector representation of each document using Amazon Bedrock Titan Embeddings model
4. Create an open search index and ingest the document content and the corresponding embeddings into the collection.

![Embeddings](./images/embeddings_lang.png)

In [None]:
def create_opensearch_serverless_collection(vector_store_name, 
                                            index_name, 
                                            encryption_policy_name, 
                                            network_policy_name, 
                                            access_policy_name):
    identity = boto3.client('sts').get_caller_identity()['Arn']

    aoss_client = boto3.client('opensearchserverless')

    security_policy = aoss_client.create_security_policy(
        name = encryption_policy_name,
        policy = json.dumps(
            {
                'Rules': [{'Resource': ['collection/' + vector_store_name],
                'ResourceType': 'collection'}],
                'AWSOwnedKey': True
            }),
        type = 'encryption'
    )

    network_policy = aoss_client.create_security_policy(
        name = network_policy_name,
        policy = json.dumps(
            [
                {'Rules': [{'Resource': ['collection/' + vector_store_name],
                'ResourceType': 'collection'}],
                'AllowFromPublic': True}
            ]),
        type = 'network'
    )

    collection = aoss_client.create_collection(name=vector_store_name,type='VECTORSEARCH')

    while True:
        status = aoss_client.list_collections(collectionFilters={'name':vector_store_name})['collectionSummaries'][0]['status']
        if status in ('ACTIVE', 'FAILED'): break
        time.sleep(10)

    access_policy = aoss_client.create_access_policy(
        name = access_policy_name,
        policy = json.dumps(
            [
                {
                    'Rules': [
                        {
                            'Resource': ['collection/' + vector_store_name],
                            'Permission': [
                                'aoss:CreateCollectionItems',
                                'aoss:DeleteCollectionItems',
                                'aoss:UpdateCollectionItems',
                                'aoss:DescribeCollectionItems'],
                            'ResourceType': 'collection'
                        },
                        {
                            'Resource': ['index/' + vector_store_name + '/*'],
                            'Permission': [
                                'aoss:CreateIndex',
                                'aoss:DeleteIndex',
                                'aoss:UpdateIndex',
                                'aoss:DescribeIndex',
                                'aoss:ReadDocument',
                                'aoss:WriteDocument'],
                            'ResourceType': 'index'
                        }],
                    'Principal': [identity],
                    'Description': 'Easy data policy'}
            ]),
        type = 'data'
    )
    collection_id = collection['createCollectionDetail']['id']
    collection_arn = collection['createCollectionDetail']['arn']
    host = collection['createCollectionDetail']['id'] + '.' + os.environ.get("AWS_DEFAULT_REGION", None) + '.aoss.amazonaws.com'
    time.sleep(60) # gives it enough time to create the collection and IAM to propagate completely so that we don't run into permission issues.
    return host, collection_id, collection_arn

In [None]:
host, collection_id, collection_arn = create_opensearch_serverless_collection(vector_store_name,
                                                                              index_name,
                                                                              encryption_policy_name,
                                                                              network_policy_name,
                                                                              access_policy_name)

## Document Preparation
After the OpenSearch Collection has been created, we are going to start prepare the documents for the ingestion process. For this lab, we'll be using the dataset provided in the `data` folder. These are a subset of the movielens dataset, formatted and prepared to work with this lab. We'll parse the text files and create [Document](https://github.com/langchain-ai/langchain/blob/master/libs/core/langchain_core/documents/base.py) for each so that they could be ingested into the vector datastore that we created using Langchain framework. 

In [None]:
docs = []
for file in glob.glob(f"data/*.txt"): 
    with open(file, "r") as f:
        lines = f.readlines()
    movie_id = lines[0].split(":")[1].strip()
    title = lines[1].split(":")[1].strip()
    genres = lines[2].split(":")[1].strip()
    spoken_languages = lines[3].split(":")[1].strip()
    release_date = lines[4].split(":")[1].strip()
    rating = lines[5].split(":")[1].strip()
    if rating == "nan":
        rating = "0"
    cast = lines[6].split(":")[1].strip()
    overview = lines[7].split(":")[1].strip()
    doc = Document(
        page_content=f"{''.join(lines)}",
        metadata={
            "title" : title,
            "movie_id": movie_id,
            "rating": float(rating),
            "genres": genres.split(","),
            "spoken_languages": spoken_languages.split(","),
            "release_date": release_date,
            "cast" : cast.split(",")}
        )
    docs.append(doc)

# Langchain Integration 
<img src="images/langchain-logo.png" alt="langchain" style="width: 400px;"/>
LangChain is a framework for developing applications powered by LLMs. As a high level, langchain enables applications that are:

* Data-aware: connect a language model to other sources of data
* Agentic: allow a language model to interact with its environment

The main advantages of using LangChain are:

* Provides framework abstractions for working with language models, along with a collection of implementations for each abstraction. 
* Modular design principle promotes flexibility to use any LangChain components to build an application 
* Provides many Off-the-shelf chains that makes it easy to get started. 

Langchain has a robust set of features with Sagemaker support. In this lab, we'll be using the following langchain components to integrate with the LLM model deployed in SageMaker with an embeddings model using Amazon Titan embedding model to build a simple Q&A application.


* [Langchain SageMaker Endpoint](https://python.langchain.com/docs/integrations/providers/sagemaker_endpoint)
* [Amazon Titan embedding model](https://python.langchain.com/docs/integrations/text_embedding/bedrock) through Bedrock
* [Langchain ConversationalRetrievalChain](https://api.python.langchain.com/en/latest/chains/langchain.chains.conversational_retrieval.base.ConversationalRetrievalChain.html)

## Creates documents and ingest into the opensearch serverless cluster.
First, we iterate through the documents in the 'data' folder and create a Document object for each txt file. 
Then we feed the documents to opensearch serverless for ingestion using an `OpenSearchVectorSearch` object supported by Langchain framework.

In [None]:
boto3_credentials = boto3.Session().get_credentials() # needed for authenticating against opensearch cluster for index creation
region = boto3.client("sts").meta.region_name
service = "aoss"
auth = AWSV4SignerAuth(boto3_credentials, region, service)

Define an embedding model and an LLM. In our example, we'll use Amazon Titan Embedding model as the embedding model, and Llama2-7b chat model hosted in Amazon SageMaker. 

In [None]:
from langchain_community.embeddings import BedrockEmbeddings

embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1")

In [None]:
vectorstore = OpenSearchVectorSearch.from_documents(
    docs,
    embeddings,
    index_name="movielens-index",
    opensearch_url=f"{host}:443",
    http_auth = auth,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection,
    timeout = 100,
    engine="faiss")
time.sleep(60) # Sleep for a short interval for the index to persist completely. 

# Vector Embeddings
What is Vector embeddings? Vector embeddings help the search engine take a user query and return relevant topical web pages, recommend articles, correct misspelled words in the query, and suggest similar related queries that the user might find helpful.

A vector embedding is a numerical representation of data that captures semantic relationships and similarities, making it possible to perform mathematical operations and comparisons on the data for various tasks like text analysis and recommendation systems.

Here's an image that visualizes the semantic relationships between embeddings in the vector space.

![vector_embeddings](images/vector-embedding-3d.jpeg)




As shown above, let's visualize the movies dataset and their relationships in 3d space!

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=3) # PCA reduces the dimension to 3 dimension without losing significant information
embedding_docs = embeddings.embed_documents( [ x.page_content for x in docs ])

Extract titles from the document for visualization purposes

In [None]:
titles = [ x.metadata['title'] for x in docs ]

In [None]:
vis_dims = pca.fit_transform(embedding_docs)
vis_dims_list = vis_dims.tolist()

In [None]:
import matplotlib.pyplot as plt
import numpy as np

def plot_vector_embeddings(titles, vectors):
    fig = plt.figure(figsize=(10, 5))
    ax = fig.add_subplot(projection='3d')
    cmap = plt.get_cmap("tab20")

    # Plot each sample category individually such that we can set label name.
    for i, cat in enumerate(titles):
        sub_matrix = np.array([vectors[i]])
        x=sub_matrix[:,0]
        y=sub_matrix[:, 1]
        z=sub_matrix[:, 2]
        colors = [cmap(i/len(titles))] * len(sub_matrix)
        ax.scatter(x, y, zs=z, zdir='z', c=colors, label=cat)

    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.set_zlabel('z')
    ax.set_title("Movielens data embeddings in 3D")
    ax.legend(bbox_to_anchor=(1.1, 1))

In [None]:
%matplotlib inline

plot_vector_embeddings(titles, vis_dims_list)

Let's add a query and visualize it in 3D space with all other movies

In [None]:
query = "I want to watch an action movie with friendship and murder"

In [None]:
embedding = embeddings.embed_documents([query])[0]

In [None]:
vis_dims_list.append(embedding)
titles.append("QUERY")

In [None]:
%matplotlib inline

plot_vector_embeddings(titles, vis_dims_list)

Since we have ingested the movie data into the vector database (Opensearwch Serverless Collection), we can start using the vector database to help us perform fast retrieval and similarity search.

In [None]:
response = vectorstore.similarity_search_with_score(query, k=3)

Prints the similarity scores for the highest matching documents

In [None]:
for doc in response:
    print(f"Title: {doc[0].metadata['title']}, similarity score: {doc[1]}")

Another approach to calculate semantic similarity is by using the vectorstore as a retriever, as shown in the followig:

In [None]:
relevant_documents = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 2}).get_relevant_documents(query)

In [None]:
for doc in relevant_documents:
    print("=======")
    print(doc)

## Deploy open source Llama2 model from SageMaker Jumpstart

Amazon SageMaker JumpStart is a machine learning (ML) hub that can help you accelerate your ML journey. 

With SageMaker JumpStart, you can evaluate, compare, and select Foundation models quickly based on pre-defined quality and responsibility metrics to perform tasks like article summarization and image generation. In addition, you can easily deploy them into production with the user interface or SDK. There are wide range of LLM selections available in SageMaker Jumpstart from open source such as Huggingface, or proprietary models such as AI21 Jurassic-2 model variants. 

Here's a quick view of some of the FM available directly from Jumpstart:

![jumpstart models](images/sagemaker-jumpstart-llms.png)

In the following section, we'll guide you through step by step for deploying an open source LLM (llama2-7b-chat) using SageMaker Jumpstart UI. After that, we'll use the model, along with the vector database integrated via Langchain framework to build simple chatbot. 

1. Navigate to SageMaker Jumpstart Launcher screen and select "JumpStart" from the left pane:

<img src="images/sm-studio-home.png" alt="sm studio launcher" style="width: 700px;"/>


2. In the search bar, search for: **llama 2 7B chat** and select the "Llama2 7B Chat" model (Note: not the neuron version)

<img src="images/llama2-7b-chat-search.png" alt="search llama2-7b chat" style="width: 800px;"/>

3. Spend a minute or two to read through the detail about this model. When you are done, click the "Deploy" button on the top right hand corner to initiate the model deploy process.

<img src="images/jumpstart-llama2-deploy.png" alt="sm jumpstart llama2 deploy" style="width: 700px;"/> 

4. Accept the EULA by clicking the checkbox, give a unique endpoint name (e.g. [your username]-genai-workshop-llama2-7b), then click **Deploy** in the lower right hand corner.

<img src="images/jumpstart-deploy-details.png" alt="sm jumpstart llama2 deploy detail" style="width: 700px;"/>

The deployment should start within a few seconds. You should see the deployment messages as shown in the following:

<img src="images/jumpstart-deploy-started.png" alt="sm jumpstart llama2 deploy started" style="width: 700px;"/>


To optimize the compute, SageMaker Jumpstart integrates with SageMaker Inference Component in the LLM deployment process. This feature helps optimize the use of the compute resources, and allows for multiple models served behind a single endpoint. For more information about inference components, please refer to this [link](https://aws.amazon.com/blogs/aws/amazon-sagemaker-adds-new-inference-capabilities-to-help-reduce-foundation-model-deployment-costs-and-latency/).

The following diagram shows the components of SageMaker Inference Component:

<img src="images/sm-inference-component.png" alt="sm inference component" style="width: 700px;"/>

You could either check the deployment status by using the Jumpstart UI directly, or use the boto3. The deployment process could take 5-10 minutes. 

In the following, we are going to use boto3 client to check for status.

In [None]:
sm_client = boto3.client("sagemaker")
llm_endpoint_name = "*********" # replace the endpoint name with the one that you created via SM Jumpstart UI. For example, "weteh-genai-workshop-llama2-7b"

In [None]:
def wait_for_deployment_complete(endpoint_name):
    response = sm_client.list_inference_components(
        SortBy='CreationTime',
        SortOrder='Descending',
        EndpointNameEquals=endpoint_name
    )
    inference_component_name = response['InferenceComponents'][0]['InferenceComponentName']
    
    response = sm_client.describe_inference_component(
        InferenceComponentName=inference_component_name
    )
    
    inference_component_status = response['InferenceComponentStatus']
    
    while inference_component_status not in ['InService', 'Failed']:
        time.sleep(10) # sleeps 10 seconds and check the status again
        response = sm_client.describe_inference_component(
            InferenceComponentName=inference_component_name
        )
        inference_component_status = response['InferenceComponentStatus']
    print(f"Inference Component: {inference_component_name} deployment is complete with status: {inference_component_status}")
    return inference_component_name

invoke the following with the endpoint name that you picked at the deployment step. The following cell is run until the deployment is completed.
If you see the status with "InService", that means the model deployment is complete, and the endpoint is ready to be used.

In [None]:
llm_inference_component_name = wait_for_deployment_complete(endpoint_name=llm_endpoint_name) 

## SageMaker Langchain Integration

When using the SagemakerEndpoint class, one essential requirement is to provide a ContentHandler object. This object is responsible for converting inputs and outputs into the required formats for the specific LLM in use. While the default LLMContentHandler object bundled with the Langchain library serves some models effectively, it may not be universally applicable, particularly for models deployed via the HuggingFace LLM Inference Container.

In the following code, we'll format the prompt to match the Llama prompt example described in this [notebook](https://github.com/facebookresearch/llama/blob/main/example_chat_completion.py):


In [None]:
from langchain_community.llms.sagemaker_endpoint import LLMContentHandler

class SMLLMContentHandler(LLMContentHandler):
        content_type = "application/json"
        accepts = "application/json"

        def transform_input(self, prompt: str, model_kwargs: Dict) -> bytes:
            input_data = json.dumps([[{"role" : "system", "content" : "You are a movie assistant."},
                                    {"role" : "user", "content" : prompt}]])
            input_str = json.dumps({"inputs" : input_data, "parameters" : {**model_kwargs}})
            return input_str.encode('utf-8')

        def transform_output(self, output: bytes) -> str:
            response_json = json.loads(output.read().decode("utf-8"))
            return response_json[0]["generated_text"]

In [None]:
from langchain_community.llms import SagemakerEndpoint

region_name = sm_client.meta.region_name
model_params = { 
                    "do_sample": True,
                    "top_p": 0.9,
                    "temperature": 0.1,
                    "max_new_tokens": 1000,
                    "stop": ["<|endoftext|>", "</s>"],
                    "repetition_penalty": 1.1
               }

llm = SagemakerEndpoint(
    endpoint_name=llm_endpoint_name,
    region_name=region_name,
    content_handler = SMLLMContentHandler(),
    model_kwargs = model_params,
    endpoint_kwargs = {"InferenceComponentName" : llm_inference_component_name})

In [None]:
from langchain.chains import LLMChain
from langchain.memory import ConversationBufferMemory
from langchain_experimental.chat_models import Llama2Chat

In [None]:
from langchain.prompts.chat import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
    MessagesPlaceholder,
    SystemMessagePromptTemplate
)
from langchain_core.messages import SystemMessage
from langchain.chains.conversational_retrieval.base import ConversationalRetrievalChain

system_template = """Given the following context: 

Context:
{context}

Answer the question as truthfully as possible. Your answer must only be coming from the context given above. If the answer is not found in the given context. Say "I don't know"
Your answer must be in a summary and direct in a concise manner. It's critical that you are only allowed to use the context given to you in answering the question.
"""

template_messages = [
    SystemMessagePromptTemplate.from_template(system_template),
    HumanMessagePromptTemplate.from_template("{question}"),
]
prompt_template = ChatPromptTemplate.from_messages(template_messages)
model = Llama2Chat(llm=llm)

vectorstore_retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 3})

In [None]:
import ipywidgets as ipw
from IPython.display import display, clear_output

class ChatUX:
    """ A chat UX using IPWidgets
    """
    def __init__(self):
        memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True, output_key='answer')
        vectorstore_retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 3})
        self.qa = ConversationalRetrievalChain.from_llm(llm=model,
                                                      memory=memory,
                                                      retriever=vectorstore_retriever, 
                                                      return_source_documents=True,
                                                      combine_docs_chain_kwargs={"prompt": prompt_template})
        self.name = None
        self.b=None
        self.out = ipw.Output()

    def start_chat(self):
        print("Let's chat!")
        display(self.out)
        self.chat(None)

    def chat(self, _):
        if self.name is None:
            prompt = ""
        else:
            prompt = self.name.value
        if 'q' == prompt or 'quit' == prompt or 'Q' == prompt:
            print("Thank you , that was a nice chat !!")
            return
        elif len(prompt) > 0:
            with self.out:
                thinking = ipw.Label(value=f"Thinking...")
                display(thinking)
                try:
                    response = self.qa.invoke({"question" : prompt})
                    result = response['answer']

                except Exception as e:
                    print(e)
                    result = "No answer"
                thinking.value=""
                print(f"AI: {result}")
                self.name.disabled = True
                self.b.disabled = True
                self.name = None

        if self.name is None:
            with self.out:
                self.name = ipw.Text(description="You: ", placeholder='q to quit')
                self.b = ipw.Button(description="Send")
                self.b.on_click(self.chat)
                display(ipw.Box(children=(self.name, self.b)))

## Sample Questions
* What's the movie "Jumanji" all about?
* When was this movie released?
* Who were the actors in this movie?
* What other movies were released in the same year?
* Tell me more about Woody in Toy Story.

In [None]:
chat = ChatUX()
chat.start_chat()

# Conclusion
In this notebook, we demonstrated how to build a Q&A chatbot using an Open Source LLM hosted in SageMaker, and a vector database using Amazon Opensearch serverless collection. We leveraged Langchain framework to integrate the vectordb and the LLM to make it a simple chatbot interface. 

First, we ingested the document provided in the 'data' folder into OpenSearch serverless collection. We showed the contextual similarities between the documents using their associated embeddings via visualizations. Then, we deployed an open source LLM (llama2-7b chat) model using SageMaker Jumpstart. Next, we leveraged Langchain library with ConversationRetrievalChain object to orchestrate the workflow between the user query and the LLM hosted in SageMaker. Finally, we built a simple Q&A interface that allows users to ask questions to the chatbot and get responses from the LLM. 