# Build a knowledge based Vector database using Amazon OpenSearch And build a Knowledge Base Chatbot with Llama2 model Hosted In SageMaker

> *This notebook should work well with the **`Data Science 3.0`** kernel in SageMaker Studio*

In this notebook, we will build a chatbot using a Llama2 finetuned mode hosted in Amazon SageMaker.

## Overview

Conversational interfaces such as chatbots and virtual assistants can be used to enhance the user experience for your customers.Chatbots uses natural language processing (NLP) and machine learning algorithms to understand and respond to user queries. Chatbots can be used in a variety of applications, such as customer service, sales, and e-commerce, to provide quick and efficient responses to users. They can be accessed through various channels such as websites, social media platforms, and messaging apps.


## Chatbot using Llama 2 model hosted in Amazon SageMaker

<img src="images/chatbot_sagemaker.png" width="800">

## Lab Content
In this lab, we will develop a chatbot that performs a range of tasks. These tasks include:  

1. **Chatbot (Basic)** - Zero Shot chatbot with a FM model
2. **Chatbot using prompt** - Chatbot with some context provided in the prompt template
3. **Chatbot with persona** - Chatbot with defined roles. i.e. Career Coach and Human interactions
4. **Contextual-aware chatbot** - Passing in context through an external file by generating embeddings.

## Setup

In [None]:
!pip install sagemaker --quiet --upgrade --force-reinstall
!pip install ipywidgets==7.0.0 --quiet
!pip install --quiet langchain==0.0.309

In [None]:
import warnings
warnings.filterwarnings('ignore')

# Langchain Integration 
<img src="images/langchain-logo.png" alt="langchain" style="width: 400px;"/>
LangChain is a framework for developing applications powered by LLMs. As a high level, langchain enables applications that are:

* Data-aware: connect a language model to other sources of data
* Agentic: allow a language model to interact with its environment

The main advantages of using LangChain are:

* Provides framework abstractions for working with language models, along with a collection of implementations for each abstraction. 
* Modular design principle promotes flexibility to use any LangChain components to build an application 
* Provides many Off-the-shelf chains that makes it easy to get started. 

Langchain also has robust Sagemaker support. In this workshop, we'll be using the following langchain components to integrate with the LLM model and the embeddings model deployed in SageMaker to build a simple Q&A application.


* [Langchain SageMaker Endpoint](https://python.langchain.com/docs/integrations/providers/sagemaker_endpoint)
* [Langchain SageMaker Endpoint Embeddings](https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.sagemaker_endpoint.SagemakerEndpointEmbeddings.html)


Setting up environment

In [None]:
from langchain.embeddings import SagemakerEndpointEmbeddings
from langchain.embeddings.sagemaker_endpoint import EmbeddingsContentHandler
from typing import Any, Dict, List, Optional
import json
from urllib.request import urlretrieve
import os
import sys
import boto3

Define a ContentHandler class for langchain LLM integration

In [None]:
%store -r

In [None]:
# Retrieves the embedding endpoint name deployed in the previous lab
embedding_endpoint_name

In [None]:
# uncomment the line below to use an endpoint name that's different from the one created in the previous lab.
# llm_endpoint_name="<you endpoint name>" # Change this value to the llama2 model endpoint deployed in your environment.
region_name = boto3.Session().region_name

In [None]:
from langchain.llms.sagemaker_endpoint import LLMContentHandler

class SMLLMContentHandler(LLMContentHandler):
        content_type = "application/json"
        accepts = "application/json"

        def transform_input(self, prompt: str, model_kwargs: Dict) -> bytes:
            input_str = json.dumps({"text": prompt, "properties" : model_kwargs})
            return input_str.encode('utf-8')

        def transform_output(self, output: bytes) -> str:
            response_json = json.loads(output.read().decode("utf-8"))
            response = response_json['outputs'][0]["generated_text"].strip()
            if response.rfind('[/INST]') != -1:
                cleaned_response = response[response.rfind('[/INST]')+len('[/INST]'):]
            else:
                cleaned_response = response
            return cleaned_response

In [None]:
from langchain import SagemakerEndpoint

model_params = { 
        
            "do_sample": True,
            "top_p": 0.9,
            "temperature": 0.01,
            "top_k": 100,
            "max_new_tokens": 512,
            "repetition_penalty": 1.03,
    }


llm = SagemakerEndpoint(
    endpoint_name=llm_endpoint_name,
    region_name=region_name,
    content_handler = SMLLMContentHandler(),
    model_kwargs = model_params)

## Chatbot (Basic - without context)

We use [CoversationChain](https://python.langchain.com/en/latest/modules/models/llms/integrations/bedrock.html?highlight=ConversationChain#using-in-a-conversation-chain) from LangChain to start the conversation. We also use the [ConversationBufferMemory](https://python.langchain.com/en/latest/modules/memory/types/buffer.html) for storing the messages. We can also get the history as a list of messages (this is very useful in a chat model).

Chatbots needs to remember the previous interactions. Conversational memory allows us to do that. There are several ways that we can implement conversational memory. In the context of LangChain, they are all built on top of the ConversationChain.

**Note:** The model outputs are non-deterministic

In [None]:
from langchain.chains import ConversationChain
from langchain.llms.bedrock import Bedrock
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory()
conversation = ConversationChain(
    llm=llm, verbose=True, memory=memory
)

print(conversation.predict(input="Hi there!"))

What happens here? We said "Hi there!" and the model spat out a several conversations. This is due to the fact that the default prompt used by Langchain ConversationChain is not well designed for Llama2. A section under the Meta's official Llama2 github repository for [llama2 chat completion](https://github.com/facebookresearch/llama#fine-tuned-chat-models) contains descriptions and examples on how to format the prompt to work with this particular model to optimize the response. Let's fix this problem.

## Chatbot using prompt template (Langchain)

LangChain provides several classes and functions to make constructing and working with prompts easy. We are going to use the [PromptTemplate](https://python.langchain.com/en/latest/modules/prompts/getting_started.html) class to construct the prompt from a f-string template. 

In [None]:
from langchain.memory import ConversationBufferMemory
from langchain.prompts import PromptTemplate

# turn verbose to true to see the full logs and documents
conversation= ConversationChain(
    llm=llm, verbose=False, memory=ConversationBufferMemory(ai_prefix="AI", human_prefix="Human", input_key="input") #memory_chain
)

prompt_template = """<s>[INST] <<SYS>>
Given the following context, answer the question as accurately as possible:
<</SYS>>

### Conversation History
{history}

### Question
{input}

### Context
{context}[/INST] """

stop = ["\[INST\]", "\[/INST\]", "Human:", "<\|im_sep\|>", "</s>", "<INST>"]

# langchain prompts do not always work with all the models. This prompt is tuned for Claude
llama2_prompt = PromptTemplate.from_template(prompt_template)

conversation.prompt = llama2_prompt
history = []
q = "Who is Albert Einstein?"
response = conversation.predict(input=q, context=None)

In [None]:
print(response)

#### New Questions

Model has responded with intial message, let's ask few questions

In [None]:
q2 = "When was he born?"
response = conversation.predict(input=q2, context=None, stop=stop)

In [None]:
print(response)

## Chatbot with persona

AI assistant will play the role of a career coach. Role Play Dialogue requires user message to be set in before starting the chat. ConversationBufferMemory is used to pre-populate the dialog

In [None]:
# store previous interactions using ConversationalBufferMemory and add custom prompts to the chat.
memory = ConversationBufferMemory(ai_prefix="AI", human_prefix="Human", input_key="input")
memory.chat_memory.add_user_message("You will be acting as a career coach. Your goal is to give career advice to users")
memory.chat_memory.add_ai_message("I am a career coach and give career advice")
conversation = ConversationChain(
     llm=llm, verbose=False, memory=memory
)

conversation.prompt = llama2_prompt

response = conversation.predict(input="What are the career options in AI?", context=None, stop=stop)

In [None]:
print(response)

In [None]:
response = conversation.predict(input="What these people really do? Is it fun?", context=None, stop=stop)
print(response)

##### Let's ask a question that is not specialty of this Persona and the model shouldn't answer that question and give a reason for that

In [None]:
conversation.verbose = False
print(conversation.predict(input="How to fix my car?", context=None, stop=stop))

## Building Chatbot with Context - Key Elements
Enterprise search has shown tremendeous values in helping people in an organization find the information they need to perform their jobs. 
With the rise of AI based enterprise search (intelligent search), the new paradigm enables organizations to gain better insights and offer employees a more dynamic experience.
For example, rather than using the standard keyword search for information, users can leverage natural language queries to find more accurate and semantically relevant results, therefore drastically improve customers and employee experiences.
An intelligent search system requires a knowledge based repository, typically a vector database to allow for fast and accurate similarity search and retrieval of data based on their vector distance or similarity.

In our previous lab, we created an embedding model and hosted on SageMaker. In this lab, we'll focus on converting the knowledge source (i.e. documents) into vector representations using the embedding model, and a ingest the vectors into an Amazon OpenSearch server cluster. 


### Pattern
In this notebook we walk through the steps to convert the sample documents into embeddings, and persist those documents into an OpenSearch serverless cluster. 

#### Step 1 Prepare documents
![Embeddings](./images/Embeddings_lang.png)

Before being able to answer the questions, the documents must be processed and a stored in a document store index
- Load the documents
- Process and split them into smaller chunks
- Create a numerical vector representation of each chunk using SageMaker embedding model
- Create an index using the chunks and the corresponding embeddings

#### Step 2 Process User Query
Second process is the user request orchestration , interaction,  invoking and returing the results

![Chatbot](./images/chatbot_lang.png)

## RAG Architecture
<img src="images/context-aware-chatbot.png" width="800">

<!-- ## Building a Chatbot with Context 
In this use case we will ask the Chatbot to answer question from some external corpus it has likely never seen before. To do this we apply a pattern called RAG (Retrieval Augmented Generation): the idea is to index the corpus in chunks, then look up which sections of the corpus might be relevant to provide an answer by using semantic similarity between the chunks and the question. Finally the most relevant chunks are aggregated and passed as context to the ConversationChain, similar to providing a history.

We will take a csv file and use **Titan Embeddings Model** to create vectors for each line of the csv. This vector is then stored in FAISS, an open source library providing an in-memory vector datastore. When the chatbot is asked a question, we query FAISS with the question and retrieve the text which is semantically closest. This will be our answer.  -->

### Dataset
For this lab, we provide some sample documents that are sythetically generated. These are news articles across different genres as followed:

- Politics
- Media
- Sports

We'll be using these documents as the basis for the chatbot to help us answer questions. 

## Setup
Before running the rest of this notebook, you'll need to run the cells below to (ensure necessary libraries are installed and) connect to Bedrock.
In this notebook, we'll also need some extra dependencies:

- [OpenSearch Python Client](https://pypi.org/project/opensearch-py/), to store vector embeddings

In [None]:
%pip install --no-build-isolation --force-reinstall \
    "boto3>=1.28.57" \
    "awscli>=1.29.57" \
    "botocore>=1.31.57" -q

In [None]:
%pip install -U opensearch-py==2.3.1 langchain==0.0.309 \
    apache-beam \
    datasets \
    tiktoken -q

In [None]:
import json
import os
import sys

import boto3

module_path = ".."
sys.path.append(os.path.abspath(module_path))

os.environ["AWS_DEFAULT_REGION"] = region_name

Define a ContentHandler class for langchain LLM integration

In [None]:
class SMEmbeddingContentHandler(EmbeddingsContentHandler):
        content_type = "application/x-text"
        accepts = "application/json"        

        def transform_input(self, prompts: List[str], model_kwargs: Dict) -> bytes:
            return prompts[0].encode('utf-8')

        def transform_output(self, output: bytes) -> List[List[float]]:
            query_response = output.read().decode("utf-8")
            
            if isinstance(query_response, dict):
                model_predictions = query_response
            else:
                model_predictions = json.loads(query_response)
    
            translation_text = model_predictions["embedding"]
            return translation_text

In [None]:
class LangchainSagemakerEndpointEmbeddings(SagemakerEndpointEmbeddings):
    def __init__(self, endpoint_name, region_name, content_handler):
        super().__init__(endpoint_name=endpoint_name,
                         region_name=region_name,
                         content_handler=content_handler)

    def embed_documents(self, texts: List[str], chunk_size: int = 1
    ) -> List[List[float]]:
        return super().embed_documents(texts, chunk_size)


In [None]:
embeddings = LangchainSagemakerEndpointEmbeddings(
                endpoint_name=embedding_endpoint_name,
                region_name=region_name,
                content_handler=SMEmbeddingContentHandler())

## Data Preparation
Let's first download some of the files to build our document store. For this example we will be using the huggingface dataset provided, extracted as txt file for easier consumption.

In [None]:
import glob, os

In [None]:
def load_articles():
    titles = []
    contents = []
    files = []
    for file in glob.glob("data/*.txt"):
        with open(file, "r") as f:
            article = f.readlines()
            start_content_tag_pos = -99
            end_content_tag_pos = -99
            content = []
            for line in article:
                if "<title>" in line:
                    start_tag_pos = line.find("<title>")
                    end_tag_pos = line.rfind("</title>")
                    title = line[start_tag_pos+len("<title>"):end_tag_pos]
                elif "<content>" in line:
                    start_content_tag_pos = line.find("<content>")
                    content.append(line[start_content_tag_pos+len("<content>"):].strip())
                elif "</content>" in line:
                    end_content_tag_pos = line.rfind("</content>")
                    content.append(line[:end_content_tag_pos].strip())
                else:
                    content.append(line.strip())

            content_str = "".join(content) 
            contents.append(content_str)
            titles.append(title)
            files.append(file)
    return titles, contents, files
            

In [None]:
titles, contents, files = load_articles()

After downloading we can load the documents (https://python.langchain.com/en/latest/reference/modules/document_loaders.html) and splitting them into smaller chunks using langchain's RecursiveCharacterTextSplitter.

Note: The retrieved document/text should be large enough to contain enough information to answer a question; but small enough to fit into the LLM prompt. Also the embeddings model has a limit of the length of input tokens limited to 4196 tokens, which roughly translates to ~16 characters. For the sake of this use-case we are creating chunks of roughly 500 characters with an overlap of 100 characters using [RecursiveCharacterTextSplitter](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/recursive_text_splitter.html).

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(        
    # separator = "\n",
    chunk_size =500,
    chunk_overlap=100,
    length_function = len
)

metadatas = []
for i, title in enumerate(titles):
    metadata = { "title" : title , "file" : files[i]}
    metadatas.append(metadata)

documents = text_splitter.create_documents(contents, metadatas=metadatas)
print(f"number of documents: {len(documents)}")

In [None]:
avg_char_count_pre_length = lambda documents: sum([len(doc) for doc in documents])//len(documents)
avg_doc_length = lambda documents: sum([len(doc.page_content) for doc in documents])//len(documents)
avg_char_count_pre = avg_char_count_pre_length(contents)
avg_char_count_post = avg_doc_length(documents)
print(f'Average length among {len(contents)} documents loaded is {avg_char_count_pre} characters.')
print(f'After the split we have {len(documents)} documents, compared to original {len(contents)}.')
print(f'Average length among {len(documents)} documents (after split) is {avg_char_count_post} characters.')

In [None]:
import numpy as np
sample_embedding = np.array(embeddings.embed_query(documents[0].page_content))
print("Sample embedding of a document chunk: ", sample_embedding)
print("Size of the embedding: ", sample_embedding.shape)

Following the similar pattern embeddings could be generated for the entire corpus and stored in a vector store.

First of all we have to create a vector store. In this workshop we will use ***Amazon OpenSerach serverless.***

Amazon OpenSearch Serverless is a serverless option in Amazon OpenSearch Service. As a developer, you can use OpenSearch Serverless to run petabyte-scale workloads without configuring, managing, and scaling OpenSearch clusters. You get the same interactive millisecond response times as OpenSearch Service with the simplicity of a serverless environment. Pay only for what you use by automatically scaling resources to provide the right amount of capacity for your application—without impacting data ingestion. 

Pleae visit this [link](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-getting-started.html for more information about setting up and operating Amazon OpenSearch Serverless in your AWS environment.

In [None]:
import boto3
import time
import uuid

rand_id = str(uuid.uuid4())[:5]
vector_store_name = f'bedrock-workshop-rag-{rand_id}'
index_name = f"bedrock-workshop-rag-index-{rand_id}"
encryption_policy_name = f"bedrock-workshop-rag-sp-{rand_id}"
network_policy_name = f"bedrock-workshop-rag-np-{rand_id}"
access_policy_name = f'bedrock-workshop-rag-ap-{rand_id}'
identity = boto3.client('sts').get_caller_identity()['Arn']

aoss_client = boto3.client('opensearchserverless')

security_policy = aoss_client.create_security_policy(
    name = encryption_policy_name,
    policy = json.dumps(
        {
            'Rules': [{'Resource': ['collection/' + vector_store_name],
            'ResourceType': 'collection'}],
            'AWSOwnedKey': True
        }),
    type = 'encryption'
)

network_policy = aoss_client.create_security_policy(
    name = network_policy_name,
    policy = json.dumps(
        [
            {'Rules': [{'Resource': ['collection/' + vector_store_name],
            'ResourceType': 'collection'}],
            'AllowFromPublic': True}
        ]),
    type = 'network'
)

collection = aoss_client.create_collection(name=vector_store_name,type='VECTORSEARCH')

while True:
    status = aoss_client.list_collections(collectionFilters={'name':vector_store_name})['collectionSummaries'][0]['status']
    if status in ('ACTIVE', 'FAILED'): break
    time.sleep(10)

access_policy = aoss_client.create_access_policy(
    name = access_policy_name,
    policy = json.dumps(
        [
            {
                'Rules': [
                    {
                        'Resource': ['collection/' + vector_store_name],
                        'Permission': [
                            'aoss:CreateCollectionItems',
                            'aoss:DeleteCollectionItems',
                            'aoss:UpdateCollectionItems',
                            'aoss:DescribeCollectionItems'],
                        'ResourceType': 'collection'
                    },
                    {
                        'Resource': ['index/' + vector_store_name + '/*'],
                        'Permission': [
                            'aoss:CreateIndex',
                            'aoss:DeleteIndex',
                            'aoss:UpdateIndex',
                            'aoss:DescribeIndex',
                            'aoss:ReadDocument',
                            'aoss:WriteDocument'],
                        'ResourceType': 'index'
                    }],
                'Principal': [identity],
                'Description': 'Easy data policy'}
        ]),
    type = 'data'
)

host = collection['createCollectionDetail']['id'] + '.' + os.environ.get("AWS_DEFAULT_REGION", None) + '.aoss.amazonaws.com:443'

In [None]:
%store access_policy_name
%store network_policy_name
%store index_name
%store vector_store_name
%store encryption_policy_name

Here are the detail information about the opensearch serverless index which we just created. Make a note o these information as we'll be using them in the later part of the lab.

In [None]:
print(f"vector store host URL: https://{host}")
print(f"vector store name: {vector_store_name}")
print(f"vector store index name: {index_name}")
print(f"vector store encryption policy name: {encryption_policy_name}")
print(f"vector store network policy name: {network_policy_name}")
print(f"vector store access policy name: {access_policy_name}")

Now we are ready to inject our documents into vector store. This can be easily done using [OpenSearch](https://python.langchain.com/docs/integrations/vectorstores/opensearch) implementation inside [LangChain](https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/faiss.html) which takes input the embeddings model and the documents to create the entire vector store.

In [None]:
# Uncomment this following block for testing with existing index.

# from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth
# from langchain.vectorstores import OpenSearchVectorSearch

# # host="https://ah3o7jn8lhwd5fv1qd0k.us-east-1.aoss.amazonaws.com:443"
# # index_name = "bedrock-workshop-rag-index-f9692"
# index_name = "bedrock-workshop-rag-index-74f28"
# service = 'aoss'
# credentials = boto3.Session().get_credentials()
# auth = AWSV4SignerAuth(credentials, os.environ.get("AWS_DEFAULT_REGION", None), service)

# docsearch = OpenSearchVectorSearch(
#     opensearch_url=host,
#     embedding_function=embeddings,
#     http_auth=auth,
#     timeout = 100,
#     use_ssl = True,
#     verify_certs = True,
#     connection_class=RequestsHttpConnection,
#     index_name=index_name,
#     engine="faiss",
#     bulk_size=1000
# )

In [None]:
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth
from langchain.vectorstores import OpenSearchVectorSearch

service = 'aoss'
credentials = boto3.Session().get_credentials()
auth = AWSV4SignerAuth(credentials, os.environ.get("AWS_DEFAULT_REGION", None), service)

docsearch = OpenSearchVectorSearch.from_documents(
    documents,
    embeddings,
    opensearch_url=host,
    http_auth=auth,
    timeout = 100,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection,
    index_name=index_name,
    engine="faiss",
    bulk_size=len(documents)
)

## LangChain Vector Store and Querying

#### We can use the similarity search method to make a query and return the chunks of text without any LLM generating the response.

It takes a few seconds to make documents availible in index. If you will get an empty output in a next cell, just wait a little bit and retry. 

In [None]:
query = "What did Dr. Aiden Smith, a leading AI researcher at Stanford University announced?"

results = docsearch.similarity_search_with_score(query, k=3)  # our search query  # return 3 most relevant docs
for i, result in enumerate(results):
    print(f"#{i}: {result}\n")

#### Amazon OpenSearch as VectorStore

In order to be able to use embeddings for search, we need a store that can efficiently perform vector similarity searches. In this notebook we use OpenSearch Serverless. 

The langchain VectorStore API's are available [here](https://python.langchain.com/en/harrison-docs-refactor-3-24/reference/modules/vectorstore.html)

#### Semantic search

We can use a Wrapper class provided by LangChain to query the vector data base store and return to us the relevant documents. Behind the scenes this is only going to run a RetrievalQA chain.

In [None]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

prompt_template = """<s>[INST] <<SYS>>
Given the following context, answer the question as accurately as possible:
<</SYS>>

### Question
{question}

### Context
{context}[/INST] """

stop = ["\[INST\]", "\[/INST\]", "Human:", "<\|im_sep\|>", "</s>", "<INST>"]


PROMPT = PromptTemplate(template=prompt_template, input_variables=["context", "question"])

qa_prompt = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=docsearch.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT},
)

query ="What did Dr. Aiden Smith, a leading AI researcher at Stanford University announced?"
result = qa_prompt({"query": query})
print(result["result"])

Let's see how the semantic search works:
1. First we calculate the embeddings vector for the query, and
2. then we use this vector to do a similarity search on the store

In [None]:
query = "What did the robots do to take over the workd?"
results = docsearch.similarity_search_with_score(query, k=3)
for r in results:
    print(f"Content: {r[0].page_content}, Similarity Score: {r[1]}")
    print('----')

#### Memory
In any chatbot we will need a QA Chain with various options which are customized by the use case. But in a chatbot we will always need to keep the history of the conversation so the model can take it into consideration to provide the answer. In this example we use the [ConversationalRetrievalChain](https://python.langchain.com/docs/modules/chains/popular/chat_vector_db) from LangChain, together with a ConversationBufferMemory to keep the history of the conversation.

Source: https://python.langchain.com/docs/modules/chains/popular/chat_vector_db

Set `verbose` to `True` to see all the what is going on behind the scenes.

In [None]:
condense_question_prompt_template = """<s>
[INST] Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question.

### Chat History
{chat_history}

### Follow Up Input: {question}

Standalone question:[/INST] """
CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(condense_question_prompt_template)

#### Parameters used for ConversationRetrievalChain
* **retriever**: We used `VectorStoreRetriever`, which is backed by a `VectorStore`. To retrieve text, there are two search types you can choose: `"similarity"` or `"mmr"`. `search_type="similarity"` uses similarity search in the retriever object where it selects text chunk vectors that are most similar to the question vector.

* **memory**: Memory Chain to store the history 

* **condense_question_prompt**: Given a question from the user, we use the previous conversation and that question to make up a standalone question

* **chain_type**: If the chat history is long and doesn't fit the context you use this parameter and the options are `stuff`, `refine`, `map_reduce`, `map-rerank`

If the question asked is outside the scope of context, then the model will reply it doesn't know the answer

**Note**: if you are curious how the chain works, uncomment the `verbose=True` line.

In [None]:
# turn verbose to true to see the full logs and documents
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory

memory_chain = ConversationBufferMemory(memory_key="chat_history", ai_prefix="AI", human_prefix="Human", input_key="question", return_messages=True)
qa = ConversationalRetrievalChain.from_llm(
    llm=llm, 
    retriever=docsearch.as_retriever(), 
    memory=memory_chain,
    condense_question_prompt=CONDENSE_QUESTION_PROMPT,
    verbose=True, 
    chain_type='stuff', # 'refine',
    combine_docs_chain_kwargs = { "prompt" : PROMPT}
)

Let's chat! ask the chatbot some questions about SageMaker, like:
1. The Main Street Bakery is run by who?
2. What did NASA scientists discover that could have major implications for the search for life outside our solar system?

In [None]:
question = "The Main Street Bakery is run by who?"
qa.run({'question': question })

In [None]:
question = "What did NASA scientists discover that could have major implications for the search for life outside our solar system?"
qa.run({'question': question })

#### Do some prompt engineering

You can "tune" your prompt to get more or less verbose answers. For example, try to change the number of sentences, or remove that instruction all-together. You might also need to change the number of `max_tokens_to_sample` (eg 1000 or 2000) to get the full answer.

### In this demo we used Llama2 LLM to create conversational interface with following patterns:

1. Chatbot (Basic - without context)

2. Chatbot using prompt template(Langchain)

3. Chatbot with personas

4. Chatbot with context

# Next Step
In the next step, we'll combine everything that we've built so far to focus on a fully functional chatbot application using [streamlit](https://streamlit.io/). Please follow the 2nd part of the instructions provided in the workshop [here](https://catalog.us-east-1.prod.workshops.aws/workshops/958877b7-af54-434e-8133-15bbb7693947/en-US/labs/lab5#instructions) to deploy the application in your SageMaker Studio environment. 

# Clean up
We provide a notebook to [clean up](cleanup.ipynb) the endpoints deployed in this workshop. This notebook will remove the SageMaker Endpoints provisioned in the previous labs, and the Amazon OpenSearch Serverless collection that we creted in this lab.

Please note that the infrastructure created through SageMaker projects and the S3 bucket are not deleted. 
To delete the infrastructure created in SageMaker Project, locate the corresponding CloudFormation stacks and remove from the by clicking the ```delete``` button.