# Conversational Search

---


In this lab, we leverage Lanchain framework to implement Retrieval Augmented Generation(RAG) solution.RAG retrieves data from outside the language model (non-parametric) and augments the prompts by adding the relevant retrieved data in context. RAG models were introduced by Lewis et al. in 2020 as a model where parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever.

In RAG, the external data can come from multiple data sources, such as a document repository, databases, or APIs. The first step is to convert the documents and the user query in the format so they can be compared and relevancy search can be performed. To make the formats comparable for doing relevancy search, a document collection (knowledge library) and the user-submitted query are converted to numerical representation using embedding language models. The embeddings are essentially numerical representations of concept in text. Next, based on the embedding of user query, its relevant text is identified in the document collection by a similarity search in the embedding space. Then the prompt provided by the user is appended with relevant text that was searched and it’s added to the context. The prompt is now sent to the LLM and because the context has relevant external data along with the original prompt, the model output is relevant and accurate.

For more informaiton about LangChain RAG, please refere: https://python.langchain.com/docs/use_cases/question_answering/

---

The lab includes the following steps:
1. [Step 1: Initialize](#Step-1:-Initialize)
2. [Step 2: Verify deployed endpoint for embedding and content generation model](#Step-2:-Verify-deployed-endpoint-for-embedding-and-content-generation-model)
3. [Step 3: Test LLM without context information](#Step-3:-Test-LLM-without-context-information)
4. [Step 4: Load documents with LangChain document loader and store vector into OpenSearch](#Step-4:-Load-documents-with-LangChain-document-loader-and-store-vector-into-OpenSearch)
5. [Step 5: Retrieval Augmented Generation](#Step-5:-Retrieval-Augmented-Generation)
6. [Step 6: Conversational search by memorizing the history](#Step-6:-Conversational-search-by-memorizing-the-history)


## Step 1: Initialize

Install required library such as OpenSearch client library, LangChain

In [None]:
!pip install --upgrade sagemaker 
!pip install opensearch-py
!pip install unstructured 
!pip install transformers
!pip install langchain==0.3.1
!pip install langchain-aws==0.2.1
!pip install langchain-community==0.3.1
!pip install beautifulsoup4

Initialize SageMaker, Boto3

In [None]:
import sagemaker, boto3, json
from sagemaker.session import Session

sagemaker_session = Session()
aws_role = sagemaker_session.get_caller_identity_arn()
aws_region = boto3.Session().region_name
sess = sagemaker.Session()

Get Cloud Formation stack output variables

We also need to grab some key values from the infrastructure we provisioned using CloudFormation. To do this, we will list the outputs from the stack and store this in "outputs" to be used later.

You can ignore any "PythonDeprecationWarning" warnings.

In [None]:
region = aws_region

cfn = boto3.client('cloudformation')

def get_cfn_outputs(stackname):
    outputs = {}
    for output in cfn.describe_stacks(StackName=stackname)['Stacks'][0]['Outputs']:
        outputs[output['OutputKey']] = output['OutputValue']
    return outputs

## Setup variables to use for the rest of the demo
cloudformation_stack_name = "semantic-search"

outputs = get_cfn_outputs(cloudformation_stack_name)
aos_host = outputs['OpenSearchDomainEndpoint']

outputs

## Step 2: Verify deployed endpoint for embedding and content generation model

Like mentioned in the architecture diagram, we use two LLM in this lab:
- Embedding Model: Convert text into vector
- Text Generation LLM: Generate content 

---

We will verify the two endpoins are ready before running the lab.

### Get endpoint for embedding

---
This is SageMaker Endpoint with GPT-J 6B parameters model to convert text into vector.


In [None]:
embedding_endpoint_name=outputs['EmbeddingEndpointName']
print(embedding_endpoint_name)

Verify embedding endpoint is ready

In [None]:
import time

sm_client = boto3.client("sagemaker", aws_region)

describe_embedding_endpoint_response = sm_client.describe_endpoint(EndpointName=embedding_endpoint_name)

while describe_embedding_endpoint_response["EndpointStatus"] == 'Creating':
    time.sleep(15)
    print('.', end='')
    describe_embedding_endpoint_response = sm_client.describe_endpoint(EndpointName=embedding_endpoint_name)
print('embedding endpoint created')

### Get endpoint for content generation

---
This is SageMaker Endpoint with Falcon 7B parameters model to generate text.


In [None]:
llm_endpoint_name=outputs['LLMEndpointName']
print(llm_endpoint_name)

Verify embedding endpoint is ready

In [None]:
sm_client = boto3.client("sagemaker", aws_region)

describe_llm_endpoint_response = sm_client.describe_endpoint(EndpointName=llm_endpoint_name)

while describe_llm_endpoint_response["EndpointStatus"] == 'Creating':
    time.sleep(15)
    print('.', end='')
    describe_llm_endpoint_response = sm_client.describe_endpoint(EndpointName=llm_endpoint_name)
print('LLM endpoint created')

### Test embedding endpoint

In [None]:
from langchain.embeddings.sagemaker_endpoint import EmbeddingsContentHandler
from langchain.embeddings import SagemakerEndpointEmbeddings


class TestContentHandler(EmbeddingsContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        input_str = json.dumps({"text_inputs": prompt, **model_kwargs})
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> str:
        response_json = json.loads(output.read().decode("utf-8"))
        embeddings = response_json["embedding"]
        if len(embeddings) == 1:
            return [embeddings[0]]
        return embeddings


test_content_handler = TestContentHandler()

test_embeddings = SagemakerEndpointEmbeddings(
    endpoint_name=embedding_endpoint_name,
    region_name=aws_region,
    content_handler=test_content_handler,
)

In [None]:
print(test_embeddings.embed_documents(["Hello World"])[0][:5])

### Test LLM endpoint

In [None]:
def query_endpoint_with_json_payload(encoded_json, endpoint_name, content_type="application/json"):
    client = boto3.client("runtime.sagemaker")
    response = client.invoke_endpoint(
        EndpointName=endpoint_name, ContentType=content_type, Body=encoded_json
    )
    return response

#method used to parse the inference model's response. we pass it as part of the model's config
def parse_response_model(query_response):
    model_predictions = json.loads(query_response["Body"].read())
    return [gen["generated_text"] for gen in model_predictions]


In [None]:
question = "How to determine shard and data node counts for OpenSearch?"

In [None]:
payload = {
    "inputs": question,
    "parameters":{
        "max_new_tokens": 1024,
        "num_return_sequences": 1,
        "top_k": 100,
        "top_p": 0.95,
        "do_sample": False,
        "return_full_text": True,
        "temperature": 0.9
    }
}



In [None]:
query_response = query_endpoint_with_json_payload(
    json.dumps(payload).encode("utf-8"), endpoint_name=llm_endpoint_name
)

generated_texts = parse_response_model(query_response)

print(f"The generated output is: {generated_texts[0]}\n")

## Step 3: Test LLM without context information

---

To better illustrate why we need retrieval-augmented generation (RAG) based approach to solve the question and anwering problem. LLM can generate answer without context information. However the generated content may be hallucination.

In [None]:
from uuid import uuid4
from typing import Dict
from langchain.memory import ConversationBufferMemory
from langchain.memory import DynamoDBChatMessageHistory
from langchain.memory import ConversationBufferWindowMemory
from langchain import PromptTemplate, SagemakerEndpoint
from langchain.llms.sagemaker_endpoint import LLMContentHandler
from langchain.chains import RetrievalQA


class ContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"
    
    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        input_str = json.dumps({"inputs": prompt, "parameters": model_kwargs})
        #print("Prompt Input:\n" + input_str)
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> str:
        response_json = json.loads(output.read().decode("utf-8"))
        #print("LLM generated text:\n" + response_json[0]["generated_text"])
        return response_json[0]["generated_text"]
    

content_handler = ContentHandler()


In [None]:
params = {
        "max_length": 4096,
        "max_new_tokens": 1024,
        "num_return_sequences": 1,
        "top_k": 100,
        "top_p": 0.95,
        "do_sample": False,
        "return_full_text": False,
        "temperature": 0.9
        }

In [None]:
llm_hallucination=SagemakerEndpoint(
        endpoint_name=llm_endpoint_name,
        region_name=aws_region,
        model_kwargs=params,
        content_handler=content_handler,
)

---

Let's directly ask the model a question and see how they respond.

In [None]:
print("Question is:" + question)
generated_result = llm_hallucination(question)
print(generated_result)

Ask the same question again to compare the result

In [None]:
print("Question is:" + question)
generated_result = llm_hallucination(question)
print(generated_result)

---

The generated result is based on LLM training data, and they are plausible. Moreover, every time the generated content are different. That's the disadvantage of LLM hallucination. 

---


## Step 4: Load documents with LangChain document loader and store vector into OpenSearch

Use document loaders to load data from a source as Document's. A Document is a piece of text and associated metadata. For example, there are document loaders for loading a simple .txt file, for loading the text contents of any web page, or even for loading a transcript of a YouTube video.

---

The following is data flow diagram of loading documents and store vector into OpenSearch.

![retriever](./image/module8/document-loader.png)


Document loaders expose a "load" method for loading data as documents from a configured source. Here, we use `UnstructuredURLLoader` to load OpenSearch best practice web page.

In [None]:
from langchain_community.document_loaders import RecursiveUrlLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1000, chunk_overlap = 100)

loader = RecursiveUrlLoader(
    "https://docs.aws.amazon.com/opensearch-service/latest/developerguide/bp.html",
    max_depth=1,
    # use_async=False,
    # extractor=None,
    # metadata_extractor=None,
    # exclude_dirs=(),
    # timeout=10,
    # check_response_status=True,
    # continue_on_failure=True,
    # prevent_outside=True,
    # base_url=None,
    # ...
)
url_texts = loader.load_and_split(text_splitter=text_splitter)

Show one example document

In [None]:
all_splits = url_texts
print(url_texts[0])

Create an OpenSearch cluster connection.
Next, we'll use Python API to set up connection with Amazon Opensearch Service domain.

In [None]:
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth
import boto3
import json

kms = boto3.client('secretsmanager')
aos_credentials = json.loads(kms.get_secret_value(SecretId=outputs['OpenSearchSecret'])['SecretString'])



#credentials = boto3.Session().get_credentials()
#auth = AWSV4SignerAuth(credentials, region)
auth = (aos_credentials['username'], aos_credentials['password'])

index_name = 'nlp_pqa'

aos_client = OpenSearch(
    hosts = [{'host': aos_host, 'port': 443}],
    http_auth = auth,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection
)

### LangChain embedding endpoint

To build a simiplied QA application with LangChain, we need to wrap up our SageMaker endpoints for embedding model and LLM into `langchain.embeddings.SagemakerEndpointEmbeddings` and `langchain.llms.sagemaker_endpoint.SagemakerEndpoint`. That requires a overwrite methods of `SagemakerEndpointEmbeddings` class to make it compatible with SageMaker embedding mdoel.

---

Embedding language model is GTP-J, and the endpoint name is `embedding_endpoint_name`

In [None]:
from typing import Any, Dict, Iterable, List, Optional, Tuple, Callable
import json
from langchain.embeddings import SagemakerEndpointEmbeddings
from langchain.embeddings.sagemaker_endpoint import EmbeddingsContentHandler
from langchain.schema import Document

class BulkSagemakerEndpointEmbeddings(SagemakerEndpointEmbeddings):
        def embed_documents(
            self, texts: List[str], chunk_size: int = 5
        ) -> List[List[float]]:
            results = []
            _chunk_size = len(texts) if chunk_size > len(texts) else chunk_size

            for i in range(0, len(texts), _chunk_size):
                response = self._embedding_func(texts[i:i + _chunk_size])
                results.extend(response)
            return results
        
class EmbeddingContentHandler(EmbeddingsContentHandler):
        content_type = "application/json"
        accepts = "application/json"

        def transform_input(self, prompt: str, model_kwargs={}) -> bytes:

            input_str = json.dumps({"text_inputs": prompt, **model_kwargs})
            return input_str.encode('utf-8') 

        def transform_output(self, output: bytes) -> str:

            response_json = json.loads(output.read().decode("utf-8"))
            embeddings = response_json["embedding"]
            if len(embeddings) == 1:
                return [embeddings[0]]
            return embeddings

print(embedding_endpoint_name)
embeddings = BulkSagemakerEndpointEmbeddings( 
            endpoint_name=embedding_endpoint_name,
            region_name=aws_region, 
            content_handler=EmbeddingContentHandler())


### OpenSearch vector store

Use `OpenSearchVectorSearch` in LangChain to ingest vector into OpenSearch. You can specify more parameters to create kNN index with specified properties. Some parameters like:
engine: “nmslib”, “faiss”, “lucene”; default: “nmslib”

space_type: “l2”, “l1”, “cosinesimil”, “linf”, “innerproduct”; default: “l2”

ef_search: Size of the dynamic list used during k-NN searches. Higher values lead to more accurate but slower searches; default: 512

ef_construction: Size of the dynamic list used during k-NN graph creation. Higher values lead to more accurate graph but slower indexing speed; default: 512

m: Number of bidirectional links created for each new element. Large impact on memory consumption. Between 2 and 100; default: 16



In [None]:
from langchain.vectorstores import OpenSearchVectorSearch

os_domain_ep = 'https://'+aos_host

embedding_index_name = 'opensearch_kb_vector'

if len(all_splits) > 500:
    for i in range(0, len(all_splits), 500):
        start = i
        end = i+500
        if end > len(all_splits):
            end = len(all_splits)-1
        docs = all_splits[start:end]
        OpenSearchVectorSearch.from_documents(
            index_name = embedding_index_name,
            documents=docs,
            embedding=embeddings,
            opensearch_url=os_domain_ep,
            http_auth=auth
        )
        print(f"ingest documents from {start} to {end}", start, end)
else:
    OpenSearchVectorSearch.from_documents(
            index_name = embedding_index_name,
            documents=all_splits,
            embedding=embeddings,
            opensearch_url=os_domain_ep,
            http_auth=auth
        )
    print(f"ingest documents")

In [None]:
aos_client.indices.get(index=embedding_index_name)

In [None]:
#aos_client.indices.delete(index=embedding_index_name)

When you use LangChain `OpenSearchVectorSearch` to store embedding with OpenSearch kNN index, you can specify parameters to choose different Approximate Near Neighbour(ANN) algorithms. For more information, please refer OpenSearch kNN documentaion: https://opensearch.org/docs/latest/search-plugins/knn/knn-index/

In [None]:
customized_embedding_index_name = 'customized_opensearch_kb_vector'

OpenSearchVectorSearch.from_documents(
            index_name = customized_embedding_index_name,
            documents=all_splits,
            embedding=embeddings,
            opensearch_url=os_domain_ep,
            http_auth=auth,
            engine="faiss",
            space_type="innerproduct",
            ef_construction=256,
            m=48,
        )
print(f"ingest documents into customized knn index")

In [None]:
aos_client.indices.get(index=customized_embedding_index_name)

We can use `OpenSearchVectorSearch` for vector store or we can extend the class to define new fuction to calculate documents relevance score if you want to use relevance score to filter document.

In [None]:
class SimiliarOpenSearchVectorSearch(OpenSearchVectorSearch):
    
    def relevance_score(self, distance: float) -> float:
        return distance
    
    def _select_relevance_score_fn(self) -> Callable[[float], float]:
        return self.relevance_score
    

open_search_vector_store = SimiliarOpenSearchVectorSearch(
                                    index_name=embedding_index_name,
                                    embedding_function=embeddings,
                                    opensearch_url=os_domain_ep,
                                    http_auth=auth
                                    ) 

Show the documents which are similiar with question "How to determine shard and data node counts for OpenSearch?". Be default, 4 documents are returned. You can specify "k" parameter. See the [doc](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.opensearch_vector_search.OpenSearchVectorSearch.html#langchain.vectorstores.opensearch_vector_search.OpenSearchVectorSearch.similarity_search_with_score) for more information.


In [None]:
docs_ = open_search_vector_store.similarity_search_with_score(question, k=5)

print("found document number:" + str(len(docs_)))

print("opensearch results:\n")
for doc in docs_:
    print(doc)
    print("\n-----------------")

## Step 5: Retrieval Augmented Generation

---

To mitigate LLM hallucination, we can provide some context to LLM and let LLM generated answer with the context. The following diagram show RAG data flow:

![rag](./image/module8/rag.png)

---


In this lab,  we use OpenSearch vector store as retriever to get similiar documents with query. We can also specify similarity scrore threshhold to return high relevant documents. Use "k" to limit how many documents to be returned.



In [None]:
retriever = open_search_vector_store.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={
        'k': 5,
        'score_threshold': 0.62
    }
)

Define LLM SageMaker endpoint and `RetrievalQA` Chain

---

In [None]:
params = {
        "max_length": 2048,
        "max_new_tokens": 512,
        "num_return_sequences": 1,
        "top_k": 200,
        "top_p": 0.9,
        "do_sample": False,
        "return_full_text": False,
        "temperature": 0.0001
        }

llm=SagemakerEndpoint(
        endpoint_name=llm_endpoint_name,
        region_name=aws_region,
        model_kwargs=params,
        content_handler=content_handler,
)

In [None]:
qa = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type="stuff" #stuff, refine, map_reduce, and map_rerank
)

Use RAG to generate answer to the same question before. Compare the content generated with RAG and LLM without context.

---

In [None]:
print("Question is:" + question)
result = qa({"query": question})

print("result:" + result["result"])
  

### Use customized prompt for RAG

---
Though the content generated by RAG is much better than the answer without any content. However some of the generated bullets are duplicate. You can improve the generated content by optimizing the prompter to LLM. Following is example of using customized prompt.

---

In [None]:
template2 = """Answer the question as truthfully as possible by using the provided informaiton in >>CONTEXT<<. If the answer is not contained within the >>CONTEXT<<, respond with "I can't answer that".

>>CONTEXT<<:
{context}

>>QUESTION<<:
{question}

>>Answer<<:
"""


prompt_template2 = PromptTemplate(
    input_variables=["question", "context"],
    template=template2
)

qa_with_prompt = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt_template2}
)


Run the code qa chain with question. You can set `langchain.debug = True` if you want to see the debug informaiton.

---

In [None]:
import langchain

#langchain.debug = True
langchain.debug = False

print("Question is:" + question)
result = qa_with_prompt({"query": question})

print("\n### Generated result:" + result["result"])


### Return source documents

You can also return the source documents to help you locate original knowledge base document which are related with the question.

---

In [None]:
qa_with_source = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt_template2}
)

In [None]:
print("Question is:" + question)
result = qa_with_source({"query": question})

print("result:" + result["result"])
print("\n\n===========================")
print("\nsource documents:")
for doc in result["source_documents"]:
    print(doc)
    print("---------------------------\n")

## Step 6: Conversational search by memorizing the history 

### LangChain Memory with Amazon DynamoDB as data store

In the above example, you can ask any questions to the system. However there is no relation among the questions. In a typical search system, you may want to implement conversational search. An essential component of a conversation is being able to refer to information introduced earlier in the conversation. LangChain provides a lot of utilities for adding memory to a system. These utilities can be used by themselves or incorporated seamlessly into a chain. In this lab, we use [Amazon DynamoDB](https://aws.amazon.com/dynamodb/) as data store of history message.

---
The data flow of conversational search with memory is as following:

![rag](./image/module8/rag-with-memory.png)

---
Here we create new session and use DynamoDB as backend to store history conversation. 

In [None]:
ddb_table_name = "conversation-history-memory"
session_id = str(uuid4())
chat_memory = DynamoDBChatMessageHistory(
        table_name=ddb_table_name,
        session_id=session_id
    )

messages = chat_memory.messages

# Maintains immutable sessions
# If previous session was present, create
# a new session and copy messages, and 
# generate a new session_id 
if messages:
    session_id = str(uuid4())
    chat_memory = DynamoDBChatMessageHistory(
        table_name=ddb_table_name,
        session_id=session_id
    )
    # This is a workaround at the moment. Ideally, this should
    # be added to the DynamoDBChatMessageHistory class
    try:
        messages = messages_to_dict(messages)
        chat_memory.table.put_item(
            Item={"SessionId": session_id, "History": messages}
        )
    except Exception as e:
        print(e)

memory = ConversationBufferMemory(chat_memory=chat_memory, memory_key="chat_history", return_messages=True)


---

Create `ConversationalRetrievalChain` to combines the chat history and the question into a standalone question, then looks up relevant documents from the retriever, and finally passes those documents and the question to a question-answering chain to return a response.

---

In [None]:
from langchain.chains import ConversationalRetrievalChain

params = {
        "max_length": 2048,
        "max_new_tokens": 512,
        "num_return_sequences": 1,
        "top_k": 200,
        "top_p": 0.99,
        "do_sample": False,
        "return_full_text": False,
        "temperature": 0.0001
        }

llm=SagemakerEndpoint(
        endpoint_name=llm_endpoint_name,
        region_name=aws_region,
        model_kwargs=params,
        content_handler=content_handler,
)

condense_template = """system: generate one standalone question.
Given the following conversation between <chat-history> and </chat-history> 
and follow up question between <follow-up-question> and </follow-up-question>, 
rephrase the follow up question to be a standalone question in its original language. 
The standalone question will only contains one sentence and it must end with '?'

<chat-history>
{chat_history}
</chat-history>

<follow-up-question>
{question}
</follow-up-question>

standalone question:
"""

CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(condense_template)


qa_with_memory = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=retriever,
    memory=memory,
    condense_question_prompt=CONDENSE_QUESTION_PROMPT,
    combine_docs_chain_kwargs={"prompt": prompt_template2},
    verbose=True)

---

For the first question, there is no history. It is just standard RAG process.

---

In [None]:
result = qa_with_memory(question)


In [None]:
#print("result:" + str(result))
print("\nAnswer:\n" + str(result["answer"]))

### Second question
Try to ask one more question, `ConversationalRetrievalChain` will use the first question, first question's answer and second question as prompt to LLM to generate new question. The prompt to LLM is like following:

```python

_template = """system: generate one standalone question.
Given the following conversation between <chat-history> and </chat-history> 
and follow up question between <follow-up-question> and </follow-up-question>, 
rephrase the follow up question to be a standalone question in its original language. 
The standalone question will only contains one sentence and it must end with '?'

<chat-history>
{chat_history}
</chat-history>

<follow-up-question>
{question}
</follow-up-question>

standalone question:
```

After get the new question from LLM, it will search relevant document from OpenSearch vector store and get relevant documents, then combine the new question and relevant documents as prompt to go through RAG process. The prompt to LLM is like following:

```python
prompt_template = """Answer the question as truthfully as possible by using the provided informaiton in >>CONTEXT<<. If the answer is not contained within the >>CONTEXT<<, respond with "I can't answer that".

>>CONTEXT<<:
{context}

>>QUESTION<<:
{question}

>>Answer<<:
"""
```

In summary, `ConversationalRetrievalChain` will call LLM twice:
1. Use history question, history answer and latest question as prompt to generate new question
2. Use new question generated in the first step, query relevant documents. Combine relevant documents and new question as prompt to LLM to generate answer.

You can also see the verbose message like following:

---

### First call to LLM:

![generate new question](./image/module8/conversation-new-question.png)

---

### Second call to LLM:

![generate final answer](./image/module8/conversation-final-answer.png)

---


In [None]:
second_following_question = 'if my data growth is very fast'
second_result = qa_with_memory(second_following_question)


In [None]:
print("second answer:" + str(second_result["answer"]))

### Return source document

---

We can also include source documents so that we can know where the content information are from.

In [None]:
session_id = str(uuid4())
chat_memory = DynamoDBChatMessageHistory(
        table_name=ddb_table_name,
        session_id=session_id
    )

memory_for_source = ConversationBufferMemory(chat_memory=chat_memory,memory_key="chat_history")

qa_with_memory_and_source = ConversationalRetrievalChain.from_llm(
    llm=llm, 
    retriever=retriever,
    condense_question_prompt=CONDENSE_QUESTION_PROMPT,
    combine_docs_chain_kwargs={"prompt": prompt_template2},
    return_source_documents=True, 
    verbose=True)


In [None]:
chat_history = []
result = qa_with_memory_and_source({"question": question,"chat_history": chat_history})

In [None]:
print("\nAnswer:\n" + str(result["answer"]))
print("\nSource Documents:\n")
for doc in result["source_documents"]:
    print(str(doc))
    print("--------------------------------")

In [None]:
chat_history = [(question, result["answer"])]
second_query = second_following_question
second_result = qa_with_memory_and_source({"question": second_query, "chat_history": chat_history})

In [None]:
print("\nAnswer:\n" + str(second_result["answer"]))
print("\nSource Documents:\n")
for doc in second_result["source_documents"]:
    print(str(doc))
    print("--------------------------------")