# Retrieval Augmented Generation and Chatbot Application

LangChain is a framework for developing applications powered by language models. The key aspects of this framework allow us to augement the Large Models and enable us to perform tasks which meet our goals and enable our use-cases. At a high level Langchain has 

Data: Connect a language model to other sources of data
Agent: Allow a language model to interact with its environment

LangChain can be used in two major ways:

<li>Indivisual Components: LangChain provides modular abstractions for the components neccessary to work with language models. LangChain also has collections of implementations for all these abstractions. The components are designed to be easy to use, regardless of whether you are using the rest of the LangChain framework or not.

<li>Use-Case Specific Chains: Chains can be thought of as assembling these components in particular ways in order to best accomplish a particular use case. These are intended to be a higher level interface through which people can easily get started with a specific use case. These chains are also designed to be customizable.

## Topics covered:

In this notebook we will be covering the below topics:

- **LLM** Examine running an LLM in bare form to check for output
- **Vector DB** Examine various vector databases like FAISS or CHROMA and leverage to produce better results using RAG
- **Prompt template** Examine use of PROMPT Template
- **Question Answering** Retrieval Augmented Generation (RAG)
- **Chatbot** Build a Interactive Chatbot with Memory 

## Key points for consideration

1. Long Document that exceed the token limit? Ability to Chain , Mapo_reduce, Refine, Map-Rerank
2. Cost of per token -- minimize the tokens and send in only relevant tokens to Model
3. Which model to use --
    - Cohere, AI21, Huggingface Hub, Manifest, Goose AI, Writer, Banana, Modal, StochasticAI, Cerebrium, Petals, Forefront AI, Anthropic, DeepInfra, and self-hosted Models.
    - Example LLM cohere = Cohere(model='command-xlarge')
    - Example LLM flan = HuggingFaceHub(repo_id="google/flan-t5-xl")
4. Input Data Sources PDF, WebPages, CSV , S3, EFS
5. Orchestration with External Tasks
    - External Tasks - Agent SerpApi, SEARCH Engines
    - Math Calculator
6. Conversation Management and History

### Key components of LangChain

Let us examine the key components of Langchain. At the heart and the center is the Large Model.

There are several main modules that LangChain provides support for. For each module we provide some examples to get started, how-to guides, reference docs, and conceptual guides. These modules are, in increasing order of complexity:

**Models**: The various model types and model integrations LangChain supports.

<img src='./images/models.png' width ="500"/>

    
**Prompts**: This includes prompt management, prompt optimization, and prompt serialization.
    
<img src="images/prompt.png" width="500"/>
    
**Memory**: Memory is the concept of persisting state between calls of a chain/agent. LangChain provides a standard interface for memory, a collection of memory implementations, and examples of chains/agents that use memory.

    
**Indexes**: Language models are often more powerful when combined with your own text data - this module covers best practices for doing exactly that.
    
<img src="images/vectorstore.png" width="500"/>

**Chains**: Chains go beyond just a single LLM call, and are sequences of calls (whether to an LLM or a different utility). LangChain provides a standard interface for chains, lots of integrations with other tools, and end-to-end chains for common applications.

<img src="images/chains.png" width="500"/>

**Agents**: Agents involve an LLM making decisions about which Actions to take, taking that Action, seeing an Observation, and repeating that until done. LangChain provides a standard interface for agents, a selection of agents to choose from, and examples of end to end agents.


    
**Callbacks**: It can be difficult to track all that occurs inside a chain or agent. Callbacks help add a level of observability and introspection.
 
    

### Chat Bot key elements

The first process in a chat bot is to generate embeddings. Typically you will have an ingestion process which will run through your embedding model and generate the embeddings which will be stored in a sort of a vector store. In this example we are using a GPT-J embeddings model for this

<img src="images/Embeddings_lang.png" width="600"/>

Second process is the user request orchestration , interaction,  invoking and returing the results

<img src="images/Chatbot_lang.png" width="600"/>

For processes which need deeper analysis, conversation history we will need to summarize every interaction to keep it succinct and for that we can follow this flow below which uses PineCone as an example

For the various Tools which are available 

<img src="images/chatbot_internet.jpg" width="800"/>

In [None]:
!pip install --upgrade pip
!pip install --upgrade sagemaker
!pip install langchain --quiet
!pip install pypdf==3.8.1
!pip install transformers==4.24.0
!pip install sentence_transformers==2.2.2
!pip install faiss-cpu==1.7.4

## Note
You must Restart Kernel here for the installations to take effect. After restarting kernel, run the following cells.

In [None]:
import sagemaker
from sagemaker.session import Session
import boto3
import os

sagemaker_session = Session()
aws_role = sagemaker_session.get_caller_identity_arn()
aws_region = boto3.Session().region_name
sess = sagemaker.Session()

print(f"Region is {aws_region}, Role is {aws_role}")

In [None]:
# To list all the available textgeneration models in JumpStart uncomment and run the code below
from sagemaker.jumpstart.notebook_utils import list_jumpstart_models
filter_value = "task == llm"

print("===== Available Llama-2 Models =====")
text_generation_models = list_jumpstart_models(filter=filter_value)
for model in text_generation_models:
        print(model)

In [None]:
model_id = 'huggingface-llm-falcon-7b-instruct-bf16'

We will now deploy this model to a SageMaker endpoint for inference.

In [None]:
from sagemaker.jumpstart.model import JumpStartModel

try:
    model = JumpStartModel(model_id=model_id, instance_type="ml.g5.2xlarge")
    predictor = model.deploy()
except Exception as e:
    print(str(e))

In [None]:
endpoint_name = predictor.endpoint_name
region = aws_region

In [None]:
print(f"SageMaker Endpoint with Falcon-7b instruct deployed: {endpoint_name}")

## Simple Q&A with Falcon 
---

In order to use our model endpoint with LangChain we wrap up endpoints for LLM into `langchain.llms.sagemaker_endpoint.SagemakerEndpoint` which is LangChain's built in support for SageMaker endpoints. 

In [None]:
import json
import re
from langchain import SagemakerEndpoint
from langchain.llms.sagemaker_endpoint import LLMContentHandler
from langchain import PromptTemplate, LLMChain

class ContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"
    
    def transform_input(self, prompt: str, model_kwargs: dict) -> bytes:
        input_str = json.dumps({"inputs": prompt,  "parameters": model_kwargs}) 
        return input_str.encode('utf-8')
    
    def transform_output(self, output: bytes) -> str:
        response_json = json.loads(output.read().decode("utf-8"))
        return response_json[0]["generated_text"]


content_handler = ContentHandler()

sm_llm=SagemakerEndpoint(
        endpoint_name=endpoint_name, 
        region_name=aws_region,
        model_kwargs={"do_sample": True,
                                    "top_p": 0.9,
                                    "temperature": 0.5,
                                    "max_new_tokens":  200,
                                    "stop": ["<|endoftext|>", "</s>"]},
        content_handler=content_handler
    )


Next, we will use LangChain PromptTemplate and LLMChain to create a prompt and invoke the model endpoint to get a response. We will use this method, or methods similar to this using LangChain throughout the rest of this notebook.

In [None]:
template = """
Question: {question}
Answer:"""

# define the prompt template
prompt = PromptTemplate(template=template, input_variables=["question"])

# define an LLMChain
llm_chain = LLMChain(prompt=prompt, llm=sm_llm)

# Run the chain
output = llm_chain.run(question=  "What is the plot of 'The Expanse'?", stop=["Question:","\n"])

print(output.strip())

## Contextual Q&A with Falcon-7b instruct
---

Given a context, ask Falcon-7b instruct to answer only from within that context. Let's create a prompt template for that first.

In [None]:
template = """You are a helpful assistant. Given a document, answer the 'Question'. Keep your answers strictly from within the document. 
If the answer to the question is not in the document, simplay say "I do not know", do not make up an answer.

Document: {document}
Question: {question}
Answer:"""

# define the prompt template
qa_prompt = PromptTemplate(template=template, input_variables=["document","question"])

In [None]:
document="""The Expanse is a science fiction television series based on the novel series of the same name by James S. A. Corey (Daniel Abraham and Ty Franck). \
It was developed by Mark Fergus and Hawk Ostby, who served as executive producers alongside Naren Shankar, Andrew Kosove, Broderick Johnson, Laura Lancaster, \
Sean Daniel, Jason Brown, and Sharon Hall. The first season premiered on December 14, 2015, with the second season following on February 1, 2017. The third \
season premiered on April 11, 2018.
"""

# define an LLMChain
llm_chain = LLMChain(prompt=qa_prompt, llm=sm_llm)

# Run the chain
output = llm_chain.run(document=document, question="When did the first season of 'The Expanse' premier?", stop=["Question:","\n"])

print(output.strip())

### Let's ask it something completely outside of the document
---

In [None]:
document="""The Expanse is a science fiction television series based on the novel series of the same name by James S. A. Corey (Daniel Abraham and Ty Franck). \
It was developed by Mark Fergus and Hawk Ostby, who served as executive producers alongside Naren Shankar, Andrew Kosove, Broderick Johnson, Laura Lancaster, \
Sean Daniel, Jason Brown, and Sharon Hall. The first season premiered on December 14, 2015, with the second season following on February 1, 2017. The third \
season premiered on April 11, 2018.
"""

# define an LLMChain
llm_chain = LLMChain(prompt=qa_prompt, llm=sm_llm)

# Run the chain
output = llm_chain.run(document=document, question="When was 'Breaking Bad' made?", stop=["Question:","\n"])

print(output.strip())

We may see the model respond with an answer, which may be correct afterall, but the context doesn't include any details about the question asked. We can mitigate this with few shot prompting.

## Few shot Q&A
---

In this section we will perform "few shot" Q&A with the model. We will show it a few example and then ask it a question to be answered based on a given document.

In [None]:
"""Answer the question as truthfully as possible using the provided text, and if the answer is not contained within the text below, say "I don't know"

Context: Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote 
what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, 
just characters with strong feelings, which I imagined made them deep.
Question: What are the two things the author worked outside of school?
Answer: Writing and programming
===
Context: The prevalence of malnutrition among elementary school aged children in tehran varied from 6% to 16% .
Anthropometric study of elementary school students in shiraz revealed that 16% of them suffer from malnutrition and low body weight .
Question: What steps did the ministry of education take to address the issue?
Answer: I do not know
===
Context: {document}
Question: {question}
Answer:"""

# define the prompt template
qa_prompt = PromptTemplate(template=template, input_variables=["document","question"])

In [None]:
document="""In the quiet town of Willowbrook, elderly Ms. Agatha discovered a mysterious old key while tending to her garden. 
Curiosity piqued, she recalled an ancient wooden chest in her attic, untouched for decades. Climbing the creaky steps, she unlocked the chest 
to reveal a collection of letters penned by her grandmother. These letters unveiled stories of a hidden world filled with magical creatures and 
enchanted forests. As she read, the wind outside picked up, carrying whispers of the adventures her ancestors had once embarked upon. Willowbrook, 
it seemed, was not as ordinary as she had always believed.
"""

# define an LLMChain
llm_chain = LLMChain(prompt=qa_prompt, llm=sm_llm)

# Run the chain
output = llm_chain.run(document=document, question="What did Ms. Agatha find in her attic?", stop=["Question:","\n"])

print(output.strip())

## Retrieval Augmented Generation
---

In the previous sections we saw a couple of things.

- First, we did simple Q&A with the model
- Second, we did some contextual QA with the model, where we gave it a piece of text (Document) and asked the model to answer questions from it.
- Third, we went a bit further with the mechanism where we show some examples to the model as "few shot" and ask the question to the model.

In the subsequent sections we will implement a RAG mechanism, step-by-step. RAG stands for "Retriever-Augmented Generation". It's a method in the domain of natural language processing (NLP) and information retrieval. RAG combines the powers of large pre-trained models like BERT (for information retrieval) and sequence-to-sequence models like BART or T5 (for generation) to produce answers to questions. Essentially, it retrieves relevant document passages from a corpus and then generates a response based on the information from those passages. To facilitate this, we will also take a look at vector databases, where we will store an entire document by first chunking it into smaller parts, and generating embeddings of those chunks, and finally loading them into the Vector DB. We will then see how we can do relevancy search on the Vecor DB to get text(s) relevant to our query, which will give us the basis of creating the context for the model. specifically, we will

- Explore vector databases
- Learn basics of QA exploring simple chains
- Learn basics of chatbot
- Build prompt templates for our chat bot
- Explore various Chains useful for RAG

### Read the document

Our final goal is to perform Q&A with the sample document `sagemaker-faqs.pdf`. First we need to read the text from the document for which we will use PyPDFLoader. We will then split this document into chunks, convert into embeddings and use with LangChain and SageMaker LLM for inference. 

In [None]:
from langchain.document_loaders import TextLoader
from langchain.document_loaders.csv_loader import CSVLoader

from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("sagemaker-faqs.pdf")
documents_aws = loader.load() # -- gives 2 docs
documents_split = loader.load_and_split() # - gives 22 docs

We have split the document into smaller chunks. We will now perform a couple of things-

- Generate embeddings of these chunks
- Store these embeddings into a vector database

### Vector store indexer

This is what stores and matches the embeddings. This notebook showcases FAISS and will be transient and in memory. FAISS (Facebook AI Similarity Search) is a library that allows developers to quickly search for embeddings of multimedia documents that are similar to each other. It solves limitations of traditional query search engines that are optimized for hash-based searches, and provides more scalable similarity search functions. The VectorStore APIs that use FAISS within LangChain are available [here](https://python.langchain.com/en/harrison-docs-refactor-3-24/reference/modules/vectorstore.html). You can read up about FAISS in memory vector store [here](https://arxiv.org/pdf/1702.08734.pdf).

Some other notable Vector databases are

- [Chroma](https://www.trychroma.com/) is a super simple vector search database. The core-API consists of just four functions, allowing users to build an in-memory document-vector store. By default Chroma uses the Hugging Face transformers library to vectorize documents.
- [Weaviate](https://github.com/weaviate/weaviate) is a very posh looking tool - not only does Weaviate offer a GraphQL API with support for vector search. It also allows users to vectorize their content using Weaviate's inbuilt modules or custom modules.

We will use `HuggingFaceEmbeddings` available via LangChain to generate embeddings of our text chunks that we generated in the previous step. This will be used by the FAISS (or Chroma) to store in memory and be used when ever the User runs a query

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

embeddings = HuggingFaceEmbeddings()
vector_db = FAISS.from_documents(documents=documents_split, embedding=embeddings)

We have loaded our vector db with the document, now let's run a query.

In [None]:
query = "How am I charged for sagemaker?"
docs = vector_db.similarity_search(query)

In [None]:
docs

The query returns all the chunks from the document that is similar to the `query`, by default it returns the Top 3 similar chunks. Let's see how to return just Top 2 with confidence scores.

In [None]:
docs = vector_db.similarity_search_with_score(query, k = 2)
docs

### Vector store-backed retriever
---

According to LangChain documentation-

> A vector store retriever is a retriever that uses a vector store to retrieve documents. It is a lightweight wrapper around the vector store class to make it conform to the retriever interface. It uses the search methods implemented by a vector store, like similarity search and MMR, to query the texts in the vector store.

Wrapping our vector db in a retriever wrapper is going to be useful when we use it in the Q&A chain for our chatbot in subsequent sections. But let's take a look how it works. The functionality is pretty similar to before (i.e. querying) with a slightly different interface.

We first define a retriever with search type `mmr` (Max Marginal Relevance),  other option is `similarity`.  Note that the `search_type` depends on which vector DB you are using, some vector DBs may or may not support `mmr` etc. 

> MMR considers the similarity of keywords/keyphrases with the document, along with the similarity of already selected keywords and keyphrases. This results in a selection of keywords that maximize their within diversity with respect to the document.

We also define how many top results to return, in this case 2. Finally we query the retriever using `get_relevant_documents` by passing in the query.

In [None]:
query = "How do I cost optimize sagemaker?"

retriever = vector_db.as_retriever(search_type='mmr', search_kwargs={"k": 1})
relevant_docs = retriever.get_relevant_documents(query)   
relevant_docs

## Build context from retrieved documents
---

We now have the two relevant pieces of text that "contain" the anwer to our question, we are not quite there yet. So we will use a technique that we used earlier to build context and ask the quetion to the Llama-2 model. In this case, we will use the two text chunks we retrieved from the vector db to create the context by simply concatenating them.

In [None]:
full_context = str()
for doc in relevant_docs:
    full_context += doc.page_content+" "
    
print(full_context.strip(".").strip())

In [None]:
sm_llm=SagemakerEndpoint(
        endpoint_name=endpoint_name, 
        region_name=aws_region,
        model_kwargs={"do_sample": False,
                                    "top_p": 0.9,
                                    "temperature": 0.5,
                                    "max_new_tokens":  200,
                                    "stop": ["<|endoftext|>", "</s>"]},
        content_handler=content_handler
    )

In [None]:
# template = """Answer the question as truthfully as possible using the provided text. If the answer is not contained within the text below, say "I don't know", do not make up an answer.  
# Text: {document}
# Question: {question}
# Answer:"""

template = """>>INTRODUCTION<<Answer the question as truthfully as possible strictly using only the provided text, and if the answer is not contained within the text, say "I don't know". Make sure your answer is verbatim from the provided text. 
>>SUMMARY<<{document}
>>QUESTION<<{question}
>>ANSWER<<"""


# define the prompt template
qa_prompt = PromptTemplate(template=template, input_variables=["document","question"])

# define an LLMChain
llm_chain = LLMChain(prompt=qa_prompt, llm=sm_llm)

query = "How do I optimize sagemaker?"

# Run the chain with document as full_context and question as query we defined earlier
output = llm_chain.run(document=full_context, question=query)

print(output.strip())

That's a much better and concise answer. Let's try another question.

In [None]:
# define the prompt template
qa_prompt = PromptTemplate(template=template, input_variables=["document","question"])

# define an LLMChain
llm_chain = LLMChain(prompt=qa_prompt, llm=sm_llm)

query_1="How do I share models?"
output = llm_chain.run(document=full_context, question=query_1)

print(output.strip())

The model is unable to answer this specific question. That is because our `full_context` doesn't have any information related to the question. So we will have to again do a similarity search from the vector database to get the relevant chunks of text, then build the context with those chunks and then as the question to the LLM with that context. But that is a lot of repeated steps, and we can certainly write reusable functions to do it. However, there is a much easier way to achieve this using "QA Chain" available in LangChain, with just a few lines of code. So let's see how that works.

### Performing Q&A with RAG with `load_qa_chain`
---

For this purpose, we will first define a question, and then generate embeddings from it. Once we have that we can perform similarity search on the vector database to find relevant pieces of information from the document. These relevant pieces of information will then be passed on to the model so that it can answer the question. We will use LangChain's `load_qa_chain` to perform Q&A with the model. The load qa chain does the work with prompt creation and all the context generation with help from the vector database.

In [None]:
# from langchain.chains.question_answering import load_qa_chain
from langchain.chains import RetrievalQA



retriever = vector_db.as_retriever(search_type='mmr', search_kwargs={"k": 2})

# template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

# {context}

# Question: {question}
# Answer:"""

template = """>>INTRODUCTION<<Answer the question as truthfully as possible strictly using only the provided text, and if the answer is not contained within the text, say "I don't know". Make sure your answer is verbatim from the provided text. 
>>SUMMARY<<{context}
>>QUESTION<<{question}
>>ANSWER<<"""

# define the prompt template
qa_prompt = PromptTemplate(template=template, input_variables=["context","question"])

chain_type_kwargs = { "prompt": qa_prompt }

qa = RetrievalQA.from_chain_type(
    llm=sm_llm, 
    chain_type="stuff", 
    retriever=retriever,
    chain_type_kwargs=chain_type_kwargs
)

question="What are SageMaker Model cards?"

result = qa.run(question)
print(result.strip())

## Chatbot application

#### For the chatbot we need `context management, history, vector stores, and many other things`. We will start by with a ConversationalRetrievalChain

This uses conversation memory and RetrievalQAChain which Allow for passing in chat history which can be used for follow up questions.Source: https://python.langchain.com/en/latest/modules/chains/index_examples/chat_vector_db.html

Set verbose to True to see all the what is going on behind the scenes

**We use Custom Prompt template to fine tune the output responses**

In [None]:
from langchain import LLMChain
from langchain.memory import ConversationBufferMemory
from langchain.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains import ConversationalRetrievalChain
from langchain.chains import LLMChain
from langchain.chains.question_answering import load_qa_chain
from langchain.chains.conversational_retrieval.prompts import CONDENSE_QUESTION_PROMPT


def create_prompt_template():
    _template = """
    Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question. Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you do not know, do not try to make up an answer.
        Chat History:
        {chat_history}
        Follow Up Input: {question}
        Standalone question:
    """
    CONVO_QUESTION_PROMPT = PromptTemplate.from_template(_template)
    return CONVO_QUESTION_PROMPT
memory_chain = ConversationBufferMemory(memory_key="chat_history", input_key="question", return_messages=True)
chat_history=[]
qa = ConversationalRetrievalChain.from_llm(
    llm=sm_llm, 
    #retriever=vectorstore_faiss_aws.as_retriever(), 
    retriever=retriever,
    memory=memory_chain,
    #verbose=True,
    condense_question_prompt=create_prompt_template(), #CONDENSE_QUESTION_PROMPT, # use the condense prompt template
    #chain_type='map_reduce',
    max_tokens_limit=100
    #combine_docs_chain_kwargs=key_chain_args,

)
print("Starting chat bot")
input_str = ['Enter your query, q to quit']
while True:
    query = input(str(input_str))
    if 'q' == query or 'quit' == query or 'Q' == query:
        print("Breaking")
        break
    else:
        result = qa.run({'question':query, 'chat_history':chat_history} )
        input_str.append(f"Question:{query}\nAI:Answer:{result}")

print("Thank you , that was a nice chat !!")

#### Refine as Chain type with no similiarity searches

In [None]:
from langchain import LLMChain
from langchain.memory import ConversationBufferMemory
from langchain.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains import ConversationalRetrievalChain
from langchain.chains import LLMChain
from langchain.chains.question_answering import load_qa_chain
from langchain.chains.conversational_retrieval.prompts import CONDENSE_QUESTION_PROMPT


def create_prompt_template():
    

    _template = """
    Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question. Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you do not know, do not try to make up an answer.
        Chat History:
        {chat_history}
        Follow Up Input: {question}
        Standalone question:
    """
    CONVO_QUESTION_PROMPT = PromptTemplate.from_template(_template)
    return CONVO_QUESTION_PROMPT
memory_chain = ConversationBufferMemory(memory_key="chat_history", input_key="question", return_messages=True)
chat_history=[]
qa = ConversationalRetrievalChain.from_llm(
        llm=sm_llm, 
        retriever=vector_db.as_retriever(), 
        memory=memory_chain,
        #verbose=True,
        condense_question_prompt=create_prompt_template(), #CONDENSE_QUESTION_PROMPT, create_prompt_template(), # use the condense prompt template
        chain_type='refine', #'map_rerank', #'refine', # s(['stuff', 'map_reduce', 'refine', 'map_rerank'])
        max_tokens_limit=100,
        get_chat_history=lambda h : h,
)  
print("Starting Refine chat bot")
input_str = ['Enter your query, q to quit']
while True:
    query = input(str(input_str))
    if 'q' == query or 'quit' == query or 'Q' == query:
        print("Breaking")
        break
    else:
        result = qa.run({'question':query, 'chat_history':chat_history} )
        input_str.append(f"Question:{query}\nAI:Answer:{result}")

print("Thank you , that was a nice chat !!")


## Let's delete the model and the endpoint to clean up resources

In [None]:
predictor.delete_model()
predictor.delete_endpoint()