# Implementing a chatbot with Retrieval-Augmented Generation (RAG) using Llama-2-13b-chat, vLLM, LangChain, and the Gradio UI
- [**Langchain**](https://python.langchain.com/docs/get_started/introduction) is a framework for the fast prototyping of applications powered by language models. In this example we replaced the Gradio's chatbot memory with LangChain's memory implemented by the `LLMChain` class.
- [**vLLM**](https://www.gradio.app/) is a fast and easy-to-use library for LLM inference and serving. In this example, we use LangChain's `VLLMOpenAI` class to access the vLLM OpenAI API-compatible service.
- [**Gradio**](https://www.gradio.app/) is a fast way to demo machine learning models with a friendly web interface.

### Introduction
Retrieval-Augmented Generation (RAG) systems combine the capabilities of large language models and information retrieval systems to improve the quality of generated responses. In a RAG system, the process generally involves two main steps:

1. **Retrieval**: Given a query, the system retrieves a set of relevant documents or passages from a pre-existing knowledge base or corpus. This retrieval step is usually performed using traditional information retrieval techniques or learned embeddings.

2. **Generation**: The retrieved documents are then used to augment the input to a generative model, like Llama 2. The model uses this additional context to generate more informed, relevant, and accurate responses.

By combining retrieval and generation, RAG systems aim to provide answers that are not only contextually coherent but also factually accurate, even when the information is not explicitly present in the model's pre-training data.

**In this RAG example, we expand the LLama 2 knowledge with a knowledge base consisting of three NY Times articles (serialized as PDF documents) about the Otis category 5 hurricane which stormed the Pacific coast of Mexico by the end of October 2023. This knowledge base contains data that is posterior to the datasets used to pre-train LLama 2 and fine-tune Llama 2 Chat, which date prior July 2023.** 

### Requirements
Please follow the instructions provided by the README.md file from the 
[LLM-fine-tuning-example](https://github.com/vmware-ai-labs/VMware-generative-ai-reference-architecture/tree/main/Examples/LLM-fine-tuning-example) directory to create a conda virtual environment with all the libraries used in this notebook.

To use Llama2 from the Transformers library you need to get approval from Meta and HuggingFace. 
- Please follow these [instructions provided by HuggingFace](https://huggingface.co/meta-llama) before continuing.
- Then [create a Hugging Face user access token](https://huggingface.co/docs/hub/security-tokens) to be able to download LLama 2 models from HF Hub.
- Once you have created an access token, please login from the shell terminal using any of these options:
    ```
    # Using an interactive command
    huggingface-cli login
  
    # or using an environment variable
    huggingface-cli login --token $HUGGINGFACE_TOKEN
    ```
- You can serve Llama-2-13B-chat on an A100 (40GB) GPU using vLLM by running the following command:
`python -m vllm.entrypoints.openai.api_server --model="meta-llama/Llama-2-13b-chat-hf"`

Now, you should be all set to get started!
## Imports section

In [None]:
import gradio as gr
from langchain.llms import VLLMOpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.vectorstores import Chroma
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.memory import ConversationBufferMemory
from langchain.prompts.prompt import PromptTemplate

## Generation parameters initialization and helper functions definitions


In [None]:
%%time
# URL where vLLM serves the Llama-2-13b-chat LLM
INFERENCE_SRV_URL = "http://localhost:8000/v1"

def setup_chat_llm(max_tokens=500, temperature=0.1, top_p=.9):
    """
    Initializes the llm chat object that allows language chains get access
    to the Llama 2 LLM service.
    :param max_tokens: Max num. of tokens to generate
    :param max_tokens: Max num. of tokens to generate.
    :param temperature:  Determines how creative the model should be.
    :param top_p: Cumulative probability threshold for selecting the next word.
    :return: the llm service callable object
    """
    llm = VLLMOpenAI(
        openai_api_key = "EMPTY",
        openai_api_base = INFERENCE_SRV_URL,
        model_name = "meta-llama/Llama-2-13b-chat-hf",
        max_tokens = max_tokens,
        temperature = temperature,
        top_p = top_p,
    )
    return llm

# Initialize LLM service
llm = setup_chat_llm()

# >>> Set up the embedding encoder (Sentence Transformers)
# Notice that CPUs can be use to encode text from knowledge bases.
model_name = "all-mpnet-base-v2"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}
embeddings = SentenceTransformerEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

# >>> Load all PDF documents containing the NY Times articles
# about hurricane Otis from the "files" directory
loader = PyPDFDirectoryLoader(path="./files")

# Split the documents into chunks.
doc_splits = loader.load_and_split()

# >>> Encode the document splits using the embeddings encoder.
# The encoded splits get stored into a Chroma vector database
# which serves as a retriever of chunks of text related to a user's query.
retriever = Chroma.from_documents(documents=doc_splits, embedding=embeddings).as_retriever()

# >>> Define a LLama RAG prompt that instructs the LLM to generate answers to users' queries
# using the context provided by the retriever and the chat history (kept in the chain's memory).
prompt_template = (
"""[INST]<<SYS>> 
You are an assistant for question-answering tasks. 
If you don't know the answer, just say that you don't know. Keep the answer concise.
Use the following context delimited by <CTX></CTX>, and the chat history delimited by <HS></HS> to answer the question.<</SYS>>  

<CTX>
{context} 
</CTX>

<HS>
{history}
</HS>

Question: {question}

Answer: [/INST]""")

# >>> Create the prompt template object including the explicit declaration of
# input variables that get dynamically entered as a conversation evolves.
prompt = PromptTemplate(
    input_variables=["history", "context", "question"],
    template=prompt_template,
)

# Initialize the memory object that tracks a user's conversation with the LLM
memory = ConversationBufferMemory(memory_key="history",
                                  input_key="question",
                                  return_messages=True)

# >>> RetrievalQA chain initialization. This object coordinates actions among the LLM service, the context retriever
# and the conversation history (memory) to generate responses to users' queries about a knowledge base. 
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type='stuff',
    retriever=retriever,
    verbose=True,
    chain_type_kwargs={
        "verbose": True,
        "prompt": prompt,
        "memory": memory,
    }
)

def respond(user_message, chatbot, conversation=qa_chain, mem=memory):
    """
    Generates a completion for a new prompt using the context provided by the 
    RetrievalQA's memory and the conversation history.
    :param user_message: New message from user properly formatted for Llama 2
    :param conversation: The RetrievalQA object that generates responses
    :param chatbot: Gradio's chatbot memory. Not needed by LangChain
    :param mem: RetrievalQA's memory. 
    :return: The response to be returned to the chatbot UI
    """
    # >>> Clear memory in the event of a reset button press
    # used as workaround as it isn't possible to do from the RetrievalQA object.
    if not len(chatbot):
        mem.clear()
    # Run the RetrievalQA chain 
    response = conversation.run(user_message)
    # Store user's message and the LLM response in chatbot history
    chatbot.append((user_message, response))
    # Return the expanded dialog history
    return "", chatbot

In [None]:
# Set up the Gradio client session
with gr.Blocks() as demo:
    # Set the size of the chat window
    chatbot = gr.Chatbot(height=500, show_copy_button=True, interactive=True)
    # Set the text box for users to enter text.
    msg = gr.Textbox(label="Prompt", interactive=True, show_copy_button=True)
    # Expose 'submit' button
    btn = gr.Button("Submit")
    # Expose "clear console" button
    clear = gr.ClearButton(components=[msg, chatbot], 
                           value="Clear chat")
    # Submit the prompt completion request on a click event
    btn.click(respond, inputs=[msg, chatbot], outputs=[msg, chatbot])
    # Submit the prompt completion request on an Enter key press event
    msg.submit(respond, inputs=[msg, chatbot], outputs=[msg, chatbot])

# Close pre-existing gradio clients
gr.close_all()

# Launch Gradio demo on a new browser tab
demo.launch(
    inline=False,
    inbrowser=True,
    share=False,
    show_error=True,
)

## Test
You can try the following dialog with the chatbot by introducing the prompts shown after `Human:` using the UI launched in your web browser. <br><br>

### Example
Human: <blockquote>Where and when hurricane Otis made landfall?</blockquote>
AI: <blockquote>Hurricane Otis made landfall near the coastal city of Acapulco in southern Mexico on Wednesday, October 25, 2023, around 1 a.m. local time.</blockquote>

Human: <blockquote>What is hurricane rapid intensification?</blockquote>
AI: <blockquote>Sure, I'd be happy to help! Hurricane rapid intensification refers to a situation where a hurricane's wind speed increases by at least 35 miles per hour within a 24-hour period. This is considered an extreme event, and it is difficult to predict with certainty. In the case of Hurricane Otis, it rapidly intensified from a tropical storm to a Category 5 hurricane in less than a day, with wind speeds increasing by 115 miles per hour in just 24 hours.</blockquote>

Human: <blockquote>Summarize your previous responses using a single sentence.</blockquote>
AI: <blockquote>AI: Sure! Here's a summary of my previous responses:

Hurricane Otis made landfall in Mexico, specifically in the states of Guerrero and Oaxaca, on Wednesday, October 25, 2023, at 1 a.m. local time, and it rapidly intensified from a tropical storm to a Category 5 hurricane in less than a day, with wind speeds increasing by 115 miles per hour in just 24 hours.</blockquote>
