# Implementing a chatbot using Llama-2-13b-chat, vLLM, LangChain, and the Gradio UI
- [**Langchain**](https://python.langchain.com/docs/get_started/introduction) is a framework for the fast prototyping of applications powered by language models. In this example we replaced the Gradio's chatbot memory with LangChain's memory implemented by the `LLMChain` class.
- [**vLLM**](https://vllm.readthedocs.io/en/latest/) is a fast and easy-to-use library for LLM inference and serving. In this example, we use LangChain's `VLLMOpenAI` class to access the vLLM OpenAI API-compatible service.
- [**Gradio**](https://www.gradio.app/) is a fast way to demo machine learning models with a friendly web interface.

### Requirements
Please follow the instructions provided by the README.md file from the 
[LLM-fine-tuning-example](https://github.com/vmware-ai-labs/VMware-generative-ai-reference-architecture/tree/main/Examples/LLM-fine-tuning-example) directory to create a conda virtual environment with all the libraries used in this notebook.

To use Llama2 from the Transformers library you need to get approval from Meta and HuggingFace. 
- Please follow these [instructions provided by HuggingFace](https://huggingface.co/meta-llama) before continuing.
- Then [create a Hugging Face user access token](https://huggingface.co/docs/hub/security-tokens) to be able to download LLama 2 models from HF Hub.
- Once you have created an access token, please log into the HuggingFace repository from the shell terminal using any of these options:
    ```
    # Using an interactive command
    huggingface-cli login
  
    # or using an environment variable
    huggingface-cli login --token $HUGGINGFACE_TOKEN
    ```
You can serve Llama-2-13B-chat on an A100 (40GB) GPU using vLLM by running the following command:
`python -m vllm.entrypoints.openai.api_server --model="meta-llama/Llama-2-13b-chat-hf"`

Now, you should be all set to get started!

## Imports section

In [1]:
import gradio as gr
from langchain.llms import VLLMOpenAI
from langchain.prompts.chat import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate,
    MessagesPlaceholder,
)
from langchain.chains import LLMChain
from langchain.memory import ConversationBufferMemory

## Generation parameters initialization and helper functions definitions


In [2]:
# URL where vLLM serves the Llama-2-13b-chat LLM
INFERENCE_SRV_URL = "http://localhost:8000/v1"

def setup_chat_llm(max_tokens=500, temperature=0.1, top_p=.9):
    """
    Initializes the llm chat object that allows language chains get access
    to the Llama 2 LLM service.
    :param max_tokens: Max num. of tokens to generate
    :param max_tokens: Max num. of tokens to generate.
    :param temperature:  Determines how creative the model should be.
    :param top_p: Cumulative probability threshold for selecting the next word
    :return: the llm service callable object
    """
    llm = VLLMOpenAI(
        openai_api_key = "EMPTY",
        openai_api_base = INFERENCE_SRV_URL,
        model_name = "meta-llama/Llama-2-13b-chat-hf",
        max_tokens = max_tokens,
        temperature = temperature,
        top_p = top_p,
    )
    return llm

# >> Prompt template definitions for LLama 2 chat
# We follow the prompt formatting guidelines from Meta's paper about Llama 2 chat.
# For more info, see https://huggingface.co/meta-llama/Llama-2-13b-chat-hf

# System prompt template
sys_prompt_template = (
"""<s>[INST] <<SYS>>
You are a helpful, respectful, and honest assistant.
Always provide helpful, truthful, and safe answers. Safety must be your highest priority.
Your answers must not contain harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.
Please ensure that your responses are socially unbiased and positive in tone and nature.
If an instruction doesn't make sense or is not coherent, inform the user instead of trying to answer.
Don't provide false information if you don't know the answer to a question.
<</SYS>>
""")
# Initialize the system (partial) prompt object for LangChain
system_message_prompt = SystemMessagePromptTemplate.from_template(sys_prompt_template)

# Human (user) prompt section template definition
human_template = "{delimiter}{user_message} [/INST]"
# LangChain's human (partial) prompt message object initialization.
human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)

# >> Prompt template initialization. Notice the placeholder for the "chat_history" key
# that LangChain will use to give the LLM context about its dialog with the user.
prompt = ChatPromptTemplate(
    messages=[
        system_message_prompt,
        MessagesPlaceholder(variable_name="chat_history"),
        human_message_prompt,
    ]
)

# >> Initialize LangChain's ConversationBufferMemory used for conversation bookkeeping 
# The "memory" and "input" keys should match keys from the prompt. 
memory = ConversationBufferMemory(memory_key="chat_history",
                                  input_key="user_message",
                                  return_messages=True)
# Initialize the LLM service object
llm = setup_chat_llm()

# >>> Create an LLMChain out of the llm, memory and prompt objects previously created.
# This chain is designed to run queries against LLMs
conversation = LLMChain(
    llm=llm,
    prompt=prompt,
    verbose=False,
    memory=memory
)

def respond(user_message, chatbot, conversation=conversation):
    """
    Generates a completion for a new prompt using previous context (if any) using
    a formatted prompt
    :param user_message: New message from user properly formatted for Llama 2
    :param conversation: Conversation history kept by LangChain's ConversationBufferMemory
    :param chatbot: Gradio's chatbot memory. Not needed by LangChain
    :return: The response to be returned to the chatbot UI
    """
    # Clear memory in the event of a reset button press
    if not len(chatbot):
        conversation.memory.clear()
    # Pull chat memory contents
    conv_w_context = conversation.memory.dict()['chat_memory']['messages']

    # Has the conversation already started?
    if not conv_w_context:
        delimiter = "" # If not, no need to complete the prompt template.
    else:
        delimiter = "</s><s> [INST] " # If it has, complete the prompt.

    # Run the LLMChain 
    response = conversation.predict(delimiter=delimiter,
                                    user_message=user_message)
    # Store user's message and the LLLM responset in chatbot history
    chatbot.append((user_message, response))
    # Return the expanded dialog history
    return "", chatbot

In [3]:
# Set up the Gradio client session
with gr.Blocks() as demo:
    # Set the size of the chat window
    chatbot = gr.Chatbot(height=500, show_copy_button=True, interactive=True)
    # Set the text box for users to enter text.
    msg = gr.Textbox(label="Prompt", interactive=True, show_copy_button=True)
    # Expose 'submit' button
    btn = gr.Button("Submit")
    # Expose "clear console" button
    clear = gr.ClearButton(components=[msg, chatbot], 
                           value="Clear chat")
    # Submit the prompt completion request on a click event
    btn.click(respond, inputs=[msg, chatbot], outputs=[msg, chatbot])
    # Submit the prompt completion request on an Enter key press event
    msg.submit(respond, inputs=[msg, chatbot], outputs=[msg, chatbot])

# Close pre-existing gradio clients
gr.close_all()

# Launch Gradio demo on a new browser tab
demo.launch(
    inline=False,
    inbrowser=True,
    share=False,
    show_error=True,
)

Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.




## Test
You can try the following dialog with the chatbot by introducing the prompts shown after `Human:` using the UI launched in your web browser. <br><br>

### Example
Human: <blockquote>Describe the four seasons in Earth's northern hemisphere</blockquote>
AI: <blockquote>Hello! I'd be happy to help you with that! The four seasons in the northern hemisphere are: 
- Spring: This season usually starts in March and lasts until May. During this time, the weather starts to warm up, and plants begin to grow and bloom. It's a great time to go outside and enjoy nature!
- Summer: Summer typically starts in June and ends in August. This is the warmest season of the year, with long days and plenty of sunshine. It's perfect for spending time outdoors, whether you're going to the beach, having a picnic, or just enjoying the sunshine.
- Autumn (or Fall): Autumn starts in September and ends in November. During this season, the leaves on the trees start to change color and fall off, creating beautiful scenery. The weather can be a bit cooler, but it's still a great time to enjoy the outdoors.
- Winter: Winter usually starts in December and ends in February. This is the coldest season of the year, with shorter days and less sunlight. It's a great time to cozy up indoors with a good book or a warm cup of tea.
I hope that helps! Let me know if you have any other questions.</blockquote>

Human: <blockquote>And what happens in the southern hemisphere during that time?</blockquote>
AI: <blockquote>Hello! I'd be happy to help you with that! 
In the southern hemisphere, the seasons are reversed compared to the northern hemisphere. So, while it's winter in the north, it's summer in the south, and vice versa...</blockquote>

Human: <blockquote>What about the Equator region of the planet? Are there any seasons? </blockquote>
AI: <blockquote>AI:  Hello! I'd be happy to help you with that!
The Equator region of the planet experiences a unique climate known as the equatorial climate. Since the Equator is located at zero degrees latitude, the sun's rays hit the region directly, resulting in high temperatures and high humidity throughout the year.</blockquote>
