# Implementing a chatbot with Llama-2-13b-chat served by vLLM, through the Gradio UI
- [**vLLM**](https://vllm.readthedocs.io/en/latest/) is a fast and easy-to-use library for LLM inference and serving.
- [**Gradio**](https://www.gradio.app/) is a fast way to demo machine learning models with a friendly web interface.

### Requirements
Please follow the instructions provided by the `README.md` file from the 
[LLM-fine-tuning-example](https://github.com/vmware-ai-labs/VMware-generative-ai-reference-architecture/tree/main/Examples/LLM-fine-tuning-example) directory to create a conda virtual environment with all the libraries used in this notebook.

To use Llama2 from the Transformers library you need to get approval from Meta and HuggingFace. 
- Please follow these [instructions provided by HuggingFace](https://huggingface.co/meta-llama) before continuing.
- Then [create a Hugging Face user access token](https://huggingface.co/docs/hub/security-tokens) to be able to download LLama 2 models from HF Hub.
- Once you have created an access token, please log into the HuugingFace repository from the shell terminal using any of these options:
    ```
    # Using an interactive command
    huggingface-cli login
  
    # or using an environment variable
    huggingface-cli login --token $HUGGINGFACE_TOKEN
    ```
You can serve Llama-2-13B-chat on an A100 (40GB) GPU using vLLM by running the following command:
    `python -m vllm.entrypoints.openai.api_server --model="meta-llama/Llama-2-13b-chat-hf"`

Now, you should be all set to get started!

## Imports section

In [None]:
# >> We won't use OpenAI's LLMs but thanks to vLLM, 
# we can consume opensource LLMs via the OpenAI Python API
import openai
import gradio as gr

## Generation parameters initialization and helper functions definitions

In [None]:
# >> Prompt template for LLama 2 chat
# We follow the prompt formatting guidelines from Meta's paper about Llama 2 chat.
# For more info, see https://huggingface.co/meta-llama/Llama-2-13b-chat-hf

P_TEMPLATE = """<s>[INST] <<SYS>>
You are a helpful, respectful, and honest assistant. 
Always provide helpful, truthful, and safe answers. Safety must be your highest priority.
Your answers must not contain harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. 
Please ensure that your responses are socially unbiased and positive in tone and nature.
If an instruction doesn't make sense or is not coherent, inform the user instead of trying to answer. 
Don't provide false information if you don't know the answer to a question.
<</SYS>>

{user_message} [/INST]"""

# >> Set the initialization parameter for the OpenAi API client .
openai.api_key = "EMPTY" # Not using OpenAI service
openai.api_base = "http://localhost:8000/v1" # Default vLLM serving 
models = openai.Model.list() # List models API
model = models["data"][0]["id"] # Get LLama 2 model name from the vLLM service

# Query the OpenAI-compatible API from a vLLM server
def generate(user_message, max_tokens=500, temperature=0.1, top_p=.9):
    """
    Generates a prompt completion using an LLM service.
    :param user_message: Formatted prompt to get completed by the LLM
    :param max_tokens: Max num. of tokens to generate
    :param temperature:  Determines how creative the model should be.
    :param top_p: Cumulative probability threshold for selecting the next word.
    :return: Prompt completion as a string
    """
    completion = openai.Completion.create(
        model=model, # vLLM service name for Llama 2
        prompt=user_message.strip(), 
        echo=False, 
        n=1, # Get one result from greedy search
        max_tokens=max_tokens, 
        stream=False, # Streaming complicates content moderation
        repetition_penalty=1.1, # Prevent the LLm from repeating itself
        temperature=temperature, 
        top_p=top_p, # Top-p is  used to exclude outputs with lower probabilities.
    )
    return completion["choices"][0]["text"]
    
# Format new prompts including previous dialog history.
def format_chat_prompt(new_message, conversation, prompt=P_TEMPLATE):
    """
    Formats a new user's message to comply with LLama 2 chat prompt template:
    
   <s>[INST] <<SYS>>
   {your_system_message}
   <</SYS>>
   
   {user_message_1} [/INST] {model_reply_1}</s><s>[INST] {user_message_2} {model_reply_2}[/INST] ...
    
    :param new_message: New message from user in a conversation
    :param conversation: The conversation history between a user and the LLM
    :param prompt template to format
    :return: A formatted prompt (conversation context included) 
    """
    if len(conversation) == 0:
        # Use format for the first user message
        return prompt.format(user_message=new_message)

    for i, turn in enumerate(conversation): # Iterate over conversation history
        user, assistant = turn # Gradio keeps interactions as a list of (user/assistant) tuples
        if i:
            # Formatting when there are multiple interactions already
            prompt = prompt + f"<s> [INST] {user} [/INST] {assistant} </s>"
        else:
            # Formatting when there's one interaction already
            prompt = prompt.format(user_message=user)
            prompt = prompt + f" {assistant} </s>"
    
    return prompt + f"[INST] {new_message} [/INST]"

def respond(message, conversation, max_tokens=500, temperature=0.1, top_p=0.9):
    """
    Generates a completion for a new prompt using previous context (if any) using
    a formatted prompt
    :param message: New message from user properly formatted for Llama 2
    :param conversation: Conversation history
    :param max_tokens: Max num. of tokens to generate
    :param temperature:  Determines how creative the model should be.
    :param top_p: Determines how creative the model should be.
    :return: The response to be returned to the chatbot UI
    """
    formatted_prompt = format_chat_prompt(message, conversation)
    bot_message = generate(formatted_prompt, max_tokens, temperature, top_p)
    conversation.append((message, bot_message))
    return "", conversation

## Set up the Gradio UI to expose the LLama 2 chatbot.

In [None]:
# Set up the Gradio client session
with gr.Blocks() as demo:
    # Set the size of the chat window
    chatbot = gr.Chatbot(height=500, show_copy_button=True, interactive=True)
    # Set the text box for users to enter text.
    msg = gr.Textbox(label="Prompt", interactive=True, show_copy_button=True)
    # Display nobs that control LLM generation options 
    with gr.Accordion(label="Generation Options",open=False):
        temp = gr.Slider(label="Temperature", minimum=0.1, maximum=1, value=0.1, step=0.1)
        top_p = gr.Slider(label="Top_p", minimum=0.1, maximum=1, value=0.9, step=0.1)
        max_tokens = gr.Slider(label="Max # of Tokens", minimum=128, maximum=1024, value=512, step=32)
    # Expose 'submit' button
    btn = gr.Button("Submit")
    # Expose "clear console" button
    clear = gr.ClearButton(components=[msg, chatbot], value="Clear Console")
    # Submit the prompt completion request on a click event
    btn.click(respond, inputs=[msg, chatbot, max_tokens, temp], outputs=[msg, chatbot])
    # Submit the prompt completion request on an Enter key press event
    msg.submit(respond, inputs=[msg, chatbot, max_tokens, temp], outputs=[msg, chatbot])

# Close pre-existing gradio clients
gr.close_all()

# Launch Gradio demo on a new browser tab
demo.launch(
    inline=False,
    inbrowser=True,
    share=False,
    show_error=True,
)

## Test
You can try the following dialog with the chatbot by introducing the prompts shown after `Human:` using the UI launched in your web browser. <br><br>

### Example
Human: <blockquote>Describe the four seasons in Earth's northern hemisphere</blockquote>
AI: <blockquote>Hello! I'd be happy to help you with that! The four seasons in the northern hemisphere are:
- Spring: This season usually starts in March and lasts until May. During this time, the weather starts to warm up, and plants begin to grow and bloom. It's a great time to go outside and enjoy nature!
- Summer: Summer typically starts in June and ends in August. This is the warmest season of the year, with long days and plenty of sunshine. It's perfect for spending time outdoors, whether you're going to the beach, having a picnic, or just enjoying the sunshine.
- Autumn (or Fall): Autumn starts in September and ends in November. During this season, the leaves on the trees start to change color and fall off, creating beautiful scenery. The weather can be a bit cooler, but it's still a great time to enjoy the outdoors.
- Winter: Winter usually starts in December and ends in February. This is the coldest season of the year, with shorter days and less sunlight. It's a great time to cozy up indoors with a good book or a warm cup of tea.
I hope that helps! Let me know if you have any other questions.</blockquote>

Human: <blockquote>And what happens in the southern hemisphere during that time?</blockquote>
AI: <blockquote>Hello! I'd be happy to help you with that!
In the southern hemisphere, the seasons are reversed compared to the northern hemisphere. So, while it's winter in the north, it's summer in the south, and vice versa...</blockquote>

Human: <blockquote>What about the Equator region of the planet? Are there any seasons? </blockquote>
AI: <blockquote>AI:  Hello! I'd be happy to help you with that!
The Equator region of the planet experiences a unique climate known as the equatorial climate. Since the Equator is located at zero degrees latitude, the sun's rays hit the region directly, resulting in high temperatures and high humidity throughout the year.</blockquote>
