<a href="https://colab.research.google.com/github/acastellanos-ie/NLP-MBD-EN/blob/main/qa_practice_dl/LLama2_Chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chatbot Creation with Gradio and HuggingFace

In this practice notebook, we will embark on an exciting journey to create an interactive chatbot. Our chatbot will be powered by [Llama 2](https://ai.meta.com/resources/llama-2/), an open-source large language model by Meta AI, and we'll be building an interactive user interface using [Gradio](https://gradio.app/). This combination allows us to leverage state-of-the-art natural language processing (NLP) models in a user-friendly chat interface.

This practice is based on the following blog: [How to Create LLaMa 2 Chatbot with Gradio and Hugging Face in Free Colab](https://pub.towardsai.net/how-to-create-llama-2-chatbot-with-gradio-and-hugging-face-in-free-colab-729fe4fb5734)

## Overview of Tools
- **[HuggingFace](https://huggingface.co/)**: A platform dedicated to hosting and sharing transformer models. We'll be selecting and utilizing a pre-trained LLM from HuggingFace's vast Model Hub.
- **[Gradio](https://gradio.app/)**: An easy-to-use library for creating interactive machine learning demos with just a few lines of code. It will serve as the interface through which users can interact with our chatbot.
- **[Llama 2](https://ai.meta.com/resources/llama-2/)**: The latest version of Meta's large language model trained on 2 trillion tokens, available for both research and commercial use

## Objectives
- Select and load a pre-trained LLM from HuggingFace.
- Create a function to process user inputs and generate chatbot responses using the LLM.
- Set up a Gradio interface for interactive user engagement with the chatbot.
- Test, iterate, and refine the chatbot to improve its performance and user experience.

By the end of this practice, you will have a functioning chatbot that you can interact with in real-time, and share with others. Now, let's dive into the initial setup and start building our chatbot!

---




# Necessary Imports
Before diving into the code, we need to import the essential libraries. We'll use `gradio` for creating the interactive interface and `transformers` to access the Llama 2 model and tokenizer.

In [1]:
!pip install -Uqqq transformers torch accelerate
!pip install -Uqqq --upgrade gradio


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.2/57.2 MB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m320.4/320.4 kB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.8/94.8 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.2/11.2 MB[0m [31m57.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.2/73.2 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.3/62.3 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h

# Authentication with Hugging Face

Before we can proceed with creating our chatbot, it's essential to authenticate ourselves with Hugging Face. Hugging Face hosts a plethora of pre-trained models and datasets which will be invaluable for our project. By logging in, we ensure a seamless interaction with the Hugging Face platform, allowing us to fetch necessary resources for our chatbot.

Run the command below, you'll be prompted to enter your Hugging Face credentials, or to create an account ([Create a HuggingFace account](https://hf.co/join)) if you don't have one already.


In [2]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: read).
The token `sd_training` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `sd_training`


# Loading the LLaMa 2 Model and Tokenizer

In this step, we are going to load the LLaMa 2 model along with its tokenizer using the `transformers` library. The model we are utilizing is `meta-llama/Llama-2-7b-chat-hf`, which is hosted on Hugging Face's model hub.

1. **Model**: The chosen model, `meta-llama/Llama-2-7b-chat-hf`, is a large language model trained specifically for chat-based interactions. It will serve as the core engine for generating responses to user queries in our chatbot.
2. **Tokenizer**: Tokenizers are crucial for preparing our text data into a format that the model can understand. In this case, we are using `AutoTokenizer` which will automatically select the correct tokenizer based on the model we specify.

The `use_auth_token=True` argument in `AutoTokenizer.from_pretrained` function ensures that we are authenticated with Hugging Face, allowing us to access and download the model and tokenizer seamlessly.

**Be aware that this code might take a while since it needs to download the model from HuggingFace**

**You might need a Pro Colab account for running this code**

In [3]:
from transformers import AutoTokenizer

model = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model, use_auth_token=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

# Setting Up the LLaMa Pipeline

Now that we have loaded the LLaMa 2 model and tokenizer, the next step is to set up a pipeline for text generation. The `pipeline` function from the `transformers` library provides a high-level, easy to use API for applying transformers to various tasks. In our case, we are interested in the `text-generation` task as we aim to generate responses for our chatbot.


In [4]:
from transformers import pipeline
import torch

llama_pipeline = pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto"
)

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Device set to use cuda:0


# Preparing Input Data for LLaMa Model

In order to effectively interact with the LLaMa model, we need to format the input data in a specific manner. This includes crafting a system prompt and formatting the conversation history along with the current message to fit the model's expectations. Here's what the code does:

1. **Defining System Prompt**:
    - We create a constant `SYSTEM_PROMPT` that sets the stage for the interaction, instructing the model to behave as a helpful bot providing clear and concise answers.

2. **Formatting Function**:
    - The `format_message` function is designed to format the current message and past conversation history into a string that can be fed to the LLaMa model.
    - Parameters:
        - `message (str)`: The current message to send to the model.
        - `history (list)`: A list containing past interactions.
        - `memory_limit (int)`: A limit on how many past interactions to consider (default is 3).

    - The function ensures that the history is within the specified `memory_limit`.
    - If the conversation history is empty, it simply attaches the current message to the `SYSTEM_PROMPT`.
    - If there's existing conversation history, it formats the history and the current message together in a way that the model can understand the context of the conversation.

For more details, please refer to the original [blog post](https://pub.towardsai.net/how-to-create-llama-2-chatbot-with-gradio-and-hugging-face-in-free-colab-729fe4fb5734)

In [5]:
SYSTEM_PROMPT = """<s>[INST] <<SYS>>
You are a helpful bot. Your answers are clear and concise.
<</SYS>>

"""

# Formatting function for message and history
def format_message(message: str, history: list, memory_limit: int = 3) -> str:
    """
    Formats the message and history for the Llama model.

    Parameters:
        message (str): Current message to send.
        history (list): Past conversation history.
        memory_limit (int): Limit on how many past interactions to consider.

    Returns:
        str: Formatted message string
    """
    # always keep len(history) <= memory_limit
    if len(history) > memory_limit:
        history = history[-memory_limit:]

    if len(history) == 0:
        return SYSTEM_PROMPT + f"{message} [/INST]"

    formatted_message = SYSTEM_PROMPT + f"{history[0][0]} [/INST] {history[0][1]} </s>"

    # Handle conversation history
    for user_msg, model_answer in history[1:]:
        formatted_message += f"<s>[INST] {user_msg} [/INST] {model_answer} </s>"

    # Handle the current message
    formatted_message += f"<s>[INST] {message} [/INST]"

    return formatted_message

# Generating Responses with the LLaMa Model

Now that we have structured the input data, the next step is to define a function to generate conversational responses from the LLaMa model based on the user's input and the conversation history.

Here's a breakdown of the `get_llama_response` function:

1. **Function Definition**:
    - `get_llama_response` is defined with two parameters: `message` (the user's input message) and `history` (the past conversation history).
    - The function returns the generated response from the LLaMa model as a string.

2. **Formatting the Input**:
    - The `format_message` function is called to format the `message` and `history` into a string `query` that can be fed to the LLaMa model.

3. **Generating a Response**:
    - The `llama_pipeline` is called with several parameters to control the text generation process:
        - `do_sample=True`: Enables stochastic sampling during generation.
        - `top_k=10`: Limits the sampling pool to the top 10 most probable tokens at each step.
        - `num_return_sequences=1`: Specifies that only one sequence should be returned.
        - `eos_token_id=tokenizer.eos_token_id`: Specifies the End Of Sentence (EOS) token ID to indicate the end of generation.
        - `max_length=1024`: Sets a maximum length for the generated sequence.

4. **Extracting the Generated Text**:
    - The generated text is extracted from the `sequences` object, and the prompt (`query`) is removed from the beginning of the generated text to isolate the model's response.

5. **Outputting and Returning the Response**:
    - The generated response is printed to the console and returned by the function.



In [6]:
# Generate a response from the Llama model
def get_llama_response(message: str, history: list) -> str:
    """
    Generates a conversational response from the Llama model.

    Parameters:
        message (str): User's input message.
        history (list): Past conversation history.

    Returns:
        str: Generated response from the Llama model.
    """
    query = format_message(message, history)
    response = ""

    sequences = llama_pipeline(
        query,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_length=1024,
    )

    generated_text = sequences[0]['generated_text']
    response = generated_text[len(query):]  # Remove the prompt from the output

    print("Chatbot:", response.strip())
    return response.strip()

# Creating and Launching the Chat Interface with Gradio

With the LLaMa model set up and ready to generate responses, the final step is to create a user interface for interacting with the chatbot.

In this step, we'll use Gradio to create a chat interface for our chatbot and launch it.


In [7]:
import gradio as gr

gr.ChatInterface(get_llama_response).launch()



Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://a89af170559d896a5e.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




# Conclusion

We have successfully built an interactive chatbot leveraging the LLaMa 2 model for natural language processing and Gradio for user interface creation.

With the skills acquired, you are now equipped to further explore and experiment with different large language models, refine the chatbot's performance, or even deploy it in real-world applications. Keep experimenting, and continue expanding your knowledge in the realm of conversational AI and user interface design!
