# Homework - Chatbots with Gradio

Gradio is a web app framework designed to facilitate the development and deployment of ML and DL apps. Have a look at [their website](https://www.gradio.app).

The following adapts their [Quickstart Guide](https://www.gradio.app/guides/quickstart).

Notebook by [Jérémie C. Wegner](https://jeremiewenger.com/about/).

In [6]:
!pip install gradio



### Example using a local, open-source LLM with Hugging Face

See [this part](https://www.gradio.app/guides/creating-a-chatbot-fast#example-using-a-local-open-source-llm-with-hugging-face).

The model we use is [RedPajama](https://huggingface.co/togethercomputer/RedPajama-INCITE-Chat-3B-v1) (pretty big, best run on Colab!), see also [this post for the dataset](https://www.together.ai/blog/redpajama-data-v2). Chat models are regular language models finetuned on specific chat datasets (especially, they include markers for "user input" and "assistant responses", as well as, sometimes, overall directives like "system prompt" (defining the overall identity of the bot). In Huggingface, you would recognise them as having a "-chat" identifier, for instance for the [Llama 2 family](https://huggingface.co/meta-llama).

Docs:
- [StopingCriteria](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.StoppingCriteria) \(see also [this nice post](https://discuss.huggingface.co/t/implimentation-of-stopping-criteria-list/20040/2)\)
- [TextIteratorStreamer](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.TextIteratorStreamer)

In [4]:
import torch

# Get cpu, gpu or mps device for training.
# See: https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html#creating-models
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)

from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM
from transformers import StoppingCriteria
from transformers import StoppingCriteriaList
from transformers import TextIteratorStreamer

from threading import Thread

#### Note

This chatbot uses almost all the memory of a free Colab instance. Unfortunately, I haven't been able to free the memory so that I would be able to restart this app for debugging without restarting the runtime (and re-downloading the model) 😬.

The upside is: it is quite powerful! Try speak to it in different languages, or ask it code questions!

In [2]:
from pathlib import Path
from huggingface_hub import notebook_login
if not (Path.home()/'.huggingface'/'token').exists():
    notebook_login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [7]:
MODEL_ID = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype=torch.float16)
model = model.to(device) # move model to GPU

class StopOnTokens(StoppingCriteria):
    """
    Class used `stopping_criteria` in `generate_kwargs` that provides an additional
    way of stopping the generation loop (if this class returns `True` on a token,
    the generation is stopped)).
    """
    # note: Python now supports type hints, see this: https://realpython.com/lessons/type-hinting/
    #       (for the **kwargs see also: https://realpython.com/python-kwargs-and-args/)
    # this could also be written: def __call__(self, input_ids, scores, **kwargs):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        stop_ids = [29, 0] # see the cell below to understand where these come from
        for stop_id in stop_ids:
            if input_ids[0][-1] == stop_id:
                return True
        return False

def predict(message, history):

    history_transformer_format = history + [[message, ""]]
    stop = StopOnTokens()

    # useful to debug
    # msg = "history"
    # print(msg)
    # print(*history_transformer_format, sep="\n")
    # print("***")

    # at each step, we feed the entire history in string format,
    # restoring the format used in their dataset with new lines
    # and <human>: or <bot>: added before the messages
    messages = "".join(
        ["".join(
            ["\n<human>:"+item[0], "\n<bot>:"+item[1]]
         )
        for item in history_transformer_format]
    )
    # to see what we feed to our net:
    # msg = "string prompt"
    # print(msg)
    # print("-" * len(msg))
    # print(messages)
    # print("-" * 40)

    # convert the string into tensors & move to GPU
    model_inputs = tokenizer([messages], return_tensors="pt").to(device)

    streamer = TextIteratorStreamer(
        tokenizer,
        # timeout=30.,    # without the timeout, if there's an issue the bot will hang indefinitely
        skip_prompt=True, # (haven't implemented the error handling yet 🙈)
        skip_special_tokens=True
    )

    generate_kwargs = dict(
        model_inputs,
        streamer=streamer,
        max_new_tokens=1024,
        do_sample=True,
        top_p=0.95,
        top_k=1000,
        temperature=1.0,
        pad_token_id=tokenizer.eos_token_id, # mute annoying warning: https://stackoverflow.com/a/71397707
        num_beams=1,  # this is for beam search (disabled), see: https://huggingface.co/blog/how-to-generate#beam-search
        stopping_criteria=StoppingCriteriaList([stop])
    )
    t = Thread(target=model.generate, kwargs=generate_kwargs)
    t.start()

    partial_message  = ""
    for new_token in streamer:
        # seen the format <human>: and \n<bot> above (when 'messages' is defined)?
        # we stream the message *until* we encounter '<', which is by the end
        if new_token != '<':
            partial_message += new_token
            yield partial_message


gr.ChatInterface(predict).queue().launch(debug=True)

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacity of 8.00 GiB of which 0 bytes is free. Of the allocated memory 7.28 GiB is allocated by PyTorch, and 1.71 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

How do we know what the stop words are? (This is in part a design choice!)

In [8]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
print("The model stop words are:")
for tok in [29, 0]:
    print(f"  - `{tokenizer.decode([tok])}`")

print("If you wanted to know what token was associated with `<`, you'd do the opposite:")
print("`<` encoded as:", tokenizer.encode("<"))

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


The model stop words are:
  - `<`
  - `<|endoftext|>`
If you wanted to know what token was associated with `<`, you'd do the opposite:
`<` encoded as: [29]


---

## Experiments

- You could try to modify this code to work with the latest Llama models by Meta (you must register on [their site](https://ai.meta.com/llama/), then on Huggingface once you get permission, to be able to download the code). After that (same as with various restricted models/datasets/etc. on the Hub), you would need to log into HF:
```python
from pathlib import Path
from huggingface_hub import notebook_login
if not (Path.home()/'.huggingface'/'token').exists():
    notebook_login()
```
- Another example that would allow you to play with the cutting-edge LLMs is the [OpenAI example](https://www.gradio.app/guides/creating-a-chatbot-fast#a-streaming-example-using-openai) in the Gradio tutorial. You would first need to register (with credit card) and get an API key on [their website](https://platform.openai.com/)...

- Gradio ships with a [`Flagging`](https://www.gradio.app/guides/key-features#styling) logic, that allows you to harvest data from your users for free! You can also implement [`likes`](https://www.gradio.app/guides/creating-a-custom-chatbot-with-blocks#liking-disliking-chat-messages), that could be interesting!

- The current trend these days is to work with multimodality (systems that are able to handle more than one type of data: text and images, for instance, or text and music). See [this last part](https://www.gradio.app/guides/creating-a-custom-chatbot-with-blocks#adding-markdown-images-audio-or-videos) of the Gradio Chatbot tutorial for examples, as well as the two apps they recommend [project-baize/Baize-7B](https://huggingface.co/spaces/project-baize/chat-with-baize) and [MAGAer13/mPLUG-Owl](https://huggingface.co/spaces/MAGAer13/mPLUG-Owl) (and as said you could clone these projects, study the code, and transform them into your own project)!
