# Notebook 5.1: ChatBot

You can use BigDL-LLM to load any Hugging Face *transformers* model for acceleration on your laptop. With BigDL-LLM, PyTorch models (in FP16/BF16/FP32) hosted on Hugging Face can be loaded and optimized automatically with low-bit quantizations (supported precisions include INT4/NF4/INT5/INT8).

This notebook will dive into the detailed usage of BigDL-LLM `transformers`-style API. In Sections 5.1.2, you will learn how to load a transformers model for different situations. Section 5.1.3 will walk you through the process of building a chatbot using loaded model. You'll start from a simple form, and then add capabilities step by step, e.g. history management (for multi-turn chat) and streaming.  

## 5.1.1 Install BigDL-LLM

First of all, install BigDL-LLM in your prepared environment. For best practices of environment setup, refer to [Chapter 2](../ch_2_Environment_Setup/README.md) in this tutorial.


In [None]:
!pip install bigdl-llm[all]

## 5.1.2 Load Model


Now let's load the model. We'll use [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) as an example.

### 5.1.2.0 Download Llama 2 (7B)

To download the [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) model from Hugging Face, you will need to obtain access granted by Meta. Please follow the instructions provided [here](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/tree/main) to request access to the model.

After receiving the access, download the model with your Hugging Face token:

In [None]:
from huggingface_hub import snapshot_download

model_path = snapshot_download(repo_id='meta-llama/Llama-2-7b-chat-hf',
                              token='hf_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX') # change it to your own Hugging Face access token


> **Note**
>
> The model will by default be downloaded to `HF_HOME='~/.cache/huggingface'`.

  One common use case is to load a Hugging Face *transformers* model in low precision, i.e. conduct **implicit** quantization while loading.

For Llama 2 (7B), you could simply import `bigdl.llm.transformers.AutoModelForCausalLM` instead of `transformers.AutoModelForCausalLM`, and specify `load_in_4bit=True` or `load_in_low_bit` parameter accordingly in the `from_pretrained` function. Compared to the Hugging Face *transformers* API, only minor code changes are required.

**For INT4 Optimizations (with `load_in_4bit=True`):**

In [1]:
from bigdl.llm.transformers import AutoModelForCausalLM

model_in_4bit = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path="../chat-7b-hf/",
                                                     load_in_4bit=True)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

2023-11-17 09:48:45,428 - INFO - Converting the current model to sym_int4 format......


> **Note**
>
> BigDL-LLM has supported `AutoModel`, `AutoModelForCausalLM`, `AutoModelForSpeechSeq2Seq` and `AutoModelForSeq2SeqLM`.

**For INT8 Optimizations (with `load_in_low_bit="sym_int8"`):**

```python
# note that the AutoModelForCausalLM here is imported from bigdl.llm.transformers
model_in_8bit = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path="meta-llama/Llama-2-7b-chat-hf",
    load_in_low_bit="sym_int8",
)
```

> **Note**
>
> * Currently, `load_in_low_bit` supports options `'sym_int4'`, `'asym_int4'`, `'sym_int5'`, `'asym_int5'` or `'sym_int8'`, in which 'sym' and 'asym' differentiate between symmetric and asymmetric quantization. Option `'nf4'` is also supported, referring to 4-bit NormalFloat.
>
> *  `load_in_4bit=True` is equivalent to `load_in_low_bit='sym_int4'`.



### 5.1.2.2 Load Tokenizer 

A tokenizer is also needed for LLM inference. It is used to encode input texts to tensors to feed to LLMs, and decode the LLM output tensors to texts. You can use [Huggingface transformers](https://huggingface.co/docs/transformers/index) API to load the tokenizer directly. It can be used seamlessly with models loaded by BigDL-LLM. For Llama 2, the corresponding tokenizer class is `LlamaTokenizer`.


In [2]:
from transformers import LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained(pretrained_model_name_or_path="../chat-7b-hf/")

### 5.1.2.3 Save & Load Low-Precision Model (Optional)

`from_pretrained` includes a conversion/quantization step, which can be particularly time-consuming or memory-intensive for some large models. To expedite this process, you can use `save_low_bit` API to store the converted model, after the model is loaded first-time using `from_pretrained`. In subsequent uses, you can opt to use the `load_low_bit` instead of `from_pretrained`, which allows for a direct loading of the pre-converted model and speedup the process. The saving and loading process can be done on different machines.
  
**Save Low-Precision Model**

Let's take the `model_in_4bit` in section 5.1.2.1 as an example. After we loading Llama 2 (7B) in 4 bit, we could use the `save_low_bit` function to save the optimized model:

In [3]:
save_directory='./llama-2-7b-bigdl-llm-4-bit'

model_in_4bit.save_low_bit(save_directory)
del(model_in_4bit)

We recommend saving the tokenizer in the same directory as the optimized model to simplify the subsequent loading process:

In [None]:
tokenizer.save_pretrained(save_directory)

**Load Low-Precision Model**

We could load the optimized low-bit model through `load_low_bit` function, and load tokenizer from the same saved directory:

In [4]:
# note that the AutoModelForCausalLM here is imported from bigdl.llm.transformers
from bigdl.llm.transformers import AutoModelForCausalLM
from transformers import LlamaTokenizer
save_directory='./llama-2-7b-bigdl-llm-4-bit'
model_in_4bit = AutoModelForCausalLM.load_low_bit(save_directory)
tokenizer = LlamaTokenizer.from_pretrained(save_directory)

2023-11-17 09:56:00,824 - INFO - Converting the current model to sym_int4 format......


## 5.1.3 Run Model

BigDL-LLM optimized *transformers* model runs much faster than original model. [Chapter 3 Basic Application Develop](../ch_3_AppDev_Basic/) introduces some basics of using optimized model for direct text completion. In this section we will introduce some advanced usages.



In [9]:
prompt = "what is AI?"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
# predict next tokens with stopping_criteria
output_ids = model_in_4bit.generate(input_ids,
                                max_new_tokens=120)
output_str = tokenizer.decode(output_ids[0][len(input_ids[0]):], # skip prompt in generated tokens
                                  skip_special_tokens=True)

print(output_str)


 Einzeln 2018-12-06 at 15:06

Artificial intelligence (AI) is a branch of computer science that focuses on creating intelligent machines that can perform tasks that typically require human intelligence, such as understanding natural language, recognizing images, making decisions, and solving problems. AI research involves developing algorithms and statistical models that enable computers to perform these tasks, as well as creating systems that can learn from experience and improve their performance over time.
There are several subfields of AI, including:
1.


In [None]:
output_ids

### 5.1.3.1 Chat

One common application of large language models is Chatbot, where LLMs can engage in interactive conversations.
Chatbot interaction is no magic - it still relies on the prediction and generation of next tokens by LLMs. 
To make LLMs chat, we need to properly format the prompts into a converation format, for example:
```
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant, who always answers as helpfully as possible, while being safe.
<</SYS>>

What is AI? [/INST]</s>

```
Further, to enable a multi-turn chat experience, you need to append the new dialog input to the previous conversation to make a new prompt for the model, for example: 

```
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant, who always answers as helpfully as possible, while being safe.
<</SYS>>

What is AI? [/INST] AI is a term used to describe the development of computer systems that can perform tasks that typically require human intelligence, such as understanding natural language, recognizing images. </s><s> [INST] Is it dangerous? [/INST]
```

Now we show a multi-turn chat example using official `transformers` API with BigDL-LLM optimized Llama 2 (7B) model. 

First, define the conversation context format <sup>[1]</sup> for the model to complete:

In [6]:
SYSTEM_PROMPT = "You are a helpful, respectful and honest assistant."
prompt = [f'<s>[INST] <<SYS>>\n{SYSTEM_PROMPT}\n<</SYS>>\n\n']

input_str = "one plus one?"
input_str = input_str.strip()
prompt.append(f'{input_str} [/INST]')
prompt = ''.join(prompt)
print((prompt))

<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant.
<</SYS>>

one plus one? [/INST]


In [8]:
input_ids = tokenizer.encode(prompt, return_tensors="pt")
# predict next tokens with stopping_criteria
output_ids = model_in_4bit.generate(input_ids,
                                max_new_tokens=120)
output_str = tokenizer.decode(output_ids[0][len(input_ids[0]):], # skip prompt in generated tokens
                                 skip_special_tokens=True)


print(output_str)

 Of course! 2 + 1 = 3. How can I assist you further?


In [10]:
def format_prompt(input_str, chat_history):
    SYSTEM_PROMPT = "You are a helpful, respectful and honest assistant."
    prompt = [f'<s>[INST] <<SYS>>\n{SYSTEM_PROMPT}\n<</SYS>>\n\n']
    do_strip = False
    for history_input, history_response in chat_history:
        history_input = history_input.strip() if do_strip else history_input
        do_strip = True
        prompt.append(f'{history_input} [/INST] {history_response.strip()} </s><s>[INST] ')
    input_str = input_str.strip() if do_strip else input_str
    prompt.append(f'{input_str} [/INST]')
    #print(''.join(prompt))
    return ''.join(prompt)

> <sup>[1]</sup> The conversation context format is referenced from [here](https://huggingface.co/spaces/huggingface-projects/llama-2-7b-chat/blob/323df5680706d388eff048fba2f9c9493dfc0152/model.py#L20) and [here](https://huggingface.co/spaces/huggingface-projects/llama-2-7b-chat/blob/323df5680706d388eff048fba2f9c9493dfc0152/app.py#L9).

Next, define the `chat` function, which continuously adds model outputs to the chat history. This ensures that conversation context can be properly formatted for next generation of responses:

In [11]:
def chat(model, tokenizer, input_str, chat_history):
    # format conversation context as prompt through chat history
    prompt = format_prompt(input_str, chat_history)
    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    # predict next tokens with stopping_criteria
    output_ids = model.generate(input_ids,
                                max_new_tokens=128)

    output_str = tokenizer.decode(output_ids[0][len(input_ids[0]):], # skip prompt in generated tokens
                                  skip_special_tokens=True)
    print(f"Response: {output_str.strip()}")

    # add model output to the chat history
    chat_history.append((input_str, output_str))

> **Note**
>
> BigDL-LLM optimized low-bit models are compatible with all Hugging Face *transformers* APIs. Therefore, in addition to using the `generate` function for token prediction, you can also utilize other methods such as the [`TextGenerationPipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TextGenerationPipeline).

Here we go! Let's do interactive, multi-turn chat with LLM:

In [12]:
import torch

chat_history = []

while True:
    with torch.inference_mode():
        user_input = input("Input:")
        if user_input == "stop": # let's stop the conversation when user input "stop"
            print("Chat with Llama 2 (7B) stopped.")
            break
        chat(model=model_in_4bit,
             tokenizer=tokenizer,
             input_str=user_input,
             chat_history=chat_history)

Input:hello
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant.
<</SYS>>

hello [/INST]
Response: Hello there! It's nice to meet you. Is there anything I can help you with or any questions you have? I'm here to assist you in any way I can. Please let me know how I can help.
Input:one plus one?
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant.
<</SYS>>

hello [/INST] Hello there! It's nice to meet you. Is there anything I can help you with or any questions you have? I'm here to assist you in any way I can. Please let me know how I can help. </s><s>[INST] one plus one? [/INST]
Response: Great, let's do some basic arithmetic! The answer to "one plus one" is 2.
Input:stop
Chat with Llama 2 (7B) stopped.


### 5.1.3.2 Stream Chat

Stream chat can be considered as an advanced function for a chatbot, where the response is generated word by word. Here, we define the `stream_chat` function with the help of `transformers.TextIteratorStreamer`:


In [13]:
# note that the AutoModelForCausalLM here is imported from bigdl.llm.transformers
from bigdl.llm.transformers import AutoModelForCausalLM
from transformers import TextIteratorStreamer
from threading import Thread
from transformers import LlamaTokenizer

save_directory='./llama-2-7b-bigdl-llm-4-bit'
model_in_4bit = AutoModelForCausalLM.load_low_bit(save_directory)

token =  LlamaTokenizer.from_pretrained(save_directory)
inputs = token(["An increasing sequense: one,"], return_tensors='pt')
streamer = TextIteratorStreamer(token)

2023-11-17 10:21:49,596 - INFO - Converting the current model to sym_int4 format......


In [14]:
generation = dict(inputs, streamer=streamer, max_new_tokens=120)
thread = Thread(target=model_in_4bit.generate, kwargs=generation)
thread.start()  
output_str = []

print("Response: ", end="")
for stream_output in streamer:
    output_str.append(stream_output)
    print(stream_output, end="")

Response: <s>An increasing sequense: one, two, three, four, five, six, seven, eight, nine, ten. Unterscheidung between "one" and "on" is not always clear-cut, but generally "one" refers to the number and "on" is an adverb meaning "at or near". For example: "Can you pass me one book from the shelf?" vs. "The dog is running on the  field.".</s>

In [None]:
from transformers import TextIteratorStreamer

def stream_chat(model, tokenizer, input_str, chat_history):
    # format conversation context as prompt through chat history
    prompt = format_prompt(input_str, chat_history)
    input_ids = tokenizer([prompt], return_tensors='pt')

    streamer = TextIteratorStreamer(tokenizer,
                                    skip_prompt=True, # skip prompt in the generated tokens
                                    skip_special_tokens=True)

    generate_kwargs = dict(
        input_ids,
        streamer=streamer,
        max_new_tokens=128
    )
    
    # to ensure non-blocking access to the generated text, generation process should be ran in a separate thread
    from threading import Thread
    
    thread = Thread(target=model.generate, kwargs=generate_kwargs)
    thread.start()

    output_str = []
    print("Response: ", end="")
    for stream_output in streamer:
        output_str.append(stream_output)
        print(stream_output, end="")

    # add model output to the chat history
    chat_history.append((input_str, ''.join(output_str)))

> **Note**
>
> To successfully observe the text streaming behavior in standard output, we need to set the environment variable `PYTHONUNBUFFERED=1 `to ensure that the standard output streams are directly sent to the terminal without being buffered first.
>
> The [Hugging Face *transformers* streamer classes](https://huggingface.co/docs/transformers/main/generation_strategies#streaming) is currently being developed and is subject to future changes.

We can then achieve interactive, multi-turn stream chat between humans and the bot by allowing continuous user input as before:

In [None]:
chat_history = []

while True:
    with torch.inference_mode():
        user_input = input("Input:")
        if user_input == "stop": # let's stop the conversation when user input "stop"
            print("Stream Chat with Llama 2 (7B) stopped.")
            break
        stream_chat(model=model_in_4bit,
                    tokenizer=tokenizer,
                    input_str=user_input,
                    chat_history=chat_history)

## 5.1.4 What's Next？

In the next tutorial, we will guide you through a speech recognition pipeline that incorporates BigDL-LLM INT4 optimizations.