# Notebook 4.1: Run Transformer Models

BigDL-LLM supports the optimization of any Hugging Face *transformers* model, allowing for efficient inference with significantly reduced latency. With the help of BigDL-LLM, PyTorch models (in FP16/BF16/FP32) from Hugging Face can be loaded with implicit quantization, so that heavy operations in Transformer can be speeded up through low precision (such as INT4/INT5/INT8, etc.).

In this tutorial, we will dive into the main usage of BigDL-LLM Transformers-style API for low-precision optimizations.

## 4.1.1 Install BigDL-LLM

Follow instructions in [Chapter 2](../ch_2_Environment_Setup/) to setup your environment if you haven't done so. Then install `bigdl-llm`:

In [None]:
!pip install BigDL-LLM[all]

## 4.1.2 Load Model in INT4 and Conduct Inference

One common use case for BigDL-LLM is to load a Hugging Face *transformers* model in INT4 precision and conduct inference with optimizations. Compared to the Hugging Face *transformers* API, only minor code changes are required.

For illustration purposes, let's take model [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) as an example.

### 4.1.2.1 Download Llama 2 (7B)

To download the [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) model from Hugging Face, you will need to obtain access granted by Meta. Please follow the instructions provided [here](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/tree/main) to request access to the model.

After receiving the access, download the model with your Hugging Face token:

In [None]:
from huggingface_hub import snapshot_download

model_path = snapshot_download(repo_id='/meta-llama/Llama-2-7b-chat-hf',
                               token='hf_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX') # change it to your own Hugging Face access token

> **Note**
>
> The model will by default be downloaded to `HF_HOME='~/.cache/huggingface'`.


### 4.1.1.2 Load Model in INT4

To load the model with BigDL-LLM INT4 optimizations, you could simply import `bigdl.llm.transformers.AutoModelForCausalLM` instead of `transformers.AutoModelForCausalLM`, and specify `load_in_4bit=True` in the `from_pretrained` function:

In [None]:
from bigdl.llm.transformers import AutoModelForCausalLM

model_in_4bit = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path="meta-llama/Llama-2-7b-chat-hf",
                                                     load_in_4bit=True,)

> **Note**
>
> BigDL-LLM has supported `AutoModel`, `AutoModelForCausalLM`, `AutoModelForSpeechSeq2Seq` and `AutoModelForSeq2SeqLM`.

### 4.1.2.3 Load Tokenizer and Conduct Inference

You could then load the corresponding tokenizer of Llama 2 (7B) and conduct inference with BigDL-LLM INT4 optimizations, in the exactly same way with using Hugging Face *transformers* API. A human-bot conversation is constructed here for the model to complete:

In [None]:
# Load tokenizer of Llama 2 (7B)
from transformers import LlamaTokenizer
tokenizer = LlamaTokenizer.from_pretrained(pretrained_model_name_or_path="meta-llama/Llama-2-7b-chat-hf")

# Generate predicted tokens
import torch
import time

prompt = """### HUMAN:

What is AI?

### RESPONSE:
"""

with torch.inference_mode():
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    st = time.time()
    output = model_in_4bit.generate(input_ids,
                                    max_new_tokens=32)
    end = time.time()
    output_str = tokenizer.decode(output[0], skip_special_tokens=True)
    print(f'Inference time: {end-st} s')
    print('-'*20, 'Output', '-'*20)
    print(output_str)

### 4.1.2.4 Conduct Streaming Multi-Turn Chat

One common application of large language models is as chatbots. There is no magic involved in chatbot interactions. LLMs simply predict and generated texts based on formatted, incomplete conversations, such as the example prompt we previously showcased:

```
### HUMAN:

What is AI?

### RESPONSE:
```

In multi-turn chatting, the generated texts by LLMs are added into the existing conversation texts, as follows:

```
### HUMAN:

What is AI?

### RESPONSE:

AI is a term used to describe the development of computer systems that can perform tasks that typically require human intelligence, such as understanding natural language, recognizing images.

### HUMAN:

Is it dangerous

### RESPONSE:
```

Here shows an streaming multi-turn chat example using official `transformers` API with BigDL-LLM optimized Llama 2 (7B) model. First we need to define the conversation format for the model to complete:

In [None]:
# format conversation with chat history
chat_history = []

HUMAN_ID = "### HUMAN:\n\n"
BOT_ID = "### RESPONSE:\n\n"

def format_prompt(input_str, chat_history):
    prompt = ""
    for history_input, history_response in chat_history:
      prompt += f"{HUMAN_ID}{history_input}\n\n{BOT_ID}{history_response}\n\n"
    prompt += f"{HUMAN_ID}{input_str}\n\n{BOT_ID}"
    return prompt

Then we could define the function for streaming chat with the help of `transformers.TextIteratorStreamer`:

In [None]:
from transformers import TextIteratorStreamer
from transformers.tools.agents import StopSequenceCriteria
from transformers.generation.stopping_criteria import StoppingCriteriaList
from threading import Thread

def stream_chat(model, tokenizer, input_str, chat_history):
    prompt = format_prompt(input_str, chat_history)
    input_ids = tokenizer([prompt], return_tensors='pt')

    streamer = TextIteratorStreamer(tokenizer,
                                    skip_prompt=True, # skip prompt in the generated tokens
                                    skip_special_tokens=True)
    
    # make sure that the generated tokens stop at ###, i.e. aviod the model from self-questioning
    stop_word = "###"
    stopping_criteria = StoppingCriteriaList([StopSequenceCriteria(stop_word, tokenizer)])

    generate_kwargs = dict(
        input_ids,
        streamer=streamer,
        max_new_tokens=128,
        stopping_criteria=stopping_criteria
    )
    
    # To ensure non-blocking access to the generated text, generation process should be ran in a separate thread.
    thread = Thread(target=model.generate, kwargs=generate_kwargs)
    thread.start()

    output_str = []
    print("Response: ", end="")
    for stream_output in streamer:
        output_str.append(stream_output)
        print(stream_output.replace(stop_word, ""), end="")

    chat_history.append((input, ''.join(output_str).replace(stop_word, "")))


> **Note**
>
> The [`transformers` streamer classes](https://huggingface.co/docs/transformers/main/generation_strategies#streaming) is currently being developed and is subject to future changes.

we can then facilitate interactive, multi-turn streaming chat between humans and the bot by allowing for continuous user input:

In [None]:
while True:
  with torch.inference_mode():
    user_input = input("Input: ")
    if user_input == "stop": # let's stop the conversation when user input "stop"
      print("Chat with Llama 2 (7B) stopped.")
      break
    stream_chat(model=model_in_4bit,
                tokenizer=tokenizer,
                input_str=user_input,
                chat_history=chat_history)

## 4.1.3 Save & Load INT4 Model

When loading a model with `load_in_4bit=True`, BigDL-LLM implicitly converts linear layers in the model into INT4 format. In theory, a model with *X* B(illion) parameters saved in 16 or 32 bit will requires approximately 2*X* or 4*X* GB of memory for loading. Thus, for extremely large models like the 40B Falcon, 70B Llama 2, 176B Bloom etc., loading them with implicit INT4 quantization of BigDL-LLM can be both resource-intensive and time-consuming, and may even become impossible on memory-limited machines.

To address this issue, BigDL-LLM provides support for saving *transformers* models in BigDL-LLM INT4 format. Once the model is optimized and saved in this format, it can be loaded directly for subsequent inference, eliminating the need for repeated quantization. The saving and loading process can be completed on different machines.

### 4.1.3.1 Save INT4 Model

Let's continue with the example in section [3.1.1](#311-load-model-in-int4-and-conduct-inference) for model Llama 2 (7B). After we loading the model in 4 bit, we could use the `save_low_bit` function to save the optimized model:

In [None]:
save_directory='./llama-2-7b-bigdl-llm-4-bit'

model_in_4bit.save_low_bit(save_directory)

We recommend saving the tokenizer in the same directory as the optimized model to simplify the subsequent loading process:

In [None]:
tokenizer.save_pretrained(save_directory)

### 4.1.3.2 Load INT4 Model

We could load the optimized INT4 model through `load_low_bit` function, and load tokenizer from the same saved directory:

In [None]:
# note that the AutoModelForCausalLM here is imported from bigdl.llm.transformers
model_loaded = AutoModelForCausalLM.load_low_bit(save_directory)

tokenizer_loaded = LlamaTokenizer.from_pretrained(save_directory)

Inference can then be done same as using the Hugging Face *transformers* API:

In [None]:
from transformers import TextGenerationPipeline

pipeline = TextGenerationPipeline(model=model_loaded, tokenizer=tokenizer_loaded, max_new_tokens=32)
output_str = pipeline(prompt)[0]["generated_text"]
print('-'*20, 'Output', '-'*20)
print(output_str)

## 4.1.4 Use INT8 and other Low Precision Format

In addition to INT4, BigDL-LLM also supports other low-precision techniques such as INT8 and INT5. 

### 4.1.4.1 Load Model in Low Precision
To load the model with BigDL-LLM low-precision optimizations, you could specify `load_in_low_bit` parameter accordingly in the `from_pretrained` function. Let's take INT8 as an example here:


In [None]:
# note that the AutoModelForCausalLM here is imported from bigdl.llm.transformers
model_in_8bit = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path="meta-llama/Llama-2-7b-chat-hf",
                                                     load_in_low_bit="sym_int8")

> **Note**
>
> Currently, `load_in_low_bit` supports options `'sym_int4'`, `'asym_int4'`, `'sym_int5'`, `'asym_int5'` or `'sym_int8'`, in which 'sym' and 'asym' differentiate between symmetric and asymmetric quantization.
>
> It is worth mentioning that `load_in_4bit=True` is equivalent to `load_in_low_bit='sym_int4'`.

Saving and loading INT8 or other low-precision models supported by BigDL-LLM follows the same usage as described in section [3.1.2](#312-save--load-int4-model).

### 4.1.4.2 Load Tokenizer and Conduct Inference

The following process is the same as using the Hugging Face *transformers* API:

In [None]:
pipeline = TextGenerationPipeline(model=model_in_8bit, tokenizer=tokenizer, max_new_tokens=32)
output_str = pipeline(prompt)[0]["generated_text"]
print('-'*20, 'Output', '-'*20)
print(output_str)

## 4.1.5 What's Next？

In the next tutorial, we will guide you through a speech recognition pipeline that incorporates BigDL-LLM INT4 optimizations.
