# Notebook 4.2: Baichuan-13B

## 4.2.1 Overview

This notebook shows how to run [Baichuan-13B](https://github.com/baichuan-inc/Baichuan-13B) Chinese inference on low-cost PCs (without the need of discrete GPU) using [IPEX-LLM](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm) APIs. Baichuan-13B is an open-source, commercially available large-scale language model developed by Baichuan Intelligent Technology following [Baichuan-7B](https://github.com/baichuan-inc/baichuan-7B). Baichuan-13B also can be found in [Huggingface models](https://huggingface.co/models) in following [link](https://huggingface.co/baichuan-inc/Baichuan-13B-Chat).

## 4.2.2 Installation

First of all, install IPEX-LLM in your prepared environment. For best practices of environment setup, refer to [Chapter 2](../ch_2_Environment_Setup/README.md) in this tutorial.

In [None]:
!pip install --pre --upgrade ipex-llm[all]

# Additional package required for Baichuan-13B-Chat to conduct generation
!pip install -U transformers_stream_generator

The all option is for installing other required packages by IPEX-LLM.

## 4.2.3 Load Model and Tokenizer

### 4.2.3.1 Load Model

Load Baichuan model with low-bit optimization(INT4) for lower resource cost using IPEX-LLM APIs, which convert the relevant layers in the model into INT4 format. 

> **Note**
>
> You can specify the argument `model_path` with both Huggingface repo id or local model path.

In [None]:
from ipex_llm.transformers import AutoModelForCausalLM

model_path = "baichuan-inc/Baichuan-13B-Chat"
model = AutoModelForCausalLM.from_pretrained(model_path,
                                             load_in_4bit=True,
                                             trust_remote_code=True)

### 4.2.3.2 Load Tokenizer

A tokenizer is also needed for LLM inference. It is used to encode input texts to tensors to feed to LLMs, and decode the LLM output tensors to texts. You can use [Huggingface transformers](https://huggingface.co/docs/transformers/index) API to load the tokenizer directly. It can be used seamlessly with models loaded by IPEX-LLM.

In [7]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path,
                                          trust_remote_code=True)

## 4.2.4 Inference

### 4.2.4.1 Create Prompt Template

Before generating, you need to create a prompt template, we show an example of a template for question and answering here. You can tune the prompt based on your own model as well.

In [5]:
BAICHUAN_PROMPT_FORMAT = "<human>{prompt} <bot>"

### 4.2.4.2 Generate

Then, you can generate output with loaded model and tokenizer.

> **Note**
>
> `max_new_tokens` parameter in the `generate` function defines the maximum number of tokens to predict.

In [8]:
import torch

prompt = "AI是什么？"
n_predict = 32
with torch.inference_mode():
        prompt = BAICHUAN_PROMPT_FORMAT.format(prompt=prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt")
        # if your selected model is capable of utilizing previous key/value attentions
        # to enhance decoding speed, but has `"use_cache": false` in its model config,
        # it is important to set `use_cache=True` explicitly in the `generate` function
        # to obtain optimal performance with IPEX-LLM INT4 optimizations
        output = model.generate(input_ids,
                                max_new_tokens=n_predict)
        output_str = tokenizer.decode(output[0], skip_special_tokens=True)
        print('-'*20, 'Output', '-'*20)
        print(output_str)

-------------------- Output --------------------
<human>AI是什么？ <bot>人工智能(Artificial Intelligence，简称AI)是指由人制造出来的系统所表现出的智能，通常是通过计算机程序和传感器实现的
