## qwen2.5推理

In [1]:
from modelscope import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen2.5-0.5B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


In [2]:
response

'Sure! A large language model (LLM) is a type of artificial intelligence that can generate human-like text based on the input it receives. These models are designed to mimic natural language understanding and generation capabilities, making them useful for a variety of applications such as chatbots, virtual assistants, language translation, and more.\n\nLarge language models are typically composed of multiple layers or "layers" of neural networks, each responsible for processing different aspects of language understanding and generation. This allows these models to learn from vast amounts of data and improve their performance over time.\n\nThe term "large" refers to the fact that these models have been trained using massive datasets, which means they have access to an enormous amount of information about language patterns and contexts. The size of these datasets also influences the complexity and effectiveness of the models\' ability to understand and generate human-like language.\n\nI

## tokenizer测试

```
[
    {"role": "user", "content": "Hi there!"},
    {"role": "assistant", "content": "Nice to meet you!"}
]
```
此消息序列需要先转换为一个文本字符串，然后才能对其进行分词以输入给模型。但问题是，转换方法有很多！例如，你可以将消息列表转换为“即时消息”格式:

```
User: Hey there!
Bot: Nice to meet you!
```
或者你可以添加特殊词元来指示角色:

```
[USER] Hey there! [/USER]
[ASST] Nice to meet you! [/ASST]
```
抑或你可以添加词元以指示消息之间的边界，而将角色信息作为字符串插入:


```
<|im_start|>user
Hey there!<|im_end|>
<|im_start|>assistant
Nice to meet you!<|im_end|>
```
方法多种多样，但没有哪种方法是最好的或是最正确的。因此，不同的模型会采用截然不同的格式进行训练。上面这些例子不是我编造的，它们都是真实的，并且至少被一个现存模型使用过！但是，一旦模型接受了某种格式的训练，你需要确保未来的输入使用相同的格式，否则就可能会出现损害性能的分布漂移。

### 模板: 一种保存格式信息的方式
当前的状况是: 如果幸运的话，你需要的格式已被正确记录在模型卡中的某个位置; 如果不幸的话，它不在，那如果你想用这个模型的话，只能祝你好运了; 在极端情况下，我们甚至会将整个提示格式放在 相应模型的博文 中，以确保用户不会错过它！但即使在最好的情况下，你也必须找到模板信息并在微调或推理流水线中手动将其写进代码。我们认为这是一个特别危险的做法，因为使用错误的聊天格式是一个 静默错误 - 一旦出了错，不会有显式的失败或 Python 异常来告诉你出了什么问题，模型的表现只会比用正确格式时差多了，但很难调试其原因！

这正是 聊天模板 旨在解决的问题。聊天模板是一个 Jinja 模板字符串，你可以使用分词器保存和加载它。聊天模板包含了将聊天消息列表转换为模型所需的、格式正确的输入字符串所需要的全部信息。下面是三个聊天模板字符串，分别对应上文所述的三种消息格式:

```
{% for message in messages %}
    {% if message['role'] == 'user' %}
        {{ "User : " }}
    {% else %}
        {{ "Bot : " }}
    {{ message['content'] + '\n' }}
{% endfor %}
````

```
{% for message in messages %}
    {% if message['role'] == 'user' %}
        {{ "[USER]" + message['content'] + " [/USER]" }}
    {% else %}
        {{ "[ASST]" + message['content'] + " [/ASST]" }}
    {{ message['content'] + '\n' }}
{% endfor %}
```

```
"{% for message in messages %}"
    "{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}"
"{% endfor %}"
```

参考[聊天模板：无声性能杀手的终结](https://github.com/huggingface/blog/blob/main/zh/chat-templates.md)

In [4]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
 
chat = [
    {"role": "user", "content": "Hello, how are you?"},
    {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
    {"role": "user", "content": "I'd like to show off how chat templating works!"},
]
 
tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=False) # 用于训练

"<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nHello, how are you?<|im_end|>\n<|im_start|>assistant\nI'm doing great. How can I help you today?<|im_end|>\n<|im_start|>user\nI'd like to show off how chat templating works!<|im_end|>\n"

In [5]:
tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True) # 用于推理

"<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nHello, how are you?<|im_end|>\n<|im_start|>assistant\nI'm doing great. How can I help you today?<|im_end|>\n<|im_start|>user\nI'd like to show off how chat templating works!<|im_end|>\n<|im_start|>assistant\n"

后者添加了模型开始答复的标记。这可以确保模型生成文本时只会给出答复，而不会做出意外的行为，比如继续用户的消息。

In [8]:
# 查看现有的模板
template = tokenizer.chat_template
print(template)

{%- if tools %}
    {{- '<|im_start|>system\n' }}
    {%- if messages[0]['role'] == 'system' %}
        {{- messages[0]['content'] }}
    {%- else %}
        {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}
    {%- endif %}
    {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
    {%- for tool in tools %}
        {{- "\n" }}
        {{- tool | tojson }}
    {%- endfor %}
    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}
    {%- if messages[0]['role'] == 'system' %}
        {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
    {%- else %}
        {{- '<|im_start|>system\nYou are Qwen, created by Alibaba C

In [10]:
print(type(template))

<class 'str'>


In [11]:
# 修改当前模板并且重新设置
template = template.replace("You are Qwen, created by Alibaba Cloud.", "You are Qwen, created by Rongsheng.")  # Change the system token
tokenizer.chat_template = template  # Set the new template

In [12]:
print(tokenizer.chat_template)

{%- if tools %}
    {{- '<|im_start|>system\n' }}
    {%- if messages[0]['role'] == 'system' %}
        {{- messages[0]['content'] }}
    {%- else %}
        {{- 'You are Qwen, created by Rongsheng. You are a helpful assistant.' }}
    {%- endif %}
    {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
    {%- for tool in tools %}
        {{- "\n" }}
        {{- tool | tojson }}
    {%- endfor %}
    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}
    {%- if messages[0]['role'] == 'system' %}
        {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
    {%- else %}
        {{- '<|im_start|>system\nYou are Qwen, created by Rongsheng. Yo

## 将chat_template添加到训练集中进行训练

In [14]:
!pip install datasets

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Looking in indexes: http://mirrors.aliyun.com/pypi/simple
Collecting datasets
  Downloading http://mirrors.aliyun.com/pypi/packages/e3/f5/668b3444a2f487b0052b908af631fe39eeb2bdb2359d9bbc2c3b80b71119/datasets-3.5.1-py3-none-any.whl (491 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.4/491.4 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting xxhash
  Downloading http://mirrors.aliyun.com/pypi/packages/f2/07/d9a3059f702dec5b3b703737afb6dda32f304f6e9da181a229dafd052c29/xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tqdm>=4.66.3
  Downloading http://mirrors.aliyun.com/pypi/packages/d0/30/dc54f88dd4a2b5dc8a0279bdd7270e735851848b762aeb1c1184ed1f6b14/tqdm-4.67.1-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.5/78.5 kB[0m [31m26.6 M

In [17]:
from transformers import AutoTokenizer
from datasets import Dataset

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")

chat1 = [
    {"role": "user", "content": "Which is bigger, the moon or the sun?"},
    {"role": "assistant", "content": "The sun."}
]
chat2 = [
    {"role": "user", "content": "Which is bigger, a virus or a bacterium?"},
    {"role": "assistant", "content": "A bacterium."}
]

dataset = Dataset.from_dict({"chat": [chat1, chat2]})
dataset = dataset.map(lambda x: {"formatted_chat": tokenizer.apply_chat_template(x["chat"], tokenize=False, add_generation_prompt=False)})
print(dataset['formatted_chat'][0])
print("===============")
print(dataset['formatted_chat'][1])

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
Which is bigger, the moon or the sun?<|im_end|>
<|im_start|>assistant
The sun.<|im_end|>

<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
Which is bigger, a virus or a bacterium?<|im_end|>
<|im_start|>assistant
A bacterium.<|im_end|>



一些分词器（tokenizers）会添加特殊的 `<bos>` 和 `<eos>` 标记。聊天模板（Chat templates）中通常已经包含了所有必要的特殊标记，额外添加特殊标记通常是不正确或重复的，这可能会损害模型的性能。当你使用 `apply_chat_template(tokenize=False)` 来格式化文本时，请确保同时将 `add_special_tokens=False`，以避免重复添加这些特殊标记。

参考[【搭建框架必备基础】彻底搞懂Chat template](https://www.guyuehome.com/detail?id=1888166611628642305)