# 越南语对话模型推理示例

本示例展示如何在香橙派AIPRO上部署并运行基于Qwen1.5-0.5B-Chat模型的越南语对话系统。

## 项目概述

- **基础模型**: Qwen1.5-0.5B-Chat
- **微调方法**: LoRA (Low-Rank Adaptation)
- **训练数据**: 万卷丝绸数据集越南语部分
- **部署平台**: 香橙派AIPRO
- **推理框架**: MindSpore + MindNLP

## 功能特性

- ✅ LoRA权重加载与模型融合
- ✅ 流式文本生成 (Streaming)
- ✅ 对话历史管理
- ✅ LoRA层激活验证
- ✅ 边缘设备推理优化

## 快速开始

1. 准备训练好的LoRA权重
2. 加载基础模型和LoRA适配器
3. 运行交互式对话界面

# 环境配置说明

## 版本要求

运行本推理示例需要以下软件版本：

- **MindSpore**: 2.6.0
- **MindNLP**: 0.4.1
- **CANN**: 8.1.RC1

In [1]:
!pip list |grep mind

mindnlp                   0.4.1
mindpet                   1.0.4
mindspore                 2.6.0


# NPU 配置

使用 `npu-smi info` 命令查看昇腾 NPU 的配置信息，包括 NPU 型号、健康状态、功耗、温度、内存使用情况等。

## 硬件信息

- **NPU 型号**: Ascend 310B4
- **内存**: 15.6GB
- **Hugepages**: 15/15 页

In [None]:
!npu-smi info

+--------------------------------------------------------------------------------------------------------+
| npu-smi 25.2.0                                   Version: 25.2.0                                       |
+-------------------------------+-----------------+------------------------------------------------------+
| NPU     Name                  | Health          | Power(W)     Temp(C)           Hugepages-Usage(page) |
| Chip    Device                | Bus-Id          | AICore(%)    Memory-Usage(MB)                        |
| 0       310B4                 | OK/Alarm        | 0.0          51                15    / 15            |
| 0       0                     | NA              | 0            3681 / 15610                            |


In [None]:
import mindspore
from mindnlp.transformers import AutoModelForCausalLM, AutoTokenizer
from mindnlp.transformers import TextIteratorStreamer
from threading import Thread

  setattr(self, word, getattr(machar, word).flat[0])
  return self._float_to_str(self.smallest_subnormal)
  setattr(self, word, getattr(machar, word).flat[0])
  return self._float_to_str(self.smallest_subnormal)
                                                       mindspore.device_context.ascend.op_precision.op_precision_mode(),
                                                       mindspore.device_context.ascend.op_precision.matmul_allow_hf32(),
                                                       mindspore.device_context.ascend.op_precision.conv_allow_hf32(),
                                                       mindspore.device_context.ascend.op_tuning.op_compile() instead.
  from .autonotebook import tqdm as notebook_tqdm
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 2.246 seconds.
Prefix dict has been built successfully.


# 模型加载与初始化

## 加载基础模型

使用MindNLP提供的`AutoTokenizer`和`AutoModelForCausalLM`类加载预训练模型：

```python
# AutoTokenizer: 自动识别并加载与模型匹配的分词器
# AutoModelForCausalLM: 自动加载因果语言模型架构
```


In [3]:
path = "./Model/Qwen1.5-0.5B-Chat/"
tokenizer = AutoTokenizer.from_pretrained(path,ms_dtype=mindspore.float16)
omodel = AutoModelForCausalLM.from_pretrained(path,ms_dtype=mindspore.float16)


Qwen2ForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`.`PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Sliding Window Attention is enabled but not implemented for `eager`; unexpected results may be encountered.


# LoRA适配器加载

## PEFT参数高效微调

使用MindNLP的PEFT（Parameter-Efficient Fine-Tuning）模块加载训练好的LoRA适配器：

```python
# PeftModel: 专门用于加载和管理PEFT微调权重
# from_pretrained: 从指定路径加载LoRA适配器配置和权重
# .eval(): 将模型设置为评估模式，关闭Dropout等训练层
```

In [4]:
from mindnlp.peft import get_peft_model, PeftModel, LoHaConfig, TaskType

In [5]:
lora_path = "./Model/lora-train_2026-01-07-16-29-48"
loramodel = PeftModel.from_pretrained(omodel, lora_path)


In [6]:
model=loramodel.eval()


# LoRA层激活验证

## 技术目的
验证PEFT微调权重在推理过程中的有效集成与激活状态，确保低秩适配器在边缘部署中正确运作，为模型性能调优提供量化依据。

## 技术原理
基于`register_forward_hook`机制，在模型计算图中动态监控LoRA模块的执行状态。
通过`PeftModel`的模块遍历与条件筛选，对包含`lora_A`和`lora_B`属性的低秩适配层注册回调函数。
前向传播时，`AutoModelForCausalLM`的推理路径触发钩子函数，实时记录`LoRALinear`或`LoRAEmbedding`等适配层的调用信息，通过`accessed_params`集合统计有效激活的微调参数规模。

In [7]:
accessed_params = []

def hook(module, input, output):
    if hasattr(module, 'lora_A') or hasattr(module, 'lora_B'):
        accessed_params.append(module.__class__.__name__)

# 注册hook到LoRA层
hooks = []
for module in model.modules():
    if hasattr(module, 'lora_A'):
        hook_handle = module.register_forward_hook(hook)
        hooks.append(hook_handle)

In [8]:
test_input = tokenizer("测试", return_tensors="ms")

_ = model(**test_input)

# 移除hooks
for h in hooks:
    h.remove()


We detected that you are passing `past_key_values` as a tuple and this is deprecated. Please use an appropriate `Cache` class
2026-01-08 16:11:13.233289: E external/org_tensorflow/tensorflow/core/framework/node_def_util.cc:676] NodeDef mentions attribute is_closed which is not in the op definition: Op<name=Range; signature=start:Tidx, limit:Tidx, delta:Tidx -> output:Tidx; attr=Tidx:type,default=DT_INT32,allowed=[DT_BFLOAT16, DT_HALF, DT_FLOAT, DT_DOUBLE, DT_INT8, DT_INT16, DT_INT32, DT_INT64, DT_UINT16, DT_UINT32]> This may be expected if your graph generating binary is newer  than this binary. Unknown attributes will be ignored. NodeDef: {{node Range2}}


In [9]:

print(f"在前向传播中访问了 {len(accessed_params)} 个LoRA层")
print(f"示例访问的层: {set(accessed_params[:8])}")

在前向传播中访问了 168 个LoRA层
示例访问的层: {'Linear'}


# 对话系统构建

## 系统提示与历史管理

### 系统提示词 (System Prompt)
```python
system_prompt = "You are a helpful and friendly chatbot"
```
- 定义模型的基本身份和行为准则
- 在每次对话开始时作为系统消息注入
- 引导模型生成更符合预期的回复

### 历史记录格式化函数
`build_input_from_chat_history()` 函数将对话历史转换为模型所需的格式：

```python
# 输入格式:
#   chat_history: [(用户消息1, 助手回复1), (用户消息2, 助手回复2), ...]
#   msg: 当前用户输入

# 输出格式:
#   messages: [
#     {'role': 'system', 'content': system_prompt},
#     {'role': 'user', 'content': 用户消息1},
#     {'role': 'assistant', 'content': 助手回复1},
#     ...
#     {'role': 'user', 'content': 当前输入}
#   ]
```

### Qwen1.5对话格式说明
Qwen1.5系列模型使用以下特殊标记：
- `<|im_start|>`: 消息开始标记
- `<|im_end|>`: 消息结束标记
- `<|endoftext|>`: 文本结束标记

`tokenizer.apply_chat_template()` 会自动将消息列表转换为这种格式的文本。


In [10]:

system_prompt = "You are a helpful and friendly chatbot"
def build_input_from_chat_history(chat_history, msg: str):
    messages = [{'role': 'system', 'content': system_prompt}]
    for user_msg, ai_msg in chat_history:
        messages.append({'role': 'user', 'content': user_msg})
        messages.append({'role': 'assistant', 'content': ai_msg})
    messages.append({'role': 'user', 'content': msg})
    return messages

# 流式推理函数

## 功能说明
实现模型的流式文本生成，将用户输入和对话历史转换为模型可处理的格式，并支持实时token输出。

## 处理流程
1. **输入格式化**：调用`build_input_from_chat_history`将对话历史转换为模型所需的聊天模板格式
2. **Tokenization**：使用分词器的`apply_chat_template`方法将消息列表转换为模型输入张量
3. **流式配置**：创建`TextIteratorStreamer`对象，设置超时和过滤参数
4. **生成参数**：配置采样策略（top-p=0.9，temperature=0.1）和重复惩罚（1.2）
5. **异步生成**：在新线程中启动模型生成过程，实现实时token流式输出

## 关键参数
- **max_new_tokens**: 控制生成文本的最大长度（1024）
- **do_sample**: 启用采样策略以增加生成多样性
- **top_p**: 使用核采样（0.9）平衡生成质量与多样性
- **temperature**: 较低温度值（0.1）确保生成内容更确定
- **repetition_penalty**: 惩罚重复内容（1.2）避免循环回复

## 技术特点
- 支持实时交互体验，逐个token输出响应
- 采用异步生成避免界面阻塞
- 平衡生成质量与响应速度

In [11]:
# Function to generate model predictions.
def predict(message, history):
    history_transformer_format = history + [[message, ""]]
    # Formatting the input for the model.
    messages = build_input_from_chat_history(history, message)
    input_ids = tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,
            return_tensors="ms",
            tokenize=True
        )
    streamer = TextIteratorStreamer(tokenizer, timeout=300, skip_prompt=True, skip_special_tokens=False)
    generate_kwargs = dict(
        input_ids=input_ids,
        streamer=streamer,
        max_new_tokens=1024,
        do_sample=True,
        top_p=0.9,
        temperature=0.1,
        num_beams=1,
        repetition_penalty=1.2
    )
    t = Thread(target=model.generate, kwargs=generate_kwargs)
    t.start()  # Starting the generation in a separate thread.
    for new_token in streamer:
        yield new_token


# 交互对话

启动命令行对话界面，支持多轮交互和流式响应。输入"exit"或"quit"退出程序。

In [None]:

history = []

while True:
    message = input("You> ").strip()
    if message in ("exit", "quit"):
        break
    ans = ""
    print("Bot> ", end="", flush=True)
    for tok in predict(message, history):
        print(tok, end="", flush=True)
        ans += tok
    print()
    history.append([message, ans.strip()])


You>  Nguồn gốc tục thờ cúng tổ tiên của người Việt?


Bot> 

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


.Trong lịch sử, Cúng là một trong những công việc quan trọng nhất cho dân tộc Việt Nam. Nó được xây dựng từ năm 1902 đến năm 1934 và đã trở thành sự kiện lớn trong cuộc chiến chống đế quốc Mỹ. Trung tâm Cúng là nơi các hoạt động văn hóa truyền thống như phán hùng vĩ, trang phục nguy hiểm, và các hội thảo xã hội.<|im_end|>
