# Trainer control
千帆Python SDK 在使用[trainer 实现训练微调](./trainer_finetune_dataset2deploy.ipynb)的基础上，SDK还提供了灵活的事件回调、以及trainer的可恢复的特性，以下以新建训练任务，并注册EventHandler，遇到报错之后进行resume进行演示。

In [None]:
! pip install "qianfan>=0.2.2" -U

In [1]:
import qianfan
qianfan.__version__

'0.2.2'

## 前置准备
- 初始化千帆安全认证AK、SK

In [2]:
import os 

os.environ["QIANFAN_ACCESS_KEY"] = "your_ak"
os.environ["QIANFAN_SECRET_KEY"] = "your_sk"

#### 导入依赖
- `qianfan.trainer.consts` trainer使用中所用到的常量
- `qianfan.resources.console.consts` api层面定义的字段常量
- `qianfan.trainer.configs` trainer使用所需要的config配置数据类
- `qianfan.trainer.LLMFinetune` 大语言模型fine-tune任务Trainer实现
- `qianfan.trainer.Service` service类，用于表示平台的模型服务，可以通过trainer.result获取
- `qianfan.dataset.Dataset` 千帆dataset类，用于管理千帆平台、本地、第三方数据集的导入导出，数据清洗等操作

In [None]:
from qianfan.trainer.consts import ActionState
from qianfan.model.consts import ServiceType
from qianfan.resources.console import consts as console_consts
from qianfan.trainer.configs import TrainConfig
from qianfan.model.configs import DeployConfig
from qianfan.resources import QfMessages
from qianfan.trainer import LLMFinetune, Service
from qianfan.dataset import Dataset
from typing import cast
from qianfan.utils import enable_log
import logging

enable_log(logging.INFO)

## EventHandler

如果需要在训练过程中监控每个阶段的各个节点的状态，可以通过事件回调函数来实现，通过事件的对应的action_state可以获取当前的action的运行情况以实现对应的业务回调，插入自定义逻辑

In [None]:
from qianfan.model import Model
from qianfan.dataset import Dataset

# 首先需要先加载测试数据集，这里以加载平台预置数据集为例子：
ds = Dataset.load(qianfan_dataset_id=15074, is_download_to_local=False)
trainer = LLMFinetune(
    train_type="ERNIE-Bot-turbo-0725",
    train_config=TrainConfig(
        epoch=1,
        learning_rate=0.0003,
        max_seq_len=4096,
        peft_type="LoRA",
    ),
    dataset=ds,
)

In [None]:
from qianfan.trainer.event import Event, EventHandler

testset: Dataset = Dataset.load(data_file="./data/fin_cqa_test.jsonl")
# 定义自己的EventHandler，并实现dispatch方法
class InferAfterSFT(EventHandler):
    target_action: str
    def __init__(self, target_action: str) -> None:
        super().__init__()
        self.target_action = target_action

    def dispatch(self, event: Event) -> None:
        print("receive: <", event)
        if self.target_action == event.action_id and event.action_state == ActionState.Done:
            svc = cast(Service, event.data["service"])
            print("svc", svc)
            for row in testset.list():
                msgs = QfMessages()
                msgs.append(row[0][0]["prompt"], "user")
                svc.exec({"messages":"msgs"})
                print("row infer result", row)
            

eh = InferAfterSFT(target_action=trainer.ppls[0].id)
trainer.register_event_handler(eh)
trainer.run()

### 任务恢复

针对网络中断，服务不稳定等重试无法覆盖的场景，SDK提供了`resume()`以恢复训练过程，这里以LLMFinetune中断后恢复为例：

In [None]:
trainer.run()

[INFO] [12-07 21:54:28] data_source.py:1044 [t:139789057857344]: data releasing, keep rolling
[INFO] [12-07 21:54:30] data_source.py:1044 [t:139789057857344]: data releasing, keep rolling
[INFO] [12-07 21:54:33] data_source.py:1044 [t:139789057857344]: data releasing, keep rolling
[INFO] [12-07 21:54:38] data_source.py:1044 [t:139789057857344]: data releasing, keep rolling
[INFO] [12-07 21:54:41] data_source.py:1053 [t:139789057857344]: data releasing succeeded
[INFO] [12-07 21:54:44] actions.py:352 [t:139789057857344]: [train_action] fine-tune running... current status: RUNNING, check vdl report in https://console.bce.baidu.com/qianfan/visualdl/index?displayToken=eyJydW5JZCI6InJ1bi10MnlzaWQ3NjE1Z3N0Zm11In0=
[INFO] [12-07 21:55:14] actions.py:352 [t:139789057857344]: [train_action] fine-tune running... current status: RUNNING, check vdl report in https://console.bce.baidu.com/qianfan/visualdl/index?displayToken=eyJydW5JZCI6InJ1bi10MnlzaWQ3NjE1Z3N0Zm11In0=
[INFO] [12-07 21:55:46] action

APIError: api return error, code: 500002, msg: auth failed, no access

In [None]:
trainer.resume()

[INFO] [12-07 22:00:58] actions.py:390 [t:139789057857344]: [train_action] resume from created job 17304/9077
[INFO] [12-07 22:00:58] actions.py:352 [t:139789057857344]: [train_action] fine-tune running... current status: RUNNING, check vdl report in https://console.bce.baidu.com/qianfan/visualdl/index?displayToken=eyJydW5JZCI6InJ1bi10MnlzaWQ3NjE1Z3N0Zm11In0=
[INFO] [12-07 22:01:29] actions.py:352 [t:139789057857344]: [train_action] fine-tune running... current status: RUNNING, check vdl report in https://console.bce.baidu.com/qianfan/visualdl/index?displayToken=eyJydW5JZCI6InJ1bi10MnlzaWQ3NjE1Z3N0Zm11In0=
[INFO] [12-07 22:02:00] actions.py:352 [t:139789057857344]: [train_action] fine-tune running... current status: RUNNING, check vdl report in https://console.bce.baidu.com/qianfan/visualdl/index?displayToken=eyJydW5JZCI6InJ1bi10MnlzaWQ3NjE1Z3N0Zm11In0=
[INFO] [12-07 22:02:30] actions.py:352 [t:139789057857344]: [train_action] fine-tune running... current status: RUNNING, check vdl rep

[INFO] [12-07 22:36:20] model.py:199 [t:139789057857344]: model publishing keep polling, current status FINISH
[INFO] [12-07 22:36:20] model.py:233 [t:139789057857344]: model ready to publish
[INFO] [12-07 22:36:21] model.py:239 [t:139789057857344]: check model publish status: Creating
[INFO] [12-07 22:36:51] model.py:239 [t:139789057857344]: check model publish status: Ready
[INFO] [12-07 22:36:51] model.py:241 [t:139789057857344]: model 10248/12701 published successfully


<qianfan.trainer.finetune.LLMFinetune at 0x7f22c4dd0210>