# Training Pipeline
[run_training_dpo_pipeline.ipynb](https://github.com/shibing624/MedicalGPT/blob/main/run_training_dpo_pipeline.ipynb)    | [Open In Colab](https://colab.research.google.com/github/shibing624/MedicalGPT/blob/main/run_training_dpo_pipeline.ipynb)

# Stage 1: Continue Pretraining

第一阶段：PT(Continue PreTraining)增量预训练，在海量领域文本数据上二次预训练GPT模型，以适配领域数据分布

注意：
1. 此阶段是可选的，如果你没有海量领域文本，可以跳过此阶段，直接进行SFT阶段的有监督微调
2. 我实验发现：做领域知识注入，SFT比PT更高效，也可以跳过PT阶段

| Stage 1: Continue Pretraining   |  [pretraining.py](https://github.com/shibing624/MedicalGPT/blob/main/pretraining.py) | [run_pt.sh](https://github.com/shibing624/MedicalGPT/blob/main/run_pt.sh)    |

#### 说明：
以下 notebook/colab 代码为了快速验证训练代码可用，我们使用了小size的生成模型和小样本数据集，实际使用时，需要使用更大的模型和数据集，以获得更好的效果。

1. 生成模型：使用的是Bloom的`bigscience/bloomz-560m`
2. 数据集：PT阶段使用的是中文天龙八部小说部分文本和英文书籍部分文本，位于`data/pretrain`文件夹

## 配置运行环境

本地执行可注释以下配置环境的命令，colab执行要打开注释，用于配置环境

colab建议使用T4 GPU训练，设置方式：`代码执行程序 -> 更改运行时类型 -> 运行时类型：Python3，硬件加速器：GPU，GPU类型：T4 -> 保存`

步骤：
1. 下载最新代码到本地
2. 安装依赖包

依赖包如下，保证最新版本：

```
loguru
transformers
sentencepiece
datasets
tensorboard
tqdm
peft
trl
```

In [1]:
#!git clone --depth 1 https://github.com/shibing624/MedicalGPT.git
%cd MedicalGPT
%ls
!pip install -r requirements.txt

[WinError 2] The system cannot find the file specified: 'MedicalGPT'
d:\llm\whole_process\MedicalGPT
 Volume in drive D is New Volume
 Volume Serial Number is B4E8-CC63

 Directory of d:\llm\whole_process\MedicalGPT

06/17/2024  04:52 PM    <DIR>          .
06/14/2024  09:35 PM    <DIR>          ..
06/12/2024  07:20 PM    <DIR>          .github
06/12/2024  07:20 PM             1,936 .gitignore
06/13/2024  12:57 PM    <DIR>          __pycache__
06/12/2024  07:20 PM                26 _config.yml
06/12/2024  07:20 PM             2,127 build_domain_tokenizer.py
06/15/2024  01:36 PM    <DIR>          cache
06/12/2024  07:20 PM            20,272 chatpdf.py
06/12/2024  07:20 PM               317 CITATION.cff
06/12/2024  07:20 PM               473 CONTRIBUTING.md
06/12/2024  07:20 PM             2,673 convert_dataset.py
06/12/2024  07:20 PM    <DIR>          data
06/12/2024  07:20 PM             1,171 deepspeed_zero_stage2_config.json
06/12/2024  07:20 PM             1,277 deepspeed_zero_stage

## Stage1 咱们开始吧

训练步骤如下：

1. 确认训练集
2. 执行训练脚本

训练脚本的执行逻辑如下：
1. 导入依赖包
2. 设置参数
3. 定义各函数并加载训练集
4. 加载模型和tokenizer
5. 开始训练并评估
6. 查看训练结果

**以下参数可以根据你的GPU实际情况修改，当前参数是根据Colab的T4单卡GPU（16GB显存）配置的**

In [2]:
%ls data\pretrain

 Volume in drive D is New Volume
 Volume Serial Number is B4E8-CC63

 Directory of d:\llm\whole_process\MedicalGPT\data\pretrain

06/12/2024  07:20 PM    <DIR>          .
06/12/2024  07:20 PM    <DIR>          ..
06/12/2024  07:20 PM            27,992 en_article_tail500.txt
06/12/2024  07:20 PM           352,651 fever.txt
06/12/2024  07:20 PM           853,842 tianlongbabu.txt
               3 File(s)      1,234,485 bytes
               2 Dir(s)  446,528,978,944 bytes free


In [3]:
%ls data\pretrain

 Volume in drive D is New Volume
 Volume Serial Number is B4E8-CC63

 Directory of d:\llm\whole_process\MedicalGPT\data\pretrain

06/12/2024  07:20 PM    <DIR>          .
06/12/2024  07:20 PM    <DIR>          ..
06/12/2024  07:20 PM            27,992 en_article_tail500.txt
06/12/2024  07:20 PM           352,651 fever.txt
06/12/2024  07:20 PM           853,842 tianlongbabu.txt
               3 File(s)      1,234,485 bytes
               2 Dir(s)  446,528,978,944 bytes free


In [13]:
!python pretraining.py \
    --model_type auto \
    --model_name_or_path Qwen1.5-1.8B-Chat\
    --train_file_dir data\pretrain \
    --validation_file_dir data\pretrain \
    --per_device_train_batch_size 3 \
    --per_device_eval_batch_size 3 \
    --do_train \
    --do_eval \
    --use_peft True \
    --seed 42 \
    --fp16 \
    --max_train_samples 20000 \
    --max_eval_samples 10 \
    --num_train_epochs 1 \
    --learning_rate 2e-4 \
    --warmup_ratio 0.05 \
    --weight_decay 0.01 \
    --logging_strategy steps \
    --logging_steps 10 \
    --eval_steps 50 \
    --evaluation_strategy steps \
    --save_steps 500 \
    --save_strategy steps \
    --save_total_limit 3 \
    --gradient_accumulation_steps 1 \
    --preprocessing_num_workers 1 \
    --block_size 128 \
    --group_by_length True \
    --output_dir outputs-pt-v1 \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --target_modules all \
    --lora_rank 8 \
    --lora_alpha 16 \
    --lora_dropout 0.05 \
    --torch_dtype float16 \
    --device_map auto \
    --report_to tensorboard \
    --ddp_find_unused_parameters False \
    --gradient_checkpointing True

trainable params: 7,495,680 || all params: 1,844,324,352 || trainable%: 0.4064187512284173
{'loss': 4.2204, 'grad_norm': 1.4093387126922607, 'learning_rate': 4.7619047619047615e-06, 'epoch': 0.0}
{'loss': 3.7902, 'grad_norm': 1.3315517902374268, 'learning_rate': 4.761904761904762e-05, 'epoch': 0.01}
{'loss': 3.7574, 'grad_norm': 1.3139597177505493, 'learning_rate': 9.047619047619048e-05, 'epoch': 0.02}
{'loss': 3.619, 'grad_norm': 1.4987647533416748, 'learning_rate': 0.0001380952380952381, 'epoch': 0.04}
{'loss': 3.4705, 'grad_norm': 1.748633623123169, 'learning_rate': 0.00018571428571428572, 'epoch': 0.05}
{'loss': 3.5659, 'grad_norm': 2.238861560821533, 'learning_rate': 0.00019823232323232324, 'epoch': 0.06}
{'eval_loss': 3.651273012161255, 'eval_accuracy': 0.3590551181102362, 'eval_runtime': 0.184, 'eval_samples_per_second': 54.337, 'eval_steps_per_second': 21.735, 'epoch': 0.06}
{'loss': 3.4443, 'grad_norm': 1.8516820669174194, 'learning_rate': 0.0001957070707070707, 'epoch': 0.07}

2024-08-21 20:26:01.779863: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-08-21 20:26:02.189344: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[32m2024-08-21 20:26:02.881[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m377[0m - [1mModel args: ModelArguments(model_type='auto', model_name_or_path='Qwen1.5-1.8B-Chat', tokenizer_name_or_path=None, load_in_8bit=False, load_in_4bit=False, cache_dir=None, model_revision='main', hf_hub_token=None, use_fast_tokenizer=False, torch_dtype='float16', device_map='auto', trust_remote_code=True)[0

In [9]:
# !python pretraining.py \
#     --model_type auto \
#     --model_name_or_path Qwen/Qwen1.5-0.5B-Chat \
#     --train_file_dir data\pretrain \
#     --validation_file_dir data\pretrain \
#     --per_device_train_batch_size 3 \
#     --per_device_eval_batch_size 3 \
#     --do_train \
#     --do_eval \
#     --use_peft True \
#     --seed 42 \
#     --fp16 \
#     --max_train_samples 20000 \
#     --max_eval_samples 10 \
#     --num_train_epochs 1 \
#     --learning_rate 2e-4 \
#     --warmup_ratio 0.05 \
#     --weight_decay 0.01 \
#     --logging_strategy steps \
#     --logging_steps 10 \
#     --eval_steps 50 \
#     --evaluation_strategy steps \
#     --save_steps 500 \
#     --save_strategy steps \
#     --save_total_limit 3 \
#     --gradient_accumulation_steps 1 \
#     --preprocessing_num_workers 1 \
#     --block_size 128 \
#     --group_by_length True \
#     --output_dir outputs-pt-v1 \
#     --overwrite_output_dir \
#     --ddp_timeout 30000 \
#     --logging_first_step True \
#     --target_modules all \
#     --lora_rank 8 \
#     --lora_alpha 16 \
#     --lora_dropout 0.05 \
#     --torch_dtype float16 \
#     --device_map auto \
#     --report_to tensorboard \
#     --ddp_find_unused_parameters False \
#     --gradient_checkpointing True

trainable params: 3,784,704 || all params: 467,772,416 || trainable%: 0.8091
{'loss': 5.1886, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 4.7897, 'grad_norm': 2.5664875507354736, 'learning_rate': 3.3333333333333335e-05, 'epoch': 0.01}
{'loss': 4.5736, 'grad_norm': 1.6772146224975586, 'learning_rate': 7.619047619047618e-05, 'epoch': 0.02}
{'loss': 4.5652, 'grad_norm': 1.6379257440567017, 'learning_rate': 0.0001238095238095238, 'epoch': 0.04}
{'loss': 4.2609, 'grad_norm': 1.7567120790481567, 'learning_rate': 0.00015714285714285716, 'epoch': 0.05}
{'loss': 4.2746, 'grad_norm': 1.9959819316864014, 'learning_rate': 0.00019974747474747474, 'epoch': 0.06}
{'eval_loss': 4.292125225067139, 'eval_accuracy': 0.3094488188976378, 'eval_runtime': 0.1572, 'eval_samples_per_second': 63.623, 'eval_steps_per_second': 25.449, 'epoch': 0.06}
{'loss': 4.0705, 'grad_norm': 2.0355987548828125, 'learning_rate': 0.00019722222222222225, 'epoch': 0.07}
{'loss': 4.0627, 'grad_norm': 2.772348165

2024-06-13 13:28:13.221214: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-06-13 13:28:13.621274: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[32m2024-06-13 13:28:14.327[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m377[0m - [1mModel args: ModelArguments(model_type='auto', model_name_or_path='Qwen/Qwen1.5-0.5B-Chat', tokenizer_name_or_path=None, load_in_8bit=False, load_in_4bit=False, cache_dir=None, model_revision='main', hf_hub_token=None, use_fast_tokenizer=False, torch_dtype='float16', device_map='auto', trust_remote_code=Tru

In [5]:
%ls -lh outputs-pt-v1

 Volume in drive D is New Volume
 Volume Serial Number is B4E8-CC63

 Directory of d:\llm\whole_process\MedicalGPT


 Directory of d:\llm\whole_process\MedicalGPT\outputs-pt-v1

08/21/2024  07:27 PM    <DIR>          .
06/17/2024  04:52 PM    <DIR>          ..
08/21/2024  07:27 PM               762 adapter_config.json
08/21/2024  07:27 PM        83,945,296 adapter_model.safetensors
08/21/2024  07:27 PM                55 added_tokens.json
08/21/2024  07:27 PM               486 all_results.json
08/21/2024  07:26 PM    <DIR>          checkpoint-1000
06/13/2024  01:29 PM    <DIR>          checkpoint-500
08/21/2024  07:27 PM               271 eval_results.json
06/13/2024  01:30 PM         1,823,241 merges.txt
08/21/2024  07:27 PM             5,113 README.md
08/21/2024  07:22 PM    <DIR>          runs
08/21/2024  07:27 PM               443 special_tokens_map.json
08/21/2024  07:27 PM           493,443 tokenizer.model
08/21/2024  07:27 PM             1,715 tokenizer_config.json
08/21/2024  07

File Not Found


模型训练结果：
- 使用lora训练模型，则保存的lora权重是`adapter_model.bin`, lora配置文件是`adapter_config.json`，合并到base model的方法见`merge_peft_adapter.py`
- 日志保存在`output_dir/runs`目录下，可以使用tensorboard查看，启动tensorboard方式如下：`tensorboard --logdir output_dir/runs --host 0.0.0.0 --port 8009`

lora模型权重合并到base model，合并后的模型保存在`--output_dir`目录下，合并方法如下：

In [14]:
!python merge_peft_adapter.py --model_type auto \
    --base_model Qwen1.5-1.8B-Chat --lora_model outputs-pt-v1 --output_dir merged-pt/

Namespace(model_type='auto', base_model='Qwen1.5-1.8B-Chat', tokenizer_path=None, lora_model='outputs-pt-v1', resize_emb=False, output_dir='merged-pt/', hf_hub_model_id='', hf_hub_token=None)
Base model: Qwen1.5-1.8B-Chat
LoRA model: outputs-pt-v1
Loading LoRA for causal language model
Merging with merge_and_unload...
Saving to Hugging Face format...
Done! model saved to merged-pt/


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [15]:
%ls -lh merged-pt/

Invalid switch - "".


In [8]:
%cat merged-pt/config.json

UsageError: Line magic function `%cat` not found.


Stage1 增量预训练完成。

# Stage 2: Supervised FineTuning

第二阶段：SFT(Supervised Fine-tuning)有监督微调，构造指令微调数据集，在预训练模型基础上做指令精调，以对齐指令意图，并注入领域知识

| Stage 2: Supervised Fine-tuning | [supervised_finetuning.py](https://github.com/shibing624/MedicalGPT/blob/main/supervised_finetuning.py) | [run_sft.sh](https://github.com/shibing624/MedicalGPT/blob/main/run_sft.sh)  |

#### 说明：
以下 notebook/colab 代码为了快速验证训练代码可用，我们使用了小size的生成模型和小样本数据集，实际使用时，需要使用更大的模型和数据集，以获得更好的效果。

1. 生成模型：使用的是Bloom的`bigscience/bloomz-560m` 或者 Stage1得到的预训练模型
2. 数据集：SFT阶段使用的是使用的是Belle的1千条抽样数据，位于`data/finetune`文件夹

## Stage2 咱们开始吧

训练步骤如下：

1. 确认训练集
2. 执行训练脚本

训练脚本的执行逻辑如下：
1. 导入依赖包
2. 设置参数
3. 定义各函数并加载训练集
4. 加载模型和tokenizer
5. 开始训练并评估
6. 查看训练结果

In [16]:
%ls merged-pt

 Volume in drive D is New Volume
 Volume Serial Number is B4E8-CC63

 Directory of d:\llm\whole_process\MedicalGPT\merged-pt

08/21/2024  08:34 PM    <DIR>          .
08/21/2024  08:24 PM    <DIR>          ..
08/21/2024  08:34 PM                85 added_tokens.json
08/21/2024  08:34 PM               729 config.json
08/21/2024  08:34 PM               217 generation_config.json
08/21/2024  08:34 PM         1,671,853 merges.txt
08/21/2024  08:34 PM     3,673,690,400 model.safetensors
08/21/2024  07:29 PM            24,248 model.safetensors.index.json
08/21/2024  08:34 PM               387 special_tokens_map.json
08/21/2024  08:34 PM         7,028,015 tokenizer.json
08/21/2024  07:29 PM           493,443 tokenizer.model
08/21/2024  08:34 PM             1,342 tokenizer_config.json
08/21/2024  08:34 PM         2,776,833 vocab.json
              11 File(s)  3,685,687,552 bytes
               2 Dir(s)  436,107,190,272 bytes free


In [17]:
%ls data\finetune

 Volume in drive D is New Volume
 Volume Serial Number is B4E8-CC63

 Directory of d:\llm\whole_process\MedicalGPT\data\finetune

06/12/2024  07:20 PM    <DIR>          .
06/12/2024  07:20 PM    <DIR>          ..
06/12/2024  07:20 PM           766,815 medical_sft_1K_format.jsonl
06/12/2024  07:20 PM         4,082,858 sharegpt_zh_1K_format.jsonl
               2 File(s)      4,849,673 bytes
               2 Dir(s)  436,107,190,272 bytes free


In [18]:
!python supervised_finetuning.py \
    --model_type auto \
    --model_name_or_path merged-pt \
    --train_file_dir ./data/finetune \
    --validation_file_dir ./data/finetune \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --do_train \
    --do_eval \
    --use_peft True \
    --fp16 \
    --max_train_samples 1000 \
    --max_eval_samples 10 \
    --num_train_epochs 1 \
    --learning_rate 2e-5 \
    --warmup_ratio 0.05 \
    --weight_decay 0.05 \
    --logging_strategy steps \
    --logging_steps 10 \
    --eval_steps 50 \
    --evaluation_strategy steps \
    --save_steps 500 \
    --save_strategy steps \
    --save_total_limit 3 \
    --gradient_accumulation_steps 1 \
    --preprocessing_num_workers 1 \
    --output_dir outputs-sft-v1 \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --target_modules all \
    --lora_rank 8 \
    --lora_alpha 16 \
    --lora_dropout 0.05 \
    --torch_dtype float16 \
    --device_map auto \
    --report_to tensorboard \
    --ddp_find_unused_parameters False \
    --gradient_checkpointing True

trainable params: 7,495,680 || all params: 1,844,324,352 || trainable%: 0.4064187512284173
{'loss': 2.763, 'grad_norm': 0.9128183722496033, 'learning_rate': 1.5384615384615387e-06, 'epoch': 0.0}
{'loss': 2.3173, 'grad_norm': 2.5105676651000977, 'learning_rate': 1.5384615384615387e-05, 'epoch': 0.04}
{'loss': 2.4568, 'grad_norm': 1.6755942106246948, 'learning_rate': 1.9409282700421944e-05, 'epoch': 0.08}
{'loss': 2.6113, 'grad_norm': 1.8691452741622925, 'learning_rate': 1.856540084388186e-05, 'epoch': 0.12}
{'loss': 2.1971, 'grad_norm': 1.1665077209472656, 'learning_rate': 1.7721518987341772e-05, 'epoch': 0.16}
{'loss': 2.2834, 'grad_norm': 1.2559572458267212, 'learning_rate': 1.687763713080169e-05, 'epoch': 0.2}
{'eval_loss': 2.512449264526367, 'eval_runtime': 0.1816, 'eval_samples_per_second': 55.076, 'eval_steps_per_second': 16.523, 'epoch': 0.2}
{'loss': 2.158, 'grad_norm': 1.4853975772857666, 'learning_rate': 1.6033755274261603e-05, 'epoch': 0.24}
{'loss': 2.2105, 'grad_norm': 0.75

2024-08-21 20:34:29.765689: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-08-21 20:34:30.164756: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[32m2024-08-21 20:34:30.818[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m500[0m - [1mModel args: ModelArguments(model_type='auto', model_name_or_path='merged-pt', load_in_8bit=False, load_in_4bit=False, tokenizer_name_or_path=None, cache_dir=None, model_revision='main', hf_hub_token=None, use_fast_tokenizer=False, torch_dtype='float16', device_map='auto', trust_remote_code=True, rope_scali

In [19]:
%ls -lh outputs-sft-v1

 Volume in drive D is New Volume
 Volume Serial Number is B4E8-CC63

 Directory of d:\llm\whole_process\MedicalGPT


 Directory of d:\llm\whole_process\MedicalGPT\outputs-sft-v1

06/13/2024  01:37 PM    <DIR>          .
08/21/2024  08:24 PM    <DIR>          ..
08/21/2024  08:35 PM               746 adapter_config.json
08/21/2024  08:35 PM        30,026,872 adapter_model.safetensors
08/21/2024  08:35 PM                85 added_tokens.json
08/21/2024  08:35 PM               446 all_results.json
08/21/2024  08:35 PM               231 eval_results.json
08/21/2024  08:35 PM         1,821,636 merges.txt
08/21/2024  08:35 PM             5,097 README.md
08/21/2024  08:34 PM    <DIR>          runs
08/21/2024  08:35 PM               417 special_tokens_map.json
08/21/2024  08:35 PM             1,377 tokenizer_config.json
08/21/2024  08:35 PM               199 train_results.json
08/21/2024  08:35 PM             6,060 trainer_state.json
08/21/2024  08:35 PM         3,535,052 vocab.json
           

File Not Found


模型训练结果：
- 使用lora训练模型，则保存的lora权重是`adapter_model.bin`, lora配置文件是`adapter_config.json`，合并到base model的方法见`merge_peft_adapter.py`
- 日志保存在`output_dir/runs`目录下，可以使用tensorboard查看，启动tensorboard方式如下：`tensorboard --logdir output_dir/runs --host 0.0.0.0 --port 8009`

lora模型权重合并到base model，合并后的模型保存在`--output_dir`目录下，合并方法如下：

In [20]:
!python merge_peft_adapter.py --model_type auto \
    --base_model merged-pt --lora_model outputs-sft-v1 --output_dir ./merged-sft

Namespace(model_type='auto', base_model='merged-pt', tokenizer_path=None, lora_model='outputs-sft-v1', resize_emb=False, output_dir='./merged-sft', hf_hub_model_id='', hf_hub_token=None)
Base model: merged-pt
LoRA model: outputs-sft-v1
Loading LoRA for causal language model
Merging with merge_and_unload...
Saving to Hugging Face format...
Done! model saved to ./merged-sft


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [21]:
%ls -lh merged-sft/

Invalid switch - "".


In [22]:
%cat merged-sft/config.json

UsageError: Line magic function `%cat` not found.


Stage2 SFT训练完成。

# Stage 3: DPO(Direct Preference Optimization)

第三阶段：DPO(Direct Preference Optimization)直接偏好优化，DPO通过直接优化语言模型来实现对其行为的精确控制，而无需使用复杂的强化学习，也可以有效学习到人类偏好，DPO相较于RLHF更容易实现且易于训练，效果更好

| Stage 3: Direct Preference Optimization        |  [dpo_training.py](https://github.com/shibing624/MedicalGPT/blob/main/dpo_training.py) | [run_dpo.sh](https://github.com/shibing624/MedicalGPT/blob/main/run_dpo.sh)    |

#### 说明：
以下 notebook/colab 代码为了快速验证训练代码可用，我们使用了小size的生成模型和小样本数据集，实际使用时，需要使用更大的模型和数据集，以获得更好的效果。

1. 生成模型：使用的是Bloom的`bigscience/bloomz-560m` 或者 Stage2得到的SFT模型
2. 数据集：DPO阶段使用的是医疗reward数据，抽样了500条，位于`data/reward`文件夹

## Stage3 咱们开始吧

训练步骤如下：

1. 确认训练集
2. 执行训练脚本

训练脚本的执行逻辑如下：
1. 导入依赖包
2. 设置参数
3. 定义各函数并加载训练集
4. 加载模型和tokenizer
5. 开始训练并评估
6. 查看训练结果

In [23]:
%ls Qwen1.5-1.8B-Chat

 Volume in drive D is New Volume
 Volume Serial Number is B4E8-CC63

 Directory of d:\llm\whole_process\MedicalGPT\Qwen1.5-1.8B-Chat

08/21/2024  08:24 PM    <DIR>          .
08/21/2024  08:24 PM    <DIR>          ..
04/30/2024  03:49 PM             1,554 .gitattributes
08/21/2024  08:24 PM    <DIR>          __pycache__
08/21/2024  08:24 PM    <DIR>          chatgpt
04/30/2024  03:49 PM               689 config.json
04/30/2024  03:49 PM                55 configuration.json
04/30/2024  03:49 PM               219 generation_config.json
04/30/2024  03:49 PM             7,335 LICENSE
04/30/2024  07:58 PM             5,856 main.py
04/30/2024  03:49 PM         1,823,226 merges.txt
04/30/2024  07:06 PM               960 mess_test.py
04/30/2024  07:38 PM               822 message_test.py
04/30/2024  03:52 PM     3,673,690,696 model.safetensors
04/26/2024  08:22 PM             3,196 openai_api_request.py
04/30/2024  03:49 PM             4,344 README.md
08/21/2024  08:24 PM    <DIR>          tes

In [25]:
import json
# Define the path to your JSON file
json_file_path = r'data\reward\orca_rlhf_mod.jsonl'

# Open the JSON file and load its content
with open(json_file_path, 'r') as file:
    data = [json.loads(line) for line in file]

# Now, `data` is a Python dictionary (or list, depending on the JSON structure)
print(data[0])

{'system': '', 'history': [], 'question': "You will be given a definition of a task first, then some input of the task.\nThis task is about using the specified sentence and converting the sentence to Resource Description Framework (RDF) triplets of the form (subject, predicate object). The RDF triplets generated must be such that the triplets accurately capture the structure and semantics of the input sentence. The input is a sentence and the output is a list of triplets of the form [subject, predicate, object] that capture the relationships present in the sentence. When a sentence has more than 1 RDF triplet possible, the output must contain all of them.\n\nAFC Ajax (amateurs)'s ground is Sportpark De Toekomst where Ajax Youth Academy also play.\nOutput:", 'response_chosen': '[\n  ["AFC Ajax (amateurs)", "has ground", "Sportpark De Toekomst"],\n  ["Ajax Youth Academy", "plays at", "Sportpark De Toekomst"]\n]', 'response_rejected': " Sure, I'd be happy to help! Here are the RDF triplet

In [10]:
import json

# Define the path to the input and output JSONL files
input_jsonl_file_path = r'D:\llm\whole_process\zuoye\orca_rlhf.jsonl'
output_jsonl_file_path = r'data\reward\orca_rlhf_mod.jsonl'

def modify_keys_and_add_history(json_obj):
    # Ensure the JSON object contains the required keys
    if 'system' in json_obj and 'question' in json_obj:
        # Create a new ordered dictionary to maintain the order of keys
        modified_obj = {}
        for key, value in json_obj.items():
            if key == 'chosen':
                modified_obj['response_chosen'] = value
            elif key == 'rejected':
                modified_obj['response_rejected'] = value
            elif key == 'system':
                modified_obj[key] = value
                modified_obj['history'] = []
            else:
                modified_obj[key] = value
        return modified_obj
    return json_obj


# Read, modify, and write the JSONL file
with open(input_jsonl_file_path, 'r') as infile, open(output_jsonl_file_path, 'w') as outfile:
    for line in infile:
        json_obj = json.loads(line)
        modified_obj = modify_keys_and_add_history(json_obj)
        outfile.write(json.dumps(modified_obj) + '\n')



In [26]:
from trl import AutoModelForCausalLMWithValueHead

model_name_or_path = './Qwen1.5-1.8B-Chat'  # Path to your model
model = AutoModelForCausalLMWithValueHead.from_pretrained(model_name_or_path)

print(model)

AutoModelForCausalLMWithValueHead(
  (pretrained_model): Qwen2ForCausalLM(
    (model): Qwen2Model(
      (embed_tokens): Embedding(151936, 2048)
      (layers): ModuleList(
        (0-23): 24 x Qwen2DecoderLayer(
          (self_attn): Qwen2SdpaAttention(
            (q_proj): Linear(in_features=2048, out_features=2048, bias=True)
            (k_proj): Linear(in_features=2048, out_features=2048, bias=True)
            (v_proj): Linear(in_features=2048, out_features=2048, bias=True)
            (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
            (rotary_emb): Qwen2RotaryEmbedding()
          )
          (mlp): Qwen2MLP(
            (gate_proj): Linear(in_features=2048, out_features=5504, bias=False)
            (up_proj): Linear(in_features=2048, out_features=5504, bias=False)
            (down_proj): Linear(in_features=5504, out_features=2048, bias=False)
            (act_fn): SiLU()
          )
          (input_layernorm): Qwen2RMSNorm()
          (post_atte

In [27]:
%ls data\reward\

 Volume in drive D is New Volume
 Volume Serial Number is B4E8-CC63

 Directory of d:\llm\whole_process\MedicalGPT\data\reward

06/15/2024  10:42 AM    <DIR>          .
06/12/2024  07:20 PM    <DIR>          ..
06/15/2024  10:43 AM        36,747,287 orca_rlhf_mod.jsonl
               1 File(s)     36,747,287 bytes
               2 Dir(s)  433,346,654,208 bytes free


In [28]:
import torch

# Clear CUDA memory cache
torch.cuda.empty_cache()

In [29]:
!python dpo_training.py \
    --model_type auto \
    --model_name_or_path Qwen1.5-1.8B-Chat \
    --train_file_dir data\reward \
    --validation_file_dir data\reward \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --do_train \
    --do_eval \
    --use_peft True \
    --max_train_samples 1000 \
    --max_eval_samples 5 \
    --max_steps 100 \
    --eval_steps 20 \
    --save_steps 50 \
    --max_source_length 512 \
    --max_target_length 256 \
    --output_dir outputs-dpo-qwen1.5 \
    --target_modules q_proj,k_proj \
    --lora_rank 8 \
    --lora_alpha 16 \
    --lora_dropout 0.05 \
    --torch_dtype float16 \
    --fp16 True \
    --device_map auto \
    --report_to tensorboard \
    --remove_unused_columns False \
    --gradient_checkpointing True \
    --cache_dir ./cache

trainable params: 1572864 || all params: 1838401536 || trainable%: 0.08555606428736143
{'loss': 0.6931, 'grad_norm': 2.6533761024475098, 'learning_rate': 5e-06, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -139.6123504638672, 'logps/chosen': -40.92887878417969, 'logits/rejected': -2.5189437866210938, 'logits/chosen': -1.860050916671753, 'epoch': 0.04}
{'loss': 0.6931, 'grad_norm': 2.7892494201660156, 'learning_rate': 1e-05, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -177.3958740234375, 'logps/chosen': -83.04783630371094, 'logits/rejected': -1.9823558330535889, 'logits/chosen': -1.9620705842971802, 'epoch': 0.08}
{'loss': 0.6945, 'grad_norm': 2.251851797103882, 'learning_rate': 1.5e-05, 'rewards/chosen': -0.00027399061946198344, 'rewards/rejected': 0.002441120333969593, 'rewards/accuracies': 0.25, 'rewards/margins': -0.002715111244469881, 'log

2024-08-21 20:46:42.044115: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-08-21 20:46:42.752218: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[32m2024-08-21 20:46:44.043[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m217[0m - [1mParse args: ScriptArguments(model_type='auto', model_name_or_path='Qwen1.5-1.8B-Chat', tokenizer_name_or_path=None, load_in_8bit=False, load_in_4bit=False, cache_dir='./cache', use_fast_tokenizer=False, torch_dtype='float16', device_map='auto', trust_remote_code=True, dataset_name=None, dataset_config_name

In [31]:
%ls -lh outputs-dpo-qwen1.5

 Volume in drive D is New Volume
 Volume Serial Number is B4E8-CC63

 Directory of d:\llm\whole_process\MedicalGPT


 Directory of d:\llm\whole_process\MedicalGPT\outputs-dpo-qwen1.5

08/21/2024  08:48 PM    <DIR>          .
08/21/2024  08:46 PM    <DIR>          ..
08/21/2024  08:48 PM               672 adapter_config.json
08/21/2024  08:48 PM         6,304,096 adapter_model.safetensors
08/21/2024  08:48 PM                85 added_tokens.json
08/21/2024  08:48 PM               749 all_results.json
08/21/2024  08:48 PM    <DIR>          checkpoint-100
08/21/2024  08:47 PM    <DIR>          checkpoint-50
08/21/2024  08:48 PM               572 eval_results.json
08/21/2024  08:48 PM         1,821,636 merges.txt
08/21/2024  08:48 PM             5,091 README.md
08/21/2024  08:46 PM    <DIR>          runs
08/21/2024  08:48 PM               417 special_tokens_map.json
08/21/2024  08:48 PM             1,350 tokenizer_config.json
08/21/2024  08:48 PM               200 train_results.json
08/21/2

File Not Found


模型训练结果：
- 使用lora训练模型，则保存的lora权重是`adapter_model.bin`, lora配置文件是`adapter_config.json`，合并到base model的方法见`merge_peft_adapter.py`
- 日志保存在`output_dir/runs`目录下，可以使用tensorboard查看，启动tensorboard方式如下：`tensorboard --logdir output_dir/runs --host 0.0.0.0 --port 8009`

lora模型权重合并到base model，合并后的模型保存在`--output_dir`目录下，合并方法如下：

In [32]:
!python merge_peft_adapter.py --model_type auto \
    --base_model merged-sft --lora_model outputs-dpo-qwen1.5 --output_dir merged-dpo-qwen1.5/

Namespace(model_type='auto', base_model='merged-sft', tokenizer_path=None, lora_model='outputs-dpo-qwen1.5', resize_emb=False, output_dir='merged-dpo-qwen1.5/', hf_hub_model_id='', hf_hub_token=None)
Base model: merged-sft
LoRA model: outputs-dpo-qwen1.5
Loading LoRA for causal language model
Merging with merge_and_unload...
Saving to Hugging Face format...
Done! model saved to merged-dpo-qwen1.5/


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [34]:
%ls -lh merged-dpo-qwen1.5

 Volume in drive D is New Volume
 Volume Serial Number is B4E8-CC63

 Directory of d:\llm\whole_process\MedicalGPT


 Directory of d:\llm\whole_process\MedicalGPT\merged-dpo-qwen1.5

08/21/2024  08:50 PM    <DIR>          .
08/21/2024  08:50 PM    <DIR>          ..
08/21/2024  08:50 PM                85 added_tokens.json
08/21/2024  08:50 PM               722 config.json
08/21/2024  08:50 PM               217 generation_config.json
08/21/2024  08:50 PM         1,671,853 merges.txt
08/21/2024  08:50 PM     3,673,690,400 model.safetensors
08/21/2024  08:50 PM               387 special_tokens_map.json
08/21/2024  08:50 PM         7,028,015 tokenizer.json
08/21/2024  08:50 PM             1,342 tokenizer_config.json
08/21/2024  08:50 PM         2,776,833 vocab.json
               9 File(s)  3,685,169,854 bytes
               2 Dir(s)  429,597,249,536 bytes free


File Not Found


In [7]:
#%cat merged-dpo/config.json

UsageError: Line magic function `%cat` not found.


Stage3 偏好建模第一次训练完成。

**至此一个完整的训练流程演示完成。**

In [37]:
!python inference.py --model_type auto --base_model Qwen1.5-1.8B-Chat


Namespace(model_type='auto', base_model='Qwen1.5-1.8B-Chat', lora_model='', tokenizer_path=None, template_name='vicuna', repetition_penalty=1.0, max_new_tokens=512, data_file=None, interactive=False, single_tune=False, temperature=0.7, output_file='./predictions_result.jsonl', eval_batch_size=4, resize_emb=False, load_in_8bit=False, load_in_4bit=False)
Qwen2TokenizerFast(name_or_path='Qwen1.5-1.8B-Chat', vocab_size=151643, model_max_length=32768, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>']}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

  attn_output = torch.nn.functional.scaled_dot_product_attention(

Generating outputs: 100%|██████████| 1/1 [00:08<00:00,  8.27s/it]
Generating outputs: 100%|██████████| 1/1 [00:08<00:00,  8.27s/it]


# Test

In [36]:
!python inference.py --model_type auto --base_model merged-dpo-qwen1.5
# 或在shell中运行
# python inference.py --model_type bloom --base_model merged-dpo --interactive

Namespace(model_type='auto', base_model='merged-dpo-qwen1.5', lora_model='', tokenizer_path=None, template_name='vicuna', repetition_penalty=1.0, max_new_tokens=512, data_file=None, interactive=False, single_tune=False, temperature=0.7, output_file='./predictions_result.jsonl', eval_batch_size=4, resize_emb=False, load_in_8bit=False, load_in_4bit=False)
Qwen2TokenizerFast(name_or_path='merged-dpo-qwen1.5', vocab_size=151643, model_max_length=32768, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>']}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151645: AddedToken("<|im_end|>", rstrip=False, lstrip=Fal

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

  attn_output = torch.nn.functional.scaled_dot_product_attention(

Generating outputs: 100%|██████████| 1/1 [00:08<00:00,  8.42s/it]
Generating outputs: 100%|██████████| 1/1 [00:08<00:00,  8.42s/it]


Input:介绍下南京
Response:  南京市位于江苏省西南部，是全国首批历史文化名城、国家中心城市和自由贸易试验区。

完。
