MLM pretrain BERT Model

專案說明

使用台灣繁體中文資料集訓練https://huggingface.co/datasets/botp/yentinglin-zh_TW_c4
多片GPU並行訓練。
Mask Language Model 預訓練BERT模型。
Continue Pretrain Model：https://huggingface.co/Azion/bert-based-chinese

Model Performance

500萬筆傳統中文資料、16 batch size*4 3090 = 64 batch size、10個epoch。
迭代次數總計80萬次，4天11小時37分20秒。
依照MLM Task比較中文BERT模型

Dataset\BERT Pretrain	bert-based-chinese	ckiplab	GufoLab
5000 Tradition Chinese Dataset	0.7183	0.6989	0.8081
10000 Sol-Idea Dataset	0.7874	0.7913	0.8025
ALL DataSet	0.7694	0.7678	0.8038

Datasets

datasets/train.txt：訓練用文本，可於write_data.ipynb做生成。
datasets/dev.txt：每個訓練epoch結束後，會用dev.txt做驗證。
datasets/test.txt：測試用資料。

Environment

Ubuntu20.04

CUDA Version: 11.7
GeForce RTX 3090 * 4
python version: 3.10.11

torch==1.13.1
transformers==4.29.2

會需要用到NVIDIA/apex，得將apex git clone再安裝才行。

$ git clone https://github.com/NVIDIA/apex
$ cd apex
$ python setup.py install

程式碼範例來源DEBUG

原開源碼來源，只能用單片GPU訓練，且初始設置分散式程序時會卡頓。

command 執行改動。

python -m torch.distributed.launch --nproc_per_node=4 --master_port='29301' --use_env main.py

初始化GPU process設置改動，解決卡頓的問題。
NCCL，全稱 NVIDIA Collective Communications Library，是由 NVIDIA 提供的一個庫，專為在 GPU 集群上進行高效的集合通信而設計。

torch.distributed.init_process_group(backend="nccl")

多process執行時將各個local_rank號碼寫入model，解決所有process只會載入model到GPU 0問題。

model = torch.nn.parallel.DistributedDataParallel(model,
                                                  device_ids=[local_rank],
                                                  output_device=local_rank)

Get Start

單卡模式(測試)

修改Config.py文件中的self.path_model_predict。

選擇想要的第n個epoch訓練的model再運行。

如要訓練到第9個epoch的Model參數，則輸入self.path_model_predict = os.path.join(self.path_model_save, 'epoch_9')

python main.py test

多卡模式（訓練）

如果你足夠幸運，擁有了多張GPU卡，那麼恭喜你，你可以進入起飛模式。🚀🚀

修改Config.py文件中的 self.num_epochs, self.batch_size, self.sen_max_length ，再運行。

設置訓練10個epoch，則輸入self.num_epochs = 10
設置 BERT長度(<=512)，如設置最長則輸入self.sen_max_length = 512
設置 batch_size大小(依照可容納size設置)，如設置16則輸入self.batch_size = 16

python -m torch.distributed.launch --nproc_per_node=4 --master_port='29301' --use_env main.py train

使用torch的nn.parallel.DistributedDataParallel模塊進行多卡訓練。
master_port：master節點的port號，在不同的節點上master_addr和master_port的設置是一樣的，用來進行通信，port我設置'29301'。
nproc_per_node：一個節點中顯卡的數量，我有4片GPU，所以設置4。

超大資料量讀取(訓練) [2023/08/17改動]

如果有資料量大到CPU RAM無法讀取的情況，請先將檔案分割寫入到路徑./datasets/train_shard

修改Config.py文件中的self.huge_data_file_data_length，每個檔案的資料有多少筆，則輸入多少。我分割成每個檔案160000筆，則輸入160000。

本人使用的資料量有百萬以上，https://huggingface.co/datasets/botp/yentinglin-zh_TW_c4 的資料總共有500萬筆數、5 Billoin的tokens。
使用torch.distributed.launch執行有個優點及缺點。

優點：多process可以快速將DistributedSampler(tokenized_datasets)完成
缺點：會多次讀取相同檔案再整理進DistributedSampler()。

假設有4片GPU，world_size=4，會導致相同資料檔案同時被重複讀取四次。本人CPU RAM 有128GB依然不夠。

python -m torch.distributed.launch --nproc_per_node=4 --master_port='29301' --use_env main.py huge_train

在Trainer.py新增def huge_data_train()，可以看出在訓練過程會讀取下個檔案再轉成新的DataLoader形式，之前是直接所有要訓練資料轉成DataLoader。

# Trainer.py
def huge_data_train(self,local_rank,world_size):
    ...
    for epoch in range(self.config.num_epochs):
      for shard in file_list:
        file_name = '/train_shard/'+shard
        train_loader = dm.data_process(file_name, self.tokenizer)
    ...

於Training過程反覆讀取新檔案，再創建新的DataLoader會有個問題。
優化器的 learning rate scheduler 參數 training steps 得重新精算。
所以得於Config.py文件中的 self.huge_data_file_data_length 輸入每個檔案的資料有多少筆。

num_training_steps = int(self.config.num_epochs * self.config.huge_data_file_data_length * len(file_list) / (self.config.batch_size*world_size))
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=self.config.num_warmup_steps,
    num_training_steps=num_training_steps
)

新增可訓練延續機制 [2023/08/24]

加入Optimizer Save的機制，於訓練開始時可以讀取Optimizer數值。 Config.py做訓練起始設置，輸入self.train_start = epoch_9，則從epoch_9開始訓練。

training

使用交叉熵（cross-entropy）作為損失函數，困惑度（perplexity）和Loss作為評價指標來進行訓練。

test

結果保存在dataset/output/pred_data.csv，分別包含三列：

src表示原始輸入
pred表示模型預測
mask表示模型輸入（帶有mask和pad等token）

範例

src:  [CLS] art education and first professional work [SEP]
pred: [CLS] art education and first class work [SEP]
mask: [CLS] art education and first [MASK] work [SEP] [PAD] [PAD] [PAD] ...

Reference

【Bert】https://arxiv.org/pdf/1810.04805.pdf

【transformers】https://github.com/huggingface/transformers

【datasets】https://huggingface.co/datasets/botp/yentinglin-zh_TW_c4

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
datasets/output		datasets/output
model		model
picture		picture
utils		utils
Config.py		Config.py
DataManager.py		DataManager.py
LossManager.py		LossManager.py
Predictor.py		Predictor.py
README.md		README.md
Trainer.py		Trainer.py
checkpoint		checkpoint
main.py		main.py
requirement.txt		requirement.txt
write_data.ipynb		write_data.ipynb

fredericklee602/BERT-Pretrain-Multi-GPU

Folders and files

Latest commit

History

Repository files navigation