Hi,
I've tried Deepspeed these days and it seems CPU offload doesn't work properly. My experiments are based on 1GB corpus and training GPT-2 with 6 layers. When CPU offload is enabled, the training loss stuck at around 4.0. I tried to update the lr, batch size, fp16, sequene len, activation_checkpointing and torch adam, but nothing helped. However, after switched off CPU offload, everything went well. The training loss and eval loss can reached below 2.0.
Anyone can shed some lights on this? Any args or configurations should be updated together with CPU offload?
-------- running configurations with CPU offload enabled -------------
{
"train_batch_size":32,
"train_micro_batch_size_per_gpu": 32,
"steps_per_print": 1,
"zero_optimization": {
"stage": 2,
"cpu_offload": true,
"reduce_bucket_size": 50000000
},
"zero_allow_untested_optimizer": false,
"gradient_clipping": 1.0,
"tensorboard": {
"enabled": true,
"output_path": "xxxx",
"job_name": "xxxxx"
},
"fp16": {
"enabled": true,
"loss_scale": 4096,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"activation_checkpointing": {
"partition_activations": true,
"contiguous_memory_optimization": true,
"cpu_checkpointing": true
},
"wall_clock_breakdown": true
}
----- args ----
--num-layers 6
--hidden-size 512
--num-attention-heads 4
--seq-length 128
--max-position-embeddings 1024
--train-iters 20000
--resume-dataloader
--train-data webtext
--lazy-loader
--tokenizer-type GPT2BPETokenizer
--split 949,50,1
--distributed-backend nccl
--lr 0.00005
--no-load-optim
--lr-decay-style cosine
--weight-decay 1e-2
--clip-grad 1.0
--warmup .01
--checkpoint-activations
--deepspeed-activation-checkpointing
--fp16
--log-interval 50
--vocab-size 50257
--eval-interval 100
--cpu-optimizer
--cpu_torch_adam
------ training scripts ------
Use the training scripts in Megatron-LM from DeepSpeedExamples.
Hi,
I've tried Deepspeed these days and it seems CPU offload doesn't work properly. My experiments are based on 1GB corpus and training GPT-2 with 6 layers. When CPU offload is enabled, the training loss stuck at around 4.0. I tried to update the lr, batch size, fp16, sequene len, activation_checkpointing and torch adam, but nothing helped. However, after switched off CPU offload, everything went well. The training loss and eval loss can reached below 2.0.
Anyone can shed some lights on this? Any args or configurations should be updated together with CPU offload?
-------- running configurations with CPU offload enabled -------------
{
"train_batch_size":32,
"train_micro_batch_size_per_gpu": 32,
"steps_per_print": 1,
"zero_optimization": {
"stage": 2,
"cpu_offload": true,
"reduce_bucket_size": 50000000
},
"zero_allow_untested_optimizer": false,
"gradient_clipping": 1.0,
"tensorboard": {
"enabled": true,
"output_path": "xxxx",
"job_name": "xxxxx"
},
"fp16": {
"enabled": true,
"loss_scale": 4096,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"activation_checkpointing": {
"partition_activations": true,
"contiguous_memory_optimization": true,
"cpu_checkpointing": true
},
"wall_clock_breakdown": true
}
----- args ----
--num-layers 6
--hidden-size 512
--num-attention-heads 4
--seq-length 128
--max-position-embeddings 1024
--train-iters 20000
--resume-dataloader
--train-data webtext
--lazy-loader
--tokenizer-type GPT2BPETokenizer
--split 949,50,1
--distributed-backend nccl
--lr 0.00005
--no-load-optim
--lr-decay-style cosine
--weight-decay 1e-2
--clip-grad 1.0
--warmup .01
--checkpoint-activations
--deepspeed-activation-checkpointing
--fp16
--log-interval 50
--vocab-size 50257
--eval-interval 100
--cpu-optimizer
--cpu_torch_adam
------ training scripts ------
Use the training scripts in Megatron-LM from DeepSpeedExamples.