Skip to content

CPU Offload is harmful to training convergence? #493

@robinn37

Description

@robinn37

Hi,

I've tried Deepspeed these days and it seems CPU offload doesn't work properly. My experiments are based on 1GB corpus and training GPT-2 with 6 layers. When CPU offload is enabled, the training loss stuck at around 4.0. I tried to update the lr, batch size, fp16, sequene len, activation_checkpointing and torch adam, but nothing helped. However, after switched off CPU offload, everything went well. The training loss and eval loss can reached below 2.0.

Anyone can shed some lights on this? Any args or configurations should be updated together with CPU offload?

-------- running configurations with CPU offload enabled -------------
{
"train_batch_size":32,
"train_micro_batch_size_per_gpu": 32,
"steps_per_print": 1,
"zero_optimization": {
"stage": 2,
"cpu_offload": true,
"reduce_bucket_size": 50000000
},
"zero_allow_untested_optimizer": false,
"gradient_clipping": 1.0,
"tensorboard": {
"enabled": true,
"output_path": "xxxx",
"job_name": "xxxxx"
},
"fp16": {
"enabled": true,
"loss_scale": 4096,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"activation_checkpointing": {
"partition_activations": true,
"contiguous_memory_optimization": true,
"cpu_checkpointing": true
},
"wall_clock_breakdown": true
}

----- args ----
--num-layers 6
--hidden-size 512
--num-attention-heads 4
--seq-length 128
--max-position-embeddings 1024
--train-iters 20000
--resume-dataloader
--train-data webtext
--lazy-loader
--tokenizer-type GPT2BPETokenizer
--split 949,50,1
--distributed-backend nccl
--lr 0.00005
--no-load-optim
--lr-decay-style cosine
--weight-decay 1e-2
--clip-grad 1.0
--warmup .01
--checkpoint-activations
--deepspeed-activation-checkpointing
--fp16
--log-interval 50
--vocab-size 50257
--eval-interval 100
--cpu-optimizer
--cpu_torch_adam

------ training scripts ------
Use the training scripts in Megatron-LM from DeepSpeedExamples.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions