CPU Offload is harmful to training convergence?

Hi,

I've tried Deepspeed these days and it seems CPU offload doesn't work properly. My experiments are based on 1GB corpus and training GPT-2 with 6 layers. When CPU offload is enabled, the training loss stuck at around 4.0. I tried to update the lr, batch size, fp16, sequene len,  activation_checkpointing and torch adam, but nothing helped. However, after switched off CPU offload, everything went well. The training loss and eval loss can reached below 2.0.

Anyone can shed some lights on this? Any args or configurations should be updated together with CPU offload?

-------- running configurations with CPU offload enabled -------------
{
  "train_batch_size":32,
  "train_micro_batch_size_per_gpu": 32,
  "steps_per_print": 1,
  "zero_optimization": {
    "stage": 2,
    "cpu_offload": true,
    "reduce_bucket_size": 50000000
  },
  "zero_allow_untested_optimizer": false,
  "gradient_clipping": 1.0,
  "tensorboard": {
    "enabled": true,
    "output_path": "xxxx",
    "job_name": "xxxxx"
  },
  "fp16": {
    "enabled": true,
    "loss_scale": 4096,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "activation_checkpointing": {
    "partition_activations": true,
    "contiguous_memory_optimization": true,
    "cpu_checkpointing": true
  },
  "wall_clock_breakdown": true
}

----- args ----
       --num-layers 6 \
       --hidden-size 512 \
       --num-attention-heads 4 \
       --seq-length 128 \
       --max-position-embeddings 1024 \
       --train-iters 20000 \
       --resume-dataloader \
       --train-data webtext \
       --lazy-loader \
       --tokenizer-type GPT2BPETokenizer \
       --split 949,50,1 \
       --distributed-backend nccl \
       --lr 0.00005 \
       --no-load-optim \
       --lr-decay-style cosine \
       --weight-decay 1e-2 \
       --clip-grad 1.0 \
       --warmup .01 \
       --checkpoint-activations \
       --deepspeed-activation-checkpointing \
       --fp16 \
       --log-interval 50 \
       --vocab-size 50257 \
       --eval-interval 100 \
       --cpu-optimizer \
       --cpu_torch_adam

------ training scripts ------
Use the training scripts in Megatron-LM from DeepSpeedExamples.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU Offload is harmful to training convergence? #493

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

CPU Offload is harmful to training convergence? #493

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions