Skip to content

Not getting convergence in Bert Large Training. #830

@manchandasahil

Description

@manchandasahil

Hey,
I have been trying out Deepspeed for a while, thank you for all the cool features.
I am trying to train a bert large from scratch using 335Million token sequences. After considering gradient accumulation and number of GPU's involved, the effective batch size comes out to be 1920.

The loss graph looks something like this :

image

Note that with small batches it converges well, and the loss stays between 0 and 1 but it takes too long and we have more data coming in.

Following is my config :

{
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "zero_optimization": {
    "stage": 1,
    "allgather_partitions": true,
    "allgather_bucket_size": 100000000,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 100000000,
    "contiguous_gradients": true,
    "cpu_offload": false
  },
  "optimizer": {
    "type": "Lamb",
    "params": {
      "lr": 0.001,
      "weight_decay": 0.01,
      "bias_correction": false,
      "max_coeff": 0.3,
      "min_coeff": 0.01
    }
  },
  "zero_allow_untested_optimizer": true,
  "scheduler": {
    "type": "WarmupLR",
    "params": {
      "warmup_min_lr": 0,
      "warmup_max_lr": 0.0003,
      "warmup_num_steps": 200
    }
  }
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions