-
Notifications
You must be signed in to change notification settings - Fork 4.8k
Description
Describe the bug
When I want to use deepspeed to accelerate my T5-XL pretraining with hugging face but without huggingface's trainer, it occured:
[2022-03-16 09:38:47,187] [INFO] [stage3.py:2553:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 4294967296
Traceback (most recent call last):
File "main.py", line 140, in
train(args, model, train_dataset, ds_config)
File "main.py", line 64, in train
model_engine.step()
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1855, in step
self._take_model_step(lr_kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1767, in _take_model_step
if self.quantizer:
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DeepSpeedEngine' object has no attribute 'quantizer'
To Reproduce
the full code is too long,so I just show key segment,if you want to see whole code , I will paste it.:
At first, in main function:
if args.local_rank == -1:
device = torch.device("cuda")
else:
torch.cuda.set_device(args.local_rank)
device = torch.device("cuda", args.local_rank)
# torch.distributed.init_process_group(backend="nccl")
deepspeed.init_distributed()
args.device = device
args.n_gpu = len(args.cuda.split(","))
set_seed(args)
# Model and dataset
with open('./ds_config.json') as f:
ds_config = json.load(f)
dschf = HfDeepSpeedConfig(ds_config)
print('init model')
with deepspeed.zero.Init():
config = AutoConfig.from_pretrained("./")
model = T5ForConditionalGeneration(config)
print('init dataset')
train_dataset = FinT5_Dataset(args)
# Barrier to make sure all process train the model simultaneously.
if args.local_rank != -1:
torch.distributed.barrier()
train(args, model, train_dataset, ds_config)
secondly , ds_config =
{
"fp16": {"enabled": true, "loss_scale": 0, "initial_scale_power": 32, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1},
"zero_optimization":
{"stage": 3, "overlap_comm": true, "contiguous_gradients": true, "reduce_bucket_size": model_hidden_size * model_hidden_size,
"stage3_prefetch_bucket_size": 0.9 * model_hidden_size * model_hidden_size,
"stage3_param_persistence_threshold": 10 * model_hidden_size},
//(the model_hidden_size for T5-XL=2048, these setting follows huggingface/transformers#15399 (comment), actually I dont know what it means...)
"optimizer": {"type": "AdamW", "params": {"lr": 0.0001, "betas": [0.8, 0.999], "eps": 1e-08, "weight_decay": 3e-07}},
"scheduler": {"type": "WarmupLR", "params": {"warmup_min_lr": 0, "warmup_max_lr": 0.0001, "warmup_num_steps": 1000}},
"steps_per_print": 200,
"train_batch_size": 256,
"train_micro_batch_size_per_gpu": 32,
"gradient_accumulation_steps": 1,
"wall_clock_breakdown": false}
3rd, in train function:
def train(args, model, train_dataset, ds_config):
train_sampler = data.distributed.DistributedSampler(train_dataset) if args.local_rank != -1 else data.RandomSampler(train_dataset)
params = {"batch_size": args.batch_size_per_gpu, "sampler": train_sampler}
train_dataloader = data.DataLoader(train_dataset, **params)
# deepspeed training
model_engine, optimizer, _, _ = deepspeed.initialize(model=model, config_params=ds_config)
print("Begin train...")
global_step = 0
start_time = time.time()
for i in range(args.max_epoch):
if args.local_rank != -1:
train_sampler.set_epoch(i)
for step, batch in enumerate(train_dataloader):
global_step += 1
#forward() method
loss = model_engine(
input_ids=batch[0].to('cuda'),
attention_mask=batch[1].to('cuda'),
labels=batch[2].to('cuda')).loss
# print(loss)
#runs backpropagation
model_engine.backward(loss)
#weight update
model_engine.step() #error occur
if global_step % args.save_interval == 0:
model_engine.save_checkpoint(args.save_dir, global_step)
if global_step == args.max_step:
return
ds_report output
JIT compiled ops requires ninja
ninja .................. [OKAY]
op name ................ installed .. compatible
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
[WARNING] please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch']
torch version .................... 1.11.0a0+b6df043
torch cuda version ............... 11.5
torch hip version ................ None
nvcc version ..................... 11.5
deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.6.0, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.5, hip 0.0
System info (please complete the following information):
- OS: Ubuntu 20.04
- GPU count and types:one machines with x8 A100s
- Python version:3.8
- Any other relevant info about your setup
Launcher context
deepspeed --num_gpus=8 main.py
Docker context
nvcr.io/nvidia/pytorch:21.12-py3 (pip install deepspeed)
Additional context
Add any other context about the problem here.