Skip to content

[BUG]AttributeError: 'DeepSpeedEngine' object has no attribute 'quantizer' #1837

@AQA6666

Description

@AQA6666

Describe the bug
When I want to use deepspeed to accelerate my T5-XL pretraining with hugging face but without huggingface's trainer, it occured:
[2022-03-16 09:38:47,187] [INFO] [stage3.py:2553:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 4294967296

Traceback (most recent call last):
File "main.py", line 140, in
train(args, model, train_dataset, ds_config)
File "main.py", line 64, in train
model_engine.step()
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1855, in step
self._take_model_step(lr_kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1767, in _take_model_step
if self.quantizer:
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DeepSpeedEngine' object has no attribute 'quantizer'

To Reproduce
the full code is too long,so I just show key segment,if you want to see whole code , I will paste it.:

At first, in main function:

if args.local_rank == -1:
    device = torch.device("cuda")
else:
    torch.cuda.set_device(args.local_rank)
    device = torch.device("cuda", args.local_rank)
    # torch.distributed.init_process_group(backend="nccl")
    deepspeed.init_distributed()
args.device = device
args.n_gpu = len(args.cuda.split(","))
set_seed(args)

# Model and dataset
with open('./ds_config.json') as f:
    ds_config = json.load(f)
dschf = HfDeepSpeedConfig(ds_config)
print('init model')
with deepspeed.zero.Init():
    config = AutoConfig.from_pretrained("./")
    model = T5ForConditionalGeneration(config)
print('init dataset')
train_dataset = FinT5_Dataset(args)
# Barrier to make sure all process train the model simultaneously.
if args.local_rank != -1:
    torch.distributed.barrier()
train(args, model, train_dataset, ds_config)

secondly , ds_config =

{
"fp16": {"enabled": true, "loss_scale": 0, "initial_scale_power": 32, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1},
"zero_optimization":
{"stage": 3, "overlap_comm": true, "contiguous_gradients": true, "reduce_bucket_size": model_hidden_size * model_hidden_size,
"stage3_prefetch_bucket_size": 0.9 * model_hidden_size * model_hidden_size,
"stage3_param_persistence_threshold": 10 * model_hidden_size},
//(the model_hidden_size for T5-XL=2048, these setting follows huggingface/transformers#15399 (comment), actually I dont know what it means...)
"optimizer": {"type": "AdamW", "params": {"lr": 0.0001, "betas": [0.8, 0.999], "eps": 1e-08, "weight_decay": 3e-07}},
"scheduler": {"type": "WarmupLR", "params": {"warmup_min_lr": 0, "warmup_max_lr": 0.0001, "warmup_num_steps": 1000}},
"steps_per_print": 200,
"train_batch_size": 256,
"train_micro_batch_size_per_gpu": 32,
"gradient_accumulation_steps": 1,
"wall_clock_breakdown": false}

3rd, in train function:

def train(args, model, train_dataset, ds_config):

train_sampler = data.distributed.DistributedSampler(train_dataset) if args.local_rank != -1 else data.RandomSampler(train_dataset)
params = {"batch_size": args.batch_size_per_gpu, "sampler": train_sampler}
train_dataloader = data.DataLoader(train_dataset, **params)

# deepspeed training
model_engine, optimizer, _, _  = deepspeed.initialize(model=model, config_params=ds_config)

print("Begin train...")

global_step = 0

start_time = time.time()

for i in range(args.max_epoch):
    if args.local_rank != -1:
        train_sampler.set_epoch(i)
    for step, batch in enumerate(train_dataloader):
        global_step += 1
        #forward() method
        loss = model_engine(
            input_ids=batch[0].to('cuda'),
            attention_mask=batch[1].to('cuda'),
            labels=batch[2].to('cuda')).loss
        # print(loss)

        #runs backpropagation
        model_engine.backward(loss)

        #weight update
        model_engine.step()  #error occur
        if global_step % args.save_interval == 0:
            model_engine.save_checkpoint(args.save_dir, global_step)
        if global_step == args.max_step:
            return

ds_report output

JIT compiled ops requires ninja
ninja .................. [OKAY]
op name ................ installed .. compatible
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
[WARNING] please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch']
torch version .................... 1.11.0a0+b6df043
torch cuda version ............... 11.5
torch hip version ................ None
nvcc version ..................... 11.5
deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.6.0, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.5, hip 0.0

System info (please complete the following information):

  • OS: Ubuntu 20.04
  • GPU count and types:one machines with x8 A100s
  • Python version:3.8
  • Any other relevant info about your setup

Launcher context
deepspeed --num_gpus=8 main.py
Docker context
nvcr.io/nvidia/pytorch:21.12-py3 (pip install deepspeed)
Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions