Skip to content
This repository was archived by the owner on Oct 23, 2024. It is now read-only.
This repository was archived by the owner on Oct 23, 2024. It is now read-only.

DDP training gets terminated in the middle of the training because of some SIGKILL received by a PID (forked child process) #31

@rushi-the-neural-arch

Description

@rushi-the-neural-arch

Describe the bug

traceback : Signal 9 (SIGKILL) received by PID 10398

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 10401 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 10398) of binary: /anaconda/envs/py37_default/bin/python
Traceback (most recent call last):
File "/anaconda/envs/py37_default/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/anaconda/envs/py37_default/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/..../torch/distributed/launch.py", line 193, in
main()
File "/home/.../torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/.../torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/.../torch/distributed/run.py", line 713, in run
)(*cmd_args)
File "/home/.../torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/jupyte.../torch/distributed/launcher/api.py", line 261, in launch_agent
failures=result.failures,

torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

To Reproduce

Just a normal stoke training script

  stoke_model = Stoke(
      model=model,
      verbose=True,    
      optimizer=optimizer,
      loss=loss,
      batch_size_per_device= opt.batchSize,   
      gpu=True,   
      fp16=None,
      distributed=DistributedOptions.ddp.value,
      fairscale_oss=True, 
      fairscale_sddp=True, 
      grad_accum_steps=1,
      configs= [amp_config, ddp_config, oss_config],     
      grad_clip=ClipGradNormConfig(max_norm = opt.grad_clip, norm_type=2.0),
  )

def train(train_dataloader, stoke_model: Stoke, scheduler1, scheduler2, epoch: int):
    
    example_ct = 0  # number of examples seen
    batch_ct = 0
    sum_loss = 0
    
    stoke_model.print_on_devices(f"Starting Epoch {epoch + 1}")
    stoke_model.model_access.train()
    
    for idx, (inputs, targets) in enumerate(train_dataloader):
        
        # call the model through the stoke onkect interface
        outputs = stoke_model.model(inputs)
        train_loss = stoke_model.loss(outputs, targets)
        
        stoke_model.print_ema_loss(prepend_msg=f"Step {idx+1} -- EMA Loss")
        
        # Call backward through the stoke object interface
        stoke_model.backward(loss=train_loss)
        
        # Call step through the stoke object interface
        stoke_model.step()
        scheduler1.step()
        scheduler2.step
        
        sum_loss += train_loss

        example_ct +=  len(inputs)
        batch_ct += 1

        # Report metrics every 50th batch
        if ((batch_ct + 1) % 50) == 0:
            train_log(train_loss, example_ct, epoch)
            #print(train_loss,  example_ct, epoch)

    avg_loss = sum_loss / len(train_dataloader)
    
    return avg_loss
    
for epoch in tqdm(range(epochs), leave=True): 
        
        train_loss = train(train_dataloader, stoke_model, scheduler1, scheduler2, epoch)
        val_loss = validate(val_dataloader, stoke_model, epoch)
        save_checkpoint(stoke_model, epoch, train_loss, val_loss)

The actual script is posted here - https://gist.github.com/rushi-the-neural-arch/bee47ba87e5ddabf0cb47def9bc0b013

  1. Ran config as - env CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 Stoke-DDP.py --projectName "Stoke-4K-2X-DDP" --batchSize 18 --nEpochs 2 --lr 1e-3 --weight_decay 1e-4 --grad_clip 0.1

  2. Error produced is - WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 10401 closing signal SIGTERM

Expected behavior

Ohkay so I know this issue is more of a PyTorch DDP concern and not a Stoke issue as I found many users face this problem and there doesn't seem any definitive solution for this apart from downgrading the torch version. You can see here, a workaround just 2 days ago - pytorch/pytorch#67538, the user downgraded his torch version from 1.10 to 1.8 which solved this particular issue but as Stoke requires torch version to be greater than 1.81, I guess this would not be possible for us. Maybe torch 1.10 version is just recently rolled out so they might not have fixed this from their end but do you happen to know any alternative approach/solution for this??

And also giving you a bit more context, I trained a sample very lightweight neural network and could do the training easily for larger batch sizes, I did a few experimentations and after gaining some perspective, I switched to a heavier-more parameter (~4.5M) network for training but now this error started occurring. Initially, I thought this might be due to more load being exerted on the RAM so I decreased the batch size to 1 and also removed the gradient accumulation step, played around with num_workers parameters but this didn't solve the error. In fact, what I have noticed is that this error occurs in the middle exactly after 125 steps! which seems weird as there is no code that relates to some operation after 125 steps or after a specific number of steps

EDIT - I tried the FP16 training and the error still persists but it's after 145 steps now.

Screenshots/Code Snippets

image

Environment:

  • OS: Ubuntu 18.04.5,
  • Python version - 3.7.7
  • PyTorch Version - 1.10:
  • Deepspeed Version: 0.5.4
  • Horovod Version: 0.23
  • Fairscale Version: 0.4.0
  • CUDA/cuDNN version: 11.2 / 7.6.2
  • Stoke configuration: 0.2.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions