DDP training gets terminated in the middle of the training because of some SIGKILL received by a PID (forked child process)

## Describe the bug
`  traceback : Signal 9 (SIGKILL) received by PID 10398`

> WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 10401 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 10398) of binary: /anaconda/envs/py37_default/bin/python
Traceback (most recent call last):
  File "/anaconda/envs/py37_default/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/anaconda/envs/py37_default/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/..../torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/.../torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/.../torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/.../torch/distributed/run.py", line 713, in run
    )(*cmd_args)
  File "/home/.../torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/jupyte.../torch/distributed/launcher/api.py", line 261, in launch_agent
    failures=result.failures,

`torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
`
## To Reproduce
Just a normal stoke training script

```python

  stoke_model = Stoke(
      model=model,
      verbose=True,    
      optimizer=optimizer,
      loss=loss,
      batch_size_per_device= opt.batchSize,   
      gpu=True,   
      fp16=None,
      distributed=DistributedOptions.ddp.value,
      fairscale_oss=True, 
      fairscale_sddp=True, 
      grad_accum_steps=1,
      configs= [amp_config, ddp_config, oss_config],     
      grad_clip=ClipGradNormConfig(max_norm = opt.grad_clip, norm_type=2.0),
  )

def train(train_dataloader, stoke_model: Stoke, scheduler1, scheduler2, epoch: int):
    
    example_ct = 0  # number of examples seen
    batch_ct = 0
    sum_loss = 0
    
    stoke_model.print_on_devices(f"Starting Epoch {epoch + 1}")
    stoke_model.model_access.train()
    
    for idx, (inputs, targets) in enumerate(train_dataloader):
        
        # call the model through the stoke onkect interface
        outputs = stoke_model.model(inputs)
        train_loss = stoke_model.loss(outputs, targets)
        
        stoke_model.print_ema_loss(prepend_msg=f"Step {idx+1} -- EMA Loss")
        
        # Call backward through the stoke object interface
        stoke_model.backward(loss=train_loss)
        
        # Call step through the stoke object interface
        stoke_model.step()
        scheduler1.step()
        scheduler2.step
        
        sum_loss += train_loss

        example_ct +=  len(inputs)
        batch_ct += 1

        # Report metrics every 50th batch
        if ((batch_ct + 1) % 50) == 0:
            train_log(train_loss, example_ct, epoch)
            #print(train_loss,  example_ct, epoch)

    avg_loss = sum_loss / len(train_dataloader)
    
    return avg_loss
    
for epoch in tqdm(range(epochs), leave=True): 
        
        train_loss = train(train_dataloader, stoke_model, scheduler1, scheduler2, epoch)
        val_loss = validate(val_dataloader, stoke_model, epoch)
        save_checkpoint(stoke_model, epoch, train_loss, val_loss)
```

The actual script is posted here - https://gist.github.com/rushi-the-neural-arch/bee47ba87e5ddabf0cb47def9bc0b013


2. Ran config as - `env CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 Stoke-DDP.py --projectName "Stoke-4K-2X-DDP" --batchSize 18 --nEpochs 2 --lr 1e-3 --weight_decay 1e-4 --grad_clip 0.1`

3. Error produced is - `WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 10401 closing signal SIGTERM`

## Expected behavior
Ohkay so I know this issue is more of a PyTorch DDP concern and not a Stoke issue as I found many users face this problem and there doesn't seem any definitive solution for this apart from downgrading the torch version. You can see here, a workaround just 2 days ago - https://github.com/pytorch/pytorch/issues/67538, the user downgraded his torch version from 1.10 to 1.8 which solved this particular issue but as Stoke requires torch version to be greater than 1.81, I guess this would not be possible for us. Maybe torch 1.10 version is just recently rolled out so they might not have fixed this from their end but do you happen to know any alternative approach/solution for this??

And also giving you a bit more context, I trained a sample very lightweight neural network and could do the training easily for larger batch sizes, I did a few experimentations and after gaining some perspective, I switched to a heavier-more parameter (~4.5M) network for training but now this error started occurring. Initially, I thought this might be due to more load being exerted on the RAM so I decreased the batch size to 1 and also removed the gradient accumulation step, played around with num_workers parameters but this didn't solve the error. In fact, what I have noticed is that this error occurs in the middle exactly after 125 steps! which seems weird as there is no code that relates to some operation after 125 steps or after a specific number of steps

EDIT - I tried the FP16 training and the error still persists but it's after 145 steps now.

## Screenshots/Code Snippets
![image](https://user-images.githubusercontent.com/34182074/141239004-a039c220-8cba-487f-8979-6b6280be82a9.png)



## Environment:
  - OS: Ubuntu 18.04.5,
  - Python version - 3.7.7
  - PyTorch Version - 1.10:
  - Deepspeed Version: 0.5.4
  - Horovod Version: 0.23
  - Fairscale Version: 0.4.0
  - CUDA/cuDNN version: 11.2 / 7.6.2
  - Stoke configuration: 0.2.0




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDP training gets terminated in the middle of the training because of some SIGKILL received by a PID (forked child process) #31

Describe the bug

To Reproduce

Expected behavior

Screenshots/Code Snippets

Environment:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DDP training gets terminated in the middle of the training because of some SIGKILL received by a PID (forked child process) #31

Description

Describe the bug

To Reproduce

Expected behavior

Screenshots/Code Snippets

Environment:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions