-
Notifications
You must be signed in to change notification settings - Fork 4
DDP training gets terminated in the middle of the training because of some SIGKILL received by a PID (forked child process) #31
Description
Describe the bug
traceback : Signal 9 (SIGKILL) received by PID 10398
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 10401 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 10398) of binary: /anaconda/envs/py37_default/bin/python
Traceback (most recent call last):
File "/anaconda/envs/py37_default/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/anaconda/envs/py37_default/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/..../torch/distributed/launch.py", line 193, in
main()
File "/home/.../torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/.../torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/.../torch/distributed/run.py", line 713, in run
)(*cmd_args)
File "/home/.../torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/jupyte.../torch/distributed/launcher/api.py", line 261, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
To Reproduce
Just a normal stoke training script
stoke_model = Stoke(
model=model,
verbose=True,
optimizer=optimizer,
loss=loss,
batch_size_per_device= opt.batchSize,
gpu=True,
fp16=None,
distributed=DistributedOptions.ddp.value,
fairscale_oss=True,
fairscale_sddp=True,
grad_accum_steps=1,
configs= [amp_config, ddp_config, oss_config],
grad_clip=ClipGradNormConfig(max_norm = opt.grad_clip, norm_type=2.0),
)
def train(train_dataloader, stoke_model: Stoke, scheduler1, scheduler2, epoch: int):
example_ct = 0 # number of examples seen
batch_ct = 0
sum_loss = 0
stoke_model.print_on_devices(f"Starting Epoch {epoch + 1}")
stoke_model.model_access.train()
for idx, (inputs, targets) in enumerate(train_dataloader):
# call the model through the stoke onkect interface
outputs = stoke_model.model(inputs)
train_loss = stoke_model.loss(outputs, targets)
stoke_model.print_ema_loss(prepend_msg=f"Step {idx+1} -- EMA Loss")
# Call backward through the stoke object interface
stoke_model.backward(loss=train_loss)
# Call step through the stoke object interface
stoke_model.step()
scheduler1.step()
scheduler2.step
sum_loss += train_loss
example_ct += len(inputs)
batch_ct += 1
# Report metrics every 50th batch
if ((batch_ct + 1) % 50) == 0:
train_log(train_loss, example_ct, epoch)
#print(train_loss, example_ct, epoch)
avg_loss = sum_loss / len(train_dataloader)
return avg_loss
for epoch in tqdm(range(epochs), leave=True):
train_loss = train(train_dataloader, stoke_model, scheduler1, scheduler2, epoch)
val_loss = validate(val_dataloader, stoke_model, epoch)
save_checkpoint(stoke_model, epoch, train_loss, val_loss)The actual script is posted here - https://gist.github.com/rushi-the-neural-arch/bee47ba87e5ddabf0cb47def9bc0b013
-
Ran config as -
env CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 Stoke-DDP.py --projectName "Stoke-4K-2X-DDP" --batchSize 18 --nEpochs 2 --lr 1e-3 --weight_decay 1e-4 --grad_clip 0.1 -
Error produced is -
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 10401 closing signal SIGTERM
Expected behavior
Ohkay so I know this issue is more of a PyTorch DDP concern and not a Stoke issue as I found many users face this problem and there doesn't seem any definitive solution for this apart from downgrading the torch version. You can see here, a workaround just 2 days ago - pytorch/pytorch#67538, the user downgraded his torch version from 1.10 to 1.8 which solved this particular issue but as Stoke requires torch version to be greater than 1.81, I guess this would not be possible for us. Maybe torch 1.10 version is just recently rolled out so they might not have fixed this from their end but do you happen to know any alternative approach/solution for this??
And also giving you a bit more context, I trained a sample very lightweight neural network and could do the training easily for larger batch sizes, I did a few experimentations and after gaining some perspective, I switched to a heavier-more parameter (~4.5M) network for training but now this error started occurring. Initially, I thought this might be due to more load being exerted on the RAM so I decreased the batch size to 1 and also removed the gradient accumulation step, played around with num_workers parameters but this didn't solve the error. In fact, what I have noticed is that this error occurs in the middle exactly after 125 steps! which seems weird as there is no code that relates to some operation after 125 steps or after a specific number of steps
EDIT - I tried the FP16 training and the error still persists but it's after 145 steps now.
Screenshots/Code Snippets
Environment:
- OS: Ubuntu 18.04.5,
- Python version - 3.7.7
- PyTorch Version - 1.10:
- Deepspeed Version: 0.5.4
- Horovod Version: 0.23
- Fairscale Version: 0.4.0
- CUDA/cuDNN version: 11.2 / 7.6.2
- Stoke configuration: 0.2.0
