Using PP with z1 results in copious logs being dumped. Please consider the scope of 64 or 256 or 512 gpus.
- a lot on saving:
https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/engine.py#L2035-L2037
I don't know why _copy_recovery_script gets called for each gpu, despite:
if self.global_rank == 0:
it works correctly w/o PP, but under PP seems never be true.
And the next logger line gets called for each gpu as well
- Then loading, each gpu dumps the list of checkpoints:
https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/state_dict_factory.py#L165
and then again here:
https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/state_dict_factory.py#L85
These logs make it very difficult to follow the progress of the training, since each dump is hundreds of lines-long.
There should be some sort of debug logs and not printed normally as they don't contribute anything other 1 line x n_gpus of logs.
Thank you!
@tjruwase