excessive logging in checkpoints load/save

Using PP with z1 results in copious logs being dumped. Please consider the scope of 64 or 256 or 512 gpus.

1. a lot on saving:

https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/engine.py#L2035-L2037

I don't know why `_copy_recovery_script` gets called for each gpu, despite:

```
if self.global_rank == 0:
```
it works correctly w/o PP, but under PP seems never be true.

And the next logger line gets called for each gpu as well 

2. Then loading, each gpu dumps the list of checkpoints:

https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/state_dict_factory.py#L165

and then again here:

https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/state_dict_factory.py#L85

These logs make it very difficult to follow the progress of the training, since each dump is hundreds of lines-long.

There should be some sort of debug logs and not printed normally as they don't contribute anything other 1 line x n_gpus of logs.

Thank you!

@tjruwase 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

excessive logging in checkpoints load/save #1269

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

excessive logging in checkpoints load/save #1269

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions