Skip to content

excessive logging in checkpoints load/save #1269

@stas00

Description

@stas00

Using PP with z1 results in copious logs being dumped. Please consider the scope of 64 or 256 or 512 gpus.

  1. a lot on saving:

https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/engine.py#L2035-L2037

I don't know why _copy_recovery_script gets called for each gpu, despite:

if self.global_rank == 0:

it works correctly w/o PP, but under PP seems never be true.

And the next logger line gets called for each gpu as well

  1. Then loading, each gpu dumps the list of checkpoints:

https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/state_dict_factory.py#L165

and then again here:

https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/state_dict_factory.py#L85

These logs make it very difficult to follow the progress of the training, since each dump is hundreds of lines-long.

There should be some sort of debug logs and not printed normally as they don't contribute anything other 1 line x n_gpus of logs.

Thank you!

@tjruwase

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions