Skip to content

removing per node repeated debug/info prints on distributed setup #1304

@stas00

Description

@stas00

Deepspeed generates an insane amount of noise on distributed setup. Running on 64 nodes I end up with probably 5k lines of repeated debug/info on each node. This makes it very difficult to find important information since SNR is close to 0 here.

Here is a very small sample, I think this one mainly JIT build info:

async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/gpfswork/rech/six/commun/conda/tr1-13B/lib/python3.8/site-packages/torch']
torch version .................... 1.8.1
torch cuda version ............... 11.1
nvcc version ..................... 11.2
deepspeed install path ........... ['/gpfsssd/worksf/projects/rech/six/commun/code/tr1-13B/DeepSpeed-big-science/deepspeed']
deepspeed info ................... 0.4.2+bc17042, bc17042, big-science
deepspeed wheel compiled w. ...... torch 1.8, cuda 11.1
 [WARNING]  async_io requires the libraries: ['libaio-dev'] but are missing. Can be fixed by: `apt install libaio-dev`.
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
 [WARNING]  async_io requires the libraries: ['libaio-dev'] but are missing. Can be fixed by: `apt install libaio-dev`.
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
 [WARNING]  async_io requires the libraries: ['libaio-dev'] but are missing. Can be fixed by: `apt install libaio-dev`.
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [YES] ...... [OKAY]
 [WARNING]  async_io requires the libraries: ['libaio-dev'] but are missing. Can be fixed by: `apt install libaio-dev`.
quantizer .............. [NO] ....... [OKAY]
async_io ............... [NO] ....... [NO]
--------------------------------------------------
transformer_inference .. [NO] ....... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
/bin/sh: line 0: type: git: not found
**** Git info for Megatron: git_hash=unknown git_branch=unknown ****
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
/bin/sh: line 0: type: git: not found
/bin/sh: line 0: type: git: not found
 [WARNING]  async_io requires the libraries: ['libaio-dev'] but are missing. Can be fixed by: `apt install libaio-dev`.
**** Git info for Megatron: git_hash=unknown git_branch=unknown ****
async_io ............... [NO] ....... [NO]
**** Git info for Megatron: git_hash=unknown git_branch=unknown ****
transformer_inference .. [NO] ....... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
 [WARNING]  async_io requires the libraries: ['libaio-dev'] but are missing. Can be fixed by: `apt install libaio-dev`.
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
/bin/sh: line 0: type: git: not found
 [WARNING]  async_io requires the libraries: ['libaio-dev'] but are missing. Can be fixed by: `apt install libaio-dev`.
**** Git info for Megatron: git_hash=unknown git_branch=unknown ****
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------

If you want to see the huge dump, it's part of https://huggingface.co/bigscience/tr1-13B-logs/resolve/main/main_log.txt

Thank you!

@tjruwase

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions