-
Notifications
You must be signed in to change notification settings - Fork 4.8k
Open
Description
Deepspeed generates an insane amount of noise on distributed setup. Running on 64 nodes I end up with probably 5k lines of repeated debug/info on each node. This makes it very difficult to find important information since SNR is close to 0 here.
Here is a very small sample, I think this one mainly JIT build info:
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/gpfswork/rech/six/commun/conda/tr1-13B/lib/python3.8/site-packages/torch']
torch version .................... 1.8.1
torch cuda version ............... 11.1
nvcc version ..................... 11.2
deepspeed install path ........... ['/gpfsssd/worksf/projects/rech/six/commun/code/tr1-13B/DeepSpeed-big-science/deepspeed']
deepspeed info ................... 0.4.2+bc17042, bc17042, big-science
deepspeed wheel compiled w. ...... torch 1.8, cuda 11.1
[WARNING] async_io requires the libraries: ['libaio-dev'] but are missing. Can be fixed by: `apt install libaio-dev`.
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
[WARNING] async_io requires the libraries: ['libaio-dev'] but are missing. Can be fixed by: `apt install libaio-dev`.
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
[WARNING] async_io requires the libraries: ['libaio-dev'] but are missing. Can be fixed by: `apt install libaio-dev`.
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [YES] ...... [OKAY]
[WARNING] async_io requires the libraries: ['libaio-dev'] but are missing. Can be fixed by: `apt install libaio-dev`.
quantizer .............. [NO] ....... [OKAY]
async_io ............... [NO] ....... [NO]
--------------------------------------------------
transformer_inference .. [NO] ....... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
/bin/sh: line 0: type: git: not found
**** Git info for Megatron: git_hash=unknown git_branch=unknown ****
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
/bin/sh: line 0: type: git: not found
/bin/sh: line 0: type: git: not found
[WARNING] async_io requires the libraries: ['libaio-dev'] but are missing. Can be fixed by: `apt install libaio-dev`.
**** Git info for Megatron: git_hash=unknown git_branch=unknown ****
async_io ............... [NO] ....... [NO]
**** Git info for Megatron: git_hash=unknown git_branch=unknown ****
transformer_inference .. [NO] ....... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
[WARNING] async_io requires the libraries: ['libaio-dev'] but are missing. Can be fixed by: `apt install libaio-dev`.
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
/bin/sh: line 0: type: git: not found
[WARNING] async_io requires the libraries: ['libaio-dev'] but are missing. Can be fixed by: `apt install libaio-dev`.
**** Git info for Megatron: git_hash=unknown git_branch=unknown ****
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
If you want to see the huge dump, it's part of https://huggingface.co/bigscience/tr1-13B-logs/resolve/main/main_log.txt
Thank you!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels