-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
It is a known issue that when using horovod in docker container, there will be a bunch of stderr message printing out like this(you can reproduce it by executing sagemaker's built-in example with multi-gpu instance: tensorflow_script_mode_horovod.ipynb):
- [1,1]:[algo-1:00062] Read -1, expected 36864, errno = 1
- [1,3]:[algo-1:00064] Read -1, expected 36864, errno = 1
- [1,2]:[algo-1:00063] Read -1, expected 36864, errno = 1
- [1,1]:[algo-1:00062] Read -1, expected 36864, errno = 1
- [1,3]:[algo-1:00064] Read -1, expected 36864, errno = 1
- [1,2]:[algo-1:00063] Read -1, expected 36864, errno = 1
Most of the time, we can simply ignore the message or filter them out in cloud watch. But sometimes the printing happens so frequent that it affects the training speed significantly (I observed ~50% training speed decrease when using p3.16xlarge).
One way to disable the message is by providing --privileged
flag when calling nvidia-docker run
, as suggested in official horovod github .
Is there a way to add the --privileged
flag in our container or suppress the message in some other ways such that the training speed won't be affected by the printing?