Disable Horovod stderr in sagemaker container

It is a known issue that when using horovod in docker container, there will be a bunch of stderr message printing out like this(you can reproduce it by executing sagemaker's built-in example with multi-gpu instance: tensorflow_script_mode_horovod.ipynb):
- [1,1]<stderr>:[algo-1:00062] Read -1, expected 36864, errno = 1 
- [1,3]<stderr>:[algo-1:00064] Read -1, expected 36864, errno = 1
- [1,2]<stderr>:[algo-1:00063] Read -1, expected 36864, errno = 1
- [1,1]<stderr>:[algo-1:00062] Read -1, expected 36864, errno = 1
- [1,3]<stderr>:[algo-1:00064] Read -1, expected 36864, errno = 1
- [1,2]<stderr>:[algo-1:00063] Read -1, expected 36864, errno = 1

Most of the time, we can simply ignore the message or filter them out in cloud watch. But sometimes the printing happens so frequent that it affects the training speed significantly (I observed ~50% training speed decrease when using p3.16xlarge).

One way to disable the message is by providing `--privileged` flag when calling `nvidia-docker run`, as suggested [in official horovod github](https://github.com/horovod/horovod/blob/master/docs/docker.md#running-on-a-single-machine) . 

Is there a way to add the `--privileged` flag in our container or suppress the message in some other ways such that the training speed won't be affected by the printing? 


 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Disable Horovod stderr in sagemaker container #800

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Disable Horovod stderr in sagemaker container #800

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions