Skip to content

Disable Horovod stderr in sagemaker container #800

@vbvg2008

Description

@vbvg2008

It is a known issue that when using horovod in docker container, there will be a bunch of stderr message printing out like this(you can reproduce it by executing sagemaker's built-in example with multi-gpu instance: tensorflow_script_mode_horovod.ipynb):

  • [1,1]:[algo-1:00062] Read -1, expected 36864, errno = 1
  • [1,3]:[algo-1:00064] Read -1, expected 36864, errno = 1
  • [1,2]:[algo-1:00063] Read -1, expected 36864, errno = 1
  • [1,1]:[algo-1:00062] Read -1, expected 36864, errno = 1
  • [1,3]:[algo-1:00064] Read -1, expected 36864, errno = 1
  • [1,2]:[algo-1:00063] Read -1, expected 36864, errno = 1

Most of the time, we can simply ignore the message or filter them out in cloud watch. But sometimes the printing happens so frequent that it affects the training speed significantly (I observed ~50% training speed decrease when using p3.16xlarge).

One way to disable the message is by providing --privileged flag when calling nvidia-docker run, as suggested in official horovod github .

Is there a way to add the --privileged flag in our container or suppress the message in some other ways such that the training speed won't be affected by the printing?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions