Training hang #4

aleSuglia · 2021-07-20T09:33:57Z

I'm trying to train a model for VideoQA but I obtain the following error:

[1,0]<stderr>:Stalled ranks:
[1,0]<stderr>:1: [allgather.noname.1]
[1,0]<stderr>:[2021-07-20 09:23:22.936726: W horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
[1,0]<stderr>:07/20/2021 09:31:42 - INFO - __main__ -   122039 samples loaded
[1,0]<stderr>:A process has executed an operation involving a call
[1,0]<stderr>:to the fork() system call to create a child process.
[1,0]<stderr>:
[1,0]<stderr>:As a result, the libfabric EFA provider is operating in
[1,0]<stderr>:a condition that could result in memory corruption or
[1,0]<stderr>:other system errors.
[1,0]<stderr>:
[1,0]<stderr>:For the libfabric EFA provider to work safely when fork()
[1,0]<stderr>:is called, the application must handle memory registrations
[1,0]<stderr>:(FI_MR_LOCAL) and you will need to set the following environment
[1,0]<stderr>:variables:
[1,0]<stderr>:          RDMAV_FORK_SAFE=1
[1,0]<stderr>:MPI applications do not support this mode.
[1,0]<stderr>:
[1,0]<stderr>:However, this setting can result in signficant performance
[1,0]<stderr>:impact to your application due to increased cost of memory
[1,0]<stderr>:registration.
[1,0]<stderr>:
[1,0]<stderr>:You may want to check with your application vendor to see
[1,0]<stderr>:if an application-level alternative (of not using fork)
[1,0]<stderr>:exists.
[1,0]<stderr>:
[1,0]<stderr>:Please refer to https://github.com/ofiwg/libfabric/issues/6332
[1,0]<stderr>:for more information.
[1,0]<stderr>:
[1,0]<stderr>:Your job will now abort.

This happens immediately after the script loads the data. I can see the logging info [1,0]<stderr>:07/20/2021 09:31:42 - INFO - __main__ - 122039 samples loaded. Can you please advise?

Otherwise, would you have a trained model for VideoQA that I can test?

UPDATE:
I've also tried with single GPU (by removing horovodrun) and the same error happens.

The text was updated successfully, but these errors were encountered:

linjieli222 · 2021-07-20T18:59:22Z

I have never seen this error before. Perhaps follow this link (ofiwg/libfabric#6332) for more information.

aleSuglia · 2021-07-21T08:48:23Z

Yeah very weird one. Not sure where this is coming from to be honest. I would leave it here because in case somebody else comes across this.

aleSuglia · 2021-07-22T12:14:39Z

I've tried setting the environment variable FI_EFA_FORK_SAFE=1 but it still produces the same error.

wzamazon · 2021-09-29T14:24:36Z

Hi, Just noticed this issue. If you are still having problem, you need to not only set the environment variable FI_EFA_FORK_SAFE=1, but also use newer version of EFA installer (current version is 1.13.0)

linjieli222 added the help wanted Extra attention is needed label Jul 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training hang #4

Training hang #4

aleSuglia commented Jul 20, 2021 •

edited

linjieli222 commented Jul 20, 2021

aleSuglia commented Jul 21, 2021

aleSuglia commented Jul 22, 2021

wzamazon commented Sep 29, 2021

Training hang #4

Training hang #4

Comments

aleSuglia commented Jul 20, 2021 • edited

linjieli222 commented Jul 20, 2021

aleSuglia commented Jul 21, 2021

aleSuglia commented Jul 22, 2021

wzamazon commented Sep 29, 2021

aleSuglia commented Jul 20, 2021 •

edited