Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training hang #4

Open
aleSuglia opened this issue Jul 20, 2021 · 4 comments
Open

Training hang #4

aleSuglia opened this issue Jul 20, 2021 · 4 comments
Labels
help wanted Extra attention is needed

Comments

@aleSuglia
Copy link

aleSuglia commented Jul 20, 2021

Hello @linjieli222,

I'm trying to train a model for VideoQA but I obtain the following error:

[1,0]<stderr>:Stalled ranks:
[1,0]<stderr>:1: [allgather.noname.1]
[1,0]<stderr>:[2021-07-20 09:23:22.936726: W horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
[1,0]<stderr>:07/20/2021 09:31:42 - INFO - __main__ -   122039 samples loaded
[1,0]<stderr>:A process has executed an operation involving a call
[1,0]<stderr>:to the fork() system call to create a child process.
[1,0]<stderr>:
[1,0]<stderr>:As a result, the libfabric EFA provider is operating in
[1,0]<stderr>:a condition that could result in memory corruption or
[1,0]<stderr>:other system errors.
[1,0]<stderr>:
[1,0]<stderr>:For the libfabric EFA provider to work safely when fork()
[1,0]<stderr>:is called, the application must handle memory registrations
[1,0]<stderr>:(FI_MR_LOCAL) and you will need to set the following environment
[1,0]<stderr>:variables:
[1,0]<stderr>:          RDMAV_FORK_SAFE=1
[1,0]<stderr>:MPI applications do not support this mode.
[1,0]<stderr>:
[1,0]<stderr>:However, this setting can result in signficant performance
[1,0]<stderr>:impact to your application due to increased cost of memory
[1,0]<stderr>:registration.
[1,0]<stderr>:
[1,0]<stderr>:You may want to check with your application vendor to see
[1,0]<stderr>:if an application-level alternative (of not using fork)
[1,0]<stderr>:exists.
[1,0]<stderr>:
[1,0]<stderr>:Please refer to https://github.com/ofiwg/libfabric/issues/6332
[1,0]<stderr>:for more information.
[1,0]<stderr>:
[1,0]<stderr>:Your job will now abort.

This happens immediately after the script loads the data. I can see the logging info [1,0]<stderr>:07/20/2021 09:31:42 - INFO - __main__ - 122039 samples loaded. Can you please advise?

Otherwise, would you have a trained model for VideoQA that I can test?

UPDATE:
I've also tried with single GPU (by removing horovodrun) and the same error happens.

@linjieli222
Copy link
Contributor

I have never seen this error before. Perhaps follow this link (ofiwg/libfabric#6332) for more information.

@aleSuglia
Copy link
Author

Yeah very weird one. Not sure where this is coming from to be honest. I would leave it here because in case somebody else comes across this.

@aleSuglia
Copy link
Author

I've tried setting the environment variable FI_EFA_FORK_SAFE=1 but it still produces the same error.

@linjieli222 linjieli222 added the help wanted Extra attention is needed label Jul 22, 2021
@wzamazon
Copy link

Hi, Just noticed this issue. If you are still having problem, you need to not only set the environment variable FI_EFA_FORK_SAFE=1, but also use newer version of EFA installer (current version is 1.13.0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants