New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RGCN generates nan in PyTorch 1.8 but not in PyTorch 1.7.x #2760
Comments
More information, is that it does not seem to occur when running on the CPU, and I've only been able to reproduce it on the ogbn-mag dataset. |
I tested it with cuda 10.1. The training progress looks good. |
Me and @zheng-da observed the same phenomenon. Using PyTorch 1.8 will yield nan while PyTorch 1.7+ does not have this problem. In my case this is not related to SparseEmbedding and OGBN-MAG. NaNs occur even when node features are available. |
I think i've tracked this down to the |
It seems PyTorch does not check the data access order. Is this a breaking change in PyTorch 1.8? |
…rseAdagrad (#3971) * * Fixed race condition bug in distributed/optim/pytorch/sparse_optim.py's SparseAdam::update, corresponding with the bug fixed in the non-distributed version in #3013 , though using the newer Event-based approach from that corresponding function. The race condition would often result in NaNs, like the previously fixed bug. #2760 * * Fixed race condition bug in SparseAdagrad::update corresponding with the one fixed in SparseAdam::update in the previous commit. Same info applies. * * Fixed typo in all copies of a repeatedly-copied comment near bug fixed 3 commits ago, checking all implementations nearby for a corresponding bug. (All of them appear to have been fixed as of 2 commits ago.) * * Removed trailing whitespace Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com> Co-authored-by: Rhett Ying <85214957+Rhett-Ying@users.noreply.github.com>
…rseAdagrad (dmlc#3971) * * Fixed race condition bug in distributed/optim/pytorch/sparse_optim.py's SparseAdam::update, corresponding with the bug fixed in the non-distributed version in dmlc#3013 , though using the newer Event-based approach from that corresponding function. The race condition would often result in NaNs, like the previously fixed bug. dmlc#2760 * * Fixed race condition bug in SparseAdagrad::update corresponding with the one fixed in SparseAdam::update in the previous commit. Same info applies. * * Fixed typo in all copies of a repeatedly-copied comment near bug fixed 3 commits ago, checking all implementations nearby for a corresponding bug. (All of them appear to have been fixed as of 2 commits ago.) * * Removed trailing whitespace Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com> Co-authored-by: Rhett Ying <85214957+Rhett-Ying@users.noreply.github.com>
🐛 Bug
Running
/examples/pytorch/rgcn/entity_classify_mp.py
results in the training accuracy staying near zero and the loss becomesnan
.To Reproduce
Steps to reproduce the behavior:
Expected behavior
However, when pytorch's implementation is used:
Environment
conda
,pip
, source): sourceThe text was updated successfully, but these errors were encountered: