Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RGCN generates nan in PyTorch 1.8 but not in PyTorch 1.7.x #2760

Closed
nv-dlasalle opened this issue Mar 18, 2021 · 5 comments · Fixed by #3013
Closed

RGCN generates nan in PyTorch 1.8 but not in PyTorch 1.7.x #2760

nv-dlasalle opened this issue Mar 18, 2021 · 5 comments · Fixed by #3013
Assignees
Labels
bug:confirmed Something isn't working help wanted Need helps from the community

Comments

@nv-dlasalle
Copy link
Collaborator

🐛 Bug

Running /examples/pytorch/rgcn/entity_classify_mp.py results in the training accuracy staying near zero and the loss becomes nan.

To Reproduce

Steps to reproduce the behavior:

  1. Build from source, and run the example according to the readme:
$ OMP_NUM_THREADS=1 python3 entity_classify_mp.py -d ogbn-mag --testing --fanout='30,30' --batch-size 1024 --n-hidden 128 --lr 0.01 --num-worker 1 --eval-batch-size 8 --low-mem --gpu 0 --dropout 0.7 --use-self-loop --n-bases 2 --n-epochs 3 --dgl-sparse --sparse-lr 0.08
Using backend: pytorch
Namespace(batch_size=1024, dataset='ogbn-mag', dgl_sparse=True, dropout=0.7, eval_batch_size=8, fanout='30,30', gpu='0', l2norm=0, layer_norm=False, low_mem=True, lr=0.01, n_bases=2, n_epochs=3, n_hidden=128, n_layers=2, node_feats=False, num_workers=1, sparse_lr=0.08, use_self_loop=True, validation=False)
num_tasks = 1
task_type = multiclass classification
eval_metric = acc
num_classes = 349
is_hetero = True
Opening 'dataset/ogbn_mag_dgl/processed/dgl_data_processed'.
Number of relations: 8
Number of class: 349
Number of train: 629571
Number of valid: 64879
Number of test: 41939
0:author
1:field_of_study
2:institution
3:paper
start training...
Train Accuracy: 0.0117 | Train Loss: 5.9985
Train Accuracy: 0.0107 | Train Loss: nan
Train Accuracy: 0.0029 | Train Loss: nan
Train Accuracy: 0.0029 | Train Loss: nan
Train Accuracy: 0.0010 | Train Loss: nan
Train Accuracy: 0.0020 | Train Loss: nan
Train Accuracy: 0.0029 | Train Loss: nan
Train Accuracy: 0.0029 | Train Loss: nan
Train Accuracy: 0.0039 | Train Loss: nan
Train Accuracy: 0.0039 | Train Loss: nan
Train Accuracy: 0.0020 | Train Loss: nan
Train Accuracy: 0.0010 | Train Loss: nan
Train Accuracy: 0.0029 | Train Loss: nan
Train Accuracy: 0.0039 | Train Loss: nan
Train Accuracy: 0.0078 | Train Loss: nan
Train Accuracy: 0.0020 | Train Loss: nan
Train Accuracy: 0.0039 | Train Loss: nan
Train Accuracy: 0.0029 | Train Loss: nan
Train Accuracy: 0.0020 | Train Loss: nan
Train Accuracy: 0.0049 | Train Loss: nan
Train Accuracy: 0.0000 | Train Loss: nan
Train Accuracy: 0.0029 | Train Loss: nan
Train Accuracy: 0.0020 | Train Loss: nan
Train Accuracy: 0.0020 | Train Loss: nan
Train Accuracy: 0.0039 | Train Loss: nan
Train Accuracy: 0.0039 | Train Loss: nan
Train Accuracy: 0.0049 | Train Loss: nan
Train Accuracy: 0.0039 | Train Loss: nan

Expected behavior

However, when pytorch's implementation is used:

$ OMP_NUM_THREADS=1 python3 entity_classify_mp.py -d ogbn-mag --testing --fanout='30,30' --batch-size 1024 --n-hidden 128 --lr 0.01 --num-worker 1 --eval-batch-size 8 --low-mem --gpu 0 --dropout 0.7 --use-self-loop --n-bases 2 --n-epochs 3 --sparse-lr 0.08
Using backend: pytorch
Namespace(batch_size=1024, dataset='ogbn-mag', dgl_sparse=False, dropout=0.7, eval_batch_size=8, fanout='30,30', gpu='0', l2norm=0, layer_norm=False, low_mem=True, lr=0.01, n_bases=2, n_epochs=3, n_hidden=128, n_layers=2, node_feats=False, num_workers=1, sparse_lr=0.08, use_self_loop=True, validation=False)
num_tasks = 1
task_type = multiclass classification
eval_metric = acc
num_classes = 349
is_hetero = True
Opening 'dataset/ogbn_mag_dgl/processed/dgl_data_processed'.
Number of relations: 8
Number of class: 349
Number of train: 629571
Number of valid: 64879
Number of test: 41939
0:author
1:field_of_study
2:institution
3:paper
start training...
Train Accuracy: 0.0107 | Train Loss: 5.9672
Train Accuracy: 0.0166 | Train Loss: 5.7614
Train Accuracy: 0.0215 | Train Loss: 5.5110
Train Accuracy: 0.0332 | Train Loss: 5.4334
Train Accuracy: 0.0361 | Train Loss: 5.2588
Train Accuracy: 0.0469 | Train Loss: 5.1994
Train Accuracy: 0.0557 | Train Loss: 5.1170
Train Accuracy: 0.0518 | Train Loss: 5.0814
Train Accuracy: 0.0674 | Train Loss: 5.0409
Train Accuracy: 0.0791 | Train Loss: 4.9510
Train Accuracy: 0.0615 | Train Loss: 5.0149
Train Accuracy: 0.0693 | Train Loss: 5.0005
Train Accuracy: 0.0801 | Train Loss: 4.7365
Train Accuracy: 0.0928 | Train Loss: 4.8522
Train Accuracy: 0.0840 | Train Loss: 4.8094
Train Accuracy: 0.1055 | Train Loss: 4.7392
Train Accuracy: 0.0967 | Train Loss: 4.8593
Train Accuracy: 0.0830 | Train Loss: 4.8212
Train Accuracy: 0.0957 | Train Loss: 4.6933
Train Accuracy: 0.1094 | Train Loss: 4.6382
Train Accuracy: 0.1055 | Train Loss: 4.6573
Train Accuracy: 0.1016 | Train Loss: 4.6128
Train Accuracy: 0.1211 | Train Loss: 4.6512

Environment

  • DGL Version (e.g., 1.0): master ( 09ef2c2)
  • Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): pytorch 1.8
  • OS (e.g., Linux): ubuntu 18.04
  • How you installed DGL (conda, pip, source): source
  • Build command you used (if compiling from source): cmake .. -DUSE_CUDA=ON -DBUILD_TORCH=ON -DBUILD_CPP_TEST=ON -DCMAKE_BUILD_TYPE=Debug -DCMAKE_VERBOSE_MAKEFILE=1 && make -j13
  • Python version: 3.6
  • CUDA/cuDNN version (if applicable): 11.1
  • GPU models and configuration (e.g. V100): TitanV
  • Any other relevant information:
@jermainewang jermainewang added the bug:unconfirmed May be a bug. Need further investigation. label Mar 18, 2021
@nv-dlasalle
Copy link
Collaborator Author

More information, is that it does not seem to occur when running on the CPU, and I've only been able to reproduce it on the ogbn-mag dataset.

@classicsong
Copy link
Contributor

I tested it with cuda 10.1. The training progress looks good.
Will check cuda 11.1 later.

@BarclayII
Copy link
Collaborator

Me and @zheng-da observed the same phenomenon. Using PyTorch 1.8 will yield nan while PyTorch 1.7+ does not have this problem.

In my case this is not related to SparseEmbedding and OGBN-MAG. NaNs occur even when node features are available.

@BarclayII BarclayII changed the title DGL's sparse embedding in the RGCN example does not converge RGCN generates nan in PyTorch 1.8 but not in PyTorch 1.7.x Apr 28, 2021
@jermainewang jermainewang added bug:confirmed Something isn't working help wanted Need helps from the community and removed bug:unconfirmed May be a bug. Need further investigation. labels May 10, 2021
@nv-dlasalle
Copy link
Collaborator Author

I think i've tracked this down to the sparse_optim.py copying data from the GPU to the CPU with non_blocking=True, and then running CPU operations before the copy finishes.

@classicsong
Copy link
Contributor

I think i've tracked this down to the sparse_optim.py copying data from the GPU to the CPU with non_blocking=True, and then running CPU operations before the copy finishes.

It seems PyTorch does not check the data access order. Is this a breaking change in PyTorch 1.8?

nv-dlasalle added a commit that referenced this issue Jun 15, 2021
…2760) (#3013)

* Fix sparse optimizer to wait on copies to the CPU

* Fix linting

* Fix typo

Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com>
ndickson-nvidia added a commit to ndickson-nvidia/dgl that referenced this issue May 4, 2022
…py's SparseAdam::update, corresponding with the bug fixed in the non-distributed version in dmlc#3013 , though using the newer Event-based approach from that corresponding function.  The race condition would often result in NaNs, like the previously fixed bug. dmlc#2760
nv-dlasalle pushed a commit that referenced this issue May 9, 2022
…rseAdagrad (#3971)

* * Fixed race condition bug in distributed/optim/pytorch/sparse_optim.py's SparseAdam::update, corresponding with the bug fixed in the non-distributed version in #3013 , though using the newer Event-based approach from that corresponding function.  The race condition would often result in NaNs, like the previously fixed bug. #2760

* * Fixed race condition bug in SparseAdagrad::update corresponding with the one fixed in SparseAdam::update in the previous commit.  Same info applies.

* * Fixed typo in all copies of a repeatedly-copied comment near bug fixed 3 commits ago, checking all implementations nearby for a corresponding bug.  (All of them appear to have been fixed as of 2 commits ago.)

* * Removed trailing whitespace

Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com>
Co-authored-by: Rhett Ying <85214957+Rhett-Ying@users.noreply.github.com>
VibhuJawa pushed a commit to VibhuJawa/dgl that referenced this issue Jun 17, 2022
…rseAdagrad (dmlc#3971)

* * Fixed race condition bug in distributed/optim/pytorch/sparse_optim.py's SparseAdam::update, corresponding with the bug fixed in the non-distributed version in dmlc#3013 , though using the newer Event-based approach from that corresponding function.  The race condition would often result in NaNs, like the previously fixed bug. dmlc#2760

* * Fixed race condition bug in SparseAdagrad::update corresponding with the one fixed in SparseAdam::update in the previous commit.  Same info applies.

* * Fixed typo in all copies of a repeatedly-copied comment near bug fixed 3 commits ago, checking all implementations nearby for a corresponding bug.  (All of them appear to have been fixed as of 2 commits ago.)

* * Removed trailing whitespace

Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com>
Co-authored-by: Rhett Ying <85214957+Rhett-Ying@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug:confirmed Something isn't working help wanted Need helps from the community
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants