RGCN generates nan in PyTorch 1.8 but not in PyTorch 1.7.x #2760

nv-dlasalle · 2021-03-18T00:27:23Z

🐛 Bug

Running /examples/pytorch/rgcn/entity_classify_mp.py results in the training accuracy staying near zero and the loss becomes nan.

To Reproduce

Steps to reproduce the behavior:

Build from source, and run the example according to the readme:

$ OMP_NUM_THREADS=1 python3 entity_classify_mp.py -d ogbn-mag --testing --fanout='30,30' --batch-size 1024 --n-hidden 128 --lr 0.01 --num-worker 1 --eval-batch-size 8 --low-mem --gpu 0 --dropout 0.7 --use-self-loop --n-bases 2 --n-epochs 3 --dgl-sparse --sparse-lr 0.08
Using backend: pytorch
Namespace(batch_size=1024, dataset='ogbn-mag', dgl_sparse=True, dropout=0.7, eval_batch_size=8, fanout='30,30', gpu='0', l2norm=0, layer_norm=False, low_mem=True, lr=0.01, n_bases=2, n_epochs=3, n_hidden=128, n_layers=2, node_feats=False, num_workers=1, sparse_lr=0.08, use_self_loop=True, validation=False)
num_tasks = 1
task_type = multiclass classification
eval_metric = acc
num_classes = 349
is_hetero = True
Opening 'dataset/ogbn_mag_dgl/processed/dgl_data_processed'.
Number of relations: 8
Number of class: 349
Number of train: 629571
Number of valid: 64879
Number of test: 41939
0:author
1:field_of_study
2:institution
3:paper
start training...
Train Accuracy: 0.0117 | Train Loss: 5.9985
Train Accuracy: 0.0107 | Train Loss: nan
Train Accuracy: 0.0029 | Train Loss: nan
Train Accuracy: 0.0029 | Train Loss: nan
Train Accuracy: 0.0010 | Train Loss: nan
Train Accuracy: 0.0020 | Train Loss: nan
Train Accuracy: 0.0029 | Train Loss: nan
Train Accuracy: 0.0029 | Train Loss: nan
Train Accuracy: 0.0039 | Train Loss: nan
Train Accuracy: 0.0039 | Train Loss: nan
Train Accuracy: 0.0020 | Train Loss: nan
Train Accuracy: 0.0010 | Train Loss: nan
Train Accuracy: 0.0029 | Train Loss: nan
Train Accuracy: 0.0039 | Train Loss: nan
Train Accuracy: 0.0078 | Train Loss: nan
Train Accuracy: 0.0020 | Train Loss: nan
Train Accuracy: 0.0039 | Train Loss: nan
Train Accuracy: 0.0029 | Train Loss: nan
Train Accuracy: 0.0020 | Train Loss: nan
Train Accuracy: 0.0049 | Train Loss: nan
Train Accuracy: 0.0000 | Train Loss: nan
Train Accuracy: 0.0029 | Train Loss: nan
Train Accuracy: 0.0020 | Train Loss: nan
Train Accuracy: 0.0020 | Train Loss: nan
Train Accuracy: 0.0039 | Train Loss: nan
Train Accuracy: 0.0039 | Train Loss: nan
Train Accuracy: 0.0049 | Train Loss: nan
Train Accuracy: 0.0039 | Train Loss: nan

Expected behavior

However, when pytorch's implementation is used:

$ OMP_NUM_THREADS=1 python3 entity_classify_mp.py -d ogbn-mag --testing --fanout='30,30' --batch-size 1024 --n-hidden 128 --lr 0.01 --num-worker 1 --eval-batch-size 8 --low-mem --gpu 0 --dropout 0.7 --use-self-loop --n-bases 2 --n-epochs 3 --sparse-lr 0.08
Using backend: pytorch
Namespace(batch_size=1024, dataset='ogbn-mag', dgl_sparse=False, dropout=0.7, eval_batch_size=8, fanout='30,30', gpu='0', l2norm=0, layer_norm=False, low_mem=True, lr=0.01, n_bases=2, n_epochs=3, n_hidden=128, n_layers=2, node_feats=False, num_workers=1, sparse_lr=0.08, use_self_loop=True, validation=False)
num_tasks = 1
task_type = multiclass classification
eval_metric = acc
num_classes = 349
is_hetero = True
Opening 'dataset/ogbn_mag_dgl/processed/dgl_data_processed'.
Number of relations: 8
Number of class: 349
Number of train: 629571
Number of valid: 64879
Number of test: 41939
0:author
1:field_of_study
2:institution
3:paper
start training...
Train Accuracy: 0.0107 | Train Loss: 5.9672
Train Accuracy: 0.0166 | Train Loss: 5.7614
Train Accuracy: 0.0215 | Train Loss: 5.5110
Train Accuracy: 0.0332 | Train Loss: 5.4334
Train Accuracy: 0.0361 | Train Loss: 5.2588
Train Accuracy: 0.0469 | Train Loss: 5.1994
Train Accuracy: 0.0557 | Train Loss: 5.1170
Train Accuracy: 0.0518 | Train Loss: 5.0814
Train Accuracy: 0.0674 | Train Loss: 5.0409
Train Accuracy: 0.0791 | Train Loss: 4.9510
Train Accuracy: 0.0615 | Train Loss: 5.0149
Train Accuracy: 0.0693 | Train Loss: 5.0005
Train Accuracy: 0.0801 | Train Loss: 4.7365
Train Accuracy: 0.0928 | Train Loss: 4.8522
Train Accuracy: 0.0840 | Train Loss: 4.8094
Train Accuracy: 0.1055 | Train Loss: 4.7392
Train Accuracy: 0.0967 | Train Loss: 4.8593
Train Accuracy: 0.0830 | Train Loss: 4.8212
Train Accuracy: 0.0957 | Train Loss: 4.6933
Train Accuracy: 0.1094 | Train Loss: 4.6382
Train Accuracy: 0.1055 | Train Loss: 4.6573
Train Accuracy: 0.1016 | Train Loss: 4.6128
Train Accuracy: 0.1211 | Train Loss: 4.6512

Environment

DGL Version (e.g., 1.0): master ( 09ef2c2)
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): pytorch 1.8
OS (e.g., Linux): ubuntu 18.04
How you installed DGL (conda, pip, source): source
Build command you used (if compiling from source): cmake .. -DUSE_CUDA=ON -DBUILD_TORCH=ON -DBUILD_CPP_TEST=ON -DCMAKE_BUILD_TYPE=Debug -DCMAKE_VERBOSE_MAKEFILE=1 && make -j13
Python version: 3.6
CUDA/cuDNN version (if applicable): 11.1
GPU models and configuration (e.g. V100): TitanV
Any other relevant information:

The text was updated successfully, but these errors were encountered:

nv-dlasalle · 2021-03-19T22:04:38Z

More information, is that it does not seem to occur when running on the CPU, and I've only been able to reproduce it on the ogbn-mag dataset.

classicsong · 2021-03-23T01:11:29Z

I tested it with cuda 10.1. The training progress looks good.
Will check cuda 11.1 later.

BarclayII · 2021-04-28T03:42:30Z

Me and @zheng-da observed the same phenomenon. Using PyTorch 1.8 will yield nan while PyTorch 1.7+ does not have this problem.

In my case this is not related to SparseEmbedding and OGBN-MAG. NaNs occur even when node features are available.

nv-dlasalle · 2021-06-11T20:01:23Z

I think i've tracked this down to the sparse_optim.py copying data from the GPU to the CPU with non_blocking=True, and then running CPU operations before the copy finishes.

classicsong · 2021-06-14T12:01:09Z

I think i've tracked this down to the sparse_optim.py copying data from the GPU to the CPU with non_blocking=True, and then running CPU operations before the copy finishes.

It seems PyTorch does not check the data access order. Is this a breaking change in PyTorch 1.8?

…2760) (#3013) * Fix sparse optimizer to wait on copies to the CPU * Fix linting * Fix typo Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com>

…py's SparseAdam::update, corresponding with the bug fixed in the non-distributed version in dmlc#3013 , though using the newer Event-based approach from that corresponding function. The race condition would often result in NaNs, like the previously fixed bug. dmlc#2760

…rseAdagrad (#3971) * * Fixed race condition bug in distributed/optim/pytorch/sparse_optim.py's SparseAdam::update, corresponding with the bug fixed in the non-distributed version in #3013 , though using the newer Event-based approach from that corresponding function. The race condition would often result in NaNs, like the previously fixed bug. #2760 * * Fixed race condition bug in SparseAdagrad::update corresponding with the one fixed in SparseAdam::update in the previous commit. Same info applies. * * Fixed typo in all copies of a repeatedly-copied comment near bug fixed 3 commits ago, checking all implementations nearby for a corresponding bug. (All of them appear to have been fixed as of 2 commits ago.) * * Removed trailing whitespace Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com> Co-authored-by: Rhett Ying <85214957+Rhett-Ying@users.noreply.github.com>

…rseAdagrad (dmlc#3971) * * Fixed race condition bug in distributed/optim/pytorch/sparse_optim.py's SparseAdam::update, corresponding with the bug fixed in the non-distributed version in dmlc#3013 , though using the newer Event-based approach from that corresponding function. The race condition would often result in NaNs, like the previously fixed bug. dmlc#2760 * * Fixed race condition bug in SparseAdagrad::update corresponding with the one fixed in SparseAdam::update in the previous commit. Same info applies. * * Fixed typo in all copies of a repeatedly-copied comment near bug fixed 3 commits ago, checking all implementations nearby for a corresponding bug. (All of them appear to have been fixed as of 2 commits ago.) * * Removed trailing whitespace Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com> Co-authored-by: Rhett Ying <85214957+Rhett-Ying@users.noreply.github.com>

jermainewang assigned classicsong Mar 18, 2021

jermainewang added the bug:unconfirmed May be a bug. Need further investigation. label Mar 18, 2021

VoVAllen mentioned this issue Apr 27, 2021

RGCN generate nan in Pytorch 1.8 #2875

Closed

BarclayII changed the title ~~DGL's sparse embedding in the RGCN example does not converge~~ RGCN generates nan in PyTorch 1.8 but not in PyTorch 1.7.x Apr 28, 2021

jermainewang added bug:confirmed Something isn't working help wanted Need helps from the community and removed bug:unconfirmed May be a bug. Need further investigation. labels May 10, 2021

nv-dlasalle mentioned this issue Jun 11, 2021

[bugfix] Fix sparse_optim when the state is stored on the CPU (fixes #2760) #3013

Merged

6 tasks

nv-dlasalle closed this as completed in #3013 Jun 15, 2021

nv-dlasalle mentioned this issue Jul 30, 2021

[Performance] Enable using pinned memory for transfers in SparseAdam optimizer. #3207

Merged

6 tasks

ndickson-nvidia mentioned this issue May 4, 2022

[Dist][Optim] Fixed race conditions in distributed SparseAdam and SparseAdagrad #3971

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RGCN generates nan in PyTorch 1.8 but not in PyTorch 1.7.x #2760

RGCN generates nan in PyTorch 1.8 but not in PyTorch 1.7.x #2760

nv-dlasalle commented Mar 18, 2021

nv-dlasalle commented Mar 19, 2021

classicsong commented Mar 23, 2021

BarclayII commented Apr 28, 2021

nv-dlasalle commented Jun 11, 2021

classicsong commented Jun 14, 2021

RGCN generates nan in PyTorch 1.8 but not in PyTorch 1.7.x #2760

RGCN generates nan in PyTorch 1.8 but not in PyTorch 1.7.x #2760

Comments

nv-dlasalle commented Mar 18, 2021

🐛 Bug

To Reproduce

Expected behavior

Environment

nv-dlasalle commented Mar 19, 2021

classicsong commented Mar 23, 2021

BarclayII commented Apr 28, 2021

nv-dlasalle commented Jun 11, 2021

classicsong commented Jun 14, 2021