-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Single-node Multi-GPU training throws CUDA failure: an illegal memory access was encountered. #61
Comments
Additionally, we also tried to use glt_dataset.init_node_features(
node_feature_data=igbh_dataset.feat_dict,
with_gpu=False,
# split_ratio=0.15 * min(num_gpus, 4),
# device_group_list=[
# glt.data.DeviceGroup(idx, group)
# for idx, group in enumerate(gpu_groups)
# ]
) And the following error is thrown: Traceback (most recent call last):
File "min_rep.py", line 158, in <module>
torch.multiprocessing.spawn(run,
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/workspace/repository/min_rep.py", line 75, in run
for batch in train_loader:
File "/usr/local/lib/python3.8/dist-packages/graphlearn_torch/loader/neighbor_loader.py", line 104, in __next__
result = self._collate_fn(out)
File "/usr/local/lib/python3.8/dist-packages/graphlearn_torch/loader/node_loader.py", line 99, in _collate_fn
x_dict = {ntype : self.data.get_node_feature(ntype)[ids] for ntype, ids in sampler_out.node.items()}
File "/usr/local/lib/python3.8/dist-packages/graphlearn_torch/loader/node_loader.py", line 99, in <dictcomp>
x_dict = {ntype : self.data.get_node_feature(ntype)[ids] for ntype, ids in sampler_out.node.items()}
File "/usr/local/lib/python3.8/dist-packages/graphlearn_torch/data/feature.py", line 145, in __getitem__
return self.cpu_get(ids)
File "/usr/local/lib/python3.8/dist-packages/graphlearn_torch/data/feature.py", line 163, in cpu_get
return self.feature_tensor[ids]
IndexError: index 1004440 is out of bounds for dimension 0 with size 1000000 We have checked the train indices and they are all within 1000000 (min 0, max 599999), so we're not sure where is 1004440 index being yielded. |
I think this has been solved by #62 . Would you try it again? |
馃悰 Describe the bug
Hello, we're trying to modify the examples/igbh/train_rgnn.py so that it supports single-node multi-GPU training. However, when trying to follow the OGBN single-node multi-GPU training example, we encountered some
CUDA failure /workspace/graphlearn/graphlearn_torch/csrc/cuda/unified_tensor.cu:351: 'an illegal memory access was encountered'
errors when loading the first batch of data. Here is a min rep code for the issue (we removed the validation & test dataset for simplicity):After placing the above code under
examples/igbh
and running it with commandpython3 min_rep.py --path /data --dataset_size small --num_classes 2983 --epochs 3 --log_every 1
, it outputs the following error messages:Could you look into this issue and share some insights on fixing it? We have been using the default
dataset.py
andrgnn.py
provided underexamples/igbh
for running this code.Environment
The text was updated successfully, but these errors were encountered: