RuntimeError: copy_if failed to synchronize: device-side assert triggered #658

yxchng · 2019-04-10T06:42:03Z

🐛 Bug

...
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: bl
ock: [0,0,0], thread: [41,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bou nds" failed.
...

Traceback (most recent call last):
File "tools/train_net.py", line 174, in
main()
File "tools/train_net.py", line 167, in main
model = train(cfg, args.local_rank, args.distributed)
File "tools/train_net.py", line 73, in train
arguments,
File "/run/mount/sdd1/maskrcnn_wrapper/env/lib/python3.5/site-packages/maskrcnn_benchmark-0.1-py3.5
-linux-x86_64.egg/maskrcnn_benchmark/engine/trainer.py", line 66, in do_train
loss_dict = model(images, targets)
File "/run/mount/sdd1/maskrcnn_wrapper/env/lib/python3.5/site-packages/torch/nn/modules/module.py",
line 494, in call
result = self.forward(*input, **kwargs)
File "/run/mount/sdd1/maskrcnn_wrapper/env/lib/python3.5/site-packages/maskrcnn_benchmark-0.1-py3.5
-linux-x86_64.egg/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 50, in forward
proposals, proposal_losses = self.rpn(images, features, targets)
File "/run/mount/sdd1/maskrcnn_wrapper/env/lib/python3.5/site-packages/torch/nn/modules/module.py",
line 494, in call
result = self.forward(*input, **kwargs)
File "/run/mount/sdd1/maskrcnn_wrapper/env/lib/python3.5/site-packages/maskrcnn_benchmark-0.1-py3.5
-linux-x86_64.egg/maskrcnn_benchmark/modeling/rpn/rpn.py", line 159, in forward
return self._forward_train(anchors, objectness, rpn_box_regression, targets)
File "/run/mount/sdd1/maskrcnn_wrapper/env/lib/python3.5/site-packages/maskrcnn_benchmark-0.1-py3.5
-linux-x86_64.egg/maskrcnn_benchmark/modeling/rpn/rpn.py", line 175, in _forward_train
anchors, objectness, rpn_box_regression, targets
File "/run/mount/sdd1/maskrcnn_wrapper/env/lib/python3.5/site-packages/torch/nn/modules/module.py",
line 494, in call
result = self.forward(*input, **kwargs)
File "/run/mount/sdd1/maskrcnn_wrapper/env/lib/python3.5/site-packages/maskrcnn_benchmark-0.1-py3.5
-linux-x86_64.egg/maskrcnn_benchmark/modeling/rpn/inference.py", line 138, in forward
sampled_boxes.append(self.forward_for_single_feature_map(a, o, b))
File "/run/mount/sdd1/maskrcnn_wrapper/env/lib/python3.5/site-packages/maskrcnn_benchmark-0.1-py3.5
-linux-x86_64.egg/maskrcnn_benchmark/modeling/rpn/inference.py", line 113, in forward_for_single_feat
ure_map
boxlist = remove_small_boxes(boxlist, self.min_size)
File "/run/mount/sdd1/maskrcnn_wrapper/env/lib/python3.5/site-packages/maskrcnn_benchmark-0.1-py3.5
-linux-x86_64.egg/maskrcnn_benchmark/structures/boxlist_ops.py", line 47, in remove_small_boxes
(ws >= min_size) & (hs >= min_size)
RuntimeError: copy_if failed to synchronize: device-side assert triggered

This may be similar to #229 but the message is slightly different. 229 is an illegal memory access was encountered but what I met is device-side assert triggered.

I have changed the NUM_CLASSES as well.

To Reproduce

Steps to reproduce the behavior:

Run training code

Expected behavior

No error

Environment

PyTorch version: 1.0.0.dev20190409
Is debug build: No
CUDA used to build PyTorch: 10.0.130

OS: Ubuntu 16.04.4 LTS
GCC version: (Ubuntu 5.5.0-12ubuntu1~16.04) 5.5.0 20171010
CMake version: version 3.5.1

Python version: 3.5
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration:
GPU 0: GeForce RTX 2080 Ti
GPU 1: TITAN X (Pascal)

Nvidia driver version: 418.39
cuDNN version: Could not collect

Versions of relevant libraries:
[pip] Could not collect
[conda] Could not collect
Pillow (6.0.0)

The text was updated successfully, but these errors were encountered:

yxchng · 2019-04-10T08:55:49Z

According to this #275, it seems like having a learning rate that is too large may have caused the problem? But the error message does not seem to be related at all. I am still in the process of verifying this fact.

yxchng · 2019-04-11T00:49:06Z

Having learning rate that is too large is indeed the problem. Lowering the learning rate solves the problem.

dk21121 · 2019-09-17T13:57:39Z

Having learning rate that is too large is indeed the problem. Lowering the learning rate solves the problem.

I meet the same issues. Can you tell me how to reduce your learning rate? Is it just experience to lower the value a little bit from the beginning?thanks

PyNancy · 2020-05-27T14:20:59Z

batch size too large also cause this issue

Enn29 · 2020-09-15T07:49:27Z

hello, I have met the seem issue, then i reduce the learning rate, but i can't reslove it. so could you help me to reslove the issue, thanks!

the error in below:
File "C:\Anaconda3\envs\maskrcnn_benchmark\lib\site-packages\maskrcnn-benchmark\maskrcnn_benchmark\engine\trainer.py", line 88, in do_train loss_dict = model(images, targets) File "C:\Anaconda3\envs\maskrcnn_benchmark\lib\site-packages\torch\nn\modules\module.py", line 491, in call result = self.forward(*input, **kwargs) File "C:\Anaconda3\envs\maskrcnn_benchmark\lib\site-packages\apex-0.1-py3.7-win-amd64.egg\apex\amp_initialize.py", line 194, in new_fwd **applier(kwargs, input_caster)) File "C:\Anaconda3\envs\maskrcnn_benchmark\lib\site-packages\maskrcnn-benchmark\maskrcnn_benchmark\modeling\detector\generalized_rcnn.py", line 60, in forward x, result, detector_losses = self.roi_heads(features, proposals, targets) File "C:\Anaconda3\envs\maskrcnn_benchmark\lib\site-packages\torch\nn\modules\module.py", line 491, in call result = self.forward(*input, **kwargs) File "C:\Anaconda3\envs\maskrcnn_benchmark\lib\site-packages\maskrcnn-benchmark\maskrcnn_benchmark\modeling\roi_heads\roi_heads.py", line 26, in forward x, detections, loss_box = self.box(features, proposals, targets) File "C:\Anaconda3\envs\maskrcnn_benchmark\lib\site-packages\torch\nn\modules\module.py", line 491, in call result = self.forward(*input, **kwargs) File "C:\Anaconda3\envs\maskrcnn_benchmark\lib\site-packages\maskrcnn-benchmark\maskrcnn_benchmark\modeling\roi_heads\box_head\box_head.py", line 56, in forward [class_logits], [box_regression] File "C:\Anaconda3\envs\maskrcnn_benchmark\lib\site-packages\maskrcnn-benchmark\maskrcnn_benchmark\modeling\roi_heads\box_head\loss.py", line 151, in call sampled_pos_inds_subset = torch.nonzero(labels > 0).squeeze(1) RuntimeError: copy_if failed to synchronize: device-side assert triggered

yxchng closed this as completed Apr 11, 2019

rusty1s mentioned this issue Sep 5, 2019

The data required by GCNConv pyg-team/pytorch_geometric#669

Closed

duanzhiihao mentioned this issue Sep 15, 2020

confused about the format of your labeled datasets CEPDOF duanzhiihao/RAPiD#8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: copy_if failed to synchronize: device-side assert triggered #658

RuntimeError: copy_if failed to synchronize: device-side assert triggered #658

yxchng commented Apr 10, 2019 •

edited

yxchng commented Apr 10, 2019

yxchng commented Apr 11, 2019

dk21121 commented Sep 17, 2019

PyNancy commented May 27, 2020

Enn29 commented Sep 15, 2020

RuntimeError: copy_if failed to synchronize: device-side assert triggered #658

RuntimeError: copy_if failed to synchronize: device-side assert triggered #658

Comments

yxchng commented Apr 10, 2019 • edited

🐛 Bug

To Reproduce

Expected behavior

Environment

yxchng commented Apr 10, 2019

yxchng commented Apr 11, 2019

dk21121 commented Sep 17, 2019

PyNancy commented May 27, 2020

Enn29 commented Sep 15, 2020

yxchng commented Apr 10, 2019 •

edited