Skip to content
This repository has been archived by the owner on Nov 21, 2023. It is now read-only.

Can not run test case and inference #260

Open
ghost opened this issue Mar 9, 2018 · 28 comments
Open

Can not run test case and inference #260

ghost opened this issue Mar 9, 2018 · 28 comments

Comments

@ghost
Copy link

ghost commented Mar 9, 2018

I installed detection, everything seems to be fine until I ran the test case

=============================
python test_spatial_narrow_as_op.py

It failed with the following message:

Found Detectron ops lib: /home/xxx/anaconda2/lib/libcaffe2_detectron_ops_gpu.so
I0308 22:07:29.731431 29562 operator.cc:173] Operator with engine CUDNN is not available for operator SpatialNarrowAs.
FI0308 22:07:30.126979 29562 operator.cc:173] Operator with engine CUDNN is not available for operator SpatialNarrowAs.
I0308 22:07:30.382014 29562 operator.cc:173] Operator with engine CUDNN is not available for operator SpatialNarrowAs.
.I0308 22:07:30.383860 29562 operator.cc:173] Operator with engine CUDNN is not available for operator SpatialNarrowAs.
I0308 22:07:30.384814 29562 operator.cc:173] Operator with engine CUDNN is not available for operator SpatialNarrowAs.
I0308 22:07:30.385095 29562 operator.cc:173] Operator with engine CUDNN is not available for operator SpatialNarrowAsGradient.
E

ERROR: test_small_forward_and_gradient (main.SpatialNarrowAsOpTest)

Traceback (most recent call last):
File "test_spatial_narrow_as_op.py", line 59, in test_small_forward_and_gradient
self._run_test(A, B, check_grad=True)
File "test_spatial_narrow_as_op.py", line 49, in _run_test
res, grad, grad_estimated = gc.CheckSimple(op, [A, B], 0, [0])

success = RunOperatorOnce(op)

File "/home/xxxx/anaconda2/lib/python2.7/site-packages/caffe2/python/workspace.py", line 179, in RunOperatorOnce
return C.run_operator_once(StringifyProto(operator))
RuntimeError: [enforce fail at context_gpu.h:171] . Encountered CUDA error: no kernel image is available for execution on the device Error from operator:
input: "A" input: "B" input: "C_grad" output: "A_grad" name: "" type: "SpatialNarrowAsGradient" device_option { device_type: 1 cuda_gpu_id: 0 } is_gradient_op: true

======================================================================
FAIL: test_large_forward (main.SpatialNarrowAsOpTest)

Traceback (most recent call last):
File "test_spatial_narrow_as_op.py", line 68, in test_large_forward
self._run_test(A, B)
File "test_spatial_narrow_as_op.py", line 54, in _run_test
np.testing.assert_allclose(C, C_ref, rtol=1e-5, atol=1e-08)
File "/home/xxx/anaconda2/lib/python2.7/site-packages/numpy/testing/nose_tools/utils.py", line 1396, in assert_allclose
verbose=verbose, header=header, equal_nan=equal_nan)

raise AssertionError(msg)

AssertionError:
Not equal to tolerance rtol=1e-05, atol=1e-08

(mismatch 100.0%)
x: array([[[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],...
y: array([[[[ 3.099715e-01, -1.291913e+00, -2.825952e-01, ...,
-2.258663e-01, -8.814982e-01, 4.408140e-01],
[ 1.377446e+00, 1.170039e+00, 1.164714e-01, ...,...


Ran 3 tests in 1.078s

FAILED (failures=1, errors=1)

=======================================================


If I run with the inference

python2 tools/infer_simple.py
--cfg configs/12_2017_baselines/e2e_mask_rcnn_R-101-FPN_2x.yaml
--output-dir /tmp/detectron-visualizations
--image-ext jpg
--wts https://s3-us-west-2.amazonaws.com/detectron/35861858/12_2017_baselines/e2e_mask_rcnn_R-101-FPN_2x.yaml.02_32_51.SgT4y1cO/output/train/coco_2014_train:coco_2014_valminusminival/generalized_rcnn/model_final.pkl
demo

I got the following error:

I0308 22:03:05.297256 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.297796 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.298099 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.298406 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.298660 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.298704 31934 operator.cc:173] Operator with engine CUDNN is not available for operator Sum.
I0308 22:03:05.299007 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.299317 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.299623 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.299666 31934 operator.cc:173] Operator with engine CUDNN is not available for operator Sum.
I0308 22:03:05.299965 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.300297 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.300607 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.300649 31934 operator.cc:173] Operator with engine CUDNN is not available for operator Sum.
I0308 22:03:05.300714 31934 operator.cc:173] Operator with engine CUDNN is not available for operator StopGradient.
I0308 22:03:05.300990 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.301300 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.301609 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.301867 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.301910 31934 operator.cc:173] Operator with engine CUDNN is not available for operator Sum.
I0308 22:03:05.302211 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.302521 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.302832 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.302876 31934 operator.cc:173] Operator with engine CUDNN is not available for operator Sum.
I0308 22:03:05.303180 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.303493 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.303802 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.303844 31934 operator.cc:173] Operator with engine CUDNN is not available for operator Sum.
I0308 22:03:05.304164 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.304476 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.304787 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.304831 31934 operator.cc:173] Operator with engine CUDNN is not available for operator Sum.
I0308 22:03:05.305145 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.305461 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.

I0308 22:03:05.460626 31934 operator.cc:173] Operator with engine CUDNN is not available for operator Sigmoid.
I0308 22:03:05.460695 31934 net_dag_utils.cc:118] Operator graph pruning prior to chain compute took: 1.714e-05 secs
I0308 22:03:05.460738 31934 net_dag.cc:61] Number of parallel execution chains 5 Number of operators = 18
INFO infer_simple.py: 111: Processing demo/16004479832_a748d55f21_k.jpg -> /tmp/detectron-visualizations/16004479832_a748d55f21_k.jpg.pdf
terminate called after throwing an instance of 'caffe2::EnforceNotMet'
what(): [enforce fail at context_gpu.h:171] . Encountered CUDA error: no kernel image is available for execution on the device Error from operator:
input: "gpu_0/res2_0_branch2c_bn" input: "gpu_0/res2_0_branch1_bn" output: "gpu_0/res2_0_branch2c_bn" name: "" type: "Sum" device_option { device_type: 1 cuda_gpu_id: 0 } debug_info: " File "tools/infer_simple.py", line 147, in \n main(args)\n File
*** Aborted at 1520575410 (unix time) try "date -d @1520575410" if you are using GNU date ***
PC: @ 0x7f1d09bad428 gsignal
*** SIGABRT (@0x3e800007cbe) received by PID 31934 (TID 0x7f1cba292700) from PID 31934; stack trace: ***
@ 0x7f1d0a663390 (unknown)
@ 0x7f1d09bad428 gsignal
@ 0x7f1d09baf02a abort
@ 0x7f1d031bdb39 __gnu_cxx::__verbose_terminate_handler()
@ 0x7f1d031bc1fb __cxxabiv1::__terminate()
@ 0x7f1d031bc234 std::terminate()
@ 0x7f1d031d7c8a execute_native_thread_routine_compat
@ 0x7f1d0a6596ba start_thread
@ 0x7f1d09c7f41d clone
Aborted

  • Operating system: Ubuntu
  • Compiler version: gcc
  • CUDA version: 9.0
  • cuDNN version: 7.0
  • NVIDIA driver version: ?
  • GPU models (for all devices if they are not all the same): TITAN
  • PYTHONPATH environment variable: ?
  • python --version output: ?
  • Anything else that seems relevant: ?
@mlprt
Copy link

mlprt commented Mar 12, 2018

Similar errors:

  1. On python tests/test_spatial_narrow_as_op.py:

Found Detectron ops lib: /home/xxxx/anaconda3/envs/detectron/lib/libcaffe2_detectron_ops_gpu.so
F.E

ERROR: test_small_forward_and_gradient (main.SpatialNarrowAsOpTest)

Traceback (most recent call last):
File "tests/test_spatial_narrow_as_op.py", line 59, in test_small_forward_and_gradient
self._run_test(A, B, check_grad=True)
File "tests/test_spatial_narrow_as_op.py", line 49, in _run_test
res, grad, grad_estimated = gc.CheckSimple(op, [A, B], 0, [0])
File "/home/xxxx/anaconda3/envs/detectron/lib/python2.7/site-packages/caffe2/python/gradient_checker.py", line 284, in CheckSimple
outputs_with_grads
File "/home/xxxx/anaconda3/envs/detectron/lib/python2.7/site-packages/caffe2/python/gradient_checker.py", line 201, in GetLossAndGrad
workspace.RunOperatorsOnce(grad_ops)
File "/home/xxxx/anaconda3/envs/detectron/lib/python2.7/site-packages/caffe2/python/workspace.py", line 184, in RunOperatorsOnce
success = RunOperatorOnce(op)
File "/home/xxxx/anaconda3/envs/detectron/lib/python2.7/site-packages/caffe2/python/workspace.py", line 179, in RunOperatorOnce
return C.run_operator_once(StringifyProto(operator))
RuntimeError: [enforce fail at context_gpu.h:171] . Encountered CUDA error: no kernel image is available for execution on the device Error from operator:
input: "A" input: "B" input: "C_grad" output: "A_grad" name: "" type: "SpatialNarrowAsGradient" device_option { device_type: 1 cuda_gpu_id: 0 } is_gradient_op: true

======================================================================
FAIL: test_large_forward (main.SpatialNarrowAsOpTest)

Traceback (most recent call last):
File "tests/test_spatial_narrow_as_op.py", line 68, in test_large_forward
self._run_test(A, B)
File "tests/test_spatial_narrow_as_op.py", line 54, in _run_test
np.testing.assert_allclose(C, C_ref, rtol=1e-5, atol=1e-08)
File "/home/xxxx/anaconda3/envs/detectron/lib/python2.7/site-packages/numpy/testing/nose_tools/utils.py", line 1396, in assert_allclose
verbose=verbose, header=header, equal_nan=equal_nan)
File "/home/xxxx/anaconda3/envs/detectron/lib/python2.7/site-packages/numpy/testing/nose_tools/utils.py", line 779, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=1e-05, atol=1e-08

(mismatch 100.0%)
x: array([[[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],...
y: array([[[[-1.243985, -2.407127, 1.165339, ..., -0.023202, -0.096644,
-0.096511],
[-0.640857, -0.977031, 0.745425, ..., -0.049333, -1.520961,...

Ran 3 tests in 0.519s

FAILED (failures=1, errors=1)


  1. On python2 tools/infer_simple.py --cfg configs/12_2017_baselines/e2e_mask_rcnn_R-101-FPN_2x.yaml --output-dir /tmp/detectron-visualizations --image-ext jpg --wts https://s3-us-west-2.amazonaws.com/detectron/35861858/12_2017_baselines/e2e_mask_rcnn_R-101-FPN_2x.yaml.02_32_51.SgT4y1cO/output/train/coco_2014_train:coco_2014_valminusminival/generalized_rcnn/model_final.pkl demo

WARNING cnn.py: 40: [====DEPRECATE WARNING====]: you are creating an object from CNNModelHelper class which will be deprecated soon. Please use ModelHelper object with brew module. For more information, please refer to caffe2.ai and python/brew.py, python/brew_test.py for more information.
INFO net.py: 57: Loading weights from: /tmp/detectron-download-cache/35861858/12_2017_baselines/e2e_mask_rcnn_R-101-FPN_2x.yaml.02_32_51.SgT4y1cO/output/train/coco_2014_train:coco_2014_valminusminival/generalized_rcnn/model_final.pkl
I0312 13:09:24.344396 378 net_dag_utils.cc:118] Operator graph pruning prior to chain compute took: 0.000140145 secs
I0312 13:09:24.344605 378 net_dag.cc:61] Number of parallel execution chains 63 Number of operators = 402
I0312 13:09:24.362937 378 net_dag_utils.cc:118] Operator graph pruning prior to chain compute took: 0.000125812 secs
I0312 13:09:24.363134 378 net_dag.cc:61] Number of parallel execution chains 30 Number of operators = 358
I0312 13:09:24.364900 378 net_dag_utils.cc:118] Operator graph pruning prior to chain compute took: 8.807e-06 secs
I0312 13:09:24.364929 378 net_dag.cc:61] Number of parallel execution chains 5 Number of operators = 18
INFO infer_simple.py: 111: Processing demo/24274813513_0cfd2ce6d0_k.jpg -> /tmp/detectron-visualizations/24274813513_0cfd2ce6d0_k.jpg.pdf
E0312 13:09:24.806742 393 net_dag.cc:203] Exception from operator '' (type 'Sum'): caffe2::EnforceNotMet: [enforce fail at context_gpu.h:171] . Encountered CUDA error: no kernel image is available for execution on the device Error from operator:
input: "gpu_0/res2_0_branch2c_bn" input: "gpu_0/res2_0_branch1_bn" output: "gpu_0/res2_0_branch2c_bn" name: "" type: "Sum" device_option { device_type: 1 cuda_gpu_id: 0 } debug_info: " File "tools/infer_simple.py", line 147, in \n main(args)\n File "tools/infer_simple.py", line 99, in main\n model = infer_engine.initialize_model_from_cfg()\n File "/home/xxxx/opt/detectron/lib/core/test_engine.py", line 266, in initialize_model_from_cfg\n model = model_builder.create(cfg.MODEL.TYPE, train=False, gpu_id=gpu_id)\n File "/home/xxxx/opt/detectron/lib/modeling/model_builder.py", line 124, in create\n return get_func(model_type_func)(model)\n File "/home/xxxx/opt/detectron/lib/modeling/model_builder.py", line 89, in generalized_rcnn\n freeze_conv_body=cfg.TRAIN.FREEZE_CONV_BODY\n File "/home/xxxx/opt/detectron/lib/modeling/model_builder.py", line 229, in build_generic_detection_model\n optim.build_data_parallel_model(model, _single_gpu_build_func)\n File "/home/xxxx/opt/detectron/lib/modeling/optimizer.py", line 54, in build_data_parallel_model\n single_gpu_build_func(model)\n File "/home/xxxx/opt/detectron/lib/modeling/model_builder.py", line 169, in _single_gpu_build_func\n blob_conv, dim_conv, spatial_scale_conv = add_conv_body_func(model)\n File "/home/xxxx/opt/detectron/lib/modeling/FPN.py", line 62, in add_fpn_ResNet101_conv5_body\n model, ResNet.add_ResNet101_conv5_body, fpn_level_info_ResNet101_conv5\n File "/home/xxxx/opt/detectron/lib/modeling/FPN.py", line 103, in add_fpn_onto_conv_body\n conv_body_func(model)\n File "/home/xxxx/opt/detectron/lib/modeling/ResNet.py", line 46, in add_ResNet101_conv5_body\n return add_ResNet_convX_body(model, (3, 4, 23, 3))\n File "/home/xxxx/opt/detectron/lib/modeling/ResNet.py", line 101, in add_ResNet_convX_body\n s, dim_in = add_stage(model, 'res2', p, n1, dim_in, 256, dim_bottleneck, 1)\n File "/home/xxxx/opt/detectron/lib/modeling/ResNet.py", line 83, in add_stage\n inplace_sum=i < n - 1\n File "/home/xxxx/opt/detectron/lib/modeling/ResNet.py", line 187, in add_residual_block\n s = model.net.Sum([tr, sc], tr)\n File "/home/xxxx/anaconda3/envs/detectron/lib/python2.7/site-packages/caffe2/python/core.py", line 2047, in \n op_type, *args, **kwargs)\n File "/home/xxxx/anaconda3/envs/detectron/lib/python2.7/site-packages/caffe2/python/core.py", line 2024, in _CreateAndAddToSelf\n op = CreateOperator(op_type, inputs, outputs, **kwargs)\n"
Original python traceback for operator 14 in network generalized_rcnn in exception above (most recent call last):
File "tools/infer_simple.py", line 147, in
File "tools/infer_simple.py", line 99, in main
File "/home/xxxx/opt/detectron/lib/core/test_engine.py", line 266, in initialize_model_from_cfg
File "/home/xxxx/opt/detectron/lib/modeling/model_builder.py", line 124, in create
File "/home/xxxx/opt/detectron/lib/modeling/model_builder.py", line 89, in generalized_rcnn
File "/home/xxxx/opt/detectron/lib/modeling/model_builder.py", line 229, in build_generic_detection_model
File "/home/xxxx/opt/detectron/lib/modeling/optimizer.py", line 54, in build_data_parallel_model
File "/home/xxxx/opt/detectron/lib/modeling/model_builder.py", line 169, in _single_gpu_build_func
File "/home/xxxx/opt/detectron/lib/modeling/FPN.py", line 62, in add_fpn_ResNet101_conv5_body
File "/home/xxxx/opt/detectron/lib/modeling/FPN.py", line 103, in add_fpn_onto_conv_body
File "/home/xxxx/opt/detectron/lib/modeling/ResNet.py", line 46, in add_ResNet101_conv5_body
File "/home/xxxx/opt/detectron/lib/modeling/ResNet.py", line 101, in add_ResNet_convX_body
File "/home/xxxx/opt/detectron/lib/modeling/ResNet.py", line 83, in add_stage
File "/home/xxxx/opt/detectron/lib/modeling/ResNet.py", line 187, in add_residual_block
Traceback (most recent call last):
File "tools/infer_simple.py", line 147, in
main(args)
File "tools/infer_simple.py", line 117, in main
model, im, None, timers=timers
File "/home/xxxx/opt/detectron/lib/core/test.py", line 65, in im_detect_all
scores, boxes, im_scales = im_detect_bbox(model, im, box_proposals)
File "/home/xxxx/opt/detectron/lib/core/test.py", line 154, in im_detect_bbox
workspace.RunNet(model.net.Proto().name)
File "/home/xxxx/anaconda3/envs/detectron/lib/python2.7/site-packages/caffe2/python/workspace.py", line 230, in RunNet
StringifyNetName(name), num_iter, allow_fail,
File "/home/xxxx/anaconda3/envs/detectron/lib/python2.7/site-packages/caffe2/python/workspace.py", line 192, in CallWithExceptionIntercept
return func(*args, **kwargs)
RuntimeError: [enforce fail at context_gpu.h:171] . Encountered CUDA error: no kernel image is available for execution on the device Error from operator:
input: "gpu_0/res2_0_branch2c_bn" input: "gpu_0/res2_0_branch1_bn" output: "gpu_0/res2_0_branch2c_bn" name: "" type: "Sum" device_option { device_type: 1 cuda_gpu_id: 0 } debug_info: " File "tools/infer_simple.py", line 147, in \n main(args)\n File "tools/infer_simple.py", line 99, in main\n model = infer_engine.initialize_model_from_cfg()\n File "/home/xxxx/opt/detectron/lib/core/test_engine.py", line 266, in initialize_model_from_cfg\n model = model_builder.create(cfg.MODEL.TYPE, train=False, gpu_id=gpu_id)\n File "/home/xxxx/opt/detectron/lib/modeling/model_builder.py", line 124, in create\n return get_func(model_type_func)(model)\n File "/home/xxxx/opt/detectron/lib/modeling/model_builder.py", line 89, in generalized_rcnn\n freeze_conv_body=cfg.TRAIN.FREEZE_CONV_BODY\n File "/home/xxxx/opt/detectron/lib/modeling/model_builder.py", line 229, in build_generic_detection_model\n optim.build_data_parallel_model(model, _single_gpu_build_func)\n File "/home/xxxx/opt/detectron/lib/modeling/optimizer.py", line 54, in build_data_parallel_model\n single_gpu_build_func(model)\n File "/home/xxxx/opt/detectron/lib/modeling/model_builder.py", line 169, in _single_gpu_build_func\n blob_conv, dim_conv, spatial_scale_conv = add_conv_body_func(model)\n File "/home/xxxx/opt/detectron/lib/modeling/FPN.py", line 62, in add_fpn_ResNet101_conv5_body\n model, ResNet.add_ResNet101_conv5_body, fpn_level_info_ResNet101_conv5\n File "/home/xxxx/opt/detectron/lib/modeling/FPN.py", line 103, in add_fpn_onto_conv_body\n conv_body_func(model)\n File "/home/xxxx/opt/detectron/lib/modeling/ResNet.py", line 46, in add_ResNet101_conv5_body\n return add_ResNet_convX_body(model, (3, 4, 23, 3))\n File "/home/xxxx/opt/detectron/lib/modeling/ResNet.py", line 101, in add_ResNet_convX_body\n s, dim_in = add_stage(model, 'res2', p, n1, dim_in, 256, dim_bottleneck, 1)\n File "/home/xxxx/opt/detectron/lib/modeling/ResNet.py", line 83, in add_stage\n inplace_sum=i < n - 1\n File "/home/xxxx/opt/detectron/lib/modeling/ResNet.py", line 187, in add_residual_block\n s = model.net.Sum([tr, sc], tr)\n File "/home/xxxx/anaconda3/envs/detectron/lib/python2.7/site-packages/caffe2/python/core.py", line 2047, in \n op_type, *args, **kwargs)\n File "/home/xxxx/anaconda3/envs/detectron/lib/python2.7/site-packages/caffe2/python/core.py", line 2024, in _CreateAndAddToSelf\n op = CreateOperator(op_type, inputs, outputs, **kwargs)\n"

  • OS: Ubuntu 17.04
  • CUDA 9.0
  • cuDNN 7
  • NVIDIA Driver 390.12
  • GPU: TITAN Xp
  • $PYTHONPATH: empty
  • python --version: Python 2.7.14 :: Anaconda, Inc.

@gecong
Copy link

gecong commented Mar 13, 2018

same here

@anatlin
Copy link

anatlin commented Mar 18, 2018

any update on this?

@xmengli
Copy link

xmengli commented Mar 19, 2018

@gecong @anatlin

RuntimeError: [enforce fail at context_gpu.h:171] . Encountered CUDA error: no kernel image is available for execution on the device Error from operator:
input: "A" input: "B" input: "C_grad" output: "A_grad" name: "" type: "SpatialNarrowAsGradient" device_option { device_type: 1 cuda_gpu_id: 0 } is_gradient_op: true

I solved this by adding export PYTHONPATH=$PYTHONPATH:/home/user/caffe2/build in bashrc file

@ghost
Copy link
Author

ghost commented Mar 20, 2018

@xmengli999

I do not seem to have a caffe2/build director on my machine.

I installed caffe with anaconda, and I have the following directories under

~/anaconda2/pkgs/caffe2-cuda9.0-cudnn7-0.8.dev-py27h4e2c0f2_0$ ls -ltr
total 24
drwxrwxr-x 3 gcong gcong 4096 Mar 7 22:04 share
drwxrwxr-x 6 gcong gcong 4096 Mar 7 22:04 include
drwxrwxr-x 2 gcong gcong 4096 Mar 7 22:04 bin
drwxrwxr-x 2 gcong gcong 4096 Mar 7 22:04 test
drwxrwxr-x 4 gcong gcong 4096 Mar 7 22:04 lib
drwxrwxr-x 4 gcong gcong 4096 Mar 7 22:04 info

Could you let me know how I can change the PYTHONPATH?

Thanks a lot

@anatlin
Copy link

anatlin commented Mar 21, 2018

@rbgirshick is this a caffe2 issue?

@mihaifieraru
Copy link

I experience the same issue, any update?

@ghost
Copy link

ghost commented Apr 3, 2018

same issue here!

@xmengli
Copy link

xmengli commented Apr 4, 2018

@deeprun I build from source. You can have a try.

@olegantonyan
Copy link

same problem. haven't tried to build caffe from sources
opensuse tumbleweed, cuda 9.0 cudnn 7.1

@gzaripov
Copy link

The same. cuda 9.0 cudnn 7.1

@rafagjordana
Copy link

Same here, cuda 9.0 and cudnn 7.1.2

@AgrawalAmey
Copy link

Facing the same issue on azure data science vm, running on ubuntu 16.04, anaconda 2 and tesla p40. The build directory is included in PYTHONPATH.

@apli
Copy link

apli commented Apr 20, 2018

Same problem

@wuharvey
Copy link

wuharvey commented May 2, 2018

Bump

@fengyicoder
Copy link

Same problem, any update?

@macsermkiat
Copy link

macsermkiat commented May 3, 2018

Me too
EDIT : I already fixed this by

  • I uninstalled CUDA 9.1, because My GPU is Quadro M4000, which support only CUDA 8.0
sudo apt-get remove cuda-9.1
sudo apt-get install cuda-8.0
  • Make sure to install matched cuDNN version (7.1.2) and NCCL for CUDA 8.0
  • Uninstall Caffe2 and install again use conda install -c caffe2 caffe2_cuda8.0_cudnn7
  • Fix .bashrc point to the right directory
export PATH="~/anaconda3/bin:$PATH:/usr/local/cuda-8.0/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-8.0/lib64:$LD_LIBRARY_PATH"

So it's that you have to make sure your GPU support right version of CUDA

@BanuSelinTosun
Copy link

@AgrawalAmey
I have been circling around the same issue for almost 1-1.5 weeks now on Azure Linux Ubuntu 16.04 DSVM. Did you come up with a resolution?

@gadcam
Copy link
Contributor

gadcam commented Jul 11, 2018

Looks like this PR is trying to solve the problem (or a part of it at least) pytorch/pytorch#7062, can you still reproduce if you use a version of Caffe2/PyTorch including this commit ?

Another track to follow could be fireice-uk/xmr-stak-nvidia#159 (comment) (see http://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/ to get the correct CUDA_ARCH number)

@gadcam
Copy link
Contributor

gadcam commented Jul 11, 2018

@AgrawalAmey
Copy link

@BanuSelimTosun Sorry, I couldn't find any solution for the issue.

@BanuSelinTosun
Copy link

BanuSelinTosun commented Jul 12, 2018

@gadcam
Yes, that was the very 1st thing I tried. The problem with Azure DSVMs are they already have Cuda 9 with cudnn 7 where Detectron want Cuda 8 & cudnn 6. There is caffe2 installation with cuda 9 and cudnn7 and it is a) not working with detectron due to the version and also it is installed in python 3 not python 2, b) conflicting with new caffe2 installations when cuda 8 & cudnn 6 is installed. Even if I create everything in a new environment.

@AgrawalAmey
I tried to file an issue on this to Azure computing before July 4th, and they are not taking it very seriously. I even talked to one of the Principal Manager in Azure. He just suggested me new approaches which did not work.

@gadcam
Copy link
Contributor

gadcam commented Jul 12, 2018

@BanuSelinTosun I had to install the Detectron in a very similar environment.

What I would do (I do not know if it is possible in your environment)

  • Uninstall Caffe2
  • Uninstall CUDA & cuDNN
  • Reinstall correct versions of CUDA & cuDNN (you could also have to switch the GC driver version in some setup if I recall correctly)
  • Build Caffe2 again specifying a CUDA_ARCH (you could also have to check that it links to correct and for Python PYTHON_EXECUTABLE / PYTHON_INCLUDE_DIR / PYTHON_LIBRARY
  • Run the tests
  • If everything goes well up to this point than you should be able to install the Detectron

I think it will not be that easy but maybe it will raise some new errors which will give us some new hints to go further.

BTW I do not think Detectron needs CUDA 8: my install is with CUDA Version 9.0.176 & cuDNN 7.0.5.

EDIT : maybe this can be of some help https://docs.microsoft.com/en-US/azure/virtual-machines/linux/n-series-driver-setup#ubuntu-1604-lts

@BanuSelinTosun
Copy link

BanuSelinTosun commented Jul 13, 2018 via email

@BanuSelinTosun
Copy link

Ok. I have a working detectron now.
My solution was using the docker image path. It works. does not matter whether you have cuda 9 or 8, cudnn 7 or 6, whatever caffe2 version... it works!
And @AgrawalAmey it works in Azure Linux DSVM. :D

@remcova
Copy link

remcova commented Oct 10, 2018

Ok. I have a working detectron now.
My solution was using the docker image path. It works. does not matter whether you have cuda 9 or 8, cudnn 7 or 6, whatever caffe2 version... it works!
And @AgrawalAmey it works in Azure Linux DSVM. :D

Could you explain your solution a bit more in details? I have this problem and I have a hard time to solve it.

@paritoshgote
Copy link

@BanuSelinTosun : Could you please explain your solution using docker in more detail?

@yfzon
Copy link

yfzon commented Nov 20, 2018

I met the similar error when I want to use the tensorflow op compiled by nvcc: Could not launch cub::DeviceSegmentedRadixSort::SortPairsDescending to sort input, temp_storage_bytes: 599295, status: no kernel image is available for execution on the device. I found this issue and knew that it's caused by the gpu compute capability. I use Tesla40 and add -gencode arch=compute_61,code=compute_61 my compile file. Solved it finally. Hope it can help you.
image

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests