Can not run test case and inference #260
Comments
Similar errors:
Found Detectron ops lib: /home/xxxx/anaconda3/envs/detectron/lib/libcaffe2_detectron_ops_gpu.so
|
same here |
any update on this? |
RuntimeError: [enforce fail at context_gpu.h:171] . Encountered CUDA error: no kernel image is available for execution on the device Error from operator: I solved this by adding export PYTHONPATH=$PYTHONPATH:/home/user/caffe2/build in bashrc file |
@xmengli999 I do not seem to have a caffe2/build director on my machine. I installed caffe with anaconda, and I have the following directories under ~/anaconda2/pkgs/caffe2-cuda9.0-cudnn7-0.8.dev-py27h4e2c0f2_0$ ls -ltr Could you let me know how I can change the PYTHONPATH? Thanks a lot |
@rbgirshick is this a caffe2 issue? |
I experience the same issue, any update? |
same issue here! |
@deeprun I build from source. You can have a try. |
same problem. haven't tried to build caffe from sources |
The same. cuda 9.0 cudnn 7.1 |
Same here, cuda 9.0 and cudnn 7.1.2 |
Facing the same issue on azure data science vm, running on ubuntu 16.04, anaconda 2 and tesla p40. The build directory is included in |
Same problem |
Bump |
Same problem, any update? |
Me too
So it's that you have to make sure your GPU support right version of CUDA |
@AgrawalAmey |
Looks like this PR is trying to solve the problem (or a part of it at least) pytorch/pytorch#7062, can you still reproduce if you use a version of Caffe2/PyTorch including this commit ? Another track to follow could be fireice-uk/xmr-stak-nvidia#159 (comment) (see http://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/ to get the correct CUDA_ARCH number) |
@BanuSelinTosun Just as a side question : did you try this ? https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/dsvm-ubuntu-intro#caffe2 |
@BanuSelimTosun Sorry, I couldn't find any solution for the issue. |
@gadcam @AgrawalAmey |
@BanuSelinTosun I had to install the Detectron in a very similar environment. What I would do (I do not know if it is possible in your environment)
I think it will not be that easy but maybe it will raise some new errors which will give us some new hints to go further. BTW I do not think Detectron needs CUDA 8: my install is with CUDA Version 9.0.176 & cuDNN 7.0.5. EDIT : maybe this can be of some help https://docs.microsoft.com/en-US/azure/virtual-machines/linux/n-series-driver-setup#ubuntu-1604-lts |
@gadcam, thank you for helping with this issue.
I had working caffe2 in the Azure DSVMs with Cuda 9 and cudnn 7
It should be working with those if it worked for you.
I can install (run the make file) of detectron, that does not have a problem. But when I run the test for the detectron, I am getting that Failures=1, errors=1 error and failing.
It always circles back to the same problem as it seems. :-(
I did not try running with inference, should I try that first?
…On Thu, Jul 12, 2018 at 3:37 PM, Camille Barneaud ***@***.***> wrote:
@BanuSelinTosun <https://github.com/BanuSelinTosun> I had to install the
Detectron in a very similar environment.
What I would do (I do not know if it is possible in your environment)
- Uninstall Caffe2
- Uninstall CUDA & cuDNN
- Reinstall correct versions of CUDA & cuDNN (you could also have to
switch the GC driver version in some setup if I recall correctly)
- Build Caffe2 again specifying a CUDA_ARCH (you could also have to
check that it links to correct and for Python PYTHON_EXECUTABLE /
PYTHON_INCLUDE_DIR / PYTHON_LIBRARY
- Run the tests
- If everything goes well up to this point than you should be able to
install the Detectron
I think it will not be that easy but maybe it will raise some new errors
which will give us some new hints to go further.
BTW I do not think Detectron needs CUDA 8: my install is with CUDA Version
9.0.176 & cuDNN 7.0.5.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#260 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AaMxPVOVdlhkkoaG7i7-8M4widWNRMYtks5uF8-ogaJpZM4SjzRk>
.
|
Ok. I have a working detectron now. |
Could you explain your solution a bit more in details? I have this problem and I have a hard time to solve it. |
@BanuSelinTosun : Could you please explain your solution using docker in more detail? |
I met the similar error when I want to use the tensorflow op compiled by nvcc: Could not launch cub::DeviceSegmentedRadixSort::SortPairsDescending to sort input, temp_storage_bytes: 599295, status: no kernel image is available for execution on the device. I found this issue and knew that it's caused by the gpu compute capability. I use Tesla40 and add |
I installed detection, everything seems to be fine until I ran the test case
=============================
python test_spatial_narrow_as_op.py
It failed with the following message:
Found Detectron ops lib: /home/xxx/anaconda2/lib/libcaffe2_detectron_ops_gpu.so
I0308 22:07:29.731431 29562 operator.cc:173] Operator with engine CUDNN is not available for operator SpatialNarrowAs.
FI0308 22:07:30.126979 29562 operator.cc:173] Operator with engine CUDNN is not available for operator SpatialNarrowAs.
I0308 22:07:30.382014 29562 operator.cc:173] Operator with engine CUDNN is not available for operator SpatialNarrowAs.
.I0308 22:07:30.383860 29562 operator.cc:173] Operator with engine CUDNN is not available for operator SpatialNarrowAs.
I0308 22:07:30.384814 29562 operator.cc:173] Operator with engine CUDNN is not available for operator SpatialNarrowAs.
I0308 22:07:30.385095 29562 operator.cc:173] Operator with engine CUDNN is not available for operator SpatialNarrowAsGradient.
E
ERROR: test_small_forward_and_gradient (main.SpatialNarrowAsOpTest)
Traceback (most recent call last):
File "test_spatial_narrow_as_op.py", line 59, in test_small_forward_and_gradient
self._run_test(A, B, check_grad=True)
File "test_spatial_narrow_as_op.py", line 49, in _run_test
res, grad, grad_estimated = gc.CheckSimple(op, [A, B], 0, [0])
File "/home/xxxx/anaconda2/lib/python2.7/site-packages/caffe2/python/workspace.py", line 179, in RunOperatorOnce
return C.run_operator_once(StringifyProto(operator))
RuntimeError: [enforce fail at context_gpu.h:171] . Encountered CUDA error: no kernel image is available for execution on the device Error from operator:
input: "A" input: "B" input: "C_grad" output: "A_grad" name: "" type: "SpatialNarrowAsGradient" device_option { device_type: 1 cuda_gpu_id: 0 } is_gradient_op: true
======================================================================
FAIL: test_large_forward (main.SpatialNarrowAsOpTest)
Traceback (most recent call last):
File "test_spatial_narrow_as_op.py", line 68, in test_large_forward
self._run_test(A, B)
File "test_spatial_narrow_as_op.py", line 54, in _run_test
np.testing.assert_allclose(C, C_ref, rtol=1e-5, atol=1e-08)
File "/home/xxx/anaconda2/lib/python2.7/site-packages/numpy/testing/nose_tools/utils.py", line 1396, in assert_allclose
verbose=verbose, header=header, equal_nan=equal_nan)
AssertionError:
Not equal to tolerance rtol=1e-05, atol=1e-08
(mismatch 100.0%)
x: array([[[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],...
y: array([[[[ 3.099715e-01, -1.291913e+00, -2.825952e-01, ...,
-2.258663e-01, -8.814982e-01, 4.408140e-01],
[ 1.377446e+00, 1.170039e+00, 1.164714e-01, ...,...
Ran 3 tests in 1.078s
FAILED (failures=1, errors=1)
=======================================================
If I run with the inference
python2 tools/infer_simple.py
--cfg configs/12_2017_baselines/e2e_mask_rcnn_R-101-FPN_2x.yaml
--output-dir /tmp/detectron-visualizations
--image-ext jpg
--wts https://s3-us-west-2.amazonaws.com/detectron/35861858/12_2017_baselines/e2e_mask_rcnn_R-101-FPN_2x.yaml.02_32_51.SgT4y1cO/output/train/coco_2014_train:coco_2014_valminusminival/generalized_rcnn/model_final.pkl
demo
I got the following error:
I0308 22:03:05.297256 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.297796 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.298099 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.298406 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.298660 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.298704 31934 operator.cc:173] Operator with engine CUDNN is not available for operator Sum.
I0308 22:03:05.299007 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.299317 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.299623 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.299666 31934 operator.cc:173] Operator with engine CUDNN is not available for operator Sum.
I0308 22:03:05.299965 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.300297 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.300607 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.300649 31934 operator.cc:173] Operator with engine CUDNN is not available for operator Sum.
I0308 22:03:05.300714 31934 operator.cc:173] Operator with engine CUDNN is not available for operator StopGradient.
I0308 22:03:05.300990 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.301300 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.301609 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.301867 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.301910 31934 operator.cc:173] Operator with engine CUDNN is not available for operator Sum.
I0308 22:03:05.302211 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.302521 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.302832 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.302876 31934 operator.cc:173] Operator with engine CUDNN is not available for operator Sum.
I0308 22:03:05.303180 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.303493 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.303802 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.303844 31934 operator.cc:173] Operator with engine CUDNN is not available for operator Sum.
I0308 22:03:05.304164 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.304476 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.304787 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.304831 31934 operator.cc:173] Operator with engine CUDNN is not available for operator Sum.
I0308 22:03:05.305145 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.305461 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.460626 31934 operator.cc:173] Operator with engine CUDNN is not available for operator Sigmoid.
I0308 22:03:05.460695 31934 net_dag_utils.cc:118] Operator graph pruning prior to chain compute took: 1.714e-05 secs
I0308 22:03:05.460738 31934 net_dag.cc:61] Number of parallel execution chains 5 Number of operators = 18
INFO infer_simple.py: 111: Processing demo/16004479832_a748d55f21_k.jpg -> /tmp/detectron-visualizations/16004479832_a748d55f21_k.jpg.pdf
terminate called after throwing an instance of 'caffe2::EnforceNotMet'
what(): [enforce fail at context_gpu.h:171] . Encountered CUDA error: no kernel image is available for execution on the device Error from operator:
input: "gpu_0/res2_0_branch2c_bn" input: "gpu_0/res2_0_branch1_bn" output: "gpu_0/res2_0_branch2c_bn" name: "" type: "Sum" device_option { device_type: 1 cuda_gpu_id: 0 } debug_info: " File "tools/infer_simple.py", line 147, in \n main(args)\n File
*** Aborted at 1520575410 (unix time) try "date -d @1520575410" if you are using GNU date ***
PC: @ 0x7f1d09bad428 gsignal
*** SIGABRT (@0x3e800007cbe) received by PID 31934 (TID 0x7f1cba292700) from PID 31934; stack trace: ***
@ 0x7f1d0a663390 (unknown)
@ 0x7f1d09bad428 gsignal
@ 0x7f1d09baf02a abort
@ 0x7f1d031bdb39 __gnu_cxx::__verbose_terminate_handler()
@ 0x7f1d031bc1fb __cxxabiv1::__terminate()
@ 0x7f1d031bc234 std::terminate()
@ 0x7f1d031d7c8a execute_native_thread_routine_compat
@ 0x7f1d0a6596ba start_thread
@ 0x7f1d09c7f41d clone
Aborted
PYTHONPATH
environment variable: ?python --version
output: ?The text was updated successfully, but these errors were encountered: