Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InternalError (see above for traceback): Blas SGEMM launch failed with Demo.py #38

Open
jeffreynghm opened this issue Mar 19, 2017 · 6 comments

Comments

@jeffreynghm
Copy link

jeffreynghm commented Mar 19, 2017

Thanks for the great effort in sharing this user friendly version, I have run it on python2 and see the same error. After spending 2 days of sparetime to change it to python3 versions, I still see the same error...any idea why seeing "Blas SGEMM launch failed"?

~/TFFRCNN3/faster_rcnn$ python3 demo.py --model "/home/jeffreynghm/TFFRCNN3/data/pretrain_model/VGGnet_fast_rcnn_iter_70000.ckpt"
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally

W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: 
name: GeForce 940M
major: 5 minor: 0 memoryClockRate (GHz) 1.176
pciBusID 0000:01:00.0
Total memory: 1.96GiB
Free memory: 1.63GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce 940M, pci bus id: 0000:01:00.0)
Tensor("Placeholder:0", shape=(?, ?, ?, 3), dtype=float32)
Tensor("conv5_3/Relu:0", shape=(?, ?, ?, 512), dtype=float32)
Tensor("rpn_conv/3x3/Relu:0", shape=(?, ?, ?, 512), dtype=float32)
Tensor("rpn_cls_score/BiasAdd:0", shape=(?, ?, ?, 18), dtype=float32)
Tensor("rpn_cls_prob:0", shape=(?, ?, ?, ?), dtype=float32)
Tensor("Reshape_2:0", shape=(?, ?, ?, 18), dtype=float32)
Tensor("rpn_bbox_pred/BiasAdd:0", shape=(?, ?, ?, 36), dtype=float32)
Tensor("Placeholder_1:0", shape=(?, 3), dtype=float32)
Tensor("conv5_3/Relu:0", shape=(?, ?, ?, 512), dtype=float32)
Tensor("rois:0", shape=(?, 5), dtype=float32)
[<tf.Tensor 'conv5_3/Relu:0' shape=(?, ?, ?, 512) dtype=float32>, <tf.Tensor 'rois:0' shape=(?, 5) dtype=float32>]
Tensor("fc7/fc7:0", shape=(?, 4096), dtype=float32)
Loading network VGGnet_test...   done.
W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 791.02MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 554.92MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 1.08GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 1.08GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
E tensorflow/stream_executor/cuda/cuda_blas.cc:372] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
W tensorflow/stream_executor/stream.cc:1390] attempting to perform BLAS operation using StreamExecutor without BLAS support
Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1022, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1004, in _run_fn
    status, run_metadata)
  File "/usr/lib/python3.4/contextlib.py", line 66, in __exit__
    next(self.gen)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Blas SGEMM launch failed : m=1369, n=18, k=512
	 [[Node: rpn_cls_score/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](rpn_conv/3x3/Relu, rpn_cls_score/weights/read)]]
	 [[Node: bbox_pred/bbox_pred/_89 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_292_bbox_pred/bbox_pred", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "demo.py", line 132, in <module>
    _, _ = im_detect(sess, net, im)
  File "/home/jeffreynghm/TFFRCNN3/lib/fast_rcnn/test.py", line 176, in im_detect
    feed_dict=feed_dict)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 767, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 965, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1015, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1035, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Blas SGEMM launch failed : m=1369, n=18, k=512
	 [[Node: rpn_cls_score/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](rpn_conv/3x3/Relu, rpn_cls_score/weights/read)]]
	 [[Node: bbox_pred/bbox_pred/_89 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_292_bbox_pred/bbox_pred", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

Caused by op 'rpn_cls_score/Conv2D', defined at:
  File "demo.py", line 122, in <module>
    net = get_network(args.demo_net)
  File "/home/jeffreynghm/TFFRCNN3/lib/networks/factory.py", line 27, in get_network
    return VGGnet_test()
  File "/home/jeffreynghm/TFFRCNN3/lib/networks/VGGnet_test.py", line 16, in __init__
    self.setup()
  File "/home/jeffreynghm/TFFRCNN3/lib/networks/VGGnet_test.py", line 46, in setup
    .conv(1, 1, len(anchor_scales) * 3 * 2, 1, 1, padding='VALID', relu=False, name='rpn_cls_score'))
  File "/home/jeffreynghm/TFFRCNN3/lib/networks/network.py", line 38, in layer_decorated
    layer_output = op(self, layer_input, *args, **kwargs)
  File "/home/jeffreynghm/TFFRCNN3/lib/networks/network.py", line 148, in conv
    conv = convolve(input, kernel)
  File "/home/jeffreynghm/TFFRCNN3/lib/networks/network.py", line 138, in <lambda>
    convolve = lambda i, k: tf.nn.conv2d(i, k, [1, s_h, s_w, 1], padding=padding)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 396, in conv2d
    data_format=data_format, name=name)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/ops.py", line 2327, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/ops.py", line 1226, in __init__
    self._traceback = _extract_stack()

InternalError (see above for traceback): Blas SGEMM launch failed : m=1369, n=18, k=512
@jeffreynghm jeffreynghm changed the title InternalError (see above for traceback): Blas SGEMM launch failed : m=1369, n=18, k=512 InternalError (see above for traceback): Blas SGEMM launch failed Mar 19, 2017
@jeffreynghm jeffreynghm changed the title InternalError (see above for traceback): Blas SGEMM launch failed InternalError (see above for traceback): Blas SGEMM launch failed with Demo.py Mar 19, 2017
@714586886
Copy link

I have the same problem ! Can you tell me how to solve it?@ jeffreynghm

@jeffreynghm
Copy link
Author

@714586886 no solution yet... I am sure it is not the problem of the tensorflow as I tried running other program and it works well.

@714586886
Copy link

@CharlesShang we meet this error,and i have spend 2 days on it.can you help me ?
i changed tensorflow from 0.11.0 to 1.0.0,it does not work.

@BStudent
Copy link

BStudent commented Apr 12, 2017

Bad install ...
I updated something from the CUDA toolkit and didn't realize that everything you don't install (explicitly check off) get's uninstalled rather than left alone.

UPDATED:
When I ignore all the messages cascading out of python and look at the original code trace:
This seems pretty self-explanatory (maybe that's too strong a word):

So, for some reason, a BLAS component is failing on allocating what appears to be a very small array, which it does not fail to allocate at other times ...

I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\gpu\gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 950, pci bus id: 0000:01:00.0)
Initialized
E c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\stream_executor\cuda\cuda_blas.cc:372] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
W c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\stream_executor\stream.cc:1390] attempting to perform BLAS operation using StreamExecutor without BLAS support

Traceback (most recent call last):
File "C:\Users\BStudent\Anaconda4_3_1\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\client\session.py", line 1022, in _do_call return fn(*args) File "C:\Users\BStudent\Anaconda4_3_1\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\client\session.py", line 1004, in _run_fn status, run_metadata) File "C:\Users\BStudent\Anaconda4_3_1\envs\tensorflow-gpu\lib\contextlib.py", line 66, in __exit__ next(self.gen) File "C:\Users\BStudent\Anaconda4_3_1\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Blas SGEMM launch failed : a.shape=(500, 2), b.shape=(2, 64), m=500, n=64, k=2
_[[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](_recv_Placeholder_1_0/7, Variable/read)]]

ORIGINAL:
I have been running the ToyGANs example code which can be found online (and required running the upgrade-to-TF-v1.0 script).
I have gotten it to run several times in a row with high performance, and then the problem suddenly starts happening. The one problem that might be correlated is when multiple processes are trying to hit the GPU via tensorflow at the same time, which only happens by accident in my world.
The problem does not affect CUDA demo suite procedures like nbody.
The problem also "survives" shutting down all python processes on the system and restarting them.
So it's maaayyybe possible that it's a CUDNN problem, if one of those DLLs is somehow stateful. Restarting seemed to fix it earlier, but I had made a lot of other changes as well.

@JaneTakanashi
Copy link

use nvidia-smi to see if gpu is overload, and use 'kill -9 pid' to kill the process.

@BStudent
Copy link

Thanks, Jane.
I actually wound up solving the problem by applying a more global strategy of never doing TensorFlow in Windows, and using Ubuntu instead. It can be a tremendous headache to install CUDA dev kit and correct Nvidia drivers on Ubuntu (or any Linux system), but once you do then there are far fewer mysterious problems with TF than on Windows.
Windows users are probably better off using Matlab or, recently, CNTK instead of Tensorflow - and if they insist on Tensorflow they should probably use the R version which is not officially supported by Goog, but is very well-maintained by the RStudio people.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants