Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid #11

Closed
najmehs opened this issue Jan 13, 2021 · 2 comments

Comments

@najmehs
Copy link

najmehs commented Jan 13, 2021

Hi,
Thanks for sharing the images and comprehensive instructions! They are very helpful!

When I use this image datamachines/cudnn_tensorflow_opencv:10.2_2.3.1_4.5.0-20201204, tf will have problems. In fact when I run the two following lines from python,
from tensorflow.python.client import device_lib print(device_lib.list_local_devices())
it will result in this error:

  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/test_util.py", line 131, in gpu_device_name
    for x in device_lib.list_local_devices():
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/device_lib.py", line 43, in list_local_devices
    _convert(s) for s in _pywrap_device_lib.list_devices(serialized_config)
RuntimeError: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid```
@mmartial
Copy link
Contributor

I am able to reproduce this issue. This seems to be related to this issue tensorflow/tensorflow#41990 but the proposed workaround tensorflow/tensorflow#41990 (comment) did not seem to solve my issue from within the container.

I am in the process of releasing updated versions (20210211) and this error does not happen anymore with a later TF.
There is a tool that runs similar commands in test/tf_hw.py

docker run --rm -it --gpus all -v `pwd`:/dmc datamachines/cudnn_tensorflow_opencv:10.2_2.4.1_3.4.13-20210211 python3 /dmc/test/tf_hw.py
2021-02-17 22:01:54.242594: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
*** Tensorflow version   :  2.4.1
*** Tensorflow Keras     :  2.4.0
*** TF Builf with cuda   :  True
*** TF compile flags     :  ['-I/usr/local/lib/python3.8/dist-packages/tensorflow/include', '-D_GLIBCXX_USE_CXX11_ABI=1']
*** TF include           :  /usr/local/lib/python3.8/dist-packages/tensorflow/include
*** TF lib               :  /usr/local/lib/python3.8/dist-packages/tensorflow
*** TF link flags        :  ['-L/usr/local/lib/python3.8/dist-packages/tensorflow', '-l:libtensorflow_framework.so.2']
*** Keras version        :  2.4.3
*** PyTorch version      :  1.7.1
*** pandas version       :  1.2.2
*** scikit-learn version :  0.24.1

(!! the following is build device specific, and here only to confirm hardware availability, ignore !!)
2021-02-17 22:01:55.537022: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-02-17 22:01:55.538548: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-02-17 22:01:55.551334: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-17 22:01:55.551894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:08:00.0 name: GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2021-02-17 22:01:55.551911: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
2021-02-17 22:01:55.552995: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-02-17 22:01:55.553019: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2021-02-17 22:01:55.554014: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-02-17 22:01:55.554186: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-02-17 22:01:55.555275: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-02-17 22:01:55.555782: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2021-02-17 22:01:55.557944: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2021-02-17 22:01:55.558043: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-17 22:01:55.558610: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-17 22:01:55.559106: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-02-17 22:01:55.559127: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
2021-02-17 22:07:15.635601: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-02-17 22:07:15.635630: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 
2021-02-17 22:07:15.635640: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N 
2021-02-17 22:07:15.635835: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-17 22:07:15.636365: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-17 22:07:15.636862: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-17 22:07:15.637338: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/device:GPU:0 with 21897 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3090, pci bus id: 0000:08:00.0, compute capability: 8.6)
--- All seen hardware    :
 [name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 9698607629912424656
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 22960798464
locality {
  bus_id: 1
  links {
  }
}
incarnation: 645222317647545331
physical_device_desc: "device: 0, name: GeForce RTX 3090, pci bus id: 0000:08:00.0, compute capability: 8.6"
]
2021-02-17 22:07:15.637877: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-17 22:07:15.638993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:08:00.0 name: GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2021-02-17 22:07:15.639030: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
2021-02-17 22:07:15.639079: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-02-17 22:07:15.639115: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2021-02-17 22:07:15.639143: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-02-17 22:07:15.639170: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-02-17 22:07:15.639189: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-02-17 22:07:15.639206: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2021-02-17 22:07:15.639223: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2021-02-17 22:07:15.639326: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-17 22:07:15.640391: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-17 22:07:15.641351: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
--- TF GPU Available     :
 [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

The only glitch is that the Successfully opened dynamic library libcudart.so.10.2 step hangs for a few minutes, which seems to be a different problem tensorflow/tensor2tensor#1643

That script is also run at the end of a build to confirm it "works". I note that a timeout (and not a crash) pass.

I will note that the soon to be made available CUDA11 based container(datamachines/cudnn_tensorflow_opencv:11.2.0_2.4.1_3.4.13-20210211) does not have this hangup problem.

@mmartial
Copy link
Contributor

It did take me less extra time than I expected, the latest containers are available at this point. I will close this issue at this point.
From my testing on the datamachines/cudnn_tensorflow_opencv:11.2.0_2.4.1_3.4.13-20210211 this appears to be functional at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants