"NCHW" data_format in Conv not working with latest CUDA #83

mil-ad · 2020-11-17T12:24:06Z

I'm not able to use the NCHW data format in conv layers:

import os
import numpy as np
import jax
import haiku as hk

# os.environ["XLA_PYTHON_CLIENT_PREALLOCATE"] = "false"
os.environ["XLA_FLAGS"] = "--xla_gpu_cuda_data_dir=/opt/cuda"

def net(x):
    model = hk.Sequential([hk.Conv2D(2, 5, padding="VALID", data_format="NCHW")])
    return model(x)

key = jax.random.PRNGKey(42)
net_transformed = hk.without_apply_rng(hk.transform(net))
params = net_transformed.init(key, np.zeros((1, 1, 28, 28)))

The snippet above works fine on the CPU but on the GPU gives tensorflow-style spew of errors below. The problem goes away if I change data_format to NHWC. I'm running pretty recent versions of nvidia driver and cuda and the same snippet seems to run on older versions (according to a few people I sent it to) so pretty sure it's related to those. My versions are:

cuda 11.1.0-2
nvidia driver: 455.38
jax 0.2.5
jaxlib 0.1.57+cuda111 
dm-haiku 0.0.2

Error:

2020-11-17 12:18:03.717098: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:349] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-11-17 12:18:03.718623: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:349] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-11-17 12:18:03.719796: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:349] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-11-17 12:18:03.719969: W external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:772] Failed to determine best cudnn convolution algorithm: Internal: All algorithms tried for convolution %custom-call = (f32[1,20,24,24]{3,2,1,0}, u8[0]{0}) custom-call(f32[1,1,28,28]{3,2,1,0} %parameter.1, f32[5,5,1,20]{1,0,2,3} %copy.1), window={size=5x5}, dim_labels=bf01_01io->bf01, custom_call_target="__cudnn$convForward", metadata={op_type="conv_general_dilated" op_name="conv_general_dilated[ batch_group_count=1\n                      dimension_numbers=ConvDimensionNumbers(lhs_spec=(0, 1, 2, 3), rhs_spec=(3, 2, 0, 1), out_spec=(0, 1, 2, 3))\n                      feature_group_count=1\n                      lhs_dilation=(1, 1)\n                      lhs_shape=(1, 1, 28, 28)\n                      padding=((0, 0), (0, 0))\n                      precision=None\n                      rhs_dilation=(1, 1)\n                      rhs_shape=(5, 5, 1, 20)\n                      window_strides=(1, 1) ]"}, backend_config="{\"algorithm\":\"0\",\"tensor_ops_enabled\":false,\"conv_result_scale\":1,\"activation_mode\":\"0\",\"side_input_scale\":0}" failed. Falling back to default algorithm. 

Convolution performance may be suboptimal.
2020-11-17 12:18:03.800681: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:349] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-11-17 12:18:03.800721: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_client.cc:1809] Execution of replica 0 failed: Unimplemented: DNN library is not found.
Traceback (most recent call last):
  File "scratch.py", line 17, in <module>
    params = net_transformed.init(key, np.zeros((1, 1, 28, 28)))
  File "/home/milad/miniconda3/envs/my_project/lib/python3.8/site-packages/haiku/_src/transform.py", line 111, in init_fn
    params, state = f.init(*args, **kwargs)
  File "/home/milad/miniconda3/envs/my_project/lib/python3.8/site-packages/haiku/_src/transform.py", line 277, in init_fn
    f(*args, **kwargs)
  File "scratch.py", line 12, in net
    return model(x)
  File "/home/milad/miniconda3/envs/my_project/lib/python3.8/site-packages/haiku/_src/module.py", line 406, in wrapped
    out = f(*args, **kwargs)
  File "/home/milad/miniconda3/envs/my_project/lib/python3.8/site-packages/haiku/_src/module.py", line 263, in run_interceptors
    return bound_method(*args, **kwargs)
  File "/home/milad/miniconda3/envs/my_project/lib/python3.8/site-packages/haiku/_src/basic.py", line 124, in __call__
    out = layer(out, *args, **kwargs)
  File "/home/milad/miniconda3/envs/my_project/lib/python3.8/site-packages/haiku/_src/module.py", line 406, in wrapped
    out = f(*args, **kwargs)
  File "/home/milad/miniconda3/envs/my_project/lib/python3.8/site-packages/haiku/_src/module.py", line 263, in run_interceptors
    return bound_method(*args, **kwargs)
  File "/home/milad/miniconda3/envs/my_project/lib/python3.8/site-packages/haiku/_src/conv.py", line 195, in __call__
    out = lax.conv_general_dilated(inputs,
  File "/home/milad/miniconda3/envs/my_project/lib/python3.8/site-packages/jax/_src/lax/lax.py", line 571, in conv_general_dilated
    return conv_general_dilated_p.bind(
  File "/home/milad/miniconda3/envs/my_project/lib/python3.8/site-packages/jax/core.py", line 266, in bind
    out = top_trace.process_primitive(self, tracers, params)
  File "/home/milad/miniconda3/envs/my_project/lib/python3.8/site-packages/jax/core.py", line 576, in process_primitive
    return primitive.impl(*tracers, **params)
  File "/home/milad/miniconda3/envs/my_project/lib/python3.8/site-packages/jax/interpreters/xla.py", line 234, in apply_primitive
    return compiled_fun(*args)
  File "/home/milad/miniconda3/envs/my_project/lib/python3.8/site-packages/jax/interpreters/xla.py", line 349, in _execute_compiled_primitive
    out_bufs = compiled.execute(input_bufs)
RuntimeError: Unimplemented: DNN library is not found.

The text was updated successfully, but these errors were encountered:

tomhennigan · 2020-11-17T13:36:29Z

Hey @mil-ad do you have cuDNN installed? I think you need to install this separately to GPU drivers..

I've tested your snippet in Google Colab on an Nvidia P100 GPU and it seems to work fine: https://colab.research.google.com/gist/tomhennigan/6fd38842b05d46b8418cf44a4083be9d/nchw-test.ipynb

mil-ad · 2020-11-17T14:23:09Z

I do have cuDN installed but perhaps similar to cuda I should explicitly point Jax to it?

mil-ad · 2020-11-17T14:25:57Z

Also the NHWC version works fine. That one doesn't need cuDNN?

hawkinsp · 2020-11-17T14:40:36Z

See google/jax#4920 . Can you try setting LD_LIBRARY_PATH?

My guess about NHWC vs NCHW is that it is possible that XLA can lower that convolution to a matrix multiplication without using CuDNN.

mil-ad · 2020-11-17T14:49:00Z

Thanks @hawkinsp that does make it go further although still fails:

2020-11-17 14:46:54.615564: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2020-11-17 14:46:54.645871: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2020-11-17 14:46:54.838008: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:168] XLA service 0x560d34013d90 initialized for platform Interpreter (this does not guarantee that XLA will be used). Devices:
2020-11-17 14:46:54.838028: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Interpreter, <undefined>
2020-11-17 14:46:54.852763: I external/org_tensorflow/tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 3594550000 Hz
2020-11-17 14:46:54.853134: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:168] XLA service 0x560d33f81c10 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-11-17 14:46:54.853147: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-11-17 14:46:54.853861: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2020-11-17 14:46:54.925446: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-17 14:46:54.925833: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:168] XLA service 0x560d33f71230 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-11-17 14:46:54.925847: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce GTX 1660 SUPER, Compute Capability 7.5
2020-11-17 14:46:54.926142: I external/org_tensorflow/tensorflow/compiler/xla/pjrt/nvidia_gpu_device.cc:119] XLA backend allocating 3926340403 bytes on device 0 for BFCAllocator.
2020-11-17 14:46:54.926262: I external/org_tensorflow/tensorflow/stream_executor/tpu/tpu_platform_interface.cc:74] No TPU platform found.
2020-11-17 14:46:55.522465: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2020-11-17 14:46:55.824222: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2020-11-17 14:46:55.824629: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2020-11-17 14:46:56.303814: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:344] Loaded cuDNN version 8004
2020-11-17 14:46:56.325354: W external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:772] Failed to determine best cudnn convolution algorithm: Internal: All algorithms tried for convolution %custom-call = (f32[1,20,24,24]{3,2,1,0}, u8[0]{0}) custom-call(f32[1,1,28,28]{3,2,1,0} %parameter.1, f32[5,5,1,20]{1,0,2,3} %copy.1), window={size=5x5}, dim_labels=bf01_01io->bf01, custom_call_target="__cudnn$convForward", metadata={op_type="conv_general_dilated" op_name="conv_general_dilated[ batch_group_count=1\n                      dimension_numbers=ConvDimensionNumbers(lhs_spec=(0, 1, 2, 3), rhs_spec=(3, 2, 0, 1), out_spec=(0, 1, 2, 3))\n                      feature_group_count=1\n                      lhs_dilation=(1, 1)\n                      lhs_shape=(1, 1, 28, 28)\n                      padding=((0, 0), (0, 0))\n                      precision=None\n                      rhs_dilation=(1, 1)\n                      rhs_shape=(5, 5, 1, 20)\n                      window_strides=(1, 1) ]"}, backend_config="{\"algorithm\":\"0\",\"tensor_ops_enabled\":false,\"conv_result_scale\":1,\"activation_mode\":\"0\",\"side_input_scale\":0}" failed. Falling back to default algorithm. 

Convolution performance may be suboptimal.
2020-11-17 14:46:56.386934: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_client.cc:1809] Execution of replica 0 failed: Unknown: CUDNN_STATUS_EXECUTION_FAILED
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(3118): 'cudnnConvolutionForward( cudnn.handle(), alpha, input_nd.handle(), input_data.opaque(), filter_nd.handle(), filter_data.opaque(), conv.handle(), ToConvForwardAlgo(algorithm_desc), scratch_memory.opaque(), scratch_memory.size(), beta, output_nd.handle(), output_data.opaque())'
Traceback (most recent call last):
  File "scratch.py", line 17, in <module>
    params = net_transformed.init(key, np.zeros((1, 1, 28, 28)))
  File "/home/milad/miniconda3/envs/my_project/lib/python3.8/site-packages/haiku/_src/transform.py", line 111, in init_fn
    params, state = f.init(*args, **kwargs)
  File "/home/milad/miniconda3/envs/my_project/lib/python3.8/site-packages/haiku/_src/transform.py", line 277, in init_fn
    f(*args, **kwargs)
  File "scratch.py", line 12, in net
    return model(x)
  File "/home/milad/miniconda3/envs/my_project/lib/python3.8/site-packages/haiku/_src/module.py", line 406, in wrapped
    out = f(*args, **kwargs)
  File "/home/milad/miniconda3/envs/my_project/lib/python3.8/site-packages/haiku/_src/module.py", line 263, in run_interceptors
    return bound_method(*args, **kwargs)
  File "/home/milad/miniconda3/envs/my_project/lib/python3.8/site-packages/haiku/_src/basic.py", line 124, in __call__
    out = layer(out, *args, **kwargs)
  File "/home/milad/miniconda3/envs/my_project/lib/python3.8/site-packages/haiku/_src/module.py", line 406, in wrapped
    out = f(*args, **kwargs)
  File "/home/milad/miniconda3/envs/my_project/lib/python3.8/site-packages/haiku/_src/module.py", line 263, in run_interceptors
    return bound_method(*args, **kwargs)
  File "/home/milad/miniconda3/envs/my_project/lib/python3.8/site-packages/haiku/_src/conv.py", line 195, in __call__
    out = lax.conv_general_dilated(inputs,
  File "/home/milad/miniconda3/envs/my_project/lib/python3.8/site-packages/jax/_src/lax/lax.py", line 571, in conv_general_dilated
    return conv_general_dilated_p.bind(
  File "/home/milad/miniconda3/envs/my_project/lib/python3.8/site-packages/jax/core.py", line 266, in bind
    out = top_trace.process_primitive(self, tracers, params)
  File "/home/milad/miniconda3/envs/my_project/lib/python3.8/site-packages/jax/core.py", line 576, in process_primitive
    return primitive.impl(*tracers, **params)
  File "/home/milad/miniconda3/envs/my_project/lib/python3.8/site-packages/jax/interpreters/xla.py", line 234, in apply_primitive
    return compiled_fun(*args)
  File "/home/milad/miniconda3/envs/my_project/lib/python3.8/site-packages/jax/interpreters/xla.py", line 349, in _execute_compiled_primitive
    out_bufs = compiled.execute(input_bufs)
RuntimeError: Unknown: CUDNN_STATUS_EXECUTION_FAILED
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(3118): 'cudnnConvolutionForward( cudnn.handle(), alpha, input_nd.handle(), input_data.opaque(), filter_nd.handle(), filter_data.opaque(), conv.handle(), ToConvForwardAlgo(algorithm_desc), scratch_memory.opaque(), scratch_memory.size(), beta, output_nd.handle(), output_data.opaque())'

tomhennigan · 2020-11-17T18:39:40Z

Lets keep the discussion in the JAX bug, I think this is not Haiku specific.

tomhennigan mentioned this issue Nov 17, 2020

Unimplemented: DNN library is not found. google/jax#4920

Closed

tomhennigan closed this as completed Nov 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"NCHW" data_format in Conv not working with latest CUDA #83

"NCHW" data_format in Conv not working with latest CUDA #83

mil-ad commented Nov 17, 2020

tomhennigan commented Nov 17, 2020

mil-ad commented Nov 17, 2020

mil-ad commented Nov 17, 2020

hawkinsp commented Nov 17, 2020

mil-ad commented Nov 17, 2020

tomhennigan commented Nov 17, 2020

"NCHW" data_format in Conv not working with latest CUDA #83

"NCHW" data_format in Conv not working with latest CUDA #83

Comments

mil-ad commented Nov 17, 2020

tomhennigan commented Nov 17, 2020

mil-ad commented Nov 17, 2020

mil-ad commented Nov 17, 2020

hawkinsp commented Nov 17, 2020

mil-ad commented Nov 17, 2020

tomhennigan commented Nov 17, 2020