Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

double backward always returns nan when dtype is float16 and cudnn is enabled. #6251

Closed
crcrpar opened this issue Feb 15, 2019 · 7 comments
Closed
Assignees
Labels
cat:bug Bug report or fix. prio:high High priority. Urgent and needs to be worked on as soon as possible.

Comments

@crcrpar
Copy link
Contributor

crcrpar commented Feb 15, 2019

When a pair of F.reshape and F.batch_normalization is used under the condition that dtype is fload16 and use_cudnn='always', double backward of the pair goes so unstable that it returns nan with high probability.

One use-case of the pair is F.group_normalization:

x = reshape.reshape(x, (1, batch_size * groups, -1, 1))
with cuda.get_device_from_array(x.array):
dummy_gamma = xp.ones(batch_size * groups).astype(xp.float32)
dummy_beta = xp.zeros(batch_size * groups).astype(xp.float32)
with warnings.catch_warnings():
warnings.simplefilter("ignore")
x = batch_normalization.batch_normalization(
x, dummy_gamma, dummy_beta, eps=eps)
x = reshape.reshape(x, original_shape)

  • Conditions
Platform: Linux-4.4.0-98-generic-x86_64-with-debian-stretch-sid
Chainer: 6.0.0b2
NumPy: 1.15.4
CuPy:
  CuPy Version          : 6.0.0b2
  CUDA Root             : /usr/local/cuda
  CUDA Build Version    : 9020
  CUDA Driver Version   : 9020
  CUDA Runtime Version  : 9020
  cuDNN Build Version   : 7201
  cuDNN Version         : 7201
  NCCL Build Version    : None
iDeep: 2.0.0.post3
  • Code to reproduce
import cupy as cp
from chainer import gradient_check
import chainer.functions as F
import numpy


def reshape_and_bn(x, gamma, beta):
    x_shape = x.shape
    expander = [None, Ellipsis, None, None]
    x_ = F.reshape(x, (1, x_shape[0] * x_shape[1], -1, 1))
    dummy_g = cp.ones(x_.shape[1], dtype=x_.dtype)
    dummy_b = cp.zeros(x_.shape[1], dtype=x_.dtype)
    x_normalized = F.batch_normalization(x_, dummy_g, dummy_b)
    x_normalized = F.reshape(x_normalized, x_shape)
    gamma = gamma[expander]
    beta = beta[expander]
    return x_normalized * gamma + beta


def run():
    x = cp.random.uniform(-1, 1, (5, 3, 4, 4)).astype(cp.float16)
    gy = cp.random.uniform(-1, 1, (5, 3, 4, 4)).astype(cp.float16)
    ggx = cp.random.uniform(-1, 1, (5, 3, 4, 4)).astype(cp.float16)
    gamma = cp.random.uniform(-1, 1, 3).astype(cp.float16)
    beta = cp.random.uniform(-1, 1, 3).astype(cp.float16)
    ggamma = cp.random.uniform(-1, 1, 3).astype(cp.float16)
    gbeta = cp.random.uniform(-1, 1, 3).astype(cp.float16)

    print('Backward')
    gradient_check.check_backward(
        reshape_and_bn, (x, gamma, beta,), (gy,), dtype=numpy.float64,
        atol=1e-2, rtol=1e-3
    )

    print('Double Backward')
    gradient_check.check_double_backward(
        reshape_and_bn, (x, gamma, beta), (gy,),
        (ggx, ggamma, gbeta), dtype=numpy.float64,
        atol=1e-2, rtol=1e-3
    )


if __name__ == '__main__':
    run()
  • Error messages, stack traces, or logs

For backward,

gradients (numeric):  0.6409951020032167
gradients (backward): -0.5768083848859537


Not equal to tolerance rtol=0.001, atol=0.01

Mismatch: 100%
Max absolute difference: 1.21780349
Max relative difference: 2.1112791
 x: array(0.640995)
 y: array(-0.576808)

assert_allclose failed:
  shape: () ()
  dtype: float64 float64
  i: (0,)
  x[i]: 0.6409951020032167
  y[i]: -0.5768083848859537
  relative error[i]: 2.1112790985691965
  absolute error[i]: 1.2178034868891703
x: 0.6409951
y: -0.57680838

For double backward,

gradients (numeric):  1.773324329406023
gradients (backward): nan


Not equal to tolerance rtol=0.001, atol=0.01

x and y nan location mismatch:
 x: array(1.773324)
 y: array(nan)

assert_allclose failed:
  shape: () ()
  dtype: float64 float64
  i: (0,)
  x[i]: 1.773324329406023
  y[i]: nan
  relative error[i]: nan
  absolute error[i]: nan
x: 1.77332433
y: nan
@toslunar
Copy link
Member

With numpy, the errors are also too large.

@kmaehashi kmaehashi added the cat:bug Bug report or fix. label Feb 26, 2019
@kmaehashi kmaehashi added the prio:high High priority. Urgent and needs to be worked on as soon as possible. label Feb 26, 2019
@takagi
Copy link
Member

takagi commented Feb 26, 2019

I will check if this issue is related to #6323 .

@crcrpar
Copy link
Contributor Author

crcrpar commented Mar 14, 2019

@takagi cupy/cupy#2072 and #6497 seem to fix this issue, at least the above snippet.

@toslunar
Copy link
Member

I guess cupy/cupy#2060, #6323, and #5924 did.

@grafi-tt
Copy link
Contributor

@crcrpar Could you try with the current master branch?

@crcrpar
Copy link
Contributor Author

crcrpar commented Mar 18, 2019

@grafi-tt I tried the snippet with current master, and it worked.

@takagi
Copy link
Member

takagi commented Mar 19, 2019

I've also confirmed that the snippet works.

Platform: Linux-4.15.0-46-generic-x86_64-with-debian-buster-sid
Chainer: 6.0.0b3
NumPy: 1.16.2
CuPy:
  CuPy Version          : 6.0.0b3
  CUDA Root             : /usr/local/cuda
  CUDA Build Version    : 10000
  CUDA Driver Version   : 10000
  CUDA Runtime Version  : 10000
  cuDNN Build Version   : 7401
  cuDNN Version         : 7401
  NCCL Build Version    : None
  NCCL Runtime Version  : None
iDeep: Not Available

@takagi takagi closed this as completed Mar 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cat:bug Bug report or fix. prio:high High priority. Urgent and needs to be worked on as soon as possible.
Projects
None yet
Development

No branches or pull requests

5 participants