double backward always returns nan when dtype is float16 and cudnn is enabled. #6251

crcrpar · 2019-02-15T09:22:48Z

When a pair of F.reshape and F.batch_normalization is used under the condition that dtype is fload16 and use_cudnn='always', double backward of the pair goes so unstable that it returns nan with high probability.

One use-case of the pair is F.group_normalization:

chainer/chainer/functions/normalization/group_normalization.py

Lines 61 to 72 in afe9033

    
           x = reshape.reshape(x, (1, batch_size * groups, -1, 1)) 
        
           with cuda.get_device_from_array(x.array): 
        
               dummy_gamma = xp.ones(batch_size * groups).astype(xp.float32) 
        
               dummy_beta = xp.zeros(batch_size * groups).astype(xp.float32) 
        
           with warnings.catch_warnings(): 
        
               warnings.simplefilter("ignore") 
        
               x = batch_normalization.batch_normalization( 
        
                   x, dummy_gamma, dummy_beta, eps=eps) 
        
           x = reshape.reshape(x, original_shape)

Conditions

Platform: Linux-4.4.0-98-generic-x86_64-with-debian-stretch-sid
Chainer: 6.0.0b2
NumPy: 1.15.4
CuPy:
  CuPy Version          : 6.0.0b2
  CUDA Root             : /usr/local/cuda
  CUDA Build Version    : 9020
  CUDA Driver Version   : 9020
  CUDA Runtime Version  : 9020
  cuDNN Build Version   : 7201
  cuDNN Version         : 7201
  NCCL Build Version    : None
iDeep: 2.0.0.post3

Code to reproduce

import cupy as cp
from chainer import gradient_check
import chainer.functions as F
import numpy


def reshape_and_bn(x, gamma, beta):
    x_shape = x.shape
    expander = [None, Ellipsis, None, None]
    x_ = F.reshape(x, (1, x_shape[0] * x_shape[1], -1, 1))
    dummy_g = cp.ones(x_.shape[1], dtype=x_.dtype)
    dummy_b = cp.zeros(x_.shape[1], dtype=x_.dtype)
    x_normalized = F.batch_normalization(x_, dummy_g, dummy_b)
    x_normalized = F.reshape(x_normalized, x_shape)
    gamma = gamma[expander]
    beta = beta[expander]
    return x_normalized * gamma + beta


def run():
    x = cp.random.uniform(-1, 1, (5, 3, 4, 4)).astype(cp.float16)
    gy = cp.random.uniform(-1, 1, (5, 3, 4, 4)).astype(cp.float16)
    ggx = cp.random.uniform(-1, 1, (5, 3, 4, 4)).astype(cp.float16)
    gamma = cp.random.uniform(-1, 1, 3).astype(cp.float16)
    beta = cp.random.uniform(-1, 1, 3).astype(cp.float16)
    ggamma = cp.random.uniform(-1, 1, 3).astype(cp.float16)
    gbeta = cp.random.uniform(-1, 1, 3).astype(cp.float16)

    print('Backward')
    gradient_check.check_backward(
        reshape_and_bn, (x, gamma, beta,), (gy,), dtype=numpy.float64,
        atol=1e-2, rtol=1e-3
    )

    print('Double Backward')
    gradient_check.check_double_backward(
        reshape_and_bn, (x, gamma, beta), (gy,),
        (ggx, ggamma, gbeta), dtype=numpy.float64,
        atol=1e-2, rtol=1e-3
    )


if __name__ == '__main__':
    run()

Error messages, stack traces, or logs

For backward,

gradients (numeric):  0.6409951020032167
gradients (backward): -0.5768083848859537


Not equal to tolerance rtol=0.001, atol=0.01

Mismatch: 100%
Max absolute difference: 1.21780349
Max relative difference: 2.1112791
 x: array(0.640995)
 y: array(-0.576808)

assert_allclose failed:
  shape: () ()
  dtype: float64 float64
  i: (0,)
  x[i]: 0.6409951020032167
  y[i]: -0.5768083848859537
  relative error[i]: 2.1112790985691965
  absolute error[i]: 1.2178034868891703
x: 0.6409951
y: -0.57680838

For double backward,

gradients (numeric):  1.773324329406023
gradients (backward): nan


Not equal to tolerance rtol=0.001, atol=0.01

x and y nan location mismatch:
 x: array(1.773324)
 y: array(nan)

assert_allclose failed:
  shape: () ()
  dtype: float64 float64
  i: (0,)
  x[i]: 1.773324329406023
  y[i]: nan
  relative error[i]: nan
  absolute error[i]: nan
x: 1.77332433
y: nan

The text was updated successfully, but these errors were encountered:

toslunar · 2019-02-18T06:14:17Z

With numpy, the errors are also too large.

takagi · 2019-02-26T08:31:36Z

I will check if this issue is related to #6323 .

crcrpar · 2019-03-14T05:13:04Z

@takagi cupy/cupy#2072 and #6497 seem to fix this issue, at least the above snippet.

toslunar · 2019-03-14T06:33:03Z

I guess cupy/cupy#2060, #6323, and #5924 did.

grafi-tt · 2019-03-18T02:26:31Z

@crcrpar Could you try with the current master branch?

crcrpar · 2019-03-18T05:03:55Z

@grafi-tt I tried the snippet with current master, and it worked.

takagi · 2019-03-19T04:41:20Z

I've also confirmed that the snippet works.

Platform: Linux-4.15.0-46-generic-x86_64-with-debian-buster-sid
Chainer: 6.0.0b3
NumPy: 1.16.2
CuPy:
  CuPy Version          : 6.0.0b3
  CUDA Root             : /usr/local/cuda
  CUDA Build Version    : 10000
  CUDA Driver Version   : 10000
  CUDA Runtime Version  : 10000
  cuDNN Build Version   : 7401
  cuDNN Version         : 7401
  NCCL Build Version    : None
  NCCL Runtime Version  : None
iDeep: Not Available

kmaehashi added the cat:bug Bug report or fix. label Feb 26, 2019

kmaehashi assigned takagi Feb 26, 2019

kmaehashi added the prio:high High priority. Urgent and needs to be worked on as soon as possible. label Feb 26, 2019

takagi closed this as completed Mar 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

double backward always returns nan when dtype is float16 and cudnn is enabled. #6251

double backward always returns nan when dtype is float16 and cudnn is enabled. #6251

crcrpar commented Feb 15, 2019 •

edited

Loading

toslunar commented Feb 18, 2019

takagi commented Feb 26, 2019

crcrpar commented Mar 14, 2019 •

edited

Loading

toslunar commented Mar 14, 2019

grafi-tt commented Mar 18, 2019

crcrpar commented Mar 18, 2019 •

edited

Loading

takagi commented Mar 19, 2019

double backward always returns nan when dtype is float16 and cudnn is enabled. #6251

double backward always returns nan when dtype is float16 and cudnn is enabled. #6251

Comments

crcrpar commented Feb 15, 2019 • edited Loading

toslunar commented Feb 18, 2019

takagi commented Feb 26, 2019

crcrpar commented Mar 14, 2019 • edited Loading

toslunar commented Mar 14, 2019

grafi-tt commented Mar 18, 2019

crcrpar commented Mar 18, 2019 • edited Loading

takagi commented Mar 19, 2019

crcrpar commented Feb 15, 2019 •

edited

Loading

crcrpar commented Mar 14, 2019 •

edited

Loading

crcrpar commented Mar 18, 2019 •

edited

Loading