@mitmul mitmul released this Oct 17, 2017 · 600 commits to v3 since this release

Assets 2

This is a major release of Chainer v3.0.0. All the updates from the previous major version (v2.0.0) are found in the release notes below:

The biggest change is the introduction of new-style differentiable functions and resulting support for double backward (gradient of gradient) in many functions. The details are linked below:

As for the backward compatibility, most users of v2.x are not affected by the introduction of new-style function FunctionNode because the conventional Function is still supported in v3 (and in the future versions). Even if you are using custom functions written with Function, you can continue running the same code with Chainer v3.0.0. You need to rewrite such custom functions only when you want to use new features added to the new-style function, e.g. double backprop.

The backward compatibility of the overall APIs is slightly broken, though most users are not affected. See the above release notes for the details of broken compatibility.

Examples of grad of grad in Chainer

Usage of the grad function

You can calculate gradients of any variables in a computational graph w.r.t. any other variables in the graph using the chainer.grad function with enable_double_backprop=True option.

# Both x and y are chainer.Variable objects
y = x * x * x / 3  # Construct a computational graph

gx, = chainer.grad([y], [x], enable_double_backprop=True)
ggx, = chainer.grad([gx], [x], enable_double_backprop=True)

Here, the above calculation of ggx is equal to:

gx.backward()
x.grad_var  # => This is equal to the above ggx

Of course, one more differentiation gives us 2:

gggx, = chainer.grad([ggx], [x], enable_double_backprop=True)

print(gggx)  #=> variable([ 2.])

The loss function of WGAN-GP

WGAN-GP (which stands for Wasserstein GAN with Gradient Penalty[1]) is one example of a GAN that uses gradients of gradients when calculating the loss. It penalizes the gradient norm for enforcing the Lipschitz constraint. The gradient norm is computed at a random interpolation x_hat between a generated point x_tilde and a real example x. Then, the loss including the penalty term will be further differentiated w.r.t. trainable parameters in the model, so that it actually performs double backward for the discriminator. The code below shows how to implement it using the backward() method with enable_double_backprop=True option:

# G (generator) and D (discriminator) should be implemented somewhere else

x_tilde = G(z)
x_hat = x + u * (x_tilde – x)

# 1st diff
D(x_hat).backward(enable_double_backprop=True)

gradient_penalty = lambda * (x_hat.grad_var – 1) ** 2
loss = D(x_tilde) – D(x) + gradient_penalty

model.cleargrads()         # to clear the 1st diff of params
loss.backward()              # 2nd diff

You can also implement it using grad(), which may be faster because it omits the computation of gradients w.r.t. parameters.

x_tilde = G(z)
x_hat = x + u * (x_tilde – x)

# 1st diff
gx_hat, = chainer.grad([D(x_hat)], [x_hat], enable_double_backprop=True)

gradient_penalty = lambda * (gx_hat – 1) ** 2
loss = D(x_tilde) – D(x) + gradient_penalty

model.cleargrads()         # to clear the 1st diff of params
loss.backward()              # 2nd diff

[1]: I. Gulrajani, et. al. “Improved Training of Wasserstein GANs,” https://arxiv.org/abs/1704.00028

Here are some simple comparisons of grad of grad in Chainer and other frameworks:
https://gist.github.com/delta2323/9bbca950ee32c523c7aec2e02ad7f85a

New features

  • Add F.flip function (#3532)
  • Functions with double-backprop support: F.swapaxis (#3480), F.permutate (#3481), F.transpose_sequence (#3525)

Bug fixes

  • Workaround for NumPy dot operation bug on non-contiguous arrays (#3478)
  • Fix KeyError when using evaluator without target 'main' (#3460)
  • Fix AttributeError for missing inv_std in F.fixed_batch_normalization backward (#3479, thanks @zaburo-ch!)

Improvements

  • Remove unused invoke_before_training argument from Trainer.extend (#3516)
  • Improve performance of MultiprocessIterator for non tuple/dict datasets (#3413, thanks @yuyu2172!)
  • Type check in chainer.grad (#3514)

Documentation

  • Document deprecation of stream option of to_gpu (#3519)
  • Add documentation for ParameterStatistics extension (#3323)
  • Fix typos: (#3414, thanks @knorth55!), (#3455, thanks @HusainZafar!),
  • Fix source links for functions defined with contextlib.contextmanager (#3567)
  • Improve or fix documentation: F.swapaxes, F.squeeze, F.transpose (#3415, thanks @naoto0804!), F.separate, F.select_item, and F.permutate (#3417, thanks @naoto0804!), Constant initializer (#3560), init_scope (#3520), F.reshape (#3515), ConvNet tutorial (#3509)
  • Add documentation of links for framework compatibility (#3476)
  • Fix documentation warnings (#3490)
  • Intoroduce docstring checker and fix markup of “returns” sections (#3510)
  • Remove obsolete statement about copy between devices in to_gpu (#3517)
  • Document deprecation of stream option of to_gpu (#3519)
  • Fix type-check reference (#3521)
  • Improve style of deprecation notification (#3522)
  • Avoid horizontal scroll of tables (#3538)
  • Add/modify supported versions of dependencies in the installation guide (#3580)

Tests

  • Skip multiprocess interrupt tests (#3412)
  • Add tests for __delattr__ in Link and Chain (#3416, thanks @naoto0804!)
  • Improve numerical_grad accuracy (#3495)
  • Improve test mode of VAE example (#3431)
  • Delete redundant test settings for F.get_item (#3469, thanks @yuyu2172!)
  • Avoid unwanted output of assert_allclose failure (#3518)
  • Stabilization of stochastic numerical errors