Second-order score gradients #875

Bonnevie · 2018-03-25T16:27:49Z

This is admittedly an esoteric issue.
The score gradient is cleverly implemented in Edward with the step

q_grads = tf.gradients(
      -(tf.reduce_mean(q_log_prob * tf.stop_gradient(losses)) - reg_penalty),
      q_vars)

ensuring that the result is of the (pseudo-code) form mean(score*losses).

Now say that I want to define an operation which takes the score gradient an an input. If I try to take the gradient of this derived expression, the result will be wrong due to the stop_gradientop. Is there a clever idiomatic way to define the score gradient without compromising its derivative?

Note that the score could be computed prior to taking the product with losses, but since Tensorflow can only compute derivatives of scalar quantities, this would involve unstacking and looping over q_log_prob.

Thinking it over, maybe the simplest way to attack the problem is to use graph modification to swap the
tf.stop_gradient(losses) node with a pure losses?

edit: for the ones who don't find this an interesting intellectual pursuit in and of itself, I can note that this becomes fairly relevant if one wants to calculate the variance gradient for the REBAR and RELAX estimators used in discrete variational approximations.

The text was updated successfully, but these errors were encountered:

dustinvtran · 2018-03-25T19:42:53Z

Is there a clever idiomatic way to define the score gradient without compromising its derivative?

Yes, there is! I was just chatting with Jakob Foerster last week about getting DiCE (https://arxiv.org/abs/1802.05098) in Edward. I don't know his github—ccing @alshedivat, @rockt who also worked on it. Contributions are welcome.

Bonnevie · 2018-03-25T20:01:42Z

@dustinvtran Ah, did see DICE when it came out, and I looked it over again this friday hoping that it would solve my problem in an instant, but I think this might be a different problem? With DICE the goal is to build unbiased estimators of higher order derivatives, while here, the goal is to take the derivative of an existing first-order estimator. I can see how my title might be a tad misleading in that respect.

dustinvtran · 2018-03-25T20:52:00Z

Right, it depends on what you're taking derivatives of—exact first-order gradients (which DiCe solves) or the first-order gradient estimator.

For the latter, have you seen Edward2's klqp implementation? It avoids tf.stop_gradient altogether by building a "scale factor", which is local to the stochastic node and not global like tf.stop_gradient.

https://github.com/blei-lab/edward/blob/feature/2.0/edward/inferences/klqp.py#L36

Bonnevie · 2018-03-25T21:18:56Z

That's a slightly dense implementation, might need a few pointers. Is the idea to have the surrogate_loss do an implicit stop_gradient by swapping in the probability calculated using x.value instead of x? Doesn't that lock the node to x.value at definition time?

edit: For the record, using graph_editor seems to work although it seems less elegant.

jakobnicolaus · 2018-03-25T21:26:35Z

Yes, so DiCE let's you define an objective such that the gradient of the objective is an estimator of the gradient. This holds for arbitrary orders of derivatives, so you don't have to worry about how to differentiate the estimator.

I think I understand your use case though and I agree that it's not obvious that DiCE would solve this out of the box.

ethancaballero · 2018-04-12T22:33:58Z

This pyro DiCE implementation & use_case might be informative:
https://github.com/uber/pyro/blob/684c909c7f66ced5408d4ea01dff9259d8b19bd2/pyro/infer/util.py#L109
https://github.com/uber/pyro/blob/ec8714f36de26d11c4d87155f68ba1e3d1868f2d/pyro/infer/traceenum_elbo.py#L17

Bonnevie · 2018-04-13T12:18:34Z

Thanks for the pointer!
From this:

        for model_trace, guide_trace in self._get_traces(model, guide, *args, **kwargs):
            elbo_particle = _compute_dice_elbo(model_trace, guide_trace)
            if is_identically_zero(elbo_particle):
                continue

            elbo += elbo_particle.item() / self.num_particles

it would appear that they don't have a clever way of vectorizing over samples.

jacksonloper mentioned this issue May 11, 2018

feature request: gradients of expected values tensorflow/probability#37

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Second-order score gradients #875

Second-order score gradients #875

Bonnevie commented Mar 25, 2018 •

edited

Loading

dustinvtran commented Mar 25, 2018 •

edited

Loading

Bonnevie commented Mar 25, 2018

dustinvtran commented Mar 25, 2018 •

edited

Loading

Bonnevie commented Mar 25, 2018 •

edited

Loading

jakobnicolaus commented Mar 25, 2018

ethancaballero commented Apr 12, 2018

Bonnevie commented Apr 13, 2018 •

edited

Loading

Second-order score gradients #875

Second-order score gradients #875

Comments

Bonnevie commented Mar 25, 2018 • edited Loading

dustinvtran commented Mar 25, 2018 • edited Loading

Bonnevie commented Mar 25, 2018

dustinvtran commented Mar 25, 2018 • edited Loading

Bonnevie commented Mar 25, 2018 • edited Loading

jakobnicolaus commented Mar 25, 2018

ethancaballero commented Apr 12, 2018

Bonnevie commented Apr 13, 2018 • edited Loading

Bonnevie commented Mar 25, 2018 •

edited

Loading

dustinvtran commented Mar 25, 2018 •

edited

Loading

dustinvtran commented Mar 25, 2018 •

edited

Loading

Bonnevie commented Mar 25, 2018 •

edited

Loading

Bonnevie commented Apr 13, 2018 •

edited

Loading