Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Second-order score gradients #875

Open
Bonnevie opened this issue Mar 25, 2018 · 7 comments
Open

Second-order score gradients #875

Bonnevie opened this issue Mar 25, 2018 · 7 comments

Comments

@Bonnevie
Copy link

Bonnevie commented Mar 25, 2018

This is admittedly an esoteric issue.
The score gradient is cleverly implemented in Edward with the step

q_grads = tf.gradients(
      -(tf.reduce_mean(q_log_prob * tf.stop_gradient(losses)) - reg_penalty),
      q_vars)

ensuring that the result is of the (pseudo-code) form mean(score*losses).

Now say that I want to define an operation which takes the score gradient an an input. If I try to take the gradient of this derived expression, the result will be wrong due to the stop_gradientop. Is there a clever idiomatic way to define the score gradient without compromising its derivative?

Note that the score could be computed prior to taking the product with losses, but since Tensorflow can only compute derivatives of scalar quantities, this would involve unstacking and looping over q_log_prob.

Thinking it over, maybe the simplest way to attack the problem is to use graph modification to swap the
tf.stop_gradient(losses) node with a pure losses?

edit: for the ones who don't find this an interesting intellectual pursuit in and of itself, I can note that this becomes fairly relevant if one wants to calculate the variance gradient for the REBAR and RELAX estimators used in discrete variational approximations.

@dustinvtran
Copy link
Member

dustinvtran commented Mar 25, 2018

Is there a clever idiomatic way to define the score gradient without compromising its derivative?

Yes, there is! I was just chatting with Jakob Foerster last week about getting DiCE (https://arxiv.org/abs/1802.05098) in Edward. I don't know his github—ccing @alshedivat, @rockt who also worked on it. Contributions are welcome.

@Bonnevie
Copy link
Author

@dustinvtran Ah, did see DICE when it came out, and I looked it over again this friday hoping that it would solve my problem in an instant, but I think this might be a different problem? With DICE the goal is to build unbiased estimators of higher order derivatives, while here, the goal is to take the derivative of an existing first-order estimator. I can see how my title might be a tad misleading in that respect.

@dustinvtran
Copy link
Member

dustinvtran commented Mar 25, 2018

Right, it depends on what you're taking derivatives of—exact first-order gradients (which DiCe solves) or the first-order gradient estimator.

For the latter, have you seen Edward2's klqp implementation? It avoids tf.stop_gradient altogether by building a "scale factor", which is local to the stochastic node and not global like tf.stop_gradient.

https://github.com/blei-lab/edward/blob/feature/2.0/edward/inferences/klqp.py#L36

@Bonnevie
Copy link
Author

Bonnevie commented Mar 25, 2018

That's a slightly dense implementation, might need a few pointers. Is the idea to have the surrogate_loss do an implicit stop_gradient by swapping in the probability calculated using x.value instead of x? Doesn't that lock the node to x.value at definition time?

edit: For the record, using graph_editor seems to work although it seems less elegant.

@jakobnicolaus
Copy link

Yes, so DiCE let's you define an objective such that the gradient of the objective is an estimator of the gradient. This holds for arbitrary orders of derivatives, so you don't have to worry about how to differentiate the estimator.

I think I understand your use case though and I agree that it's not obvious that DiCE would solve this out of the box.

@Bonnevie
Copy link
Author

Bonnevie commented Apr 13, 2018

Thanks for the pointer!
From this:

        for model_trace, guide_trace in self._get_traces(model, guide, *args, **kwargs):
            elbo_particle = _compute_dice_elbo(model_trace, guide_trace)
            if is_identically_zero(elbo_particle):
                continue

            elbo += elbo_particle.item() / self.num_particles

it would appear that they don't have a clever way of vectorizing over samples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants