Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

think about what it means to "default" to reparameterization gradient #38

Closed
akucukelbir opened this issue Mar 7, 2016 · 21 comments
Closed

Comments

@akucukelbir
Copy link
Contributor

we currently default to the reparameterization gradient if the Variational class implements reparam

however, if the Inference class does not support reparameterization gradients (e.g. KLpq) then it doesn't matter whether the Variational class implements it or not.

@dustinvtran
Copy link
Member

@mariru is working on MAP, which is another case where we don't necessarily need this score vs reparam dichotomy. We also need to think about how the class should later incorporate sampling methods (e.g., do we just treat is an "optimization"?).

@akucukelbir
Copy link
Contributor Author

how about having a hierarchical method structure, like in Stan?

@dustinvtran
Copy link
Member

You mean for specifying the inference method? E.g., Inference(method="MFVI")?

@akucukelbir
Copy link
Contributor Author

hmm. now that i think about it, i'm not sure.

perhaps we have some sort of added hierarchy within Inference.

i don't know how to communicate this so bear with me:

 +-----------+
 | Inference +------------+------------------+
 +-----+-----+            |                  |
       |                  |                  |
       |                  |                  |
+------+------+    +------+-------+    +-----+------+
| Variational |    | Optimization |    |  Sampling  |
+------+------+    +--------------+    +------------+
       |
       |
       +
 MFVI/KLpq/etc.

so the reparam/score loss stuff happens at the variational level (in its implementation of run). perhaps Inference doesn't even need to implement run anymore.

does that make sense?

@dustinvtran
Copy link
Member

I like the ASCII! This makes sense. I would also put optimization inside variational.

@akucukelbir
Copy link
Contributor Author

optimization with the score function estimator? is that useful?

On Sun, Mar 13, 2016 at 12:10 PM, Dustin Tran notifications@github.com
wrote:

I like the ASCII! This makes sense. I would also put optimization inside
variational.


Reply to this email directly or view it on GitHub
#38 (comment).

@dustinvtran
Copy link
Member

For example, MAP (and by extension, MLE) is variational inference with a point mass variational family. This is how Maja is currently implementing it.

@akucukelbir
Copy link
Contributor Author

what does sampling from a point mass mean?

the way i view it: variational inference in this library is basically (by
choice) based on stochastic optimization techniques.

MAP and MLE does not need to be based on stochastic optimization. so
doesn't it make more sense to separate?

(i could be missing something here.)

On Sun, Mar 13, 2016 at 2:40 PM, Dustin Tran notifications@github.com
wrote:

For example, MAP (and by extension, MLE) is variational inference with a
point mass variational family. This is how Maja is currently implementing
it.


Reply to this email directly or view it on GitHub
#38 (comment).

@mariru
Copy link
Contributor

mariru commented Mar 13, 2016

"sampling" for the point mass means simply returning its value.

If you checkout branch feature/map I implemented a variational family PMGaussian for modeling unconstrained parameters using a point estimate. It should probably get a better name. But I wanted to make the distinction that like MFGaussian the transform for the mean parameter is the identity.

So I think it can be useful to have run() in the variational/optimization parent class but then have methods within run() that get overwritten by the child classes: e.g. call build_loss() within run() in the parent class and then overwrite build_loss() in the child class to call one of build_score_loss() or build_reparam_loss() or build_"other"_loss(). These method specific loss functions can be implemented in the parent class or if a modification is needed they can also be overwritten for a specific inference method.

@dustinvtran
Copy link
Member

Yup that's a great idea. So right now, Inference would have build_loss(): which returns raise NotImplementedError(). Then MFVI would write build_loss() as an if-else chain and returns the score or reparam loss. For KLpq, it would just be a single loss because there is no reparameterization gradient. For MAP, it can just return log p(x,z).

@akucukelbir
Copy link
Contributor Author

so what's the full spec here? and what would be the best way of making this change? (we should be considerate of stuff happening in other branches.)

@dustinvtran
Copy link
Member

class Inference:
    def __init__(self, model, data):

class MonteCarlo(Inference):
    def __init__(self, *args, **kwargs):
        Inference.__init__(self, *args, **kwargs)

    # not sure what will go here

class VariationalInference(Inference):
    def __init__(self, model, variational, data):
        Inference.__init__(self, model, data)
        self.variational = variational

    def run():
    def initialize():
    def update():
    def build_loss():
    def print_progress():

class MFVI(VariationalInference):
    def __init__(self, *args, **kwargs):
        VariationalInference.__init__(self, *args, **kwargs)

    def build_loss():
        if ...:
            return build_score_loss()
        else:
            return build_reparam_loss()

    def build_score_loss():
    def build_reparam_loss():

class KLpq(VariationalInference):
    def __init__(self, *args, **kwargs):
        VariationalInference.__init__(self, *args, **kwargs)

    def build_loss():

class MAP(VariationalInference):
    def __init__(self, model, data):
        variational = PointMass(...)
        VariationalInference.__init__(self, model, variational,data)

    def build_loss():

@dustinvtran
Copy link
Member

As for how to implement this, I suggest we do this broad refactor as early stage as possible to avoid incurring debt. So we write this in a branch and then individually deal with any merge conflicts to each branch once the pull request is made.

@akucukelbir
Copy link
Contributor Author

very nice.

wouldn't it be more flexible to have

class MAP(Inference):

again, i'm not entirely following why we want to go with this PointMass approach. is it to reduce some reimplementation of some code somehow?

@mariru
Copy link
Contributor

mariru commented Mar 14, 2016

By doing variational inference with a pointmass, you are reusing the
gradient descent routine from run() in (variational) inference. Plus you
can use the PointMass objects to encode constraints in the parameter space
but then still do the same optimization as defined in run() in the
unconstrained space.

On Mon, Mar 14, 2016 at 2:06 PM Alp Kucukelbir notifications@github.com
wrote:

very nice.

wouldn't it be more flexible to have

class MAP(Inference):

again, i'm not entirely following why we want to go with this PointMass
approach. is it to reduce some reimplementation of some code somehow?


Reply to this email directly or view it on GitHub
#38 (comment).

@dustinvtran
Copy link
Member

Broadly, I see inference derived from two paradigms: optimization (variational inference) and sampling (Monte Carlo methods). The reason to include techniques such as MLE, MAP, MML, and MPO as part of the variational inference class is for two reasons:

  1. Conceptually. I personally view variational inference as an umbrella term for any posterior inference method that is formulated as an optimization problem. All these estimation techniques are crude approximate methods based on the mode. Viewing them as approximations justifies and makes clear the use case for other approximations, such as KL(p||q). (E.g., I don't think it's reasonable to distinguish between inference via approximate posterior means and inference via exact or approximate posterior modes.)
  2. Practically. All optimization-based methods share many defaults: the same optimization routine (e.g., learning rate, gradient descent method) using update(), print progress() of the iteration and loss function's value, initialize(), and a general wrapper of all these objects in run(). Any of these methods can overwrite one of the defaults or add onto it.

@akucukelbir
Copy link
Contributor Author

hmm. not to be pedantic here, but i don't think i agree with either point.
(also, I don't know what MPO is.)

  1. interpreting MLE, for instance, as a posterior inference method is
    confusing.
  2. why should all optimization-based methods share the same optimization
    routine? why would i want to do stochastic gradient ascent instead of
    conjugate gradient or BFGS if i have exact gradients of my log prob?

a broader point of 1 is i guess this: did we decide to frame blackbox as
a Bayesian toolbox?

i also didn't follow some of maja's comments. perhaps this is easier to
figure out over coffee :)

On Mon, Mar 14, 2016 at 2:33 PM, Dustin Tran notifications@github.com
wrote:

Broadly, I see inference derived from two paradigms: optimization
(variational inference) and sampling (Monte Carlo methods). The reason to
include techniques such as MLE, MAP, MML, and MPO as part of the
variational inference class is for two reasons:

  1. Conceptually. I personally view variational inference as an
    umbrella term for any posterior inference method that is formulated as an
    optimization problem. All these estimation techniques are crude approximate
    methods based on the mode. Viewing them as approximations justifies and
    makes clear the use case for other approximations, such as KL(p||q). (E.g.,
    I don't think it's reasonable to distinguish between inference via
    approximate posterior means and inference via exact or approximate
    posterior modes.)
  2. Practically. All optimization-based methods share many defaults:
    the same optimization routine (e.g., learning rate, gradient descent
    method) using update(), print progress() of the iteration and loss
    function's value, initialize(), and a general wrapper of all these
    objects in run(). Any of these methods can overwrite one of the
    defaults or add onto it.


Reply to this email directly or view it on GitHub
#38 (comment).

@dustinvtran
Copy link
Member

Well, let's agree to disagree then. :)

MPO: marginal posterior optimization

All optimization methods default to gradient descent (data subsampling is optional). latent variable sampling is currently used, e.g., in MFVI and KLpq, but it's not a necessary distinction. for example, we ideally would have coordinate ascent MFVI if someone wrote down a exponential family graphical model with VIBES-like metadata. (@heywhoah and I are interested in this.)

@akucukelbir
Copy link
Contributor Author

agree to disagree? what kind of strange proposal is that? :)

let's chat in person. i think i'm missing some things here. ( e.g. preferring coordinate ascent? much strangeness abound :) )

@dustinvtran
Copy link
Member

I wrote it in the MAP branch. Here's what it looks like: https://github.com/Blei-Lab/blackbox/blob/af3f0528fd116be3dbcfc6d3871ac9119648abce/blackbox/inferences.py

@akucukelbir
Copy link
Contributor Author

nice work! (i'm not saying that what you and maja propose won't work btw.)

okay, let's discuss today if you both ( @dustinvtran @mariru ) are around!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants