think about what it means to "default" to reparameterization gradient #38

akucukelbir · 2016-03-07T21:36:16Z

we currently default to the reparameterization gradient if the Variational class implements reparam

however, if the Inference class does not support reparameterization gradients (e.g. KLpq) then it doesn't matter whether the Variational class implements it or not.

The text was updated successfully, but these errors were encountered:

dustinvtran · 2016-03-12T06:09:00Z

@mariru is working on MAP, which is another case where we don't necessarily need this score vs reparam dichotomy. We also need to think about how the class should later incorporate sampling methods (e.g., do we just treat is an "optimization"?).

akucukelbir · 2016-03-12T14:07:05Z

how about having a hierarchical method structure, like in Stan?

dustinvtran · 2016-03-12T21:37:27Z

You mean for specifying the inference method? E.g., Inference(method="MFVI")?

akucukelbir · 2016-03-13T15:15:58Z

hmm. now that i think about it, i'm not sure.

perhaps we have some sort of added hierarchy within Inference.

i don't know how to communicate this so bear with me:

 +-----------+
 | Inference +------------+------------------+
 +-----+-----+            |                  |
       |                  |                  |
       |                  |                  |
+------+------+    +------+-------+    +-----+------+
| Variational |    | Optimization |    |  Sampling  |
+------+------+    +--------------+    +------------+
       |
       |
       +
 MFVI/KLpq/etc.

so the reparam/score loss stuff happens at the variational level (in its implementation of run). perhaps Inference doesn't even need to implement run anymore.

does that make sense?

dustinvtran · 2016-03-13T16:10:47Z

I like the ASCII! This makes sense. I would also put optimization inside variational.

akucukelbir · 2016-03-13T16:25:31Z

optimization with the score function estimator? is that useful?

On Sun, Mar 13, 2016 at 12:10 PM, Dustin Tran notifications@github.com
wrote:

I like the ASCII! This makes sense. I would also put optimization inside
variational.

—
Reply to this email directly or view it on GitHub
#38 (comment).

dustinvtran · 2016-03-13T18:40:20Z

For example, MAP (and by extension, MLE) is variational inference with a point mass variational family. This is how Maja is currently implementing it.

akucukelbir · 2016-03-13T18:44:48Z

what does sampling from a point mass mean?

the way i view it: variational inference in this library is basically (by
choice) based on stochastic optimization techniques.

MAP and MLE does not need to be based on stochastic optimization. so
doesn't it make more sense to separate?

(i could be missing something here.)

On Sun, Mar 13, 2016 at 2:40 PM, Dustin Tran notifications@github.com
wrote:

For example, MAP (and by extension, MLE) is variational inference with a
point mass variational family. This is how Maja is currently implementing
it.

—
Reply to this email directly or view it on GitHub
#38 (comment).

mariru · 2016-03-13T19:22:23Z

"sampling" for the point mass means simply returning its value.

If you checkout branch feature/map I implemented a variational family PMGaussian for modeling unconstrained parameters using a point estimate. It should probably get a better name. But I wanted to make the distinction that like MFGaussian the transform for the mean parameter is the identity.

So I think it can be useful to have run() in the variational/optimization parent class but then have methods within run() that get overwritten by the child classes: e.g. call build_loss() within run() in the parent class and then overwrite build_loss() in the child class to call one of build_score_loss() or build_reparam_loss() or build_"other"_loss(). These method specific loss functions can be implemented in the parent class or if a modification is needed they can also be overwritten for a specific inference method.

dustinvtran · 2016-03-13T19:54:03Z

Yup that's a great idea. So right now, Inference would have build_loss(): which returns raise NotImplementedError(). Then MFVI would write build_loss() as an if-else chain and returns the score or reparam loss. For KLpq, it would just be a single loss because there is no reparameterization gradient. For MAP, it can just return log p(x,z).

akucukelbir · 2016-03-14T14:20:29Z

so what's the full spec here? and what would be the best way of making this change? (we should be considerate of stuff happening in other branches.)

dustinvtran · 2016-03-14T18:03:10Z

class Inference:
    def __init__(self, model, data):

class MonteCarlo(Inference):
    def __init__(self, *args, **kwargs):
        Inference.__init__(self, *args, **kwargs)

    # not sure what will go here

class VariationalInference(Inference):
    def __init__(self, model, variational, data):
        Inference.__init__(self, model, data)
        self.variational = variational

    def run():
    def initialize():
    def update():
    def build_loss():
    def print_progress():

class MFVI(VariationalInference):
    def __init__(self, *args, **kwargs):
        VariationalInference.__init__(self, *args, **kwargs)

    def build_loss():
        if ...:
            return build_score_loss()
        else:
            return build_reparam_loss()

    def build_score_loss():
    def build_reparam_loss():

class KLpq(VariationalInference):
    def __init__(self, *args, **kwargs):
        VariationalInference.__init__(self, *args, **kwargs)

    def build_loss():

class MAP(VariationalInference):
    def __init__(self, model, data):
        variational = PointMass(...)
        VariationalInference.__init__(self, model, variational,data)

    def build_loss():

dustinvtran · 2016-03-14T18:04:20Z

As for how to implement this, I suggest we do this broad refactor as early stage as possible to avoid incurring debt. So we write this in a branch and then individually deal with any merge conflicts to each branch once the pull request is made.

akucukelbir · 2016-03-14T18:06:06Z

very nice.

wouldn't it be more flexible to have

class MAP(Inference):

again, i'm not entirely following why we want to go with this PointMass approach. is it to reduce some reimplementation of some code somehow?

mariru · 2016-03-14T18:30:20Z

By doing variational inference with a pointmass, you are reusing the
gradient descent routine from run() in (variational) inference. Plus you
can use the PointMass objects to encode constraints in the parameter space
but then still do the same optimization as defined in run() in the
unconstrained space.

On Mon, Mar 14, 2016 at 2:06 PM Alp Kucukelbir notifications@github.com
wrote:

very nice.

wouldn't it be more flexible to have

class MAP(Inference):

again, i'm not entirely following why we want to go with this PointMass
approach. is it to reduce some reimplementation of some code somehow?

—
Reply to this email directly or view it on GitHub
#38 (comment).

dustinvtran · 2016-03-14T18:33:31Z

Broadly, I see inference derived from two paradigms: optimization (variational inference) and sampling (Monte Carlo methods). The reason to include techniques such as MLE, MAP, MML, and MPO as part of the variational inference class is for two reasons:

Conceptually. I personally view variational inference as an umbrella term for any posterior inference method that is formulated as an optimization problem. All these estimation techniques are crude approximate methods based on the mode. Viewing them as approximations justifies and makes clear the use case for other approximations, such as KL(p||q). (E.g., I don't think it's reasonable to distinguish between inference via approximate posterior means and inference via exact or approximate posterior modes.)
Practically. All optimization-based methods share many defaults: the same optimization routine (e.g., learning rate, gradient descent method) using update(), print progress() of the iteration and loss function's value, initialize(), and a general wrapper of all these objects in run(). Any of these methods can overwrite one of the defaults or add onto it.

akucukelbir · 2016-03-15T01:04:52Z

hmm. not to be pedantic here, but i don't think i agree with either point.
(also, I don't know what MPO is.)

interpreting MLE, for instance, as a posterior inference method is
confusing.
why should all optimization-based methods share the same optimization
routine? why would i want to do stochastic gradient ascent instead of
conjugate gradient or BFGS if i have exact gradients of my log prob?

a broader point of 1 is i guess this: did we decide to frame blackbox as
a Bayesian toolbox?

i also didn't follow some of maja's comments. perhaps this is easier to
figure out over coffee :)

On Mon, Mar 14, 2016 at 2:33 PM, Dustin Tran notifications@github.com
wrote:

Broadly, I see inference derived from two paradigms: optimization
(variational inference) and sampling (Monte Carlo methods). The reason to
include techniques such as MLE, MAP, MML, and MPO as part of the
variational inference class is for two reasons:

Conceptually. I personally view variational inference as an
umbrella term for any posterior inference method that is formulated as an
optimization problem. All these estimation techniques are crude approximate
methods based on the mode. Viewing them as approximations justifies and
makes clear the use case for other approximations, such as KL(p||q). (E.g.,
I don't think it's reasonable to distinguish between inference via
approximate posterior means and inference via exact or approximate
posterior modes.)

Practically. All optimization-based methods share many defaults:
the same optimization routine (e.g., learning rate, gradient descent
method) using update(), print progress() of the iteration and loss
function's value, initialize(), and a general wrapper of all these
objects in run(). Any of these methods can overwrite one of the
defaults or add onto it.

—
Reply to this email directly or view it on GitHub
#38 (comment).

dustinvtran · 2016-03-15T01:46:17Z

Well, let's agree to disagree then. :)

MPO: marginal posterior optimization

All optimization methods default to gradient descent (data subsampling is optional). latent variable sampling is currently used, e.g., in MFVI and KLpq, but it's not a necessary distinction. for example, we ideally would have coordinate ascent MFVI if someone wrote down a exponential family graphical model with VIBES-like metadata. (@heywhoah and I are interested in this.)

akucukelbir · 2016-03-15T02:01:01Z

agree to disagree? what kind of strange proposal is that? :)

let's chat in person. i think i'm missing some things here. ( e.g. preferring coordinate ascent? much strangeness abound :) )

dustinvtran · 2016-03-15T04:34:54Z

I wrote it in the MAP branch. Here's what it looks like: https://github.com/Blei-Lab/blackbox/blob/af3f0528fd116be3dbcfc6d3871ac9119648abce/blackbox/inferences.py

akucukelbir · 2016-03-15T12:36:36Z

nice work! (i'm not saying that what you and maja propose won't work btw.)

okay, let's discuss today if you both ( @dustinvtran @mariru ) are around!

dustinvtran added the Code cleanup label Mar 7, 2016

dustinvtran mentioned this issue Mar 15, 2016

Feature/map #52

Merged

dustinvtran mentioned this issue May 6, 2016

Feature/variational models #62

Merged

dustinvtran closed this as completed May 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

think about what it means to "default" to reparameterization gradient #38

think about what it means to "default" to reparameterization gradient #38

akucukelbir commented Mar 7, 2016

dustinvtran commented Mar 12, 2016

akucukelbir commented Mar 12, 2016

dustinvtran commented Mar 12, 2016

akucukelbir commented Mar 13, 2016

dustinvtran commented Mar 13, 2016

akucukelbir commented Mar 13, 2016

dustinvtran commented Mar 13, 2016

akucukelbir commented Mar 13, 2016

mariru commented Mar 13, 2016

dustinvtran commented Mar 13, 2016

akucukelbir commented Mar 14, 2016

dustinvtran commented Mar 14, 2016

dustinvtran commented Mar 14, 2016

akucukelbir commented Mar 14, 2016

mariru commented Mar 14, 2016

dustinvtran commented Mar 14, 2016

akucukelbir commented Mar 15, 2016

dustinvtran commented Mar 15, 2016

akucukelbir commented Mar 15, 2016

dustinvtran commented Mar 15, 2016

akucukelbir commented Mar 15, 2016

think about what it means to "default" to reparameterization gradient #38

think about what it means to "default" to reparameterization gradient #38

Comments

akucukelbir commented Mar 7, 2016

dustinvtran commented Mar 12, 2016

akucukelbir commented Mar 12, 2016

dustinvtran commented Mar 12, 2016

akucukelbir commented Mar 13, 2016

dustinvtran commented Mar 13, 2016

akucukelbir commented Mar 13, 2016

dustinvtran commented Mar 13, 2016

akucukelbir commented Mar 13, 2016

mariru commented Mar 13, 2016

dustinvtran commented Mar 13, 2016

akucukelbir commented Mar 14, 2016

dustinvtran commented Mar 14, 2016

dustinvtran commented Mar 14, 2016

akucukelbir commented Mar 14, 2016

mariru commented Mar 14, 2016

dustinvtran commented Mar 14, 2016

akucukelbir commented Mar 15, 2016

dustinvtran commented Mar 15, 2016

akucukelbir commented Mar 15, 2016

dustinvtran commented Mar 15, 2016

akucukelbir commented Mar 15, 2016