Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tensorboard for variational inference #598

Merged
merged 12 commits into from
May 28, 2017

Conversation

akshaykhatri639
Copy link
Contributor

Add loss as a scalar and gradients and weights as histograms to the
summary

Add loss as a scalar and gradients and weights as histograms to the
summary
Copy link
Member

@dustinvtran dustinvtran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work so far! You should track the norm of all the gradients too.

Also, see the travis failure to adhere to PEP 8.


var_list.update(get_variables(qz, collection=trainables))

for x, qx in six.iteritems(self.data):
if isinstance(x, RandomVariable) and \
not isinstance(qx, RandomVariable):
var_list.update(get_variables(x, collection=trainables))
data_var_list.update(get_variables(x, collection=trainables))

var_list = list(var_list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make this more efficient, i think you can rewrite var_list to be the concatenation of latent_var_list and data_var_list.

trainables = tf.trainable_variables()
for z, qz in six.iteritems(self.latent_vars):
if isinstance(z, RandomVariable):
var_list.update(get_variables(z, collection=trainables))
latent_var_list.update(get_variables(z, collection=trainables))

var_list.update(get_variables(qz, collection=trainables))

for x, qx in six.iteritems(self.data):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a good practice is to instantiate variables closest to where they're used. e.g., instantiate data_var_list above this line.

if self.logging:
#TODO: when var_list is not None
tf.summary.scalar("loss", self.loss)
for var in latent_var_list:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two double for loops could be made more efficient and readable. For example:

for grad, var in grads_and_vars:
  if var in latent_var_list:
    name = "variational"
  elif var in data_var_list:
    name = "model"
  else:
    name = ""

  with tf.name_scope(name):
    tf.summary.histogram("parameter_" + var.name, var)
    tf.summary.histogram("gradient_" + var.name, grad)

This also solves your TODO above.

Some other tips I added to the above: 1. Use variational over "inference", as they're variational parameters in general; "inference" is specific to inference networks in VAEs; 2. use name scopes if you notice you're using a prefix everywhere.

Also, is there a reason you used CamelCase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particular reason for CamelCase. I tried both and decided to stick with it just for the aesthetics. Using name scopes is a better idea though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On running initialize(), tensorflow prints these warnings for all variables/gradients:

INFO:tensorflow:Summary name parameter_dense_1/bias:0 is illegal; using parameter_dense_1/bias_0 instead.
INFO:tensorflow:Summary name gradient_gradients/inference_4608532368/0/dense_1/BiasAdd_grad/BiasAddGrad:0 is illegal; using gradient_gradients/inference_4608532368/0/dense_1/BiasAdd_grad/BiasAddGrad_0 instead.

Should we do something about it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, try iterating them with _0, _1, etc. for each unique name?

Also, can you check that the variables are given their appropriate summary names? parameter_dense_1/bias:0 sounds like a model parameter, but it's not in the model name scope. Same for the second line.

Copy link
Contributor Author

@akshaykhatri639 akshaykhatri639 Apr 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I simply replace ':' with '_'?
The names include the scope when they are displayed on tensorboard, they don't print the full names here.

capture

But as seen from the screenshot, there is one other problem. Not all variables are grouped under scope 'model'. It uses 'model_1' when tf.name_scope(scope_name) is called again.

Should I add all the variables and gradients from one scope in one go? Something like:

with tf.name_scope('model'):
  for grad, var in grads_and_vars:
     if var in data_var_list:
       tf.summary.histogram("parameter_" + var.name, var)
       tf.summary.histogram("gradient_" + grad.name, grad)

and similiarly for inference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To your two questions: yes and yes.

tf.summary.histogram("ModelWeight_" + var.name, var)
tf.summary.histogram("ModelGradient_" + var.name, grad_var_tuple[0])

self.summarize = tf.summary.merge_all()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You define self.summarize here, but it's also defined when using the parent method's initialize (Inference).

You're right that we need to merge the summaries after they're defined and not before. It sounds like we need to decide where to properly put the merge summary op; not sure if it's as simple as moving the call to the parent method after VariationalInference finishes its initialize instead of before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it doesn't work when self.summarize is defined before defining the summary ops.
I tried to move the call to parent's initialize() at the end of the function. But we are using variables like self.logging which are initialized in the parent call. Here it will be easy to modify, but that may not be the case for all classes that are inherited from Inference.

Or we could call summary.merge_all() in each of the child classes and remove it from the parent. For now, we can let it be defined in both until we modify the other classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would even make sense to have the summary writer object outside the inference class? Or else how does it work with composed inferences (http://edwardlib.org/api/inference-compositionality)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mariru. Good point. It sounds like self.summary shouldn't merge all summaries in the TensorFlow graph, but only summaries added inside its class. That way self.summary is localized and thus compatible with compositions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think summary.merge_all() will merge all the summaries in the graph. We could instead use tf.merge() which takes as input a list of summary buffers. Something like:
self.summarize = tf.merge(summary_objects)

Copy link
Member

@dustinvtran dustinvtran Apr 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tf.merge requires storing all the summary objects. However, we build summaries in various methods of the Inference class, so this is not very TensorFlow-y.

We can use a unique graph collection for each inference; then we can merge all summaries within that collection. For example:

# summaries are added like this
summary_key= 'summaries_' + str(id(inference))
tf.summary.scalar(..., collections=[summary_key])
tf.summary.scalar(..., collections=[summary_key])
# when forming merge summary
summary_key = 'summaries_' + str(id(inference))
self.summarize = tf.summary.merge_all(summary_key)

@dustinvtran
Copy link
Member

Bump. It would be great to get this merged in before you start working on the Monte Carlo / more specific variational method stuf.

@@ -101,6 +101,12 @@ def initialize(self, optimizer=None, optimizer_d=None,
self.train_d = optimizer_d.apply_gradients(grads_and_vars_d,
global_step=global_step_d)

if self.logging:
summary_key = 'summaries_' + str(id(self))
tf.summary.scalar('loss_fake', self.loss_d, collections=[summary_key])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename these to 'loss_discriminative' and 'loss_generative' respectively?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@dustinvtran
Copy link
Member

Are you including the summary ops for the VI losses here too?

@akshaykhatri639
Copy link
Contributor Author

Yes.

@dustinvtran
Copy link
Member

@akshaykhatri639 I'm not sure if I follow: do you mean they will be in this PR?

tf.summary.scalar("gradient_norm_" +
grad.name.replace(':', '_'),
tf.norm(grad), collections=[summary_key])
# replace : with _ because tf does not allow : in var names in summaries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akshaykhatri639 Really looking forward to this feature!
I tried it using MAP inference, but the naming turns out weird.

tensorboard

I would replace grad.name.replace(':', '_') with var.name.replace(':', '_') everywhere, because the gradients are named after the operations. But then by prefacing it with gradient_ it will still be clear from the name that its a gradient.

I have been using tf.name_scope('model') for the actual model.
tb_model

Maybe rename to tf.name_scope('training') for logging? But that's minor. I can just rename the scope for my model.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. naming of gradients messy: see comment above

  2. tracking gradients is super useful, but it might be even more useful to look at gradients*learning_rate or (parameters - gradients*learning_rate)/parameters.
    Especially with adaptive learning rates when the user doesn't know the current learning rate, it is impossible to know if the gradient is in the right ballpark from looking at the histograms.

  3. when training blows up and some parameters become NaN or inf (e.g. because the learning rate is too large) I get an error message that is hard to decipher.

InvalidArgumentError (see above for traceback): Nan in summary histogram for: model_1/parameter_model/embeddings/alpha_0
         [[Node: model_1/parameter_model/embeddings/alpha_0 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](model_1/parameter_model/embeddings/alpha_0/tag, model/embeddings/alpha/read)]]

Maybe there's a nicer way to catch this? I guess that's more an idiosyncrasy of tensorflow than of edward. But still, catching it in a more readable way would be nice

Copy link
Contributor Author

@akshaykhatri639 akshaykhatri639 May 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. You are right about names of the summary ops. I will make the required changes.
  2. Won't this become a bit complicated? Because each optimizer has a slightly different way of modifying weights and may be storing one or more cache/momentum matrices for each gradient. Should we implement is differently for each optimizer? Otherwise, the values in the summary still won't be close to the values that the weights were updated with.
  3. Should I check if any values become Nan before writing the summary ops?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. There's a self.debug member in Edward inferences for checking if certain ops blow up. I agree with Maja that it would be nice if we can somehow raise a more informative error. (but without running a check every iteration just to raise that error)

@dustinvtran
Copy link
Member

dustinvtran commented May 28, 2017

Merging this now. I'll personally make changes that revise this slightly in another PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants