Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorBoard logging batch level metrics #6692

Closed
sameermanek opened this issue May 19, 2017 · 27 comments
Closed

TensorBoard logging batch level metrics #6692

sameermanek opened this issue May 19, 2017 · 27 comments

Comments

@sameermanek
Copy link

It'd be useful if there was some batch-level logging in TensorBoard when using the TensorBoard callback (as defined in keras/callbacks.py). I think this'd be generally useful when trying to keep track of stats between epochs.

As an example, there could be a new boolean argument write_batch_performance in the init() method, and a new on_batch_end method, something like:

    def on_batch_end(self, batch, logs=None):
        logs = logs or {}

        if self.write_batch_performance == True:
            for name, value in logs.items():
                if name in ['batch','size']:
                    continue
                summary = tf.Summary()
                summary_value = summary.value.add()
                summary_value.simple_value = value.item()
                summary_value.tag = name
                self.writer.add_summary(summary, self.seen)
            self.writer.flush()

        self.seen += self.batch_size

I have a basic version of this locally; I'd need to slightly clean it up and incorporate it into the unit tests. Happy to do so if it makes sense to incorporate into keras. I couldn't find any matching outstanding feature requests.

Thanks!

@sxs4337
Copy link

sxs4337 commented Jun 22, 2017

+1
This would be very useful indeed!
Specially when training on very large datasets.

@Barfknecht
Copy link

I have been looking for something like this for a week. @sameermanek , does your implementation perform well?

@sameermanek
Copy link
Author

I haven't explicitly tested (computational) performance, but it has been useful to me (e..g, when testing things locally on CPU).
The modifications I made locally are here.
In terms of output, this'll log the batch-level performance, so I can see whether anything's going haywire relatively early on (clearly though, I should've stopped this one a little earlier than I did)
example

@Barfknecht
Copy link

Thanks! This looks really great, and I agree, so much easier to see what is happening between epochs. I am currently having problems with a large dataset myself. So this is just a modification of the tensorboard callback? I will then use the callback as normal. Should I see live graphing of the batches or will that still only show every epoch.

@sameermanek
Copy link
Author

@Barfknecht Correct -- this is just a modification of the tensorboard callback. I probably gave you the wrong link (sorry about that -- that was just a commit, not the entire file). The file is here and you can see the added write_batch_performance argument for the Tensorboard callback (line 635).

You should see live graphing of the batches (example screenshot below; one epoch is 5,000 steps in this, so you can see there are values well before we get to an epoch)
example

@winni2k
Copy link

winni2k commented Aug 2, 2017

This would be really useful! Any ETA on when this will be merged into master? I don't see a PR?

@BrikerMan
Copy link

This is what I needed, thanks a lot.

@sameermanek
Copy link
Author

@wkretzsch I just submitted a PR; let's see whether it's accepted or if there's any feedback. Thanks.

@stale
Copy link

stale bot commented Nov 10, 2017

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

@stale stale bot added the stale label Nov 10, 2017
@stale stale bot closed this as completed Dec 10, 2017
@JulesGM
Copy link

JulesGM commented Dec 20, 2017

bump. I'd really like to see this make it to master, and it has no negative effect on current users of the Tensorboard callback.

@PeterPanUnderhill
Copy link

Short notice for those who got only one dot in the graph: remember to set write_batch_performance to True!

@jpcenteno80
Copy link

Hopefully the PR will be approved soon... Since I didn't want to modify the keras/callbacks.py file, I tried to implement this by sub-classing from the class TensorBoard(Callback):

import tensorflow as tf

class TensorBoard_and_write_batch_performance(TensorBoard):
    '''
    Writes batch performance to TensorBoard
    '''
    def on_batch_end(self, batch, logs=None):
        self.seen = 0

        for name, value in logs.items():
            if name in ['batch', 'size']:
                continue
            summary = tf.Summary()
            summary_value = summary.value.add()
            summary_value.simple_value = value.item()
            summary_value.tag = name
            self.writer.add_summary(summary, self.seen)
        self.writer.flush()
        self.seen += self.batch_size
        
        super(TensorBoard_and_write_batch_performance, self).on_batch_end(batch, logs)

However, when I run this, I can only view the batch performance if I use 'RELATIVE':
image
And I also loose the 'STEP' tracker, so I don't know in which epoch I am.
Finally, I don't have access to the global variable tf and that is why I need to import it in my module.

Does anyone have any tips on how to bring back epoch tracker with this sub-class-TensorBoard version of write_batch_performance?

@VladislavZavadskyy
Copy link

@jpcenteno80, try this:

class TB(callbacks.TensorBoard):
    def __init__(self, log_every=1, **kwargs):
        super().__init__(**kwargs)
        self.log_every = log_every
        self.counter = 0
    
    def on_batch_end(self, batch, logs=None):
        self.counter+=1
        if self.counter%self.log_every==0:
            for name, value in logs.items():
                if name in ['batch', 'size']:
                    continue
                summary = tf.Summary()
                summary_value = summary.value.add()
                summary_value.simple_value = value.item()
                summary_value.tag = name
                self.writer.add_summary(summary, self.counter)
            self.writer.flush()
        
        super().on_batch_end(batch, logs)

Your seen counter is reset every batch, so step is always zero.

@RJVisee44
Copy link

Where exactly does this change have to occur? Tried changing it in:

C:\Users\rc\AppData\Local\conda\conda\envs\tensorflow\Lib\site-packages\tensorflow\contrib\keras\python\keras\callbacks.py

but the code I am running still does not recognise write_batch_performance as an argument.

@JulesGM
Copy link

JulesGM commented Apr 11, 2018

@RyanCodes44 don't change the code in callbacks, just instantiate the class in your code and use it as a callback like https://keras.io/callbacks/#example-recording-loss-history

@RJVisee44
Copy link

RJVisee44 commented Apr 11, 2018

Okay thanks!

@bersbersbers
Copy link

bersbersbers commented Apr 20, 2018

Regarding the code example by @sameermanek, @jpcenteno80, @VladislavZavadskyy (I have tried the latter): it seems to work fine, with two issues:

  1. loss and metrics are written out for each batch, but val_loss and val_metrics are not (only once per epoch). Is this intended?
  2. the on_epoch_end being still active (because I need it for val_loss etc, see 1), it writes out data where the step number is the running batch number, interspersed with data where the step number is the epoch number. This leads to artifacts: for example, with 2 batches per epoch, this is what I get:
    image
    One can clearly see the trend of both per-epoch (left hull) and per-batch (right hull) curves, but it would be easier without the per-epoch data (or with the step number of per-epoch data properly corrected).

@bersbersbers
Copy link

  1. Another issue revolves around the comparison of runs having different batch sizes - this will become difficult as the same number of epochs does not translate into the same number of steps. I believe in view of all these issues 2 and 3, it would make sense to write per-batch and per-epoch output to separate scalar streams (if that is the correct term) and not the same one.

@achatrian
Copy link

In answer to 2:
The logs dictionary fed to the on_batch_end method does not contain validation metrics ( see https://github.com/keras-team/keras/blob/master/keras/callbacks.py , line 168)

@GuillaumeDesforges
Copy link

Bump, the PR seems idle. This feature would be really useful !

@jayanthc
Copy link

@bersbersbers: As a workaround to the second issue you point out, I tend to redefine on_epoch_end in the derived class as:

    def on_epoch_end(self, epoch, logs=None):
        pass

This prevents the superclass' on_epoch_end from being called (which, by the way, may have consequences that you care about), thereby preventing the batch count from being interspersed with the epoch count in the step numbers.

@DmitriiDenisov
Copy link

@jayanthc But in your case nothing will be written in val metrics

@DmitriiDenisov
Copy link

DmitriiDenisov commented Aug 16, 2018

I actually suggest adding this function into class TB(callbacks.TensorBoard):

def on_epoch_end(self, epoch, logs=None):
        for name, value in logs.items():
            if (name in ['batch', 'size']) or ('val' not in name):
                continue
            summary = tf.Summary()
            summary_value = summary.value.add()
            summary_value.simple_value = value.item()
            summary_value.tag = name
            self.writer.add_summary(summary, epoch)
        self.writer.flush()

So it will write all val-metrics and it will pass writing train-metrics

@jayanthc
Copy link

jayanthc commented Aug 16, 2018

@DmitriiDenisov: You are right. I ended up adding something similar to what you posted, to on_epoch_end: Gist.

@gabrieldemarmiesse
Copy link
Contributor

This feature has been implemented and merged in master.

@ybagdasa
Copy link

When I set the update_freq parameter in tf.keras.callbacks.TensorBoard to 128 (update every 128 batches) it only affects the training loss/metrics. The validation loss/metrics are still plotted by epoch. This is my only enabled callback.
Screenshot from 2020-04-21 10-56-48
I'm using tf version 2.2.0-dev20200331 and tb version 2.3.0a20200331 from a recently nightly release.

@GF-Huang
Copy link

GF-Huang commented Feb 10, 2021

So how to show the metrics based on batch? Not the epoch?

image

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests