TensorBoard logging batch level metrics #6692

sameermanek · 2017-05-19T14:45:58Z

It'd be useful if there was some batch-level logging in TensorBoard when using the TensorBoard callback (as defined in keras/callbacks.py). I think this'd be generally useful when trying to keep track of stats between epochs.

As an example, there could be a new boolean argument write_batch_performance in the init() method, and a new on_batch_end method, something like:

    def on_batch_end(self, batch, logs=None):
        logs = logs or {}

        if self.write_batch_performance == True:
            for name, value in logs.items():
                if name in ['batch','size']:
                    continue
                summary = tf.Summary()
                summary_value = summary.value.add()
                summary_value.simple_value = value.item()
                summary_value.tag = name
                self.writer.add_summary(summary, self.seen)
            self.writer.flush()

        self.seen += self.batch_size

I have a basic version of this locally; I'd need to slightly clean it up and incorporate it into the unit tests. Happy to do so if it makes sense to incorporate into keras. I couldn't find any matching outstanding feature requests.

Thanks!

The text was updated successfully, but these errors were encountered:

sxs4337 · 2017-06-22T14:32:34Z

+1
This would be very useful indeed!
Specially when training on very large datasets.

Barfknecht · 2017-06-29T05:55:06Z

I have been looking for something like this for a week. @sameermanek , does your implementation perform well?

sameermanek · 2017-06-29T18:19:23Z

I haven't explicitly tested (computational) performance, but it has been useful to me (e..g, when testing things locally on CPU).
The modifications I made locally are here.
In terms of output, this'll log the batch-level performance, so I can see whether anything's going haywire relatively early on (clearly though, I should've stopped this one a little earlier than I did)

Barfknecht · 2017-06-29T19:36:19Z

Thanks! This looks really great, and I agree, so much easier to see what is happening between epochs. I am currently having problems with a large dataset myself. So this is just a modification of the tensorboard callback? I will then use the callback as normal. Should I see live graphing of the batches or will that still only show every epoch.

sameermanek · 2017-06-29T19:49:56Z

@Barfknecht Correct -- this is just a modification of the tensorboard callback. I probably gave you the wrong link (sorry about that -- that was just a commit, not the entire file). The file is here and you can see the added write_batch_performance argument for the Tensorboard callback (line 635).

You should see live graphing of the batches (example screenshot below; one epoch is 5,000 steps in this, so you can see there are values well before we get to an epoch)

winni2k · 2017-08-02T12:06:14Z

This would be really useful! Any ETA on when this will be merged into master? I don't see a PR?

BrikerMan · 2017-08-07T13:54:23Z

This is what I needed, thanks a lot.

sameermanek · 2017-08-12T20:42:06Z

@wkretzsch I just submitted a PR; let's see whether it's accepted or if there's any feedback. Thanks.

stale · 2017-11-10T20:52:29Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

JulesGM · 2017-12-20T18:39:21Z

bump. I'd really like to see this make it to master, and it has no negative effect on current users of the Tensorboard callback.

PeterPanUnderhill · 2018-01-12T02:33:27Z

Short notice for those who got only one dot in the graph: remember to set write_batch_performance to True!

jpcenteno80 · 2018-03-12T20:53:13Z

Hopefully the PR will be approved soon... Since I didn't want to modify the keras/callbacks.py file, I tried to implement this by sub-classing from the class TensorBoard(Callback):

import tensorflow as tf

class TensorBoard_and_write_batch_performance(TensorBoard):
    '''
    Writes batch performance to TensorBoard
    '''
    def on_batch_end(self, batch, logs=None):
        self.seen = 0

        for name, value in logs.items():
            if name in ['batch', 'size']:
                continue
            summary = tf.Summary()
            summary_value = summary.value.add()
            summary_value.simple_value = value.item()
            summary_value.tag = name
            self.writer.add_summary(summary, self.seen)
        self.writer.flush()
        self.seen += self.batch_size
        
        super(TensorBoard_and_write_batch_performance, self).on_batch_end(batch, logs)

However, when I run this, I can only view the batch performance if I use 'RELATIVE':

And I also loose the 'STEP' tracker, so I don't know in which epoch I am.
Finally, I don't have access to the global variable tf and that is why I need to import it in my module.

Does anyone have any tips on how to bring back epoch tracker with this sub-class-TensorBoard version of write_batch_performance?

VladislavZavadskyy · 2018-03-17T15:35:11Z

@jpcenteno80, try this:

class TB(callbacks.TensorBoard):
    def __init__(self, log_every=1, **kwargs):
        super().__init__(**kwargs)
        self.log_every = log_every
        self.counter = 0
    
    def on_batch_end(self, batch, logs=None):
        self.counter+=1
        if self.counter%self.log_every==0:
            for name, value in logs.items():
                if name in ['batch', 'size']:
                    continue
                summary = tf.Summary()
                summary_value = summary.value.add()
                summary_value.simple_value = value.item()
                summary_value.tag = name
                self.writer.add_summary(summary, self.counter)
            self.writer.flush()
        
        super().on_batch_end(batch, logs)

Your seen counter is reset every batch, so step is always zero.

RJVisee44 · 2018-04-10T20:40:06Z

Where exactly does this change have to occur? Tried changing it in:

C:\Users\rc\AppData\Local\conda\conda\envs\tensorflow\Lib\site-packages\tensorflow\contrib\keras\python\keras\callbacks.py

but the code I am running still does not recognise write_batch_performance as an argument.

JulesGM · 2018-04-11T00:13:09Z

@RyanCodes44 don't change the code in callbacks, just instantiate the class in your code and use it as a callback like https://keras.io/callbacks/#example-recording-loss-history

RJVisee44 · 2018-04-11T01:43:34Z

Okay thanks!

bersbersbers · 2018-04-20T11:06:17Z

Regarding the code example by @sameermanek, @jpcenteno80, @VladislavZavadskyy (I have tried the latter): it seems to work fine, with two issues:

loss and metrics are written out for each batch, but val_loss and val_metrics are not (only once per epoch). Is this intended?
the on_epoch_end being still active (because I need it for val_loss etc, see 1), it writes out data where the step number is the running batch number, interspersed with data where the step number is the epoch number. This leads to artifacts: for example, with 2 batches per epoch, this is what I get:

One can clearly see the trend of both per-epoch (left hull) and per-batch (right hull) curves, but it would be easier without the per-epoch data (or with the step number of per-epoch data properly corrected).

bersbersbers · 2018-04-20T14:01:02Z

Another issue revolves around the comparison of runs having different batch sizes - this will become difficult as the same number of epochs does not translate into the same number of steps. I believe in view of all these issues 2 and 3, it would make sense to write per-batch and per-epoch output to separate scalar streams (if that is the correct term) and not the same one.

achatrian · 2018-05-15T15:48:51Z

In answer to 2:
The logs dictionary fed to the on_batch_end method does not contain validation metrics ( see https://github.com/keras-team/keras/blob/master/keras/callbacks.py , line 168)

GuillaumeDesforges · 2018-06-04T10:35:29Z

Bump, the PR seems idle. This feature would be really useful !

jayanthc · 2018-08-12T03:24:33Z

@bersbersbers: As a workaround to the second issue you point out, I tend to redefine on_epoch_end in the derived class as:

    def on_epoch_end(self, epoch, logs=None):
        pass

This prevents the superclass' on_epoch_end from being called (which, by the way, may have consequences that you care about), thereby preventing the batch count from being interspersed with the epoch count in the step numbers.

DmitriiDenisov · 2018-08-16T12:03:26Z

@jayanthc But in your case nothing will be written in val metrics

DmitriiDenisov · 2018-08-16T13:06:34Z

I actually suggest adding this function into class TB(callbacks.TensorBoard):

def on_epoch_end(self, epoch, logs=None):
        for name, value in logs.items():
            if (name in ['batch', 'size']) or ('val' not in name):
                continue
            summary = tf.Summary()
            summary_value = summary.value.add()
            summary_value.simple_value = value.item()
            summary_value.tag = name
            self.writer.add_summary(summary, epoch)
        self.writer.flush()

So it will write all val-metrics and it will pass writing train-metrics

jayanthc · 2018-08-16T15:16:06Z

@DmitriiDenisov: You are right. I ended up adding something similar to what you posted, to on_epoch_end: Gist.

gabrieldemarmiesse · 2018-10-01T19:41:46Z

This feature has been implemented and merged in master.

ybagdasa · 2020-04-21T18:04:24Z

When I set the update_freq parameter in tf.keras.callbacks.TensorBoard to 128 (update every 128 batches) it only affects the training loss/metrics. The validation loss/metrics are still plotted by epoch. This is my only enabled callback.

I'm using tf version 2.2.0-dev20200331 and tb version 2.3.0a20200331 from a recently nightly release.

GF-Huang · 2021-02-10T20:39:38Z

So how to show the metrics based on batch? Not the epoch？

BrikerMan mentioned this issue Aug 7, 2017

How to increase loss scalar frequency while using Tensorboard callback? #7547

Closed

2 tasks

sameermanek mentioned this issue Aug 12, 2017

Tensorboard callback modifications #7617

Closed

stale bot added the stale label Nov 10, 2017

stale bot closed this as completed Dec 10, 2017

bersbersbers mentioned this issue Apr 19, 2018

False-positive E1101 with tensorflow.Summary().value pylint-dev/pylint#2024

Open

gabrieldemarmiesse mentioned this issue Sep 16, 2018

Write to TensorBoard every x samples. #11152

Merged

4 tasks

Threynaud mentioned this issue Feb 16, 2019

TensorBoard logging batch level metrics with tf.keras tensorflow/tensorflow#25801

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorBoard logging batch level metrics #6692

TensorBoard logging batch level metrics #6692

sameermanek commented May 19, 2017

sxs4337 commented Jun 22, 2017

Barfknecht commented Jun 29, 2017

sameermanek commented Jun 29, 2017

Barfknecht commented Jun 29, 2017

sameermanek commented Jun 29, 2017

winni2k commented Aug 2, 2017

BrikerMan commented Aug 7, 2017

sameermanek commented Aug 12, 2017

stale bot commented Nov 10, 2017

JulesGM commented Dec 20, 2017 •

edited

Loading

PeterPanUnderhill commented Jan 12, 2018

jpcenteno80 commented Mar 12, 2018

VladislavZavadskyy commented Mar 17, 2018

RJVisee44 commented Apr 10, 2018

JulesGM commented Apr 11, 2018

RJVisee44 commented Apr 11, 2018 •

edited

Loading

bersbersbers commented Apr 20, 2018 •

edited

Loading

bersbersbers commented Apr 20, 2018

achatrian commented May 15, 2018

GuillaumeDesforges commented Jun 4, 2018

jayanthc commented Aug 12, 2018

DmitriiDenisov commented Aug 16, 2018

DmitriiDenisov commented Aug 16, 2018 •

edited

Loading

jayanthc commented Aug 16, 2018 •

edited

Loading

gabrieldemarmiesse commented Oct 1, 2018

ybagdasa commented Apr 21, 2020

GF-Huang commented Feb 10, 2021 •

edited

Loading

TensorBoard logging batch level metrics #6692

TensorBoard logging batch level metrics #6692

Comments

sameermanek commented May 19, 2017

sxs4337 commented Jun 22, 2017

Barfknecht commented Jun 29, 2017

sameermanek commented Jun 29, 2017

Barfknecht commented Jun 29, 2017

sameermanek commented Jun 29, 2017

winni2k commented Aug 2, 2017

BrikerMan commented Aug 7, 2017

sameermanek commented Aug 12, 2017

stale bot commented Nov 10, 2017

JulesGM commented Dec 20, 2017 • edited Loading

PeterPanUnderhill commented Jan 12, 2018

jpcenteno80 commented Mar 12, 2018

VladislavZavadskyy commented Mar 17, 2018

RJVisee44 commented Apr 10, 2018

JulesGM commented Apr 11, 2018

RJVisee44 commented Apr 11, 2018 • edited Loading

bersbersbers commented Apr 20, 2018 • edited Loading

bersbersbers commented Apr 20, 2018

achatrian commented May 15, 2018

GuillaumeDesforges commented Jun 4, 2018

jayanthc commented Aug 12, 2018

DmitriiDenisov commented Aug 16, 2018

DmitriiDenisov commented Aug 16, 2018 • edited Loading

jayanthc commented Aug 16, 2018 • edited Loading

gabrieldemarmiesse commented Oct 1, 2018

ybagdasa commented Apr 21, 2020

GF-Huang commented Feb 10, 2021 • edited Loading

JulesGM commented Dec 20, 2017 •

edited

Loading

RJVisee44 commented Apr 11, 2018 •

edited

Loading

bersbersbers commented Apr 20, 2018 •

edited

Loading

DmitriiDenisov commented Aug 16, 2018 •

edited

Loading

jayanthc commented Aug 16, 2018 •

edited

Loading

GF-Huang commented Feb 10, 2021 •

edited

Loading