Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Add tensorboard support in Speedometer. #5345

Merged
merged 6 commits into from Mar 24, 2017

Conversation

zihaolucky
Copy link
Member

@piiswrong

Add tensorboard logging support in Speedometer by optional import. Any other suggestion in callback functions? Should I change the Python setup.py and make dmlc/tensorboard a requirement?

@zihaolucky
Copy link
Member Author

demo

@zihaolucky
Copy link
Member Author

zihaolucky commented Mar 14, 2017

@piiswrong I add this feature in Speedometer as it's the most commonly used callback function and it also logs the evaluation metrics, that's convenient.

@piiswrong
Copy link
Contributor

I think its better to make a new file and add another callback. maybe add it back to speedometer after things have stablized.

@zihaolucky
Copy link
Member Author

Agree. Can I create a tensorboard_callback.py and put this Speedometer(or change to another name) there? As the BatchEndParam only provide scalar value for logging training speed and evaluation metrics.

@piiswrong
Copy link
Contributor

simply tensorboard is fine. also it shouldn't log anything to logging. we need better documentation on the parameters

@zihaolucky
Copy link
Member Author

Refactor for metrics only, no longer mimic Speedometer. Add an example, with more detailed document about the API and TensorBoard as well.

@zihaolucky
Copy link
Member Author

@piiswrong how about now?

@piiswrong
Copy link
Contributor

Let's move this to mx.contrib

@zihaolucky
Copy link
Member Author

Yep. why jenkins build error?

@piiswrong
Copy link
Contributor

lint:

pylint python/mxnet --rcfile=/workspace/tests/ci_build/pylintrc
************* Module mxnet.contrib.init
C: 10, 0: Final newline missing (missing-final-newline)
************* Module mxnet.contrib.tensorboard
C: 29, 0: Line too long (101/100) (line-too-long)
make: *** [pylint] Error 16

@zihaolucky
Copy link
Member Author

All passed now. Any ideas further? I'm now working on graph and embedding, might take sometime to do.

@piiswrong piiswrong merged commit 1550f17 into apache:master Mar 24, 2017
@piiswrong
Copy link
Contributor

Thanks. merged

@jmerkow
Copy link
Contributor

jmerkow commented Mar 27, 2017

Can you provide source code for this? I am attempting to use this with the image_classification examples and I don't get graphs as one would expect.

I used the pypi to install tensorboard:

$pip freeze | grep tensorboard
tensorboard==1.0.0a6

I added/changed this to common/fit.py in image classification to get a dummy test running.
(FYI there is an error in the doc string for LogMetricsCallback, mx.tensorboard should be mx.contrib.tensorboard. lines)

    # starting around line 170 in image_classification/common/fit.py
    evaluation_log = 'logs/eval'
    training_log = 'logs/train'
    eval_end_callbacks = [mx.contrib.tensorboard.LogMetricsCallback(evaluation_log)]
    batch_end_callbacks += [mx.contrib.tensorboard.LogMetricsCallback(training_log)]
    # run
    model.fit(train,
        begin_epoch        = args.load_epoch if args.load_epoch else 0,
        num_epoch          = args.num_epochs,
        eval_data          = val,
        eval_metric        = eval_metrics,
        kvstore            = kv,
        optimizer          = args.optimizer,
        optimizer_params   = optimizer_params,
        initializer        = initializer,
        arg_params         = arg_params,
        aux_params         = aux_params,
        batch_end_callback = batch_end_callbacks, # This was updated
        eval_end_callback  = eval_end_callbacks, # This was added
        epoch_end_callback = checkpoint,
        allow_missing      = True,
        monitor            = monitor)

It appears that 'Step' is not being recorded properly. Attached are some screen shots.
image

if you look at the relative graph it becomes more clear what's happening:
image

image

@zihaolucky
Copy link
Member Author

@jmerkow thanks for your feedback!

To clarify, did you use tensorboard --logdir=logs/train or tensorboard --logdir=logs/? As I know, if we use tensorboard --logdir=logs/, it should be two different colors in the graph.

You can use prefix in LogMetricsCallback to plot train&eval metrics in separate graph:

eval_end_callbacks = [mx.contrib.tensorboard.LogMetricsCallback(evaluation_log, prefix='eval')]
batch_end_callbacks += [mx.contrib.tensorboard.LogMetricsCallback(training_log, prefix='train')]

Or they would be in one graph just like your case, in this case, we have to use relative mode rather than step mode. Any suggestions are welcomed and let's make it better.

@jmerkow
Copy link
Contributor

jmerkow commented Mar 28, 2017

I used logs/. It just never got to the eval call back, I killed it before it finished. I am just testing the tensorboard to use to train on my data.
I solved it in my branch by adding the step manually, i.e.
self.summary_writer.add_scalar(name, value)
to
self.summary_writer.add_scalar(name, value, global_step=param.nbatch).

@ysh329
Copy link
Contributor

ysh329 commented Jul 30, 2017

@jmerkow My Dear brother, I met same problem as yours: never got the eval call back curve or same color for eval and train curve. I didn't understand your means, May you clarify your code change below clearly, thanks a lot!

self.summary_writer.add_scalar(name, value)

to

self.summary_writer.add_scalar(name, value, global_step=param.nbatch).

Guneet-Dhillon pushed a commit to Guneet-Dhillon/mxnet that referenced this pull request Sep 13, 2017
* Add tensorboard support in Speedometer.

* fix pylint.

* Add tensorboard_callback.

* Refactor.

* fix lint.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants