Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Add register_op_hook for gluon #15839

Merged
merged 5 commits into from Sep 18, 2019

Conversation

anirudh2290
Copy link
Member

@anirudh2290 anirudh2290 commented Aug 10, 2019

Description

This PR adds a register_op_hook for Block. One can register_op_hook to monitor all the ops that are called by the HybridBlock after hybridization. For example:

def mon_callback(node_name, opr_name, arr):
      print(py_str(node_name))
      print(py_str(opr_name))
model = mx.gluon.nn.HybridSequential(prefix="dense_")
with model.name_scope():
     model.add(mx.gluon.nn.Dense(2))
model.initialize()
model.hybridize()
model.register_op_hook(mon_callback, monitor_all=True)
model(mx.nd.ones((2, 3, 4)))

This will print the following :

dense_dense0_fwd_data
FullyConnected
dense_dense0_fwd_weight
FullyConnected
dense_dense0_fwd_bias
FullyConnected
dense_dense0_fwd_output
FullyConnected

Setting monitor_all=False will print only the output :

dense_dense0_fwd_output
FullyConnected

What is not supported

  1. register_op_hook on non hybridized blocks is not supported. If a block is not a hybrid block and register_op_hook is called on it, it will be registered only on HybridBlocks which are children of the blocks. The callback will only be invoked after hybridization.
  2. Calback is not supported when static_shape=True for call to hybridize.

cc @Vikas-kum

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

Copy link
Contributor

@Vikas-kum Vikas-kum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

src/imperative/imperative_utils.cc Outdated Show resolved Hide resolved
src/imperative/imperative_utils.cc Outdated Show resolved Hide resolved
@anirudh2290 anirudh2290 force-pushed the monitor_callback_gluon branch 3 times, most recently from f4a24cd to 1adbffa Compare August 19, 2019 23:26
@anirudh2290 anirudh2290 changed the title [WIP] [DONT REVIEW] set_monitor_callback for gluon set_monitor_callback for gluon Aug 19, 2019
@anirudh2290 anirudh2290 added the pr-awaiting-review PR is waiting for code review label Aug 19, 2019
@anirudh2290
Copy link
Member Author

It looks like my changed files are not being picked up and probably cached files are being picked up for CI. @marcoabreu would you know how I can force clear the cache ?

@anirudh2290 anirudh2290 changed the title set_monitor_callback for gluon Add register_op_hook for gluon Aug 27, 2019
Copy link
Contributor

@Vikas-kum Vikas-kum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!! Looks good !

src/common/utils.cc Show resolved Hide resolved
python/mxnet/gluon/block.py Outdated Show resolved Hide resolved
src/common/utils.cc Show resolved Hide resolved
@apeforest
Copy link
Contributor

Have we measured how much performance overhead this might introduce?

@anirudh2290
Copy link
Member Author

@apeforest no we haven't measured perf overhead since This is only used for monitoring. This should not be used in production or for perf intensive use cases.

@anirudh2290
Copy link
Member Author

Thank you @Vikas-kum, @rahul003, @apeforest for the review!!

@anirudh2290 anirudh2290 merged commit b777a69 into apache:master Sep 18, 2019
larroy pushed a commit to larroy/mxnet that referenced this pull request Sep 28, 2019
* Set monitor callback basic support

* Trigger CI

* Add base.pyi and ndarray.pyx

* Change not supported to experimental and check for both static_shape and static_alloc
joapolarbear added a commit to joapolarbear/incubator-mxnet that referenced this pull request Dec 19, 2019
@joapolarbear
Copy link

@anirudh2290 It seems this pr may affect the trace results, below shows the visualized results using chrome://tracing when training a bert model.
image
In contrast, before I use this pr, I get the following dense traces. Many forward and backpropagation operators miss in the above traces.
image
FYI, I encounter the following warning when using this pr. I am not sure whether this warning has any effect.
image

@anirudh2290
Copy link
Member Author

@joapolarbear do you just see much spacing out or is it that ops are missing in one but not in other. registering an op hook is expensive and called at the beginning and end of each op execution and may kill performance. It is meant to be used for debugging and should be commented out during profiling.

@eric-haibin-lin
Copy link
Member

registering an op hook is expensive and called at the beginning and end of each op execution and may kill performance.

@anirudh2290 did we measure how much overhead it is to invoke a dummy hook?

@anirudh2290
Copy link
Member Author

@eric-haibin-lin i didnt measure that, since this was intended only for debugging not to be used in production use cases. If I had to make a guesstimate i would say it would be very similar to invoking a dummy hook with set_monitor_callback in module.

joapolarbear added a commit to joapolarbear/incubator-mxnet that referenced this pull request Dec 31, 2019
assert self._cached_op, "cached op is not None"
if self._callback:
self._cached_op._register_op_hook(self._callback, self._monitor_all)
if len(self._flags) >= 2 and (self._flags[1] or self._flags[0]):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._flags = list(kwargs.items())
The condition is true even when it is not supposed to be. For example if
self._flags = [('static_alloc', False), ('static_shape', False), ('inline_limit', 2)]

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants