New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DEBUG]Support a debug framework for TVM Runtime #1378
Conversation
@merrymercy @Huyuwei @yzhliu can you please review this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a quick review, so far the debug graph runtime looks good to me. Not quite familiar with the curses UI APIs though.
docs/dev/tvmdbg.md
Outdated
|
||
https://github.com/tensorflow/tensorflow | ||
https://github.com/tensorflow/tensorboard | ||
https://github.com/awslabs/mxboard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
break to three lines.
docs/dev/tvmdbg.md
Outdated
2. Make tvm so that it will make the `libtvm_runtime.so` | ||
|
||
3. In the graph build file instead of `from tvm.contrib import graph_runtime` import the debug_runtime `from tvm.contrib.debugging import debug_runtime as graph_runtime` | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use ```python to highlight the py keywords.
docs/dev/tvmdbg.md
Outdated
The above two modifications will bring up the debug UI during run. | ||
|
||
The HOME page of tvmdbg looks like below. | ||
![](https://raw.githubusercontent.com/siju-samuel/tvmdbg/master/docs/dev/_images/tvm_dbg1.png) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tqchen do we have some 'official' place to store these images?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dmlc/web-data can be used for all content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us use https://github.com/tvmai/tvmai.github.io for this
#menu.append( | ||
# ui_common.MenuItem("home", "home")) | ||
#menu.append( | ||
# ui_common.MenuItem("help", "help")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove unused lines?
ValueError: on failure to parse the input `size_str`. | ||
""" | ||
|
||
size_str = size_str.strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to normalize by size_str.upper()
and check the upper case K, B, M, G, etc.
""" | ||
dbg_out_buffer_list = [] | ||
for i in range(len(shapes_list[1])): | ||
dbg_out_buffer_list.append(nd.empty(shapes_list[1][i], dltype_list[1][i])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to make sure I understand correctly: these buffers always stay on cpu, which are for storing the network outputs for debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. always stay in cpu. for storing the outputs of each fused op after execution,
This will be updated after each op execution.
|
||
Parameters | ||
---------- | ||
tensor: The tensor to be displayed, as a numpy ndarray or other |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parameter format https://docs.tvm.ai/contribute/document.html#document-python
Thanks, @siju-samuel for the contribution and sorry for the delayed review. By looking into the current PR I think we can split it into two logical parts. The current workflow of debugger works by running debug-runtime collect all the data we need, and then invoke a visualizer to step through the collected result. So logically, we can draw a picture like
Let us change this PR to only contain We can bring this in first. Once this PR get merged. The users can readily try out the debugger feature if we wrap the curse visualizer as a separate python package that you maintain. I would recommend we collect feedbacks from users and improves the curses UX, before we stablize and merge it back to master. The tensorboard support can also be done in concurrent PRs. I recommend this path because this will force us think carefully about this separation of UX and data logging, and also ensures the high quality components, as @yzhliu and I can review the data logging components but less familar with the UX part |
@srkreddy1238 @PariksheetPinjari909 can you also please help review? |
@tqchen the review comments are addressed and all the curses functions are made according to tvm guidelines.
If we keep as a separate package, users won't use this feature extensively, so I suggest we can keep it as part of tvm package int a separate folder , the complete maintenance responsibility of tvmdbg, I will handle. This have no impact to the actual runtime as this is disabled by default. I want users to use this and give their feedback so that it could be improved further. |
@siju-samuel I understand your concerns in here. Let us aim to merge this in, but still do it in two step PR, (raise the UX part in a separate PR). The first PR should contain
The reason to push for this is that to force us to make clear API separation and pave the path for tensorboard integration. I think it would also help a lot to get reviewers to review the UX part once the exchange format is clear |
d75c6be
to
d06b19a
Compare
@tqchen @yzhliu
The second PR about ncurses will be done based on the review comments received for this one. |
@siju-samuel can you remove the usage in tvmdbg.md for now(as they are UI related) and instead add a section about Debug Exchange Format, and document all the exchange file format(how each field of json correspond to the which each info) |
@tqchen updated as per your comment. |
Thanks for the update, we still need to improve the specification to give all the details, instead of examples. Please refer to https://docs.tvm.ai/dev/nnvm_json_spec.html As for the tensor storage format, one way we might be able to do is to provide a base64 serialization format of the tensor, so that can be embedded into json, would that be helpful? See related PR here #1452 |
Thanks for the review. I updated the json format used in exchange. Please suggest if i need to update further.
if we add the storage in tensor along with the graph dump, the file will become so big for big networks. When i checked tensorflow/mxboard they are also creating separate files for tensors. Another thing is graph wont change. but tensors can keep changing based on input data. So if user run graph again and again with different inputs, we need to update the complete graph dump file. numpy dumping format is easy and wont consume much memory. tensorboard, curses can directly load the numpy format without any conversion. |
docs/dev/tvmdbg.md
Outdated
@@ -0,0 +1,105 @@ | |||
### TVMDBG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change to level one title # Debugger
docs/dev/tvmdbg.md
Outdated
|
||
TVM Debugger (TVMDBG) is an interface for debugging TVM's computation graph execution. It helps to provide access to graph structures and tensor values at the TVM runtime. | ||
|
||
### Debug Exchange Format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## Exchange Format
docs/dev/tvmdbg.md
Outdated
TVM Debugger (TVMDBG) is an interface for debugging TVM's computation graph execution. It helps to provide access to graph structures and tensor values at the TVM runtime. | ||
|
||
### Debug Exchange Format | ||
**1. Graph information** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### Computational Graph
docs/dev/tvmdbg.md
Outdated
|
||
``` | ||
|
||
**2. Tensor dumping** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
```### Tensor Data````
docs/dev/tvmdbg.md
Outdated
Each node in the graph will be dumped to individual files, in the dump folder. These files will be created after execution of each node. | ||
|
||
|
||
### How to use TVMDBG? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## Enable Debugger
|
||
GRAPH_DUMP_FILE_NAME = '_tvmdbg_graph_dump.json' | ||
|
||
class DebugResult(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class DebugResult(object):
This function will get called before run performs. | ||
GraphRuntime copy the execution out to the allocated memory for each nodes. | ||
|
||
Parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there is no parameters, do not include parameters section
docs/dev/tvmdbg.md
Outdated
@@ -0,0 +1,105 @@ | |||
### TVMDBG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I am asking too much, but is it possible to change the doc to reStucturedText? It will make things easier to refer to specific section of the document
Parameters | ||
---------- | ||
|
||
cli_obj : obj |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parameters do not match the signature
"""Exits the dump folder and all its contents""" | ||
self._remove_dump_root() | ||
|
||
class DebugGraphUXWrapper(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this class for now as it is unclear how this class can be used, alternatively, give an example on how to make simple UX out of the class
@yzhliu @srkreddy1238 please take another look of this PR |
The context to deploy the module, can be local or remote. | ||
|
||
dbg_ux : str | ||
To select which ux user needs, Exampel, curses/tensorboard/None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exampel ??
|
||
dump_root : str | ||
To select which folder the outputs should be kept. | ||
None will make a temp folder in /tmp and does the dumping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest to use /tmp/tvmdbg-<random number / a unique session id> instead of just /tmp
points to the name of PackedFunc in the libmod. | ||
|
||
dbg_ux : str | ||
To select which ui user needs, curses, tensorboard, etc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curses, tensorboard, etc ??
Just keep the supported options and the specify the default value.
|
||
dump_root : str | ||
To select which folder the outputs should be kept. | ||
None will make a temp folder in /tmp and does the dumping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use sub folder under temp as explained above.
"""update the nodes_list with name, shape and data type, | ||
for temporarily storing the output. | ||
|
||
Parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove parameters if there is nothing (Exported docs/tutorials may not look clean).
out_stats: list | ||
Contains the list of output tensors | ||
|
||
Returns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove Returns if there is nothing (Exported docs/tutorials may not look clean).
json formatted NNVM graph contain list of each node's | ||
name, shape and type. | ||
|
||
Returns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove Returns if there is nothing (Exported docs/tutorials may not look clean).
Node index to be executed now. Only the op corresponding to this index will be executed | ||
This will be mainly used for stepping each node and finding the output | ||
|
||
Returns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove Returns section if nothing.
Please check empty sections across patchset.
docs/dev/tvmdbg.md
Outdated
The Graph JSON format is explained below | ||
1. nodes | ||
Nodes are either placeholders or computational nodes in NNVM graph. The nodes are stored as a list. A node contains the below information | ||
`op` - operation type, `null` means its a placeholder/variable/input node and `tvm_op` means this node can be executed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its -> it is
""" | ||
return self._shapes_list | ||
|
||
def get_graph_node_dltypes(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feel it is better to name it 'dtypes', keep the name convention the same across the project.
eid += 1 | ||
order += time | ||
key = node['name'] + "_" + str(j) + "__" + str(order) + ".npy" | ||
dump_file = str(self._dump_path + key.replace("/", "_")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use os.sep
instead of /
?
For storing tensor data, can we reuse https://docs.tvm.ai/api/python/nnvm/compiler.html?highlight=save_param_dict#nnvm.compiler.save_param_dict and load_param_dict, this way we can avoid having to store too many files |
@siju-samuel @tqchen Out of curiosity, is it necessary to also show the start and end timestamp of each operator so that we could be able to see the "idle time" in between (e.g. this kind of data could be exported to some UI tools conveniently)? This might be also helpful for debugging. |
I had another look with new changes. @zhiics already covered the heterogeneous related changes. We could add another test case (may be later) to demonstrate the debug tensors over heterogeneous with intermediate node on non-cpu context. |
@tqchen @zhiics @srkreddy1238 @merrymercy @yzhliu Thanks for the reviews. |
@siju-samuel LGTM. |
Thanks @siju-samuel for being patient over the review process, thanks @zhiics @merrymercy @srkreddy1238 @yzhliu for the reviews. this is now merged |
RFC Issue
##1315
A debug framework for TVM Runtime.
The current version support the following:
Design
debug_runtime.py
: This is the extension ofgraph_runtime.py
. If user enable the debug, then all the graph_runtime interfaces are controlled from here and the data dumping will be maintained from here. The data dumping to different UX, whether curses or tensorboard is maintained from this file.graph_runtime_debug.cc
: This is the extension of graph_runtime.cc file. To compile this file, need to change the make file USE_GRAPH_RUNTIME_DEBUG to ONcurses
: This folder contains the curses framework. curses will do the data parsing and visualisation for various features like analyser, stepper & profiler.The current implementation takes care of the following.
Currently this PR supports only graph visualisation using curses UX, later will add support for profiling, stepping and tensorboard support.