[DEBUG]Support a debug framework for TVM Runtime #1378

siju-samuel · 2018-07-03T13:48:38Z

A debug framework for TVM Runtime.
The current version support the following:

Show fused graph summary in curses UI
Perform debug run and show node details including inputs & outputs tensors
Provide flexibility to run without debug option

Design

debug_runtime.py : This is the extension of graph_runtime.py. If user enable the debug, then all the graph_runtime interfaces are controlled from here and the data dumping will be maintained from here. The data dumping to different UX, whether curses or tensorboard is maintained from this file.
graph_runtime_debug.cc : This is the extension of graph_runtime.cc file. To compile this file, need to change the make file USE_GRAPH_RUNTIME_DEBUG to ON
curses : This folder contains the curses framework. curses will do the data parsing and visualisation for various features like analyser, stepper & profiler.

The current implementation takes care of the following.

No memory impact for normal run-time since the existing data structures are kept intact.
No lib size impact for normal run-time, as all the debug modificatios are protected with compilation macro USE_GRAPH_RUNTIME_DEBUG.
No execution time impact for normal run-time as there are no change in the existing functions.
Test case added for debug module creation and partial validation (run cant be tested since the curses ui will come up block the execution)
The data logging and UX work independently, Output data logged into temporary folder and UX read this data and proceed visualisation.
Tutorial on how to use.

Currently this PR supports only graph visualisation using curses UX, later will add support for profiling, stepping and tensorboard support.

tqchen · 2018-07-04T16:06:58Z

@merrymercy @Huyuwei @yzhliu can you please review this?

yzhliu

Did a quick review, so far the debug graph runtime looks good to me. Not quite familiar with the curses UI APIs though.

yzhliu · 2018-07-05T01:07:21Z

docs/dev/tvmdbg.md

+
+https://github.com/tensorflow/tensorflow
+https://github.com/tensorflow/tensorboard
+https://github.com/awslabs/mxboard


break to three lines.

yzhliu · 2018-07-05T01:08:06Z

docs/dev/tvmdbg.md

+2. Make tvm so that it will make the `libtvm_runtime.so`
+
+3. In the graph build file instead of `from tvm.contrib import graph_runtime` import the debug_runtime `from tvm.contrib.debugging import debug_runtime as graph_runtime`
+```


use ```python to highlight the py keywords.

yzhliu · 2018-07-05T01:10:23Z

docs/dev/tvmdbg.md

+The above two modifications will bring up the debug UI during run.
+
+The HOME page of tvmdbg looks like below.
+ ![](https://raw.githubusercontent.com/siju-samuel/tvmdbg/master/docs/dev/_images/tvm_dbg1.png)


@tqchen do we have some 'official' place to store these images?

dmlc/web-data can be used for all content.

Let us use https://github.com/tvmai/tvmai.github.io for this

yzhliu · 2018-07-05T01:14:31Z

python/tvm/contrib/debugger/curses/ui/analyzer/analyzer.py

+    #menu.append(
+    #    ui_common.MenuItem("home", "home"))
+    #menu.append(
+    #    ui_common.MenuItem("help", "help"))


remove unused lines?

yzhliu · 2018-07-05T01:21:43Z

python/tvm/contrib/debugger/curses/ui/command_parser.py

+      ValueError: on failure to parse the input `size_str`.
+    """
+
+    size_str = size_str.strip()


better to normalize by size_str.upper() and check the upper case K, B, M, G, etc.

yzhliu · 2018-07-05T01:35:30Z

python/tvm/contrib/debugger/debug_runtime.py

+        """
+        dbg_out_buffer_list = []
+        for i in range(len(shapes_list[1])):
+            dbg_out_buffer_list.append(nd.empty(shapes_list[1][i], dltype_list[1][i]))


just to make sure I understand correctly: these buffers always stay on cpu, which are for storing the network outputs for debug.

yes. always stay in cpu. for storing the outputs of each fused op after execution,
This will be updated after each op execution.

tqchen · 2018-07-08T18:02:58Z

python/tvm/contrib/debugger/curses/ui/tensor_data.py

+
+    Parameters
+    ----------
+      tensor: The tensor to be displayed, as a numpy ndarray or other


parameter format https://docs.tvm.ai/contribute/document.html#document-python

tqchen · 2018-07-08T18:14:22Z

Thanks, @siju-samuel for the contribution and sorry for the delayed review. By looking into the current PR I think we can split it into two logical parts.

The current workflow of debugger works by running debug-runtime collect all the data we need, and then invoke a visualizer to step through the collected result.

So logically, we can draw a picture like

tvm.contrib.debugger.DebugGraphRuntime -> tvm.contrib.debugger.DebugResult(currently GraphModuleDebugDumpDatum) -> UX

Let us change this PR to only contain tvm.contrib.debugger.DebugGraphRuntime -> tvm.contrib.debugger.DebugResult(currently GraphModuleDebugDumpDatum) and document the debug result exchange format clearly.

We can bring this in first. Once this PR get merged. The users can readily try out the debugger feature if we wrap the curse visualizer as a separate python package that you maintain. I would recommend we collect feedbacks from users and improves the curses UX, before we stablize and merge it back to master.

The tensorboard support can also be done in concurrent PRs. I recommend this path because this will force us think carefully about this separation of UX and data logging, and also ensures the high quality components, as @yzhliu and I can review the data logging components but less familar with the UX part

tqchen · 2018-07-08T18:14:40Z

@srkreddy1238 @PariksheetPinjari909 can you also please help review?

siju-samuel · 2018-07-12T04:13:54Z

@tqchen the review comments are addressed and all the curses functions are made according to tvm guidelines.
debug_runtime.py & debug_result.py handles the interfaces with graph_runtime.

The users can readily try out the debugger feature if we wrap the curse visualizer as a separate python package that you maintain. I would recommend we collect feedbacks from users and improves the curses UX, before we stablize and merge it back to master.

If we keep as a separate package, users won't use this feature extensively, so I suggest we can keep it as part of tvm package int a separate folder , the complete maintenance responsibility of tvmdbg, I will handle. This have no impact to the actual runtime as this is disabled by default. I want users to use this and give their feedback so that it could be improved further.

tqchen · 2018-07-13T18:41:25Z

@siju-samuel I understand your concerns in here. Let us aim to merge this in, but still do it in two step PR, (raise the UX part in a separate PR).

The first PR should contain

tvm.contrib.debugger.DebugGraphRuntime -> tvm.contrib.debugger.DebugResult(currently GraphModuleDebugDumpDatum)
Document on the debugger exchange format, on how a visualizer can read these in and visualize the result
Testcases on verifying the exchange format logging.

The reason to push for this is that to force us to make clear API separation and pave the path for tensorboard integration. I think it would also help a lot to get reviewers to review the UX part once the exchange format is clear

siju-samuel · 2018-07-23T05:08:40Z

@tqchen @yzhliu
The first PR is updated. please review and give your opinion.

tvm.contrib.debugger.DebugGraphRuntime -> tvm.contrib.debugger.DebugResult
Documentation on the debugger exchange format
Testcases on verifying the data dumps is added

The second PR about ncurses will be done based on the review comments received for this one.
TIA

tqchen · 2018-07-23T22:13:46Z

@siju-samuel can you remove the usage in tvmdbg.md for now(as they are UI related) and instead add a section about Debug Exchange Format, and document all the exchange file format(how each field of json correspond to the which each info)

siju-samuel · 2018-07-24T10:06:38Z

@tqchen updated as per your comment.

tqchen · 2018-07-24T22:06:12Z

Thanks for the update, we still need to improve the specification to give all the details, instead of examples. Please refer to https://docs.tvm.ai/dev/nnvm_json_spec.html

As for the tensor storage format, one way we might be able to do is to provide a base64 serialization format of the tensor, so that can be embedded into json, would that be helpful?

See related PR here #1452

siju-samuel · 2018-07-25T11:04:03Z

Thanks for the update, we still need to improve the specification to give all the details, instead of examples. Please refer to https://docs.tvm.ai/dev/nnvm_json_spec.html

Thanks for the review. I updated the json format used in exchange. Please suggest if i need to update further.

As for the tensor storage format, one way we might be able to do is to provide a base64 serialization format of the tensor, so that can be embedded into json, would that be helpful?

if we add the storage in tensor along with the graph dump, the file will become so big for big networks. When i checked tensorflow/mxboard they are also creating separate files for tensors. Another thing is graph wont change. but tensors can keep changing based on input data. So if user run graph again and again with different inputs, we need to update the complete graph dump file. numpy dumping format is easy and wont consume much memory. tensorboard, curses can directly load the numpy format without any conversion.

tqchen · 2018-07-25T16:41:47Z

docs/dev/tvmdbg.md

@@ -0,0 +1,105 @@
+### TVMDBG


change to level one title # Debugger

tqchen · 2018-07-25T16:41:56Z

docs/dev/tvmdbg.md

+
+TVM Debugger (TVMDBG) is an interface for debugging TVM's computation graph execution. It helps to provide access to graph structures and tensor values at the TVM runtime.
+
+### Debug Exchange Format


## Exchange Format

tqchen · 2018-07-25T16:42:22Z

docs/dev/tvmdbg.md

+TVM Debugger (TVMDBG) is an interface for debugging TVM's computation graph execution. It helps to provide access to graph structures and tensor values at the TVM runtime.
+
+### Debug Exchange Format
+**1. Graph information**


### Computational Graph

tqchen · 2018-07-25T16:42:48Z

docs/dev/tvmdbg.md

+
+```
+
+**2. Tensor dumping**


```### Tensor Data````

tqchen · 2018-07-25T16:43:12Z

docs/dev/tvmdbg.md

+Each node in the graph will be dumped to individual files, in the dump folder. These files will be created after execution of each node.
+
+
+### How to use TVMDBG?


## Enable Debugger

tqchen · 2018-07-25T16:44:01Z

python/tvm/contrib/debugger/debug_result.py

+
+GRAPH_DUMP_FILE_NAME = '_tvmdbg_graph_dump.json'
+
+class DebugResult():


class DebugResult(object):

tqchen · 2018-07-25T16:44:25Z

python/tvm/contrib/debugger/debug_runtime.py

+        This function will get called before run performs.
+        GraphRuntime copy the execution out to the allocated memory for each nodes.
+
+        Parameters


if there is no parameters, do not include parameters section

tqchen · 2018-07-25T16:45:11Z

docs/dev/tvmdbg.md

@@ -0,0 +1,105 @@
+### TVMDBG


Maybe I am asking too much, but is it possible to change the doc to reStucturedText? It will make things easier to refer to specific section of the document

tqchen · 2018-07-25T16:45:53Z

python/tvm/contrib/debugger/debug_runtime.py

+        Parameters
+        ----------
+
+        cli_obj : obj


parameters do not match the signature

tqchen · 2018-07-25T16:47:07Z

python/tvm/contrib/debugger/debug_runtime.py

+        """Exits the dump folder and all its contents"""
+        self._remove_dump_root()
+
+class DebugGraphUXWrapper(object):


Remove this class for now as it is unclear how this class can be used, alternatively, give an example on how to make simple UX out of the class

tqchen · 2018-07-25T16:47:41Z

@yzhliu @srkreddy1238 please take another look of this PR

srkreddy1238 · 2018-07-26T05:27:10Z

python/tvm/contrib/debugger/debug_runtime.py

+        The context to deploy the module, can be local or remote.
+
+    dbg_ux : str
+        To select which ux user needs, Exampel, curses/tensorboard/None.


srkreddy1238 · 2018-07-26T05:30:22Z

python/tvm/contrib/debugger/debug_runtime.py

+
+    dump_root : str
+        To select which folder the outputs should be kept.
+        None will make a temp folder in /tmp and does the dumping


suggest to use /tmp/tvmdbg-<random number / a unique session id> instead of just /tmp

srkreddy1238 · 2018-07-26T05:36:11Z

python/tvm/contrib/debugger/debug_runtime.py

+        points to the name of PackedFunc in the libmod.
+
+    dbg_ux : str
+        To select which ui user needs, curses, tensorboard, etc


curses, tensorboard, etc ??

Just keep the supported options and the specify the default value.

srkreddy1238 · 2018-07-26T05:36:46Z

python/tvm/contrib/debugger/debug_runtime.py

+
+    dump_root : str
+        To select which folder the outputs should be kept.
+        None will make a temp folder in /tmp and does the dumping


Use sub folder under temp as explained above.

srkreddy1238 · 2018-07-26T05:38:16Z

python/tvm/contrib/debugger/debug_result.py

+        """update the nodes_list with name, shape and data type,
+        for temporarily storing the output.
+
+        Parameters


Remove parameters if there is nothing (Exported docs/tutorials may not look clean).

srkreddy1238 · 2018-07-26T05:39:17Z

python/tvm/contrib/debugger/debug_result.py

+        out_stats: list
+            Contains the list of output tensors
+
+        Returns


Remove Returns if there is nothing (Exported docs/tutorials may not look clean).

srkreddy1238 · 2018-07-26T05:39:32Z

python/tvm/contrib/debugger/debug_result.py

+            json formatted NNVM graph contain list of each node's
+            name, shape and type.
+
+        Returns


Remove Returns if there is nothing (Exported docs/tutorials may not look clean).

srkreddy1238 · 2018-07-26T05:52:54Z

python/tvm/contrib/debugger/debug_runtime.py

+            Node index to be executed now. Only the op corresponding to this index will be executed
+        This will be mainly used for stepping each node and finding the output
+
+        Returns


Remove Returns section if nothing.

Please check empty sections across patchset.

yzhliu · 2018-07-26T06:50:51Z

docs/dev/tvmdbg.md

+The Graph JSON format is explained below 
+1. nodes
+  Nodes are either placeholders or computational nodes in NNVM graph. The nodes are stored as a list. A node contains the below information
+   `op` - operation type, `null` means its a placeholder/variable/input node and `tvm_op` means this node can be executed


its -> it is

yzhliu · 2018-07-26T06:54:37Z

python/tvm/contrib/debugger/debug_result.py

+        """
+        return self._shapes_list
+
+    def get_graph_node_dltypes(self):


feel it is better to name it 'dtypes', keep the name convention the same across the project.

yzhliu · 2018-07-26T06:55:59Z

python/tvm/contrib/debugger/debug_result.py

+                eid += 1
+                order += time
+                key = node['name'] + "_" + str(j) + "__" + str(order) + ".npy"
+                dump_file = str(self._dump_path + key.replace("/", "_"))


use os.sep instead of / ?

tqchen · 2018-07-26T16:17:06Z

For storing tensor data, can we reuse https://docs.tvm.ai/api/python/nnvm/compiler.html?highlight=save_param_dict#nnvm.compiler.save_param_dict and load_param_dict, this way we can avoid having to store too many files

zhiics · 2018-10-02T05:02:26Z

@siju-samuel @tqchen Out of curiosity, is it necessary to also show the start and end timestamp of each operator so that we could be able to see the "idle time" in between (e.g. this kind of data could be exported to some UI tools conveniently)? This might be also helpful for debugging.

srkreddy1238 · 2018-10-03T03:44:57Z

I had another look with new changes. @zhiics already covered the heterogeneous related changes.

We could add another test case (may be later) to demonstrate the debug tensors over heterogeneous with intermediate node on non-cpu context.

…tensor

siju-samuel · 2018-10-04T10:21:56Z

@tqchen @zhiics @srkreddy1238 @merrymercy @yzhliu Thanks for the reviews.
I have updated the code based on review comments. Please check once again. Thanks.

zhiics · 2018-10-04T15:59:14Z

@siju-samuel LGTM.

tqchen · 2018-10-04T17:20:21Z

Thanks @siju-samuel for being patient over the review process, thanks @zhiics @merrymercy @srkreddy1238 @yzhliu for the reviews. this is now merged

siju-samuel changed the title ~~[DEBUG]Support a debug framework for TVM Runtime~~ [WIP][DEBUG]Support a debug framework for TVM Runtime Jul 3, 2018

siju-samuel changed the title ~~[WIP][DEBUG]Support a debug framework for TVM Runtime~~ [DEBUG]Support a debug framework for TVM Runtime Jul 4, 2018

tqchen added the status: review in progress label Jul 4, 2018

yzhliu reviewed Jul 5, 2018

View reviewed changes

tqchen requested changes Jul 8, 2018

View reviewed changes

tqchen added the status: need update need update based on feedbacks label Jul 8, 2018

siju-samuel mentioned this pull request Jul 9, 2018

TVMDBG images uploaded dmlc/web-data#92

Merged

siju-samuel force-pushed the tvm_debug branch from 5e55327 to b3a393e Compare July 11, 2018 14:48

siju-samuel force-pushed the tvm_debug branch from b3a393e to d43420b Compare July 12, 2018 04:20

siju-samuel changed the title ~~[DEBUG]Support a debug framework for TVM Runtime~~ [WIP][DEBUG]Support a debug framework for TVM Runtime Jul 14, 2018

siju-samuel force-pushed the tvm_debug branch 2 times, most recently from d75c6be to d06b19a Compare July 19, 2018 15:40

siju-samuel changed the title ~~[WIP][DEBUG]Support a debug framework for TVM Runtime~~ [DEBUG]Support a debug framework for TVM Runtime Jul 23, 2018

siju-samuel force-pushed the tvm_debug branch from d06b19a to 4a68d25 Compare July 24, 2018 10:05

siju-samuel force-pushed the tvm_debug branch from 4a68d25 to 38226a5 Compare July 25, 2018 09:38

tqchen requested changes Jul 25, 2018

View reviewed changes

srkreddy1238 requested changes Jul 26, 2018

View reviewed changes

yzhliu reviewed Jul 26, 2018

View reviewed changes

siju-samuel added 17 commits October 4, 2018 14:02

Review comemnts addressed, this change consists of dumping graph and …

19eeeb9

…tensor

Remove curses related info from tvmdbg and updated with exchange format

2cb22a0

Json specification of graph is updated in document

dd42b51

Documentation in rst format, and other review comments updated

e337247

save_param_dict is used to save the tensors

54a706b

Review comments updated

d77ff76

Comments from tqchen addressed

d11c3b9

Removed all ux object and references

092fa11

Delete common.py

e92c305

Reworked after graphruntime changes

161fcfc

Headers updated

e5b98e8

Review comments fixed

4826d78

Lint issues fixed

49281aa

bugfix Sync function called after execution of operation

be31042

Synced with hetrogenious

85d6226

Review comment updated

8fde6e9

Review comments fixed

4b88316

siju-samuel force-pushed the tvm_debug branch from d17a5e7 to 4b88316 Compare October 4, 2018 10:02

tqchen approved these changes Oct 4, 2018

View reviewed changes

tqchen merged commit d713d63 into apache:master Oct 4, 2018

tqchen added status: accepted and removed status: need update need update based on feedbacks status: review in progress labels Oct 4, 2018

tqchen mentioned this pull request Oct 4, 2018

[RFC][DEBUG]Support a debug framework for TVM Runtime #1315

Closed

13 tasks

FrozenGene pushed a commit to FrozenGene/tvm that referenced this pull request Dec 27, 2018

[DEBUG]Support a debug framework for TVM Runtime (apache#1378)

f0b99d7


		TVM Debugger (TVMDBG) is an interface for debugging TVM's computation graph execution. It helps to provide access to graph structures and tensor values at the TVM runtime.

		### Debug Exchange Format

		Each node in the graph will be dumped to individual files, in the dump folder. These files will be created after execution of each node.


		### How to use TVMDBG?


		GRAPH_DUMP_FILE_NAME = '_tvmdbg_graph_dump.json'

		class DebugResult():


		```

		2. Tensor dumping

[DEBUG]Support a debug framework for TVM Runtime #1378

[DEBUG]Support a debug framework for TVM Runtime #1378

Conversation

siju-samuel commented Jul 3, 2018 • edited

tqchen commented Jul 4, 2018

yzhliu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tqchen commented Jul 8, 2018

tqchen commented Jul 8, 2018

siju-samuel commented Jul 12, 2018

tqchen commented Jul 13, 2018

siju-samuel commented Jul 23, 2018

tqchen commented Jul 23, 2018

siju-samuel commented Jul 24, 2018

tqchen commented Jul 24, 2018

siju-samuel commented Jul 25, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tqchen commented Jul 25, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tqchen commented Jul 26, 2018

zhiics commented Oct 2, 2018

srkreddy1238 commented Oct 3, 2018

siju-samuel commented Oct 4, 2018

zhiics commented Oct 4, 2018

tqchen commented Oct 4, 2018

siju-samuel commented Jul 3, 2018 •

edited

siju-samuel commented Jul 25, 2018 •

edited