Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DEBUG]Support a debug framework for TVM Runtime #1378

Merged
merged 17 commits into from Oct 4, 2018

Conversation

siju-samuel
Copy link
Member

@siju-samuel siju-samuel commented Jul 3, 2018

RFC Issue
##1315

A debug framework for TVM Runtime.
The current version support the following:

  • Show fused graph summary in curses UI
  • Perform debug run and show node details including inputs & outputs tensors
  • Provide flexibility to run without debug option

Design

  1. debug_runtime.py : This is the extension of graph_runtime.py. If user enable the debug, then all the graph_runtime interfaces are controlled from here and the data dumping will be maintained from here. The data dumping to different UX, whether curses or tensorboard is maintained from this file.

  2. graph_runtime_debug.cc : This is the extension of graph_runtime.cc file. To compile this file, need to change the make file USE_GRAPH_RUNTIME_DEBUG to ON

  3. curses : This folder contains the curses framework. curses will do the data parsing and visualisation for various features like analyser, stepper & profiler.

The current implementation takes care of the following.

  • No memory impact for normal run-time since the existing data structures are kept intact.
  • No lib size impact for normal run-time, as all the debug modificatios are protected with compilation macro USE_GRAPH_RUNTIME_DEBUG.
  • No execution time impact for normal run-time as there are no change in the existing functions.
  • Test case added for debug module creation and partial validation (run cant be tested since the curses ui will come up block the execution)
  • The data logging and UX work independently, Output data logged into temporary folder and UX read this data and proceed visualisation.
  • Tutorial on how to use.

Currently this PR supports only graph visualisation using curses UX, later will add support for profiling, stepping and tensorboard support.

@siju-samuel siju-samuel changed the title [DEBUG]Support a debug framework for TVM Runtime [WIP][DEBUG]Support a debug framework for TVM Runtime Jul 3, 2018
@siju-samuel siju-samuel changed the title [WIP][DEBUG]Support a debug framework for TVM Runtime [DEBUG]Support a debug framework for TVM Runtime Jul 4, 2018
@tqchen
Copy link
Member

tqchen commented Jul 4, 2018

@merrymercy @Huyuwei @yzhliu can you please review this?

Copy link
Member

@yzhliu yzhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a quick review, so far the debug graph runtime looks good to me. Not quite familiar with the curses UI APIs though.


https://github.com/tensorflow/tensorflow
https://github.com/tensorflow/tensorboard
https://github.com/awslabs/mxboard
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

break to three lines.

2. Make tvm so that it will make the `libtvm_runtime.so`

3. In the graph build file instead of `from tvm.contrib import graph_runtime` import the debug_runtime `from tvm.contrib.debugging import debug_runtime as graph_runtime`
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use ```python to highlight the py keywords.

The above two modifications will bring up the debug UI during run.

The HOME page of tvmdbg looks like below.
![](https://raw.githubusercontent.com/siju-samuel/tvmdbg/master/docs/dev/_images/tvm_dbg1.png)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tqchen do we have some 'official' place to store these images?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dmlc/web-data can be used for all content.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#menu.append(
# ui_common.MenuItem("home", "home"))
#menu.append(
# ui_common.MenuItem("help", "help"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove unused lines?

ValueError: on failure to parse the input `size_str`.
"""

size_str = size_str.strip()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to normalize by size_str.upper() and check the upper case K, B, M, G, etc.

"""
dbg_out_buffer_list = []
for i in range(len(shapes_list[1])):
dbg_out_buffer_list.append(nd.empty(shapes_list[1][i], dltype_list[1][i]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to make sure I understand correctly: these buffers always stay on cpu, which are for storing the network outputs for debug.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. always stay in cpu. for storing the outputs of each fused op after execution,
This will be updated after each op execution.


Parameters
----------
tensor: The tensor to be displayed, as a numpy ndarray or other
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tqchen
Copy link
Member

tqchen commented Jul 8, 2018

Thanks, @siju-samuel for the contribution and sorry for the delayed review. By looking into the current PR I think we can split it into two logical parts.

The current workflow of debugger works by running debug-runtime collect all the data we need, and then invoke a visualizer to step through the collected result.

So logically, we can draw a picture like

tvm.contrib.debugger.DebugGraphRuntime -> tvm.contrib.debugger.DebugResult(currently GraphModuleDebugDumpDatum) -> UX

Let us change this PR to only contain tvm.contrib.debugger.DebugGraphRuntime -> tvm.contrib.debugger.DebugResult(currently GraphModuleDebugDumpDatum) and document the debug result exchange format clearly.

We can bring this in first. Once this PR get merged. The users can readily try out the debugger feature if we wrap the curse visualizer as a separate python package that you maintain. I would recommend we collect feedbacks from users and improves the curses UX, before we stablize and merge it back to master.

The tensorboard support can also be done in concurrent PRs. I recommend this path because this will force us think carefully about this separation of UX and data logging, and also ensures the high quality components, as @yzhliu and I can review the data logging components but less familar with the UX part

@tqchen
Copy link
Member

tqchen commented Jul 8, 2018

@srkreddy1238 @PariksheetPinjari909 can you also please help review?

@siju-samuel
Copy link
Member Author

@tqchen the review comments are addressed and all the curses functions are made according to tvm guidelines.
debug_runtime.py & debug_result.py handles the interfaces with graph_runtime.

The users can readily try out the debugger feature if we wrap the curse visualizer as a separate python package that you maintain. I would recommend we collect feedbacks from users and improves the curses UX, before we stablize and merge it back to master.

If we keep as a separate package, users won't use this feature extensively, so I suggest we can keep it as part of tvm package int a separate folder , the complete maintenance responsibility of tvmdbg, I will handle. This have no impact to the actual runtime as this is disabled by default. I want users to use this and give their feedback so that it could be improved further.

@tqchen
Copy link
Member

tqchen commented Jul 13, 2018

@siju-samuel I understand your concerns in here. Let us aim to merge this in, but still do it in two step PR, (raise the UX part in a separate PR).

The first PR should contain

  • tvm.contrib.debugger.DebugGraphRuntime -> tvm.contrib.debugger.DebugResult(currently GraphModuleDebugDumpDatum)
  • Document on the debugger exchange format, on how a visualizer can read these in and visualize the result
  • Testcases on verifying the exchange format logging.

The reason to push for this is that to force us to make clear API separation and pave the path for tensorboard integration. I think it would also help a lot to get reviewers to review the UX part once the exchange format is clear

@siju-samuel siju-samuel changed the title [DEBUG]Support a debug framework for TVM Runtime [WIP][DEBUG]Support a debug framework for TVM Runtime Jul 14, 2018
@siju-samuel siju-samuel force-pushed the tvm_debug branch 2 times, most recently from d75c6be to d06b19a Compare July 19, 2018 15:40
@siju-samuel siju-samuel changed the title [WIP][DEBUG]Support a debug framework for TVM Runtime [DEBUG]Support a debug framework for TVM Runtime Jul 23, 2018
@siju-samuel
Copy link
Member Author

@tqchen @yzhliu
The first PR is updated. please review and give your opinion.

  • tvm.contrib.debugger.DebugGraphRuntime -> tvm.contrib.debugger.DebugResult
  • Documentation on the debugger exchange format
  • Testcases on verifying the data dumps is added

The second PR about ncurses will be done based on the review comments received for this one.
TIA

@tqchen
Copy link
Member

tqchen commented Jul 23, 2018

@siju-samuel can you remove the usage in tvmdbg.md for now(as they are UI related) and instead add a section about Debug Exchange Format, and document all the exchange file format(how each field of json correspond to the which each info)

@siju-samuel
Copy link
Member Author

@tqchen updated as per your comment.

@tqchen
Copy link
Member

tqchen commented Jul 24, 2018

Thanks for the update, we still need to improve the specification to give all the details, instead of examples. Please refer to https://docs.tvm.ai/dev/nnvm_json_spec.html

As for the tensor storage format, one way we might be able to do is to provide a base64 serialization format of the tensor, so that can be embedded into json, would that be helpful?

See related PR here #1452

@siju-samuel
Copy link
Member Author

siju-samuel commented Jul 25, 2018

Thanks for the update, we still need to improve the specification to give all the details, instead of examples. Please refer to https://docs.tvm.ai/dev/nnvm_json_spec.html

Thanks for the review. I updated the json format used in exchange. Please suggest if i need to update further.

As for the tensor storage format, one way we might be able to do is to provide a base64 serialization format of the tensor, so that can be embedded into json, would that be helpful?

if we add the storage in tensor along with the graph dump, the file will become so big for big networks. When i checked tensorflow/mxboard they are also creating separate files for tensors. Another thing is graph wont change. but tensors can keep changing based on input data. So if user run graph again and again with different inputs, we need to update the complete graph dump file. numpy dumping format is easy and wont consume much memory. tensorboard, curses can directly load the numpy format without any conversion.

@@ -0,0 +1,105 @@
### TVMDBG
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to level one title # Debugger


TVM Debugger (TVMDBG) is an interface for debugging TVM's computation graph execution. It helps to provide access to graph structures and tensor values at the TVM runtime.

### Debug Exchange Format
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## Exchange Format

TVM Debugger (TVMDBG) is an interface for debugging TVM's computation graph execution. It helps to provide access to graph structures and tensor values at the TVM runtime.

### Debug Exchange Format
**1. Graph information**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

### Computational Graph


```

**2. Tensor dumping**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

```### Tensor Data````

Each node in the graph will be dumped to individual files, in the dump folder. These files will be created after execution of each node.


### How to use TVMDBG?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## Enable Debugger


GRAPH_DUMP_FILE_NAME = '_tvmdbg_graph_dump.json'

class DebugResult():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class DebugResult(object):

This function will get called before run performs.
GraphRuntime copy the execution out to the allocated memory for each nodes.

Parameters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there is no parameters, do not include parameters section

@@ -0,0 +1,105 @@
### TVMDBG
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am asking too much, but is it possible to change the doc to reStucturedText? It will make things easier to refer to specific section of the document

Parameters
----------

cli_obj : obj
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parameters do not match the signature

"""Exits the dump folder and all its contents"""
self._remove_dump_root()

class DebugGraphUXWrapper(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this class for now as it is unclear how this class can be used, alternatively, give an example on how to make simple UX out of the class

@tqchen
Copy link
Member

tqchen commented Jul 25, 2018

@yzhliu @srkreddy1238 please take another look of this PR

The context to deploy the module, can be local or remote.

dbg_ux : str
To select which ux user needs, Exampel, curses/tensorboard/None.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exampel ??


dump_root : str
To select which folder the outputs should be kept.
None will make a temp folder in /tmp and does the dumping
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest to use /tmp/tvmdbg-<random number / a unique session id> instead of just /tmp

points to the name of PackedFunc in the libmod.

dbg_ux : str
To select which ui user needs, curses, tensorboard, etc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curses, tensorboard, etc ??

Just keep the supported options and the specify the default value.


dump_root : str
To select which folder the outputs should be kept.
None will make a temp folder in /tmp and does the dumping
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use sub folder under temp as explained above.

"""update the nodes_list with name, shape and data type,
for temporarily storing the output.

Parameters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove parameters if there is nothing (Exported docs/tutorials may not look clean).

out_stats: list
Contains the list of output tensors

Returns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove Returns if there is nothing (Exported docs/tutorials may not look clean).

json formatted NNVM graph contain list of each node's
name, shape and type.

Returns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove Returns if there is nothing (Exported docs/tutorials may not look clean).

Node index to be executed now. Only the op corresponding to this index will be executed
This will be mainly used for stepping each node and finding the output

Returns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove Returns section if nothing.

Please check empty sections across patchset.

The Graph JSON format is explained below
1. nodes
Nodes are either placeholders or computational nodes in NNVM graph. The nodes are stored as a list. A node contains the below information
`op` - operation type, `null` means its a placeholder/variable/input node and `tvm_op` means this node can be executed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its -> it is

"""
return self._shapes_list

def get_graph_node_dltypes(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feel it is better to name it 'dtypes', keep the name convention the same across the project.

eid += 1
order += time
key = node['name'] + "_" + str(j) + "__" + str(order) + ".npy"
dump_file = str(self._dump_path + key.replace("/", "_"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use os.sep instead of / ?

@tqchen
Copy link
Member

tqchen commented Jul 26, 2018

For storing tensor data, can we reuse https://docs.tvm.ai/api/python/nnvm/compiler.html?highlight=save_param_dict#nnvm.compiler.save_param_dict and load_param_dict, this way we can avoid having to store too many files

@zhiics
Copy link
Member

zhiics commented Oct 2, 2018

@siju-samuel @tqchen Out of curiosity, is it necessary to also show the start and end timestamp of each operator so that we could be able to see the "idle time" in between (e.g. this kind of data could be exported to some UI tools conveniently)? This might be also helpful for debugging.

@srkreddy1238
Copy link
Contributor

I had another look with new changes. @zhiics already covered the heterogeneous related changes.

We could add another test case (may be later) to demonstrate the debug tensors over heterogeneous with intermediate node on non-cpu context.

@siju-samuel
Copy link
Member Author

@tqchen @zhiics @srkreddy1238 @merrymercy @yzhliu Thanks for the reviews.
I have updated the code based on review comments. Please check once again. Thanks.

@zhiics
Copy link
Member

zhiics commented Oct 4, 2018

@siju-samuel LGTM.

@tqchen tqchen merged commit d713d63 into apache:master Oct 4, 2018
@tqchen
Copy link
Member

tqchen commented Oct 4, 2018

Thanks @siju-samuel for being patient over the review process, thanks @zhiics @merrymercy @srkreddy1238 @yzhliu for the reviews. this is now merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants