Dynamic custom operator GPU support #17270

rondogency · 2020-01-11T00:35:45Z

Description

Add GPU support for custom operators.
This is a continuation of custom operators project, initial CPU support is implemented here: #15921

Design

Working backward from the user. Here is an example of custom operator forward function for GPU.

Notice the function interface is the same as CPU operator. The input and output tensors are already in GPU, so you don't need to memcpy them to GPU.

You need the CUDA stream from OpResource in order to launch your CUDA kernel on the correct GPU, and mx_stream_t is defined as cudaStream_t when you compile the code with NVCC.

MXReturnValue forwardGPU(std::map<std::string, std::string> attrs,
                         std::vector<MXTensor> inputs,
                         std::vector<MXTensor> outputs,
                         OpResource res) {
    ... 
    mx_stream_t cuda_stream = res.get_cuda_stream();
    gpu_forward<<<grid, block, 0, cuda_stream>>>(out_data, in_data, N);
    return MX_SUCCESS;
}

Then user will register a single operator with both CPU and GPU computation logic, by specifying the device type in register function. This registration works for any context as user only passes in a string.

REGISTER_OP(my_op)
.setForward(forwardCPU, "cpu")
.setForward(forwardGPU, "gpu")
.setBackward(backwardCPU, "cpu")
.setBackward(backwardGPU, "gpu");

Then user compiles the library similar to compiling the CPU operator library. Python usage is exactly the same as CPU custom operators

import mxnet as mx
mx.library.load(os.path.abspath('libmy_op_lib.so'))
a = mx.nd.array([[1,2,3],[4,5,6]], ctx=mx.gpu())
b = mx.nd.array([[7],[8],[9]], ctx=mx.gpu())
mx.nd.my_op(a,b)

This PR is not

Supporting CPU function is one library and GPU function in another library. Both CPU and GPU functions have to be in the same library.
Making inferShape and inferType context aware
Supporting dynamic loading of custom context

Checklist

Essentials

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Add Fcompute registration, and pass NDArray context to custom library in c_api.cc
Add context info to MXTensor class in lib_api.h
Add lib_custom_op/relu.cu example file containing full registration of custom operator "my_relu", and add both CPU and GPU kernel functions in that file
Modify lib_custom_op/Makefile to compile .cu file using nvcc to custom library

Comments

eric-haibin-lin

Some interfaces require more discussion.

src/c_api/c_api.cc

include/mxnet/lib_api.h

src/c_api/c_api.cc

include/mxnet/lib_api.h

src/c_api/c_api.cc

rondogency · 2020-01-27T23:53:38Z

@ptrendx thanks for your comments! I have resolved those comments, and I will appreciate if you could take another quick look and approve this.

CMakeLists.txt

samskalicky · 2020-01-28T16:38:27Z

@rondogency looks like the windows build/test is working now with those cmake changes:

test_operator_gpu.test_custom_op_gpu ... 
MXNet version 10600 supported
[
10:16:08] C:\jenkins_slave\workspace\build-gpu\src\c_api\c_api.cc:286: 
Found 2 operators in library
[
10:16:08] C:\jenkins_slave\workspace\build-gpu\src\c_api\c_api.cc:350: 	Op[0] my_relu
[
10:16:08] C:\jenkins_slave\workspace\build-gpu\src\c_api\c_api.cc:350: 	Op[1] my_state_relu
[
10:16:08] C:\jenkins_slave\workspace\build-gpu\src\c_api\c_api.cc:785: Found 0 partitioners in library
ok (0.6834s)

Now we just need to work through the flaky tests

CMakeLists.txt

Makefile

tests/python/unittest/test_extensions.py

example/extensions/lib_custom_op/test_relu.py

include/mxnet/lib_api.h

wkcn

Great! LGTM. Thank you!

rondogency · 2020-01-31T02:45:32Z

@wkcn Thanks! Can you help me to merge it?

wkcn · 2020-01-31T03:55:56Z

Merged. : )

* poc gpu customop end to end * add backward and device id * clear up customop makefile * new fcomp register * new setforward to pass custom context to c_api * resolve sam comment: add cond register and fix setforward char * tmp stateful op * passing ctx of stateful op * add gpu alloc and refactor all fcomp * resolve sam comments and refactor alloc * add gpu check to pass cpu build * add unittest and resolve ptrend comments * add cmake and jenkins * fix windows * windows gpu cmake build fix * remove verbose

Add random number generator support for custom operator libraries. Design: We pass from MXNet the initialized and seeded states, located on CPU and GPU, to custom library. So user could use those seeds to generate deterministic values from a given seed passed to MXNet. Basically this workflow: mx.random.seed(128) r1 = mx.nd.some_custom_random_op(data) mx.random.seed(128) r2 = mx.nd.some_custom_random_op(data) assert (r1 == r2) This PR does not let custom library generate exactly the same sequence of random numbers comparing to MXNet This is a continuation of the custom operator project #15921 and #17270

Add random number generator support for custom operator libraries. Design: We pass from MXNet the initialized and seeded states, located on CPU and GPU, to custom library. So user could use those seeds to generate deterministic values from a given seed passed to MXNet. Basically this workflow: mx.random.seed(128) r1 = mx.nd.some_custom_random_op(data) mx.random.seed(128) r2 = mx.nd.some_custom_random_op(data) assert (r1 == r2) This PR does not let custom library generate exactly the same sequence of random numbers comparing to MXNet This is a continuation of the custom operator project apache#15921 and apache#17270

…18069) * Dynamic subgraph compile support (#17623) This PR adds support for passing the NDArrays from the existing optimize_for API down to the reviewSubgraph function in an external library. It also adds a new API for HybridBlock called optimize_for that can partition the model without running a forward pass. Feature changes Adds new API to HybridBlock optimize_for that partitions the model but does not call the cachedOp Modifies the subgraph library example to optionally require args to be provided Adds annotation on subgraph inputs for the name of the original param so that inputs can be mapped and passes annotations to input nodes of subgraphs Adds support for tensors in MKLDNN format, calls Reorder2Default New tests Adds a new test to partition operators that directly consume params add a new model to test where ops to be partitioned have args/params Bug Fixes fixes bug in passing ids vector by value instead of by reference fixes bug in passing copies of attributes instead of by reference fixes bug where _cached_graph was not updated after partitioning fixes memory leak where user-specified attributes on subgraph ops were not freed if subgraph was rejected fixes problem incorrectly indexing into shape/dtype maps when annotating the graph Docs Updates the README doc with the latest changes described above * Adding sparse support to MXTensor for custom operators (#17569) * Added enum for sparse storage * Add structure for Dense and Sparse * redesign the data structure for MXSparse * pull out aux data from sparse NDArray * Added more sparse arguments to API interface * Passed sparse from c_api to lib_api.h and set in MXTensor * Fix indent * fix segfault * Fix NDArray to MXTensor errors * Add a sample of sparse(CSR) transpose * Make CSR transpose temporarily work by hardcoding * Fixed sparse output size(Refined) * Add tests for symbolic and stateful ops * Added a sample for row sparse transpose * Added real row sparse transpose * Fix output size issue by adding lambda for CheckAndAlloc() * Fix mixed storage formats error * Added infer storage type function * resolve comments * Set inferSType as optional function * Resolve comments * Add error messages * Resolve comments * verify transpose ops results * fix sanity check * update MX_LIBRARY_VERSION to 5 * Custom Operator Random Number Generator Support (#17762) Add random number generator support for custom operator libraries. Design: We pass from MXNet the initialized and seeded states, located on CPU and GPU, to custom library. So user could use those seeds to generate deterministic values from a given seed passed to MXNet. Basically this workflow: mx.random.seed(128) r1 = mx.nd.some_custom_random_op(data) mx.random.seed(128) r2 = mx.nd.some_custom_random_op(data) assert (r1 == r2) This PR does not let custom library generate exactly the same sequence of random numbers comparing to MXNet This is a continuation of the custom operator project #15921 and #17270 Co-authored-by: guanxinq <58794120+guanxinq@users.noreply.github.com> Co-authored-by: Ziyi Mu <ziyi.mu@columbia.edu>

* Dynamic subgraph compile support (#17623) This PR adds support for passing the NDArrays from the existing optimize_for API down to the reviewSubgraph function in an external library. It also adds a new API for HybridBlock called optimize_for that can partition the model without running a forward pass. Feature changes Adds new API to HybridBlock optimize_for that partitions the model but does not call the cachedOp Modifies the subgraph library example to optionally require args to be provided Adds annotation on subgraph inputs for the name of the original param so that inputs can be mapped and passes annotations to input nodes of subgraphs Adds support for tensors in MKLDNN format, calls Reorder2Default New tests Adds a new test to partition operators that directly consume params add a new model to test where ops to be partitioned have args/params Bug Fixes fixes bug in passing ids vector by value instead of by reference fixes bug in passing copies of attributes instead of by reference fixes bug where _cached_graph was not updated after partitioning fixes memory leak where user-specified attributes on subgraph ops were not freed if subgraph was rejected fixes problem incorrectly indexing into shape/dtype maps when annotating the graph Docs Updates the README doc with the latest changes described above * Adding sparse support to MXTensor for custom operators (#17569) * Added enum for sparse storage * Add structure for Dense and Sparse * redesign the data structure for MXSparse * pull out aux data from sparse NDArray * Added more sparse arguments to API interface * Passed sparse from c_api to lib_api.h and set in MXTensor * Fix indent * fix segfault * Fix NDArray to MXTensor errors * Add a sample of sparse(CSR) transpose * Make CSR transpose temporarily work by hardcoding * Fixed sparse output size(Refined) * Add tests for symbolic and stateful ops * Added a sample for row sparse transpose * Added real row sparse transpose * Fix output size issue by adding lambda for CheckAndAlloc() * Fix mixed storage formats error * Added infer storage type function * resolve comments * Set inferSType as optional function * Resolve comments * Add error messages * Resolve comments * verify transpose ops results * fix sanity check * update MX_LIBRARY_VERSION to 5 * Custom Operator Random Number Generator Support (#17762) Add random number generator support for custom operator libraries. Design: We pass from MXNet the initialized and seeded states, located on CPU and GPU, to custom library. So user could use those seeds to generate deterministic values from a given seed passed to MXNet. Basically this workflow: mx.random.seed(128) r1 = mx.nd.some_custom_random_op(data) mx.random.seed(128) r2 = mx.nd.some_custom_random_op(data) assert (r1 == r2) This PR does not let custom library generate exactly the same sequence of random numbers comparing to MXNet This is a continuation of the custom operator project #15921 and #17270 Co-authored-by: guanxinq <58794120+guanxinq@users.noreply.github.com> Co-authored-by: Ziyi Mu <ziyi.mu@columbia.edu>

rondogency added 3 commits January 9, 2020 01:37

poc gpu customop end to end

09ebadf

add backward and device id

bf3cfdc

clear up customop makefile

4bcda15

rondogency requested review from aaronmarkham, anirudh2290, eric-haibin-lin and szha as code owners January 11, 2020 00:35

eric-haibin-lin suggested changes Jan 13, 2020

View reviewed changes

src/c_api/c_api.cc Outdated Show resolved Hide resolved

rondogency added 2 commits January 14, 2020 21:35

new fcomp register

61db076

new setforward to pass custom context to c_api

2f8cf79