New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPIRV] Incorrect Vulkan Result on Mobile GPU #2355

Open
tqchen opened this Issue Dec 30, 2018 · 13 comments

Comments

Projects
None yet
3 participants
@tqchen
Copy link
Member

tqchen commented Dec 30, 2018

There have been several incidences reported about incorrect results when deploying to MobileGPU via vulkan. We will need to look into this.

More specifically, it would be really useful to produce a minimum reduced example from the end to end ones, to use debug_runtime to get the output of each intermediate step and compare them with the CPU or OpenCL version. Then we can look into which specific setup causes the problem.

We would love to have volunteers from the community to look into this issue.

Related threads:

cc @eqy

Possible steps to debug the issue:

  • Get an end to end example that produces the wrong result
  • Use debug runtime to run up to k-steps, dump to local files
  • Dump the output of each step using a different backend(opencl or cpu)
  • Check at which specific step the result went wrong, compare the generated code, use the dumped data to generate a minimum reproducible example.
  • Reduce the schedule of the errored code, to a minimum one to see if there is anything wrong.

@tqchen tqchen changed the title [CODGEN] Vulkan on Mobile GPU [CODGEN] Incorrect Vulkan Result on Mobile GPU Dec 30, 2018

@tqchen tqchen changed the title [CODGEN] Incorrect Vulkan Result on Mobile GPU [SPIRV] Incorrect Vulkan Result on Mobile GPU Dec 30, 2018

@headupinclouds

This comment has been minimized.

Copy link
Contributor

headupinclouds commented Dec 31, 2018

I reported this issue in the post you mentioned here.

Get an end to end example that produces the wrong result

I can post a small sample to reproduce this for review and discussion in a github repo.

Use debug runtime to run up to k-steps, dump to local files ...

I was looking for mechanisms to support this from C++ (asked here). Can you share any pointers or examples that might help illustrate how to perform such logging from C++?

@tqchen

This comment has been minimized.

Copy link
Member

tqchen commented Dec 31, 2018

You can setup an RPC server in the mobile and use https://docs.tvm.ai/api/python/graph_runtime.html#tvm.contrib.graph_runtime.GraphModule.debug_get_output to get k-th output. Then you can just log the output using numpy's typical serialization mechanism

@headupinclouds

This comment has been minimized.

Copy link
Contributor

headupinclouds commented Jan 2, 2019

Thanks. That pointed me in the right direction. I'm able to use the same mechanism from C++ with tvm.graph_runtime_debug.create and the get_output_by_layer packaged function calls.

https://gist.github.com/headupinclouds/d87df2dcc9603589b3b9b9cb00f26011#file-tvm_deploy_gpu_sample-cpp-L521

If I use opt_level=0 the example does return the correct result on the same Android device. It is only when opt_level > 0 where things break down.

with nnvm.compiler.build_config(opt_level=0):
    graph, lib, params = nnvm.compiler.build(sym, target, shape_dict, params=params, target_host=target_host)

The same code with opt_level=3 (any >= 1) works on Android with llvm (cpu) and on the Ubuntu host for all tested back-ends: llvm (cpu), OpenGL, Vulkan, OpenCL, and gpu (cuda).

I'll see if I can understand the root cause.

@headupinclouds

This comment has been minimized.

Copy link
Contributor

headupinclouds commented Jan 6, 2019

I've posted an initial C++ example to help reproduce the issue here:

https://github.com/headupinclouds/tvm_cpp_test

In addition to the observations noted above (the same example works in every tested configuration except Android w/ Vulkan), I've noticed the error:

  • is reproducible across runs (consistent but wrong)
  • occurs in the first actual nnvm fused layer (fuse_conv2d_broadcast_add_relu below)
  • is eliminated by using opt_level=0

head of compiled graph from_mxnet.json w/ opt_level=3 where output is incorrect (fusion)

{
  "nodes": [
    {
      "op": "null", 
      "name": "data", 
      "inputs": []
    }, 
    {
      "op": "null", 
      "name": "resnetv10_conv0_weight_sc", 
      "inputs": []
    }, 
    {
      "op": "null", 
      "name": "batch_norm0_add_beta_expand", 
      "inputs": []
    }, 
    {
      "op": "tvm_op", 
      "name": "relu0", 
      "attrs": {
        "flatten_data": "0", 
        "func_name": "fuse_conv2d_broadcast_add_relu", 
        "num_inputs": "3", 
        "num_outputs": "1"
      }, 
      "inputs": [[0, 0, 0], [1, 0, 0], [2, 0, 0]]
    }, 

head of compiled graph from_mxnet.json w/ opt_level=0 where output is okay (no fusion)

{
  "nodes": [
    {
      "op": "null", 
      "name": "data", 
      "inputs": []
    }, 
    {
      "op": "null", 
      "name": "resnetv10_conv0_weight", 
      "inputs": []
    }, 
    {
      "op": "tvm_op", 
      "name": "conv2d0", 
      "attrs": {
        "flatten_data": "0", 
        "func_name": "fuse_conv2d", 
        "num_inputs": "2", 
        "num_outputs": "1"
      }, 
      "inputs": [[0, 0, 0], [1, 0, 0]]
    }, 
    {
      "op": "null", 
      "name": "resnetv10_batchnorm0_running_var", 
      "inputs": []
    }, 
    {
      "op": "tvm_op", 
      "name": "batch_norm0_add_eps", 
      "attrs": {
        "flatten_data": "1", 
        "func_name": "fuse___add_scalar__", 
        "num_inputs": "1", 
        "num_outputs": "1"
      }, 
      "inputs": [[3, 0, 1]]
    }, 

I've saved the flattened tensor ascii logging from the tests outlined in the repository for convenience in github storage for that repository here

You can download and inspect ubuntu vs android vulkan differences in the first fused layer as follows:

wget https://github.com/headupinclouds/tvm_cpp_test/releases/download/v0.0.0/vulkan.tar.gz
tar zxvf vulkan.tar.gz
name=tvm_0003_relu0.txt;  paste -d' ' android/${name} <(awk '{print $NF}' ubuntu/${name}) | awk '{ s1=$(NF); s2=$(NF-1); d=(s1 > s2) ? (s1-s2) : (s2-s1); if(d/(s2+s1+1e-6f) > 0.05) { print " '$name' " NR " " $0 " (" d ")" } }' | less

The errors seems to be related to the codgen fusion step (opt_level > 0). The codegen details are still fairly opaque to me at this point, so any additional pointers or direction on how to proceed would be appreciated. There is some overhead associated with setting up the experiment, so if there are additional tests or logging experiments I can perform, please let me know.

@tqchen , @eqy ☝️

It is worth mentioning that the C++ executable in that example runs through to completion (w/ the wrong result) but it hangs on exit here. It terminates properly in all other tested configurations, so I suspect something is blocking in a tearm down step somewhere. Please let me know if you think I should file an independent issue for that.

@eqy

This comment has been minimized.

Copy link
Contributor

eqy commented Jan 9, 2019

Thanks for your work on this! Were you able to verify that it was the fusion pass in particular that produces the error (e.g., by disabling/enabling other passes to check https://discuss.tvm.ai/t/different-output-values-when-setting-opt-level-3-in-nnvm-compiler-build-config/1392).

@headupinclouds

This comment has been minimized.

Copy link
Contributor

headupinclouds commented Jan 9, 2019

Were you able to verify that it was the fusion pass in particular that produces the error opt_level=2 triggers an error.

I don't know enough to say that specifically.

[EDIT: Although, since the output is incorrect for all opt_level != 0 (on Android), and OpFusion seems to be the only step that runs w/ opt_level=1, then I think it is safe to say that OpFusion is sufficient to trigger the issue on Android with this configuration for this case. As mentioned previously, building for and running with Vulkan on an Ubuntu host works fine for all opt_level >= 0, so I'm not sure it is OpFusion per se that is actually causing the issue, as opposed to uncovering a bug elsewhere related to the Vulkan back-end. (I have little familiarity with TVM, and many parts of the framework are still black box to me, so I want to be careful about making strong statements based on these tests 😄).]

Here is the output for opt_level=0,1,2,3 using the default OPT_PASS_LEVEL settings

OPT_PASS_LEVEL = {
    "SimplifyInference": 0,
    "PrecomputePrune": 2,
    "OpFusion": 1,
    "FoldScaleAxis": 3,
    "AlterOpLayout": 3,
}

opt_level=3

The maximum position in output vector is: 669
Expected 282 but got: 669

opt_level=2

error: [21:36:32] /dl/mxnet/3rdparty/tvm/apps/howto_deploy/../../src/runtime/module_util.cc:53: Check failed: ret == 0 (-1 vs. 0) [21:36:32] /dl/mxnet/3rdparty/tvm/apps/howto_deploy/../../src/runtime/vulkan/vulkan_module.cc:328: Check failed: __e == VK_SUCCESS Vulan Error, code=-3: VK_ERROR_INITIALIZATION_FAILED
terminating with uncaught exception of type dmlc::Error: [21:36:32] /dl/mxnet/3rdparty/tvm/apps/howto_deploy/../../src/runtime/vulkan/vulkan_device_api.cc:345: Check failed: __e == VK_SUCCESS Vulan Error, code=-3: VK_ERROR_INITIALIZATION_FAILED
Aborted 

opt_level=1

The maximum position in output vector is: 574
Expected 282 but got: 574

opt_level=0

The maximum position in output vector is: 282

Please let me know if there are other tests that could be useful.

Is it possible to debug this at the level of individual Vulkan commands somehow?

@eqy

This comment has been minimized.

Copy link
Contributor

eqy commented Jan 9, 2019

Maybe a way to simplify the debugging is to see if the behavior changes for smaller pieces of the graph. As it is, it may be difficult to trace exactly the error is when doing a full end-to-end inference. Can you try checking what happens, say, when only running the first layer---adding layers until the problem resurfaces?

@headupinclouds

This comment has been minimized.

Copy link
Contributor

headupinclouds commented Jan 10, 2019

Thanks for the feedback. In this case it actually does seem to occur directly in the first real layer of the fused/optimized network fuse_conv2d_broadcast_add_relu in the post above. I was hoping there might be some way to "lower" to human readable Vulkan instructions that I could test sequentially, but I don't know enough to understand if that is possible or not.

You can copy and paste this one liner to see just the diff for that layer:

name=tvm_0003_relu0; wget https://github.com/headupinclouds/tvm_cpp_test/releases/download/v0.0.0/vulkan.tar.gz && tar zxvf vulkan.tar.gz && diff vulkan/android/${name}.txt vulkan/ubuntu/${name}.txt
@eqy

This comment has been minimized.

Copy link
Contributor

eqy commented Jan 10, 2019

Thanks for this; it is useful to know that the problem can be triggered with a single fused conv2d layer.
I think the next step is to check if the problem persists when using a default schedule for the operator.
For example, we could manually define a direct 2d convolution with minimal scheduling (e.g., just bind threads to a single dimension) while inlining the next operation (to simulate fusion).

@headupinclouds

This comment has been minimized.

Copy link
Contributor

headupinclouds commented Jan 11, 2019

I think the next step is to check if the problem persists when using a default schedule for the operator.

Okay, this sounds good, although I'll need help with the details (unless you have an example close to what you are suggesting already). It sounds like this would be a small unit test entirely in TVM (no NNVM or end-to-end CNN) that runs something like conv2d + relu on a simple test input tensor ones(64,64,3)? We would make a small python script, then build for Android + Vulkan and run the test as a C++ application on the device. Right?

Maybe the following TVM tutorial is a reasonable starting point?

https://docs.tvm.ai/tutorials/topi/intro_topi.html#fusing-convolutions

Are there other generic unit tests we should be running (in addition) to help test the Vulkan back-end on Android? Maybe these https://github.com/dmlc/tvm/tree/master/tests/cpp?

[EDIT: It looks like packed_func_test is the most relevant one, since it is limited to run time functionality: #include <tvm/runtime/*>. Perhaps we can build on that?]

@eqy

This comment has been minimized.

Copy link
Contributor

eqy commented Jan 11, 2019

Yes, your understanding is correct. That TVM tutorial should be almost exactly what we need. In this case, we would manually schedule the fused convolution (as minimally as possible) instead of relying on a built-in TOPI schedule.
If you have a C++ function deploy flow on device, we can use that, but otherwise it would be simpler to just use the RPC server so that we can make changes quickly without rebuilding anything manually for the device.

I am not sure we need to use the packed function test, because more basic functions (e.g., vector-add, or even un-fused conv2d would tell us that the runtime is working at a more primitive level).

Basically the purpose of this is to see if the problem lies within the schedule somewhere, or occurs any time we invoke fusion. I hope it is not a corner case where a specific schedule + fusion are needed to trigger it.

@headupinclouds

This comment has been minimized.

Copy link
Contributor

headupinclouds commented Jan 20, 2019

I made a smaller test project tvm_intro_topi containing just the conv2d + relu chain along with a C++ test executable.

data = tvm.placeholder((1, 3, 224, 224))
kernel = tvm.placeholder((10, 3, 5, 5))

with tvm.target.create("cuda"):
    conv = topi.nn.conv2d(data, kernel, strides=1, padding=2, dilation=1)
    out = topi.nn.relu(conv)
    sconv = topi.generic.nn.schedule_conv2d_nchw(out)
    print(tvm.lower(sconv, [data, kernel], simple_mode=True))

This is based on the original TVM TOPI tutorial intro_topi.py we discussed above.

Both the C++ and Vulkan versions run correctly on the same Android test device.

[EDIT: I'm seeing minor floating point differences between the two examples, but they are a very close.]

we would manually schedule the fused convolution (as minimally as possible)

What modifications do you recommend to the topi_intro.py script to achieve this?

Thanks for the help.

@headupinclouds

This comment has been minimized.

Copy link
Contributor

headupinclouds commented Jan 20, 2019

The from_mxnet.py example is using resnet18_v

block = get_model('resnet18_v1', pretrained=True)

Since we seem to see issues in the first block, we can probably replicate that from the gluon-cv definition here. I guess we want some (or all) of this part:

                self.features.add(nn.Conv2D(channels[0], 7, 2, 3, use_bias=False))
                self.features.add(norm_layer(**({} if norm_kwargs is None else norm_kwargs)))
                self.features.add(nn.Activation('relu'))
                self.features.add(nn.MaxPool2D(3, 2, 1))

The details of how to replicate the NNVM fusion (in particular, replicating OpFusion) are not currently clear to me. I'll try read through that code in more detail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment