Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VTA][OpenCL] Cloud FPGA support #5842

Closed
wants to merge 46 commits into from

Conversation

zhanghaohit
Copy link
Contributor

@zhanghaohit zhanghaohit commented Jun 18, 2020

This PR, coupled with this on tvm-vta repo, is the basic implementation of RFC #5840

Some notes:

  • this is just a basic version without much performance optimization. We've done some optimization, and achieved significant improvement (but this part of code is not ready; has to be organized and cleaned up further).

  • there are some experimental features, which are also included in this PR, including

    • sync all the instructions per model (instead of per layer)
    • static auto-tune using profiling results

    These codes are not well styled (using some environmental variables), and we have to think about how to format/implement nicely. But we think these features are useful so we also commit. We can discuss if we should include these codes in this PR, or leave it for other PRs.

@tmoreau89
Copy link
Contributor

Thanks @zhanghaohit I can generate the new VTA design on Pynq to verify that it still passes on the new ISA. I will let you know if this passes. Installation documentation will be highly welcome.

@tmoreau89
Copy link
Contributor

@zhanghaohit good news, the end to end tests ran successfully on VTA hardware with your changes. Given that this PR brings modifications to the ISA spec, we need to bump the HW_VER string from 0.0.1 to 0.0.2. I can upload the Pynq bitstream so it's ready for use. I will need some help from @liangfu to re-generate the DE-10 bitstream as well.

@tmoreau89
Copy link
Contributor

In addition were you able to put a guide together? Which Intel FPGA have you tested this on?

@zhanghaohit
Copy link
Contributor Author

@zhanghaohit good news, the end to end tests ran successfully on VTA hardware with your changes. Given that this PR brings modifications to the ISA spec, we need to bump the HW_VER string from 0.0.1 to 0.0.2. I can upload the Pynq bitstream so it's ready for use. I will need some help from @liangfu to re-generate the DE-10 bitstream as well.

Thanks @tmoreau89 for the help with the tests.

In addition were you able to put a guide together? Which Intel FPGA have you tested this on?

Sure. What guide are you referring to? The guide on HW_VER?

We're using Intel Arria 10.

@tmoreau89
Copy link
Contributor

@zhanghaohit I was referring to an install guide as the one shown here: https://tvm.apache.org/docs/vta/install.html

The source file is found here: https://github.com/apache/incubator-tvm/blob/master/docs/vta/install.rst

@tmoreau89
Copy link
Contributor

Note that I've also uploaded the Pynq bitstream given the ISA changes here: https://github.com/uwsampl/vta-distro/tree/master/bitstreams/pynq/0.0.2

It would be great @liangfu if you had the bandwidth to synthesize the DE10 bistream as well and submit a pull request. Let me know if you are tight on time and may need some help.

Finally, I will add an ultra96 variant in there as well. It would be great @zhanghaohit if you could also pre-generate the Arria10 bitstream and add it to the vta-distro repo.

@zhanghaohit
Copy link
Contributor Author

zhanghaohit commented Jul 17, 2020

Note that I've also uploaded the Pynq bitstream given the ISA changes here: https://github.com/uwsampl/vta-distro/tree/master/bitstreams/pynq/0.0.2

It would be great @liangfu if you had the bandwidth to synthesize the DE10 bistream as well and submit a pull request. Let me know if you are tight on time and may need some help.

Finally, I will add an ultra96 variant in there as well. It would be great @zhanghaohit if you could also pre-generate the Arria10 bitstream and add it to the vta-distro repo.

Thanks @tmoreau89 for the help. For the Arria10 bitstream, because the fpga device we are currently using is from other vendors (not Intel), it is not easy to be used by others. We're trying to adapt to the fpga devices on AWS, which are more standard. After that, we can upload the aocx/bitstream.

@zhanghaohit
Copy link
Contributor Author

@tmoreau89 @vegaluisjose @huajsj @pasqoc
Any comments are welcome. Thanks.

@liangfu
Copy link
Member

liangfu commented Jul 17, 2020

Thanks for the update @zhanghaohit , please rebase upon latest master branch and resolve the conflicts.

@tmoreau89
Copy link
Contributor

@zhanghaohit thanks for the follow up. I recommend to echo @liangfu to rebase so the PR can be in a mergeable state. In addition, if the target is a custom FPGA board, I think we can leave it as a generic target for now. Having it ported to AWS F1 should be a great addition. I understand having setup instructions at this time would not make the most sense.

@@ -0,0 +1,555 @@
// Copyright (C) 2013-2018 Altera Corporation, San Jose, California, USA. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tqchen will this copyright work with Apache licensing rules? this will end up in 3rd party

@@ -186,6 +189,16 @@ def __init__(self,
timeout=10, n_parallel=None,
number=4, repeat=3, min_repeat_ms=0, cooldown_interval=0.1,
check_correctness=False):
static_tune = os.getenv("TVM_STATIC_TUNE_EXPERIMENTAL")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on what TVM_STATIC_TUNE_EXPERIMENTAL is being used for? I'm a little weary of inserting additional execution modes in AutoTVM dictated by environmental variables.

(1) we either add additional flags in AutoTVM config
(2) we clean this experimental mode if it's not really necessary for the OpenCL variant of VTA

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TVM_STATIC_TUNE_EXPERIMENTAL is used for static auto tune. I think as you advised, we can split the PR into separate PRs. For static auto tune, the interface is not well nicely done. I think maybe we remove this part first.

@@ -323,8 +331,10 @@ def compute_conv2d_transpose(attrs, inputs, out_dtype):
out = topi_compute(
inputs[0], inputs[1], strides, padding, out_dtype)
output_padding = get_const_tuple(attrs.output_padding)
out = topi.nn.pad(out, [0, 0, 0, 0],
[0, 0, output_padding[0], output_padding[1]])
if output_padding[0] != 0 or output_padding[1] != 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which operators was this breaking on previously?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause errors when running dcgan on vta. There are two issues here:

  1. if output_padding are all 0, no need to do pad
  2. if the size of out.shape is not 4, it will cause problems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause errors when running dcgan on vta. There are two issues here:

  1. if output_padding are all 0, no need to do pad
  2. if the size of out.shape is not 4, it will cause problems.

After rebase, this changes are not needed.

@tmoreau89
Copy link
Contributor

@zhanghaohit I took a second look at the changes in this PR, and I believe it would be best to break the PR down into separate, smaller PRs:

(1) modifications to VTA for OpenCL support (runtime, operator support etc.)
(2) quantization support for conv2d_transpose which is device agnostic
(3) modifications to device_annotation.cc
(4) once (3) is merged and approved, updating GraphPack pass to use device annotation

Let me know what you think. This will make the PRs easier to review and merge.

Copy link
Contributor

@tmoreau89 tmoreau89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider breaking down the PR down into smaller PRs.

@tmoreau89 tmoreau89 requested a review from liangfu July 17, 2020 22:03
@tmoreau89
Copy link
Contributor

I'm thinking that the static auto-tune using profiling results can also be separated out in a standalone PR

@zhanghaohit
Copy link
Contributor Author

@zhanghaohit I took a second look at the changes in this PR, and I believe it would be best to break the PR down into separate, smaller PRs:

(1) modifications to VTA for OpenCL support (runtime, operator support etc.)
(2) quantization support for conv2d_transpose which is device agnostic
(3) modifications to device_annotation.cc
(4) once (3) is merged and approved, updating GraphPack pass to use device annotation

Let me know what you think. This will make the PRs easier to review and merge.

Thanks @tmoreau89 for the suggestion. I've split into 3 small PRs here:
#6124
#6125
#6126

And remove the unnecessary code that are not very related to OpenCL support.

Thanks.

@tqchen tqchen closed this Oct 11, 2020
@zhanghaohit zhanghaohit deleted the feature/opencl branch December 17, 2020 01:59
@zhanghaohit zhanghaohit restored the feature/opencl branch December 17, 2020 01:59
@zhanghaohit zhanghaohit deleted the feature/opencl branch December 17, 2020 01:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants