Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Hardware][Verilator] Integrating and simulating hardware accelerators in TVM #6971

Merged
merged 7 commits into from Nov 29, 2020

Conversation

vegaluisjose
Copy link
Member

@vegaluisjose vegaluisjose commented Nov 24, 2020

Integrating and simulating hardware accelerators in TVM

This RFC proposes support for integrating hardware designs in TVM, specifically before they are manufactured in silicon or prototyped in an FPGA. Regardless of the hardware target, ML accelerators are built today using the waterfall approach described in the following figure:

today

Starting from an accelerator idea, hardware is built incrementally for several weeks or months until it is prototyped. Once this happens, software teams start working on integrating this new piece of hardware to a particular ML framework of choice i.e., TVM. One challenge of this approach is the fact that hardware-design decisions are evaluated too late, which turns out to be counterproductive. Additionally, there are higher probabilities that the target ML model gets updated along the way, which makes the efficiency of this process critical.

Motivated by this reality, we believe that ML accelerators should be integrated, and more importantly evaluated once the hardware design process begins. Notably, this approach allows engineers to have design feedback and hardware-software continuous integration from day one, as shown in the following figure:

idea

We achieved this integration by leveraging the fact that most ML accelerators are funnel down to a common hardware language i.e., Verilog, regardless of the source language. Moreover, we have today efficient Verilog-to-C++ open-source compilers i.e., Verilator that can be used to interoperate with Verilog via a simple C interface (CFFI) from TVM.

Concretely, this RFC consists on a simple codegen using BYOC, and an opaque kernel library, runtime, and device interface that can be implemented by engineers to efficiently integrate ML accelerators with TVM regardless of their design complexity. Also, we provide a small example (demo) that shows how to offload an add instruction from a Relay program to a simple hardware implementation (scalar adder) written in Verilog. This example is hosted on a third party repository called hw-widgets that we are planning to use for future hardware designs.

@tmoreau89 @liangfu @jroesch

@giuseros
Copy link
Contributor

Hi @vegaluisjose ,

This looks very interesting! I would suggest to move the RFC discussion in https://discuss.tvm.apache.org/. Meanwhile, let me double check I understood the design. So, the idea is:
a) Write Verilog
b) Compile (e.g., with Verilator) in a C++ Cycle Accurate Model
c) Link that model into TVM
You are using BYOC so that you can split your graph on the operations you want, and offload the one you decided to support (potentially, the entire graph) to the C++ model. Is my understanding correct?

Two questions:
a) How hard is to open a route in TIR for this (to test smaller hw parts)? In theory call_extern should be sufficient?
b) Why adding hw-widgets as a dependency? IIUC this is just an example, right?

Copy link
Member

@liangfu liangfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is definitely a great idea. This is the missing part of the design that leverage BYOC to help HW designer from day1. I agree to have the hw-widgets in the 3rdparty, since the repo could be a starting point for HW designers.

The implementation look good.

@vegaluisjose
Copy link
Member Author

vegaluisjose commented Nov 25, 2020

Hi @vegaluisjose ,

This looks very interesting! I would suggest to move the RFC discussion in https://discuss.tvm.apache.org/. Meanwhile, let me double check I understood the design. So, the idea is:
a) Write Verilog
b) Compile (e.g., with Verilator) in a C++ Cycle Accurate Model
c) Link that model into TVM
You are using BYOC so that you can split your graph on the operations you want, and offload the one you decided to support (potentially, the entire graph) to the C++ model. Is my understanding correct?

Two questions:
a) How hard is to open a route in TIR for this (to test smaller hw parts)? In theory call_extern should be sufficient?
b) Why adding hw-widgets as a dependency? IIUC this is just an example, right?

Hi @giuseros ,

Yes to all of your points (a, b, c). This is an extension of a work we started exploring earlier this year. The main difference between the paper and this RFC is the fact that we are allowing the hardware engineer to specify the level of detail of the simulating object. For example, if you have hardware code for simulating complex protocols i.e., AXI, then you could implement such protocol transactions using the proposed device interface. On the other hand, if you don't have such models or you want to bypass them to speed simulation then you can use hierarchical-paths to access memories or registers in your design, and control it the same way I am doing it in the current demo. The benefit of this freedom is that the interoperability should be independent of a particular hardware protocol or interface, and such interoperability should happen in the most productive way possible.

In terms of the other questions,

a) I haven't explored the TIR path yet. However, the benefit of doing such thing is leveraging other utilities available in the framework i.e., AutoTVM. In other words, BYOC could be considered a more shallow integration, but still useful.

b) The reason why we have the hw-widgets repo is that this stuff is hard to explain without a running example given that we are in a hardware-software boundary. I am working on a blogpost that will come after this gets merged that not only covers all of this but also shows how to offload operators from ML models. For example, we have already tested offloading nn.bias_add() to the same scalar adder, used in this example, from a quantized MobileNet model. As @liangfu said, this is a starting point for others.

Finally, this approach does not stop at ML accelerators only, you can imagine integrating a bare-metal processor out there in a similar fashion. There are plenty of RISCV processors out there using Verilator for simulation, so that could happen also. I am giving a small talk in TVM conf about this RFC.

Copy link
Contributor

@giuseros giuseros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @vegaluisjose , don't get me wrong, I am very supportive with this work, questions were mostly to understand it better. I will attend your talk. And this LGTM!

Copy link
Contributor

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this interesting idea and implementation! Hopefully we can get this in. Meanwhile, we should let more people review this PR especially it introduces a new 3rdparty dependency (hw-widgets). I also agree with @giuseros that it would be good to have an RFC talking about what you expect users to make use of this feature, because I didn't get this point after reviewing the PR.

My confusion is what role "hw-widgets" plays in this feature.

  1. According to its README, "These are examples of hardware designs for testing Verilator backend integration to TVM", is it just examples? If so then I don't think we really need an example to be in a 3rd party dependency. We should even bring the implementation into the TVM code base and only compile it when Verilator is enabled.

  2. What should I do if I have a new design in Verilator? Should I 1) compile my design to a library like you did for hw-widgets, 2) modify the verilator_codegen and verilator_runtime to link that library? If this is a case, then this PR is more like just an example instead of a framework.

  3. The current implementation in this PR executes the graph op by op, but it seems not common for an accelerator. Is this just an example or your intention?

cc @zhiics @tqchen

"kernel", /* op_type_ */
inputs, 1 /* num_outputs_ */);
SetCallNodeAttribute(node, call);
return AddNode(node, GetRef<Expr>(cn));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too ad-hoc even you only support AddNode at this moment. It would be better to check the node type and throw an error when you find other unsupported nodes. This is also more flexible to add other nodes in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point, I was thinking the same. The thing is that we would need to add the function call in the verilator-runtime as well right?, leaving the freedom of marking the ops to be offloaded to the user (python). I was thinking on increasing this coverage as a follow-up PR, I did not want this to be too many lines to review.

Copy link
Contributor

@tmoreau89 tmoreau89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is a very welcome addition, particularly for HW vendors/researchers.

Indeed having validation from @tqchen or @zhiics on adding a new 3rd party package + extra cmake flag would be great too.

@vegaluisjose
Copy link
Member Author

vegaluisjose commented Nov 25, 2020

Hi @comaniac , thank you for your feedback -- answers inline

Thanks for this interesting idea and implementation! Hopefully we can get this in. Meanwhile, we should let more people review this PR especially it introduces a new 3rdparty dependency (hw-widgets). I also agree with @giuseros that it would be good to have an RFC talking about what you expect users to make use of this feature, because I didn't get this point after reviewing the PR.

My confusion is what role "hw-widgets" plays in this feature.

  1. According to its README, "These are examples of hardware designs for testing Verilator backend integration to TVM", is it just examples? If so then I don't think we really need an example to be in a 3rd party dependency. We should even bring the implementation into the TVM code base and only compile it when Verilator is enabled.

The hw-widgets repository is just an example on how to use this "backend" and a hardware codebase. The reason behind having a separate repo with examples is similar to the split that happen with VTA (3rdparty vta-hw).

  1. What should I do if I have a new design in Verilator? Should I 1) compile my design to a library like you did for hw-widgets, 2) modify the verilator_codegen and verilator_runtime to link that library? If this is a case, then this PR is more like just an example instead of a framework.

I don't think I referred to this as a framework. I think this should be viewed more as another backend for TVM. Similarly to Ethos-N or DNNL, you need to have a precompiled library in your system but in this case this library is a cycle-accurate-version of your hardware as @giuseros pointed out. However, we need a separate ecosystem because like other external codegens, Verilator requires certain things to make it work.

For example, building floating-point hardware is far more complicated than integer or fixed-point hardware. Most of the backends in TVM, uses kernel libraries based on floating-point because hardware support is already free. But if you are building things from scratch, or if you are in a power-constraint environment i.e., edge, you are most likely to start with integer ops. Therefore, we need a mechanism to not only leverage full-blown-complex-accelerators but also the tiny-smallest-version-of-hardware that allow us to have a more vibrant hardware ecosystem/community compared to other frameworks. The other benefit of this is being capable of building hardware "incrementally" and evaluate your design-decisions with TVM over time.

  1. The current implementation in this PR executes the graph op by op, but it seems not common for an accelerator. Is this just an example or your intention?

This backend can mirror what we currently have for DNNL in terms of single and merged operators (composite), no reason for stopping here. The reason why we have a single one is because we have only one example for this particular operator. Eventually, we are hoping for the community to join this effort and add examples that run more complicated use cases i.e., CONV2D, etc.

cc @zhiics @tqchen

Copy link
Contributor

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vegaluisjose thanks for the explanation. It's now much more reasonable for me as I was thinking this is a general backend for everyone. In terms of being an example, this PR looks good to me except for the 3rd party dependency. To me, this is not the case as VTA, because VTA is a framework but hw-widgets is just an example. In addition, IIUC, with the 3rd party dependency, anyone who wants to contribute to this feature needs to first submit a PR to that repo, or it seems not possible to enable more ops such as Conv2D, otherwise.

btw we may need a test case for this backend even it's just an example.

@tmoreau89
Copy link
Contributor

@comaniac thank you for the feedback. The decision to add the Verilog sources in a 3rd party submodule came from TVM industry contributor concerns about leaving hardware sources out of the main repo for IP protection reasons. We are simply making sure all hardware sources are cleaved off from the main TVM repo.

@tmoreau89
Copy link
Contributor

I agree on adding a test case - @vegaluisjose do you mind adding one to make sure this example doesn't bitrot?
I believe that the default CPU docker images are already build with Verilator, so hopefully we don't need to rebuild those.

@zhiics
Copy link
Member

zhiics commented Nov 25, 2020

@vegaluisjose thanks for the effort. I think this is a quite interesting implementation and yet another use case of BYOC. It is more like the json runtime example we had before but in the Verilator scenario. I think we need a test case. In addition, as @comaniac mentioned this would increase of maintenance (mostly from your side though) because we need to update both repos if there is change like adding operator support. Do you think it makes sense to contribute the kernels to TVM as well?

@zhiics
Copy link
Member

zhiics commented Nov 25, 2020

ahh, just saw @tmoreau89's comment on the separation.

@vegaluisjose
Copy link
Member Author

@vegaluisjose thanks for the explanation. It's now much more reasonable for me as I was thinking this is a general backend for everyone. In terms of being an example, this PR looks good to me except for the 3rd party dependency. To me, this is not the case as VTA, because VTA is a framework but hw-widgets is just an example. In addition, IIUC, with the 3rd party dependency, anyone who wants to contribute to this feature needs to first submit a PR to that repo, or it seems not possible to enable more ops such as Conv2D, otherwise.

btw we may need a test case for this backend even it's just an example.

Regarding the test, I do have one located here that I hope we can figure how to eventually add it to CI.

@vegaluisjose
Copy link
Member Author

Alright guys, sounds like we have a plan.

I will move the Python test that I have in hw-widgets to follow what we have with other external codegens, making sure this work on our CI and does not bitrot.

Does this sounds reasonable?

Thank you all for your valuable input @giuseros @comaniac @liangfu @tmoreau89 @zhiics

@zhiics
Copy link
Member

zhiics commented Nov 25, 2020

@vegaluisjose sounds good to me. Thanks.

@giuseros
Copy link
Contributor

@vegaluisjose , it is ok for me too. Although, I would suggest to move the RFC on the discuss forum to get more visibility and comments

@comaniac
Copy link
Contributor

@comaniac thank you for the feedback. The decision to add the Verilog sources in a 3rd party submodule came from TVM industry contributor concerns about leaving hardware sources out of the main repo for IP protection reasons. We are simply making sure all hardware sources are cleaved off from the main TVM repo.

I totally understand this concern and it makes sense to me. I completely agree with you that we should keep the hardware implementation out of the main repo. I was just concern the maintenance for this example. IMHO, it should also be fine if we bring the hardware example in and mention in the tutorial/blog post/comments that users are suggested to have a separate repo for this part. Anyways, I won't against the decision if everyone else are fine with it.

I will move the Python test that I have in hw-widgets to follow what we have with other external codegens, making sure this work on our CI and does not bitrot.

Sounds like a plan. Thanks.

@tmoreau89
Copy link
Contributor

As @giuseros pointed out, an RFC about this work would be nice for longer term conversations. I'm hoping this PR can be merged soon for the TVM conf demo, but we should have longer-lasting discussion on extensions/applications of this work.

@tqchen
Copy link
Member

tqchen commented Nov 25, 2020

Re the hw related component, how about we put it into the https://github.com/apache/tvm-vta repo? Given that infrastructure are already setup for testing in the VTA component, so the code in the repo can also be tested against the latest version of tvm. We could make it as part of vta/apps/hw-widgets

I agree that having an RFC about the proposal would be really nice.

@tmoreau89
Copy link
Contributor

@tqchen if we go towards putting hardware sources under VTA, we may want to rename that repo to something like tvm-hardware or something similar under which apps and VTA sources would sit. And hopefully we'll see other designs in there as well. That's something we can do after an RFC discussion.

Copy link
Contributor

@tmoreau89 tmoreau89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry but I'll need to gate this PR until we move the submodule contents into the VTA repo, and then update the submodule in this PR.

@tmoreau89
Copy link
Contributor

apache/tvm-vta#18 has now been merged as requested by @tqchen .

Please update the submodule and paths to now point to tvm-vta, and bump the submodule version.

Copy link
Contributor

@tmoreau89 tmoreau89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for updating the PR @vegaluisjose

@tmoreau89 tmoreau89 merged commit 8d1dc7e into apache:main Nov 29, 2020
@tmoreau89
Copy link
Contributor

Thank to @vegaluisjose and @comaniac @liangfu @zhiics @giuseros who reviewed the PR, it's now merged.

trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Dec 2, 2020
…s in TVM (apache#6971)

* add files

* update interface

* update

* fix comment

* fix fmt

* remove widget repo

* remove
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Dec 4, 2020
…s in TVM (apache#6971)

* add files

* update interface

* update

* fix comment

* fix fmt

* remove widget repo

* remove
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Dec 4, 2020
…s in TVM (apache#6971)

* add files

* update interface

* update

* fix comment

* fix fmt

* remove widget repo

* remove
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants