Add support for Xilinx FPGA board with SDAccel #1278

kazum · 2018-06-13T20:32:29Z

This PR adds initial support for Xilinx FPGA boards with SDAccel. The following are included:

SDAccel backend based on OpenCL runtime
Vivado HLS code generation
Tutorial to deploy TVM on AWS F1 instance

This PR is only preliminary implementation, but would be a good start point for FPGA supports, I think.

tqchen · 2018-06-14T00:20:34Z

@tmoreau89 @Huyuwei can you share your input on this?

tmoreau89 · 2018-06-14T03:24:32Z

Looping @vegaluisjose in who can add his two cents. I'll try reproducing the tutorial over the next few days on F1 to make sure that this runs correctly. This looks like a promising start to having an OpenCL back-end for FPGAs.

Huyuwei · 2018-06-14T19:23:29Z

@seanlatias @comaniac may be interested and have something to share

comaniac · 2018-06-14T20:26:49Z

I briefly reviewed the PR and looks good to me.
I think the programming model of compilation and execution could definitely be improved, but I also prefer to let this PR in first, since the mechanism of registering AFI and setting F1 instance could involve many AWS account issues (e.g., permission).

While I don't have comments to the runtime part, I'd like to confirm some points in kernel compilation part.

By reading the tutorial, it seems to me that in order to run the TVM program on FPGAs, you require users to specify the pipeline binding like TVM now requires users to specify thread binding for OpenCL on GPUs. As I know, this user-specified binding helps not only kernel optimization but host device splitting. Is it the way you identify the FPGA kernel scope?
Could you explain the current limitation of HLS code generation? I briefly reviewed the HLS code generation and it seems to me that now you only generate the corresponding ap_uint data type and insert interface pragmas. I know other optimization pragmas such as unroll or pipeline could be added in the future, but I'd like to know if there have any case that you currently cannot support.

Thanks.

kazum · 2018-06-15T05:51:31Z

By reading the tutorial, it seems to me that in order to run the TVM program on FPGAs, you require users to specify the pipeline binding like TVM now requires users to specify thread binding for OpenCL on GPUs. As I know, this user-specified binding helps not only kernel optimization but host device splitting. Is it the way you identify the FPGA kernel scope?

Yes, exactly. I think it's the most intuitive way, but any other suggestions would be much appreciated.

Could you explain the current limitation of HLS code generation? I briefly reviewed the HLS code generation and it seems to me that now you only generate the corresponding ap_uint data type and insert interface pragmas. I know other optimization pragmas such as unroll or pipeline could be added in the future, but I'd like to know if there have any case that you currently cannot support.

From the point of view of code semantics, I guess there is no limitation; Vivado HLS could compile the generated code as far as FPGA resource is available. For optimization, there are many things we have to do, but they would be future work as you said.

Thanks.

comaniac · 2018-06-15T06:12:48Z

I traced some more parts with your clarification and made the following summary of this PR:

This PR provides a working flow that compiles TVM IR to HLS C for SDAccel.
Different for any existing TVM supported backend flow, this flow fully decouples compilation and execution to deal with long place&route time. This is totally reasonable. Although the connection between compilation and execution seems not clean enough, it could be improved later on.
The flow provides the runtime support to AWS F1 instance but currently does not target to other standalone servers with Xilinx FPGAs.
User has to specify the pipeline binding in order to define the kernel scope, but the pipeline binding currently doesn't affect the performance of the generated HLS C code.
This PR does not have any performance consideration for the generated HLS kernel yet.

Please correct me if I misunderstand anything.
Since it is the first HLS flow in TVM and it doesn't affect other backends, it's safe to let it in as an experimental feature, in my opinion.

Thanks.

kazum · 2018-06-15T09:15:34Z

This PR provides a working flow that compiles TVM IR to HLS C for SDAccel.

Yes.

Different for any existing TVM supported backend flow, this flow fully decouples compilation and execution to deal with long place&route time. This is totally reasonable. Although the connection between compilation and execution seems not clean enough, it could be improved later on.

The main reason I decoupled the tutorial script is that we need to create AFI (Amazon FPGA Image) before execution on the AWS F1 instance. If we use a standalone server with Xilinx FPGAs, I think we can combine those two script into one. I confirmed I could compile and execute the vector addition code with one script for software/hardware emulation, so we can use the same code with other backend at least for testing purpose.

The flow provides the runtime support to AWS F1 instance but currently does not target to other standalone servers with Xilinx FPGAs.

No, I guess the flow would be similar to other standalone servers, though I've not tested it yet. The only thing we need to change is to specify a xclbin binary instead of an awsxclbin file, I think.

User has to specify the pipeline binding in order to define the kernel scope, but the pipeline binding currently doesn't affect the performance of the generated HLS C code.

Yes.

This PR does not have any performance consideration for the generated HLS kernel yet.

Yes.

Thanks for your help for clarification!

kazum · 2018-06-16T01:22:25Z

@comaniac I've added a tiny testcase which shows we can try SDAccel backend without decoupling scripts for emulation.

comaniac · 2018-06-16T04:39:50Z

tests/python/integration/test_ewise_fpga.py

+        np.testing.assert_allclose(
+            b.asnumpy(), np.exp(a.asnumpy()), rtol=1e-5)
+
+    check_device("sdaccel")


Does the TVM Jenkins environment has sdaccel?
If so, how did you mock the bitstream generation part?
If not, it seems to me that this test will exit at line 29 and nothing will be tested.

I think the Jenkins environment of this community doesn't have SDAccel, so the test is not run via CI. I checked the test works on my AWS F1 instance:

$ python -m nose -vs tests/python/integration/test_ewise_fpga.py test_ewise_fpga.test_exp ... [07:37:36] /home/centos/src/project_data/tvm/src/runtime/opencl/opencl_device_api.cc:232: Initialize OpenCL platform 'Xilinx' [07:37:36] /home/centos/src/project_data/tvm/src/runtime/opencl/opencl_device_api.cc:259: opencl(0)='xilinx_aws-vu9p-f1_dynamic_5_0' cl_device_id=0x1ad7ef0 #include <ap_int.h> extern "C" void myexp_kernel0( float* B, float* A) { #pragma HLS INTERFACE m_axi port=B offset=slave bundle=gmem #pragma HLS INTERFACE s_axilite port=B bundle=control #pragma HLS INTERFACE m_axi port=A offset=slave bundle=gmem #pragma HLS INTERFACE s_axilite port=A bundle=control #pragma HLS INTERFACE s_axilite port=return bundle=control for (int i0_inner = 0; i0_inner < 1024; ++i0_inner) { B[i0_inner] = expf(A[i0_inner]); } } ****** xocc v2017.4 (64-bit) **** SW Build 2193837 on Tue Apr 10 18:06:59 MDT 2018 ** Copyright 1986-2017 Xilinx, Inc. All Rights Reserved. Attempting to get a license: ap_opencl Feature available: ap_opencl INFO: [XOCC 60-585] Compiling for software emulation target (snip) INFO: [XOCC 60-586] Created /tmp/tmpc5uk9a/output.xclbin INFO: [XOCC 60-791] Total elapsed time: 0h 0m 11s ok ---------------------------------------------------------------------- Ran 1 test in 61.040s OK

I set the environment variable XCL_EMULATION_MODE in the python test file, so SDAccel doesn't generate a bitstream but run in software emulation mode - it works like a mock.

Let me know if I'm missing something.

I see. Could we figure out a plan to include SDAccel in Jenkins, or an alternative way to perform unit tests without SDAccel? I'm worried that it would be hard to be maintained when other contributors start adding more features.

Another way is to use FPGA Developer AMI of AWS. It contains SDAccel.

@tqchen, is it possible to launch an AWS instance from Jenkins for FPGA testing purpose? Note that we don't need F1 instances for emulation. Any instance types is okay.

kazum · 2018-06-20T03:17:22Z

I'll try reproducing the tutorial over the next few days on F1 to make sure that this runs correctly.

@tmoreau89, have you tried the tutorial? Let me know if there is anything I can help you with. :)

tqchen · 2018-06-20T17:07:32Z

I have been discussing for a while with @tmoreau89 on how to introduce SDAccel and other VHLS toolchain into test infrastructure. The main challenge here is that our current test infra was based on Docker and because we need some manual step in setting up the Xilinx toolchain, it is hard to dockerize these parts.

Given the situation @comaniac suggested, we might need to look into possible solutions.

kazum · 2018-06-21T05:08:48Z

I've added a change to create a program binary for GPU when SDAccel is not available. I think we can test the sdaccel backend in Jenkins with this change.

@tqchen @tmoreau89 @comaniac Let me know your opinion.

comaniac · 2018-06-21T05:36:24Z

python/tvm/contrib/sdaccel.py

-                              os.environ.get("AWS_PLATFORM", "xilinx:kcu1500:dynamic"))
+    platform = os.environ.get("XCL_PLATFORM", os.environ.get("AWS_PLATFORM"))
+
+    if platform is None:


Suggest showing a warning like "Failed to set up SDAccel, running on GPU instead." in case if a user really wants to run with SDAccel but just forgets to set up the environment.

tqchen · 2018-06-24T03:18:33Z

python/tvm/ndarray.py

@@ -121,6 +121,22 @@ def vpi(dev_id=0):
    return TVMContext(9, dev_id)


+def sdaccel(dev_id=0):


this is not necessary, as from frontend's perspective, sdaccel is just opencl

tqchen · 2018-06-24T03:19:22Z

src/codegen/codegen_vhls.cc

+  }
+}
+
+void CodeGenVHLS::AddFunction(LoweredFunc f) {


VLHS->VivadoHLS

tqchen · 2018-06-24T03:19:47Z

src/codegen/intrin_rule_vhls.cc

+TVM_REGISTER_GLOBAL("tvm.intrin.rule.sdaccel.popcount")
+.set_body(DispatchExtern<Direct>);
+
+TVM_REGISTER_GLOBAL("tvm.intrin.rule.sdaccel.tvm_warp_shuffle")


I don't think sdaccel have warp shuffle

tqchen · 2018-06-24T03:21:32Z

docs/deploy/aws_fpga.md

@@ -0,0 +1,151 @@
+Deploy to AWS F1 FPGA Instance


Deploy to AWS F1 -> HLS Backend Example

tqchen · 2018-06-24T03:21:48Z

docs/deploy/aws_fpga.md

+Deploy to AWS F1 FPGA Instance
+==============================
+
+TVM supports Xilinx FPGA board with SDAccel.  Here is a tutorial for how to deploy TVM to AWS F1 FPGA instance.


Add a note saying that this is still experimental, as we cannot use to deploy an end to end neural networks.

tqchen · 2018-06-24T03:22:32Z

cmake/modules/OpenCL.cmake

@@ -13,6 +13,10 @@ if(USE_OPENCL)
  file(GLOB RUNTIME_OPENCL_SRCS src/runtime/opencl/*.cc)
  list(APPEND TVM_RUNTIME_LINKER_LIBS ${OpenCL_LIBRARIES})
  list(APPEND RUNTIME_SRCS ${RUNTIME_OPENCL_SRCS})
+  if($ENV{XILINX_SDX})


Would this conflict with runtime SDK?

When we have another SDK, we can still compile tvm and find multiple OpenCL platforms with clGetPlatformInfo(). Currently, tvm cannot handle this case but it's another problem:
https://github.com/dmlc/tvm/blob/1e66d3c/src/runtime/opencl/opencl_device_api.cc#L242
I think of sending another PR to support it.

Anyway, I'll remove this if condition since cmake can compile tvm for sdaccel in either way. It is not necessary to use headers and libraries in the Xilinx toolchain at compile time.

tqchen · 2018-06-24T03:25:38Z

Thanks for all the reviews. I think the conclusion is that we want to bring this in. However, there are a few things we need to keep in mind as there are limitations in this approach.

FPGA pipeline is very different from normal codegen.
- Since every opencl kernel will be hardened to the real hardware, unless we build an entire pipeline, we cannot let it flow well
Ideally, we would like to build a hardware core(e.g. tensor core) and an ISA and use TVM to generate these ISA. @tmoreau89 is going to send something upstream soon and hopefully, that would give a better pragmatic approach of interfacing
HLS itself is still super valuable if we could use TVM to automatically generate these processing cores, and figure out how to reuse them with a virtual ISA

kazum · 2018-06-25T09:25:19Z

@tqchen Thanks for your comments. I addressed your review comments and rebased onto the current master.

Ideally, we would like to build a hardware core(e.g. tensor core) and an ISA and use TVM to generate these ISA. @tmoreau89 is going to send something upstream soon and hopefully, that would give a better pragmatic approach of interfacing

It sounds very nice! I'm looking forward to it. :)

tqchen · 2018-06-26T16:07:41Z

Thanks! this is merged

kazum force-pushed the sdaccel-support branch 4 times, most recently from 6f7c8d6 to fc8f768 Compare June 13, 2018 20:47

comaniac reviewed Jun 16, 2018

View reviewed changes

comaniac reviewed Jun 21, 2018

View reviewed changes

tqchen requested changes Jun 24, 2018

View reviewed changes

kazum added 6 commits June 25, 2018 14:06

Add support for Xilinx FPGA board with SDAccel

3f40c95

fix cpptest error

30f15f4

add tests for SDAccel

bbcd91a

fix register name of intrin rules

f6f0018

Create program binary for GPU when SDAccel is not available

4fca139

address review comments

459b59e

kazum force-pushed the sdaccel-support branch from 13193b5 to 459b59e Compare June 25, 2018 09:19

tqchen approved these changes Jun 26, 2018

View reviewed changes

tqchen merged commit 9c64ab2 into apache:master Jun 26, 2018

kazum deleted the sdaccel-support branch June 27, 2018 19:57

tqchen pushed a commit to tqchen/tvm that referenced this pull request Jul 6, 2018

Add support for Xilinx FPGA board with SDAccel (apache#1278)

dee005f

mnuyens pushed a commit to mnuyens/tvm that referenced this pull request Jul 10, 2018

Add support for Xilinx FPGA board with SDAccel (apache#1278)

ca2ad6d

sergei-mironov pushed a commit to sergei-mironov/tvm that referenced this pull request Aug 8, 2018

Add support for Xilinx FPGA board with SDAccel (apache#1278)

3e83404

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Xilinx FPGA board with SDAccel #1278

Add support for Xilinx FPGA board with SDAccel #1278

kazum commented Jun 13, 2018

tqchen commented Jun 14, 2018 •

edited

Loading

tmoreau89 commented Jun 14, 2018

Huyuwei commented Jun 14, 2018

comaniac commented Jun 14, 2018

kazum commented Jun 15, 2018

comaniac commented Jun 15, 2018 •

edited

Loading

kazum commented Jun 15, 2018 •

edited

Loading

kazum commented Jun 16, 2018

comaniac Jun 16, 2018

kazum Jun 16, 2018

comaniac Jun 20, 2018

kazum Jun 20, 2018

kazum commented Jun 20, 2018

tqchen commented Jun 20, 2018

kazum commented Jun 21, 2018

comaniac Jun 21, 2018 •

edited

Loading

tqchen Jun 24, 2018

tqchen Jun 24, 2018

tqchen Jun 24, 2018

tqchen Jun 24, 2018

tqchen Jun 24, 2018

tqchen Jun 24, 2018

kazum Jun 25, 2018

tqchen commented Jun 24, 2018

kazum commented Jun 25, 2018

tqchen commented Jun 26, 2018

		@@ -121,6 +121,22 @@ def vpi(dev_id=0):
		return TVMContext(9, dev_id)


		def sdaccel(dev_id=0):

Add support for Xilinx FPGA board with SDAccel #1278

Add support for Xilinx FPGA board with SDAccel #1278

Conversation

kazum commented Jun 13, 2018

tqchen commented Jun 14, 2018 • edited Loading

tmoreau89 commented Jun 14, 2018

Huyuwei commented Jun 14, 2018

comaniac commented Jun 14, 2018

kazum commented Jun 15, 2018

comaniac commented Jun 15, 2018 • edited Loading

kazum commented Jun 15, 2018 • edited Loading

kazum commented Jun 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kazum commented Jun 20, 2018

tqchen commented Jun 20, 2018

kazum commented Jun 21, 2018

comaniac Jun 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tqchen commented Jun 24, 2018

kazum commented Jun 25, 2018

tqchen commented Jun 26, 2018

tqchen commented Jun 14, 2018 •

edited

Loading

comaniac commented Jun 15, 2018 •

edited

Loading

kazum commented Jun 15, 2018 •

edited

Loading

comaniac Jun 21, 2018 •

edited

Loading