Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Xilinx FPGA board with SDAccel #1278

Merged
merged 6 commits into from
Jun 26, 2018

Conversation

kazum
Copy link
Contributor

@kazum kazum commented Jun 13, 2018

This PR adds initial support for Xilinx FPGA boards with SDAccel. The following are included:

  • SDAccel backend based on OpenCL runtime
  • Vivado HLS code generation
  • Tutorial to deploy TVM on AWS F1 instance

This PR is only preliminary implementation, but would be a good start point for FPGA supports, I think.

@kazum kazum force-pushed the sdaccel-support branch 4 times, most recently from 6f7c8d6 to fc8f768 Compare June 13, 2018 20:47
@tqchen
Copy link
Member

tqchen commented Jun 14, 2018

@tmoreau89 @Huyuwei can you share your input on this?

@tmoreau89
Copy link
Contributor

Looping @vegaluisjose in who can add his two cents. I'll try reproducing the tutorial over the next few days on F1 to make sure that this runs correctly. This looks like a promising start to having an OpenCL back-end for FPGAs.

@Huyuwei
Copy link
Contributor

Huyuwei commented Jun 14, 2018

@seanlatias @comaniac may be interested and have something to share

@comaniac
Copy link
Contributor

I briefly reviewed the PR and looks good to me.
I think the programming model of compilation and execution could definitely be improved, but I also prefer to let this PR in first, since the mechanism of registering AFI and setting F1 instance could involve many AWS account issues (e.g., permission).

While I don't have comments to the runtime part, I'd like to confirm some points in kernel compilation part.

  1. By reading the tutorial, it seems to me that in order to run the TVM program on FPGAs, you require users to specify the pipeline binding like TVM now requires users to specify thread binding for OpenCL on GPUs. As I know, this user-specified binding helps not only kernel optimization but host device splitting. Is it the way you identify the FPGA kernel scope?
  2. Could you explain the current limitation of HLS code generation? I briefly reviewed the HLS code generation and it seems to me that now you only generate the corresponding ap_uint data type and insert interface pragmas. I know other optimization pragmas such as unroll or pipeline could be added in the future, but I'd like to know if there have any case that you currently cannot support.

Thanks.

@kazum
Copy link
Contributor Author

kazum commented Jun 15, 2018

By reading the tutorial, it seems to me that in order to run the TVM program on FPGAs, you require users to specify the pipeline binding like TVM now requires users to specify thread binding for OpenCL on GPUs. As I know, this user-specified binding helps not only kernel optimization but host device splitting. Is it the way you identify the FPGA kernel scope?

Yes, exactly. I think it's the most intuitive way, but any other suggestions would be much appreciated.

Could you explain the current limitation of HLS code generation? I briefly reviewed the HLS code generation and it seems to me that now you only generate the corresponding ap_uint data type and insert interface pragmas. I know other optimization pragmas such as unroll or pipeline could be added in the future, but I'd like to know if there have any case that you currently cannot support.

From the point of view of code semantics, I guess there is no limitation; Vivado HLS could compile the generated code as far as FPGA resource is available. For optimization, there are many things we have to do, but they would be future work as you said.

Thanks.

@comaniac
Copy link
Contributor

comaniac commented Jun 15, 2018

I traced some more parts with your clarification and made the following summary of this PR:

  • This PR provides a working flow that compiles TVM IR to HLS C for SDAccel.
  • Different for any existing TVM supported backend flow, this flow fully decouples compilation and execution to deal with long place&route time. This is totally reasonable. Although the connection between compilation and execution seems not clean enough, it could be improved later on.
  • The flow provides the runtime support to AWS F1 instance but currently does not target to other standalone servers with Xilinx FPGAs.
  • User has to specify the pipeline binding in order to define the kernel scope, but the pipeline binding currently doesn't affect the performance of the generated HLS C code.
  • This PR does not have any performance consideration for the generated HLS kernel yet.

Please correct me if I misunderstand anything.
Since it is the first HLS flow in TVM and it doesn't affect other backends, it's safe to let it in as an experimental feature, in my opinion.

Thanks.

@kazum
Copy link
Contributor Author

kazum commented Jun 15, 2018

This PR provides a working flow that compiles TVM IR to HLS C for SDAccel.

Yes.

Different for any existing TVM supported backend flow, this flow fully decouples compilation and execution to deal with long place&route time. This is totally reasonable. Although the connection between compilation and execution seems not clean enough, it could be improved later on.

The main reason I decoupled the tutorial script is that we need to create AFI (Amazon FPGA Image) before execution on the AWS F1 instance. If we use a standalone server with Xilinx FPGAs, I think we can combine those two script into one. I confirmed I could compile and execute the vector addition code with one script for software/hardware emulation, so we can use the same code with other backend at least for testing purpose.

The flow provides the runtime support to AWS F1 instance but currently does not target to other standalone servers with Xilinx FPGAs.

No, I guess the flow would be similar to other standalone servers, though I've not tested it yet. The only thing we need to change is to specify a xclbin binary instead of an awsxclbin file, I think.

User has to specify the pipeline binding in order to define the kernel scope, but the pipeline binding currently doesn't affect the performance of the generated HLS C code.

Yes.

This PR does not have any performance consideration for the generated HLS kernel yet.

Yes.

Thanks for your help for clarification!

@kazum
Copy link
Contributor Author

kazum commented Jun 16, 2018

@comaniac I've added a tiny testcase which shows we can try SDAccel backend without decoupling scripts for emulation.

np.testing.assert_allclose(
b.asnumpy(), np.exp(a.asnumpy()), rtol=1e-5)

check_device("sdaccel")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the TVM Jenkins environment has sdaccel?
If so, how did you mock the bitstream generation part?
If not, it seems to me that this test will exit at line 29 and nothing will be tested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Jenkins environment of this community doesn't have SDAccel, so the test is not run via CI. I checked the test works on my AWS F1 instance:

$ python -m nose -vs tests/python/integration/test_ewise_fpga.py
test_ewise_fpga.test_exp ... [07:37:36] /home/centos/src/project_data/tvm/src/runtime/opencl/opencl_device_api.cc:232: Initialize OpenCL platform 'Xilinx'
[07:37:36] /home/centos/src/project_data/tvm/src/runtime/opencl/opencl_device_api.cc:259: opencl(0)='xilinx_aws-vu9p-f1_dynamic_5_0' cl_device_id=0x1ad7ef0
#include <ap_int.h>

extern "C" void myexp_kernel0( float* B,  float* A) {
#pragma HLS INTERFACE m_axi port=B  offset=slave bundle=gmem
#pragma HLS INTERFACE s_axilite port=B bundle=control
#pragma HLS INTERFACE m_axi port=A  offset=slave bundle=gmem
#pragma HLS INTERFACE s_axilite port=A bundle=control
#pragma HLS INTERFACE s_axilite port=return bundle=control

  for (int i0_inner = 0; i0_inner < 1024; ++i0_inner) {
    B[i0_inner] = expf(A[i0_inner]);
  }
}



****** xocc v2017.4 (64-bit)
  **** SW Build 2193837 on Tue Apr 10 18:06:59 MDT 2018
    ** Copyright 1986-2017 Xilinx, Inc. All Rights Reserved.

Attempting to get a license: ap_opencl
Feature available: ap_opencl
INFO: [XOCC 60-585] Compiling for software emulation target

(snip)

INFO: [XOCC 60-586] Created /tmp/tmpc5uk9a/output.xclbin
INFO: [XOCC 60-791] Total elapsed time: 0h 0m 11s
ok

----------------------------------------------------------------------
Ran 1 test in 61.040s

OK

I set the environment variable XCL_EMULATION_MODE in the python test file, so SDAccel doesn't generate a bitstream but run in software emulation mode - it works like a mock.

Let me know if I'm missing something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Could we figure out a plan to include SDAccel in Jenkins, or an alternative way to perform unit tests without SDAccel? I'm worried that it would be hard to be maintained when other contributors start adding more features.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way is to use FPGA Developer AMI of AWS. It contains SDAccel.

@tqchen, is it possible to launch an AWS instance from Jenkins for FPGA testing purpose? Note that we don't need F1 instances for emulation. Any instance types is okay.

@kazum
Copy link
Contributor Author

kazum commented Jun 20, 2018

I'll try reproducing the tutorial over the next few days on F1 to make sure that this runs correctly.

@tmoreau89, have you tried the tutorial? Let me know if there is anything I can help you with. :)

@tqchen
Copy link
Member

tqchen commented Jun 20, 2018

I have been discussing for a while with @tmoreau89 on how to introduce SDAccel and other VHLS toolchain into test infrastructure. The main challenge here is that our current test infra was based on Docker and because we need some manual step in setting up the Xilinx toolchain, it is hard to dockerize these parts.

Given the situation @comaniac suggested, we might need to look into possible solutions.

@kazum
Copy link
Contributor Author

kazum commented Jun 21, 2018

I've added a change to create a program binary for GPU when SDAccel is not available. I think we can test the sdaccel backend in Jenkins with this change.

@tqchen @tmoreau89 @comaniac Let me know your opinion.

os.environ.get("AWS_PLATFORM", "xilinx:kcu1500:dynamic"))
platform = os.environ.get("XCL_PLATFORM", os.environ.get("AWS_PLATFORM"))

if platform is None:
Copy link
Contributor

@comaniac comaniac Jun 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest showing a warning like "Failed to set up SDAccel, running on GPU instead." in case if a user really wants to run with SDAccel but just forgets to set up the environment.

@@ -121,6 +121,22 @@ def vpi(dev_id=0):
return TVMContext(9, dev_id)


def sdaccel(dev_id=0):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not necessary, as from frontend's perspective, sdaccel is just opencl

}
}

void CodeGenVHLS::AddFunction(LoweredFunc f) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VLHS->VivadoHLS

TVM_REGISTER_GLOBAL("tvm.intrin.rule.sdaccel.popcount")
.set_body(DispatchExtern<Direct>);

TVM_REGISTER_GLOBAL("tvm.intrin.rule.sdaccel.tvm_warp_shuffle")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think sdaccel have warp shuffle

@@ -0,0 +1,151 @@
Deploy to AWS F1 FPGA Instance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deploy to AWS F1 -> HLS Backend Example

Deploy to AWS F1 FPGA Instance
==============================

TVM supports Xilinx FPGA board with SDAccel. Here is a tutorial for how to deploy TVM to AWS F1 FPGA instance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a note saying that this is still experimental, as we cannot use to deploy an end to end neural networks.

@@ -13,6 +13,10 @@ if(USE_OPENCL)
file(GLOB RUNTIME_OPENCL_SRCS src/runtime/opencl/*.cc)
list(APPEND TVM_RUNTIME_LINKER_LIBS ${OpenCL_LIBRARIES})
list(APPEND RUNTIME_SRCS ${RUNTIME_OPENCL_SRCS})
if($ENV{XILINX_SDX})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this conflict with runtime SDK?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we have another SDK, we can still compile tvm and find multiple OpenCL platforms with clGetPlatformInfo(). Currently, tvm cannot handle this case but it's another problem:
https://github.com/dmlc/tvm/blob/1e66d3c/src/runtime/opencl/opencl_device_api.cc#L242
I think of sending another PR to support it.

Anyway, I'll remove this if condition since cmake can compile tvm for sdaccel in either way. It is not necessary to use headers and libraries in the Xilinx toolchain at compile time.

@tqchen
Copy link
Member

tqchen commented Jun 24, 2018

Thanks for all the reviews. I think the conclusion is that we want to bring this in. However, there are a few things we need to keep in mind as there are limitations in this approach.

  • FPGA pipeline is very different from normal codegen.
    • Since every opencl kernel will be hardened to the real hardware, unless we build an entire pipeline, we cannot let it flow well
  • Ideally, we would like to build a hardware core(e.g. tensor core) and an ISA and use TVM to generate these ISA. @tmoreau89 is going to send something upstream soon and hopefully, that would give a better pragmatic approach of interfacing
  • HLS itself is still super valuable if we could use TVM to automatically generate these processing cores, and figure out how to reuse them with a virtual ISA

@kazum
Copy link
Contributor Author

kazum commented Jun 25, 2018

@tqchen Thanks for your comments. I addressed your review comments and rebased onto the current master.

Ideally, we would like to build a hardware core(e.g. tensor core) and an ISA and use TVM to generate these ISA. @tmoreau89 is going to send something upstream soon and hopefully, that would give a better pragmatic approach of interfacing

It sounds very nice! I'm looking forward to it. :)

@tqchen tqchen merged commit 9c64ab2 into apache:master Jun 26, 2018
@tqchen
Copy link
Member

tqchen commented Jun 26, 2018

Thanks! this is merged

@kazum kazum deleted the sdaccel-support branch June 27, 2018 19:57
tqchen pushed a commit to tqchen/tvm that referenced this pull request Jul 6, 2018
mnuyens pushed a commit to mnuyens/tvm that referenced this pull request Jul 10, 2018
sergei-mironov pushed a commit to sergei-mironov/tvm that referenced this pull request Aug 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants