Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VTA] Enable streamlined GEMM execution #4392

Merged
merged 5 commits into from Nov 27, 2019
Merged

Conversation

liangfu
Copy link
Member

@liangfu liangfu commented Nov 21, 2019

This PR fixed an issue in the streamlined GEMM execution by disabling pipelined adder, which consumes 4 cycles (in case of LOG_BLOCK=4) in addition to the single-cycle fused multiplier-adder. This is much longer than the 4-stage streamline design in the TensorGemm module, so instead of creating a routine to wait for the pipelined adder, this PR disabled the pipelined adder and bring the accumulated results to the output instantly.

Previously, the SMT schedule for GEMM in test_vta_insn.py was successful simply because the streamlined GEMM execution doesn't accumulate on the row, so there is no dependency between stage cycles in the TensorGemm module.

In addition, this PR brings successful evaluation of matrix_multiply.py, matrix_multiply_opt.py and convolution_opt.py under the tutorials directory.

@vegaluisjose @tmoreau89 Please review.

Copy link
Member

@vegaluisjose vegaluisjose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @liangfu,

Just to double check, is this a fix to an error? or improvement?

The reason why we had the pipelined adder was because it showed more performance (after P&R) overall in terms of timing, because of register retiming.

Have you tried to push both version on the tools and verify timing?

@tmoreau89
Copy link
Contributor

I second Luis' comments on reporting on the latency/throughput tradeoffs; @liangfu thank you for the PR, do you mind pushing the old and new VTA design through Intel or Xilinx P&R and report on fmax, area and cycle count (perhaps on one of the conv2d benchmarks)

@liangfu
Copy link
Member Author

liangfu commented Nov 22, 2019

This PR doesn't intend to reduce cycle count, or any performance improvement.

My major intention is to bring successful evaluation of matrix_multiply_opt.py and convolution_opt.py scripts, so that we can get closer to bring end-to-end support in evaluating resnet18. The reason the evaluation of these scripts failed previously is that

This PR fixed an issue in the streamlined GEMM execution by disabling pipelined adder, which consumes 4 cycles (in case of LOG_BLOCK=4) in addition to the single-cycle fused multiplier-adder. This is much longer than the 4-stage streamline design in the TensorGemm module.

Here are the benchmarks from Intel's Timing Analyzer (, with cycle count and result entry performed with matrix_multiply_opt.py script).

design area (in ALMs) fmax (slow 100C) fmax (slow -40C) fmax (fast 100C) fmax (fast -40C) cycle count result
PipeAdderX4 20,419/41,910 (49%) 71.74 MHz 73.52 MHz 109.12 MHz 135.46 MHz 183,038 fail
PipeAdderX2 AdderX2 19,811/41,910 (47%) 66.94 MHz 69.93 MHz 100.76 MHz 125.3 MHz 183,006 pass
PipeAdderX1 AdderX3 19,811/41,910 (47%) 65.71 MHz 68.54 MHz 106.3 MHz 130.82 MHz 182,990 pass
AdderX4 18,186/41,910 (43%) 60.72 MHz 58.82 MHz 92.43 MHz 114.97 MHz 182,990 pass

However, it is recommended to use PipeAdderX1 AdderX3 design, since it guarantees correctness in even larger designs in evaluating matrix_multiply_opt.py .

@liangfu
Copy link
Member Author

liangfu commented Nov 26, 2019

@vegaluisjose @tmoreau89 I've updated the timing results along with an update to add PipeAdder in the first layer of the adders. Please take another look.

Copy link
Contributor

@tmoreau89 tmoreau89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Liangfu for the enhancements and for sharing insights on performance on Intel FPGAs! I left a couple nits, but it seems good to go.

@liangfu
Copy link
Member Author

liangfu commented Nov 26, 2019

@tmoreau89 All review comments has been addressed, please take another look.

For Chisel based design, I think for now, our target is to bring end-to-end support (with sufficient scalability) and reproduce what HLS is capable of. After that it would be more meaningful to consider performance improvements (with correctness guaranteed), and deprecate HLS based design along the way.

@tmoreau89
Copy link
Contributor

Thanks @liangfu ; I left one final comment, and the PR is good to go!

Copy link
Contributor

@tmoreau89 tmoreau89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM!

@tmoreau89 tmoreau89 merged commit 3a1c8c5 into apache:master Nov 27, 2019
Leo-arm pushed a commit to Leo-arm/tvm that referenced this pull request Nov 29, 2019
* disable pipelined adder and enable streamlined gemm execution

* pipeline first layer of adder

* explain difference between pipeadder and adder

* add comment for explaining the hard-coded latency
tmoreau89 pushed a commit to tmoreau89/tvm that referenced this pull request Dec 3, 2019
* disable pipelined adder and enable streamlined gemm execution

* pipeline first layer of adder

* explain difference between pipeadder and adder

* add comment for explaining the hard-coded latency
zxy844288792 pushed a commit to zxy844288792/tvm that referenced this pull request Dec 13, 2019
* disable pipelined adder and enable streamlined gemm execution

* pipeline first layer of adder

* explain difference between pipeadder and adder

* add comment for explaining the hard-coded latency
zxy844288792 pushed a commit to neo-ai/tvm that referenced this pull request Dec 13, 2019
* disable pipelined adder and enable streamlined gemm execution

* pipeline first layer of adder

* explain difference between pipeadder and adder

* add comment for explaining the hard-coded latency
tqchen pushed a commit to tqchen/tvm that referenced this pull request Mar 29, 2020
* disable pipelined adder and enable streamlined gemm execution

* pipeline first layer of adder

* explain difference between pipeadder and adder

* add comment for explaining the hard-coded latency
@liangfu liangfu deleted the patch-15 branch April 14, 2020 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants