VTA Chisel Wide memory interface. #32

aasorokiin · 2021-08-19T03:01:46Z

VTA modification for parametrizable AXI data size.

Performance change:
Numbers are based on test_benchmark_topi_conv2d.py limited to resnet-18.C2
Baseline 64bit data tsim run: 192M cycles. New 64bit data transfer: 97M cycles.
128bit data: 58M cycles. 256bit data: 39M cycles. 512bit data: 29M cycles.
Code changes:

AXI 64/128/256/512 data bits support by AXIParams->dataBits
TensorLoad is modified to replace all VME load operations. Multiple simultaneous requests can be generated. Load is pipelined and separated from request generation. A "wide" implementation of load/store is used when AXI interface data width is larger than number of bits of a tensor.
TensorStore -> TensorStoreNarrowVME TensorStoreWideVME. The narrow one is the original one
TensorLoad -> TensorLoadSimple (original) TensorLoadWideVME TensorLoadNarrowVME
LoadUop -> LoadUopSimple is the original one. The new one is based on TensorLoad
Fetch -> FetchVME64 FetchWideVME. Reuse communication part from TensorLoad. Implemented as a 64bit tensor with double tensor read to allow 64bit address alignment.
DPI interface changed to transfer more than 64bit. svOpenArrayHandle is used. tsim library compilation now requires verilator includes
Compute is changed to use TensorLoad style of load uop.
VME changed to generate/queue/respond to multiple simultaneous load requests
Added SyncQueue with tests - Implementation uses sync memory to implement larger queues.

Code contributions to this PR were made by the following individuals (in alphabetical order): @suvadeep89, @stevenmburns, @pasqoc, @adavare, @sjain12intel, @aasorokiin, and @zhenkuny.

* Added SyncQueue with tests - Implementation uses sync memory to implement larger queues. * AXI 64/128/256/512 data bits support by AXIParams->dataBits A wide implementation of load/store is used when AXI interface data width is larger than number of bits in a tesor. Instructions are stored as 64bit tensors to allow 64bit address alignment * TensorLoad is modified to replace all VME load operations. Multiple simultaneous requests can be generated. Load is pipelined and separated from request generation. * TensorStore -> TensorStoreNarrowVME TensorStoreWideVME. The narrow one is the original one * TensorLoad -> TensorLoadSimple (original) TensorLoadWideVME TensorStoreNarrowVME * LoadUop -> LoadUopSimple is the original one. The new one is based on TensorLoad * Fetch -> FetchVME64 FetchWideVME. Reuse communication part from TensorLoad. * DPI intreface changed to transfer more than 64bit. svOpenArrayHandle is used. tsim library compilation now requires verilator * Compute is changed to use TensorLoad style of load uop. * VME changed to generate/queue/respond to multiple simultaneous load requests

aasorokiin · 2021-08-19T18:39:51Z

Adding @tmoreau89 . I need help on deciding how to proceed further. Checks flow fails because VTA DPI modification require changes to TVM. A Verilator include directory is required in order to build new TSIM DPI communication library by TVM make flow. Those changes are needed to transfer more than 64bit data as sv arrays. There is a pull request in TVM "VTA cmake change to include Verilator header for building tsim library" #8797. It fails checks as no Verilator found by cmake.

tmoreau89 · 2021-08-19T18:55:31Z

Thank you @aasorokiin - CC-ing @vegaluisjose

hardware/chisel/src/main/scala/util/SyncQueue.scala

src/dpi/module.cc

hardware/chisel/src/main/scala/core/TensorUtil.scala

hardware/chisel/src/main/scala/core/LoadUopSimple.scala

hardware/chisel/src/main/scala/core/FetchWideVME.scala

hardware/chisel/src/main/scala/core/Compute.scala

hardware/chisel/src/main/resources/verilog/VTAMemDPI.v

…pache#33) * Fix Makefile to use Chisel -o instead of top name and .sv instead of .v * fix reset to reset.asBool * fix SyncQueue to deprecated module.io * fix toBools to asBools

aasorokiin · 2021-08-22T01:30:36Z

Updated this request to comply to Chisel changes from #33
Cannot quite get why the latest main with #33 doesn't compile lib for me (make in chisel directory). It complains: --top-name cannot be used. Proposed Makefile changes are in this PR. @ekiwi

vegaluisjose · 2021-08-22T19:56:22Z

I think I found the issue of the new chisel version, I just submitted #34 that should fix it.

tmoreau89 · 2021-08-23T01:37:49Z

#34 was just merged, @aasorokiin please rebase

…ide verilator array acces fuctionality and avoid compilation warnings

aasorokiin · 2021-08-27T15:46:57Z

Changes were addressed. TSIM ci also should pass now. @vegaluisjose

vegaluisjose · 2021-08-27T16:02:49Z

could you push en empty commit to see if we can re-trigger CI?

Thanks! @aasorokiin

aasorokiin · 2021-08-27T16:18:49Z

I tried that, but checks keep failing.
I cannot reproduce the problem. For me it looks like CI caching error. @vegaluisjose

vegaluisjose · 2021-08-27T16:55:48Z

I think I found what is causing the issue. Can you try to update the docker image numbers here with the following numbers here

the change should be

ci_lint = "tvmai/ci-lint:v0.67"
ci_cpu = "tvmai/ci-cpu:v0.77"
ci_i386 = "tvmai/ci-i386:v0.73"

Jenkinsfile

vegaluisjose · 2021-08-27T17:24:41Z

Oh I see, what is going on now. Can you do this docker update on a separate PR? we will merge that before this one, so it can take the changes.

vegaluisjose · 2021-08-31T18:33:41Z

Hey @aasorokiin , now that we have #36 merged...could you please rebase to see how this one goes?

aasorokiin · 2021-08-31T20:09:08Z

Is there any reason tsim test run got skipped? @vegaluisjose

vegaluisjose · 2021-08-31T22:01:26Z

Oh yeah, could you please remove the following?

https://github.com/apache/tvm-vta/blob/main/tests/scripts/docker_bash.sh#L70-L74

I believe we fixed this already

vegaluisjose · 2021-09-01T16:23:39Z

hardware/chisel/src/main/scala/interface/axi/AXI.scala

@@ -211,7 +211,7 @@ class AXIMaster(params: AXIParams) extends AXIBase(params) {
    ar.bits.qos := params.qosConst.U
    ar.bits.region := params.regionConst.U
    ar.bits.size := params.sizeConst.U
-    ar.bits.id := params.idConst.U
+    //do not override


btw, what is up with ar.bits.id and w.bits.strb here? by default the strobe mask is all ones. Also, ar.bits.id is not being assigned now.

id/strb should be handled by VME now. VME can send multiple read requests with different IDs. Strb is used by TensorStore write to mark data of valid tensors in case when AXI data is wider than tensor data.

Sure but why changing the defaults? Isn't better to let these modules to customize (override) these values?

Also, if it is done for reads ar why it is not done for writes aw?

IMHO, those are not defaults. They were set in setConst. That expected no change of value.
VME write path did not get changed other than supporting wider data. It was not identified critical for hiding memory access latency. DE10 config AXI burst of 16 pulses was short enough to make hiding burst-to-burst read latency important. Now even 1 pulse bursts can potentially transfer read data without gaps.

Sure, the original intent of this method, back when we had support for one small board/configuration, was to use this as a mechanism to derive other params in the protocol. However, things have obviously evolved quite a bit since then.

What if we do the same for both reads/writes in terms of id so future contributors know that they have to handle id and strb?

Maybe we can even add comments on top of that setConst method saying what they should implement and we can point to that documentation later if needed.

For the record, these suggestions are intended to help future contributions. We really appreciate all your efforts on this!

I moved write id to VME. Added comments. Check PR

hardware/chisel/src/main/scala/shell/VME.scala

vegaluisjose

Alright this looks like ready to go on my end, thanks a lot @aasorokiin !

do you have something to add? @tmoreau89

hardware/chisel/src/test/scala/unittest/SyncQueueTest.scala

hardware/dpi/tsim_device.cc

hardware/chisel/src/main/scala/core/LoadUop.scala

hardware/chisel/src/main/scala/core/TensorLoadSimple.scala

hardware/chisel/src/test/scala/unittest/SyncQueue2PortMemTest.scala

hardware/chisel/src/main/scala/core/TensorLoadNarrowVME.scala

hardware/chisel/src/main/scala/core/TensorLoadWideVME.scala

tmoreau89 · 2021-09-03T04:42:28Z

Thanks @vegaluisjose for reviewing the PR and ensuring that the CI tests go green, much appreciated. While going through the PR there were a few things that I'd like to address before we merge:

Still quite a few comments that need to be cleaned up
The comments could use a good spell check pass, many typos in some files - perhaps it could be automated via the preferred IDE
Overall I think this is not blocking the merge, but we could extend the scala linter to address consistent comment formatting etc.

aasorokiin · 2021-09-08T23:54:47Z

Is there anything else expected form me? @tmoreau89 @vegaluisjose

tmoreau89 · 2021-09-09T01:10:05Z

Hi @aasorokiin - thanks for the ping and addressing the comments.

tmoreau89 · 2021-09-09T01:10:29Z

Thanks @vegaluisjose @aasorokiin and Intel team - the PR has been merged.

vegaluisjose mentioned this pull request Aug 20, 2021

VTA cmake change to include Verilator header for building tsim library apache/tvm#8797

Merged

vegaluisjose requested changes Aug 20, 2021

View reviewed changes

aasorokiin added 3 commits August 19, 2021 20:36

code formatting fix

e5a8fe9

Merge branch 'main' into vta_wide_axi_data

fcead2d

Update to Chisel 3.4.3 PR Port to the latest stable Chisel release (a…

6000b19

…pache#33) * Fix Makefile to use Chisel -o instead of top name and .sv instead of .v * fix reset to reset.asBool * fix SyncQueue to deprecated module.io * fix toBools to asBools

aasorokiin requested a review from vegaluisjose August 22, 2021 01:32

aasorokiin added 3 commits August 24, 2021 18:38

include Verialted.cpp verilated_dpi.cpp directly in module.cc to prov…

aba3588

…ide verilator array acces fuctionality and avoid compilation warnings

fix module io warnings. Revert reset asBool for black boxes

35fd4e9

fix module io warnings

df521e8

comments

e566bfd

Jenkinsfile ci pipeline fix

b9b9377

vegaluisjose requested changes Aug 27, 2021

View reviewed changes

Jenkinsfile Outdated Show resolved Hide resolved

Jenkinsfile ci pipeline fix. only for lint,cpu,i386

2fa08a7

merge main

15f4230

aasorokiin added 2 commits August 31, 2021 19:34

Reenable tsim tests

2115eb9

style fix

d9cbd8b

aasorokiin requested a review from vegaluisjose September 1, 2021 03:28

vegaluisjose requested changes Sep 1, 2021

View reviewed changes

aasorokiin added 2 commits September 1, 2021 10:19

comments cleanup

097fe24

AXI constants commented. Moved write id to VME

b93deb5

vegaluisjose approved these changes Sep 1, 2021

View reviewed changes