Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VTA Chisel Wide memory interface. #32

Merged
merged 17 commits into from
Sep 9, 2021
Merged

Conversation

aasorokiin
Copy link
Contributor

VTA modification for parametrizable AXI data size.

Performance change:
Numbers are based on test_benchmark_topi_conv2d.py limited to resnet-18.C2
Baseline 64bit data tsim run: 192M cycles. New 64bit data transfer: 97M cycles.
128bit data: 58M cycles. 256bit data: 39M cycles. 512bit data: 29M cycles.
Code changes:

  • AXI 64/128/256/512 data bits support by AXIParams->dataBits
  • TensorLoad is modified to replace all VME load operations. Multiple simultaneous requests can be generated. Load is pipelined and separated from request generation. A "wide" implementation of load/store is used when AXI interface data width is larger than number of bits of a tensor.
  • TensorStore -> TensorStoreNarrowVME TensorStoreWideVME. The narrow one is the original one
  • TensorLoad -> TensorLoadSimple (original) TensorLoadWideVME TensorLoadNarrowVME
  • LoadUop -> LoadUopSimple is the original one. The new one is based on TensorLoad
  • Fetch -> FetchVME64 FetchWideVME. Reuse communication part from TensorLoad. Implemented as a 64bit tensor with double tensor read to allow 64bit address alignment.
  • DPI interface changed to transfer more than 64bit. svOpenArrayHandle is used. tsim library compilation now requires verilator includes
  • Compute is changed to use TensorLoad style of load uop.
  • VME changed to generate/queue/respond to multiple simultaneous load requests
  • Added SyncQueue with tests - Implementation uses sync memory to implement larger queues.

Code contributions to this PR were made by the following individuals (in alphabetical order): @suvadeep89, @stevenmburns, @pasqoc, @adavare, @sjain12intel, @aasorokiin, and @zhenkuny.

* Added SyncQueue with tests - Implementation uses sync memory to implement larger queues.
* AXI 64/128/256/512 data bits support by AXIParams->dataBits
    A wide implementation of load/store is used when AXI interface data width
    is larger than number of bits in a tesor.
    Instructions are stored as 64bit tensors to allow 64bit address alignment
* TensorLoad is modified to replace all VME load operations.
    Multiple simultaneous requests can be generated. Load is pipelined
    and separated from request generation.
* TensorStore -> TensorStoreNarrowVME TensorStoreWideVME. The narrow one is the original one
* TensorLoad -> TensorLoadSimple (original) TensorLoadWideVME TensorStoreNarrowVME
* LoadUop -> LoadUopSimple is the original one. The new one is based on TensorLoad
* Fetch -> FetchVME64 FetchWideVME. Reuse communication part from TensorLoad.
* DPI intreface changed to transfer more than 64bit. svOpenArrayHandle is used. tsim library compilation now requires verilator
* Compute is changed to use TensorLoad style of load uop.
* VME changed to generate/queue/respond to multiple simultaneous load requests
@aasorokiin
Copy link
Contributor Author

Adding @tmoreau89 . I need help on deciding how to proceed further. Checks flow fails because VTA DPI modification require changes to TVM. A Verilator include directory is required in order to build new TSIM DPI communication library by TVM make flow. Those changes are needed to transfer more than 64bit data as sv arrays. There is a pull request in TVM "VTA cmake change to include Verilator header for building tsim library" #8797. It fails checks as no Verilator found by cmake.

@tmoreau89
Copy link
Contributor

Thank you @aasorokiin - CC-ing @vegaluisjose

hardware/chisel/src/main/scala/util/SyncQueue.scala Outdated Show resolved Hide resolved
src/dpi/module.cc Outdated Show resolved Hide resolved
src/dpi/module.cc Outdated Show resolved Hide resolved
hardware/chisel/src/main/scala/core/TensorUtil.scala Outdated Show resolved Hide resolved
hardware/chisel/src/main/scala/core/TensorUtil.scala Outdated Show resolved Hide resolved
hardware/chisel/src/main/scala/core/LoadUopSimple.scala Outdated Show resolved Hide resolved
hardware/chisel/src/main/scala/core/FetchWideVME.scala Outdated Show resolved Hide resolved
hardware/chisel/src/main/scala/core/FetchWideVME.scala Outdated Show resolved Hide resolved
hardware/chisel/src/main/scala/core/Compute.scala Outdated Show resolved Hide resolved
…pache#33)

* Fix Makefile to use Chisel -o instead of top name and .sv instead of .v
* fix reset to reset.asBool
* fix SyncQueue to deprecated module.io
* fix toBools to asBools
@aasorokiin
Copy link
Contributor Author

Updated this request to comply to Chisel changes from #33
Cannot quite get why the latest main with #33 doesn't compile lib for me (make in chisel directory). It complains: --top-name cannot be used. Proposed Makefile changes are in this PR. @ekiwi

@vegaluisjose
Copy link
Member

I think I found the issue of the new chisel version, I just submitted #34 that should fix it.

@tmoreau89
Copy link
Contributor

#34 was just merged, @aasorokiin please rebase

@aasorokiin
Copy link
Contributor Author

Changes were addressed. TSIM ci also should pass now. @vegaluisjose

@vegaluisjose
Copy link
Member

could you push en empty commit to see if we can re-trigger CI?

Thanks! @aasorokiin

@aasorokiin
Copy link
Contributor Author

I tried that, but checks keep failing.
I cannot reproduce the problem. For me it looks like CI caching error. @vegaluisjose

@vegaluisjose
Copy link
Member

vegaluisjose commented Aug 27, 2021

I think I found what is causing the issue. Can you try to update the docker image numbers here with the following numbers here

the change should be

ci_lint = "tvmai/ci-lint:v0.67"
ci_cpu = "tvmai/ci-cpu:v0.77"
ci_i386 = "tvmai/ci-i386:v0.73"

Jenkinsfile Outdated Show resolved Hide resolved
@vegaluisjose
Copy link
Member

Oh I see, what is going on now. Can you do this docker update on a separate PR? we will merge that before this one, so it can take the changes.

@vegaluisjose
Copy link
Member

Hey @aasorokiin , now that we have #36 merged...could you please rebase to see how this one goes?

@aasorokiin
Copy link
Contributor Author

Is there any reason tsim test run got skipped? @vegaluisjose

@vegaluisjose
Copy link
Member

Oh yeah, could you please remove the following?

https://github.com/apache/tvm-vta/blob/main/tests/scripts/docker_bash.sh#L70-L74

I believe we fixed this already

@@ -211,7 +211,7 @@ class AXIMaster(params: AXIParams) extends AXIBase(params) {
ar.bits.qos := params.qosConst.U
ar.bits.region := params.regionConst.U
ar.bits.size := params.sizeConst.U
ar.bits.id := params.idConst.U
//do not override
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, what is up with ar.bits.id and w.bits.strb here? by default the strobe mask is all ones. Also, ar.bits.id is not being assigned now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id/strb should be handled by VME now. VME can send multiple read requests with different IDs. Strb is used by TensorStore write to mark data of valid tensors in case when AXI data is wider than tensor data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure but why changing the defaults? Isn't better to let these modules to customize (override) these values?

Also, if it is done for reads ar why it is not done for writes aw?

Copy link
Contributor Author

@aasorokiin aasorokiin Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, those are not defaults. They were set in setConst. That expected no change of value.
VME write path did not get changed other than supporting wider data. It was not identified critical for hiding memory access latency. DE10 config AXI burst of 16 pulses was short enough to make hiding burst-to-burst read latency important. Now even 1 pulse bursts can potentially transfer read data without gaps.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, the original intent of this method, back when we had support for one small board/configuration, was to use this as a mechanism to derive other params in the protocol. However, things have obviously evolved quite a bit since then.

What if we do the same for both reads/writes in terms of id so future contributors know that they have to handle id and strb?

Maybe we can even add comments on top of that setConst method saying what they should implement and we can point to that documentation later if needed.

For the record, these suggestions are intended to help future contributions. We really appreciate all your efforts on this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved write id to VME. Added comments. Check PR

hardware/chisel/src/main/scala/shell/VME.scala Outdated Show resolved Hide resolved
hardware/chisel/src/main/scala/shell/VME.scala Outdated Show resolved Hide resolved
Copy link
Member

@vegaluisjose vegaluisjose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright this looks like ready to go on my end, thanks a lot @aasorokiin !

do you have something to add? @tmoreau89

@tmoreau89
Copy link
Contributor

Thanks @vegaluisjose for reviewing the PR and ensuring that the CI tests go green, much appreciated. While going through the PR there were a few things that I'd like to address before we merge:

  • Still quite a few comments that need to be cleaned up
  • The comments could use a good spell check pass, many typos in some files - perhaps it could be automated via the preferred IDE
  • Overall I think this is not blocking the merge, but we could extend the scala linter to address consistent comment formatting etc.

@aasorokiin
Copy link
Contributor Author

Is there anything else expected form me? @tmoreau89 @vegaluisjose

@tmoreau89
Copy link
Contributor

Hi @aasorokiin - thanks for the ping and addressing the comments.

@tmoreau89 tmoreau89 merged commit 36a9157 into apache:main Sep 9, 2021
@tmoreau89
Copy link
Contributor

Thanks @vegaluisjose @aasorokiin and Intel team - the PR has been merged.

jinhongyii pushed a commit that referenced this pull request Sep 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants