Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC][BYOC] Marvell ML/AI Accelerator Integration #48

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ccjoechou
Copy link

  • Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.

We have also posted a pre-RFC at https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691.
Plus, we have up-streamed our POC code changes in: PR-9730 (apache/tvm#9730). We have resolved a Mrvl.cmake issue but we are now waiting for tips from the TVM community in order to make the PR's Jenkins task_rust.sh to pass.

Note1: we have not spend much time on driver/runtime integration and therefore can be missing changes for rust cargo. We are trying to catch up here.
Note2: we do run TVM-Jenkinsfile-like build & tests locally but we have skipped the task_rust.sh script during our locally run.

@ccjoechou
Copy link
Author

Hello,

How can we request a reviewer to review our RFC?
Also, we have followed the RFC template and listed several unresolved questions, which need help from reviewer and/or TVM community to resolve?
The first few issues are related TVM Jenkins build's rust/cargo failure. I.e., we like to add two more build config flags (use_mrvl and use_mrvl_runtime) in tvm/rust/ setups but need help to update the cargo.io's tvm-build package -- BTW tvm-build seems to be owned by OctoML GitHub. Also, there are other tvm/tests/scripts/task_rush.sh errors. Can a reviewer provide additional information or pointers explain how we should debug the task_rush.sh run to find and resolve issues due to new changes?

Thanks,

  • Joe

@ssdurako
Copy link

@areusch @jroesch
Is there some aging process to follow in the TVM Open Source so issues get looked at and advanced without getting stale for weeks? As there are timelines that get affected by such delays.

Thanks

@areusch areusch self-requested a review January 18, 2022 19:58
Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ssdurako @ccjoechou apologies for the long delay! i think we missed this one since it was mailed during TVMCon and also just before we all took off for the holidays. I'll try to be a bit better about reviewing this.

Overall I have some understanding of your approach with this RFC. I'd like to further discuss some of the rationale behind:

  • device planning, which I think maybe you're doing outside the typical TVM flow?
  • executor, which I think you may have re-implemented here.

I'm a bit low on bandwidth to read your full PoC PR. would you mind clarifying the RFC as a starting point (or feel free to provide code links into your PoC if that would help me understand--I can do some targeted reading, I'm just fairly busy for a full read-through right now)

it would also be great to spell out a plan for tests here--it seems like it might be possible to checkin your compiler/simulator into our CI, but could you be more explicit about your plans there?

also cc @comaniac @mbs-octoml @Mousius @junrushao1994 for further comments on BYOC, device planning, and support for custom executors

in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
in model binary file

* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the test plan for this RFC? Would it be possible to add the Marvell backend compiler and simulator to our ci images and run against it in CI?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this BYOC-Marvell RFC, the POC PR codebase only contains code to generate JSON meta files. We have up-streamed our test_mrvl test suite but only contains JSON codegen. In our next RFC, we will provide runtime & driver hookups. We are working on a Marvell backend package with Marvell backend code-gen and Marvell software simulator, which mimics a cycle-approximate Marvell HW accelerator. This package can become available later for external usage.
Currently, we are having problems run TVM rust/cargo and can’t find useful document to debug issues – plus, tvm-build is owned by OctoML (not GitHub TVM, right?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @jroesch can we unblock their rust debugging? @ccjoechou i'm not as familiar with the rust stuff in TVM, but we should transfer ownership of any rust packages to a TVM account. apologies for any oversight there.

ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph

* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
focus only on float16 inference run
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just checking if this was the end of the sentence here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok--suggest to either add a period or maybe reword as "For this RFC, we will focus only on models that use float16 quantization mode."


* We can get to the following one Mrvl subgraph by applying the default strategy.
* in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you clarify this sentence?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did not know how to include a Figure in the RFC file – but I did include figures at end of the corresponding pre-RFC on the discuss forum – please check the end of pre-RFC and its figure to see whether they can help explaining the definition of Marvell sub-graphs here. https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can add them to assets/ and then link similar to https://github.com/apache/tvm-rfcs/blob/main/rfcs/0050-roadmaps.md (see the Raw source for example how to link it).


```
def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
%0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

above, the RFC discusses having exactly one Marvell and non-Marvell subcgraph, but here I see 8 different function calls. do you mean that there are two targets, and you partition the graph into 8 subgraphs, but each subgraph is assigned to one or the other target? (reading further, I can see this is not the case, but it would help with reader comprehension to clarify this example)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are talking about different definitions of “(sub-)graphs” here. In the TVM partition pass, TVM’s graph or sub-graph is a merge-composite IR function, which can contain a pre-define pattern of original frontend operators. In BYOC-Marvell RFC’s definition, a sub-graph is a connected graph of Marvell-merge-composite functions. For instance, a tvmgen_mrvl_main_4 (see below in original email), it is a TVM-partition sub-graph, which is a Marvell merge-composite function containing frontend operators: conv, add, batchnorm, tuple-get-item, relu. But a Marvell sub-graph contains, in the given test case, several Marvell merge-composite functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. so just to clarify--you're proposing to fuse these merge-composite IR functions at the Relay level into a single e.g. Relay @main? I think another strategy would be to run a TIR-only pass after scheduling. curious if that may work to accomplish the same goals? the benefit there is that you can also operate, at that time, on the logical TIR buffers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have not spend time on the TIR flow and passes - we will.
One quick question, is TIR buffer and its data-layout can lead how inputs/outputs of Marvell sub-graphs and LLVM-non-Marvell sub-graphs are communicated during inference runtime?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saw your other feedback regarding RFC-#10 and we will review the RFC. Thanks.

return mod_new
```

* Marvell-specific graph executor codegen, We have defined call backs and extension functions in the following files:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you motivate this further? it's hard to understand why you need to output your own JSON format without some explanation here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh the email version of this question was linking to a different RFC segment (the light green section above) -- led to me to answer differently in my in-line reply to your email.
Sorry and since I noticed here and can see the correct light green block above corresponding to your question, so let me reply here again properly.

We are using the TVM graph executor codegen to process BYOC-Marvell-relay-seq-generated Marvell-part of the IR sub-graph, which includes Marvell-specific GraphInputNode object(s) & attributes and Marvell-specific GraphOpNode object(s) and attributes. When processing the Marvell sub-graph in the TVM graph executor codegen, we need to specialize the generation code in order to dump extra Marvell-specified attributes to node-JSON file (also in a more readable format). The original code can't do what we need; hence, we are using derive classes and call back functions in C++ to overwrite defaults here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok got it, i think that makes sense to me. i think the main question i have here is the mechanism by which you guys export the Marvell GraphExecutor sub-graph.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

# Rationale and alternatives
[rationale-and-alternatives]: #rationale-and-alternatives

* We follow the TVM BYOC framework to enable BYOC Marvell flow without impacting any TVM core features.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like there has been some impact to the GraphExecutor, and I think one point of confusion here is whether it was necessary to do that or whether you could have handled the additional runtime complexity inside a Marvell-specific runtime.Module. could you explain a bit further here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the email version of this question was linking to a different RFC segment (the light green section above) -- led to me to answer differently in my in-line reply to your email.
Sorry and since I noticed here and can see the correct light green block above corresponding to your question, so let me reply here again properly.

Please check my reply to the previous question (e.g., we need Marvell-specific GraphOpNode and GraphInputNode in order to dump out Marvell specific attributes to node-JSON file).
Also, since we are also using our build-from-outside-typical TVM flow Marvell compiler backend component, which can do additional compile-time optimizations and is reading in graph-executor-generated JSON meta data, currently, we don't think using runtime.Module to generate Marvell-specific JSON meta files of network is the way to go here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@areusch: We may not know enough but like to learn more about the runtime.Module subject (its flow and how to add specialization). Can you provide an example or an existing tvm/tests-folder suite as a pointer for us to run an existing runtime using tvm repo (without relying on specific USE_ flags being ON), if there is such example?

* For one Mrvl-BYOC relay transformation pass, we have identified a need to inject a (global) expr node ID for the
RelayExprNode class and its derived classes: Tuple and CallNode, so that during the transformation pass, we can
uniquely identify each Tuple or CallNode object. Again, we need help from TVM community to provide
suggestions/guidelines here in order to know whether this is one of the best ways to achieve the Mrvl-BYOC need.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it would help to spell out why you guys need to be able to identify each expression here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but not just us but a data-scientist customer who is using TVM flow may like to know, for example, the linkages between the runtime performance #s (which were provided by driver and/or hardware) and their corresponding user frontend model’s operators (e.g., each expression which customer knows)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so is the idea that the exported graph contains the en_id and then someone can trace that back to an annotated Relay program? what's the procedure by which en_id could be used?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes and I have update the RFC to include more information regarding exprnode_id and its usages.

community has any better or work-in-progress resolution.

* When using TVM RPC code to exercise and run inference on a remote-hosted Mrvl ML/AI HW accelerator for the Mrvl
subgraph, we ran into one minor issue and have made local TVM RPC enhancement so that, when a TVM RPC client sends
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you explain the nature of the problem that requires the client to know the absolute path?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, the TVM RPC server choses path for any uploaded file under tmp randomly (which can be good to reduce possible security problem). But, in our use case, we like to have the TVM RPC client to send a “runtime” command to the RPC server side to pre-process the just uploaded file before the file can be consumed autonomously by the RPC server using pre-defined script. We can’t find a way or via a TVM example, which shows how this can be done -- unless the client knows the uploaded file's
path on server.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you try calling tvm.rpc.server.workpath on the RPC server? https://github.com/apache/tvm/blob/main/python/tvm/rpc/server.py#L62

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes and we will find time to check.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, in our use case, we need the server path to be know on the client side so that the client is the master who controls activities to be running on the server side.

Since this is not directly related to this Mrvl-BYOC PR, we will find time to contribute this enhance back in another
TVM PR soon.

* In order for us to generate the constants-JSON file, we must “NOT” remove external params, which were stored in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this? params passed in MetadataModule are meant for consumption only by the runtime.Module which defines them. it seems like perhaps you need to consume them at the executor level. could you explain that?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using relay to generate JSON meta files representing the given network model in a way our (compiler-) backend code can process directly at compile time (e.g., only the marvell-part sub-graph(s). If we have include 100% of our backend code in the TVM codebase, then, we do not need to dump constants in JSON mega file; but, due to our backend code is built outside the typical TVM flow and can do other compile-time optimizations including manipulating constants, we need constants JSON.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be possible for you to do this in TIR, if you're able to leverage tir.constant. you would need to use https://github.com/apache/tvm-rfcs/blob/main/rfcs/0010-target-registered-compiler-flow-customisation.md, so I'm not sure if that's appropriate here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will review RFC-#10 to find out. Thanks.


* Do Marvell-specific layout conversions to transform IR graph in order to meet requirements of the accelerator

* Do Marvell-specific composite-merging/fusing to transform IR graph in order to utilize available HW capability
Copy link
Contributor

@mbs-octoml mbs-octoml Jan 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for the RFC. My team at OctoML is looking at bringing some training features to the BYOC world (a la https://arxiv.org/pdf/2111.00655.pdf), so I'm looking at this RFC with that future in mind. Can you expand on:

  • Is the fusion using the existing MergeComposite / AnnotateTarget/ MergeCompilerRegions(maybe) / PartitionGraph sequence?
  • Other than the global layout xform, which necessarily must be done before any fusion etc, are there any other xforms before the above partitioning takes place?
  • Can you explain the need to limit to one kernel for each of your byoc and the default tvm? Perhaps it's an artifact of how you're later trying to capture the byoc output in json graph form? Ideally the BYOC target.ext.name function could be run multiple times, the resulting runtime::Module would be accumulated in the IRModule, and the runtime::Modules later merged. Perhaps supporting that would actually be easier and would remove the at-most-one kernel limit?
  • Ideally there'd be a single entry point for 'partition for marvel', after which the regular TVM build would deal with fusion, lowering and codegen for everything that's left (ie overall model - kernels you already partitioned out). I may not be following the explanation but it seems you're proposing the driver splits things more explicitly.
  • Like @areusch I'm a bit confused by the special handling of the graph. Perhaps it would be worth going through the tensorrt BYOC integration as a reference example since it too collects a JSON representation of the to-be-complied fused sub-graph (we invoke the TensorRT build function at runtime not compile time), but it does so on top of existing machinery.

Let me know if it would be easier to discuss this on a PR rather than here, then we could come back to here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me raise a difference here:

  • The TVM partition’s sub-graph seems to represent a relay function, which can include multiple frontend operators captured by utilizing the relay merge-composite pattern
  • The Marvell sub-graph is a connected graph of multiple relay merge-composite functions – I did not know how to include a Figure in the RFC file before (now I do). But if you look at the listed pre-RFC link, we did include figures at end of the corresponding pre-RFC on the discuss forum – please check the end of pre-RFC and its figure to see whether they can help explaining the definition of Marvell sub-graphs here. https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691].

We have also up-steamed the TVM GitHub’s PR-9730 as a POC (can be downloaded via git clone https://github.com/ccjoechou/tvm.git and changes are on the byoc-mrvl branch). Please see the tvm/python/tvm/relay/op/contrib/mrvl.py file's partition_for_mrvl() function's seq setup there.
There is also the test_mrvl suite, which can be run to generate JSON files for ssd-resnet50 network.

[Using our definition of sub-graph -- not the TVM partition's definition of sub-graph] Yes, limitation regarding at-most one mrvl-sub-graph and at most-one llvm sub-graph can be relaxed later on when we have runtime & driver hookups ready + our driver & firmware of our HW accelerator are also ready to handle multiple sub-graphs. We will be spending time on this area in the next few months.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbs-octoml: Thanks for replying. Please also see my comments to @areusch's reply below including several in-line write-ups since they can be providing information regarding to your questions too. Please let me know if anything can be clarified on the TVM GitHub PR-9730 front.
Currently, we are also running parts of the tvm/Jenkinsfile stages and their steps locally using our own Jenkins server. However, we are having problem to debug rust/cargo issue (the tvm/scripts/task_rust.sh suite). It will be great, if you can provide us additional information regarding how to build our "local" tvm-build package (I can git clone current OctoML GitHub tvm-build repo) and then how we can adjust the tvm/rust/Cargo.toml file to use our "local" tvm-build package.
Also, any tips and pointers regarding how to debug rust/cargo build.
Thanks.

@ccjoechou
Copy link
Author

ccjoechou commented Jan 19, 2022 via email

Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ccjoechou sorry for the delay, I'll try to be a bit quicker replying here.

I read your PoC PR a bit and I see you guys are indeed following the BYOC flow. I think that part makes sense to me. the main question I have is around enshrining the external JSON in BuildResult. I think it's a little unclear how the "external graph JSON" is consumed in your PoC. Do I understand right that you guys want to pass this to an external compiler? I'm wondering if something like Model Library Format is the right thing here (note though, that it doesn't support BYOC right now...so we would be wanting to adopt an approach here that was easy to port forward to that).

in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input meta-data in order to generate final instructions,
in model binary file

* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific optimization and code generation, is not included
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @jroesch can we unblock their rust debugging? @ccjoechou i'm not as familiar with the rust stuff in TVM, but we should transfer ownership of any rust packages to a TVM account. apologies for any oversight there.

ball; and it can be used to read in input file(s) and the model binary to run inference for the Mrvl subgraph

* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or int8 quantization mode. For this RFC, we will
focus only on float16 inference run
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok--suggest to either add a period or maybe reword as "For this RFC, we will focus only on models that use float16 quantization mode."


* We can get to the following one Mrvl subgraph by applying the default strategy.
* in the mrvl.py file: the compute_two_subgraphs() function of the class MrvlIRGraphUtils is used
to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can add them to assets/ and then link similar to https://github.com/apache/tvm-rfcs/blob/main/rfcs/0050-roadmaps.md (see the Raw source for example how to link it).


```
def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] {
%0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* ty=Tensor[(1, 28, 28, 1), float32] */;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. so just to clarify--you're proposing to fuse these merge-composite IR functions at the Relay level into a single e.g. Relay @main? I think another strategy would be to run a TIR-only pass after scheduling. curious if that may work to accomplish the same goals? the benefit there is that you can also operate, at that time, on the logical TIR buffers.

```

* Currently, in order for the following Marvell classes/functions to identify a Mrvl subgraphs and a non-Mrvl
subgraph from the layout-converted, composited/fused IR graph, we are utilizing the unique en_id attribute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. i'd like to suggest for clarity's sake we use exprnode_id here.

return mod_new
```

* Marvell-specific graph executor codegen, We have defined call backs and extension functions in the following files:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok got it, i think that makes sense to me. i think the main question i have here is the mechanism by which you guys export the Marvell GraphExecutor sub-graph.

* For one Mrvl-BYOC relay transformation pass, we have identified a need to inject a (global) expr node ID for the
RelayExprNode class and its derived classes: Tuple and CallNode, so that during the transformation pass, we can
uniquely identify each Tuple or CallNode object. Again, we need help from TVM community to provide
suggestions/guidelines here in order to know whether this is one of the best ways to achieve the Mrvl-BYOC need.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so is the idea that the exported graph contains the en_id and then someone can trace that back to an annotated Relay program? what's the procedure by which en_id could be used?

community has any better or work-in-progress resolution.

* When using TVM RPC code to exercise and run inference on a remote-hosted Mrvl ML/AI HW accelerator for the Mrvl
subgraph, we ran into one minor issue and have made local TVM RPC enhancement so that, when a TVM RPC client sends
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you try calling tvm.rpc.server.workpath on the RPC server? https://github.com/apache/tvm/blob/main/python/tvm/rpc/server.py#L62

Since this is not directly related to this Mrvl-BYOC PR, we will find time to contribute this enhance back in another
TVM PR soon.

* In order for us to generate the constants-JSON file, we must “NOT” remove external params, which were stored in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be possible for you to do this in TIR, if you're able to leverage tir.constant. you would need to use https://github.com/apache/tvm-rfcs/blob/main/rfcs/0010-target-registered-compiler-flow-customisation.md, so I'm not sure if that's appropriate here.

@ccjoechou
Copy link
Author

@areusch: Thanks again for your latest responses. Don't worry about timing (we all have our real jobs to do too).
I am going to update the RFC #48 doc based on many of your feedback in the next few days -- including to add figures by following the example you gave and to add descriptions to clarify many good questions of reviewers.
Once that is done, we will reply to your latest responses individually. For changing the en_id name to exprnode_id, we will update our POC-up-streamed TVM PR accordingly first.
Btw, in our RFC flow, we mentioned that we need to call the relay defused-op pass convert merge-composition-ed Relay IR back to w/o-composition Relay IR. We found a bug in the Relay Defuses-op pass and had up-streamed a fix and a test case to TVM (see TVM PR-10069 for details).

@ccjoechou
Copy link
Author

@areusch: Forgot to answer your first question.
Yes, for now, we like to generate external JSON files, only for the accelerator sub graph(s), in the BuildResult step so that we can pass them to our not-(yet-)in-TVM-general-flow, external-accelerator-specific compiler backend program to do further AOT for-inference-run optimization in order to generate more optimized ISA code to run inference for the accelerator sub graph(s).
For the llvm sub graphs (let me use llvm here to distinguish them from those for-accelerator sub graphs), we will and do follow the TVM general flow to generate the .so lib files.
When the Model Library format and flow become stable, plus, they can be specialized to include extra external accelerator and memory allocations & communications, we definitely like to see how we can advance the BYOC-Marvell flow to use them (with llvm .so lib files together).

@areusch
Copy link
Contributor

areusch commented Feb 1, 2022

@ccjoechou great thanks! i'll take another look here when you're done with the updates.

Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
@ccjoechou
Copy link
Author

Hi @areusch:
I have updated RFC-#48 based on most of your feedback. New RFC md file and 4 new figures have been uploaded.
Please take a look again (in a view document mode I can see new figures being displayed inside RFC).
We are taking couple of your TIR-related advices and will start reviewing RFC-0010 and TVM TIR files.
For our TVM PR-9730 POC code, I have renamed en_id to exprnode_id for our changes and update that PR soon.

@areusch
Copy link
Contributor

areusch commented Feb 8, 2022

@ccjoechou hey I think you may have had a bad merge--I see a bunch of unrelated RFCs listed as changed underneath "Files changed." Could you take a look and rebase/re-merge?

@ccjoechou
Copy link
Author

@ccjoechou hey I think you may have had a bad merge--I see a bunch of unrelated RFCs listed as changed underneath "Files changed." Could you take a look and rebase/re-merge?

Let me check.

@ccjoechou ccjoechou closed this Feb 8, 2022
@ccjoechou ccjoechou reopened this Feb 8, 2022
@ccjoechou
Copy link
Author

@areusch: You are correct that I must did something wrong the last time and now I lost the linkage between my GitHub forked byoc-mrvl branch and this tvm-rfc PR-#48. Therefore, my changes done a week ago are still staying on my personal GitHub forked byoc-mrvl branch and did not get push to the tvm-rfc PR-#48.

Any suggestion for me to do?
I can see my byoc-mrvl branch is good to be merged automatically with the tvm-rfc PR-#48 but I do not see any bottom to click to make it happen.

image

@ccjoechou ccjoechou force-pushed the byoc-mrvl branch 2 times, most recently from 371b5f8 to a28247e Compare February 8, 2022 22:11
Joe (Chien-Chun) Chou and others added 2 commits February 8, 2022 14:18
Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to bring the TVM ecosystem to Marvell customers.
* [RFC][Runtime] Bring `PackedFunc` into TVM Object System

* Apply suggestions from code review

Fix a typo

Co-authored-by: Xiyou Zhou <xiyou.zhou@gmail.com>

* Apply suggestions from RFC review

Co-authored-by: Xiyou Zhou <xiyou.zhou@gmail.com>
@ccjoechou
Copy link
Author

@areusch:
Hello, I believe I have cleaned up the PR's commits now (reset a few commits and re-add changes).
Sorry about that.
Can you take a look again (my latest changes are with this commit: 8a7fd01)?
Thanks.

@areusch
Copy link
Contributor

areusch commented Feb 15, 2022

@ccjoechou sorry for the delay--i've gotten pretty busy with something and will hopefully have some bandwidth towards the end of the week.

cc @jroesch @mbs-octoml in case they have cycles

@ccjoechou
Copy link
Author

@areusch: No worries. I saw lots TVM emails coming from you & others working on other also important stuffs. We will wait for your feedback.

Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ccjoechou I took another look. I think I'm still hung up on how you want to use partition_for_mrvl--are you trying to develop a separate compilation flow, or reuse the TVM compilation flow? part of the missing piece is that I think the PoC doesn't demonstrate how the compilation artifacts are passed down to the runtime: https://github.com/apache/tvm/pull/9730/files#diff-7561f47a7ba9997bdd34dec8311c0070985abc9703ced5ce52087fb0639c79fdR206

i can't really tell how this is done from the RFC either.

I would generally prefer if we could avoid subclassing pieces of the core compilation flow. this makes it harder for us to maintain the core compilation flow. I think it'd be helpful to discuss why you guys need the additional attributes in graph.json and see whether we could augment our existing graph.json with those attributes, rather than building a parallel flow. I'd also still like to understand more about your partitioning scheme.

would it help to do something more high-bandwidth, either over discord or zoom?


To exercise the TVM-BYOC-Marvell flow, we have provided a tests/python/contrib/test\_mrvl folder with test\_mrvl\_codegen.py and infrastructure.py files so that they shows how to exercise the TVM-BYOC-Marvell flow for a pre-trained SSD-ResNet50 model. In addition, Marvell are also planning to provide the Marvell backend compiler (mrvl-tvmircomp) and the Marvell HW accelerator software simulator (mlModel) so that they can be used to read in JSON files generated by the TVM-BYOC-Marvell flow to run inference to get results.

In the uploaded appache/tvm-PR-9730 branch,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you finish this sentence or rm?

...
```

First, we can download a pre-trained SSD-ResNet50 model from the MXNet-gluoncv site; then, call the mrvl.partition\_for\_mrvl() function to trigger the TVM-BYOC-Marvell flow; and finally, call relay.build() function and mrvl.dump\_json\_meta\_data\_files() function to generate a pair of JSON files for each Marvell sub-graph identified by the TVM-BYOC-Marvell flow.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest to use numbered list:

Suggested change
First, we can download a pre-trained SSD-ResNet50 model from the MXNet-gluoncv site; then, call the mrvl.partition\_for\_mrvl() function to trigger the TVM-BYOC-Marvell flow; and finally, call relay.build() function and mrvl.dump\_json\_meta\_data\_files() function to generate a pair of JSON files for each Marvell sub-graph identified by the TVM-BYOC-Marvell flow.
The above code snippet does the following:
1. Download a pre-trained SSD-ResNet50 model from the MXNet-gluoncv site
2. Call the `mrvl.partition_for_mrvl()` function to partition the graph into Marvell and non-Marvell pieces and trigger the TVM-BYOC-Marvell flow
3. Call relay.build() function and mrvl.dump\_json\_meta\_data\_files() function to generate a pair of JSON files for each Marvell sub-graph identified by the TVM-BYOC-Marvell flow.

) = keras.datasets.fashion_mnist.load_data()
```

In the code snippet below, we call onnx.load() and relay.frontend.from\_onnx() to generate TVM mod and params. Then, they are used by the mrvl.partition\_for\_mrvl() function and the mrvl.dump\_json\_meta\_data\_files() function provided for the TVM-BYOC-Marvell flow to generate Nodes-JSON file (nodes\_json\_filename) and Constants-JSON file (consts\_json\_filename).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the PoC PR, partition_for_mrvl is registered in python/tvm/driver/tvmc/composite_target.py along with the other BYOC partitioners, but its signature differs significantly (from the de-facto partition_func(IRModule) -> IRModule):

    """Partition the graph greedily offloading supported
    operators to Mrvl

    Parameters
    ----------
    mod : Module
        The module to run passes on.
    params : Optional[Dict[str, NDArray]]
        Constant input parameters.

    Returns
    -------
    mod_mrvl : annotated and partitioned module - part 1, the mrvl sub graph
    mod_other : annotated and partitioned module - part 2, if any, the rest sub graph
    params : TBA
    opt_level : TBA
    disabled_pass_list : TBA
    mod : TBA
    mrvl_layers_in_mrvl_subgraph : TBA
    """

what's your intention here? in order to register this function in REGISTERED_CODEGEN, you'll need to make that signature match up. however, i think from my reading, what's happening here is you're invoking a fair bit of the compilation pipeline underneath a hard-coded PassContext, then returning a fair bit of extra information here. some of this information looks fairly specific to the Marvell lowering flow.

@ccjoechou
Copy link
Author

@areusch:
Yes, using one or two zoom sessions to go over some questions of yours and some questions of ours will definitely be very, very helpful. This way we can sync up quicker.
I am open tomorrow (2/25) as well as Monday (2/28) in the afternoons from 1:30 pm to 6 pm Pacific time zone.
Will you be available?
My company email address is: cchou1@marvell.com.
Please feel free to schedule a zoom meeting using my email.

@areusch
Copy link
Contributor

areusch commented Mar 2, 2022

@ccjoechou Summarizing our discussion a bit:

  • Marvell is interested in being able to arbitrarily partition a Relay graph into hardware-accelerated and non-hardware-acclerated parts

    • The boundaries between these parts are to be determined by Marvell backend; therefore, some additional control is needed over the default behavior provided by MergeComposite
    • @mbs-octoml suggests that they use the StopFusion annotation to manually enforce the boundaries. These annotations could be added programmatically via a Relay IRModule pass. StopFusion is used in FuseOps pass to avoid fusion.
    • Using this approach, the Marvell partitioning pass defined here could be simplified and the existing fusion pass could be used.
  • Marvell needs to be able to determine which:

    • Imported ONNX operator is responsible for a given Relay node
    • Relay node is responsible for a TIR CallNode

    This needs to happen at two times:

    1. At compile time, to serve as a reference to the boundary nodes between a hardware-accelerated and non-hardware-accelerated subgraph
    2. At runtime, to determine which backend operator to call

    A follow-up question here from me: at runtime, couldn’t you just emit the call to the correct backend operator? I wonder if the reason this mapping was needed was due to previous difficulties configuring the TVM partitioner (it would sometimes fuse across a desired boundary). Is it possible to avoid the need for this reference at runtime given the improved partitioning approach mentioned above?

    That doesn't solve the problem of needing to identify a Relay node at compile time. However, if we can reduce this need to a purely compile-time need, perhaps we can develop an alternate way to refer to a Relay node given Relay source code other than adding an id to the IR. cc @tqchen @junrushao1994 in case they have ideas here.

    • Marvell proposes to add a Relay attribute exprnode_id and export this from the compiled artifact to identify the relay nodes which are fused into a particular subgraph
    • More broadly, source maps (e.g. mapping TIR to Relay to frontend operator) would help here.
  • Right now the RFC proposes to create a new GraphExecutorCodegen. It might not be necessary to do this if we could export the exprnode_id for Relay operators passed to BYOC. A suggestion is to create a Marvell-specific runtime::Module modeled after CUDAModule which contains several distinct pieces of generated code. The exprnode_ids could be kept separate from any binary instructions if encoded this way. This pattern is common amongst GPU-offloaded runtime::Module.

    • Additionally, note the SaveToFile override which is invoked when Module.save() is called from Python. This can allow you walk the runtime::Module tree from Python and collect the various exprnode_ids into a single e.g. JSON blob.
  • @jroesch to comment on rust CI failures

  • Marvell would like to contribute a simulator which can run in TVM CI to test their accelerator. We discussed either adding the sim to ci-cpu or a new ci-marvell, the method to do this, and limitations of TVM CI.

  • Marvell runs a patched version of the TVM CI internally. A primary reason why patching is needed is because many tests in the TVM CI require an internet connection to e.g. download models, but their CI is run in a sandbox. It would be particularly helpful to mark such tests e.g. via pytest.mark in order to make these easy to skip. We also discussed pre-populating the download_testdata cache and patching pytest.skip into download_testdata on their internal fork. cc @leandron @driazati @konturn for visibility and in case they have ideas here.

@areusch areusch added the status: need update RFC needs update based on feedback label Mar 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: need update RFC needs update based on feedback
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants