[RFC] TVM Unified Static Memory Planning #9

manupak · 2021-07-06T12:11:29Z

This commits adds the RFC (.md) for USMP.

pre-RFC on discuss : https://discuss.tvm.apache.org/t/rfc-unified-static-memory-planning/10099

This commits adds the RFC (.md) for USMP

manupak · 2021-07-06T12:11:41Z

cc : @areusch @mbaret @tqchen
(I cant seem to tag the original people in the RFC -- working on it)

*Updating the RFC with PR number

tkonolige

I'm in favor of doing memory planning, so thank you for this PR. I've got a couple questions:

Having memory planning would be useful across all executors (graph, vm, aot). What is preventing us from doing USMP for all the executors? It seems like the main problem with non-aot executors is that we do not have a full view of all the buffers.
Can you include a drawbacks section?

rfcs/0009_Unified_Static_Memory_Planning.md

tkonolige · 2021-07-06T20:26:22Z

rfcs/0009_Unified_Static_Memory_Planning.md

+Special Parametric Inputs : 
+* function : The algorithm to be used for planning From a component PoV, the algorithm is a special input with a defined interface.
+
+The current proposal for the interface is as follows :


What is this the interface for? The planning algorithm?

Yes, I ll add it to the text tomorrow :)

rfcs/0009_Unified_Static_Memory_Planning.md

tkonolige · 2021-07-06T20:36:00Z

rfcs/0009_Unified_Static_Memory_Planning.md

+    --executor=aot
+    --output-format=mlf 
+    --target=accel,c
+    --with-workspace-buffer= "name=sram;target=c,accel"


Are buffers part of the target specification with this?

these buffers should ideally be compiler flags (and same goes to flags such as --executor, -runtime). However, we are currently using the host target to hold such flags. Depending on the progress of the activity of taking compiler flags out of the host target, where this flags end up will change :)

manupak · 2021-07-06T21:27:58Z

Thanks @tkonolige for taking a look at this.

As per the pre-RFC discussion,
There is not anything preventing us from integrating unified static memory planner (apart from doing actual work for that of course :) ) for graph executor as long as relay "main" function is lowered to TIR before creating the JSON (or maybe not create the json at all -- see the discuss post conversation). Thus, the design does not block such integration.

For VM, Im not sure how useful "static" memory planning would be.

Drawbacks -- The relay "main" function has to be lowered to TIR to use the USMP. (I ll add a section)

* addressing tristan's comments. Change-Id: Ieb64ae6fc1de12374836c7f754a70b735fe5d379

rfcs/0009_Unified_Static_Memory_Planning.md

*Addressing further tristan's comments Change-Id: I5eabfda362fa85fa4c377d20043f938ffc6de456

tkonolige · 2021-07-12T16:08:23Z

rfcs/0009_Unified_Static_Memory_Planning.md

+```
+Array<BufferInfo> (*foo)(Array<BufferInfo> buffers, Map<String, Integer> pool_sizes)
+```
+The memory planning algorithm is expected to populate the assigned pool_name with the offset and return the updated array of BufferInfo objects. Additionally, the second argument provides size constraints for each pool (if any).


Suggested change

The memory planning algorithm is expected to populate the assigned pool_name with the offset and return the updated array of BufferInfo objects. Additionally, the second argument provides size constraints for each pool (if any).

The memory planning algorithm is expected to populate pool_name and pool_offset and return the updated array of BufferInfo objects. Additionally, the second argument provides size constraints for each pool (if any).

tkonolige · 2021-07-12T16:09:09Z

rfcs/0009_Unified_Static_Memory_Planning.md

+        Integer size_bytes;
+        Integer alignment;
+        Array<BufferInfo> conflicts; //the conflicting bufferinfo objs
+        Array<Integer> pool_candidates;`


Does the memory planning algorithm have to assign each buffer to a pool in pool_candidates? Is this information not used after memory planning?

As of now, once they are assigned with pool_name and offset, pool_candidates may not be used -- except for debugging purposes.

Correction : Array<Integer> pool_candidates --> Array<String> pool_candidates.

it does seem like it would be useful to split the pre-memory-planning data structure apart from the post-memory planning data structure. perhaps:

an attribute "memory_constraints" is expected on nodes as input to memory planning

an attribute "memory_placement" is expected on nodes as an output

one concern with this approach is it may be slower if we need to copy all nodes involved. cc @jroesch for opinions on how to handle this sort of thing in the new TE-compiler world.

ghost · 2021-07-16T16:11:06Z

rfcs/0009_Unified_Static_Memory_Planning.md

+
+The current proposal for the interface of the memory planning algorithm is as follows :
+```
+    struct BufferInfo {


Wouldn't it make sense to have a similar data structure for pools, as well?
something like :

struct PoolInfo { String pool_name; Integer size_bytes; Integer alignment; Integer pool_offset }

In our current understanding a pool represents a physical memory, e.g. SRAM, DRAM, flash, etc, and has therefore also architectural restrictions, such as size, alignment and possibly an offset in the global memory space

It makes sense and aligns with our thinking as well.

This is something that will be passed from the user from the TVMC interface. However, TVM currently is lacking support to hold compiler(-wide) flags where we currently end up posting them as target_host attrs. During the runtime of the compiler, these values would be global constants, therefore we are kind of discussing (internally) best way to represent them generically in an extensible way when more such compiler-flags (e.g. --executor is another example here) are added (e.g., possibly as some sort of AttrRegistry) and hold them during a compilation run.

Other reason for not defining them here is that the TVM's schedulers may would like to have access them in future.

If you have any plans/ideas to this effect, we'd love to hear them :)

cc : @leandron @Mousius

cc @jroesch regarding compiler input and output interfaces.

I am trying to write down some guidelines on where to store these things, will hopefully send upstream in conjunction with another RFC. My 2c is that we should add Attributes to the IRModule to also use as a per compilation place to store global data instead of keying them off target. Is there any reason this wouldn't work for you? I think we should move in this direction so all the useful state for compilation is readable directly from the compilation unit.

I agree with @jroesch on this, I think that it would be great to try to store the information in the IRModule attrs. This would also be consistent with some of the cleanups we are currently doing -- I am starting to try to push the info stored in the TargetMap data structure into function attributes.

Hi @jroesch @electriclilies,

Yes, that sounds like a good idea.

Only concern is these are constant inputs in the compilation flow -- that are not supposed to get mutated in passes.
But, yea it could still work and the passes will not need anything more than IRModule -- that seems attractive. Let me know if you have anymore thoughts around this.

If we all seem to agree, yes we can introduce the 'PoolInfo' (cc : @MichaelJKlaiberBosch) structure here and add as an Attribute to the IRModule to be passed through out the compilation flow.

ghost · 2021-07-16T16:20:14Z

rfcs/0009_Unified_Static_Memory_Planning.md

+    }
+```
+```
+Array<BufferInfo> (*foo)(Array<BufferInfo> buffers, Map<String, Integer> pool_sizes)


In our opinion the memory planner should be able to change the call graph order to any possible topological sort.
It is then the memory planners job to ensure that the buffer offset correspond to the liveness of the buffer dependencies.
To reflect this in the memory planner interface it might make sense to add the call_graph:

Array<BufferInfo>, Map<CallGraphNode, CallGraphNode> (*foo) ( Array<BufferInfo> buffers, Map<String, Integer> pool_sizes, Array<PoolInfo> pools, Map<CallGraphNode, CallGraphNode> call_graph )

The return value Map<CallGraphNode, CallGraphNode> defines the call graph altered by the memory planner. This leaves room for:

Enforcing a fixed order: e.g. Node1 -> Node2 -> … -> NodeN OR

Returning an altered call graph with a certain degree of freedom that is lowered. Example for such a graph

Node1 -> Node2 -> Node3--> ... -->NodeN | ^ |------->Node4----->Node5---------|

One of the assumption is that before the USMP is invoked, the scheduling is already decided for the operators.

However, we do understand the requirement.
To that end, this is something we refer to as memory-aware operator scheduling that should be done prior to USMP.

In most cases, this could be done using 'memory pressure' and using let-bindings in relay as relay pass to putting them to A-normal form that results in minimum 'memory pressure'. For advanced (or in other words compile-time tolerable situations), that relay pass could call into the above memory planning interface to retrieve actual allocations.

The reasoning behind that being, mixing scheduling and allocations can easily result in higher time complexity. However, for exceptional cases schedulers are free to be implemented (compile-time in-sensitive) to determine the order by running complex allocation algorithm in-loop. Therefore, we feel this interface is general enough to support both use cases.

we discussed this at the µTVM meetup. The current thinking is what @manupa-arm said above--scheduling operators is considered to be separate from memory planning. However, it is entirely reasonable for an upstream scheduler to invoke parts of the memory planner to determine whether reordering the schedule could be advantageous later on. @jroesch also has plans to enable an upstream pass to run an initial "return value" through a partial set of the downstream passes and examine the result of this. For instance, an upstream scheduler could produce an initial schedule and then invoke the remainder of the passes through Memory Planning. Then, it could examine the result of Memory Planning, potentially do this a few times, and choose the best solution as the true "return value" from scheduling.

@MichaelJKlaiberBosch let us know if this seems like a poor compromise to you guys. We certainly don't have to stick with this approach if there are more advantages to another.

I am working on a set of RFCs which distill all the pain I felt removing the compile engine. I think users should choose ordering before we plan, entangling passes with different concerns creates some of the problems which we are feeling today, for example the current static memory planner invokes device planning directly and there is no (currently) easy way to split them into pieces.

If you want to adjust the scheduling you should write a pass to plan, then analyze the plan, rewrite the program and invoke planning again instead of baking these details into the planner. The one shot mentality has led to a lot of super fragile code that only works under a single phase ordering for a single executor.

Thanks for the detailed answers.
@areusch @manupa-arm
In general we are okay with separation scheduling an memory planning problem. From our point of view there could be cases where the memory planner can change the order of the callnodes is of advantage. This, however, depends a lot on the topology of future neural networks. For simple feed-forward NNs this might not be required. More complicated NNs, e.g. as a result of NAS this might be more relevant.

areusch

@manupa-arm apologies for the delay and thanks for bringing this RFC to the table! I've left some initial comments/questions.

areusch · 2021-07-20T22:09:35Z

rfcs/0009_Unified_Static_Memory_Planning.md

+Currently, given a ML model primarily TVM will generate two main artifacts :
+
+* A1 : Description of the sequential execution of operators :
+  1. If the "executor" is "graph", this would be a JSON


specify "executor" here (it's in metadata.json in MLF)

Sorry, Im not sure that I follow. Do you mean 'define' or 'specify'?

sorry i mean that by "executor" here, you mean the value of the "executor" key in MLF metadata.json. However, that's a micro-specific (for now) artifact. To make the RFC easier to follow, suggest either including a pointer so that a reader could determine this, or just saying something less specific like "If building for the graph executor."

areusch · 2021-07-20T22:09:54Z

rfcs/0009_Unified_Static_Memory_Planning.md

+
+Currently, given a ML model primarily TVM will generate two main artifacts :
+
+* A1 : Description of the sequential execution of operators :


i call this executor configuration

rfcs/0009_Unified_Static_Memory_Planning.md

areusch · 2021-07-20T22:11:02Z

rfcs/0009_Unified_Static_Memory_Planning.md

+
+Yes, there is.
+
+For A1, the inter-(fused) operator tensors are visible in the "main" relay function. Thus, there exists currently a Relay level pass known as "GraphPlanMemory" that works on the Relay IR to share the space used by tensors which are not live simultaneously and are visible between (fused) operators . Currently, the said pass will use Shared Memory Buffer Object memory planning scheme (See https://blog.tensorflow.org/2020/10/optimizing-tensorflow-lite-runtime.html) to perform the planning.


nit: remove "Thus," because there's no deductive reason why it should exist

areusch · 2021-07-26T23:05:07Z

rfcs/0009_Unified_Static_Memory_Planning.md

+             }
+```
+
+The above TIR snippet shows that two intra operator buffers PaddedInput, DepthwiseConv2d is not visible to Relay Graph Plan Memory to be shared.


nit: "are not visible for optimization by the Relay-level GraphPlanMemory approach."

areusch · 2021-07-26T23:22:56Z

rfcs/0009_Unified_Static_Memory_Planning.md

+* src/tir/usmp/usmp.cc -- this is main intergration of USMP that exposes the full TIR --> TIR transformation as described.
+* tests/python/unittest/test_tir_usmp_*.py -- this where unittests for each of the passes and pass pipeline for USMP as a component will live.
+
+NOTE 1: All the above passes will have a mirror in the python.


could you define "mirror" here? also s/the python/python//

areusch · 2021-07-26T23:23:57Z

rfcs/0009_Unified_Static_Memory_Planning.md

+
+The current proposal for the interface of the memory planning algorithm is as follows :
+```
+    struct BufferInfo {


cc @jroesch regarding compiler input and output interfaces.

areusch · 2021-07-26T23:25:45Z

rfcs/0009_Unified_Static_Memory_Planning.md

+        Integer size_bytes;
+        Integer alignment;
+        Array<BufferInfo> conflicts; //the conflicting bufferinfo objs
+        Array<Integer> pool_candidates;`


it does seem like it would be useful to split the pre-memory-planning data structure apart from the post-memory planning data structure. perhaps:

an attribute "memory_constraints" is expected on nodes as input to memory planning

an attribute "memory_placement" is expected on nodes as an output

one concern with this approach is it may be slower if we need to copy all nodes involved. cc @jroesch for opinions on how to handle this sort of thing in the new TE-compiler world.

areusch · 2021-07-26T23:26:58Z

rfcs/0009_Unified_Static_Memory_Planning.md

+
+G1. There would be no TVMBackendAlloc(/Free)Workspace calls generated for tir.allocates that could be evaluated at compile time.
+
+Currently, the TVM codegen and the AoT executor relies on TVMB(A/F)W calls to increment/decrement a pointer of user provided workspace buffer. By the end of this set of work, if the backend uses Unified Static Memory Planning, there should not be TVMB(A/F)W calls rather correct offset in to the user provided buffer should be codegen'd for allocates that could be evaluated at compile time. The dynamically sized allocates will remain untouched, thus will be lowered as usual.


could you specify how you'll determine whether a TVMBAW can be eliminated? i think it should roughly be: can the size argument be evaluated at compile-time?

rfcs/0009_Unified_Static_Memory_Planning.md

* addressed comments Change-Id: I12fa85e5ea10eee328be4c5d51c9a481a90dedb5

tkonolige

Thank you for answering my questions @manupa-arm

electriclilies

Overall, looks good to me, found a few typos! Thanks for the detailed write up.

electriclilies · 2021-08-11T19:43:48Z

rfcs/0009_Unified_Static_Memory_Planning.md

+
+# Motivation
+
+For embedded use-cases, its widely accepted that aggressive memory optimizations are vital. Intially we are looking at enable memory planning for embedded use-cases using the AoT executor.


There's a typo here -- "enable" should be "enabling"!

electriclilies · 2021-08-11T19:45:33Z

rfcs/0009_Unified_Static_Memory_Planning.md

+
+The above TIR snippet shows that two intra operator buffers PaddedInput, DepthwiseConv2d are not visible for optimization by the Relay-level GraphPlanMemory approach.
+
+* Assumption of local optimization : performing sharing inside the operator first and sub-subsequently sharing that workspace with inter-operator tensors, would be sub-optimal.


another typo, sub-subsequently -> subsequently

electriclilies · 2021-08-11T19:51:09Z

rfcs/0009_Unified_Static_Memory_Planning.md

+
+G3. Multiple pool support (including constants)
+
+Ideally, the user would expect to provide these buffers in the granularity of the memories they'd want to pin them to. E.g., if there are two RW memories : DRAM and SRAM, the buffers need to be identified and pooled by the compiler. Similiarly, for constant data, we need to have a mechanism to allow user to pin them to appropriate memories and addresses in the IR would simply be offsets into the constant buffer(s) provided by the user


Can you split this into two sentences? "we need to have a mechanism to allow user to pin them to appropriate memories and addresses in the IR would simply be offsets into the constant buffer(s) provided by the user" -> "we need to have a mechanism to allow users to pin them to appropriate memories and addresses. In the IR, they would simply be offsets into the constant buffer(s) provided by the user"

electriclilies · 2021-08-11T20:01:26Z

rfcs/0009_Unified_Static_Memory_Planning.md

+
+The current proposal for the interface of the memory planning algorithm is as follows :
+```
+    struct BufferInfo {


I agree with @jroesch on this, I think that it would be great to try to store the information in the IRModule attrs. This would also be consistent with some of the cleanups we are currently doing -- I am starting to try to push the info stored in the TargetMap data structure into function attributes.

electriclilies · 2021-08-11T20:15:55Z

rfcs/0009_Unified_Static_Memory_Planning.md

+
+##### Step 3 : Use the updated Map<tir::Var, BufferInfo> to generate Array<BufferInfo>, Map<String, Integer> pool_sizes
+
+##### Step 4 : Call the provided/default algorithm (void (*foo)(Array<ByfferInfo> buffers, Map<String, Integer> pool_sizes) to populate pool_id and pool_offset.


another typo: ByfferInfo -> BufferInfo

rfcs/0009_Unified_Static_Memory_Planning.md

csullivan

I'm a bit worried that if we try to unify a static memory planner before unifying our lowering flow, we widen the division between Graph, VM, and AoT.

csullivan · 2021-08-11T21:25:07Z

rfcs/0009_Unified_Static_Memory_Planning.md

+This should be a IRModule (TIR) → IRModule (TIR) pass.
+
+Inputs : 
+* AoT TIR PrimFunc ( the control function describing the call graph to operators)


In the unified lowering flow we may choose to introduce explicit storage and tensor allocations into the relay AST, and then hoist memory planning out of the executors into the unified lowering flow. We can then annotate the device/storage scope/memory plan directly on calls so we can re-use across transformations. Currently once memory planning is run, we can no longer transform the program again limiting the ability to customize per target.

With customization in the lowering flow, executors such as AoT can allow per target planning, and then generate code by matching allocation nodes in the relay AST and lowering them directly to TIR when building the main function.

Unified lowering could allow:

[IRModule(relay)] -> transforms (incl. mem. planning) -> [IRModule(relay)] -> LowerTE (op scheduling) -> [IRModule(relay)+IRModule(TIR)] -> transforms (incl. target specific mem. planning) -> [IRModule(relay)+IRModule(TIR)] -> ExecutorLowering

with transforms including the ability to customize the transformations applied for different hardware targets. Moreover, transformations are applied at each stage of lowering and one or more of those transforms may include memory planning.

If we push forward with this RFC without co-designing it with unified lower, my concern is that we continue to kick the problem down the road and it will make unifying the lowering and planning for all executors more difficult.

Hi Chris,

In the unified lowering flow we may choose to introduce explicit storage and tensor allocations into the relay AST, and then hoist memory planning out of the executors into the unified lowering flow.

[IRModule(relay)]
-> transforms (incl. mem. planning) -> [IRModule(relay)]

I dont see much value in doing memory planning at this level as the intra-operator allocate nodes are not visible. Can you elaborate ?

With customization in the lowering flow, executors such as AoT can allow per target planning, and then generate code by matching allocation nodes in the relay AST and lowering them directly to TIR when building the main function.

Can you explain what do you mean by allocation nodes in the relay AST ?
In our view, relay is a pure functional language that is designed as an IR to represent operator-level info. It feels wrong to do memory planning at this level. @mbrookhart

I guess Im more interested in knowing why graph and AoT, both cannot lower the main function to TIR before performing executor specific lowering ( that is generating the JSON or main function -- that is the executor specific lowering).

A side point to this is -- as we have discussed in the discuss post -- "I think going forward graph executor might be able to load a packed function of the tvm_main instead of json – it’ll be less confusing as how the graph executor runtime is positioned as of today which is more of a (a very thin – as its supposed to be :) ) middleware that connect the graph json and the compiled operator library".

Having said that, I can see this work enabling a path (to extend) towards that – though we only plan to create USMP component that is a TIR IRModule → TIR IRModule which we initially test and support for the AoT executor."

https://discuss.tvm.apache.org/t/rfc-unified-static-memory-planning/10099/2

with transforms including the ability to customize the transformations applied for different hardware targets. Moreover, transformations are applied at each stage of lowering and one or more of those transforms may include memory planning.

I dont see why this cannot happen if we have the full program in TIR.

If we push forward with this RFC without co-designing it with unified lower, my concern is that we continue to kick the problem down the road and it will make unifying the lowering and planning for all executors more difficult.

The existence of different executors are for different use-cases, therefore there is only so much unification we could do. IMO, we should only have two executors AoT and JIT (VM), conceptually (that is not to say graph executor is not important). Moreover, I dont feel static memory planning is applicable to the latter. I'd view graph executor as an application that support RPC (and other additional features such launch parallel for loops, loading packed binary runtime modules, etc) for AoT (Graph = AoT++ ?) that could be used in a tuning process (Refer to the earlier comment on the importance of using a main function instead of JSON).

I'd like to hear why designing a static memory planner that works where the full program is expressed in TIR creates a divergence at least between graph and AoT.

@jroesch @areusch @tqchen @mbaret

In our view, relay is a pure functional language that is designed as an IR to represent operator-level info. It feels wrong to do memory planning at this level.

I believe @jroesch's RFC on unified lowering can help clarify this point some when it lands soon. My take is that overall we would like to move to a world in which the IR is progressively lowered from initially an operator only representation to something closer to hardware, including to TIR for AoT, with target customizable transformation available at every progressively lowered step.

I dont see much value in doing memory planning at this level as the intra-operator allocate nodes are not visible. Can you elaborate ?

Partial memory planning can be useful at this level. Let's take the constrained resource (SRAM) scheduling as an example, where SRAM is large enough to hold some full weight tensors. Assume weight pinning in SRAM is a special case of prefetching, and consider scheduling prefetch copy nodes that move weight tensors from DDR to SRAM storage. Depending on how many weights are pinned from the beginning and prefetched at various points in the topologically ordered execution, the amount of SRAM available for intra-operator scratch will change. In unified lowering this SRAM "stack" size could be provided to the TECompiler when doing subgraph scheduling. If this subgraph contains, for example, multiple convolutions and a striping technique is employed, the available SRAM scratch can be used as a constraint when an autoscheduler is searching the schedule space of this complicated subgraph. After scheduling another layer of transformations can occur which can include full / unified memory planning of inter- and intra- op storage.

The existence of different executors are for different use-cases, therefore there is only so much unification we could do. IMO, we should only have two executors AoT and JIT (VM), conceptually (that is not to say graph executor is not important).

We can consider a case in which everything is unified behind a MemoryPlan->VM->AoT lowering flow in which the AoT TIR main is generated from the relax VM representation. In this case the resulting AoT TIR also supports dynamism. But even before this we could stage the effort and make early strides towards unifying the lowering flows such that different executors utilize the same common planning infrastructure,

MemoryPlan -> VM \-> AoT \-> Graph

I dont see why this cannot happen if we have the full program in TIR.

I'm not saying that we shouldn't be able to do final memory planning in a full program TIR. Unified lowering should support this. The argument I'm making is that supporting unified static memory planning (USMP) shouldn't be available to AoT only. My feeling is that the approach to USMP should be co-designed as part of the approach to unified lowering so that all executor paths can benefit from the ability to plan inter-op and intra-op storage together.

Hi Chris,

I believe @jroesch's RFC on unified lowering can help clarify this point some when it lands soon. My take is that overall we would like to move to a world in which the IR is progressively lowered from initially an operator only representation to something closer to hardware, including to TIR for AoT, with target customizable transformation available at every progressively lowered step.

Conceptually this is sensible, Im worried we will mix the responsibilities of Relay and TIR, where the former is a functional language and TIR has imperative properties (Stmts) that encapsulate Exprs. I guess my question is why do we say TIR is for AoT ?, In our view, we should not bypass TIR in the lowering, even for the main callgraph lowering that we call as specialized executor lowering. In that world, everything will get represented in TIR (not the just AoT executor).

Regarding progressively lowering -- yes this is a valuable addition but I think by doing so we should try to avoid bypassing TIR.

cc : @tqchen @mbrookhart

Partial memory planning can be useful at this level. Let's take the constrained resource (SRAM) scheduling as an example, where SRAM is large enough to hold some full weight tensors. Assume weight pinning in SRAM is a special case of prefetching, and consider scheduling prefetch copy nodes that move weight tensors from DDR to SRAM storage. Depending on how many weights are pinned from the beginning and prefetched at various points in the topologically ordered execution, the amount of SRAM available for intra-operator scratch will change. In unified lowering this SRAM "stack" size could be provided to the TECompiler when doing subgraph scheduling. If this subgraph contains, for example, multiple convolutions and a striping technique is employed, the available SRAM scratch can be used as a constraint when an autoscheduler is searching the schedule space of this complicated subgraph. After scheduling another layer of transformations can occur which can include full / unified memory planning of inter- and intra- op storage.

Few queries around this approach :

Isn't this similiar to what could be done using https://tvm.apache.org/docs/api/python/te.html#tvm.te.Schedule.cache_read ?
How will relay do something like double buffering ?
What if we want to perform compute_at at a non-reduction loop on a tiled-basis that requires only part of the weights to copied to the SRAM?

I guess my broad question is why are we adding Relay AST nodes to do this while we can do this with less changes and holistically in TIR

cc : @mbaret

By doing so we need to duplicate most logic in scheduling primivites and TIR IR nodes to Relay as well. Do we know a strong reason to perform such a duplication ? It is not immediately obvious how this aligns with the goals of the unified lowering.

We can consider a case in which everything is unified behind a MemoryPlan->VM->AoT lowering flow in which the AoT TIR main is generated from the relax VM representation. In this case the resulting AoT TIR also supports dynamism. But even before this we could stage the effort and make early strides towards unifying the lowering flows such that different executors utilize the same common planning infrastructure,

MemoryPlan -> VM
-> AoT
-> Graph

Does supporting dynamism require a relay-level memory plan ?

My argument is unlike lowering it from directly from relay, if we unify the lowering after the whole program is expressed in TIR, it would avoid duplication -- which I think is one of the goals in the unified lowering refactor.

Relay --> TECompiler --> TIR (main is also in TIR) --> MemoryPlan --> AoT \--> Graph \--> VM

It'd be interesting to know what relay constructs that are used in VM is not possible to be expressed in TIR. If we can fully express the dynamism in TIR, I'd say we should not bypass the TIR lowering for any executor. If we are not bypassing the TIR lowering for any executor (operators or control code), as stated before, doing a memory plan at relay level seems a bit redundant. Its entirely possible, I might be missing some info on VM, thus feel to enlighten me :).

Im not saying that we shouldn't be able to do final memory planning in a full program TIR. Unified lowering should support this. The argument I'm making is that supporting unified static memory planning (USMP) shouldn't be available to AoT only. My feeling is that the approach to USMP should be co-designed as part of the approach to unified lowering so that all executor paths can benefit from the ability to plan inter-op and intra-op storage together.

We agree with this fully and additionally we are saying we should not bypass 'full program TIR' for any executor, if possible. All, we are proposing here is Full Program TIR --> Full Program TIR transformation that could be attached to executor codegen pipeline, if each executor have that state (currently the graph bypasses this, it goes from Relay --> JSON directly) in the pipeline (I dont see a strong reason why they should not have that, at least yet). Then, its incremental work to attach it to each executor -- In a similiar way we do TECompiler refactor (attaching the TECompiler for each executor).

cc : @Mousius @mbaret

Hi Manupa- we've been having some conversations internally trying to sync up on the unified lowering flow to understand how the USMP RFC and unified lowering can work together. @areusch will follow up with some comments. Overall I think it sounds like we have the same goals and if we can arrive at design in which only incremental refactoring is needed, like with updating AOT to use the TECompiler, then we're good. I'll also follow up directly with some of the questions you raised in response in another post, just didn't want to let the conversation lapse.

hi Manupa,

@jroesch @csullivan and I discussed this at length over the last two days in conjunction with @jroesch's plans to unify the Graph, AOT, and VM lowering pipelines. The root of the concern was over three issues:

it may be of interest for some users to explicitly specify which memory pool a Relay operator shall place its results. How shall TVM interpret such requests?

What if memory planning impacts scheduling (e.g. you may want to choose a different schedule because memory does not fit)?

As a side effect of unifying the lowering flows, it will be desirable to also unify memory planning across the three. Given the VM supports additional features over AOT/Graph executors (e.g. dynamic shapes, potentially additional control flow such as recursion), it may not be possible to perform static memory planning on all tensors in such a world. We want to ensure such a unified flow is possible without being overly complex and while still supporting features.

There was some confusion as to how GraphExecutor/AOTExecutor may both use USMP.

Point #1 can be handled with a Relay-level annotation on either the function or variable which can then be moved into TIR to drive the memory planner.
Point #4 is handled by writing a pass to translate the USMP-planned AOT top-level function into a GraphExecutor graph.json when possible. I've clarified this with @jroesch and @csullivan.

To point #2: there are a couple of classes of this problem: 1) when a schedule must be discarded because a single buffer always doesn't fit into RAM; 2) when a particular configuration of a schedule can fit into RAM, but not all. Ultimately, this problem can always be addressed by iteratively running the compilation pipeline starting from scheduling and discarding the results after memory planning if they don't work out. However, this solution is very slow so we prefer to avoid it. There are a couple of ways to do this:

introduce memory constraints at schedule time so that it's obvious when a buffer is far too big for the available memory

We can model allocations before scheduling, like the VM does. this can allow us to discard schedules in class 1 before doing expensive memory planning.

we may be able to leverage AutoTIR to "tweak" a schedule (e.g. reduce the intermediate scratchpad usage) when that usage is proportional to an inner/outer loop nest (this solution re

To point #3: After some discussion, it appears that the planning done in the VM and USMP have something in common: they both create allocate nodes to model the graph-level tensors in TIR. Since USMP relies on building an abstract representation of allocate/free events, it should be possible to continue to build this after unifying the flows based on either the AOT top-level function or a similar function produced by lowering the VM-optimized Relay module to TIR. Therefore, the static tensors used in a TIR IRModule should still be able to be planned using USMP after unifying the flows. The dynamic tensors may need to be handled separately, but in general doing much ahead-of-time planning for tensors with data-dependent lifetime (e.g. when handling recursion) is difficult.

Finally, related: @jroesch also mentioned that in unifying the flows, he wants to move to the world where a conceptual equivalent of the VM buffer/storage nodes are injected pre-scheduling. There will be some work in doing this to ensure that they are fully compatible with the memory pool concept here, but at first glance they appear to be broadly compatible. The VM carries a somewhat similar concept of a per-device dynamic allocator, which is conceptually similar to that which may be required if we were to consider at a later time supporting dynamic models in the C runtime. We think these concepts are similar and can be unified moving forwards.

Here our intent isn't to choose any one implementation over another but purely to ensure we can reduce the conceptual complexity of the compiler over time while retaining flexibility of the compiler. My thoughts are that adopting this approach (and therefore, going ahead with USMP now) should allow others to begin building the static memory planning algorithms while we are working to simplify the compiler, and this will ensure we can still continue to support any memory planning algorithms developed on top of USMP in the medium-long term.

Please let me know if this makes sense to you! I think this should allow everyone to continue working without stepping on each others' toes excessively.

-Andrew

Hi @areusch @csullivan,

Thanks for the clarification and it makes sense for us.

areusch

hi @manupa-arm, just a few more questions. if they're answered satisfactorily by @jroesch @mbs-octoml @tqchen and they approve this feel free to merge without my approval.

rfcs/0009_Unified_Static_Memory_Planning.md

areusch · 2021-08-19T18:14:17Z

rfcs/0009_Unified_Static_Memory_Planning.md

+* All Operator Functions
+* the maximum size for each pool We could use "pinned_memory" (see below) to tag buffers with suggested priority order determined by the scheduler.
+
+The idea is USMP will try to pool them using the preferred "pinned_memory" and fallback whenever the size is exceeding the user provided max size for each pool (if any)


could you elaborate on "fallback"? Presumably this would be configurable--if the user doesn't want USMP to fallback when it exceeds memory boundaries, we should be able to configure this, no?

can you address this here? some questions i don't see answered:

where are the "fallback" candidate_memory_pools passed in to the runtime?

how should the compiler know which candidate_memory_pools can be fallbacks for a given buffer (given that it does know a particular buffer needs to be accessible per-device)

do we need to add an additional field to PoolInfo to identify which devices can use it? it's fine if this is to be decided in a follow-up RFC, but can you indicate which one (if you have one in mind) or that one hasn't been chosen (if not)?

areusch · 2021-08-19T18:14:32Z

rfcs/0009_Unified_Static_Memory_Planning.md

+* All Operator Functions
+* the maximum size for each pool We could use "pinned_memory" (see below) to tag buffers with suggested priority order determined by the scheduler.
+
+The idea is USMP will try to pool them using the preferred "pinned_memory" and fallback whenever the size is exceeding the user provided max size for each pool (if any)


also can you clarify "fallback" here as well?

rfcs/0009_Unified_Static_Memory_Planning.md

areusch · 2021-08-19T18:50:42Z

rfcs/0009_Unified_Static_Memory_Planning.md

+       <compute>
+       ...
+```
+##### Step 2 : Run an analysis pass to populate a Map<tir::Var, BufferInfo> that contains buffer information as defined above (See the struct BufferInfo).


could you sketch this out briefly?

areusch · 2021-08-19T18:51:20Z

rfcs/0009_Unified_Static_Memory_Planning.md

+```
+##### Step 2 : Run an analysis pass to populate a Map<tir::Var, BufferInfo> that contains buffer information as defined above (See the struct BufferInfo).
+
+##### Step 3 : Use the updated Map<tir::Var, BufferInfo> to generate Array<BufferInfo>, Map<String, Integer> pool_sizes


how are the Vars assigned to memory pools?

areusch · 2021-09-07T21:17:46Z

@manupa-arm could you address my last round of comments? i think we're good to merge after that

*reflecting the partial changes for tir pinned memory representation *addressing Andrew's comments Change-Id: I40019ecb8e75ba46b1bf415ea70718bbeab3d26b

manupak · 2021-09-20T06:40:59Z

@areusch I got some cycles to spend on this. I've updated the RFC addressing your comments and reflecting the changes discussed here as well : https://github.com/apache/tvm-rfcs/blob/c447cbfbd5abceaa7623a0f90cc492784e6f0c0b/rfcs/0023-adding-annotation-field-to-tir.allocate.md.

PTAL when you get some time. TIA.

areusch · 2021-09-21T19:02:57Z

rfcs/0009_Unified_Static_Memory_Planning.md

+    * Each tir.allocate in the IRModule annotated with candidate pools ([Using the annotation field of tir.allocate](https://github.com/apache/tvm-rfcs/blob/c447cbfbd5abceaa7623a0f90cc492784e6f0c0b/rfcs/0023-adding-annotation-field-to-tir.allocate.md))
+
+
+```


er, this is kind of just here without much explanation. it makes sense in light of the comments threads here..but can you add something more explicit to indicate that this is an input? even if we are not quite sure how this is going to land and just defer to that other RFC, it's helpful to state that here.

areusch · 2021-09-21T19:09:12Z

rfcs/0009_Unified_Static_Memory_Planning.md

+* All Operator Functions
+* the maximum size for each pool We could use "pinned_memory" (see below) to tag buffers with suggested priority order determined by the scheduler.
+
+The idea is USMP will try to pool them using the preferred "pinned_memory" and fallback whenever the size is exceeding the user provided max size for each pool (if any)


can you address this here? some questions i don't see answered:

where are the "fallback" candidate_memory_pools passed in to the runtime?

how should the compiler know which candidate_memory_pools can be fallbacks for a given buffer (given that it does know a particular buffer needs to be accessible per-device)

do we need to add an additional field to PoolInfo to identify which devices can use it? it's fine if this is to be decided in a follow-up RFC, but can you indicate which one (if you have one in mind) or that one hasn't been chosen (if not)?

areusch

@manupa-arm i did another pass here. just a couple more clarifying questions on the overall approach. i'm okay with deferring some decisions to follow-on RFCs when they become more clear (e.g. I think progress on @jroesch TargetDevice work may influence some of the ways we encode memory pool accessibility per-device), but can we explicitly make it clear in the RFC when we are deferring something and specify which RFC it'll be decided under, if we know?

*explaining the fallback and candidate_memory_pools Change-Id: Iab59de953bd931fe44ae77004f8c014e25b126f8

manupak · 2021-09-27T16:35:56Z

Hi @areusch ,

I have addressed candidate_memory_pool query now.

For your question around fallback :

where are the "fallback" candidate_memory_pools passed in to the runtime?

The fallback only happens the in the compilation time as per this RFC. Therefore, by the time USMP is done, one pool will be assigned/decided to the tir.allocate.

how should the compiler know which candidate_memory_pools can be fallbacks for a given buffer (given that it does know a particular buffer needs to be accessible per-device)

I have added a sentence to explain this. The core compiler should assign the tir.allocate in each PrimFunc with pool it can access because each PrimFunc know which target it gets compiled to and each pools know which target could access them. Initially it will take the priority order the user provides to the TVMC interface.

do we need to add an additional field to PoolInfo to identify which devices can use it? it's fine if this is to be decided in a follow-up RFC, but can you indicate which one (if you have one in mind) or that one hasn't been chosen (if not)?

Im pretty sure it is listed as Map<Target,String> target_access; // 'rw' or 'ro' in the PoolInfo.

areusch

@manupa-arm thanks. can you do a pass through the remaining items here (e.g. @electriclillies had some clarifying suggestions to the text).

For your question around fallback :

where are the "fallback" candidate_memory_pools passed in to the runtime?

The fallback only happens the in the compilation time as per this RFC. Therefore, by the time USMP is done, one pool will be assigned/decided to the tir.allocate.

sorry but i mean; how do we determine a pool to be a fallback pool?

how should the compiler know which candidate_memory_pools can be fallbacks for a given buffer (given that it does know a particular buffer needs to be accessible per-device)

I have added a sentence to explain this. The core compiler should assign the tir.allocate in each PrimFunc with pool it can access because each PrimFunc know which target it gets compiled to and each pools know which target could access them. Initially it will take the priority order the user provides to the TVMC interface.

thanks! can you add the ordering bit to the RFC?

do we need to add an additional field to PoolInfo to identify which devices can use it? it's fine if this is to be decided in a follow-up RFC, but can you indicate which one (if you have one in mind) or that one hasn't been chosen (if not)?

Im pretty sure it is listed as Map<Target,String> target_access; // 'rw' or 'ro' in the PoolInfo.

ah--you're correct.

areusch · 2021-10-03T17:24:51Z

rfcs/0009_Unified_Static_Memory_Planning.md

+Currently, given a ML model primarily TVM will generate two main artifacts :
+
+* A1 : Description of the sequential execution of operators :
+  1. If the "executor" is "graph", this would be a JSON


sorry i mean that by "executor" here, you mean the value of the "executor" key in MLF metadata.json. However, that's a micro-specific (for now) artifact. To make the RFC easier to follow, suggest either including a pointer so that a reader could determine this, or just saying something less specific like "If building for the graph executor."

* Improving text * Adding more specifics to how to handle fallback pool * Renamed TVMC arugments to be pools instead of buffer

manupak · 2021-10-04T14:32:54Z

@electriclilies -- sorry I missed your suggestions -- corrected them now.

@areusch -- addressed the comments now.

areusch

thanks @manupa-arm !

areusch · 2021-10-04T23:50:25Z

@manupa-arm please open a tracking issue for this one. excited to see this land!

Rfc uma

Merge pull request apache#9 from MichaelJKlaiber/rfc_uma

[RFC] TVM Unified Static Memory Planning

d9abc0d

This commits adds the RFC (.md) for USMP

[RFC] TVM Unified Static Memory Planning

e296e02

*Updating the RFC with PR number

tkonolige requested changes Jul 6, 2021

View reviewed changes

tkonolige reviewed Jul 6, 2021

View reviewed changes

[RFC] TVM Unified Static Memory Planning

9fe84bb

* addressing tristan's comments. Change-Id: Ieb64ae6fc1de12374836c7f754a70b735fe5d379

electriclilies reviewed Jul 9, 2021

View reviewed changes

rfcs/0009_Unified_Static_Memory_Planning.md Show resolved Hide resolved

[RFC] TVM Unified Static Memory Planning

83bd858

*Addressing further tristan's comments Change-Id: I5eabfda362fa85fa4c377d20043f938ffc6de456

tkonolige reviewed Jul 12, 2021

View reviewed changes

ghost reviewed Jul 19, 2021

View reviewed changes

areusch reviewed Jul 26, 2021

View reviewed changes

[RFC] TVM Unified Static Memory Planning

c520a39

* addressed comments Change-Id: I12fa85e5ea10eee328be4c5d51c9a481a90dedb5

tkonolige approved these changes Aug 9, 2021

View reviewed changes

electriclilies approved these changes Aug 11, 2021

View reviewed changes

csullivan reviewed Aug 11, 2021

View reviewed changes

areusch mentioned this pull request Aug 16, 2021

Implementation of relay_to_tir target hook apache/tvm#8423

Merged

areusch requested changes Aug 19, 2021

View reviewed changes

areusch added the status: need update RFC needs update based on feedback label Aug 28, 2021

This was referenced Aug 31, 2021

[RFC][TIR] Adding annotation field to tir.allocate #23

Merged

Fix incorrect AOT Memory Planning apache/tvm#8926

Merged

[RFC] TVM Unified Static Memory Planning

f10b0e6

*reflecting the partial changes for tir pinned memory representation *addressing Andrew's comments Change-Id: I40019ecb8e75ba46b1bf415ea70718bbeab3d26b

manupak added status: need review RFC needs review status: need update RFC needs update based on feedback and removed status: need update RFC needs update based on feedback status: need review RFC needs review labels Sep 20, 2021

areusch reviewed Sep 21, 2021

View reviewed changes

[RFC] TVM Unified Static Memory Planning

d42c0d4

*explaining the fallback and candidate_memory_pools Change-Id: Iab59de953bd931fe44ae77004f8c014e25b126f8

manupak force-pushed the usmp branch from 2a247ec to d42c0d4 Compare September 27, 2021 15:59

areusch reviewed Oct 3, 2021

View reviewed changes

[RFC] TVM Unified Static Memory Planning

0b29835

* Improving text * Adding more specifics to how to handle fallback pool * Renamed TVMC arugments to be pools instead of buffer

areusch approved these changes Oct 4, 2021

View reviewed changes

areusch merged commit 3834d72 into apache:main Oct 4, 2021

grant-arm mentioned this pull request Oct 18, 2021

Arm(R) Cortex(R)-M55 CPU and Arm(R) Ethos(TM)-U55 NPU Demo App apache/tvm#8922

Merged

MichaelJKlaiber added a commit to MichaelJKlaiber/tvm-rfcs that referenced this pull request Apr 19, 2022

Merge pull request apache#9 from MichaelJKlaiber/rfc_uma

b657dbb

Rfc uma

MichaelJKlaiber added a commit to MichaelJKlaiber/tvm-rfcs that referenced this pull request Apr 19, 2022

Merge pull request #5 from boschresearch/rfc_uma

b92b215

Merge pull request apache#9 from MichaelJKlaiber/rfc_uma

	The memory planning algorithm is expected to populate the assigned pool_name with the offset and return the updated array of BufferInfo objects. Additionally, the second argument provides size constraints for each pool (if any).
	The memory planning algorithm is expected to populate pool_name and pool_offset and return the updated array of BufferInfo objects. Additionally, the second argument provides size constraints for each pool (if any).


		Currently, given a ML model primarily TVM will generate two main artifacts :

		* A1 : Description of the sequential execution of operators :


		Yes, there is.

		For A1, the inter-(fused) operator tensors are visible in the "main" relay function. Thus, there exists currently a Relay level pass known as "GraphPlanMemory" that works on the Relay IR to share the space used by tensors which are not live simultaneously and are visible between (fused) operators . Currently, the said pass will use Shared Memory Buffer Object memory planning scheme (See https://blog.tensorflow.org/2020/10/optimizing-tensorflow-lite-runtime.html) to perform the planning.


		G1. There would be no TVMBackendAlloc(/Free)Workspace calls generated for tir.allocates that could be evaluated at compile time.

		Currently, the TVM codegen and the AoT executor relies on TVMB(A/F)W calls to increment/decrement a pointer of user provided workspace buffer. By the end of this set of work, if the backend uses Unified Static Memory Planning, there should not be TVMB(A/F)W calls rather correct offset in to the user provided buffer should be codegen'd for allocates that could be evaluated at compile time. The dynamically sized allocates will remain untouched, thus will be lowered as usual.


		# Motivation

		For embedded use-cases, its widely accepted that aggressive memory optimizations are vital. Intially we are looking at enable memory planning for embedded use-cases using the AoT executor.


		The above TIR snippet shows that two intra operator buffers PaddedInput, DepthwiseConv2d are not visible for optimization by the Relay-level GraphPlanMemory approach.

		* Assumption of local optimization : performing sharing inside the operator first and sub-subsequently sharing that workspace with inter-operator tensors, would be sub-optimal.


		G3. Multiple pool support (including constants)

		Ideally, the user would expect to provide these buffers in the granularity of the memories they'd want to pin them to. E.g., if there are two RW memories : DRAM and SRAM, the buffers need to be identified and pooled by the compiler. Similiarly, for constant data, we need to have a mechanism to allow user to pin them to appropriate memories and addresses in the IR would simply be offsets into the constant buffer(s) provided by the user


		##### Step 3 : Use the updated Map<tir::Var, BufferInfo> to generate Array<BufferInfo>, Map<String, Integer> pool_sizes

		##### Step 4 : Call the provided/default algorithm (void (*foo)(Array<ByfferInfo> buffers, Map<String, Integer> pool_sizes) to populate pool_id and pool_offset.

		* Each tir.allocate in the IRModule annotated with candidate pools ([Using the annotation field of tir.allocate](https://github.com/apache/tvm-rfcs/blob/c447cbfbd5abceaa7623a0f90cc492784e6f0c0b/rfcs/0023-adding-annotation-field-to-tir.allocate.md))


		```

[RFC] TVM Unified Static Memory Planning #9

[RFC] TVM Unified Static Memory Planning #9

Conversation

manupak commented Jul 6, 2021

manupak commented Jul 6, 2021

tkonolige left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

manupak Jul 6, 2021 • edited Loading

Choose a reason for hiding this comment

manupak commented Jul 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

manupak Jul 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jroesch Jul 28, 2021 • edited Loading

Choose a reason for hiding this comment

electriclilies Aug 11, 2021 • edited Loading

Choose a reason for hiding this comment

manupak Aug 17, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

manupak Jul 19, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

areusch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tkonolige left a comment

Choose a reason for hiding this comment

electriclilies left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

electriclilies Aug 11, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csullivan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

manupak Aug 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

manupak Aug 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

areusch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

areusch commented Sep 7, 2021

manupak commented Sep 20, 2021

manupak Jul 6, 2021 •

edited

Loading

manupak commented Jul 6, 2021 •

edited

Loading

manupak Jul 12, 2021 •

edited

Loading

jroesch Jul 28, 2021 •

edited

Loading

electriclilies Aug 11, 2021 •

edited

Loading

manupak Aug 17, 2021 •

edited

Loading

manupak Jul 19, 2021 •

edited

Loading

electriclilies Aug 11, 2021 •

edited

Loading

manupak Aug 12, 2021 •

edited

Loading

manupak Aug 13, 2021 •

edited

Loading

manupak commented Sep 27, 2021 •

edited

Loading