Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] TVM Unified Static Memory Planning #9

Merged
merged 8 commits into from
Oct 4, 2021
Merged

Conversation

manupak
Copy link
Contributor

@manupak manupak commented Jul 6, 2021

This commits adds the RFC (.md) for USMP.

pre-RFC on discuss : https://discuss.tvm.apache.org/t/rfc-unified-static-memory-planning/10099

This commits adds the RFC (.md) for USMP
@manupak
Copy link
Contributor Author

manupak commented Jul 6, 2021

cc : @areusch @mbaret @tqchen
(I cant seem to tag the original people in the RFC -- working on it)

*Updating the RFC with PR number
Copy link
Contributor

@tkonolige tkonolige left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of doing memory planning, so thank you for this PR. I've got a couple questions:

  1. Having memory planning would be useful across all executors (graph, vm, aot). What is preventing us from doing USMP for all the executors? It seems like the main problem with non-aot executors is that we do not have a full view of all the buffers.
  2. Can you include a drawbacks section?

rfcs/0009_Unified_Static_Memory_Planning.md Outdated Show resolved Hide resolved
rfcs/0009_Unified_Static_Memory_Planning.md Show resolved Hide resolved
Special Parametric Inputs :
* function : The algorithm to be used for planning From a component PoV, the algorithm is a special input with a defined interface.

The current proposal for the interface is as follows :
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this the interface for? The planning algorithm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I ll add it to the text tomorrow :)

rfcs/0009_Unified_Static_Memory_Planning.md Outdated Show resolved Hide resolved
rfcs/0009_Unified_Static_Memory_Planning.md Outdated Show resolved Hide resolved
--executor=aot
--output-format=mlf
--target=accel,c
--with-workspace-buffer= "name=sram;target=c,accel"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are buffers part of the target specification with this?

Copy link
Contributor Author

@manupak manupak Jul 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these buffers should ideally be compiler flags (and same goes to flags such as --executor, -runtime). However, we are currently using the host target to hold such flags. Depending on the progress of the activity of taking compiler flags out of the host target, where this flags end up will change :)

@manupak
Copy link
Contributor Author

manupak commented Jul 6, 2021

Thanks @tkonolige for taking a look at this.

As per the pre-RFC discussion,
There is not anything preventing us from integrating unified static memory planner (apart from doing actual work for that of course :) ) for graph executor as long as relay "main" function is lowered to TIR before creating the JSON (or maybe not create the json at all -- see the discuss post conversation). Thus, the design does not block such integration.

For VM, Im not sure how useful "static" memory planning would be.

Drawbacks -- The relay "main" function has to be lowered to TIR to use the USMP. (I ll add a section)

* addressing tristan's comments.

Change-Id: Ieb64ae6fc1de12374836c7f754a70b735fe5d379
*Addressing further tristan's comments

Change-Id: I5eabfda362fa85fa4c377d20043f938ffc6de456
```
Array<BufferInfo> (*foo)(Array<BufferInfo> buffers, Map<String, Integer> pool_sizes)
```
The memory planning algorithm is expected to populate the assigned pool_name with the offset and return the updated array of BufferInfo objects. Additionally, the second argument provides size constraints for each pool (if any).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The memory planning algorithm is expected to populate the assigned pool_name with the offset and return the updated array of BufferInfo objects. Additionally, the second argument provides size constraints for each pool (if any).
The memory planning algorithm is expected to populate pool_name and pool_offset and return the updated array of BufferInfo objects. Additionally, the second argument provides size constraints for each pool (if any).

Integer size_bytes;
Integer alignment;
Array<BufferInfo> conflicts; //the conflicting bufferinfo objs
Array<Integer> pool_candidates;`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the memory planning algorithm have to assign each buffer to a pool in pool_candidates? Is this information not used after memory planning?

Copy link
Contributor Author

@manupak manupak Jul 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of now, once they are assigned with pool_name and offset, pool_candidates may not be used -- except for debugging purposes.

Correction : Array<Integer> pool_candidates --> Array<String> pool_candidates.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does seem like it would be useful to split the pre-memory-planning data structure apart from the post-memory planning data structure. perhaps:

  • an attribute "memory_constraints" is expected on nodes as input to memory planning
  • an attribute "memory_placement" is expected on nodes as an output

one concern with this approach is it may be slower if we need to copy all nodes involved. cc @jroesch for opinions on how to handle this sort of thing in the new TE-compiler world.


The current proposal for the interface of the memory planning algorithm is as follows :
```
struct BufferInfo {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it make sense to have a similar data structure for pools, as well?
something like :

    struct PoolInfo {
        String pool_name; 
        Integer size_bytes;
        Integer alignment;
        Integer pool_offset
    }

In our current understanding a pool represents a physical memory, e.g. SRAM, DRAM, flash, etc, and has therefore also architectural restrictions, such as size, alignment and possibly an offset in the global memory space

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense and aligns with our thinking as well.

This is something that will be passed from the user from the TVMC interface. However, TVM currently is lacking support to hold compiler(-wide) flags where we currently end up posting them as target_host attrs. During the runtime of the compiler, these values would be global constants, therefore we are kind of discussing (internally) best way to represent them generically in an extensible way when more such compiler-flags (e.g. --executor is another example here) are added (e.g., possibly as some sort of AttrRegistry) and hold them during a compilation run.

Other reason for not defining them here is that the TVM's schedulers may would like to have access them in future.

If you have any plans/ideas to this effect, we'd love to hear them :)

cc : @leandron @Mousius

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @jroesch regarding compiler input and output interfaces.

Copy link
Member

@jroesch jroesch Jul 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to write down some guidelines on where to store these things, will hopefully send upstream in conjunction with another RFC. My 2c is that we should add Attributes to the IRModule to also use as a per compilation place to store global data instead of keying them off target. Is there any reason this wouldn't work for you? I think we should move in this direction so all the useful state for compilation is readable directly from the compilation unit.

Copy link
Contributor

@electriclilies electriclilies Aug 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @jroesch on this, I think that it would be great to try to store the information in the IRModule attrs. This would also be consistent with some of the cleanups we are currently doing -- I am starting to try to push the info stored in the TargetMap data structure into function attributes.

Copy link
Contributor Author

@manupak manupak Aug 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jroesch @electriclilies,

Yes, that sounds like a good idea.

Only concern is these are constant inputs in the compilation flow -- that are not supposed to get mutated in passes.
But, yea it could still work and the passes will not need anything more than IRModule -- that seems attractive. Let me know if you have anymore thoughts around this.

If we all seem to agree, yes we can introduce the 'PoolInfo' (cc : @MichaelJKlaiberBosch) structure here and add as an Attribute to the IRModule to be passed through out the compilation flow.

}
```
```
Array<BufferInfo> (*foo)(Array<BufferInfo> buffers, Map<String, Integer> pool_sizes)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our opinion the memory planner should be able to change the call graph order to any possible topological sort.
It is then the memory planners job to ensure that the buffer offset correspond to the liveness of the buffer dependencies.
To reflect this in the memory planner interface it might make sense to add the call_graph:

Array<BufferInfo>, Map<CallGraphNode, CallGraphNode>  (*foo)
(
 Array<BufferInfo> buffers,
 Map<String, Integer> pool_sizes,
 Array<PoolInfo> pools,
 Map<CallGraphNode, CallGraphNode> call_graph
)

The return value Map<CallGraphNode, CallGraphNode> defines the call graph altered by the memory planner. This leaves room for:

  1. Enforcing a fixed order: e.g. Node1 -> Node2 -> … -> NodeN OR
  2. Returning an altered call graph with a certain degree of freedom that is lowered. Example for such a graph
Node1 -> Node2 -> Node3--> ... -->NodeN
   |                                 ^
   |------->Node4----->Node5---------|

Copy link
Contributor Author

@manupak manupak Jul 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the assumption is that before the USMP is invoked, the scheduling is already decided for the operators.

However, we do understand the requirement.
To that end, this is something we refer to as memory-aware operator scheduling that should be done prior to USMP.

In most cases, this could be done using 'memory pressure' and using let-bindings in relay as relay pass to putting them to A-normal form that results in minimum 'memory pressure'. For advanced (or in other words compile-time tolerable situations), that relay pass could call into the above memory planning interface to retrieve actual allocations.

The reasoning behind that being, mixing scheduling and allocations can easily result in higher time complexity. However, for exceptional cases schedulers are free to be implemented (compile-time in-sensitive) to determine the order by running complex allocation algorithm in-loop. Therefore, we feel this interface is general enough to support both use cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we discussed this at the µTVM meetup. The current thinking is what @manupa-arm said above--scheduling operators is considered to be separate from memory planning. However, it is entirely reasonable for an upstream scheduler to invoke parts of the memory planner to determine whether reordering the schedule could be advantageous later on. @jroesch also has plans to enable an upstream pass to run an initial "return value" through a partial set of the downstream passes and examine the result of this. For instance, an upstream scheduler could produce an initial schedule and then invoke the remainder of the passes through Memory Planning. Then, it could examine the result of Memory Planning, potentially do this a few times, and choose the best solution as the true "return value" from scheduling.

@MichaelJKlaiberBosch let us know if this seems like a poor compromise to you guys. We certainly don't have to stick with this approach if there are more advantages to another.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am working on a set of RFCs which distill all the pain I felt removing the compile engine. I think users should choose ordering before we plan, entangling passes with different concerns creates some of the problems which we are feeling today, for example the current static memory planner invokes device planning directly and there is no (currently) easy way to split them into pieces.

If you want to adjust the scheduling you should write a pass to plan, then analyze the plan, rewrite the program and invoke planning again instead of baking these details into the planner. The one shot mentality has led to a lot of super fragile code that only works under a single phase ordering for a single executor.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed answers.
@areusch @manupa-arm
In general we are okay with separation scheduling an memory planning problem. From our point of view there could be cases where the memory planner can change the order of the callnodes is of advantage. This, however, depends a lot on the topology of future neural networks. For simple feed-forward NNs this might not be required. More complicated NNs, e.g. as a result of NAS this might be more relevant.

Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@manupa-arm apologies for the delay and thanks for bringing this RFC to the table! I've left some initial comments/questions.

Currently, given a ML model primarily TVM will generate two main artifacts :

* A1 : Description of the sequential execution of operators :
1. If the "executor" is "graph", this would be a JSON
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specify "executor" here (it's in metadata.json in MLF)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, Im not sure that I follow. Do you mean 'define' or 'specify'?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry i mean that by "executor" here, you mean the value of the "executor" key in MLF metadata.json. However, that's a micro-specific (for now) artifact. To make the RFC easier to follow, suggest either including a pointer so that a reader could determine this, or just saying something less specific like "If building for the graph executor."


Currently, given a ML model primarily TVM will generate two main artifacts :

* A1 : Description of the sequential execution of operators :
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i call this executor configuration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

rfcs/0009_Unified_Static_Memory_Planning.md Show resolved Hide resolved

Yes, there is.

For A1, the inter-(fused) operator tensors are visible in the "main" relay function. Thus, there exists currently a Relay level pass known as "GraphPlanMemory" that works on the Relay IR to share the space used by tensors which are not live simultaneously and are visible between (fused) operators . Currently, the said pass will use Shared Memory Buffer Object memory planning scheme (See https://blog.tensorflow.org/2020/10/optimizing-tensorflow-lite-runtime.html) to perform the planning.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove "Thus," because there's no deductive reason why it should exist

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
```

The above TIR snippet shows that two intra operator buffers PaddedInput, DepthwiseConv2d is not visible to Relay Graph Plan Memory to be shared.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "are not visible for optimization by the Relay-level GraphPlanMemory approach."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* src/tir/usmp/usmp.cc -- this is main intergration of USMP that exposes the full TIR --> TIR transformation as described.
* tests/python/unittest/test_tir_usmp_*.py -- this where unittests for each of the passes and pass pipeline for USMP as a component will live.

NOTE 1: All the above passes will have a mirror in the python.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you define "mirror" here? also s/the python/python//

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


The current proposal for the interface of the memory planning algorithm is as follows :
```
struct BufferInfo {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @jroesch regarding compiler input and output interfaces.

Integer size_bytes;
Integer alignment;
Array<BufferInfo> conflicts; //the conflicting bufferinfo objs
Array<Integer> pool_candidates;`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does seem like it would be useful to split the pre-memory-planning data structure apart from the post-memory planning data structure. perhaps:

  • an attribute "memory_constraints" is expected on nodes as input to memory planning
  • an attribute "memory_placement" is expected on nodes as an output

one concern with this approach is it may be slower if we need to copy all nodes involved. cc @jroesch for opinions on how to handle this sort of thing in the new TE-compiler world.


G1. There would be no TVMBackendAlloc(/Free)Workspace calls generated for tir.allocates that could be evaluated at compile time.

Currently, the TVM codegen and the AoT executor relies on TVMB(A/F)W calls to increment/decrement a pointer of user provided workspace buffer. By the end of this set of work, if the backend uses Unified Static Memory Planning, there should not be TVMB(A/F)W calls rather correct offset in to the user provided buffer should be codegen'd for allocates that could be evaluated at compile time. The dynamically sized allocates will remain untouched, thus will be lowered as usual.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you specify how you'll determine whether a TVMBAW can be eliminated? i think it should roughly be: can the size argument be evaluated at compile-time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

rfcs/0009_Unified_Static_Memory_Planning.md Show resolved Hide resolved
* addressed comments

Change-Id: I12fa85e5ea10eee328be4c5d51c9a481a90dedb5
Copy link
Contributor

@tkonolige tkonolige left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for answering my questions @manupa-arm

Copy link
Contributor

@electriclilies electriclilies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good to me, found a few typos! Thanks for the detailed write up.


# Motivation

For embedded use-cases, its widely accepted that aggressive memory optimizations are vital. Intially we are looking at enable memory planning for embedded use-cases using the AoT executor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a typo here -- "enable" should be "enabling"!


The above TIR snippet shows that two intra operator buffers PaddedInput, DepthwiseConv2d are not visible for optimization by the Relay-level GraphPlanMemory approach.

* Assumption of local optimization : performing sharing inside the operator first and sub-subsequently sharing that workspace with inter-operator tensors, would be sub-optimal.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another typo, sub-subsequently -> subsequently


G3. Multiple pool support (including constants)

Ideally, the user would expect to provide these buffers in the granularity of the memories they'd want to pin them to. E.g., if there are two RW memories : DRAM and SRAM, the buffers need to be identified and pooled by the compiler. Similiarly, for constant data, we need to have a mechanism to allow user to pin them to appropriate memories and addresses in the IR would simply be offsets into the constant buffer(s) provided by the user
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you split this into two sentences? "we need to have a mechanism to allow user to pin them to appropriate memories and addresses in the IR would simply be offsets into the constant buffer(s) provided by the user" -> "we need to have a mechanism to allow users to pin them to appropriate memories and addresses. In the IR, they would simply be offsets into the constant buffer(s) provided by the user"


The current proposal for the interface of the memory planning algorithm is as follows :
```
struct BufferInfo {
Copy link
Contributor

@electriclilies electriclilies Aug 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @jroesch on this, I think that it would be great to try to store the information in the IRModule attrs. This would also be consistent with some of the cleanups we are currently doing -- I am starting to try to push the info stored in the TargetMap data structure into function attributes.


##### Step 3 : Use the updated Map<tir::Var, BufferInfo> to generate Array<BufferInfo>, Map<String, Integer> pool_sizes

##### Step 4 : Call the provided/default algorithm (void (*foo)(Array<ByfferInfo> buffers, Map<String, Integer> pool_sizes) to populate pool_id and pool_offset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another typo: ByfferInfo -> BufferInfo

rfcs/0009_Unified_Static_Memory_Planning.md Show resolved Hide resolved
Copy link
Contributor

@csullivan csullivan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit worried that if we try to unify a static memory planner before unifying our lowering flow, we widen the division between Graph, VM, and AoT.

This should be a IRModule (TIR) → IRModule (TIR) pass.

Inputs :
* AoT TIR PrimFunc ( the control function describing the call graph to operators)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the unified lowering flow we may choose to introduce explicit storage and tensor allocations into the relay AST, and then hoist memory planning out of the executors into the unified lowering flow. We can then annotate the device/storage scope/memory plan directly on calls so we can re-use across transformations. Currently once memory planning is run, we can no longer transform the program again limiting the ability to customize per target.

With customization in the lowering flow, executors such as AoT can allow per target planning, and then generate code by matching allocation nodes in the relay AST and lowering them directly to TIR when building the main function.

Unified lowering could allow:

[IRModule(relay)] 
-> transforms (incl. mem. planning) -> [IRModule(relay)] 
-> LowerTE (op scheduling) -> [IRModule(relay)+IRModule(TIR)] 
-> transforms (incl. target specific mem. planning) -> [IRModule(relay)+IRModule(TIR)] 
-> ExecutorLowering

with transforms including the ability to customize the transformations applied for different hardware targets. Moreover, transformations are applied at each stage of lowering and one or more of those transforms may include memory planning.

If we push forward with this RFC without co-designing it with unified lower, my concern is that we continue to kick the problem down the road and it will make unifying the lowering and planning for all executors more difficult.

Copy link
Contributor Author

@manupak manupak Aug 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Chris,

In the unified lowering flow we may choose to introduce explicit storage and tensor allocations into the relay AST, and then hoist memory planning out of the executors into the unified lowering flow.

[IRModule(relay)]
-> transforms (incl. mem. planning) -> [IRModule(relay)]

I dont see much value in doing memory planning at this level as the intra-operator allocate nodes are not visible. Can you elaborate ?

With customization in the lowering flow, executors such as AoT can allow per target planning, and then generate code by matching allocation nodes in the relay AST and lowering them directly to TIR when building the main function.

Can you explain what do you mean by allocation nodes in the relay AST ?
In our view, relay is a pure functional language that is designed as an IR to represent operator-level info. It feels wrong to do memory planning at this level. @mbrookhart

I guess Im more interested in knowing why graph and AoT, both cannot lower the main function to TIR before performing executor specific lowering ( that is generating the JSON or main function -- that is the executor specific lowering).

A side point to this is -- as we have discussed in the discuss post -- "I think going forward graph executor might be able to load a packed function of the tvm_main instead of json – it’ll be less confusing as how the graph executor runtime is positioned as of today which is more of a (a very thin – as its supposed to be :) ) middleware that connect the graph json and the compiled operator library".

Having said that, I can see this work enabling a path (to extend) towards that – though we only plan to create USMP component that is a TIR IRModule → TIR IRModule which we initially test and support for the AoT executor."

https://discuss.tvm.apache.org/t/rfc-unified-static-memory-planning/10099/2

with transforms including the ability to customize the transformations applied for different hardware targets. Moreover, transformations are applied at each stage of lowering and one or more of those transforms may include memory planning.

I dont see why this cannot happen if we have the full program in TIR.

If we push forward with this RFC without co-designing it with unified lower, my concern is that we continue to kick the problem down the road and it will make unifying the lowering and planning for all executors more difficult.

The existence of different executors are for different use-cases, therefore there is only so much unification we could do. IMO, we should only have two executors AoT and JIT (VM), conceptually (that is not to say graph executor is not important). Moreover, I dont feel static memory planning is applicable to the latter. I'd view graph executor as an application that support RPC (and other additional features such launch parallel for loops, loading packed binary runtime modules, etc) for AoT (Graph = AoT++ ?) that could be used in a tuning process (Refer to the earlier comment on the importance of using a main function instead of JSON).

I'd like to hear why designing a static memory planner that works where the full program is expressed in TIR creates a divergence at least between graph and AoT.

@jroesch @areusch @tqchen @mbaret

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our view, relay is a pure functional language that is designed as an IR to represent operator-level info. It feels wrong to do memory planning at this level.

I believe @jroesch's RFC on unified lowering can help clarify this point some when it lands soon. My take is that overall we would like to move to a world in which the IR is progressively lowered from initially an operator only representation to something closer to hardware, including to TIR for AoT, with target customizable transformation available at every progressively lowered step.

I dont see much value in doing memory planning at this level as the intra-operator allocate nodes are not visible. Can you elaborate ?

Partial memory planning can be useful at this level. Let's take the constrained resource (SRAM) scheduling as an example, where SRAM is large enough to hold some full weight tensors. Assume weight pinning in SRAM is a special case of prefetching, and consider scheduling prefetch copy nodes that move weight tensors from DDR to SRAM storage. Depending on how many weights are pinned from the beginning and prefetched at various points in the topologically ordered execution, the amount of SRAM available for intra-operator scratch will change. In unified lowering this SRAM "stack" size could be provided to the TECompiler when doing subgraph scheduling. If this subgraph contains, for example, multiple convolutions and a striping technique is employed, the available SRAM scratch can be used as a constraint when an autoscheduler is searching the schedule space of this complicated subgraph. After scheduling another layer of transformations can occur which can include full / unified memory planning of inter- and intra- op storage.

The existence of different executors are for different use-cases, therefore there is only so much unification we could do. IMO, we should only have two executors AoT and JIT (VM), conceptually (that is not to say graph executor is not important).

We can consider a case in which everything is unified behind a MemoryPlan->VM->AoT lowering flow in which the AoT TIR main is generated from the relax VM representation. In this case the resulting AoT TIR also supports dynamism. But even before this we could stage the effort and make early strides towards unifying the lowering flows such that different executors utilize the same common planning infrastructure,

MemoryPlan -> VM
          \-> AoT
          \-> Graph

I dont see why this cannot happen if we have the full program in TIR.

I'm not saying that we shouldn't be able to do final memory planning in a full program TIR. Unified lowering should support this. The argument I'm making is that supporting unified static memory planning (USMP) shouldn't be available to AoT only. My feeling is that the approach to USMP should be co-designed as part of the approach to unified lowering so that all executor paths can benefit from the ability to plan inter-op and intra-op storage together.

Copy link
Contributor Author

@manupak manupak Aug 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Chris,

I believe @jroesch's RFC on unified lowering can help clarify this point some when it lands soon. My take is that overall we would like to move to a world in which the IR is progressively lowered from initially an operator only representation to something closer to hardware, including to TIR for AoT, with target customizable transformation available at every progressively lowered step.

Conceptually this is sensible, Im worried we will mix the responsibilities of Relay and TIR, where the former is a functional language and TIR has imperative properties (Stmts) that encapsulate Exprs. I guess my question is why do we say TIR is for AoT ?, In our view, we should not bypass TIR in the lowering, even for the main callgraph lowering that we call as specialized executor lowering. In that world, everything will get represented in TIR (not the just AoT executor).

Regarding progressively lowering -- yes this is a valuable addition but I think by doing so we should try to avoid bypassing TIR.

cc : @tqchen @mbrookhart

Partial memory planning can be useful at this level. Let's take the constrained resource (SRAM) scheduling as an example, where SRAM is large enough to hold some full weight tensors. Assume weight pinning in SRAM is a special case of prefetching, and consider scheduling prefetch copy nodes that move weight tensors from DDR to SRAM storage. Depending on how many weights are pinned from the beginning and prefetched at various points in the topologically ordered execution, the amount of SRAM available for intra-operator scratch will change. In unified lowering this SRAM "stack" size could be provided to the TECompiler when doing subgraph scheduling. If this subgraph contains, for example, multiple convolutions and a striping technique is employed, the available SRAM scratch can be used as a constraint when an autoscheduler is searching the schedule space of this complicated subgraph. After scheduling another layer of transformations can occur which can include full / unified memory planning of inter- and intra- op storage.

Few queries around this approach :

Isn't this similiar to what could be done using https://tvm.apache.org/docs/api/python/te.html#tvm.te.Schedule.cache_read ?
How will relay do something like double buffering ?
What if we want to perform compute_at at a non-reduction loop on a tiled-basis that requires only part of the weights to copied to the SRAM?

I guess my broad question is why are we adding Relay AST nodes to do this while we can do this with less changes and holistically in TIR

cc : @mbaret

By doing so we need to duplicate most logic in scheduling primivites and TIR IR nodes to Relay as well. Do we know a strong reason to perform such a duplication ? It is not immediately obvious how this aligns with the goals of the unified lowering.

We can consider a case in which everything is unified behind a MemoryPlan->VM->AoT lowering flow in which the AoT TIR main is generated from the relax VM representation. In this case the resulting AoT TIR also supports dynamism. But even before this we could stage the effort and make early strides towards unifying the lowering flows such that different executors utilize the same common planning infrastructure,

MemoryPlan -> VM
-> AoT
-> Graph

Does supporting dynamism require a relay-level memory plan ?

My argument is unlike lowering it from directly from relay, if we unify the lowering after the whole program is expressed in TIR, it would avoid duplication -- which I think is one of the goals in the unified lowering refactor.

Relay --> TECompiler --> TIR (main is also in TIR) --> MemoryPlan --> AoT
                                                                 \--> Graph
                                                                 \--> VM

It'd be interesting to know what relay constructs that are used in VM is not possible to be expressed in TIR. If we can fully express the dynamism in TIR, I'd say we should not bypass the TIR lowering for any executor. If we are not bypassing the TIR lowering for any executor (operators or control code), as stated before, doing a memory plan at relay level seems a bit redundant. Its entirely possible, I might be missing some info on VM, thus feel to enlighten me :).

Im not saying that we shouldn't be able to do final memory planning in a full program TIR. Unified lowering should support this. The argument I'm making is that supporting unified static memory planning (USMP) shouldn't be available to AoT only. My feeling is that the approach to USMP should be co-designed as part of the approach to unified lowering so that all executor paths can benefit from the ability to plan inter-op and intra-op storage together.

We agree with this fully and additionally we are saying we should not bypass 'full program TIR' for any executor, if possible. All, we are proposing here is Full Program TIR --> Full Program TIR transformation that could be attached to executor codegen pipeline, if each executor have that state (currently the graph bypasses this, it goes from Relay --> JSON directly) in the pipeline (I dont see a strong reason why they should not have that, at least yet). Then, its incremental work to attach it to each executor -- In a similiar way we do TECompiler refactor (attaching the TECompiler for each executor).

cc : @Mousius @mbaret

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Manupa- we've been having some conversations internally trying to sync up on the unified lowering flow to understand how the USMP RFC and unified lowering can work together. @areusch will follow up with some comments. Overall I think it sounds like we have the same goals and if we can arrive at design in which only incremental refactoring is needed, like with updating AOT to use the TECompiler, then we're good. I'll also follow up directly with some of the questions you raised in response in another post, just didn't want to let the conversation lapse.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi Manupa,

@jroesch @csullivan and I discussed this at length over the last two days in conjunction with @jroesch's plans to unify the Graph, AOT, and VM lowering pipelines. The root of the concern was over three issues:

  1. it may be of interest for some users to explicitly specify which memory pool a Relay operator shall place its results. How shall TVM interpret such requests?
  2. What if memory planning impacts scheduling (e.g. you may want to choose a different schedule because memory does not fit)?
  3. As a side effect of unifying the lowering flows, it will be desirable to also unify memory planning across the three. Given the VM supports additional features over AOT/Graph executors (e.g. dynamic shapes, potentially additional control flow such as recursion), it may not be possible to perform static memory planning on all tensors in such a world. We want to ensure such a unified flow is possible without being overly complex and while still supporting features.
  4. There was some confusion as to how GraphExecutor/AOTExecutor may both use USMP.

Point #1 can be handled with a Relay-level annotation on either the function or variable which can then be moved into TIR to drive the memory planner.
Point #4 is handled by writing a pass to translate the USMP-planned AOT top-level function into a GraphExecutor graph.json when possible. I've clarified this with @jroesch and @csullivan.

To point #2: there are a couple of classes of this problem: 1) when a schedule must be discarded because a single buffer always doesn't fit into RAM; 2) when a particular configuration of a schedule can fit into RAM, but not all. Ultimately, this problem can always be addressed by iteratively running the compilation pipeline starting from scheduling and discarding the results after memory planning if they don't work out. However, this solution is very slow so we prefer to avoid it. There are a couple of ways to do this:

  1. introduce memory constraints at schedule time so that it's obvious when a buffer is far too big for the available memory
  2. We can model allocations before scheduling, like the VM does. this can allow us to discard schedules in class 1 before doing expensive memory planning.
  3. we may be able to leverage AutoTIR to "tweak" a schedule (e.g. reduce the intermediate scratchpad usage) when that usage is proportional to an inner/outer loop nest (this solution re

To point #3: After some discussion, it appears that the planning done in the VM and USMP have something in common: they both create allocate nodes to model the graph-level tensors in TIR. Since USMP relies on building an abstract representation of allocate/free events, it should be possible to continue to build this after unifying the flows based on either the AOT top-level function or a similar function produced by lowering the VM-optimized Relay module to TIR. Therefore, the static tensors used in a TIR IRModule should still be able to be planned using USMP after unifying the flows. The dynamic tensors may need to be handled separately, but in general doing much ahead-of-time planning for tensors with data-dependent lifetime (e.g. when handling recursion) is difficult.

Finally, related: @jroesch also mentioned that in unifying the flows, he wants to move to the world where a conceptual equivalent of the VM buffer/storage nodes are injected pre-scheduling. There will be some work in doing this to ensure that they are fully compatible with the memory pool concept here, but at first glance they appear to be broadly compatible. The VM carries a somewhat similar concept of a per-device dynamic allocator, which is conceptually similar to that which may be required if we were to consider at a later time supporting dynamic models in the C runtime. We think these concepts are similar and can be unified moving forwards.

Here our intent isn't to choose any one implementation over another but purely to ensure we can reduce the conceptual complexity of the compiler over time while retaining flexibility of the compiler. My thoughts are that adopting this approach (and therefore, going ahead with USMP now) should allow others to begin building the static memory planning algorithms while we are working to simplify the compiler, and this will ensure we can still continue to support any memory planning algorithms developed on top of USMP in the medium-long term.

Please let me know if this makes sense to you! I think this should allow everyone to continue working without stepping on each others' toes excessively.

-Andrew

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @areusch @csullivan,

Thanks for the clarification and it makes sense for us.

Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @manupa-arm, just a few more questions. if they're answered satisfactorily by @jroesch @mbs-octoml @tqchen and they approve this feel free to merge without my approval.

rfcs/0009_Unified_Static_Memory_Planning.md Show resolved Hide resolved
rfcs/0009_Unified_Static_Memory_Planning.md Outdated Show resolved Hide resolved
rfcs/0009_Unified_Static_Memory_Planning.md Outdated Show resolved Hide resolved
* All Operator Functions
* the maximum size for each pool We could use "pinned_memory" (see below) to tag buffers with suggested priority order determined by the scheduler.

The idea is USMP will try to pool them using the preferred "pinned_memory" and fallback whenever the size is exceeding the user provided max size for each pool (if any)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you elaborate on "fallback"? Presumably this would be configurable--if the user doesn't want USMP to fallback when it exceeds memory boundaries, we should be able to configure this, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you address this here? some questions i don't see answered:

  1. where are the "fallback" candidate_memory_pools passed in to the runtime?
  2. how should the compiler know which candidate_memory_pools can be fallbacks for a given buffer (given that it does know a particular buffer needs to be accessible per-device)
  3. do we need to add an additional field to PoolInfo to identify which devices can use it? it's fine if this is to be decided in a follow-up RFC, but can you indicate which one (if you have one in mind) or that one hasn't been chosen (if not)?

* All Operator Functions
* the maximum size for each pool We could use "pinned_memory" (see below) to tag buffers with suggested priority order determined by the scheduler.

The idea is USMP will try to pool them using the preferred "pinned_memory" and fallback whenever the size is exceeding the user provided max size for each pool (if any)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also can you clarify "fallback" here as well?

rfcs/0009_Unified_Static_Memory_Planning.md Outdated Show resolved Hide resolved
rfcs/0009_Unified_Static_Memory_Planning.md Show resolved Hide resolved
rfcs/0009_Unified_Static_Memory_Planning.md Show resolved Hide resolved
<compute>
...
```
##### Step 2 : Run an analysis pass to populate a Map<tir::Var, BufferInfo> that contains buffer information as defined above (See the struct BufferInfo).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you sketch this out briefly?

```
##### Step 2 : Run an analysis pass to populate a Map<tir::Var, BufferInfo> that contains buffer information as defined above (See the struct BufferInfo).

##### Step 3 : Use the updated Map<tir::Var, BufferInfo> to generate Array<BufferInfo>, Map<String, Integer> pool_sizes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how are the Vars assigned to memory pools?

@areusch areusch added the status: need update RFC needs update based on feedback label Aug 28, 2021
@areusch
Copy link
Contributor

areusch commented Sep 7, 2021

@manupa-arm could you address my last round of comments? i think we're good to merge after that

*reflecting the partial changes for tir pinned memory representation
*addressing Andrew's comments

Change-Id: I40019ecb8e75ba46b1bf415ea70718bbeab3d26b
@manupak
Copy link
Contributor Author

manupak commented Sep 20, 2021

@areusch I got some cycles to spend on this. I've updated the RFC addressing your comments and reflecting the changes discussed here as well : https://github.com/apache/tvm-rfcs/blob/c447cbfbd5abceaa7623a0f90cc492784e6f0c0b/rfcs/0023-adding-annotation-field-to-tir.allocate.md.

PTAL when you get some time. TIA.

@manupak manupak added status: need review RFC needs review status: need update RFC needs update based on feedback and removed status: need update RFC needs update based on feedback status: need review RFC needs review labels Sep 20, 2021
* Each tir.allocate in the IRModule annotated with candidate pools ([Using the annotation field of tir.allocate](https://github.com/apache/tvm-rfcs/blob/c447cbfbd5abceaa7623a0f90cc492784e6f0c0b/rfcs/0023-adding-annotation-field-to-tir.allocate.md))


```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

er, this is kind of just here without much explanation. it makes sense in light of the comments threads here..but can you add something more explicit to indicate that this is an input? even if we are not quite sure how this is going to land and just defer to that other RFC, it's helpful to state that here.

* All Operator Functions
* the maximum size for each pool We could use "pinned_memory" (see below) to tag buffers with suggested priority order determined by the scheduler.

The idea is USMP will try to pool them using the preferred "pinned_memory" and fallback whenever the size is exceeding the user provided max size for each pool (if any)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you address this here? some questions i don't see answered:

  1. where are the "fallback" candidate_memory_pools passed in to the runtime?
  2. how should the compiler know which candidate_memory_pools can be fallbacks for a given buffer (given that it does know a particular buffer needs to be accessible per-device)
  3. do we need to add an additional field to PoolInfo to identify which devices can use it? it's fine if this is to be decided in a follow-up RFC, but can you indicate which one (if you have one in mind) or that one hasn't been chosen (if not)?

Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@manupa-arm i did another pass here. just a couple more clarifying questions on the overall approach. i'm okay with deferring some decisions to follow-on RFCs when they become more clear (e.g. I think progress on @jroesch TargetDevice work may influence some of the ways we encode memory pool accessibility per-device), but can we explicitly make it clear in the RFC when we are deferring something and specify which RFC it'll be decided under, if we know?

*explaining the fallback and candidate_memory_pools

Change-Id: Iab59de953bd931fe44ae77004f8c014e25b126f8
@manupak
Copy link
Contributor Author

manupak commented Sep 27, 2021

Hi @areusch ,

I have addressed candidate_memory_pool query now.

For your question around fallback :

where are the "fallback" candidate_memory_pools passed in to the runtime?

The fallback only happens the in the compilation time as per this RFC. Therefore, by the time USMP is done, one pool will be assigned/decided to the tir.allocate.

how should the compiler know which candidate_memory_pools can be fallbacks for a given buffer (given that it does know a particular buffer needs to be accessible per-device)

I have added a sentence to explain this. The core compiler should assign the tir.allocate in each PrimFunc with pool it can access because each PrimFunc know which target it gets compiled to and each pools know which target could access them. Initially it will take the priority order the user provides to the TVMC interface.

do we need to add an additional field to PoolInfo to identify which devices can use it? it's fine if this is to be decided in a follow-up RFC, but can you indicate which one (if you have one in mind) or that one hasn't been chosen (if not)?

Im pretty sure it is listed as Map<Target,String> target_access; // 'rw' or 'ro' in the PoolInfo.

Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@manupa-arm thanks. can you do a pass through the remaining items here (e.g. @electriclillies had some clarifying suggestions to the text).

For your question around fallback :

where are the "fallback" candidate_memory_pools passed in to the runtime?

The fallback only happens the in the compilation time as per this RFC. Therefore, by the time USMP is done, one pool will be assigned/decided to the tir.allocate.

sorry but i mean; how do we determine a pool to be a fallback pool?

how should the compiler know which candidate_memory_pools can be fallbacks for a given buffer (given that it does know a particular buffer needs to be accessible per-device)

I have added a sentence to explain this. The core compiler should assign the tir.allocate in each PrimFunc with pool it can access because each PrimFunc know which target it gets compiled to and each pools know which target could access them. Initially it will take the priority order the user provides to the TVMC interface.

thanks! can you add the ordering bit to the RFC?

do we need to add an additional field to PoolInfo to identify which devices can use it? it's fine if this is to be decided in a follow-up RFC, but can you indicate which one (if you have one in mind) or that one hasn't been chosen (if not)?

Im pretty sure it is listed as Map<Target,String> target_access; // 'rw' or 'ro' in the PoolInfo.

ah--you're correct.

Currently, given a ML model primarily TVM will generate two main artifacts :

* A1 : Description of the sequential execution of operators :
1. If the "executor" is "graph", this would be a JSON
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry i mean that by "executor" here, you mean the value of the "executor" key in MLF metadata.json. However, that's a micro-specific (for now) artifact. To make the RFC easier to follow, suggest either including a pointer so that a reader could determine this, or just saying something less specific like "If building for the graph executor."

* Improving text
* Adding more specifics to how to handle fallback pool
* Renamed TVMC arugments to be pools instead of buffer
@manupak
Copy link
Contributor Author

manupak commented Oct 4, 2021

@electriclilies -- sorry I missed your suggestions -- corrected them now.

@areusch -- addressed the comments now.

Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @manupa-arm !

@areusch areusch merged commit 3834d72 into apache:main Oct 4, 2021
@areusch
Copy link
Contributor

areusch commented Oct 4, 2021

@manupa-arm please open a tracking issue for this one. excited to see this land!

MichaelJKlaiber added a commit to MichaelJKlaiber/tvm-rfcs that referenced this pull request Apr 19, 2022
MichaelJKlaiber added a commit to MichaelJKlaiber/tvm-rfcs that referenced this pull request Apr 19, 2022
Merge pull request apache#9 from MichaelJKlaiber/rfc_uma
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: need review RFC needs review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants