Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decoupling AOT from graph memory planner #8096

Merged
merged 7 commits into from Jun 29, 2021
Merged

Conversation

giuseros
Copy link
Contributor

In this PR we are decoupling AOT from the Graph Memory Planner. Since AOT has the runner expressed in TIR we can get rid of the GMP in relay and use the Storage Rewrite Pass to do memory planning on the runner function. This also sorts out the issue mentioned in #8062

@giuseros
Copy link
Contributor Author

@mehrdadh
Copy link
Member

thanks @giuseros for working on this! I tested it and it solves the issue that I had. So I close the issue assuming this will merge sometimes soon.

Copy link
Contributor

@manupak manupak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix @giuseros !

My comments are mostly about documentation and clarification questions.

Also I think you wanted to mention : @mbaret and not Matthew Bentham (@MatthewARM ) :)

src/relay/backend/aot_executor_codegen.cc Outdated Show resolved Hide resolved
src/relay/backend/aot_executor_codegen.cc Outdated Show resolved Hide resolved
src/relay/backend/aot_executor_codegen.cc Outdated Show resolved Hide resolved
src/relay/backend/aot_executor_codegen.cc Outdated Show resolved Hide resolved
src/relay/backend/aot_executor_codegen.cc Outdated Show resolved Hide resolved
@u99127
Copy link
Contributor

u99127 commented May 21, 2021

thanks @giuseros for working on this! I tested it and it solves the issue that I had. So I close the issue assuming this will merge sometimes soon.

I think we should close issues when things merge not before that actually happens.

@giuseros giuseros force-pushed the aot-remove-gpm branch 2 times, most recently from 443e7a8 to 836648e Compare May 21, 2021 14:24
@giuseros
Copy link
Contributor Author

@manupa-arm I applied your comments. Sorry, this was a draft that I quickly turned into a PR, hence the in-elegance of the code. Please have another look and let me know.

Copy link
Contributor

@manupak manupak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broadly looking good!.

Just one more question.

src/relay/backend/aot_executor_codegen.cc Outdated Show resolved Hide resolved
Copy link
Contributor

@manupak manupak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

src/relay/backend/aot_executor_codegen.cc Outdated Show resolved Hide resolved
src/relay/backend/aot_executor_codegen.cc Outdated Show resolved Hide resolved
src/relay/backend/aot_executor_codegen.cc Outdated Show resolved Hide resolved
src/relay/backend/aot_executor_codegen.cc Outdated Show resolved Hide resolved
tests/python/relay/aot/test_crt_aot.py Show resolved Hide resolved
@giuseros
Copy link
Contributor Author

Hi @areusch , @manupa-arm , @mehrdadh ,
Quick update on this. We found an issue in the storage rewrite pass. The problem is due to the fact that the packed functions accept TVMValues and not bare pointers. This means that in this TIR:

  let sid_5_value_1: handle = @tir.tvm_stack_alloca("array", 1, dtype=handle)
   @tir.tvm_struct_set(sid_5_value_1, 0, 1, sid_5, dtype=handle)
   let sid_4_value: handle = @tir.tvm_stack_alloca("array", 1, dtype=handle)
   @tir.tvm_struct_set(sid_4_value, 0, 1, sid_4, dtype=handle)
    {
     @tir.tvm_call_cpacked("fused_transpose", sid_5_value_1, sid_4_value, dtype=int32)
   }

The rewriter is able to identify sid_5_value_1 and sid_4_value as both live variables, and they are not overridden. But it fails to see that those variable relate to sid_5 and sid_4 (respectively).

Early on, in the calls:

@tir.tvm_struct_set(sid_5_value_1, 0, 1, sid_5, dtype=handle)
@tir.tvm_struct_set(sid_4_value, 0, 1, sid_4, dtype=handle)

The variables sid_5 and sid_4 are treated as separate variables (that are NOT live at the same time) and hence get overridden. In the last commit I explicitly create a translation table between the TVMValues and the allocations. This seems to fix the issue. I have added also a further test case test_transpose (disabling the op-fusion in Relay) to test this scenario.

@giuseros
Copy link
Contributor Author

giuseros commented Jun 3, 2021

Hi @areusch , @manupa-arm ,
I was wondering if there was anything else you may want me to address on this PR.

Thanks!

@PhilippvK
Copy link
Contributor

thanks @giuseros for working on this! I tested it and it solves the issue that I had. So I close the issue assuming this will merge sometimes soon.

I think we should close issues when things merge not before that actually happens.

I agree with this. I have experienced the same problems as described in #8062 and thought they should be already fixed. I’ve also tried out the proposed changes and they work for me at least for basic use cases.

@mehrdadh
Copy link
Member

mehrdadh commented Jun 7, 2021

@giuseros there is a lint issue due to black update. You can run "docker/lint.sh python_format" to see the error locally.

@giuseros
Copy link
Contributor Author

giuseros commented Jun 7, 2021

Done! Thanks @mehrdadh

Copy link
Contributor

@manupak manupak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @giuseros! I dont have further comments.

Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the delay @giuseros , here's a round of comments. this seems overall fine, but i'm curious about the special handling we are adding to StorageRewritePass for the AOT top-level func. i'm also curious whether we intend to eventually abstract this behind the interface from @manupa-arm RFC?

src/relay/backend/aot_executor_codegen.cc Outdated Show resolved Hide resolved
src/tir/transforms/storage_rewrite.cc Outdated Show resolved Hide resolved
src/relay/backend/aot_executor_codegen.cc Outdated Show resolved Hide resolved
src/relay/backend/aot_executor_codegen.cc Outdated Show resolved Hide resolved
src/relay/backend/aot_executor_codegen.cc Outdated Show resolved Hide resolved
src/tir/transforms/storage_rewrite.cc Outdated Show resolved Hide resolved
this->VisitExpr(arg);
}
}
} else if (op->op.same_as(builtin::tvm_struct_set())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these the only two such builtins we need to care about? seems like any access to the data would be affected, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, for called_packed, we don't care about accessing the data (tvm_struct_get), but only setting the data of the TVMValue. But in general it is true that other patterns might be problematic if used.

I am not sure about tvm_struct_get in particular, because you would use the (real) data you extracted, while with tvm_struct_set you use the real data to set the structure and then you pass the structure in the call. This change is only to annotate that the struct is a mere container, and that the rewriter needs to focus on the real data.

I would say that this PR is not trying to fix the rewriter on all the possible corner cases. This particular corner case is affecting AOT, and it needs to be fixed. If there are other corner cases, I would fix them in a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess another way to view this though is that this PR is adding reference tracking, but reference tracking only works with one particular TIR pattern. i agree with you that adding full reference tracking is a lot to ask particularly considering the goals of this PR. However, there's no documentation (by which I mean no unit test) as to what specifically needs to be tracked. can you add a unit test for the changes you've made to StorageRewrite?

tests/python/relay/aot/test_crt_aot.py Show resolved Hide resolved
tests/python/relay/aot/test_crt_aot.py Show resolved Hide resolved
@giuseros
Copy link
Contributor Author

giuseros commented Jun 7, 2021

Hi @areusch ,
I applied your changes.

  • About the storage rewrite changes, I tried to explain more in the code. It's about the fact that in order to call a function, you need to set a struct. The rewriter focuses on the structs (marking them conflicting) but the real allocations are lost (and get reused even when they don't need to be). If you have more questions, feel free to ask.
  • The point about testing is that it's very useful for us to test a quantized mobilenet, and coming up with the edge situation that that network is testing is non-trivial. So I saw having that network test as a two-bird-one-stone situation.
  • About the unified static memory planner, yes. This change will be incorporated by the planner that will replace the storage rewrite pass.

@giuseros
Copy link
Contributor Author

Hi @areusch ,
Any more thoughts on this?

Thanks,
Giuseppe

Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @giuseros @manupa-arm ,

i chatted with @tqchen about this change and i think one concern we share is that there isn't a good way to delineate between the reference-counting functionality being added to StorageRewrite, which is fairly small, and a full-fledged implementation. @tqchen pointed out that given the lack of a generalized ref-counting implementation and rules for what TIR authors can do with addresses, likely the correct thing to make StorageRewrite robust agains this problem is to add logic to make StorageRewrite a no-op if tvm_struct_set is used to assign a buffer pointer. However, that means that StorageRewrite becomes useless for this use case.

I also think that on further reading, this may not actually solve the original problem from #8062. there, the problem was that the declared size of the output tensor was actually too small because GraphPlanMemory enlarged the storage_id it belonged to. Because StorageRewrite has authority to modify the parameters to the top-level AOT function, it can indeed correct the inaccurate size. However, that's not what we want here, right? We want it to be the case that the output tensor need only be as large as the graph output (e.g. no reuse of that tensor should happen). Apologies for not catching this earlier, but I'm not sure I see that StorageRewrite actually prevents that (or let me know if you see differently).

given this, it seems like this approach might be a bit fragile to keep logic in StorageRewrite that is fairly tailored to AOTExecutorCodegen. at minimum I think we'd need a unit test suite to exercise the AOTExecutorCodegen TIR patterns in StorageRewrite, but I'm concerned that cross-talk between that and other StorageRewrite use cases may lead to difficulties down the road (e.g. multiplying test cases and StorageRewrite becoming quite complex). @tqchen suggested perhaps a better approach is to apply the results of memory planning to AOTExecutorCodegen and handle buffer reuse manually there--this way any address assignments are purposeful.

what are your thoughts on this?

i added a couple other comments from an earlier reading of this PR too, though they may be moot given the above.

src/relay/backend/aot_executor_codegen.cc Show resolved Hide resolved
this->VisitExpr(arg);
}
}
} else if (op->op.same_as(builtin::tvm_struct_set())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess another way to view this though is that this PR is adding reference tracking, but reference tracking only works with one particular TIR pattern. i agree with you that adding full reference tracking is a lot to ask particularly considering the goals of this PR. However, there's no documentation (by which I mean no unit test) as to what specifically needs to be tracked. can you add a unit test for the changes you've made to StorageRewrite?

tests/python/relay/aot/test_crt_aot.py Show resolved Hide resolved
@giuseros
Copy link
Contributor Author

giuseros commented Jun 11, 2021

Hi @areusch,
Thanks for your reply. I think what you are saying is can be summarized in:
a) The StorageRewrite pass does not solve our problem because it can still touch the output buffers (which are parameters of the tir PrimFunc)
b) The solution that we have put in place in StorageRewrite does not address the reference counting in an general way.

About a), I disagree. StorageRewrite only rewrites tir.Allocate nodes, and the output is not allocated in our main runner function. This is why if you run mobilenet quantized with the StorageRewrite pass, it works (and it doesn't work with Graph Memory Planning). So StorageRewrite is a perfectly good candidate to solve the issue we have in #8062 . The problem with StorageRewrite is that it doesn't handle tvm_struct_set correctly, and this brings us to point b)

About b), I agree. The solution we are proposing is not general, but any general solution can be built on top of it. As you said, we are trying to solve a specific case, but when you tackle the general case, the code I wrote is still valid. Especially in a TIR based world (which is the aim of TensorIR, afaiu) I think we will want (at some point) to extend this solution instead of accepting that StorageRewrite does not work with some TIR built-ins.

In other words, our solution is not tailored to AOT at all. It is a partial solution to a generic problem that (only) AOT is hitting (for now). It cannot break anything in the future, because either the user is doing what AOT is doing, and it will work. For more edge cases, StorageRewrite won't work as it didn't use to work before.

Bear also in mind that until we have a static memory planner, without this solution AOT is not usable. If you want, we can try to write a test that stresses the problem we are trying to solve with the change in StorageRewrite.

What do you think?

@areusch
Copy link
Contributor

areusch commented Jun 11, 2021

hi @giuseros,

you're right about a)--my analysis was incorrect of StorageRewrite and it shouldn't modify the parameters appreciably to the memory map. thanks for correcting me on that.

on b), my concern is:

  • do other uses of StorageRewrite exist that may be affected by the narrow approach taken here?
  • how do we document what the narrow approach needs to accomplish using unit tests so that, if others run into problems with StorageRewrite elsewhere, they don't change it and break us?

On the latter point, my main critique is that the unit tests are pretty coarse (e.g. they are integration tests). there are enough moving pieces that i'm not sure it's obvious from the tests how StorageRewrite should work (it probably is today, but things could change over time that would make it less so). possible to add a finer-grained unit test e.g. in tests/python/unittest/test_tir_transform_storage_rewrite.py?

@tqchen also raised a concern on whether it's feasible to push StorageRewrite in the direction of having pointer analysis. The problem posed was: we currently pass buffers to operator implementation as pointers, and the pointer lifetime of those buffers is clear enough from convention (when the operator impl returns, they are invalid; so impl cannot save those pointers). if we start down the path of adding a more general reference counting impl to StorageRewrite, that implies we may be passing user data structs that contain pointers to operator implementations. however, it isn't really clear how we should explain the lifecycle of those pointers.

sort of I think this is a minor point given the current impl, but I also agree that if we are arguing that StorageRewrite's struct_set analysis is in fact a limited application of a general pattern, and implementing the general pattern is also fine, we do need to then elaborate where we are going with it. i do think keeping the impl in AOTExecutorCodegen doesn't carry this problem.

another possible way to alleviate both concerns would be to define an attr on the tir.tvm_stack_alloca node that explicitly instructs StorageRewrite to link that tvm_stack_alloca with a given AllocateNode. what do you think about that?

@giuseros
Copy link
Contributor Author

giuseros commented Jun 14, 2021

Hi @areusch ,
After some thinking we came up with a different solution. The way we are doing things now is the following:
pass-a) Compose the main_func with tvm_set_struct, i.e., codegen ready
pass-b) Storage rewrite modified to take care of the structs
pass-c) tvm::build

A possible alternative can be the following:
pass-a2) Compose the main_func without tvm_set_struct. This means that the packed calls will receive raw pointers. This in turn means that the main_func in TIR is not ready to be code generated
pass-b2) Storage rewrite without any change. This is possible now since we are not using tvm_set_struct yet
pass-c2) transform the packed calls inputs by using tvm_set_struct. We can actually avoid this pass if we are using unpacked signatures.
pass-d2) tvm::build

With pipilene-2 we can leave StorageRewrite unchanged and still address the issues that tvm_set_struct gives us. What do you think?

@manupak
Copy link
Contributor

manupak commented Jun 17, 2021

Hi @areusch ,

a friendly ping!
It'd be good hear your thoughts on the second alternative mentioned here, for us to proceed in that approach.

As @giuseros mentioned here, we figured out adding aliasing (partial / incremental) support for StorageRewrite might not be required for this fix. In the pass-a2, if we did not create the struct early on, StorageRewrite on its own will work on raw pointers (not aliased via structs).

Also, post-StorageRewrite TIR, we could essentially create the structs if packed API was desired, in a subsequent pass (pass-c2).

@areusch
Copy link
Contributor

areusch commented Jun 17, 2021

@manupa-arm @giuseros apologies for the delay!

yeah the alternative approach makes sense to me. i think it's better to avoid adding special-case logic to StorageRewrite when we don't have a clear plan to implement the general case.

one question on the overall plan: i imagine after this, to implement USMP, there would be an additional pass to replace the allocate calls with indexing into workspace buffers. would USMP then run StorageRewrite before memory planning, or would that be left up to the memory planning implementation?

@manupak
Copy link
Contributor

manupak commented Jun 17, 2021

Yes, the overall plan, we'd not like to run StorageRewrite before the USMP passes are run because it will perform non-optimal local sharing inside the primfunc. The plan is to move StorageRewrite pass down after USMP passes if there is anything left out StorageRewrite could help with.

@giuseros
Copy link
Contributor Author

giuseros commented Jun 21, 2021

Hi @manupa-arm , @areusch ,
Here is a second attempt to resolve this. I basically applied what we discussed. I also cleaned a bit the TIR produced by removing all the tir::Evaluate(0) statements. Please, let me know what you guys think!

@giuseros giuseros force-pushed the aot-remove-gpm branch 3 times, most recently from fc7eb40 to 6b010fd Compare June 22, 2021 15:37
@giuseros
Copy link
Contributor Author

Hi @manupa-arm , @areusch ,
Quick update on this. I added a unit test and slightly changed the pass to address some issues. This should be good enough to be reviewed.

Thanks,
Giuseppe

@giuseros
Copy link
Contributor Author

A friendly ping @manupa-arm , @jroesch , @areusch !

Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @giuseros, i think the new approach makes sense. i would like @manupa-arm and perhaps @jroesch to explicit-approve before we merge this


// Allocate the TVMValues on the stack and define the variables
for (auto v : tvm_values) {
call_stmt = LetStmt(v, StackAlloca("array", 1), call_stmt);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wondering if we may want to eventually create a separate memory pool for these, and whether you'd be open to (in a future PR) converting into tir.allocate optionally

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it is a very good idea, and I would personally like a similar direction. @manupa-arm, what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that sounds good! (not sure we want a seperate pool but I can see them pooled to 'a' workspace buffer)

Copy link
Contributor

@manupak manupak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!


// Allocate the TVMValues on the stack and define the variables
for (auto v : tvm_values) {
call_stmt = LetStmt(v, StackAlloca("array", 1), call_stmt);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that sounds good! (not sure we want a seperate pool but I can see them pooled to 'a' workspace buffer)

Giuseppe Rossini added 6 commits June 28, 2021 18:13
Change-Id: I13888471d4b8927a4012d6a8e749fb7a8935dd77
Change-Id: I7aa12e0217b8a2e1ff2a97a7c5fdda6b7597ae64
Change-Id: If9f1ee190690f9a810fe41eb1933d736f1eb4ec3
Change-Id: I8aa43d3a1b837b03a5cf3c6b32fc760bd78d3436
Change-Id: I5b0d75380ff660dd5a0acf5b14fa84bb992fbec4
Change-Id: I52ceab5cf6e9b54390cb36c18dbb8e22505d8e18
@giuseros
Copy link
Contributor Author

Hi @manupa-arm , @jroesch , @areusch ,
I just rebased and had to change StorageInfo to TempStorageInfo because of a conflict. Please, let me know if this is OK for you!

Thanks,
Giuseppe

@areusch
Copy link
Contributor

areusch commented Jun 28, 2021

hey @giuseros, i think @jroesch was hoping with #8297 that you could re-use that common StorageInfo struct for your purposes here. is it possible to do that? sorry if this was unclear from that PR.

@giuseros
Copy link
Contributor Author

Hi @areusch , I can reuse that in theory, but TempStorageInfo is only a local structure used to have on-demand memory allocation, i.e., I don't need to pass the TempStorageInfo around. This was the reason you suggested to not use struct of arrays here: #8096 (comment). I personally agreed with your comment and would prefer the way it is (i.e., array of structs), because using a struct of arrays makes the code less readable. However, if you guys feel strongly about reusing the StorageInfo struct of arrays, I can surely do it

@jroesch
Copy link
Member

jroesch commented Jun 28, 2021

@giuseros I would prefer to also use an array of structs, but my goal is to just get us to use the same data structure everywhere since there are multiple places where we are storing different versions of the same data without a shared data structure. We could move the Array outside and simplify the struct but it would be good to move memory planning to also use the same structure.

I guess one solution could be merge as is, I can rewrite the MemoryPlanning to use the Array of struct and then you can update to use the same structure.

Change-Id: Ia8b7de1373f167ca7d0d69a99846d417405bbe48
@giuseros
Copy link
Contributor Author

Hi @jroesch,
The solution I picked is to use the common StorageInfo so that later on we can have a PR to move from SoA to AoS for all the users of that structure (i.e., all the memory planning algorithms).

Thanks,
Giuseppe

@areusch areusch merged commit b803bab into apache:main Jun 29, 2021
@areusch
Copy link
Contributor

areusch commented Jun 29, 2021

thanks for working through this one with us @giuseros, the PR is now merged! we will now temporarily prioritize #7518 over further compiler changes. we aim to land that this week, so feel free to send additional PRs based on top of that if you'd like.

ylc pushed a commit to ylc/tvm that referenced this pull request Sep 29, 2021
* Fix an issue with storage-rewrite pass and packed functions

Change-Id: I13888471d4b8927a4012d6a8e749fb7a8935dd77

* Rebasing

Change-Id: I7aa12e0217b8a2e1ff2a97a7c5fdda6b7597ae64

* Addressing comments

Change-Id: If9f1ee190690f9a810fe41eb1933d736f1eb4ec3

* Add a pass to legalize packed calls

Change-Id: I8aa43d3a1b837b03a5cf3c6b32fc760bd78d3436

* Add a unit test for the legalization pass

Change-Id: I5b0d75380ff660dd5a0acf5b14fa84bb992fbec4

* rebasing

Change-Id: I52ceab5cf6e9b54390cb36c18dbb8e22505d8e18

* Use common StorageInfo

Change-Id: Ia8b7de1373f167ca7d0d69a99846d417405bbe48
zxy844288792 pushed a commit to zxy844288792/tvm that referenced this pull request Mar 4, 2022
* Fix an issue with storage-rewrite pass and packed functions

Change-Id: I13888471d4b8927a4012d6a8e749fb7a8935dd77

* Rebasing

Change-Id: I7aa12e0217b8a2e1ff2a97a7c5fdda6b7597ae64

* Addressing comments

Change-Id: If9f1ee190690f9a810fe41eb1933d736f1eb4ec3

* Add a pass to legalize packed calls

Change-Id: I8aa43d3a1b837b03a5cf3c6b32fc760bd78d3436

* Add a unit test for the legalization pass

Change-Id: I5b0d75380ff660dd5a0acf5b14fa84bb992fbec4

* rebasing

Change-Id: I52ceab5cf6e9b54390cb36c18dbb8e22505d8e18

* Use common StorageInfo

Change-Id: Ia8b7de1373f167ca7d0d69a99846d417405bbe48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants