Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Introducing DeclBuffer #70

Merged
merged 6 commits into from
Jun 10, 2022
Merged

[RFC] Introducing DeclBuffer #70

merged 6 commits into from
Jun 10, 2022

Conversation

vinx13
Copy link
Member

@vinx13 vinx13 commented May 10, 2022

This is a follow-up of apache/tvm#9727 and RFC#63. Currently buffer can be implicitly declared and then used. The implicit behavior can be error prone and makes analysis more difficult. This RFC introduces DeclBuffer, a new IR construct as an explicit statement for buffer declaration.

Rendered version: https://github.com/vinx13/tvm-rfcs/blob/decl-buffer/rfcs/0070-introducing-decl-buffer.md

cc @tqchen @Lunderberg @junrushao1994 @csullivan @mbs-octoml @jroesch @areusch @wrongtest @Hzfengsy

Co-authored-by: Eric Lunderberg <Lunderberg@users.noreply.github.com>
@wrongtest-intellif
Copy link

Thanks a lot! I think then we can handle buffer related issues in customized passes with more explicit and robust way.

I have one question on tir script, for certain algorithms in DL workloads, users may want to write non-stir formed script like

x = T.allocate((), "int32", "")
x[()] = 0
while x[()] < 128:
    x[()] = x[()] + 1
    # ...

Could the parser support still write things like that (though underlying IR structure changed) instead of

x_data = T.allocate((), "int32", "")
x = T.decl_buffer(data=x_data,)
x[()] = 0
# ...

@vinx13
Copy link
Member Author

vinx13 commented May 11, 2022

@wrongtest Thanks for bringing up this. There are a few options for the behavior in TVM script, I'm open to discussion.

  • A1: The original behavior before [TE][TIR] Implement layout transformations, non-flat memory buffers tvm#9727:
    T.allocate returns a Var, which can be later used in T.load / T.store.

  • A2: Current behavior:
    T.allocate returns a buffer (Var is implicitly created inside). The buffer can be accessed via subscripts, which will be translated to BufferLoad / BufferStore.

  • A3: Potential behavior for this RFC:
    T.allocate returns a Var, use T.decl_buffer to create buffer and access it (because T.load / T.store are deprecated). This follows the translation to TIR nodes closely.

  • A4: Potential behavior for this RFC (the one @wrongtest is proposing):
    If there are no buffer aliases (in most of the cases), it is tempting to avoid boilerplate code of T.decl_buffer.
    T.allocate returns a buffer, where buffer->data is implicitly created and allocated inside. The buffer can be accessed via subscripts. Subsequent buffer aliases can be created by referring to buffer->data. This requires DeclBuffer to be created implicitly, it is okay to have some discrepancy between parser side and the TIR nodes.

Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @vinx13 , couple questions on this one as well

# Unresolved questions
[unresolved-questions]: #unresolved-questions

Should low-level code generators handle buffer aliases? One option would be to remove them in a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this basically asking whether a codegen should have to track the data variables, or whether we should introduce an explicit TIR node that more readily translates to an index-into-opaque-pointer with type info?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is asking whether buffer aliasing should be left to codegen (codegen should track the data variable), or buffer aliasing should be unified in an earlier pass.

However, `T.buffer_decl` doesn’t translate to a node in AST. The AST will be
```
PrimFunc {
buffer_map: {A: Buffer[(16,), "float"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the A.data member should theoretically appear here, whether or not in repr, right? i think the central challenge here is determining that the data for a buffer is A's data member. is that right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. it should be A_data: Buffer(data=A_data, ...)

data: A_data(Var(name=...))
extent: ...
body: DeclBuffer {
buffer: A(data=A_data, dtype=..., shape=...),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's always required to supply data in decl_buffer world, right? so this changes the process of identifying aliases by making it explicit that all backing buffers would come from tir.allocate nodes or from function arguments, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supplying data in decl_buffer is preferred. I mentioned an alternative in line 88, e.g

DeclBuffer {
  buffer: A(data=A_data(Var(...)), dtype=..., shape=...),
  body: Allocate{
    data=A_data,
    body: ...
  }
}

The question here is whether this should be allowed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it would be great to just be prescriptive here so that people don't try to write 10 different forms of buffer declaration. we can always expand the set of supported use cases if people can supply rationale, but reducing them after people check in a bunch of code without design docs is pretty tricky. to that end, could we clarify where we expect to get a data value from?

I also am not quite following this notation: A_data(Var(...)). Is A_data somehow a function here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question here is whether this should be allowed

I think I'd lean toward removing the alternative form and requiring the definition of the data variable to be defined prior to use in a DeclBuffer node in the PrimFunc's TIR. If a user gets an error message that states A is undefined, they may point at the DeclBuffer node and say that of course it is defined, conflating the definition of the buffer and the definition of the buffer's backing allocation. The confusion is especially likely, because the default behavior of tvm.tir.decl_buffer is to name the BufferNode and the VarNode with the same name. By making it impossible to have a buffer without a defined data variable, we avoid potential confusion about why a buffer is partially undefined.

that all backing buffers would come from tir.allocate nodes or from function arguments

They could also come from return values of functions. After MakePackedAPI, the backing buffers are the return value of @tir.tvm_struct_get. It could also be an entirely separate function call, such as data: T.Ptr[T.int32] = T.call_extern("device_specific_malloc", 1024, dtype="handle")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is A_data somehow a function here?

@areusch It should be A_data{Var(data = ..., )}. A_data is a variable.

I think I'd lean toward removing the alternative form and requiring the definition of the data variable to be defined prior to use in a DeclBuffer node in the PrimFunc's TIR.

@Lunderberg That's also what I thought. Since we all agree on this, I'll update the RFC to be prescriptive here.

They could also come from return values of functions.

Thanks for pointing out. I'll update it accordingly.

store the physical shapes of the buffers. The change of the information stored in `buffer_map` can
be confusing. These two maps can be unified into a single `buffer_map` that defines the logical
shapes of the input buffers. The buffer access in physical shape, which is an internal behavior of
`PrimFunc` after flattening, can be achieved by using `DeclBuffer` to create buffer aliases in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will there be an easy way to do this implemented as part of this RFC? @Lunderberg mentioned something about a utility function to at least compute the flattened buffer_map.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lunderberg has a prototype for this apache/tvm#10940, it is based on implicit buffer aliasing. This paragraph is suggesting moving to explicit buffer aliasing using DeclBuffer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, the buffer flattening would be done through buffer aliasing. After apache/tvm#10940, the use of N-d/flattened buffers would be as below, depending on where you are in the lowering flow.

  1. Before BufferFlatten/FlattenStorage: Buffers are declared in the buffer_map, and are not flattened. Buffer access is done using N-d unflattened indices.
  2. After BufferFlatten/FlattenStorage, but before MakePackedAPI: Buffers are declared in the buffer_map, and are not flattened. Buffer access is done through a buffer alias, where the alias shares the same data pointer, but has a flattened shape and is accessed with flattened indices.
  3. After MakePackedAPI: The buffer_map is empty. Declarations of flattened buffers are done using the handles extracted using tvm_struct_get. These flattened buffers are accessed with flattened indices.

As I understand it, this RFC would impact the TIR during parts 2 and 3. Step 2 would have an explicit DeclBuffer to mark the aliasing, rather than an implicit aliasing based on re-used Var Buffer::data. Step 3 would have an explicit DeclBuffer to mark the use of the T.handle in a buffer, rather than being implicit in the first use.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for elaborating that! could we add these to the RFC?

Copy link
Member

@Hzfengsy Hzfengsy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @vinx13. That's a super cool RFC, enabling all buffers to be defined before use. It looks good to me. Looking forward to following PRs.

@vinx13
Copy link
Member Author

vinx13 commented May 20, 2022

@wrongtest I've thought about the option A3 vs A4. From the parsing / translation from TVM script to TIR, it is acceptable to have T.allocate translated to Allocate + DeclBuffer two nodes. But it will be tricky for TVMScriptPrinter. We will need to find both Allocate and DeclBuffer nodes and then print T.allocate, and these two nodes do not have to be parent/child of each other. I'm not sure if this behavior, which breaks 1-to-1 mapping between TVM script and TIR, is desirable. Alternatively, we can add an option to T.allocate, such as def allocate(..., return_buffer: bool). What do you think?

@tqchen
Copy link
Member

tqchen commented Jun 7, 2022

A gentle ping. Would be great to have a status checkin and let us move to make this happen

@vinx13
Copy link
Member Author

vinx13 commented Jun 8, 2022

Seems we all agree that introducing DeclBuffer is helpful. The only unresolved question is how shall the TVMScript be updated as @wrongtest mentioned. As discussed above, we have the options:

  • B1: In TVMScript, T.allocate and T.decl_buffer strictly map to the corresponding TIR nodes. To allocate and declare a buffer (in lowered TIR), there will be two separate steps:
data = T.allocate(physical_shape)
buffer = T.decl_buffer(data=data, shape=physical_shape)
  • B2: provide some syntax sugar for T.allocate (adding a returns_buffer option), which is translated to AllocateNode + DeclBufferNode in TIR.
buffer = T.allocate(physical_shape, returns_buffer=True)

When printing them from TIR to TVMScript, we still print in unsugared form in B1.

Note that in B2, it is not feasible to make T.allocate always return the created buffer, because

  1. there might be still need to directly use the buffer var in lower level TIR
  2. when printing the TIR to TVMScript, AllocateNode and DeclBufferNode do not always appear in the same place, it is difficult (and not preferred) to map two TIR nodes into one TVMScript statement

would love to know what you think @wrongtest @Hzfengsy

@tqchen
Copy link
Member

tqchen commented Jun 8, 2022

We always need the original syntax to enable bidirectional property. It does not hurt though to enable sugars that allows us to have a combination on top. How about we start with the original syntax and then discuss sugar as a separate topic.

On the sugar part, perhaps we can reuse the alloc-buffer keyword

@wrongtest-intellif
Copy link

reuse T.alloc_buffer seems good,as long as there is no ambiguity for parser impl :)

@vinx13
Copy link
Member Author

vinx13 commented Jun 8, 2022

@areusch @Hzfengsy I've updated the RFC. It is ready for another look

rfcs/0070-introducing-decl-buffer.md Outdated Show resolved Hide resolved
@@ -136,11 +138,26 @@ def elemwise(A: T.Buffer[(16, 16), "float32"], C: T.Buffer[(16, 16), "float32"])
C_flattened[i * 16 + j] = A[i * 16 + j]
```

Specifically, the updated flow of buffer flattening using `DeclBuffer` will be:
1. Before `FlattenBuffer/StorageFlatten`: Buffers are declared in the `buffer_map`, and are not flattened. Buffer access is done using N-d unflattened indices.
2. After `FlattenBuffer/StorageFlatten`, but before `MakePackedAPI`: Buffers are declared in the `buffer_map`, and are not flattened. Buffer access is done through a buffer alias explicitly created via `DeclBuffer`, where the alias shares the same data pointer, but has a flattened shape and is accessed with flattened indices.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for some reason i thought our previous discussion was that Buffer would always contain unflattened shape and there would be a separate function that computes the flattened shape on the fly. did that change? if so why?

Copy link
Member Author

@vinx13 vinx13 Jun 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not changed. The buffer in buffer_map is always unflattened. After FlattenBuffer/StorageFlatten, we will create a flattened view of the buffer using buffer alias, and in IR all accesses to the buffer will be done via the flattened view. This is because from the perspective of the calling convention of PrimFunc, we always expect the input to have unflattened shape (e.g. the shape of DLTensor). But internally in the IR, we need to flatten to physical shape and indices which is how it is done in codegen and runtime

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh sorry, i see now. the text is very clear i just misread it :)

Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for bearing with me @vinx13 !

@areusch areusch merged commit 87ff1fa into apache:main Jun 10, 2022
Lunderberg added a commit to apache/tvm that referenced this pull request Nov 16, 2022
`PrimFuncNode::preflattened_buffer_map` was introduced in
#9727, in order to maintain a record
of the pre-flattened buffer shape until it can be used in
`MakePackedAPI`.  This commit instead maintains the pre-flattened
shapes in `PrimFuncNode::buffer_map`, while the body of the function
uses a flattened buffer alias, as described in
[RFC#70](apache/tvm-rfcs#70)
xinetzone pushed a commit to daobook/tvm that referenced this pull request Nov 25, 2022
`PrimFuncNode::preflattened_buffer_map` was introduced in
apache#9727, in order to maintain a record
of the pre-flattened buffer shape until it can be used in
`MakePackedAPI`.  This commit instead maintains the pre-flattened
shapes in `PrimFuncNode::buffer_map`, while the body of the function
uses a flattened buffer alias, as described in
[RFC#70](apache/tvm-rfcs#70)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants