Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-33923: [Docs] Tensor canonical extension type specification #33925

Merged
merged 27 commits into from
Mar 15, 2023
Merged
Changes from 6 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
af571cb
Add Fixed size tensor spec to canonical extensions list
AlenkaF Jan 30, 2023
8231150
Apply suggestions from code review
AlenkaF Jan 30, 2023
884d871
Remove implementation-specific metadata
AlenkaF Jan 30, 2023
83edd70
Change order with is_row_major
AlenkaF Jan 30, 2023
16ef6f1
Update docs/source/format/CanonicalExtensions.rst
AlenkaF Jan 30, 2023
4f4ccce
Update metadata part
AlenkaF Jan 30, 2023
92fd7c6
Correct True to true in json
AlenkaF Jan 31, 2023
7873676
Change name from fixed_size_tensor to fixed_shape_tensor
AlenkaF Jan 31, 2023
a4219e3
Add description for ListType parameters
AlenkaF Jan 31, 2023
37e83db
Change the description for ListType parameters
AlenkaF Feb 1, 2023
5c92ff0
Remove is_row_major from the spec
AlenkaF Feb 2, 2023
cb5e2dd
Add dim_names and permutation to optional metadata
AlenkaF Feb 15, 2023
b562b8d
Add notes to the usage of dim_names and permutations metadata
AlenkaF Feb 15, 2023
c44101b
Update docs/source/format/CanonicalExtensions.rst
AlenkaF Feb 15, 2023
24e7c28
Add dim_names and permutation to optional parameters
AlenkaF Feb 15, 2023
333ae67
Add explicit explanation of permutation indices
AlenkaF Feb 15, 2023
4086dfb
Change order with layout
AlenkaF Feb 15, 2023
bd2a515
Rephrase text about absent permutation param
AlenkaF Feb 15, 2023
bc07d7a
Apply suggestions from code review - Joris
AlenkaF Feb 15, 2023
68c6244
Remove redundant sentence in permutations explanation
AlenkaF Feb 16, 2023
3e2bb25
Update value_type description
AlenkaF Feb 22, 2023
a49f14f
Update parameters description
AlenkaF Feb 22, 2023
89d8042
Add a logical layout shape example in the desc of the serialization
AlenkaF Feb 22, 2023
4ff7a65
Update docs/source/format/CanonicalExtensions.rst
AlenkaF Feb 22, 2023
1daf820
Update docs/source/format/CanonicalExtensions.rst
AlenkaF Feb 28, 2023
70059d9
Add note about IPC tensor
AlenkaF Mar 9, 2023
6f44296
Update docs/source/format/CanonicalExtensions.rst
AlenkaF Mar 10, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 23 additions & 1 deletion docs/source/format/CanonicalExtensions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -72,4 +72,26 @@ same rules as laid out above, and provide backwards compatibility guarantees.
Official List
=============

No canonical extension types have been standardized yet.
Fixed size tensor
=================

* Extension name: `arrow.fixed_size_tensor`.
AlenkaF marked this conversation as resolved.
Show resolved Hide resolved

* The storage type of the extension: ``FixedSizeList``.
AlenkaF marked this conversation as resolved.
Show resolved Hide resolved

* Extension type parameters:

* **value_type** = Arrow DataType of the tensor elements
* **shape** = shape of the contained tensors as a tuple
AlenkaF marked this conversation as resolved.
Show resolved Hide resolved
* **is_row_major** = boolean indicating the order of elements

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Zulip discussion we are leaning towards canonical type always storing row-major and letting applications store strides in metadata. Any arguments for or against from you or your users would be most welcome at this point!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you mean removing is_row_major as a parameter? I guess the issue boils down to how fast you want the loading/saving? In torch terminology, making a tensor contiguous (I'm guessing is the same as row major) makes a copy because the underlying memory representation changes. The same would apply to loading.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you mean removing is_row_major as a parameter?

Yes. We would just use the physical layout of the source and not change memory layout when going in and out of the extension. We would provide an option to store the layout as metadata.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In torch terminology, making a tensor contiguous (I'm guessing is the same as row major) makes a copy because the underlying memory representation changes.

My understanding is that contiguous tensors in torch are indeed always row-major, but so that also means that if you have such a contiguous tensor, you don't need any copy to put this in the proposed extension TensorArray (or you can get it out without a copy).

Copy link

@thomasw21 thomasw21 Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if that helps the discussion:

import torch

def get_1d_memory_buffer(tensor):
    return "".join(hex(elt) for elt in tensor.storage().untyped().byte())

x = torch.randn(2,3)
y = torch.empty(3,2).transpose(0,1)

# Fill y with x data
y[:] = x

assert x.shape == y.shape
assert get_1d_memory_buffer(x) != get_1d_memory_buffer(y)
# You can try printing `x` and `y` and you'll see that the tensors are the same

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, torch has the ability to change the underlying "physical" ordering depending on how you interpret data.

Yes, but both of those use row-major / C-contiguous memory layout. What you change is the order of the dimensions (C-H-W or H-W-C), and to do that while keeping row-major layout, changing between channels first/last layout requires actually shuffling the data in memory (and thus requires a copy). But either "physical order" is row-major and thus can be stored in the proposed FixedShapeTensor array without copy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if that helps the discussion:

For your example, x and y indeed have a different layout of the values in memory, while seemingly representing the same tensor (in you print them). But that's because y is transposed after it was created (and this only changed the logical order, not the physical), and thus has custom strides.

But you could still store both x and y without copy in a FixedShapeTensor array. The difference is that x would be stored as shape (2, 3), and y would be stored as (3, 2). And assume that in this dummy example the dimensions are called A and B, and if you want to keep track of the correct logical order, you would store the dimension names for x as ["A", "B"] and for y as ["B", "A"]. Then afterwards, after reading such data and the application knows it always wants the data in (A, B) order, it can transpose the data read from the stored y (just as you did when creating y)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw I fixed a bug in my script ...

The difference is that x would be stored as shape (2, 3), and y would be stored as (3, 2)

I understand that there exist a permutation of dimensions that allows me to get a "row major" format. I think it doesn't change how you store the permutation information, ie via dimension names or stride. It felt like a natural concept to me to store stride, as this would allow just provide a better generalisation IMO. But I do understand if the current extension would focus on pure row_major.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I understand (we might have been talking past each other a bit, as I was assuming you want to have strides to allow zero-copy for all cases, while I tried to convince you that it's not needed).

It's certainly true that we could store strides, but I am not sure it would be a better generalization of (or a full replacement for) dimension names.
Consider for example that you have channels-last physical data (NHWC), but you view it as channels-first logically (NCHW). To store the data with the logical dimension order, this would requires a strides parameter. But assume you only store the strides in the FixedShapeTensor type and not the dimension names, then when consuming that data, you know the strides associated with it, but you still don't know for sure what the dimensions mean (because both NHWC viewed as NCHW, or NCHW viewed as NHWC would give you custom strides). Of course, if you know where the data is coming from and you know that it's from a pytorch context, then you can assume that the logical order is NCHW (that's how pytorch always shows it: "No matter what the physical order is, tensor shape and stride will always be depicted in the order of NCHW"), and that information combined with the strides ensures you know if the physical layout is channels-first or last.
But that requires application-specific context to know that. While if you store the dimension names that match the order assuming a row-major layout, then you can infer the same information (and how to transpose it to get your desired logical order), but without requiring this application-specific knowledge (assuming different applications would use consistent dimension names, so you can recognize those).

So my current understanding is that dimension names are the more generalizable information.

In addition, pushing the strides logic (how to translate the given dimension order and your desired dimension order to strides) to the application to deal with, keeps the implementation of the FixedShapeTensorType itself simpler, not requiring every implementation to deal with custom strides.

in memory

* Description of the serialization:

The metadata must be a valid JSON object including:

* shape of the contained tensors as an array with key "shape",
* boolean indicating the order of elements in memory with key
"is_row_major".

For example: `{ "shape": [2, 5], "is_row_major": True }`