Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-33923: [Docs] Tensor canonical extension type specification #33925

Merged
merged 27 commits into from
Mar 15, 2023

Conversation

AlenkaF
Copy link
Member

@AlenkaF AlenkaF commented Jan 30, 2023

Rationale for this change

There have been quite a lot of discussions connected to the tensor support in Arrow Tables/RecorBatches. This PR is a specification proposal to add tensors as a canonical type extensions and is meant to be sent to the Mailing list for discussion and vote.

What changes are included in this PR?

Specification for canonical extension type for fixed sized tensors.

Open question

Should metadata include the "dim_names" key to (optionally) specify dimension names when creating the Arrow FixedShapeTensorArray?

@github-actions

This comment was marked as outdated.

@AlenkaF AlenkaF changed the title Add Fixed size tensor spec to canonical extensions list GH-33923: [Docs] Tensor canonical type extension specification Jan 30, 2023
@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #33923 has been automatically assigned in GitHub to PR creator.

Copy link
Member

@rok rok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this @AlenkaF !
I suggested some changes.
Also: a question I hit on the C++ implementation last week:
what happens when one wants to work with two fixed_size_tensor extensions (let's say they have different shapes) at once? I think this will be a common occurrence and because of the namespace collision two extensions can't be registered at once.

docs/source/format/CanonicalExtensions.rst Outdated Show resolved Hide resolved
docs/source/format/CanonicalExtensions.rst Outdated Show resolved Hide resolved
docs/source/format/CanonicalExtensions.rst Outdated Show resolved Hide resolved
docs/source/format/CanonicalExtensions.rst Outdated Show resolved Hide resolved
docs/source/format/CanonicalExtensions.rst Outdated Show resolved Hide resolved
AlenkaF and others added 2 commits January 30, 2023 11:44
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@AlenkaF
Copy link
Member Author

AlenkaF commented Jan 30, 2023

Also: a question I hit on the C++ implementation last week:
what happens when one wants to work with two fixed_size_tensor extensions (let's say they have different shapes) at once? I think this will be a common occurrence and because of the namespace collision two extensions can't be registered at once.

When working on pyarrow implementation locally, this wasn't an issue.
From the Python documentation:

Registration needs an extension type instance, but then works for any instance of the same subclass regardless of parametrization of the type.

And you can see this case in the tests I uploaded to gist:
https://gist.github.com/AlenkaF/95fb41f461fb792396bb20dd502b4112#file-02_tensor_extension_tests-py-L77-L96

I guess it is the same in C++?

@rok
Copy link
Member

rok commented Jan 30, 2023

Only one extension type with a given name can be registered at a time. Multiple can coexist, but register can only keep one extension per name. I suppose users won't really want to register these types (or is that a requirement for using compute?), but will want an easy way to instantiate them.

Alternatively we could store shape in the extension name, e.g. arrow.fixed_size_tensor<int64,(5,2),row_major>, but I'm not sure that's a good idea.

@jorisvandenbossche
Copy link
Member

Only one extension type with a given name can be registered at a time. Multiple can coexist, but register can only keep one extension per name.

It's only the name that is being registered, along with the methods to serialize / deserialize. The actual metadata (i.e. the only thing that differs between two different extension type instances of the same type) isn't part of the type class that is being registered, and so for a single registered (parametrized) type, you can have many instance alive at the same time with a different parametrization (i.e. different metadata).
That's the goal of having parametrized types (and so you don't want to have those parameters (metadata) in the name of the extension type. Otherwise you would have to register every possible parametrization).

@AlenkaF
Copy link
Member Author

AlenkaF commented Jan 30, 2023

Thank you for reviewing Rok!

If Joris agrees this can be sent to the ML when the C++ implementation is ready. If I remember correctly, I would start a new discussion thread for the tensor canonical extension specification adding some explanation, link to the C++ implementation and pyarrow example. Please correct me if I am wrong.

I am working on PyArrow extension example to have it ready for illustration.

@jorisvandenbossche jorisvandenbossche changed the title GH-33923: [Docs] Tensor canonical type extension specification GH-33923: [Docs] Tensor canonical extension type specification Jan 31, 2023
@pitrou
Copy link
Member

pitrou commented Jan 31, 2023

Hey @lhoestq, you (and/or other people at HuggingFace) might be interested in reviewing this proposed addition. It will also be discussed on the Arrow dev ML.

@mariosasko
Copy link

Hi! (I'm one of the maintainers of HF Datasets)

The spec looks good to me. I also slightly prefer is_row_major over NumPy's order.

In Datasets, we store fixed-size tensors as variable-length lists, which is suboptimal as it also stores the offsets, so having this natively supported by Arrow would save us the hassle of implementing/maintaining a new extension type.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 31, 2023

@mariosasko Thanks for the feedback!

In Datasets, we store fixed-size tensors as variable-length lists, which is suboptimal as it also stores the offsets

I noticed that in the HuggingFace Datasets implementation, there is a comment about using variable instead of fixed size list array. That is resolved now, and it's fine to use fixed-size lists?


Related to the is_row_major / order parameter, we were having some discussion in zulip chat (this public stream). Summarizing that here: I wonder to what extent the is_row_major / order keyword is parameter actually useful.

  • I think none of the existing custom tensor extension type implementations have this (eg neither of Ray and HuggingFace Datasets have this. @mariosasko do you know if that has come up in HuggingFace?)
  • The first dimension always needs to be the one that matches the length of the array and have the biggest stride (to match our list array layout). So even if you would store each individual tensor as column-major (F contiguous), the full ndarray representing the full TensorArray wouldn't be F-contiguous.
  • Not super familiar here, but I think some of the ML frameworks like tensorflow also only deal with row-major data anyway. @rok mentioned an example of a different "channel last" layout in pytorch (https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html). But that specific example also isn't exactly column-major for each individual tensor (after the first dimension), but something custom (strides of (1, 96, 3) for tensor shape of (3, 32, 32) in their example, so neither one of (1024, 32, 1) or (1, 3, 96)). So to support this use case, we would actually need a strides parameter, and not a order/is_row_major parameter. EDIT: that's just a view (with different strides) on the original row-major data, to give a view with consistent dimension order (channel-first), if the actual data is stored with a different dimension order (channel-last). So the useful information for an application to pass along would be the names of the dimensions (the data itself is still always row major).

@mariosasko
Copy link

@jorisvandenbossche

I noticed that in the HuggingFace Datasets implementation, there is a comment about using variable instead of fixed size list array. That is resolved now, and it's fine to use fixed-size lists?

Yes, the fixed-sized version seems to work now (requires pyarrow>=10.0.0 to pass the tests).

I think none of the existing custom tensor extension type implementations have this (eg neither of Ray and HuggingFace Datasets have this. @mariosasko do you know if that has come up in HuggingFace?)

No, that has yet to be requested on our side (so not super important for us right now).

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@extabgrad
Copy link

MATLAB’s Deep Learning Toolbox uses n-dimensional arrays which fit quite well with the proposal. It also has a special datatype called “dlarray” which is responsible for automatic differentiation. This datatype allows labelled dimensions, however, the labels are quite compatible with a permutation. We group dimension labels into categories – spatial/continuous, channel, batch, temporal, unspecified, and potential extensions – where the spatial and unspecified labels (and perhaps others in future) are allowed to be used for multiple dimensions. We find that ‘H’ and ‘W’ are very image-centric and do not extend arbitrarily, while ‘U’ might be used for additional dimensions needed for a variety of purposes in different contexts, such as filter groups, point cloud index and so forth. Consequently we need to keep track of permutations within dimensions with the same name.

Therefore our main input would be to request that permutation and dim_names are not mutually exclusive.

We would also like to ensure that the format will support complex data.

Questions:

  1. Could you clarify whether shape, dim_names and permutation are listed in left-to-right or right-to-left order. As in, is the contiguous dimension the right-most (C, python) or left-most (MATLAB, Fortran) dimension?
  2. Are 1-dimensional and 0-dimensional arrays allowed?
  3. “fixed_shape_array” or variations on multi-dimensional or nd array might be a more appropriate name given the mathematical implications of the term ‘tensor’.

Joss Knight
Development Manager
GPU and Deep Learning Team
MathWorks UK

@rok
Copy link
Member

rok commented Mar 2, 2023

Thanks you for the input and description of MATLAB's Deep Learning Toolbox @extabgrad !

Therefore our main input would be to request that permutation and dim_names are not mutually exclusive.

As per discussion on the mailing list and this proposal I believe permutation and dim_names will not be mutually exclusive.

We would also like to ensure that the format will support complex data.

Just to be clear: by complex you mean diverse not complex as in complex numbers?

Questions:

  1. Could you clarify whether shape, dim_names and permutation are listed in left-to-right or right-to-left order. As in, is the contiguous dimension the right-most (C, python) or left-most (MATLAB, Fortran) dimension?

shape, dim_names and permutation would map to a row-major (C, python) physical layout of data. Data in the underlying buffer would have row-major layout.

  1. Are 1-dimensional and 0-dimensional arrays allowed?

This wasn't discussed, but the current language doesn't forbid them, so I suppose they are allowed. Do you think we should explicitly allow them?

  1. “fixed_shape_array” or variations on multi-dimensional or nd array might be a more appropriate name given the mathematical implications of the term ‘tensor’.

That's an interesting point. Do you know of a source where this is discussed/argued? I'd be in favor of array-like naming too. The consideration here would be that deep learning frameworks are pushing tensor name and seem to have a lot of momentum.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Mar 2, 2023

We would also like to ensure that the format will support complex data.

Just to be clear: by complex you mean diverse not complex as in complex numbers?

And if you mean complex numbers: the value_type in this proposal can be any arrow data type. At the moment, there is no "complex" data type defined in the Arrow format spec, though. But there is a proposal to add a canonical extension type for complex data (#10565), and assuming that this would be included in Arrow, then you can specify a tensor array that uses complex numbers as the individual tensor elements.

3. “fixed_shape_array” or variations on multi-dimensional or nd array might be a more appropriate name given the mathematical implications of the term ‘tensor’.

That's an interesting point. Do you know of a source where this is discussed/argued? I'd be in favor of array-like naming too. The consideration here would be that deep learning frameworks are pushing tensor name and seem to have a lot of momentum.

I don't have an opinion on the whole "array vs tensor" debate, but one aspect in favor of "Tensor" is that it avoids the confusion about whether "FixedShapeArray" is actually an array or type class, and the duplication in the name of "FixedShapeArrayArray".

@extabgrad
Copy link

Just to be clear: by complex you mean diverse not complex as in complex numbers?

No, I mean complex numbers, which are increasingly used in AI workflows.

shape, dim_names and permutation would map to a row-major (C, python) physical layout of data. Data in the underlying buffer would have row-major layout.

I find 'row-major' to be ambiguous. You can have a row major layout and still list dimensions left-to-right [row column page batch].

  1. Are 1-dimensional and 0-dimensional arrays allowed?

This wasn't discussed, but the current language doesn't forbid them, so I suppose they are allowed. Do you think we should explicitly allow them?

I don't know, I just know they are important in numpy.

That's an interesting point. Do you know of a source where this is discussed/argued? I'd be in favor of array-like naming too. The consideration here would be that deep learning frameworks are pushing tensor name and seem to have a lot of momentum.

image

Are you confident it is clear this datatype is only to be used in machine learning contexts? Perhaps it should be called 'machine-learning-tensor' then?

@rok
Copy link
Member

rok commented Mar 2, 2023

No, I mean complex numbers, which are increasingly used in AI workflows.

Interesting! As Joris states there is an independent effort to enable that.

shape, dim_names and permutation would map to a row-major (C, python) physical layout of data. Data in the underlying buffer would have row-major layout.

I find 'row-major' to be ambiguous. You can have a row major layout and still list dimensions left-to-right [row column page batch].

This proposal currently states: Elements in a fixed shape tensor extension array are stored in row-major/C-contiguous order.. We can amend that to be more general. Could you state what left-to-right means? I assume it's equal to TensorFlow's minor-to-major.

Are you confident it is clear this datatype is only to be used in machine learning contexts? Perhaps it should be called 'machine-learning-tensor' then?

I hope it'll be used in a wider context. FixedShapeNDArray?

@extabgrad
Copy link

extabgrad commented Mar 2, 2023

This proposal currently states: Elements in a fixed shape tensor extension array are stored in row-major/C-contiguous order.. We can amend that to be more general. Could you state what left-to-right means? I assume it's equal to TensorFlow's minor-to-major.

I can see it's confusing, since we are distinguishing between the order dimensions are indexed and the order they are stored. Left-to-right means new higher dimensions are added on the right, and right-to-left means they are added on the left.

The minor_to_major field definition states that the left-most entry refers to the contiguous dimension in memory. The order of indexing is then defined by the numbers in this field, so [0 1 2] means that X(i,j,k) is indexing position i+(Mj)+(MN*k). In typical parlance where i represents the row index, this is therefore column-major layout. A minor_to_major of [1 0 2] would therefore be a strict row-major layout so that j can represent the column index. But of course by row-major people typically mean [2 1 0], which means actually k is the column index, since the left-most dimension will be the highest (pages).

If MATLAB data were stored in row-major format but still indexed left-to-right then i would become the column index and no permutation would be needed to translate between MATLAB data and other row-major formats. However, since everyone uses the row,column indexing convention and MATLAB indexes left-to-right, the data is therefore stored in column-major. If 'permutation' is an inherently left-to-right spec then it is [1 0 2] for MATLAB (swap the 'first' two dimensions in memory). But if it's a right-to-left spec then it would be [0 2 1] (swap the 'last' two dimensions in memory).

Hence the ambiguity - so I thought it best to check!

I note that permutation is effectively the inverse of minor_to_major. The former is the memory order relative to the indexing order and the latter is the other way around.

@github-actions github-actions bot added the awaiting review Awaiting review label Mar 9, 2023
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Mar 9, 2023
Co-authored-by: David Li <li.davidm96@gmail.com>
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Mar 10, 2023
@AlenkaF AlenkaF merged commit 8583076 into apache:main Mar 15, 2023
@AlenkaF AlenkaF added this to the 12.0.0 milestone Mar 15, 2023
@ursabot
Copy link

ursabot commented Mar 15, 2023

Benchmark runs are scheduled for baseline = 3df5ba8 and contender = 8583076. 8583076 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.61% ⬆️0.06%] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.66% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 85830764 ec2-t3-xlarge-us-east-2
[Failed] 85830764 test-mac-arm
[Finished] 85830764 ursa-i9-9960x
[Finished] 85830764 ursa-thinkcentre-m75q
[Finished] 3df5ba8b ec2-t3-xlarge-us-east-2
[Finished] 3df5ba8b test-mac-arm
[Finished] 3df5ba8b ursa-i9-9960x
[Finished] 3df5ba8b ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

rtpsw pushed a commit to rtpsw/arrow that referenced this pull request Mar 27, 2023
…pache#33925)

### Rationale for this change

There have been quite a lot of discussions connected to the tensor support in Arrow Tables/RecorBatches. This PR is a specification proposal to add tensors as a canonical type extensions and is meant to be sent to the Mailing list for discussion and vote.

### What changes are included in this PR?
Specification for canonical extension type for fixed sized tensors.

**Open question**

Should metadata include the `"dim_names"` key to (optionally) specify dimension names when creating the Arrow FixedShapeTensorArray? 

* Closes: apache#33923

Lead-authored-by: Alenka Frim <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Alenka Frim <frim.alenka@gmail.com>
@AlenkaF AlenkaF deleted the spec-canonical-extension-tensor branch June 5, 2023 06:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Docs] Tensor canonical extension type specification