Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-15483: [C++] Add a Fixed Shape Tensor canonical ExtensionType #8510

Merged
merged 39 commits into from
Apr 4, 2023

Conversation

rok
Copy link
Member

@rok rok commented Oct 22, 2020

ARROW-1614: In an Arrow table, we would like to add support for a column that has values cells each containing a tensor value, with all tensors having the same dimensions. These would be stored as a binary value, plus some metadata to store type and shape/strides.

@github-actions
Copy link

@rok rok marked this pull request as draft October 22, 2020 16:54
@jorisvandenbossche
Copy link
Member

Currently, only the shape is stored. Is this enough? That does a assume a fixed row major order?

@rok
Copy link
Member Author

rok commented Oct 30, 2020

Currently, only the shape is stored. Is this enough? That does a assume a fixed row major order?

I think we either assume that or also store strides / dimension order. I am not sure how dimension order changes are done in other frameworks (TF, pytorch, etc.) but I would assume they don't reorder tensors in memory. So I would go for storing strides.

@sjperkins
Copy link
Contributor

In the context of testing metadata equality withinin multiple parquet files in a dataset, equality on shape and strides may be a very strict requirement. Would relaxing the equality requirement to only compare the number of tensor dimensions negatively impact the design?

@rok
Copy link
Member Author

rok commented Oct 5, 2021

In the context of testing metadata equality withinin multiple parquet files in a dataset, equality on shape and strides may be a very strict requirement. Would relaxing the equality requirement to only compare the number of tensor dimensions negatively impact the design?

Good point. By tensor dimensions you mean shape, right?
I think it's ok to relax on the strides check. I've pushed a change, see latest commit.

@rok rok marked this pull request as ready for review October 5, 2021 09:36
@sjperkins
Copy link
Contributor

In the context of testing metadata equality withinin multiple parquet files in a dataset, equality on shape and strides may be a very strict requirement. Would relaxing the equality requirement to only compare the number of tensor dimensions negatively impact the design?

Good point. By tensor dimensions you mean shape, right? I think it's ok to relax on the strides check. I've pushed a change, see latest commit.

I was thinking even looser:

def __eq__(self, other):
    len(self.shape) == len(other.shape)

@rok
Copy link
Member Author

rok commented Oct 5, 2021

I was thinking even looser:

def __eq__(self, other):
    len(self.shape) == len(other.shape)

Done.
We could introduce comparison options in case there would be differing requirements here.

@rok
Copy link
Member Author

rok commented Dec 10, 2021

@jorisvandenbossche @sjperkins @pitrou is there interest to get this in?
If yes is cpp/src/arrow/extension_type_test.cc a good place to put it?

@pitrou
Copy link
Member

pitrou commented Dec 13, 2021

Currently we don't ship any standard extension types. I recommend discussing this on the mailing-list.

@Hoeze
Copy link

Hoeze commented Jan 19, 2022

fyi, the ray project created its own Tensor type:
https://docs.ray.io/en/latest/_modules/ray/data/extensions/tensor_extension.html#ArrowTensorArray

@wesm
Copy link
Member

wesm commented Jan 19, 2022

Indeed I think having a built-in Tensor value type (implemented using extension arrays) in Arrow/pyarrow would be better than having third party projects rolling their own.

@frreiss
Copy link

frreiss commented Jan 24, 2022

@wesm would there be interest in folding the Pandas side of these third-party extensions into Pandas also?

@jorisvandenbossche
Copy link
Member

would there be interest in folding the Pandas side of these third-party extensions into Pandas also?

That will be something to discuss in the pandas project.
(speaking as a pandas maintainer, for now we mostly encourage creating third-party extensions (that was the whole purpose of formalizing this ExtensionArray interface in pandas), but at some point we should also expand the types in pandas itself. Although I personally think we should rather start with adding a simple List type, than directly a tensor type)

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Apr 4, 2023
@AlenkaF
Copy link
Member

AlenkaF commented Apr 4, 2023

@rok, you are awesome! 👍

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failing CI is unrelated? (it seems the R failures are being worked on, and the C++ failures are related to LLVM update #34768)

cpp/src/arrow/extension/fixed_shape_tensor.h Outdated Show resolved Hide resolved
cpp/src/arrow/extension/fixed_shape_tensor_test.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Apr 4, 2023
@Batash0
Copy link

Batash0 commented Apr 4, 2023

Great news!

rok and others added 2 commits April 4, 2023 12:51
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Apr 4, 2023
@rok
Copy link
Member Author

rok commented Apr 4, 2023

The failing CI is unrelated? (it seems the R failures are being worked on, and the C++ failures are related to LLVM update #34768)

They seem unrelated indeed and I don't think they obscure any new problems as the change was fairly minimal.

@jorisvandenbossche
Copy link
Member

Merged after 2.5 years ;) Thanks @rok!

@rok
Copy link
Member Author

rok commented Apr 4, 2023

Thanks for all the input and reviews everyone, very happy to see this merged!

@jorisvandenbossche now let's talk about strides @ #34797 :D

@ursabot
Copy link

ursabot commented Apr 5, 2023

Benchmark runs are scheduled for baseline = 81c828e and contender = a84a39b. a84a39b is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.56% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.26% ⬆️0.0%] ursa-i9-9960x
[Failed ⬇️0.0% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] a84a39b6 ec2-t3-xlarge-us-east-2
[Failed] a84a39b6 test-mac-arm
[Finished] a84a39b6 ursa-i9-9960x
[Failed] a84a39b6 ursa-thinkcentre-m75q
[Finished] 81c828ed ec2-t3-xlarge-us-east-2
[Failed] 81c828ed test-mac-arm
[Finished] 81c828ed ursa-i9-9960x
[Failed] 81c828ed ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

jorisvandenbossche added a commit that referenced this pull request Apr 11, 2023
### Rationale for this change
In the C++ the fixed shape tensor canonical extension type is implementated #8510 so we can add bindings to the extension type in Python.

### What changes are included in this PR?
Binding for fixed shape tensor canonical extension type.

### Are these changes tested?
Yes.

### Are there any user-facing changes?
No.
* Closes: #34882

Lead-authored-by: Alenka Frim <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
liujiacheng777 pushed a commit to LoongArch-Python/arrow that referenced this pull request May 11, 2023
### Rationale for this change
In the C++ the fixed shape tensor canonical extension type is implementated apache#8510 so we can add bindings to the extension type in Python.

### What changes are included in this PR?
Binding for fixed shape tensor canonical extension type.

### Are these changes tested?
Yes.

### Are there any user-facing changes?
No.
* Closes: apache#34882

Lead-authored-by: Alenka Frim <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this pull request May 15, 2023
apache#8510)

> [ARROW-1614](https://issues.apache.org/jira/browse/ARROW-1614): In an Arrow table, we would like to add support for a column that has values cells each containing a tensor value, with all tensors having the same dimensions. These would be stored as a binary value, plus some metadata to store type and shape/strides.
* Closes: apache#15483

Lead-authored-by: Rok Mihevc <rok@mihevc.org>
Co-authored-by: Rok <rok@mihevc.org>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Ben Harkins <60872452+benibus@users.noreply.github.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this pull request May 15, 2023
### Rationale for this change
In the C++ the fixed shape tensor canonical extension type is implementated apache#8510 so we can add bindings to the extension type in Python.

### What changes are included in this PR?
Binding for fixed shape tensor canonical extension type.

### Are these changes tested?
Yes.

### Are there any user-facing changes?
No.
* Closes: apache#34882

Lead-authored-by: Alenka Frim <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
rtpsw pushed a commit to rtpsw/arrow that referenced this pull request May 16, 2023
apache#8510)

> [ARROW-1614](https://issues.apache.org/jira/browse/ARROW-1614): In an Arrow table, we would like to add support for a column that has values cells each containing a tensor value, with all tensors having the same dimensions. These would be stored as a binary value, plus some metadata to store type and shape/strides.
* Closes: apache#15483

Lead-authored-by: Rok Mihevc <rok@mihevc.org>
Co-authored-by: Rok <rok@mihevc.org>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Ben Harkins <60872452+benibus@users.noreply.github.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
rtpsw pushed a commit to rtpsw/arrow that referenced this pull request May 16, 2023
### Rationale for this change
In the C++ the fixed shape tensor canonical extension type is implementated apache#8510 so we can add bindings to the extension type in Python.

### What changes are included in this PR?
Binding for fixed shape tensor canonical extension type.

### Are these changes tested?
Yes.

### Are there any user-facing changes?
No.
* Closes: apache#34882

Lead-authored-by: Alenka Frim <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@rok
Copy link
Member Author

rok commented Aug 17, 2023

We started a mailing list discussion about potential VariableShapeTensor extension array, please check it out and give feedback. For more details here's also a PR #37166.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] Add a Tensor logical value type with constant shape, implemented using ExtensionType