-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Add a Tensor logical value type with varying dimensions, implemented using ExtensionType #24868
Comments
Christian Hudon / @chrish42:
|
Joris Van den Bossche / @jorisvandenbossche:
The field storing the values of the actual tensors will be a variable size binary or list layout, I suppose. That way, since this is a normal arrow array, you already have access to the start offset of each tensor (without needing to calculate it from all previous ones), see https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout For variable size binary vs variable size list layout, in the end both will be same physical storage. But using a list array instead of binary array might make it a bit easier to work with (the data type of the individual values is then already coded in the list type as well, and eg in the python APIs of pyarrow, you can easily access the flat array of values of the ListArray as a single numpy array (from which a part can be sliced and reshaped to get the tensor). |
Bryan Cutler / @BryanCutler: I also had another thought, if the shape for each tensor added an additional outer dimension to represent how many records are in each tensor, that would allow us to use a single tensor extension type for both variable and constant shapes. For example, say you have 10 tensors of shape (2, 3) stacked in a single ndarray of (10, 2, 3), then the shape array would have a single entry |
Rok Mihevc / @rok: |
Joris Van den Bossche / @jorisvandenbossche:
To clarify, this is only about constant vs variable dimensions, and not about constant shape ? My understanding was that ARROW-1614 is also about constant shape (although the title only says dimension), and then I don't see how that would be possible to combine in the way described? |
Bryan Cutler / @BryanCutler: |
…ns, implemented using ExtensionType (#37166) ### Rationale for this change For use cases where underlying datatype and number of dimensions in tensors are equal but not the actual shape we want to add a `VariableShapeTensorType`. See #24868 and huggingface/datasets#5272 ### What changes are included in this PR? This introduces definition of `arrow.variable_shape_tensor` extension and it's C++ implementation and a Python wrapper. ### Are these changes tested? Yes. ### Are there any user-facing changes? This introduces new extension type to the user. * Closes: #24868 Lead-authored-by: Rok Mihevc <rok@mihevc.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
I was going to cherry-pick this due to it being milestoned for 14.0.0 before, should I pick it @jorisvandenbossche or should I leave it for 15.0.0? |
Yeah, I was just going to comment on the PR to ask about it. It's the merge script that changed the milestone again to 15.0. @rok how important is it to have this in 14.0? In practice it doesn't matter much because it's only the extension type specification that anyone can implement (on any version of arrow), and the actual C++ implementation will only be for 15.0. |
I'm +0 for 14.0.0 for the same reasons as Joris. Since it's just a doc change it'd not expect it to cause issues with the release. I defer to @raulcd as I don't know how much work it is to include. |
I'll cherry-pick it |
…ns, implemented using ExtensionType (#37166) ### Rationale for this change For use cases where underlying datatype and number of dimensions in tensors are equal but not the actual shape we want to add a `VariableShapeTensorType`. See #24868 and huggingface/datasets#5272 ### What changes are included in this PR? This introduces definition of `arrow.variable_shape_tensor` extension and it's C++ implementation and a Python wrapper. ### Are these changes tested? Yes. ### Are there any user-facing changes? This introduces new extension type to the user. * Closes: #24868 Lead-authored-by: Rok Mihevc <rok@mihevc.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…mensions, implemented using ExtensionType (apache#37166) ### Rationale for this change For use cases where underlying datatype and number of dimensions in tensors are equal but not the actual shape we want to add a `VariableShapeTensorType`. See apache#24868 and huggingface/datasets#5272 ### What changes are included in this PR? This introduces definition of `arrow.variable_shape_tensor` extension and it's C++ implementation and a Python wrapper. ### Are these changes tested? Yes. ### Are there any user-facing changes? This introduces new extension type to the user. * Closes: apache#24868 Lead-authored-by: Rok Mihevc <rok@mihevc.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…mensions, implemented using ExtensionType (apache#37166) ### Rationale for this change For use cases where underlying datatype and number of dimensions in tensors are equal but not the actual shape we want to add a `VariableShapeTensorType`. See apache#24868 and huggingface/datasets#5272 ### What changes are included in this PR? This introduces definition of `arrow.variable_shape_tensor` extension and it's C++ implementation and a Python wrapper. ### Are these changes tested? Yes. ### Are there any user-facing changes? This introduces new extension type to the user. * Closes: apache#24868 Lead-authored-by: Rok Mihevc <rok@mihevc.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…mensions, implemented using ExtensionType (apache#37166) ### Rationale for this change For use cases where underlying datatype and number of dimensions in tensors are equal but not the actual shape we want to add a `VariableShapeTensorType`. See apache#24868 and huggingface/datasets#5272 ### What changes are included in this PR? This introduces definition of `arrow.variable_shape_tensor` extension and it's C++ implementation and a Python wrapper. ### Are these changes tested? Yes. ### Are there any user-facing changes? This introduces new extension type to the user. * Closes: apache#24868 Lead-authored-by: Rok Mihevc <rok@mihevc.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Support for tensor in Table, RecordBatch, etc. where each row is a tensor of a different shape (e.g images of different sizes), but of the same underlying type (e.g. int32). Implemented as an ExtensionType, so no need to change the format.
I don't see needing each row being a tensor with a different number of dimensions, so if the implementation for that falls out easily of the use case with each row in the table having a tensor with the same number of dimensions, great. If it adds a lot of complexity, that case would be postponed.
Reporter: Christian Hudon / @chrish42
Watchers: Rok Mihevc / @rok
Related issues:
Note: This issue was originally created as ARROW-8714. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: