Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Add a Tensor logical value type with varying dimensions, implemented using ExtensionType #24868

Closed
asfimport opened this issue May 6, 2020 · 10 comments · Fixed by #37166

Comments

@asfimport
Copy link
Collaborator

asfimport commented May 6, 2020

Support for tensor in Table, RecordBatch, etc. where each row is a tensor of a different shape (e.g images of different sizes), but of the same underlying type (e.g. int32). Implemented as an ExtensionType, so no need to change the format. 

I don't see needing each row being a tensor with a different number of dimensions, so if the implementation for that falls out easily of the use case with each row in the table having a tensor with the same number of dimensions, great. If it adds a lot of complexity, that case would be postponed.

Reporter: Christian Hudon / @chrish42
Watchers: Rok Mihevc / @rok

Related issues:

Note: This issue was originally created as ARROW-8714. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Christian Hudon / @chrish42:
Proposed approach: a first column containing the elements from all the tensors (in row-major order), and a second containing a tuple with that tensor's shape. The start offset of the data for the next tensor can be computed from the shape of the previous one. Does that sound like the right approach? Would we also need a separate column containing the pre-computed start index of for each tensor?

 

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
I think a struct with one field with the actual values and one field keeping track of the shape of each tensor sounds good.

The start offset of the data for the next tensor can be computed from the shape of the previous one.

The field storing the values of the actual tensors will be a variable size binary or list layout, I suppose. That way, since this is a normal arrow array, you already have access to the start offset of each tensor (without needing to calculate it from all previous ones), see https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout

For variable size binary vs variable size list layout, in the end both will be same physical storage. But using a list array instead of binary array might make it a bit easier to work with (the data type of the individual values is then already coded in the list type as well, and eg in the python APIs of pyarrow, you can easily access the flat array of values of the ListArray as a single numpy array (from which a part can be sliced and reshaped to get the tensor).

@asfimport
Copy link
Collaborator Author

Bryan Cutler / @BryanCutler:
+1 on the proposal of having a list array for the data (of same type as the tensor) and second array for the shape. For the shape, a list array of ints would work but it could also be possible to modify Tensor.fbs slightly to have a TensorShape message. That might have some benefit to keep the size down for lots of small tensors, but not sure if it's worth the added complexity.

I also had another thought, if the shape for each tensor added an additional outer dimension to represent how many records are in each tensor, that would allow us to use a single tensor extension type for both variable and constant shapes. For example, say you have 10 tensors of shape (2, 3) stacked in a single ndarray of (10, 2, 3), then the shape array would have a single entry [(10, 2, 3)], if you have 10 tensors of varying shapes, then each one will have a 1 added to the outer dimension, so 10 entries with [(1, 2, 3), (1, 5, 3), (1, 4, 3), ...]. It would be a little redundant having the 1's in this case, but this would also allow to combine smaller batches, say 10 tensors where 5 are same shape, as an example would give you [(5, 2, 3), (5, 4, 6). What do you think of this @chrish42 and @jorisvandenbossche ?

@asfimport
Copy link
Collaborator Author

Rok Mihevc / @rok:
I like idea of batching into equally shaped Tensors. It would require a bit extra complexity for keeping original order of elements and taking into account cases where Tensors with equal shapes would have different strides. I suppose this will become clearer with ARROW-1614 resolved.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:

I also had another thought, if the shape for each tensor added an additional outer dimension to represent how many records are in each tensor, that would allow us to use a single tensor extension type for both variable and constant dimensions.

To clarify, this is only about constant vs variable dimensions, and not about constant shape ?

My understanding was that ARROW-1614 is also about constant shape (although the title only says dimension), and then I don't see how that would be possible to combine in the way described?

@asfimport
Copy link
Collaborator Author

Bryan Cutler / @BryanCutler:
Sorry, I mistyped dimension when I meant shape above (fixed now). I was trying to think of a way to use a single extension type for constant and variable shapes, with a fixed dimension. Although there is a problem with my suggestion in that the arrays won't be able to be sliced without recomputing the shapes, and I don't see a way around that. So I guess it seems better to stay with 2 different extension types, this one for variable shapes and ARROW-1614 for constant shapes.

@AlenkaF AlenkaF added this to the 14.0.0 milestone Oct 11, 2023
jorisvandenbossche added a commit that referenced this issue Oct 11, 2023
…ns, implemented using ExtensionType (#37166)

### Rationale for this change

For use cases where underlying datatype and number of dimensions in tensors are equal but not the actual shape we want to add a `VariableShapeTensorType`.
See #24868 and huggingface/datasets#5272

### What changes are included in this PR?

This introduces definition of `arrow.variable_shape_tensor` extension and it's C++ implementation and a Python wrapper.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

This introduces new extension type to the user.
* Closes: #24868

Lead-authored-by: Rok Mihevc <rok@mihevc.org>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@jorisvandenbossche jorisvandenbossche modified the milestones: 14.0.0, 15.0.0 Oct 11, 2023
@raulcd
Copy link
Member

raulcd commented Oct 11, 2023

I was going to cherry-pick this due to it being milestoned for 14.0.0 before, should I pick it @jorisvandenbossche or should I leave it for 15.0.0?

@jorisvandenbossche
Copy link
Member

Yeah, I was just going to comment on the PR to ask about it. It's the merge script that changed the milestone again to 15.0. @rok how important is it to have this in 14.0? In practice it doesn't matter much because it's only the extension type specification that anyone can implement (on any version of arrow), and the actual C++ implementation will only be for 15.0.
On the other hand, for rendering it in the documentation, it's maybe nice to have for 14.0 (and it's also an easy backport)

@rok
Copy link
Member

rok commented Oct 11, 2023

I'm +0 for 14.0.0 for the same reasons as Joris. Since it's just a doc change it'd not expect it to cause issues with the release. I defer to @raulcd as I don't know how much work it is to include.

@raulcd raulcd modified the milestones: 15.0.0, 14.0.0 Oct 11, 2023
@raulcd
Copy link
Member

raulcd commented Oct 11, 2023

I'll cherry-pick it

raulcd pushed a commit that referenced this issue Oct 11, 2023
…ns, implemented using ExtensionType (#37166)

### Rationale for this change

For use cases where underlying datatype and number of dimensions in tensors are equal but not the actual shape we want to add a `VariableShapeTensorType`.
See #24868 and huggingface/datasets#5272

### What changes are included in this PR?

This introduces definition of `arrow.variable_shape_tensor` extension and it's C++ implementation and a Python wrapper.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

This introduces new extension type to the user.
* Closes: #24868

Lead-authored-by: Rok Mihevc <rok@mihevc.org>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Oct 23, 2023
…mensions, implemented using ExtensionType (apache#37166)

### Rationale for this change

For use cases where underlying datatype and number of dimensions in tensors are equal but not the actual shape we want to add a `VariableShapeTensorType`.
See apache#24868 and huggingface/datasets#5272

### What changes are included in this PR?

This introduces definition of `arrow.variable_shape_tensor` extension and it's C++ implementation and a Python wrapper.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

This introduces new extension type to the user.
* Closes: apache#24868

Lead-authored-by: Rok Mihevc <rok@mihevc.org>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…mensions, implemented using ExtensionType (apache#37166)

### Rationale for this change

For use cases where underlying datatype and number of dimensions in tensors are equal but not the actual shape we want to add a `VariableShapeTensorType`.
See apache#24868 and huggingface/datasets#5272

### What changes are included in this PR?

This introduces definition of `arrow.variable_shape_tensor` extension and it's C++ implementation and a Python wrapper.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

This introduces new extension type to the user.
* Closes: apache#24868

Lead-authored-by: Rok Mihevc <rok@mihevc.org>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…mensions, implemented using ExtensionType (apache#37166)

### Rationale for this change

For use cases where underlying datatype and number of dimensions in tensors are equal but not the actual shape we want to add a `VariableShapeTensorType`.
See apache#24868 and huggingface/datasets#5272

### What changes are included in this PR?

This introduces definition of `arrow.variable_shape_tensor` extension and it's C++ implementation and a Python wrapper.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

This introduces new extension type to the user.
* Closes: apache#24868

Lead-authored-by: Rok Mihevc <rok@mihevc.org>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment