Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Add a Tensor logical value type with constant shape, implemented using ExtensionType #15483

Closed
Tracked by #33924
asfimport opened this issue Sep 26, 2017 · 21 comments · Fixed by #8510
Closed
Tracked by #33924

Comments

@asfimport
Copy link
Collaborator

asfimport commented Sep 26, 2017

In an Arrow table, we would like to add support for a column that has values cells each containing a tensor value, with all tensors having the same shape/dimensions. These would be stored as a binary value, plus some metadata to store type and shape/strides.

Reporter: Wes McKinney / @wesm
Assignee: Rok Mihevc / @rok
Watchers: Rok Mihevc / @rok

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-1614. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
This can be handled via custom_metadata, so we don't actually need to change the format

@asfimport
Copy link
Collaborator Author

Christian Hudon / @chrish42:
This is a blocker for some Arrow use cases for us, so I'd be willing to work on this, with a bit of guidance. The first step would be to agree on the approach to take.

For me, there are two cases I'd need Arrow to support:

  1. Each row is a tensor of a different shape (e.g images of different sizes), but of the same underlying type (e.g. int32). I don't see needing each row being a tensor with a different number of dimensions, so that could be out of scope if desired.

  2. All rows have the same shape (so the whole column could potentially be handed off to e.g. scikit-image, as an array of n images of the same size).

    From what I understand of Arrow, here's how I would implement this:

  3. A first column containing the elements from all the tensors (in row-major order), and a second containing a tuple with that tensor's shape. The start offset of the data for the next tensor can be computed from the shape of the previous one. (Would also need a separate column containing the pre-computed start index of for each tensor?)

  4. Similarly, the data from the tensors would be stored all together in row-major order. The shape (without the first dimension) would be store in the metadata for that column.

    Thoughts?

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Sounds like two different extension types. Do you want to split the work into separate PRs for the sake of code review?

@asfimport
Copy link
Collaborator Author

Christian Hudon / @chrish42:
Created ARROW-8714 for the varying dimensions case. Renamed this one to be clearer that it is about the constant dimensions case.

Is the approach proposed sound for this case? What would be a next step in terms of code? Is there an example of another type that's implemented as en ExtensionType that I could have a look at?

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
@chrish42 for experimenting, you could also first implement it in Python (to have a prototype, to experiment with and discuss how to store metadata, etc, that might be easier to play with).
For Python, we have docs here: https://arrow.apache.org/docs/python/extending_types.html#defining-extension-types-user-defined-types (and there are also examples in the pandas codebase (for intervals and period dtypes): https://github.com/pandas-dev/pandas/blob/master/pandas/core/arrays/_arrow_utils.py, or also in the pyarrow tests: https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_extension_type.py)

For implementing it in C++, probably best source are the test extension types: https://github.com/apache/arrow/blob/master/cpp/src/arrow/testing/extension_type.h and https://github.com/apache/arrow/blob/master/cpp/src/arrow/extension_type_test.cc

@asfimport
Copy link
Collaborator Author

Christian Hudon / @chrish42:
@jorisvandenbossche  Thanks for all the doc pointers. I think I'll start with a Python prototype, then, as that'll be a lot faster to experiment with. Once we're good with the results of that, I'll port it to C++.

@asfimport
Copy link
Collaborator Author

Christopher Osborn:
Thank you @chrish42  for taking this on! I was happy to find this ticket. I also need to store constant-dimension tensors in tables, with the intention of then exposing them in Python as ndarrays and in C++ as Eigen Matrix/Tensor views. I'm still ramping up on Arrow (and so far only in C++), and don't really understand the ExtensionType facility yet, but would be happy to lend a hand if possible. I was about to start picking my way painfully down a similar road.

@asfimport
Copy link
Collaborator Author

Rok Mihevc / @rok:
I'd like to contribute to this work and will have time available next week. @chrish42  could I help out somehow?

@asfimport
Copy link
Collaborator Author

Bryan Cutler / @BryanCutler:
I just wanted to let you all know I have been working on a similar Tensor extension type. I currently have a Pandas extension type for a tensor with conversion to/from an Arrow extension type, just for Python/PyArrow right now, and zero-copy conversion with numpy.ndarrays. It's part of the project Text Extensions for Pandas where we use it for NLP feature vectors, but it's really general purpose. You can check it out at

https://github.com/CODAIT/text-extensions-for-pandas/blob/master/text_extensions_for_pandas/array/tensor.py
https://github.com/CODAIT/text-extensions-for-pandas/blob/master/text_extensions_for_pandas/array/arrow_conversion.py
Or install the package if you like via pip install text-extensions-for-pandas (it's currently in alpha)

We would love to help out with this effort and contribute what we have to Arrow, if it fits the bill!

@asfimport
Copy link
Collaborator Author

Rok Mihevc / @rok:
@BryanCutler  Text Extensions for Pandas looks like a good start for the Python part.  We'd probably want to base it on pyarrow.Tensor instead of np.ndarray? Would you like to do this or shall I start and ask you for review?

@asfimport
Copy link
Collaborator Author

Bryan Cutler / @BryanCutler:
@rok for our purposes, it wasn't necessary to use pyarrow.Tensor, but there are some limitations with it currently so maybe there are some trade-offs. Please go ahead and start if you like and I'd be happy to help review and discuss further.

@asfimport
Copy link
Collaborator Author

Christian Hudon / @chrish42:
I manifestly haven't had time to work on this yet, mostly because my Arrow time is still dedicated to trying to figure out why my other Arrow pull request isn't working. Once that other pull request is done, I'll move to working on this one, but if someone else can start working on this, please do so and open a draft pull request with your work. I'll join you there when I can, in that case.

@asfimport
Copy link
Collaborator Author

Rok Mihevc / @rok:
As proposed by @jorisvandenbossche  I've made a draft PR ( #8510) with python logic prototype. It was heavily inspired by @BryanCutler's text-extensions-for-pandas. Once we agree on the design we can rewrite it to c++.

As this is for the case where all tensors in the array are of the same shape I propose we store the data in a single Tensor. Is there a good reason not to do that?

I assume we should support non-contiguous tensors. I'll add that.

Any comments at this point?

@chrish42  - feel free to jump in any time.

@asfimport
Copy link
Collaborator Author

Bryan Cutler / @BryanCutler:

As this is for the case where all tensors in the array are of the same shape I propose we store the data in a single Tensor. Is there a good reason not to do that?

I believe if there is a single pyarrow.Tensor serialized in the backing binary array, that array will have a length of 1. Then if placing in a Table or RecordBatch, would restrict it to 1 row for the other columns as well.

@asfimport
Copy link
Collaborator Author

Bryan Cutler / @BryanCutler:
@rok I had an idea for making a single extension type that would handle constant and varying dimensions, see comment here and let me know what you think.

@asfimport
Copy link
Collaborator Author

Rok Mihevc / @rok:
@BryanCutler  I've replied in that thread.

I've also started working on the c++ for this I'll report back soon.

@asfimport
Copy link
Collaborator Author

Bryan Cutler / @BryanCutler:
Great, @rok , looking forward to it. I also updated the title and description to reflect this is for tensors with the same fixed shape.

@asfimport
Copy link
Collaborator Author

Wenbing Bai:
Hi there, just want to followup the status of this ticket. Any updates?

@asfimport
Copy link
Collaborator Author

Rok Mihevc / @rok:
Hey [~wenbing-bai], I'm planing to take a look at this next week.

@asfimport
Copy link
Collaborator Author

Todd Farmer / @toddfarmer:
This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.

@asfimport
Copy link
Collaborator Author

Apache Arrow JIRA Bot:
This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per project policy. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.

jorisvandenbossche added a commit that referenced this issue Apr 4, 2023
> [ARROW-1614](https://issues.apache.org/jira/browse/ARROW-1614): In an Arrow table, we would like to add support for a column that has values cells each containing a tensor value, with all tensors having the same dimensions. These would be stored as a binary value, plus some metadata to store type and shape/strides.
* Closes: #15483

Lead-authored-by: Rok Mihevc <rok@mihevc.org>
Co-authored-by: Rok <rok@mihevc.org>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Ben Harkins <60872452+benibus@users.noreply.github.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this issue May 15, 2023
apache#8510)

> [ARROW-1614](https://issues.apache.org/jira/browse/ARROW-1614): In an Arrow table, we would like to add support for a column that has values cells each containing a tensor value, with all tensors having the same dimensions. These would be stored as a binary value, plus some metadata to store type and shape/strides.
* Closes: apache#15483

Lead-authored-by: Rok Mihevc <rok@mihevc.org>
Co-authored-by: Rok <rok@mihevc.org>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Ben Harkins <60872452+benibus@users.noreply.github.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
rtpsw pushed a commit to rtpsw/arrow that referenced this issue May 16, 2023
apache#8510)

> [ARROW-1614](https://issues.apache.org/jira/browse/ARROW-1614): In an Arrow table, we would like to add support for a column that has values cells each containing a tensor value, with all tensors having the same dimensions. These would be stored as a binary value, plus some metadata to store type and shape/strides.
* Closes: apache#15483

Lead-authored-by: Rok Mihevc <rok@mihevc.org>
Co-authored-by: Rok <rok@mihevc.org>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Ben Harkins <60872452+benibus@users.noreply.github.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants