Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-5736: [Format][C++] Support small bit-width indices in sparse tensor #5290

Closed
wants to merge 21 commits into from

Conversation

@mrkn
Copy link
Member

mrkn commented Sep 5, 2019

I'd like to enable all integer types to become the value type of sparse tensor index.

@mrkn mrkn force-pushed the mrkn:ARROW-5736 branch 7 times, most recently from 2a1d5e1 to 7c33e35 Sep 6, 2019
@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Sep 7, 2019

Codecov Report

Merging #5290 into master will increase coverage by <.01%.
The diff coverage is 97.83%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5290      +/-   ##
==========================================
+ Coverage   88.72%   88.72%   +<.01%     
==========================================
  Files         945      945              
  Lines      124009   124086      +77     
  Branches     1437     1437              
==========================================
+ Hits       110024   110093      +69     
- Misses      13623    13631       +8     
  Partials      362      362
Impacted Files Coverage Δ
cpp/src/arrow/visitor_inline.h 88.78% <ø> (ø) ⬆️
cpp/src/arrow/ipc/metadata_internal.h 100% <ø> (ø) ⬆️
cpp/src/arrow/sparse_tensor.h 100% <100%> (ø) ⬆️
cpp/src/arrow/ipc/reader.cc 89.57% <100%> (+0.18%) ⬆️
cpp/src/arrow/sparse_tensor_test.cc 100% <100%> (ø) ⬆️
cpp/src/arrow/ipc/metadata_internal.cc 88.25% <100%> (+0.27%) ⬆️
cpp/src/arrow/ipc/read_write_test.cc 99.85% <100%> (ø) ⬆️
cpp/src/arrow/sparse_tensor.cc 88.27% <88.46%> (-3.4%) ⬇️
cpp/src/arrow/util/thread_pool_test.cc 97.66% <0%> (-0.94%) ⬇️
cpp/src/arrow/compare.cc 81.09% <0%> (-0.15%) ⬇️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fb51ecf...7c33e35. Read the comment docs.

@mrkn mrkn changed the title WIP: ARROW-5736: [Format][C++] Support small bit-width indices in sparse tensor ARROW-5736: [Format][C++] Support small bit-width indices in sparse tensor Sep 7, 2019
@mrkn mrkn requested review from wesm and pitrou Sep 7, 2019
@mrkn mrkn force-pushed the mrkn:ARROW-5736 branch 3 times, most recently from d581597 to c48bf53 Sep 8, 2019
Copy link
Contributor

pitrou left a comment

Thanks for doing this. I wonder if the format changes require a mailing-list discussion. Not necessarily since IPC support for sparse tensors is marked experimental, but I'd like @wesm 's opinion.

Otherwise, here are a bunch of comments.

format/SparseTensor.fbs Outdated Show resolved Hide resolved
cpp/src/arrow/ipc/reader.cc Outdated Show resolved Hide resolved
cpp/src/arrow/ipc/reader.cc Outdated Show resolved Hide resolved
const int64_t elsize = sizeof(int64_t);
std::vector<int64_t> strides({elsize, elsize * non_zero_length});
std::vector<int64_t> indices_shape(
{non_zero_length, static_cast<int64_t>(shape.size())});

This comment has been minimized.

Copy link
@pitrou

pitrou Sep 9, 2019

Contributor

Perhaps not related with this PR, but is non_zero_length validated somewhere? What if it's not consistent with indices_data?

This comment has been minimized.

Copy link
@mrkn

mrkn Sep 10, 2019

Author Member

non_zero_length links to indices_shape[0] in SparseCOOIndex. On the other words, if non_zero_length is inconsistent with indices_data, it means indices_shape is inconsistent with indices_data.

If we need to do validation somewhere, I guess the appropriate place is Tensor's constructor.

And, I think we can remove non_zero_length attribute from the format, and we should do it. I'll file the issue in JIRA.

/// Non-negative byte offsets to advance one value cell along each dimension
indicesStrides: [long];

/// The location and size of the indices matrix's data
indicesBuffer: Buffer;

This comment has been minimized.

Copy link
@pitrou

pitrou Sep 9, 2019

Contributor

A high-level question: if the indices are a tensor, why aren't they represented here as a Tensor object? Is it more convenient to outline its structure explicitly?

This comment has been minimized.

Copy link
@mrkn

mrkn Sep 10, 2019

Author Member

There are two reasons that I didn't employ Tensor object here.

(1) I want to strict the value type of indices tensor to integer types at the format level.

(2) I can drop the shape field of Tensor.

This comment has been minimized.

Copy link
@mrkn

mrkn Sep 10, 2019

Author Member

The previous comment doesn't mean I want to stick to this format. Using Tensor object here is acceptable.
If we use Tensor here, we need a way to strict the value type of the tensor to integer types.

This comment has been minimized.

Copy link
@pitrou

pitrou Sep 10, 2019

Contributor

I think that's your choice. I don't mind keeping things like they are.

@@ -891,11 +905,12 @@ Status ReadSparseCSRIndex(const flatbuf::SparseTensor* sparse_tensor, int64_t nd
RETURN_NOT_OK(
file->ReadAt(indices_buffer->offset(), indices_buffer->length(), &indices_data));

std::vector<int64_t> indptr_shape({ndim + 1});
std::vector<int64_t> indptr_shape({shape[0] + 1});
std::vector<int64_t> indices_shape({non_zero_length});

This comment has been minimized.

Copy link
@pitrou

pitrou Sep 9, 2019

Contributor

Same question here about validating indptr_shape against indptr_data and indices_data against `indices_data...

Perhaps instead of calling the Tensor constructor directly, we need a Status-returning static factory function that validates input for consistency?

This comment has been minimized.

Copy link
@pitrou

pitrou Sep 9, 2019

Contributor

That said, it seems we should then validate the indices data, which is much more expensive...

This comment has been minimized.

Copy link
@mrkn

mrkn Sep 10, 2019

Author Member

You mean that it's better if we have Status-returning factory function for Tensor class and it validates indices data?

If so, it makes sense to me. SparseTensor should have Status-returning factory function, too.

Do you think we need to discuss on mailing-list before changing these APIs?

This comment has been minimized.

Copy link
@pitrou

pitrou Sep 10, 2019

Contributor

Do you think we need to discuss on mailing-list before changing these APIs?

I don't think so. We can keep the public (non-validating) constructors in addition to the (validating) factory functions. So there's no API breakage.

This comment has been minimized.

Copy link
@mrkn

mrkn Sep 10, 2019

Author Member
RETURN_NOT_OK(
AllocateBuffer(sizeof(int64_t) * ndim * nonzero_count, &indices_buffer));
int64_t* indices = reinterpret_cast<int64_t*>(indices_buffer->mutable_data());
RETURN_NOT_OK(AllocateBuffer(indices_elsize * ndim * nonzero_count, &indices_buffer));

This comment has been minimized.

Copy link
@pitrou

pitrou Sep 9, 2019

Contributor

Not related to this PR, but at some point all buffer-allocating constructors should have a variant taking a MemoryPool argument.

This comment has been minimized.

Copy link
@mrkn
Status Convert() {
switch (index_value_type_->id()) {
ARROW_GENERATE_FOR_ALL_INTEGER_TYPES(CALL_TYPE_SPECIFIC_CONVERT);
// LCOV_EXCL_START: The following invalid causes program failure.

This comment has been minimized.

Copy link
@pitrou

pitrou Sep 9, 2019

Contributor

Not sure I understand: how does returning Status::Invalid cause program failure?

This comment has been minimized.

Copy link
@mrkn

mrkn Sep 10, 2019

Author Member

Because the result of Convert function is checked by ARROW_CHECK_OK in MakeSparseTensorFromTensor function, and ARROW_CHECK_OK makes the fatal-level ArrowLog object that makes the program failure.

@pitrou
pitrou approved these changes Sep 10, 2019
@pitrou

This comment has been minimized.

Copy link
Contributor

pitrou commented Sep 10, 2019

The C glib CI failures should be unrelated. @mrkn Do you want to add validation here or as a separate ticket? If the latter, I will merge this PR.

@pitrou pitrou force-pushed the mrkn:ARROW-5736 branch from 50c6094 to b0be574 Sep 10, 2019
@pitrou

This comment has been minimized.

Copy link
Contributor

pitrou commented Sep 10, 2019

I've rebased. Hopefully this will make CI green.

@mrkn

This comment has been minimized.

Copy link
Member Author

mrkn commented Sep 10, 2019

@pitrou I'll add new factory functions with validation in the separated pull-request.

@pitrou pitrou closed this in 0fbaff6 Sep 10, 2019
@pitrou

This comment has been minimized.

Copy link
Contributor

pitrou commented Sep 10, 2019

Thanks @mrkn !

pprudhvi added a commit to pprudhvi/arrow that referenced this pull request Sep 16, 2019
…ensor

I'd like to enable all integer types to become the value type of sparse tensor index.

Closes apache#5290 from mrkn/ARROW-5736 and squashes the following commits:

b0be574 <Kenta Murata> Add check SparseCOOIndex::coords_->ndim() in the constructor
ad725c2 <Kenta Murata> Remove needless include
84e6726 <Kenta Murata> Fix typo
76e5ca7 <Kenta Murata> Increase coverage rate
2226c9a <Kenta Murata> Fix compile errors on MSVC
f1cf53f <Kenta Murata> Insert an explicit down cast
56ffc1b <Kenta Murata> Add #include <limits>
a3fd502 <Kenta Murata> Refactoring
f0222ab <Kenta Murata> Suppor reading and writing
f2a1920 <Kenta Murata> Use fixtures in TestSparseCSRMatrix
4079906 <Kenta Murata> Refactoring of TestSparseCOOTensor
980cf96 <Kenta Murata> Refactoring
becc425 <Kenta Murata> Enable all integer types to become the value type of SparseCOOIndex
46acdb1 <Kenta Murata> Refactor ipc-read-write-test with a typed test case
b0053de <Kenta Murata> Add assertions for dim_names and dim_name
b7b47c7 <Kenta Murata> Refactor sparse_tensor-test with a typed test case
45be5c4 <Kenta Murata> Support reading and writing row-major SparseCOOIndex
33b6d1f <Kenta Murata> Make SparseCOOIndex support row-major index
5f8ba9f <Kenta Murata> Add indicesStrides in SparseTensorIndexCOO
5ec7e65 <Kenta Murata> Modify the comment of SparseTensorIndexCOO
8a0f65b <Kenta Murata> Add type fields of sparse index types

Authored-by: Kenta Murata <mrkn@mrkn.jp>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@mrkn mrkn deleted the mrkn:ARROW-5736 branch Oct 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.