Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-544: [C++] Test writing zero-length record batches, zero-length BinaryArray fixes #333

Closed
wants to merge 3 commits into from

Conversation

wesm
Copy link
Member

@wesm wesm commented Feb 9, 2017

I believe this should fix the failure reported in the Spark integration work. We'll need to upgrade the conda test packages to verify. cc @BryanCutler

Change-Id: Ia47370298a009feef2935fc4ab89c3541f7b9a55
Change-Id: I1841f42c833bd89f2f8ce29ddb88e6353c9e1ac7
raw_type_ids_(nullptr),
value_offsets_(value_offsets),
raw_value_offsets_(nullptr) {
if (type_ids) { raw_type_ids_ = reinterpret_cast<const uint8_t*>(type_ids->data()); }
if (value_offsets) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if it would be slightly better to check against nullptr here also?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With shared pointers this is the same as value_offsets != nullptr

@BryanCutler
Copy link
Member

Thanks! I'll try it out

@BryanCutler
Copy link
Member

Still seeing the same error. Here is the stack trace

#2  0x00007ffff781ebd7 in __assert_fail_base (fmt=<optimized out>, assertion=assertion@entry=0x7fffd7538eb0 "i < size()", 
    file=file@entry=0x7fffd7538d90 "/home/bryan/git/arrow/cpp/debug/flatbuffers_ep-prefix/src/flatbuffers_ep-install/include/flatbuffers/flatbuffers.h", line=line@entry=299, 
    function=function@entry=0x7fffd7539180 <flatbuffers::Vector<org::apache::arrow::flatbuf::Buffer const*>::Get(unsigned int) const::__PRETTY_FUNCTION__> "flatbuffers::Vector<T>::return_type flatbuffers::Vector<T>::Get(flatbuffers::uoffset_t) const [with T = const org::apache::arrow::flatbuf::Buffer*; flatbuffers::Vector<T>::return_type = const org::apa"...)
    at assert.c:92
        str = 0x13b91e0 "0A\t\001"
        total = 4096
#3  0x00007ffff781ec82 in __GI___assert_fail (assertion=0x7fffd7538eb0 "i < size()", 
    file=0x7fffd7538d90 "/home/bryan/git/arrow/cpp/debug/flatbuffers_ep-prefix/src/flatbuffers_ep-install/include/flatbuffers/flatbuffers.h", line=299, 
    function=0x7fffd7539180 <flatbuffers::Vector<org::apache::arrow::flatbuf::Buffer const*>::Get(unsigned int) const::__PRETTY_FUNCTION__> "flatbuffers::Vector<T>::return_type flatbuffers::Vector<T>::Get(flatbuffers::uoffset_t) const [with T = const org::apache::arrow::flatbuf::Buffer*; flatbuffers::Vector<T>::return_type = const org::apa"...) at assert.c:101
No locals.
#4  0x00007fffd7524385 in flatbuffers::Vector<org::apache::arrow::flatbuf::Buffer const*>::Get (this=0x13a8f02, i=1)
    at /home/bryan/git/arrow/cpp/debug/flatbuffers_ep-prefix/src/flatbuffers_ep-install/include/flatbuffers/flatbuffers.h:299
        __PRETTY_FUNCTION__ = "flatbuffers::Vector<T>::return_type flatbuffers::Vector<T>::Get(flatbuffers::uoffset_t) const [with T = const org::apache::arrow::flatbuf::Buffer*; flatbuffers::Vector<T>::return_type = const org::apa"...
#5  0x00007fffd7523c42 in arrow::ipc::RecordBatchMetadata::RecordBatchMetadataImpl::buffer (this=0x14537a0, i=1)
    at /home/bryan/git/arrow/cpp/src/arrow/ipc/metadata.cc:168
No locals.
#6  0x00007fffd7523506 in arrow::ipc::RecordBatchMetadata::buffer (this=0x14c68b0, i=1)
    at /home/bryan/git/arrow/cpp/src/arrow/ipc/metadata.cc:214
        buffer = 0x7fffffffcbc0
        result = {
          page = 91, 
          offset = -3301766484703332096, 
          length = 0
        }
#7  0x00007fffd74da44d in arrow::ipc::ArrayLoader::GetBuffer (this=0x7fffffffcd90, buffer_index=1, out=0x7fffffffcbd0)
    at /home/bryan/git/arrow/cpp/src/arrow/ipc/adapter.cc:550
        metadata = {
          page = 91,
          offset = -3301766484703332096, 
          length = 0
        }
#8  0x00007fffd74dee7f in arrow::ipc::ArrayLoader::LoadBinary<arrow::StringArray> (this=0x7fffffffcd90)
    at /home/bryan/git/arrow/cpp/src/arrow/ipc/adapter.cc:605
        _s = {
          state_ = 0x0
        }
        field_meta = {
          length = 0, 
          null_count = 0
        }
        null_bitmap = std::shared_ptr (empty) 0x0
        offsets = std::shared_ptr (empty) 0x0
        values = std::shared_ptr (empty) 0x0
#9  0x00007fffd74dadd0 in arrow::ipc::ArrayLoader::Visit (this=0x7fffffffcd90, type=...)
    at /home/bryan/git/arrow/cpp/src/arrow/ipc/adapter.cc:639
No locals.
#10 0x00007fffdcdeb6b6 in arrow::StringType::Accept (this=0x11bffd0, visitor=0x7fffffffcd90)
    at /home/bryan/git/arrow/cpp/src/arrow/type.cc:172
No locals.
#11 0x00007fffd74da1b9 in arrow::ipc::ArrayLoader::Load (this=0x7fffffffcd90, out=0x103e8b0)
    at /home/bryan/git/arrow/cpp/src/arrow/ipc/adapter.cc:527
        _s = {
          state_ = 0x7fffffffccf0 " \315\377\377\377\177"
        }
#12 0x00007fffd74dc149 in arrow::ipc::RecordBatchReader::Read (this=0x7fffffffce20, out=0xa2a300)
    at /home/bryan/git/arrow/cpp/src/arrow/ipc/adapter.cc:755
        _s = {
          state_ = 0x13c5ad0 "\340Z<\001"
        }
        loader = {
          <arrow::TypeVisitor> = {
            _vptr.TypeVisitor = 0x7fffd7764490 <vtable for arrow::ipc::ArrayLoader+16>
          }, 
          members of arrow::ipc::ArrayLoader: 
          field_ = @0x13c5ad0, 
          context_ = 0x7fffffffce20, 
          file_ = 0x7fffffffcf80, 
          result_ = std::shared_ptr (empty) 0x0
        }
        i = 0
        arrays = std::vector of length 3, capacity 3 = {std::shared_ptr (empty) 0x0, std::shared_ptr (empty) 0x0, 
          std::shared_ptr (empty) 0x0}
#13 0x00007fffd74d6143 in arrow::ipc::ReadRecordBatch (metadata=std::shared_ptr (count 2, weak 0) 0x14c68b0, 
    schema=std::shared_ptr (count 2, weak 0) 0x14833e0, max_recursion_depth=64, file=0x7fffffffcf80, out=0xa2a300)
    at /home/bryan/git/arrow/cpp/src/arrow/ipc/adapter.cc:800
        reader = {
          context_ = {
            metadata = 0x14c68b0, 
            buffer_index = 2, 
            field_index = 1, 
            max_recursion_depth = 64
          }, 
          metadata_ = std::shared_ptr (count 2, weak 0) 0x14c68b0, 
          schema_ = std::shared_ptr (count 2, weak 0) 0x14833e0, 
          max_recursion_depth_ = 64, 
          file_ = 0x7fffffffcf80

Seems like the problem is when loading the BinaryArray before it's constructed?

@BryanCutler
Copy link
Member

This patch seems to do the trick, but I just followed the function above, so not sure if it's correct

diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc
index 3613ccb..f35bfe4 100644
--- a/cpp/src/arrow/ipc/adapter.cc
+++ b/cpp/src/arrow/ipc/adapter.cc
@@ -602,8 +602,14 @@ class ArrayLoader : public TypeVisitor {
 
     std::shared_ptr<Buffer> offsets;
     std::shared_ptr<Buffer> values;
-    RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &offsets));
-    RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &values));
+    if (field_meta.length > 0) {
+      RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &offsets));
+      RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &values));
+    } else {
+      context_->buffer_index += 2;
+      offsets.reset(new Buffer(nullptr, 0));
+      values.reset(new Buffer(nullptr, 0));
+    }
 
     result_ = std::make_shared<CONTAINER>(
         field_meta.length, offsets, values, null_bitmap, field_meta.null_count);

@wesm
Copy link
Member Author

wesm commented Feb 9, 2017

@BryanCutler this appears to be a bug on the Java side -- the record batch that was written was incomplete or empty. EDIT: whose fault is the malformed metadata (the Spark/Arrow interface or Arrow itself)?

@wesm
Copy link
Member Author

wesm commented Feb 9, 2017

To be clear, I will add your patch as a stopgap, but it would be very difficult for me to construct a test case

Change-Id: If9214960d7f7d05af85a2bf265521aa99190c374
@BryanCutler
Copy link
Member

So this is happening when Spark has a Dataset with at least 1 empty partition which will convert to an empty ArrowRecordBatch. I can inspect the metadata if you like, but looks like the body size is 0

from a Dataset with 3 rows / 4 partitions

17/02/09 15:34:26 DEBUG ArrowWriter: magic written, now at 6
17/02/09 15:34:26 DEBUG ArrowWriter: magic written, now at 6
17/02/09 15:34:26 DEBUG ArrowWriter: magic written, now at 6
17/02/09 15:34:26 DEBUG ArrowWriter: magic written, now at 6
17/02/09 15:34:26 DEBUG ArrowWriter: RecordBatch at 374, metadata: 98, body: 0
17/02/09 15:34:26 DEBUG ArrowWriter: RecordBatch at 374, metadata: 282, body: 56
17/02/09 15:34:26 DEBUG ArrowWriter: RecordBatch at 374, metadata: 282, body: 56
17/02/09 15:34:26 DEBUG ArrowWriter: RecordBatch at 374, metadata: 282, body: 56
17/02/09 15:34:26 DEBUG ArrowWriter: Footer starts at 472, length: 408
17/02/09 15:34:26 DEBUG ArrowWriter: Footer starts at 712, length: 408
17/02/09 15:34:26 DEBUG ArrowWriter: Footer starts at 712, length: 408
17/02/09 15:34:26 DEBUG ArrowWriter: Footer starts at 712, length: 408
17/02/09 15:34:26 DEBUG ArrowWriter: magic written, now at 1130
17/02/09 15:34:26 DEBUG ArrowWriter: magic written, now at 1130
17/02/09 15:34:26 DEBUG ArrowWriter: magic written, now at 890
17/02/09 15:34:26 DEBUG ArrowWriter: magic written, now at 1130

@BryanCutler
Copy link
Member

Is it incorrect to have an empty ArrowRecordBatch?

@wesm
Copy link
Member Author

wesm commented Feb 9, 2017

From the backtrace it looks like the record batch metadata is empty. Even for a length-0 partition, with a known schema we would expect metadata with all zero-length buffers. I agree that it's not especially useful to generate a bunch of metadata with no value -- I'd be OK with indicating in the specification that for length-0 batches the buffer and field metadata can be empty. @julienledem any thoughts?

@julienledem
Copy link
Member

@wesm I agree that for empty RecordBatches we should return FieldNodes with length=0 and null_count=0.
Looking at the java side I don't see any special case:
https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java
Could you point at the code that calls this?

@wesm
Copy link
Member Author

wesm commented Feb 10, 2017

FieldNodes yes, but not necessarily Buffers. Somehow Bryan's code is not sending the buffers (they would all be length zero anyway)

@wesm
Copy link
Member Author

wesm commented Feb 10, 2017

@BryanCutler if this is working for you, I can merge and we can create a JIRA about documenting the IPC conventions for length-0 row batches

@BryanCutler
Copy link
Member

@wesm , yeah it seems to be working for me. I can try to reproduce it outside of Spark also.

@wesm
Copy link
Member Author

wesm commented Feb 10, 2017

Thanks. +1

@asfgit asfgit closed this in 42b55d9 Feb 10, 2017
@wesm wesm deleted the ARROW-544 branch February 10, 2017 18:16
wesm added a commit to wesm/arrow that referenced this pull request Sep 2, 2018
Use PIMPL pattern to hide zlib from public API

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#333 from wesm/gzip-pimpl and squashes the following commits:

adbca51 [Wes McKinney] Make the appveyor build matrix a little smaller
7d88204 [Wes McKinney] cpplint
39a85bd [Wes McKinney] Use override for virtuals
1064669 [Wes McKinney] Fix up GZIP pimpl
6de81df [Wes McKinney] WIP converting GZipCodec to PIMPL to hide zlib.h
wesm added a commit to wesm/arrow that referenced this pull request Sep 4, 2018
Use PIMPL pattern to hide zlib from public API

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#333 from wesm/gzip-pimpl and squashes the following commits:

adbca51 [Wes McKinney] Make the appveyor build matrix a little smaller
7d88204 [Wes McKinney] cpplint
39a85bd [Wes McKinney] Use override for virtuals
1064669 [Wes McKinney] Fix up GZIP pimpl
6de81df [Wes McKinney] WIP converting GZipCodec to PIMPL to hide zlib.h

Change-Id: Ic0d39fb0a6c642220b33e9a36ed4d81b62bccbd1
wesm added a commit to wesm/arrow that referenced this pull request Sep 6, 2018
Use PIMPL pattern to hide zlib from public API

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#333 from wesm/gzip-pimpl and squashes the following commits:

adbca51 [Wes McKinney] Make the appveyor build matrix a little smaller
7d88204 [Wes McKinney] cpplint
39a85bd [Wes McKinney] Use override for virtuals
1064669 [Wes McKinney] Fix up GZIP pimpl
6de81df [Wes McKinney] WIP converting GZipCodec to PIMPL to hide zlib.h

Change-Id: Ic0d39fb0a6c642220b33e9a36ed4d81b62bccbd1
wesm added a commit to wesm/arrow that referenced this pull request Sep 7, 2018
Use PIMPL pattern to hide zlib from public API

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#333 from wesm/gzip-pimpl and squashes the following commits:

adbca51 [Wes McKinney] Make the appveyor build matrix a little smaller
7d88204 [Wes McKinney] cpplint
39a85bd [Wes McKinney] Use override for virtuals
1064669 [Wes McKinney] Fix up GZIP pimpl
6de81df [Wes McKinney] WIP converting GZipCodec to PIMPL to hide zlib.h

Change-Id: Ic0d39fb0a6c642220b33e9a36ed4d81b62bccbd1
wesm added a commit to wesm/arrow that referenced this pull request Sep 8, 2018
Use PIMPL pattern to hide zlib from public API

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#333 from wesm/gzip-pimpl and squashes the following commits:

adbca51 [Wes McKinney] Make the appveyor build matrix a little smaller
7d88204 [Wes McKinney] cpplint
39a85bd [Wes McKinney] Use override for virtuals
1064669 [Wes McKinney] Fix up GZIP pimpl
6de81df [Wes McKinney] WIP converting GZipCodec to PIMPL to hide zlib.h

Change-Id: Ic0d39fb0a6c642220b33e9a36ed4d81b62bccbd1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants