ARROW-544: [C++] Test writing zero-length record batches, zero-length BinaryArray fixes #333

wesm · 2017-02-09T21:17:05Z

I believe this should fix the failure reported in the Spark integration work. We'll need to upgrade the conda test packages to verify. cc @BryanCutler

Change-Id: Ia47370298a009feef2935fc4ab89c3541f7b9a55

Change-Id: I1841f42c833bd89f2f8ce29ddb88e6353c9e1ac7

BryanCutler · 2017-02-09T21:51:16Z

cpp/src/arrow/array.cc

+      raw_type_ids_(nullptr),
+      value_offsets_(value_offsets),
+      raw_value_offsets_(nullptr) {
+  if (type_ids) { raw_type_ids_ = reinterpret_cast<const uint8_t*>(type_ids->data()); }
  if (value_offsets) {


Just wondering if it would be slightly better to check against nullptr here also?

With shared pointers this is the same as value_offsets != nullptr

BryanCutler · 2017-02-09T21:51:39Z

Thanks! I'll try it out

BryanCutler · 2017-02-09T22:55:31Z

Still seeing the same error. Here is the stack trace

#2  0x00007ffff781ebd7 in __assert_fail_base (fmt=<optimized out>, assertion=assertion@entry=0x7fffd7538eb0 "i < size()", 
    file=file@entry=0x7fffd7538d90 "/home/bryan/git/arrow/cpp/debug/flatbuffers_ep-prefix/src/flatbuffers_ep-install/include/flatbuffers/flatbuffers.h", line=line@entry=299, 
    function=function@entry=0x7fffd7539180 <flatbuffers::Vector<org::apache::arrow::flatbuf::Buffer const*>::Get(unsigned int) const::__PRETTY_FUNCTION__> "flatbuffers::Vector<T>::return_type flatbuffers::Vector<T>::Get(flatbuffers::uoffset_t) const [with T = const org::apache::arrow::flatbuf::Buffer*; flatbuffers::Vector<T>::return_type = const org::apa"...)
    at assert.c:92
        str = 0x13b91e0 "0A\t\001"
        total = 4096
#3  0x00007ffff781ec82 in __GI___assert_fail (assertion=0x7fffd7538eb0 "i < size()", 
    file=0x7fffd7538d90 "/home/bryan/git/arrow/cpp/debug/flatbuffers_ep-prefix/src/flatbuffers_ep-install/include/flatbuffers/flatbuffers.h", line=299, 
    function=0x7fffd7539180 <flatbuffers::Vector<org::apache::arrow::flatbuf::Buffer const*>::Get(unsigned int) const::__PRETTY_FUNCTION__> "flatbuffers::Vector<T>::return_type flatbuffers::Vector<T>::Get(flatbuffers::uoffset_t) const [with T = const org::apache::arrow::flatbuf::Buffer*; flatbuffers::Vector<T>::return_type = const org::apa"...) at assert.c:101
No locals.
#4  0x00007fffd7524385 in flatbuffers::Vector<org::apache::arrow::flatbuf::Buffer const*>::Get (this=0x13a8f02, i=1)
    at /home/bryan/git/arrow/cpp/debug/flatbuffers_ep-prefix/src/flatbuffers_ep-install/include/flatbuffers/flatbuffers.h:299
        __PRETTY_FUNCTION__ = "flatbuffers::Vector<T>::return_type flatbuffers::Vector<T>::Get(flatbuffers::uoffset_t) const [with T = const org::apache::arrow::flatbuf::Buffer*; flatbuffers::Vector<T>::return_type = const org::apa"...
#5  0x00007fffd7523c42 in arrow::ipc::RecordBatchMetadata::RecordBatchMetadataImpl::buffer (this=0x14537a0, i=1)
    at /home/bryan/git/arrow/cpp/src/arrow/ipc/metadata.cc:168
No locals.
#6  0x00007fffd7523506 in arrow::ipc::RecordBatchMetadata::buffer (this=0x14c68b0, i=1)
    at /home/bryan/git/arrow/cpp/src/arrow/ipc/metadata.cc:214
        buffer = 0x7fffffffcbc0
        result = {
          page = 91, 
          offset = -3301766484703332096, 
          length = 0
        }
#7  0x00007fffd74da44d in arrow::ipc::ArrayLoader::GetBuffer (this=0x7fffffffcd90, buffer_index=1, out=0x7fffffffcbd0)
    at /home/bryan/git/arrow/cpp/src/arrow/ipc/adapter.cc:550
        metadata = {
          page = 91,
          offset = -3301766484703332096, 
          length = 0
        }
#8  0x00007fffd74dee7f in arrow::ipc::ArrayLoader::LoadBinary<arrow::StringArray> (this=0x7fffffffcd90)
    at /home/bryan/git/arrow/cpp/src/arrow/ipc/adapter.cc:605
        _s = {
          state_ = 0x0
        }
        field_meta = {
          length = 0, 
          null_count = 0
        }
        null_bitmap = std::shared_ptr (empty) 0x0
        offsets = std::shared_ptr (empty) 0x0
        values = std::shared_ptr (empty) 0x0
#9  0x00007fffd74dadd0 in arrow::ipc::ArrayLoader::Visit (this=0x7fffffffcd90, type=...)
    at /home/bryan/git/arrow/cpp/src/arrow/ipc/adapter.cc:639
No locals.
#10 0x00007fffdcdeb6b6 in arrow::StringType::Accept (this=0x11bffd0, visitor=0x7fffffffcd90)
    at /home/bryan/git/arrow/cpp/src/arrow/type.cc:172
No locals.
#11 0x00007fffd74da1b9 in arrow::ipc::ArrayLoader::Load (this=0x7fffffffcd90, out=0x103e8b0)
    at /home/bryan/git/arrow/cpp/src/arrow/ipc/adapter.cc:527
        _s = {
          state_ = 0x7fffffffccf0 " \315\377\377\377\177"
        }
#12 0x00007fffd74dc149 in arrow::ipc::RecordBatchReader::Read (this=0x7fffffffce20, out=0xa2a300)
    at /home/bryan/git/arrow/cpp/src/arrow/ipc/adapter.cc:755
        _s = {
          state_ = 0x13c5ad0 "\340Z<\001"
        }
        loader = {
          <arrow::TypeVisitor> = {
            _vptr.TypeVisitor = 0x7fffd7764490 <vtable for arrow::ipc::ArrayLoader+16>
          }, 
          members of arrow::ipc::ArrayLoader: 
          field_ = @0x13c5ad0, 
          context_ = 0x7fffffffce20, 
          file_ = 0x7fffffffcf80, 
          result_ = std::shared_ptr (empty) 0x0
        }
        i = 0
        arrays = std::vector of length 3, capacity 3 = {std::shared_ptr (empty) 0x0, std::shared_ptr (empty) 0x0, 
          std::shared_ptr (empty) 0x0}
#13 0x00007fffd74d6143 in arrow::ipc::ReadRecordBatch (metadata=std::shared_ptr (count 2, weak 0) 0x14c68b0, 
    schema=std::shared_ptr (count 2, weak 0) 0x14833e0, max_recursion_depth=64, file=0x7fffffffcf80, out=0xa2a300)
    at /home/bryan/git/arrow/cpp/src/arrow/ipc/adapter.cc:800
        reader = {
          context_ = {
            metadata = 0x14c68b0, 
            buffer_index = 2, 
            field_index = 1, 
            max_recursion_depth = 64
          }, 
          metadata_ = std::shared_ptr (count 2, weak 0) 0x14c68b0, 
          schema_ = std::shared_ptr (count 2, weak 0) 0x14833e0, 
          max_recursion_depth_ = 64, 
          file_ = 0x7fffffffcf80

Seems like the problem is when loading the BinaryArray before it's constructed?

BryanCutler · 2017-02-09T23:06:58Z

This patch seems to do the trick, but I just followed the function above, so not sure if it's correct

diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc
index 3613ccb..f35bfe4 100644
--- a/cpp/src/arrow/ipc/adapter.cc
+++ b/cpp/src/arrow/ipc/adapter.cc
@@ -602,8 +602,14 @@ class ArrayLoader : public TypeVisitor {
 
     std::shared_ptr<Buffer> offsets;
     std::shared_ptr<Buffer> values;
-    RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &offsets));
-    RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &values));
+    if (field_meta.length > 0) {
+      RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &offsets));
+      RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &values));
+    } else {
+      context_->buffer_index += 2;
+      offsets.reset(new Buffer(nullptr, 0));
+      values.reset(new Buffer(nullptr, 0));
+    }
 
     result_ = std::make_shared<CONTAINER>(
         field_meta.length, offsets, values, null_bitmap, field_meta.null_count);

wesm · 2017-02-09T23:19:55Z

@BryanCutler this appears to be a bug on the Java side -- the record batch that was written was incomplete or empty. EDIT: whose fault is the malformed metadata (the Spark/Arrow interface or Arrow itself)?

wesm · 2017-02-09T23:22:24Z

To be clear, I will add your patch as a stopgap, but it would be very difficult for me to construct a test case

Change-Id: If9214960d7f7d05af85a2bf265521aa99190c374

BryanCutler · 2017-02-09T23:36:43Z

So this is happening when Spark has a Dataset with at least 1 empty partition which will convert to an empty ArrowRecordBatch. I can inspect the metadata if you like, but looks like the body size is 0

from a Dataset with 3 rows / 4 partitions

17/02/09 15:34:26 DEBUG ArrowWriter: magic written, now at 6
17/02/09 15:34:26 DEBUG ArrowWriter: magic written, now at 6
17/02/09 15:34:26 DEBUG ArrowWriter: magic written, now at 6
17/02/09 15:34:26 DEBUG ArrowWriter: magic written, now at 6
17/02/09 15:34:26 DEBUG ArrowWriter: RecordBatch at 374, metadata: 98, body: 0
17/02/09 15:34:26 DEBUG ArrowWriter: RecordBatch at 374, metadata: 282, body: 56
17/02/09 15:34:26 DEBUG ArrowWriter: RecordBatch at 374, metadata: 282, body: 56
17/02/09 15:34:26 DEBUG ArrowWriter: RecordBatch at 374, metadata: 282, body: 56
17/02/09 15:34:26 DEBUG ArrowWriter: Footer starts at 472, length: 408
17/02/09 15:34:26 DEBUG ArrowWriter: Footer starts at 712, length: 408
17/02/09 15:34:26 DEBUG ArrowWriter: Footer starts at 712, length: 408
17/02/09 15:34:26 DEBUG ArrowWriter: Footer starts at 712, length: 408
17/02/09 15:34:26 DEBUG ArrowWriter: magic written, now at 1130
17/02/09 15:34:26 DEBUG ArrowWriter: magic written, now at 1130
17/02/09 15:34:26 DEBUG ArrowWriter: magic written, now at 890
17/02/09 15:34:26 DEBUG ArrowWriter: magic written, now at 1130

BryanCutler · 2017-02-09T23:37:10Z

Is it incorrect to have an empty ArrowRecordBatch?

wesm · 2017-02-09T23:42:18Z

From the backtrace it looks like the record batch metadata is empty. Even for a length-0 partition, with a known schema we would expect metadata with all zero-length buffers. I agree that it's not especially useful to generate a bunch of metadata with no value -- I'd be OK with indicating in the specification that for length-0 batches the buffer and field metadata can be empty. @julienledem any thoughts?

julienledem · 2017-02-10T01:06:45Z

@wesm I agree that for empty RecordBatches we should return FieldNodes with length=0 and null_count=0.
Looking at the java side I don't see any special case:
https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java
Could you point at the code that calls this?

wesm · 2017-02-10T01:45:07Z

FieldNodes yes, but not necessarily Buffers. Somehow Bryan's code is not sending the buffers (they would all be length zero anyway)

wesm · 2017-02-10T14:49:22Z

@BryanCutler if this is working for you, I can merge and we can create a JIRA about documenting the IPC conventions for length-0 row batches

BryanCutler · 2017-02-10T17:55:02Z

@wesm , yeah it seems to be working for me. I can try to reproduce it outside of Spark also.

wesm · 2017-02-10T18:15:23Z

Thanks. +1

Use PIMPL pattern to hide zlib from public API Author: Wes McKinney <wes.mckinney@twosigma.com> Closes apache#333 from wesm/gzip-pimpl and squashes the following commits: adbca51 [Wes McKinney] Make the appveyor build matrix a little smaller 7d88204 [Wes McKinney] cpplint 39a85bd [Wes McKinney] Use override for virtuals 1064669 [Wes McKinney] Fix up GZIP pimpl 6de81df [Wes McKinney] WIP converting GZipCodec to PIMPL to hide zlib.h

Use PIMPL pattern to hide zlib from public API Author: Wes McKinney <wes.mckinney@twosigma.com> Closes apache#333 from wesm/gzip-pimpl and squashes the following commits: adbca51 [Wes McKinney] Make the appveyor build matrix a little smaller 7d88204 [Wes McKinney] cpplint 39a85bd [Wes McKinney] Use override for virtuals 1064669 [Wes McKinney] Fix up GZIP pimpl 6de81df [Wes McKinney] WIP converting GZipCodec to PIMPL to hide zlib.h Change-Id: Ic0d39fb0a6c642220b33e9a36ed4d81b62bccbd1

wesm added 2 commits February 9, 2017 16:14

Test writing zero-length record batches, misc zero-length fixes

1dc7733

Change-Id: Ia47370298a009feef2935fc4ab89c3541f7b9a55

Test with null value_offsets too

f876dce

Change-Id: I1841f42c833bd89f2f8ce29ddb88e6353c9e1ac7

BryanCutler reviewed Feb 9, 2017

View reviewed changes

Protect zero-length record batches from incomplete buffer metadata

f80d58f

Change-Id: If9214960d7f7d05af85a2bf265521aa99190c374

asfgit closed this in 42b55d9 Feb 10, 2017

wesm deleted the ARROW-544 branch February 10, 2017 18:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-544: [C++] Test writing zero-length record batches, zero-length BinaryArray fixes #333

ARROW-544: [C++] Test writing zero-length record batches, zero-length BinaryArray fixes #333

wesm commented Feb 9, 2017

BryanCutler Feb 9, 2017

wesm Feb 9, 2017

BryanCutler commented Feb 9, 2017

BryanCutler commented Feb 9, 2017

BryanCutler commented Feb 9, 2017

wesm commented Feb 9, 2017 •

edited

wesm commented Feb 9, 2017

BryanCutler commented Feb 9, 2017

BryanCutler commented Feb 9, 2017

wesm commented Feb 9, 2017

julienledem commented Feb 10, 2017

wesm commented Feb 10, 2017

wesm commented Feb 10, 2017

BryanCutler commented Feb 10, 2017

wesm commented Feb 10, 2017

ARROW-544: [C++] Test writing zero-length record batches, zero-length BinaryArray fixes #333

ARROW-544: [C++] Test writing zero-length record batches, zero-length BinaryArray fixes #333

Conversation

wesm commented Feb 9, 2017

BryanCutler Feb 9, 2017

Choose a reason for hiding this comment

wesm Feb 9, 2017

Choose a reason for hiding this comment

BryanCutler commented Feb 9, 2017

BryanCutler commented Feb 9, 2017

BryanCutler commented Feb 9, 2017

wesm commented Feb 9, 2017 • edited

wesm commented Feb 9, 2017

BryanCutler commented Feb 9, 2017

BryanCutler commented Feb 9, 2017

wesm commented Feb 9, 2017

julienledem commented Feb 10, 2017

wesm commented Feb 10, 2017

wesm commented Feb 10, 2017

BryanCutler commented Feb 10, 2017

wesm commented Feb 10, 2017

wesm commented Feb 9, 2017 •

edited