New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-544: [C++] Test writing zero-length record batches, zero-length BinaryArray fixes #333
Conversation
Change-Id: Ia47370298a009feef2935fc4ab89c3541f7b9a55
Change-Id: I1841f42c833bd89f2f8ce29ddb88e6353c9e1ac7
raw_type_ids_(nullptr), | ||
value_offsets_(value_offsets), | ||
raw_value_offsets_(nullptr) { | ||
if (type_ids) { raw_type_ids_ = reinterpret_cast<const uint8_t*>(type_ids->data()); } | ||
if (value_offsets) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering if it would be slightly better to check against nullptr
here also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With shared pointers this is the same as value_offsets != nullptr
Thanks! I'll try it out |
Still seeing the same error. Here is the stack trace
Seems like the problem is when loading the BinaryArray before it's constructed? |
This patch seems to do the trick, but I just followed the function above, so not sure if it's correct
|
@BryanCutler this appears to be a bug on the Java side -- the record batch that was written was incomplete or empty. EDIT: whose fault is the malformed metadata (the Spark/Arrow interface or Arrow itself)? |
To be clear, I will add your patch as a stopgap, but it would be very difficult for me to construct a test case |
Change-Id: If9214960d7f7d05af85a2bf265521aa99190c374
So this is happening when Spark has a Dataset with at least 1 empty partition which will convert to an empty ArrowRecordBatch. I can inspect the metadata if you like, but looks like the body size is 0 from a Dataset with 3 rows / 4 partitions
|
Is it incorrect to have an empty ArrowRecordBatch? |
From the backtrace it looks like the record batch metadata is empty. Even for a length-0 partition, with a known schema we would expect metadata with all zero-length buffers. I agree that it's not especially useful to generate a bunch of metadata with no value -- I'd be OK with indicating in the specification that for length-0 batches the buffer and field metadata can be empty. @julienledem any thoughts? |
@wesm I agree that for empty RecordBatches we should return FieldNodes with length=0 and null_count=0. |
FieldNodes yes, but not necessarily Buffers. Somehow Bryan's code is not sending the buffers (they would all be length zero anyway) |
@BryanCutler if this is working for you, I can merge and we can create a JIRA about documenting the IPC conventions for length-0 row batches |
@wesm , yeah it seems to be working for me. I can try to reproduce it outside of Spark also. |
Thanks. +1 |
Use PIMPL pattern to hide zlib from public API Author: Wes McKinney <wes.mckinney@twosigma.com> Closes apache#333 from wesm/gzip-pimpl and squashes the following commits: adbca51 [Wes McKinney] Make the appveyor build matrix a little smaller 7d88204 [Wes McKinney] cpplint 39a85bd [Wes McKinney] Use override for virtuals 1064669 [Wes McKinney] Fix up GZIP pimpl 6de81df [Wes McKinney] WIP converting GZipCodec to PIMPL to hide zlib.h
Use PIMPL pattern to hide zlib from public API Author: Wes McKinney <wes.mckinney@twosigma.com> Closes apache#333 from wesm/gzip-pimpl and squashes the following commits: adbca51 [Wes McKinney] Make the appveyor build matrix a little smaller 7d88204 [Wes McKinney] cpplint 39a85bd [Wes McKinney] Use override for virtuals 1064669 [Wes McKinney] Fix up GZIP pimpl 6de81df [Wes McKinney] WIP converting GZipCodec to PIMPL to hide zlib.h Change-Id: Ic0d39fb0a6c642220b33e9a36ed4d81b62bccbd1
Use PIMPL pattern to hide zlib from public API Author: Wes McKinney <wes.mckinney@twosigma.com> Closes apache#333 from wesm/gzip-pimpl and squashes the following commits: adbca51 [Wes McKinney] Make the appveyor build matrix a little smaller 7d88204 [Wes McKinney] cpplint 39a85bd [Wes McKinney] Use override for virtuals 1064669 [Wes McKinney] Fix up GZIP pimpl 6de81df [Wes McKinney] WIP converting GZipCodec to PIMPL to hide zlib.h Change-Id: Ic0d39fb0a6c642220b33e9a36ed4d81b62bccbd1
Use PIMPL pattern to hide zlib from public API Author: Wes McKinney <wes.mckinney@twosigma.com> Closes apache#333 from wesm/gzip-pimpl and squashes the following commits: adbca51 [Wes McKinney] Make the appveyor build matrix a little smaller 7d88204 [Wes McKinney] cpplint 39a85bd [Wes McKinney] Use override for virtuals 1064669 [Wes McKinney] Fix up GZIP pimpl 6de81df [Wes McKinney] WIP converting GZipCodec to PIMPL to hide zlib.h Change-Id: Ic0d39fb0a6c642220b33e9a36ed4d81b62bccbd1
Use PIMPL pattern to hide zlib from public API Author: Wes McKinney <wes.mckinney@twosigma.com> Closes apache#333 from wesm/gzip-pimpl and squashes the following commits: adbca51 [Wes McKinney] Make the appveyor build matrix a little smaller 7d88204 [Wes McKinney] cpplint 39a85bd [Wes McKinney] Use override for virtuals 1064669 [Wes McKinney] Fix up GZIP pimpl 6de81df [Wes McKinney] WIP converting GZipCodec to PIMPL to hide zlib.h Change-Id: Ic0d39fb0a6c642220b33e9a36ed4d81b62bccbd1
I believe this should fix the failure reported in the Spark integration work. We'll need to upgrade the conda test packages to verify. cc @BryanCutler