Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-1860: [C++] Introduce ipc::PreparedMessage data structure to avoid making multiple passes over record batches #1500

Closed
wants to merge 3 commits into from

Conversation

Projects
None yet
3 participants
@wesm
Copy link
Member

commented Jan 24, 2018

The purpose of this is to decompose a record batch into its serialized Flatbuffer metadata and sequence of buffers that form its body so that the total output size can be computed, e.g. for writing to a shared memory segment. Prior to this, one would have to call GetRecordBatchSize, allocate memory, then WriteRecordBatch which duplicates work.

This also introduces a change to how padding is handled in unaligned streams to make the message contents deterministic. This makes record batches consistent with the way that tensors were already being written, with alignment bytes being written first to move the stream position to a multiple of 8, then beginning to write the metadata and message body.

@xhochy

xhochy approved these changes Jan 24, 2018

Copy link
Member

left a comment

+1, LGTM

@wesm

This comment has been minimized.

Copy link
Member Author

commented Jan 24, 2018

Handled alignment on the read side, also. One question this brings up is whether the Java stream writer writes padded / aligned messages. I believe it does (this is checked for in C++: https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L854). But we should make sure. This is a case where having the Spark integration tests would be really helpful (cc @BryanCutler @icexelloss)

@xhochy
Copy link
Member

left a comment

Failure is due to a segmentation fault.

@wesm

This comment has been minimized.

Copy link
Member Author

commented Feb 12, 2018

Picking this up again

wesm added some commits Jan 24, 2018

Introduce PreparedMessage data structure to avoid making multiple pas…
…ses over RecordBatch to size, and then write

Change-Id: I3f00adfb2571356b500e891d190b20f99ca078f1
Account for padding in unaligned stream when reading message
Change-Id: I72e5a6d83f774aa6b1e99257e1360043174ee661
Fix padding logic on tensor read
Change-Id: I7142bc113f638b4e775ccaee3cbb0143d3c26e13

@wesm wesm force-pushed the wesm:ARROW-1860 branch from a94534a to 011c22a Feb 12, 2018

@wesm

This comment has been minimized.

Copy link
Member Author

commented Feb 12, 2018

So it seems that not all Python file objects support the tell operation, for example the _io.BufferedReader that you get calling makefile on a socket. That's the last issue here -- I think probably the best way to deal with this is to implement a PyInputStream in https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/io.h and handle the stream position bookkeeping internally, if that sounds good to folks

@pitrou

This comment has been minimized.

Copy link
Contributor

commented Apr 10, 2018

So it seems that not all Python file objects support the tell operation, for example the _io.BufferedReader that you get calling makefile on a socket.

Yes, non-seekable files generally don't support tell() either.
I'm a bit surprised that you need this: if writing to a generic output stream, why do you want to pad data? Padding only sounds useful when writing to a true random access file that may be e.g. memory-mapped.

@wesm

This comment has been minimized.

Copy link
Member Author

commented May 21, 2018

why do you want to pad data?

I'll need to revisit this patch from a few months ago and remember why I thought this. The basic idea is for messages to be relocatable and deterministic (e.g. if you were to hash the metadata, you would always get the same hash). By always aligning the stream before writing a message, you guarantee that the message metadata and body have the exact same contents.

Given the time that has passed, it'll probably make sense to start this patch over. There's a unit test here that can be taken over, but the rest of the refactoring should probably be started from from the codebase as it is now.

@wesm

This comment has been minimized.

Copy link
Member Author

commented Jun 29, 2018

Abandoning this PR for now. I would like to start again on this later in the year

@wesm wesm closed this Jun 29, 2018

@wesm wesm deleted the wesm:ARROW-1860 branch Jun 29, 2018

@wesm wesm restored the wesm:ARROW-1860 branch Jun 29, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.