Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Add data structure to "stage" a sequence of IPC messages from in-memory data #17853

Closed
asfimport opened this issue Nov 26, 2017 · 5 comments

Comments

@asfimport
Copy link

Currently, when you need to pre-allocate space for a record batch or a stream (schema + dictionaries + record batches), you must make multiple passes over the data structures of interest (and use e.g. MockOutputStream to compute the size of the output buffer). It would be useful to make a single pass to "prepare" the IPC payload for both sizing and writing to prevent having to make multiple passes

Reporter: Wes McKinney / @wesm
Assignee: Wes McKinney / @wesm

Original Issue Attachments:

PRs and other links:

Note: This issue was originally created as ARROW-1860. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
@xuepanchen I'm going to take a look at this if you don't mind. There are some nuances in the IPC writer around handling sliced bitmaps (see https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L125), and I want to see if we can improve the write performance in microbenchmarks while doing this refactoring

@asfimport
Copy link
Author

Wes McKinney / @wesm:
I'm running into a slight issue with this refactoring around output streams which are currently at a position that is not a multiple of 8.

see https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata-internal.cc#L903

This results in the entire message payload being non-deterministic depending on the state of the OutputStream. I don't think it will impact any applications, but I suggest we write padding bytes at the start to an 8-byte offset and then begin writing the message. This is the same approach used by WriteTensor in https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L621.

@robertnishihara @pcmoritz @xhochy does this seem reasonable?

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Moving this to 0.10.0

@asfimport
Copy link
Author

Wes McKinney / @wesm:
I will need to revisit this in the context of Flight RPC, where this data structure may be used to coordinate writes onto the GRPC output stream

@asfimport
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants