-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-3479: [R] Support to write record_batch as stream #2727
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you open a JIRA for this and maybe write a unit test that round trips to the R raw type?
r/R/RecordBatch.R
Outdated
@@ -23,6 +23,7 @@ | |||
num_rows = function() RecordBatch__num_rows(self), | |||
schema = function() `arrow::Schema`$new(RecordBatch__schema(self)), | |||
to_file = function(path) invisible(RecordBatch__to_file(self, fs::path_abs(path))), | |||
to_raw = function() RecordBatch__to_raw(self), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "to_stream_raw`?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about to_stream()
, I think this implies that the returned data is raw()
.
36d3456
to
ec1a8c2
Compare
r/src/RecordBatch.cpp
Outdated
std::shared_ptr<arrow::ipc::RecordBatchWriter> mockWriter; | ||
R_ERROR_NOT_OK(arrow::ipc::RecordBatchStreamWriter::Open(mockSink.get(), | ||
batch->schema(), | ||
&mockWriter)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the function arrow::ipc::WriteRecordBatchStream({batch}, &mock_writer)
to save yourself about 3 lines.
r/src/RecordBatch.cpp
Outdated
MemoryPool* pool = default_memory_pool(); | ||
RawVector RecordBatch__to_stream(const std::shared_ptr<arrow::RecordBatch>& batch) { | ||
std::unique_ptr<io::MockOutputStream> mockSink; | ||
mockSink.reset(new io::MockOutputStream()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just declare this on the stack
io::MockOutputStream mock_sink;
r/src/RecordBatch.cpp
Outdated
RawVector res(mockSink->GetExtentBytesWritten()); | ||
|
||
std::unique_ptr<RawVectorOutputStream> sink; | ||
sink.reset(new RawVectorOutputStream(res)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit elaborate. Is there a way to get the pointer to the raw memory in res
? Then you can just use FixedSizeBufferWriter
https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/memory.h#L103 cc @romainfrancois
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
res.begin()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, then RawVectorOutputStream
is not needed
r/src/RecordBatch.cpp
Outdated
R_ERROR_NOT_OK(arrow::ipc::RecordBatchStreamWriter::Open(sink.get(), batch->schema(), &writer)); | ||
R_ERROR_NOT_OK(arrow::ipc::RecordBatchStreamWriter::Open(sink.get(), | ||
batch->schema(), | ||
&writer)); | ||
|
||
R_ERROR_NOT_OK(writer->WriteRecordBatch(*batch)); | ||
R_ERROR_NOT_OK(writer->Close()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
I think we should start with simpler batch <- ...
stream(batch, output_stream(...)) |
Here is the JIRA issue: https://issues.apache.org/jira/browse/ARROW-3479 |
bbb2f99
to
0e302a5
Compare
@romainfrancois that makes sense to me; however, I still like this PR as it is to make progress in |
Codecov Report
@@ Coverage Diff @@
## master #2727 +/- ##
==========================================
+ Coverage 87.57% 88.52% +0.94%
==========================================
Files 402 341 -61
Lines 61454 57649 -3805
==========================================
- Hits 53821 51031 -2790
+ Misses 7561 6618 -943
+ Partials 72 0 -72 Continue to review full report at Codecov.
|
LGTM, but I might still change the interface for streaming out, after #2714 is merged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thanks @javierluraschi!
Using this PR as a WIP to efficiently transfer data from R to Spark using Arrow.
This PR might be ultimately closed and not merged, but thought it would be good to give visibility as to what I'm exploring.
Specifically, I'm working on supporting efficient execution of:
Currently, without this PR and without using
arrow
:Using
arrow
is down to:and down to the following while using
record$to_raw()
from this PR instead ofrecord$to_file()
: