ARROW-3479: [R] Support to write record_batch as stream #2727

javierluraschi · 2018-10-08T18:29:45Z

Using this PR as a WIP to efficiently transfer data from R to Spark using Arrow.

This PR might be ultimately closed and not merged, but thought it would be good to give visibility as to what I'm exploring.

Specifically, I'm working on supporting efficient execution of:

library(sparklyr)
sc <- spark_connect(master = "local")
system.time({
  tbl_data <- sdf_copy_to(sc, data.frame(y = runif(10^6, 0, 1)), "data", overwrite = TRUE)
})

Currently, without this PR and without using arrow:

system.time({
  tbl_data <- sdf_copy_to(sc, data.frame(y = runif(10^6, 0, 1)), "data", overwrite = TRUE)
})

   user  system elapsed 
  1.120   0.087   3.482

Using arrow is down to:

library(arrow)
system.time({
  tbl_data <- sdf_copy_to(sc, data.frame(y = runif(10^6, 0, 1)), "data", overwrite = TRUE)
})

   user  system elapsed 
  0.222   0.029   0.641

and down to the following while using record$to_raw() from this PR instead of record$to_file():

   user  system elapsed 
  0.102   0.007   0.351

wesm

Can you open a JIRA for this and maybe write a unit test that round trips to the R raw type?

r/src/RecordBatch.cpp

wesm · 2018-10-08T18:37:36Z

r/R/RecordBatch.R

@@ -23,6 +23,7 @@
    num_rows = function() RecordBatch__num_rows(self),
    schema = function() `arrow::Schema`$new(RecordBatch__schema(self)),
    to_file = function(path) invisible(RecordBatch__to_file(self, fs::path_abs(path))),
+    to_raw = function() RecordBatch__to_raw(self),


Maybe "to_stream_raw`?

How about to_stream(), I think this implies that the returned data is raw().

wesm · 2018-10-09T08:40:17Z

r/src/RecordBatch.cpp

+  std::shared_ptr<arrow::ipc::RecordBatchWriter> mockWriter;
+  R_ERROR_NOT_OK(arrow::ipc::RecordBatchStreamWriter::Open(mockSink.get(),
+                                                           batch->schema(),
+                                                           &mockWriter));


Use the function arrow::ipc::WriteRecordBatchStream({batch}, &mock_writer) to save yourself about 3 lines.

wesm · 2018-10-09T08:40:41Z

r/src/RecordBatch.cpp

-  MemoryPool* pool = default_memory_pool();
+RawVector RecordBatch__to_stream(const std::shared_ptr<arrow::RecordBatch>& batch) {
+  std::unique_ptr<io::MockOutputStream> mockSink;
+  mockSink.reset(new io::MockOutputStream());


Just declare this on the stack

io::MockOutputStream mock_sink;

wesm · 2018-10-09T08:47:16Z

r/src/RecordBatch.cpp

+  RawVector res(mockSink->GetExtentBytesWritten());
+
+  std::unique_ptr<RawVectorOutputStream> sink;
+  sink.reset(new RawVectorOutputStream(res));


This seems a bit elaborate. Is there a way to get the pointer to the raw memory in res? Then you can just use FixedSizeBufferWriter https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/memory.h#L103 cc @romainfrancois

res.begin()

OK, then RawVectorOutputStream is not needed

wesm · 2018-10-09T08:47:28Z

r/src/RecordBatch.cpp

-  R_ERROR_NOT_OK(arrow::ipc::RecordBatchStreamWriter::Open(sink.get(), batch->schema(), &writer));
+  R_ERROR_NOT_OK(arrow::ipc::RecordBatchStreamWriter::Open(sink.get(),
+                                                           batch->schema(),
+                                                           &writer));

  R_ERROR_NOT_OK(writer->WriteRecordBatch(*batch));
  R_ERROR_NOT_OK(writer->Close());


romainfrancois · 2018-10-09T08:57:28Z

I think we should start with simpler OutputStream first, e.g. replace $to_file by a write_record_batch <- function(batch, stream){ ... } S3 generic, or maybe even a stream generic with double dispatch on what is streamed and the stream, so that we can e.g. :

batch <- ...
stream(batch, output_stream(...))

javierluraschi · 2018-10-09T22:44:36Z

Here is the JIRA issue: https://issues.apache.org/jira/browse/ARROW-3479

javierluraschi · 2018-10-09T23:45:36Z

@romainfrancois that makes sense to me; however, I still like this PR as it is to make progress in sparklyr. However, I'm not making the sparklyr work public for several months, so feel free to override this function with a more appropriate binding. I can also take a look at this at some point; however, I want to try to get data from Spark to R implemented since R to Spark is currently at a descent place.

codecov-io · 2018-10-10T00:23:26Z

Codecov Report

Merging #2727 into master will increase coverage by 0.94%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #2727      +/-   ##
==========================================
+ Coverage   87.57%   88.52%   +0.94%     
==========================================
  Files         402      341      -61     
  Lines       61454    57649    -3805     
==========================================
- Hits        53821    51031    -2790     
+ Misses       7561     6618     -943     
+ Partials       72        0      -72

Impacted Files	Coverage Δ
rust/src/record_batch.rs
go/arrow/datatype_nested.go
rust/src/util/bit_util.rs
go/arrow/math/uint64_amd64.go
go/arrow/internal/testing/tools/bool.go
go/arrow/internal/bitutil/bitutil.go
go/arrow/memory/memory_avx2_amd64.go
go/arrow/array/null.go
rust/src/lib.rs
rust/src/array.rs
... and 51 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f4f6269...0e302a5. Read the comment docs.

romainfrancois · 2018-10-10T09:03:05Z

LGTM, but I might still change the interface for streaming out, after #2714 is merged

wesm

+1, thanks @javierluraschi!

wesm reviewed Oct 8, 2018

View reviewed changes

javierluraschi added 3 commits October 8, 2018 18:39

implement record to_raw() for r bindings

643371d

avoid double copy under to_stream for R bindings

a1580df

add test for record_batch to_stream

ec1a8c2

javierluraschi force-pushed the feature/r-to-raw branch from 36d3456 to ec1a8c2 Compare October 9, 2018 01:40

wesm reviewed Oct 9, 2018

View reviewed changes

javierluraschi added 3 commits October 9, 2018 09:51

fix clang lint warnings

9e27d04

additional code review feedback

0cf4ce7

additional code review feedback

40f4e24

javierluraschi changed the title ~~[WIP] Improvements to support R to Spark in socket serialization~~ ARROW-3479: [R] Support to write record_batch as stream Oct 9, 2018

use snake casing not camel

0e302a5

javierluraschi force-pushed the feature/r-to-raw branch from bbb2f99 to 0e302a5 Compare October 9, 2018 23:43

wesm approved these changes Oct 10, 2018

View reviewed changes

wesm closed this in 32960a1 Oct 10, 2018

javierluraschi mentioned this pull request Oct 10, 2018

Implement Arrow sparklyr/sparklyr#1611

Merged

asfimport mentioned this pull request Oct 10, 2018

[R] Support to write record_batch as stream #19798

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-3479: [R] Support to write record_batch as stream #2727

ARROW-3479: [R] Support to write record_batch as stream #2727

javierluraschi commented Oct 8, 2018 •

edited

Loading

wesm left a comment

wesm Oct 8, 2018

javierluraschi Oct 9, 2018

wesm Oct 9, 2018

wesm Oct 9, 2018

wesm Oct 9, 2018

romainfrancois Oct 9, 2018

wesm Oct 9, 2018

wesm Oct 9, 2018

romainfrancois commented Oct 9, 2018

javierluraschi commented Oct 9, 2018

javierluraschi commented Oct 9, 2018

codecov-io commented Oct 10, 2018

romainfrancois commented Oct 10, 2018

wesm left a comment

ARROW-3479: [R] Support to write record_batch as stream #2727

ARROW-3479: [R] Support to write record_batch as stream #2727

Conversation

javierluraschi commented Oct 8, 2018 • edited Loading

wesm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romainfrancois commented Oct 9, 2018

javierluraschi commented Oct 9, 2018

javierluraschi commented Oct 9, 2018

codecov-io commented Oct 10, 2018

Codecov Report

romainfrancois commented Oct 10, 2018

wesm left a comment

Choose a reason for hiding this comment

javierluraschi commented Oct 8, 2018 •

edited

Loading