ARROW-3311: [R] Wrap MemoryMappedFile class #2714

romainfrancois · 2018-10-06T17:46:02Z

The functions read_record_batch and read_table become S3 generic and methods handle:

arrow::io::RandomAccessFile : that can be made with file_open() or mmap_open()
a file path (either a character scalar or a fs::fs_path object. In that case a ReadableFile stream is open, read from using the arrow::io::RandomAccessFile method and closed on exit.
arrow::BufferReader
a raw vector, which uses the arrow::BufferReader method

Classes Buffer, ReadableFile and MemoryMappedFile are wrapped in R6 classes.

codecov-io · 2018-10-06T18:26:31Z

Codecov Report

Merging #2714 into master will increase coverage by 15.55%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##           master    #2714       +/-   ##
===========================================
+ Coverage   71.93%   87.48%   +15.55%     
===========================================
  Files          61      402      +341     
  Lines        3805    61401    +57596     
===========================================
+ Hits         2737    53718    +50981     
- Misses        994     7609     +6615     
  Partials       74       74

Impacted Files	Coverage Δ
python/pyarrow/ipc.pxi	`68.65% <0%> (ø)`
cpp/src/arrow/io/hdfs-internal.h	`100% <0%> (ø)`
cpp/src/parquet/column_page.h	`89.28% <0%> (ø)`
cpp/src/parquet/arrow/test-util.h	`97.77% <0%> (ø)`
cpp/src/parquet/bloom_filter-test.cc	`99.15% <0%> (ø)`
cpp/src/plasma/client.cc	`84.43% <0%> (ø)`
cpp/src/arrow/io/test-common.h	`97.29% <0%> (ø)`
cpp/src/arrow/pretty_print.h	`100% <0%> (ø)`
cpp/src/arrow/io/hdfs-internal.cc	`33.62% <0%> (ø)`
cpp/src/arrow/visitor_inline.h	`93.33% <0%> (ø)`
... and 331 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0e518c4...1ad3fcc. Read the comment docs.

romainfrancois · 2018-10-08T09:56:48Z

@wesm I've updated this PR to follow up on the changes merged in #2711

I think this can be reviewed as is, otherwise I'll also add commits about using streams in the other direction, e.g. write a record batch or a Table to an OutputStream but I guess that can be another PR.

wesm · 2018-10-08T09:58:52Z

OK, I will review this sometime today

With methods for: - character: passed to fs::path_abs - fs_path: opens a ReadableFile stream, dispatch and close the stream on exit - RandomAccessFile: (either ReadableFile or MemoryMappedFile)

wesm

This looks reasonable. I noted a couple typos, and left some questions about handling methods that are part of the virtual interfaces in arrow/io/interfaces.h. Let me know what (if anything) you want to do there in this patch, but we can always manage improving that code in a follow up patch

r/R/buffer.R

wesm · 2018-10-11T09:37:48Z

r/R/io.R

+    Tell = function() io___MemoryMappedFile__Tell(self),
+    Seek = function(position) io___MemoryMappedFile__Seek(self, position),
+    Resize = function(size) io___MemoryMappedFile__Resize(self, size),
+    Read = function(nbytes) `arrow::Buffer`$new(io___Readable__Read(self, nbytes))


Does R6 support diamond inheritance? You probably want to follow the Arrow class hierarchy and put Close, Tell, and Seek in a base class, and Read on InputStream

I don't think so, ping @wch. r-lib/R6#9

I see. Well you could implement a non-diamond inheritance hierarchy for R and at least maximize the amount of code sharing. I guess there might be the issue that the incoming shared_ptr type to each function is different, so perhaps it's not worth the effort as long as you internally dispatch to a primary implementation that uses the lowest-level virtual interface

That's right, R6 doesn't have multiple inheritance.

Would something like java's implements make sense ?

@romainfrancois want to file an issue on R6? It's worth having a discussion about it.

wesm · 2018-10-11T09:38:06Z

r/R/io.R

+  public = list(
+    Close = function() io___ReadableFile__Close(self),
+    Tell = function() io___ReadableFile__Tell(self),
+    Seek = function(position) io___ReadableFile__Seek(self, position),


Same question here re: inheritance

r/src/buffer.cpp

romainfrancois · 2018-10-11T09:50:47Z

I'm adding a few things to it, i.e. something to replace #2741 so that we can do read_record_batch on a raw vector.

wesm · 2018-10-11T09:52:38Z

OK, I'll return in a few hours to merge. thanks!

romainfrancois · 2018-10-11T13:11:49Z

@wesm I think it's good now.

So we have both read_table and read_record_batch that are generic, with implementations for various streams:

#' Read an arrow::Table from a stream
#'
#' @param stream stream. Either a stream created by [file_open()] or [mmap_open()] or a file path.
#'
#' @export
read_table <- function(stream){
  UseMethod("read_table")
}

#' @export
read_table.character <- function(stream){
  assert_that(length(stream) == 1L)
  read_table(fs::path_abs(stream))
}

#' @export
read_table.fs_path <- function(stream) {
  stream <- file_open(stream); on.exit(stream$Close())
  read_table(stream)
}

#' @export
`read_table.arrow::io::RandomAccessFile` <- function(stream) {
  `arrow::Table`$new(read_table_RandomAccessFile(stream))
}

#' @export
`read_table.arrow::io::BufferReader` <- function(stream) {
  `arrow::Table`$new(read_table_BufferReader(stream))
}

#' @export
`read_table.raw` <- function(stream) {
  stream <- buffer_reader(stream); on.exit(stream$Close())
  read_table(stream)
}

so @javierluraschi reading from a raw vector is just:

bytes <- ... # from somewhere
batch <- read_record_batch(bytes)

This, I believe is more coherent with the rest of the api than #2741

My next move is to do the same for output streams, i.e. instead of having the $to_file and to_stream, we'd have a stream generic S3 with double dispatch on what we stream (record batch or table) and where (the output stream).

I can either do this as a follow up patch, or right here.

romainfrancois · 2018-10-11T15:26:38Z

@wesm I'll pick the next step in the morning, so if you want to merge, I'm good with it.

wesm

Some minor comments about API refinement. We can address in follow up patches

r/R/enums.R

wesm · 2018-10-11T15:52:25Z

r/src/RecordBatch.cpp

@@ -63,17 +63,26 @@ List RecordBatch__to_dataframe(const std::shared_ptr<arrow::RecordBatch>& batch)
 }

 // [[Rcpp::export]]
-std::shared_ptr<arrow::RecordBatch> read_record_batch_(std::string path) {
-  std::shared_ptr<arrow::io::ReadableFile> stream;
+std::shared_ptr<arrow::RecordBatch> read_record_batch_RandomAccessFile(


We need to revisit these APIs -- read_record_batch_RandomAccessFile and read_record_batch_BufferReader do different things. https://issues.apache.org/jira/browse/ARROW-3498

wesm · 2018-10-11T15:52:55Z

r/src/RecordBatch.cpp

+}
+
+// [[Rcpp::export]]
+std::shared_ptr<arrow::RecordBatch> read_record_batch_BufferReader(


This function name does not suggest that it reads a stream (including the schema). The Python version of this does not read a stream

It reads a record batch from a BufferReader.

But yeah this is likely to change once more C++ classes can be held on the r side.

Should it just read the record batch and not the schema ?

We need to go carefully through all the modes of operation when it comes to IPC, as listed in https://issues.apache.org/jira/browse/ARROW-3498. We have support for all modes in Python at the moment:

Reading a stream

Reading a file

Reading a schema alone

Reading a single record batch alone given a known schema

wesm · 2018-10-11T15:53:16Z

r/src/RecordBatch.cpp

@@ -95,20 +104,20 @@ int RecordBatch__to_file(const std::shared_ptr<arrow::RecordBatch>& batch,
  return offset;
 }

-// [[Rcpp::export]]
-RawVector RecordBatch__to_stream(const std::shared_ptr<arrow::RecordBatch>& batch) {
+int64_t RecordBatch_size(const std::shared_ptr<arrow::RecordBatch>& batch) {


Hm, "stream_size"?

Btw, is this the way to get the number of bytes a record batch would take ?

It depends on the operational mode per comments in https://issues.apache.org/jira/browse/ARROW-3498. If you are writing a stream with the schema and a single record batch, then this is the right way. But you can write a record batch without the schema in a non-stream.

That makes sense. I’ve switched to the other side of the problem. I’ll keep this in mind and propose some changes.

wesm · 2018-10-11T15:54:28Z

r/src/Table.cpp

+}
+
+// [[Rcpp::export]]
+std::shared_ptr<arrow::Table> read_table_BufferReader(


wesm · 2018-10-11T15:55:18Z

+1

javierluraschi · 2018-10-11T19:07:49Z

r/DESCRIPTION

@@ -3,6 +3,7 @@ Title: R Integration to 'Apache' 'Arrow'
 Version: 0.0.0.9000
 Authors@R: c(
    person("Romain", "François", email = "romain@rstudio.com", role = c("aut", "cre")),
+    person("Javier", "Luraschi", email = "javier@rstudio.com", role = c("ctb")),


@romainfrancois Lol, I don't think I deserve this yet... so feel free to remove be before going to CRAN... unless I ended up contributing something more significant, in any case, thanks!

romainfrancois force-pushed the ARROW-3350/MemoryMappedFile branch from 1ad3fcc to 70da023 Compare October 8, 2018 09:22

kou force-pushed the master branch from ef9fd0e to 9448f58 Compare October 8, 2018 11:27

romainfrancois mentioned this pull request Oct 10, 2018

ARROW-3479: [R] Support to write record_batch as stream #2727

Closed

romainfrancois force-pushed the ARROW-3350/MemoryMappedFile branch from 70da023 to 8d1984b Compare October 10, 2018 09:05

romainfrancois added 12 commits October 10, 2018 17:36

👶 support for arrow::io::MemoryMappedFile

af96ac3

+ MemoryMappedFile$Open

bb1d23f

rename SimpleRBuffer -> arrow::r::RBuffer

4596d76

wrapper R6 class for arrow::Buffer

0467422

+ Readable$Read(nbytes) -> Buffer

938afca

s/Seekable$Seek/MemoryMappedFile$Seek/

8c157b4

+ arrow::io::ReadableFile R6 class

f582d03

read_record_batch S3 generic

276c9ec

With methods for: - character: passed to fs::path_abs - fs_path: opens a ReadableFile stream, dispatch and close the stream on exit - RandomAccessFile: (either ReadableFile or MemoryMappedFile)

read_table also becomes S3 generic

1718684

linting

5ed3222

update FileMode enum

b14e11a

rebase

7f8d01b

romainfrancois force-pushed the ARROW-3350/MemoryMappedFile branch from 8d1984b to 7f8d01b Compare October 10, 2018 15:43

romainfrancois mentioned this pull request Oct 11, 2018

ARROW-3484: [R] Support to read record_batch from stream #2741

Closed

romainfrancois added 3 commits October 11, 2018 08:08

+ @javierluraschi as a ctb

02c1cd8

+ buffer() function to create arrow::Buffer

4460d1f

+ buffer_reader and BufferReader

7de1602

wesm reviewed Oct 11, 2018

View reviewed changes

romainfrancois added 3 commits October 11, 2018 13:43

+ read_record_batch.raw, read_record_batch.arrow::BufferReader

9b99328

Move virtual methods higher

4c39da7

+read_table methods for arrow::BufferReader and raw

4730a28

wesm reviewed Oct 11, 2018

View reviewed changes

wesm closed this in b027f5a Oct 11, 2018

javierluraschi reviewed Oct 11, 2018

View reviewed changes

romainfrancois deleted the ARROW-3350/MemoryMappedFile branch October 11, 2018 20:44

This was referenced Oct 12, 2018

Diamond inheritance pattern and interfaces r-lib/R6#163

Open

enum code generation RcppCore/Rcpp#913

Closed

kou changed the title ~~ARROW-3350: [R] Wrap MemoryMappedFile class~~ ARROW-3311: [R] Wrap MemoryMappedFile class Mar 25, 2019

asfimport mentioned this pull request Mar 25, 2019

[R] Functions for deserializing IPC components from arrow::Buffer or from IO interface #19647

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-3311: [R] Wrap MemoryMappedFile class #2714

ARROW-3311: [R] Wrap MemoryMappedFile class #2714

romainfrancois commented Oct 6, 2018 •

edited

Loading

codecov-io commented Oct 6, 2018

romainfrancois commented Oct 8, 2018

wesm commented Oct 8, 2018

wesm left a comment

wesm Oct 11, 2018

romainfrancois Oct 11, 2018

wesm Oct 11, 2018

wch Oct 11, 2018

romainfrancois Oct 11, 2018

wch Oct 11, 2018

wesm Oct 11, 2018

romainfrancois commented Oct 11, 2018

wesm commented Oct 11, 2018

romainfrancois commented Oct 11, 2018 •

edited

Loading

romainfrancois commented Oct 11, 2018

wesm left a comment

wesm Oct 11, 2018

wesm Oct 11, 2018

romainfrancois Oct 11, 2018

wesm Oct 11, 2018

wesm Oct 11, 2018

romainfrancois Oct 11, 2018

wesm Oct 11, 2018

romainfrancois Oct 11, 2018

wesm Oct 11, 2018

wesm commented Oct 11, 2018

javierluraschi Oct 11, 2018

ARROW-3311: [R] Wrap MemoryMappedFile class #2714

ARROW-3311: [R] Wrap MemoryMappedFile class #2714

Conversation

romainfrancois commented Oct 6, 2018 • edited Loading

codecov-io commented Oct 6, 2018

Codecov Report

romainfrancois commented Oct 8, 2018

wesm commented Oct 8, 2018

wesm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romainfrancois commented Oct 11, 2018

wesm commented Oct 11, 2018

romainfrancois commented Oct 11, 2018 • edited Loading

romainfrancois commented Oct 11, 2018

wesm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wesm commented Oct 11, 2018

Choose a reason for hiding this comment

romainfrancois commented Oct 6, 2018 •

edited

Loading

romainfrancois commented Oct 11, 2018 •

edited

Loading