Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-3311: [R] Wrap MemoryMappedFile class #2714

Closed

Conversation

@romainfrancois
Copy link
Contributor

romainfrancois commented Oct 6, 2018

The functions read_record_batch and read_table become S3 generic and methods handle:

  • arrow::io::RandomAccessFile : that can be made with file_open() or mmap_open()
  • a file path (either a character scalar or a fs::fs_path object. In that case a ReadableFile stream is open, read from using the arrow::io::RandomAccessFile method and closed on exit.
  • arrow::BufferReader
  • a raw vector, which uses the arrow::BufferReader method

Classes Buffer, ReadableFile and MemoryMappedFile are wrapped in R6 classes.

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Oct 6, 2018

Codecov Report

Merging #2714 into master will increase coverage by 15.55%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #2714       +/-   ##
===========================================
+ Coverage   71.93%   87.48%   +15.55%     
===========================================
  Files          61      402      +341     
  Lines        3805    61401    +57596     
===========================================
+ Hits         2737    53718    +50981     
- Misses        994     7609     +6615     
  Partials       74       74
Impacted Files Coverage Δ
python/pyarrow/ipc.pxi 68.65% <0%> (ø)
cpp/src/arrow/io/hdfs-internal.h 100% <0%> (ø)
cpp/src/parquet/column_page.h 89.28% <0%> (ø)
cpp/src/parquet/arrow/test-util.h 97.77% <0%> (ø)
cpp/src/parquet/bloom_filter-test.cc 99.15% <0%> (ø)
cpp/src/plasma/client.cc 84.43% <0%> (ø)
cpp/src/arrow/io/test-common.h 97.29% <0%> (ø)
cpp/src/arrow/pretty_print.h 100% <0%> (ø)
cpp/src/arrow/io/hdfs-internal.cc 33.62% <0%> (ø)
cpp/src/arrow/visitor_inline.h 93.33% <0%> (ø)
... and 331 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0e518c4...1ad3fcc. Read the comment docs.

@romainfrancois romainfrancois force-pushed the romainfrancois:ARROW-3350/MemoryMappedFile branch from 1ad3fcc to 70da023 Oct 8, 2018
@romainfrancois

This comment has been minimized.

Copy link
Contributor Author

romainfrancois commented Oct 8, 2018

@wesm I've updated this PR to follow up on the changes merged in #2711

I think this can be reviewed as is, otherwise I'll also add commits about using streams in the other direction, e.g. write a record batch or a Table to an OutputStream but I guess that can be another PR.

@wesm

This comment has been minimized.

Copy link
Member

wesm commented Oct 8, 2018

OK, I will review this sometime today

@kou kou force-pushed the apache:master branch from ef9fd0e to 9448f58 Oct 8, 2018
@romainfrancois romainfrancois force-pushed the romainfrancois:ARROW-3350/MemoryMappedFile branch from 70da023 to 8d1984b Oct 10, 2018
@romainfrancois romainfrancois force-pushed the romainfrancois:ARROW-3350/MemoryMappedFile branch from 8d1984b to 7f8d01b Oct 10, 2018
Copy link
Member

wesm left a comment

This looks reasonable. I noted a couple typos, and left some questions about handling methods that are part of the virtual interfaces in arrow/io/interfaces.h. Let me know what (if anything) you want to do there in this patch, but we can always manage improving that code in a follow up patch

r/R/buffer.R Outdated Show resolved Hide resolved
r/R/buffer.R Outdated Show resolved Hide resolved
r/R/io.R Outdated
Tell = function() io___MemoryMappedFile__Tell(self),
Seek = function(position) io___MemoryMappedFile__Seek(self, position),
Resize = function(size) io___MemoryMappedFile__Resize(self, size),
Read = function(nbytes) `arrow::Buffer`$new(io___Readable__Read(self, nbytes))

This comment has been minimized.

Copy link
@wesm

wesm Oct 11, 2018

Member

Does R6 support diamond inheritance? You probably want to follow the Arrow class hierarchy and put Close, Tell, and Seek in a base class, and Read on InputStream

This comment has been minimized.

Copy link
@romainfrancois

romainfrancois Oct 11, 2018

Author Contributor

I don't think so, ping @wch. r-lib/R6#9

This comment has been minimized.

Copy link
@wesm

wesm Oct 11, 2018

Member

I see. Well you could implement a non-diamond inheritance hierarchy for R and at least maximize the amount of code sharing. I guess there might be the issue that the incoming shared_ptr type to each function is different, so perhaps it's not worth the effort as long as you internally dispatch to a primary implementation that uses the lowest-level virtual interface

This comment has been minimized.

Copy link
@wch

wch Oct 11, 2018

That's right, R6 doesn't have multiple inheritance.

This comment has been minimized.

Copy link
@romainfrancois

romainfrancois Oct 11, 2018

Author Contributor

Would something like java's implements make sense ?

This comment has been minimized.

Copy link
@wch

wch Oct 11, 2018

@romainfrancois want to file an issue on R6? It's worth having a discussion about it.

r/R/io.R Outdated
public = list(
Close = function() io___ReadableFile__Close(self),
Tell = function() io___ReadableFile__Tell(self),
Seek = function(position) io___ReadableFile__Seek(self, position),

This comment has been minimized.

Copy link
@wesm

wesm Oct 11, 2018

Member

Same question here re: inheritance

r/src/buffer.cpp Show resolved Hide resolved
@romainfrancois

This comment has been minimized.

Copy link
Contributor Author

romainfrancois commented Oct 11, 2018

I'm adding a few things to it, i.e. something to replace #2741 so that we can do read_record_batch on a raw vector.

@wesm

This comment has been minimized.

Copy link
Member

wesm commented Oct 11, 2018

OK, I'll return in a few hours to merge. thanks!

@romainfrancois

This comment has been minimized.

Copy link
Contributor Author

romainfrancois commented Oct 11, 2018

@wesm I think it's good now.

So we have both read_table and read_record_batch that are generic, with implementations for various streams:

#' Read an arrow::Table from a stream
#'
#' @param stream stream. Either a stream created by [file_open()] or [mmap_open()] or a file path.
#'
#' @export
read_table <- function(stream){
  UseMethod("read_table")
}

#' @export
read_table.character <- function(stream){
  assert_that(length(stream) == 1L)
  read_table(fs::path_abs(stream))
}

#' @export
read_table.fs_path <- function(stream) {
  stream <- file_open(stream); on.exit(stream$Close())
  read_table(stream)
}

#' @export
`read_table.arrow::io::RandomAccessFile` <- function(stream) {
  `arrow::Table`$new(read_table_RandomAccessFile(stream))
}

#' @export
`read_table.arrow::io::BufferReader` <- function(stream) {
  `arrow::Table`$new(read_table_BufferReader(stream))
}

#' @export
`read_table.raw` <- function(stream) {
  stream <- buffer_reader(stream); on.exit(stream$Close())
  read_table(stream)
}

so @javierluraschi reading from a raw vector is just:

bytes <- ... # from somewhere
batch <- read_record_batch(bytes)

This, I believe is more coherent with the rest of the api than #2741

My next move is to do the same for output streams, i.e. instead of having the $to_file and to_stream, we'd have a stream generic S3 with double dispatch on what we stream (record batch or table) and where (the output stream).

I can either do this as a follow up patch, or right here.

@romainfrancois

This comment has been minimized.

Copy link
Contributor Author

romainfrancois commented Oct 11, 2018

@wesm I'll pick the next step in the morning, so if you want to merge, I'm good with it.

Copy link
Member

wesm left a comment

Some minor comments about API refinement. We can address in follow up patches

r/R/enums.R Show resolved Hide resolved
@@ -63,17 +63,26 @@ List RecordBatch__to_dataframe(const std::shared_ptr<arrow::RecordBatch>& batch)
}

// [[Rcpp::export]]
std::shared_ptr<arrow::RecordBatch> read_record_batch_(std::string path) {
std::shared_ptr<arrow::io::ReadableFile> stream;
std::shared_ptr<arrow::RecordBatch> read_record_batch_RandomAccessFile(

This comment has been minimized.

Copy link
@wesm

wesm Oct 11, 2018

Member

We need to revisit these APIs -- read_record_batch_RandomAccessFile and read_record_batch_BufferReader do different things. https://issues.apache.org/jira/browse/ARROW-3498

}

// [[Rcpp::export]]
std::shared_ptr<arrow::RecordBatch> read_record_batch_BufferReader(

This comment has been minimized.

Copy link
@wesm

wesm Oct 11, 2018

Member

This function name does not suggest that it reads a stream (including the schema). The Python version of this does not read a stream

This comment has been minimized.

Copy link
@romainfrancois

romainfrancois Oct 11, 2018

Author Contributor

It reads a record batch from a BufferReader.

But yeah this is likely to change once more C++ classes can be held on the r side.

Should it just read the record batch and not the schema ?

This comment has been minimized.

Copy link
@wesm

wesm Oct 11, 2018

Member

We need to go carefully through all the modes of operation when it comes to IPC, as listed in https://issues.apache.org/jira/browse/ARROW-3498. We have support for all modes in Python at the moment:

  • Reading a stream
  • Reading a file
  • Reading a schema alone
  • Reading a single record batch alone given a known schema
@@ -95,20 +104,20 @@ int RecordBatch__to_file(const std::shared_ptr<arrow::RecordBatch>& batch,
return offset;
}

// [[Rcpp::export]]
RawVector RecordBatch__to_stream(const std::shared_ptr<arrow::RecordBatch>& batch) {
int64_t RecordBatch_size(const std::shared_ptr<arrow::RecordBatch>& batch) {

This comment has been minimized.

Copy link
@wesm

wesm Oct 11, 2018

Member

Hm, "stream_size"?

This comment has been minimized.

Copy link
@romainfrancois

romainfrancois Oct 11, 2018

Author Contributor

Btw, is this the way to get the number of bytes a record batch would take ?

This comment has been minimized.

Copy link
@wesm

wesm Oct 11, 2018

Member

It depends on the operational mode per comments in https://issues.apache.org/jira/browse/ARROW-3498. If you are writing a stream with the schema and a single record batch, then this is the right way. But you can write a record batch without the schema in a non-stream.

This comment has been minimized.

Copy link
@romainfrancois

romainfrancois Oct 11, 2018

Author Contributor

That makes sense. I’ve switched to the other side of the problem. I’ll keep this in mind and propose some changes.

}

// [[Rcpp::export]]
std::shared_ptr<arrow::Table> read_table_BufferReader(

This comment has been minimized.

Copy link
@wesm

wesm Oct 11, 2018

Member

Same here

@wesm

This comment has been minimized.

Copy link
Member

wesm commented Oct 11, 2018

+1

@wesm wesm closed this in b027f5a Oct 11, 2018
@@ -3,6 +3,7 @@ Title: R Integration to 'Apache' 'Arrow'
Version: 0.0.0.9000
Authors@R: c(
person("Romain", "François", email = "romain@rstudio.com", role = c("aut", "cre")),
person("Javier", "Luraschi", email = "javier@rstudio.com", role = c("ctb")),

This comment has been minimized.

Copy link
@javierluraschi

javierluraschi Oct 11, 2018

Contributor

@romainfrancois Lol, I don't think I deserve this yet... so feel free to remove be before going to CRAN... unless I ended up contributing something more significant, in any case, thanks!

@romainfrancois romainfrancois deleted the romainfrancois:ARROW-3350/MemoryMappedFile branch Oct 11, 2018
@kou kou changed the title ARROW-3350: [R] Wrap MemoryMappedFile class ARROW-3311: [R] Wrap MemoryMappedFile class Mar 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.