Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-3311: [R] Wrap MemoryMappedFile class #2714

Closed

Conversation

romainfrancois
Copy link
Contributor

@romainfrancois romainfrancois commented Oct 6, 2018

The functions read_record_batch and read_table become S3 generic and methods handle:

  • arrow::io::RandomAccessFile : that can be made with file_open() or mmap_open()
  • a file path (either a character scalar or a fs::fs_path object. In that case a ReadableFile stream is open, read from using the arrow::io::RandomAccessFile method and closed on exit.
  • arrow::BufferReader
  • a raw vector, which uses the arrow::BufferReader method

Classes Buffer, ReadableFile and MemoryMappedFile are wrapped in R6 classes.

@codecov-io
Copy link

Codecov Report

Merging #2714 into master will increase coverage by 15.55%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #2714       +/-   ##
===========================================
+ Coverage   71.93%   87.48%   +15.55%     
===========================================
  Files          61      402      +341     
  Lines        3805    61401    +57596     
===========================================
+ Hits         2737    53718    +50981     
- Misses        994     7609     +6615     
  Partials       74       74
Impacted Files Coverage Δ
python/pyarrow/ipc.pxi 68.65% <0%> (ø)
cpp/src/arrow/io/hdfs-internal.h 100% <0%> (ø)
cpp/src/parquet/column_page.h 89.28% <0%> (ø)
cpp/src/parquet/arrow/test-util.h 97.77% <0%> (ø)
cpp/src/parquet/bloom_filter-test.cc 99.15% <0%> (ø)
cpp/src/plasma/client.cc 84.43% <0%> (ø)
cpp/src/arrow/io/test-common.h 97.29% <0%> (ø)
cpp/src/arrow/pretty_print.h 100% <0%> (ø)
cpp/src/arrow/io/hdfs-internal.cc 33.62% <0%> (ø)
cpp/src/arrow/visitor_inline.h 93.33% <0%> (ø)
... and 331 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0e518c4...1ad3fcc. Read the comment docs.

@romainfrancois
Copy link
Contributor Author

@wesm I've updated this PR to follow up on the changes merged in #2711

I think this can be reviewed as is, otherwise I'll also add commits about using streams in the other direction, e.g. write a record batch or a Table to an OutputStream but I guess that can be another PR.

@wesm
Copy link
Member

wesm commented Oct 8, 2018

OK, I will review this sometime today

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable. I noted a couple typos, and left some questions about handling methods that are part of the virtual interfaces in arrow/io/interfaces.h. Let me know what (if anything) you want to do there in this patch, but we can always manage improving that code in a follow up patch

r/R/buffer.R Outdated Show resolved Hide resolved
r/R/buffer.R Outdated Show resolved Hide resolved
r/R/io.R Outdated
Tell = function() io___MemoryMappedFile__Tell(self),
Seek = function(position) io___MemoryMappedFile__Seek(self, position),
Resize = function(size) io___MemoryMappedFile__Resize(self, size),
Read = function(nbytes) `arrow::Buffer`$new(io___Readable__Read(self, nbytes))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does R6 support diamond inheritance? You probably want to follow the Arrow class hierarchy and put Close, Tell, and Seek in a base class, and Read on InputStream

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, ping @wch. r-lib/R6#9

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Well you could implement a non-diamond inheritance hierarchy for R and at least maximize the amount of code sharing. I guess there might be the issue that the incoming shared_ptr type to each function is different, so perhaps it's not worth the effort as long as you internally dispatch to a primary implementation that uses the lowest-level virtual interface

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, R6 doesn't have multiple inheritance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would something like java's implements make sense ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romainfrancois want to file an issue on R6? It's worth having a discussion about it.

r/R/io.R Outdated
public = list(
Close = function() io___ReadableFile__Close(self),
Tell = function() io___ReadableFile__Tell(self),
Seek = function(position) io___ReadableFile__Seek(self, position),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here re: inheritance

r/src/buffer.cpp Show resolved Hide resolved
@romainfrancois
Copy link
Contributor Author

I'm adding a few things to it, i.e. something to replace #2741 so that we can do read_record_batch on a raw vector.

@wesm
Copy link
Member

wesm commented Oct 11, 2018

OK, I'll return in a few hours to merge. thanks!

@romainfrancois
Copy link
Contributor Author

romainfrancois commented Oct 11, 2018

@wesm I think it's good now.

So we have both read_table and read_record_batch that are generic, with implementations for various streams:

#' Read an arrow::Table from a stream
#'
#' @param stream stream. Either a stream created by [file_open()] or [mmap_open()] or a file path.
#'
#' @export
read_table <- function(stream){
  UseMethod("read_table")
}

#' @export
read_table.character <- function(stream){
  assert_that(length(stream) == 1L)
  read_table(fs::path_abs(stream))
}

#' @export
read_table.fs_path <- function(stream) {
  stream <- file_open(stream); on.exit(stream$Close())
  read_table(stream)
}

#' @export
`read_table.arrow::io::RandomAccessFile` <- function(stream) {
  `arrow::Table`$new(read_table_RandomAccessFile(stream))
}

#' @export
`read_table.arrow::io::BufferReader` <- function(stream) {
  `arrow::Table`$new(read_table_BufferReader(stream))
}

#' @export
`read_table.raw` <- function(stream) {
  stream <- buffer_reader(stream); on.exit(stream$Close())
  read_table(stream)
}

so @javierluraschi reading from a raw vector is just:

bytes <- ... # from somewhere
batch <- read_record_batch(bytes)

This, I believe is more coherent with the rest of the api than #2741

My next move is to do the same for output streams, i.e. instead of having the $to_file and to_stream, we'd have a stream generic S3 with double dispatch on what we stream (record batch or table) and where (the output stream).

I can either do this as a follow up patch, or right here.

@romainfrancois
Copy link
Contributor Author

@wesm I'll pick the next step in the morning, so if you want to merge, I'm good with it.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments about API refinement. We can address in follow up patches

r/R/enums.R Show resolved Hide resolved
@@ -63,17 +63,26 @@ List RecordBatch__to_dataframe(const std::shared_ptr<arrow::RecordBatch>& batch)
}

// [[Rcpp::export]]
std::shared_ptr<arrow::RecordBatch> read_record_batch_(std::string path) {
std::shared_ptr<arrow::io::ReadableFile> stream;
std::shared_ptr<arrow::RecordBatch> read_record_batch_RandomAccessFile(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to revisit these APIs -- read_record_batch_RandomAccessFile and read_record_batch_BufferReader do different things. https://issues.apache.org/jira/browse/ARROW-3498

}

// [[Rcpp::export]]
std::shared_ptr<arrow::RecordBatch> read_record_batch_BufferReader(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function name does not suggest that it reads a stream (including the schema). The Python version of this does not read a stream

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It reads a record batch from a BufferReader.

But yeah this is likely to change once more C++ classes can be held on the r side.

Should it just read the record batch and not the schema ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to go carefully through all the modes of operation when it comes to IPC, as listed in https://issues.apache.org/jira/browse/ARROW-3498. We have support for all modes in Python at the moment:

  • Reading a stream
  • Reading a file
  • Reading a schema alone
  • Reading a single record batch alone given a known schema

@@ -95,20 +104,20 @@ int RecordBatch__to_file(const std::shared_ptr<arrow::RecordBatch>& batch,
return offset;
}

// [[Rcpp::export]]
RawVector RecordBatch__to_stream(const std::shared_ptr<arrow::RecordBatch>& batch) {
int64_t RecordBatch_size(const std::shared_ptr<arrow::RecordBatch>& batch) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, "stream_size"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, is this the way to get the number of bytes a record batch would take ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on the operational mode per comments in https://issues.apache.org/jira/browse/ARROW-3498. If you are writing a stream with the schema and a single record batch, then this is the right way. But you can write a record batch without the schema in a non-stream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I’ve switched to the other side of the problem. I’ll keep this in mind and propose some changes.

}

// [[Rcpp::export]]
std::shared_ptr<arrow::Table> read_table_BufferReader(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

@wesm
Copy link
Member

wesm commented Oct 11, 2018

+1

@wesm wesm closed this in b027f5a Oct 11, 2018
@@ -3,6 +3,7 @@ Title: R Integration to 'Apache' 'Arrow'
Version: 0.0.0.9000
Authors@R: c(
person("Romain", "François", email = "romain@rstudio.com", role = c("aut", "cre")),
person("Javier", "Luraschi", email = "javier@rstudio.com", role = c("ctb")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romainfrancois Lol, I don't think I deserve this yet... so feel free to remove be before going to CRAN... unless I ended up contributing something more significant, in any case, thanks!

@romainfrancois romainfrancois deleted the ARROW-3350/MemoryMappedFile branch October 11, 2018 20:44
@kou kou changed the title ARROW-3350: [R] Wrap MemoryMappedFile class ARROW-3311: [R] Wrap MemoryMappedFile class Mar 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants