-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-3311: [R] Wrap MemoryMappedFile class #2714
ARROW-3311: [R] Wrap MemoryMappedFile class #2714
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2714 +/- ##
===========================================
+ Coverage 71.93% 87.48% +15.55%
===========================================
Files 61 402 +341
Lines 3805 61401 +57596
===========================================
+ Hits 2737 53718 +50981
- Misses 994 7609 +6615
Partials 74 74
Continue to review full report at Codecov.
|
1ad3fcc
to
70da023
Compare
OK, I will review this sometime today |
70da023
to
8d1984b
Compare
With methods for: - character: passed to fs::path_abs - fs_path: opens a ReadableFile stream, dispatch and close the stream on exit - RandomAccessFile: (either ReadableFile or MemoryMappedFile)
8d1984b
to
7f8d01b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks reasonable. I noted a couple typos, and left some questions about handling methods that are part of the virtual interfaces in arrow/io/interfaces.h. Let me know what (if anything) you want to do there in this patch, but we can always manage improving that code in a follow up patch
r/R/io.R
Outdated
Tell = function() io___MemoryMappedFile__Tell(self), | ||
Seek = function(position) io___MemoryMappedFile__Seek(self, position), | ||
Resize = function(size) io___MemoryMappedFile__Resize(self, size), | ||
Read = function(nbytes) `arrow::Buffer`$new(io___Readable__Read(self, nbytes)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does R6 support diamond inheritance? You probably want to follow the Arrow class hierarchy and put Close
, Tell
, and Seek
in a base class, and Read
on InputStream
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, ping @wch. r-lib/R6#9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Well you could implement a non-diamond inheritance hierarchy for R and at least maximize the amount of code sharing. I guess there might be the issue that the incoming shared_ptr type to each function is different, so perhaps it's not worth the effort as long as you internally dispatch to a primary implementation that uses the lowest-level virtual interface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, R6 doesn't have multiple inheritance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would something like java's implements
make sense ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@romainfrancois want to file an issue on R6? It's worth having a discussion about it.
r/R/io.R
Outdated
public = list( | ||
Close = function() io___ReadableFile__Close(self), | ||
Tell = function() io___ReadableFile__Tell(self), | ||
Seek = function(position) io___ReadableFile__Seek(self, position), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here re: inheritance
I'm adding a few things to it, i.e. something to replace #2741 so that we can do |
OK, I'll return in a few hours to merge. thanks! |
@wesm I think it's good now. So we have both #' Read an arrow::Table from a stream
#'
#' @param stream stream. Either a stream created by [file_open()] or [mmap_open()] or a file path.
#'
#' @export
read_table <- function(stream){
UseMethod("read_table")
}
#' @export
read_table.character <- function(stream){
assert_that(length(stream) == 1L)
read_table(fs::path_abs(stream))
}
#' @export
read_table.fs_path <- function(stream) {
stream <- file_open(stream); on.exit(stream$Close())
read_table(stream)
}
#' @export
`read_table.arrow::io::RandomAccessFile` <- function(stream) {
`arrow::Table`$new(read_table_RandomAccessFile(stream))
}
#' @export
`read_table.arrow::io::BufferReader` <- function(stream) {
`arrow::Table`$new(read_table_BufferReader(stream))
}
#' @export
`read_table.raw` <- function(stream) {
stream <- buffer_reader(stream); on.exit(stream$Close())
read_table(stream)
} so @javierluraschi reading from a raw vector is just: bytes <- ... # from somewhere
batch <- read_record_batch(bytes) This, I believe is more coherent with the rest of the api than #2741 My next move is to do the same for output streams, i.e. instead of having the I can either do this as a follow up patch, or right here. |
@wesm I'll pick the next step in the morning, so if you want to merge, I'm good with it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments about API refinement. We can address in follow up patches
@@ -63,17 +63,26 @@ List RecordBatch__to_dataframe(const std::shared_ptr<arrow::RecordBatch>& batch) | |||
} | |||
|
|||
// [[Rcpp::export]] | |||
std::shared_ptr<arrow::RecordBatch> read_record_batch_(std::string path) { | |||
std::shared_ptr<arrow::io::ReadableFile> stream; | |||
std::shared_ptr<arrow::RecordBatch> read_record_batch_RandomAccessFile( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to revisit these APIs -- read_record_batch_RandomAccessFile and read_record_batch_BufferReader do different things. https://issues.apache.org/jira/browse/ARROW-3498
} | ||
|
||
// [[Rcpp::export]] | ||
std::shared_ptr<arrow::RecordBatch> read_record_batch_BufferReader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function name does not suggest that it reads a stream (including the schema). The Python version of this does not read a stream
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It reads a record batch from a BufferReader.
But yeah this is likely to change once more C++ classes can be held on the r side.
Should it just read the record batch and not the schema ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to go carefully through all the modes of operation when it comes to IPC, as listed in https://issues.apache.org/jira/browse/ARROW-3498. We have support for all modes in Python at the moment:
- Reading a stream
- Reading a file
- Reading a schema alone
- Reading a single record batch alone given a known schema
@@ -95,20 +104,20 @@ int RecordBatch__to_file(const std::shared_ptr<arrow::RecordBatch>& batch, | |||
return offset; | |||
} | |||
|
|||
// [[Rcpp::export]] | |||
RawVector RecordBatch__to_stream(const std::shared_ptr<arrow::RecordBatch>& batch) { | |||
int64_t RecordBatch_size(const std::shared_ptr<arrow::RecordBatch>& batch) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, "stream_size"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, is this the way to get the number of bytes a record batch would take ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on the operational mode per comments in https://issues.apache.org/jira/browse/ARROW-3498. If you are writing a stream with the schema and a single record batch, then this is the right way. But you can write a record batch without the schema in a non-stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I’ve switched to the other side of the problem. I’ll keep this in mind and propose some changes.
} | ||
|
||
// [[Rcpp::export]] | ||
std::shared_ptr<arrow::Table> read_table_BufferReader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
+1 |
@@ -3,6 +3,7 @@ Title: R Integration to 'Apache' 'Arrow' | |||
Version: 0.0.0.9000 | |||
Authors@R: c( | |||
person("Romain", "François", email = "romain@rstudio.com", role = c("aut", "cre")), | |||
person("Javier", "Luraschi", email = "javier@rstudio.com", role = c("ctb")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@romainfrancois Lol, I don't think I deserve this yet... so feel free to remove be before going to CRAN... unless I ended up contributing something more significant, in any case, thanks!
The functions
read_record_batch
andread_table
become S3 generic and methods handle:arrow::io::RandomAccessFile
: that can be made withfile_open()
ormmap_open()
fs::fs_path
object. In that case aReadableFile
stream is open, read from using thearrow::io::RandomAccessFile
method and closed on exit.arrow::BufferReader
arrow::BufferReader
methodClasses
Buffer
,ReadableFile
andMemoryMappedFile
are wrapped in R6 classes.