New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-3490: [R] streaming of arrow objects to streams #2749
Conversation
This currently fails because of documentation/roxygen issue :
Not quite sure what the workaround is. |
Independently of this, @javierluraschi I'd like your input on |
I'd have used |
Hm, I'm concerned the term "stream" might cause confusion with the other kinds of streams. If serialize is unavailable we could use some synonym |
Maybe we can make |
Codecov Report
@@ Coverage Diff @@
## master #2749 +/- ##
==========================================
- Coverage 87.57% 87.56% -0.02%
==========================================
Files 403 403
Lines 61483 61483
==========================================
- Hits 53845 53835 -10
- Misses 7568 7574 +6
- Partials 70 74 +4
Continue to review full report at Codecov.
|
Still interested in input about the functionality, and the choices of what underlying stream is used for each case. And I guess we need a name for both ways, maybe repurpose the |
The replacement of Regarding naming, I like |
@romainfrancois one more comment... related to https://github.com/apache/arrow/pull/2714/files#r224517253, and specifically:
I just found out a relevant use case to optimize R/Spark, I still have work a few days implementing another piece of The way I would expect this to work, not sure if python is the same, would be something like: schema <- read_schema(raw_schema)
df1 <- read_table(raw_batch_1, schema)
df2 <- read_table(raw_batch_2, schema)
df3 <- read_table(raw_batch_3, schema) |
Maybe i’ll just roll back to write_table and write_record_batch and have them be s3 generics, i.e drop having a single double dispatch generic. |
@javierluraschi you’re one step ahead again. I’ll do reading from various streams in the next pr. |
I ⏪ to :
The previous
|
Ready to be merged as far as I'm concerned |
Will review later today or tomorrow |
893e338
to
91a3103
Compare
I have a few more commits, for implementing changes to @wesm I can either merge them here or create another PR once this one is squashed ? |
Plan to review this tomorrow (I'm on US/Eastern time, so feel free to add more commits if you are working earlier in the day) |
perhaps r-lib/withr#80 would give a better syntax.
- write_arrow + stream.data.frame
e300428
to
338f75f
Compare
Alright I folded in the other branches I had.
|
|
Similar dispatch rules for
|
The test covers various ways to read a record batch (same thing for table): test_that("read_record_batch handles various streams (ARROW-3450, ARROW-3505)", {
tbl <- tibble::tibble(
int = 1:10, dbl = as.numeric(1:10),
lgl = sample(c(TRUE, FALSE, NA), 10, replace = TRUE),
chr = letters[1:10]
)
batch <- record_batch(tbl)
tf <- local_tempfile()
write_record_batch(batch, tf)
bytes <- write_record_batch(batch, raw())
buf_reader <- buffer_reader(bytes)
batch1 <- read_record_batch(tf)
batch2 <- read_record_batch(fs::path_abs(tf))
readable_file <- close_on_exit(file_open(tf))
batch3 <- read_record_batch(readable_file)
mmap_file <- close_on_exit(mmap_open(tf))
batch4 <- read_record_batch(mmap_file)
batch5 <- read_record_batch(bytes)
batch6 <- read_record_batch(buf_reader)
stream_reader <- record_batch_stream_reader(bytes)
batch7 <- read_record_batch(stream_reader)
file_reader <- record_batch_file_reader(tf)
batch8 <- read_record_batch(file_reader)
expect_equal(batch, batch1)
expect_equal(batch, batch2)
expect_equal(batch, batch3)
expect_equal(batch, batch4)
expect_equal(batch, batch5)
expect_equal(batch, batch6)
expect_equal(batch, batch7)
expect_equal(batch, batch8)
}) |
@wesm could we merge this one or are we missing anything? From my side, just moved the |
I'm pretty waterlogged but reviewing this now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some questions about names and how shared_ptr<T>(nullptr)
are handled. I opened a few follow up JIRAs
I'm going to merge this in the interest of helping things move along but please discuss the naming issues and decide on something unambiguous to distinguish the "write a single record batch to an existing IPC stream" and "write a complete IPC stream" concepts
#' @export | ||
`read_record_batch.arrow::io::RandomAccessFile` <- function(stream, ...){ | ||
reader <- record_batch_file_reader(stream) | ||
reader$ReadRecordBatch(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this assert that there is only one batch in the stream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, but is it not useful to be able to just read the first one ?
#' @export | ||
`read_record_batch.arrow::io::BufferReader` <- function(stream, ...){ | ||
reader <- record_batch_stream_reader(stream) | ||
reader$ReadNext() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this read a second time and assert that the reader returns null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
#' @param ... extra parameters | ||
#' | ||
#' @export | ||
write_record_batch <- function(x, stream, ...){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm struggling a bit with the name of the function and what it actually does. This writes a single record batch as the IPC streaming format including the schema. See e.g. some Python versions of similar things
https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_ipc.py#L599
See RecordBatch.serialize
implementation, Schema.serialize
. These write a single IPC message (just the record batch, no schema) in memory, though.
It is true that if you create a stream writer yourself and then call write_record_batch(x, writer)
multiple times, you can write multiple message, but the function that
- creates a stream writer from an output sink
- writes a single batch to it
- closes the writer
should probably be called something different. Can leave to a follow up patch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what it does depend on what the stream
, if it's a RecordBatchWriter
already, then it just serialize the record batch.
|
||
#' Write an object to a stream | ||
#' | ||
#' @param x An object to stream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to make a more precise statement about what these functions do, per comment above. Like "Writes a complete Arrow IPC stream to the output sink including schema to serialize the input object"
It might be useful to describe things as "output sinks" to distinguish from "IPC streams"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. In the meantime, it's just so that the builds passes. As above, we need to discuss which function does what.
using value_type = typename TypeTraits<Type>::ArrayType::value_type; | ||
|
||
auto n = array->length(); | ||
auto start = reinterpret_cast<const value_type*>(array->data()->buffers[1]->data()) + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be null. See GetValues<T>
used in Arrow C++ codebase
case Type::INT64: | ||
return arrow::r::promotion_Array_to_Vector<REALSXP, arrow::Int64Type>(array); | ||
case Type::UINT64: | ||
return arrow::r::promotion_Array_to_Vector<REALSXP, arrow::UInt64Type>(array); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these error on data loss? opened https://issues.apache.org/jira/browse/ARROW-3553
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should. However, I'll probably switch to using the bit64 package in a follow up pr. it just uses a numeric vector (array of double as the host and interpret the bytes as a int64_t
.
This way, there is no loss.
…tr), use bits64::integer64 Follow up to #2749 with the following fixes: - shared_ptr of `nullptr` don't construct R6 objects - int64 uses `bit64::integer64`, based on r-lib/vctrs#121 Author: Romain Francois <romain@purrple.cat> Closes #2795 from romainfrancois/ARROW-3553/int64 and squashes the following commits: 49c34bb <Romain Francois> using lower case cpp files cf35994 <Romain Francois> lint 018092a <Romain Francois> bounds check on RecordBatchFileReader$ReadRecordBatch 3200352 <Romain Francois> - test 089f951 <Romain Francois> additional check 72ccdcb <Romain Francois> replace direct class$new(.) by construct(class, .) 5b10dd1 <Romain Francois> R -> bit64::integer64 3e6bc9a <Romain Francois> using integer64 class for arrow (int64) -> R conversion.
This makes
write_record_batch
andwrite_table
generic with dispatch on the stream type.The
stream
argument can be various things for different use cases:an
arrow::pic::RecordBatchWriter
created either withrecord_batch_stream_writer()
orrecord_batch_file_writer()
. This is the lowest level and that calls its$WriteBatch()
or$WriteTable()
method depending on what is being streamedan
arrow::io::OutputStream
: this first creates anarrow::ipc::RecordBatchStreamWriter
and streams into it. In particular this does not add the bytes of arrow files.an
fs_path
from 📦fs
: this opens aarrow::ipc::RecordBatchFileWriter
and streams to it, so that the file gets the ARROW1 bytesA
character
, we just assert it is of length one and then call thefs_path
methodA
raw()
which is just used for its type, in that case we stream into a byte buffer and returns it as a raw vectorSome examples:
Created on 2018-10-12 by the reprex package (v0.2.1.9000)