-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-16266: [R] Add StructArray$create() #14922
Conversation
|
r/R/array.R
Outdated
@@ -450,6 +440,44 @@ StructArray <- R6Class("StructArray", | |||
) | |||
) | |||
|
|||
StructArray$create <- function(...) { | |||
dots <- list2(...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would starting this with:
batch <- record_batch(...)
...allow you to avoid everything below? I think in C++ you can create a struct array from:
std::shared_ptr<arrow::RecordBatch> batch;
arrow::StructArray::Make(batch->columns(), batch->schema()->field_names());
``
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to make that work; I tried start off with:
dots <- list2(...)
stopifnot(length(dots) > 0)
if (is.null(names(dots))) {
names(dots) <- rep_len("", length(dots))
}
batch <- record_batch(dots)
but I just end up with
Error: Unknown: only data frames are allowed as unnamed arguments to be auto spliced
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps the solution is just to have some non-empty default names? perhaps x1, x2, x3, ...
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry for forgetting about this thread! I think Will nailed it...the easiest way around record_batch()
's auto-splicing of data.frames from unnamed arguments thing is just to give all the arguments names. Maybe using rlang::dots_list(..., .named = TRUE)
?
library(arrow, warn.conflicts = FALSE)
struct_array <- function(...) {
args <- rlang::dots_list(..., .named = TRUE)
batch <- record_batch(!!! args)
array_ptr <- arrow:::allocate_arrow_array()
schema_ptr <- arrow:::allocate_arrow_schema()
batch$export_to_c(array_ptr, schema_ptr)
Array$import_from_c(array_ptr, schema_ptr)
}
struct_array(Array$create(1), Array$create(2))
#> StructArray
#> <struct<Array$create(1): double, Array$create(2): double>>
#> -- is_valid: all not null
#> -- child 0 type: double
#> [
#> 1
#> ]
#> -- child 1 type: double
#> [
#> 2
#> ]
Created on 2022-12-27 with reprex v2.0.2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't work as it adds names to things which shouldn't be named like data.frames.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you reprex your initial problem? (i.e., what usage caused the Error: Unknown: only data frames are allowed as unnamed arguments to be auto spliced
). I'm wondering if the initial error might be fixed by record_batch(!!!dots)
instead of record_batch(dots)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah thanks, but solved it a little differently. I've ended up sticking with the previous non-record batch approach as I wanted to avoid creating additional R6 objects en route, and this approach seems to work now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great!
My concern about using record_batch()
isn't about any love for record_batch()
per se, but I do want to make sure that using both of them is the same user experience because the object they are generating is functionally identical. The ability to re-use existing record_batch()
code would also mean that users of StructArray$create()
would benefit from our existing tests for record_batch()
which covers things like recycling of length-1 R vectors (and maybe some other things neither of us has considered yet). It's fine to not use it; however, I would prefer that the common pieces of code be reused by both functions somehow (or that if different behaviour is desired that it be documented in a comment somewhere so that somebody like me in the future doesn't come along and obliterate the work you've put in here trying to refactor them into a common constructor).
r/R/array.R
Outdated
if (all(map_lgl(dots, ~ inherits(.x, "Array")))) { | ||
# Check for Array length equality | ||
if (!length(unique(lengths(dots))) == 1) { | ||
abort( | ||
message = c( | ||
"All input Arrays must be equal lengths", | ||
x = paste("Array lengths:", paste(lengths(dots), collapse = ", ")) | ||
) | ||
) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can cbind_check_length()
be reused here? It has a pretty good error message for arguments of different lengths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we also have to consider what happens with Scalars/R vectors that are length 1 here?
# pr_fetch(14922)
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
StructArray$create(a = 5L, b = 1:10)
#> Error in eval(expr, envir, enclos): attempt to apply non-function
record_batch(a = 5L, b = 1:10)
#> RecordBatch
#> 10 rows x 2 columns
#> $a <int32>
#> $b <int32>
str(tibble::tibble(a = 5L, b = 1:10))
#> tibble [10 × 2] (S3: tbl_df/tbl/data.frame)
#> $ a: int [1:10] 5 5 5 5 5 5 5 5 5 5
#> $ b: int [1:10] 1 2 3 4 5 6 7 8 9 10
r/R/array.R
Outdated
# if we can convert items in dots to Arrays, then do | ||
if (!all(map_lgl(dots, ~ inherits(.x, "Array")))) { | ||
dots <- map(dots, Array$create) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if dots <- lapply(dots, as_arrow_array)
is more concise?
Finally got this working better with a combination of suggestions, mind taking another look @paleolimbot ? I'm intentionally deferring the need for it to work on every possible input seeing as it's not clear that we'd need them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a suggestion to fix the failing test! The error that the test is failing with is not good and could be improved for both StructArray$create()
and RecordBatch$create()
(no need for a followup ticket unless you're passionate about this one!).
Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>
Benchmark runs are scheduled for baseline = e11ef97 and contender = a06a5d6. a06a5d6 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
No description provided.