Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-16266: [R] Add StructArray$create() #14922

Merged
merged 17 commits into from
Jan 11, 2023
Merged

Conversation

thisisnic
Copy link
Member

No description provided.

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@thisisnic thisisnic changed the title ARROW-16266: [R] Add StructArray$create() [WIP] ARROW-16266: [R] Add StructArray$create() Dec 13, 2022
@thisisnic thisisnic marked this pull request as ready for review December 13, 2022 02:34
r/R/array.R Outdated
@@ -450,6 +440,44 @@ StructArray <- R6Class("StructArray",
)
)

StructArray$create <- function(...) {
dots <- list2(...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would starting this with:

batch <- record_batch(...)

...allow you to avoid everything below? I think in C++ you can create a struct array from:

std::shared_ptr<arrow::RecordBatch> batch;
arrow::StructArray::Make(batch->columns(), batch->schema()->field_names());
``

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to make that work; I tried start off with:

dots <- list2(...)

stopifnot(length(dots) > 0)

if (is.null(names(dots))) {
  names(dots) <- rep_len("", length(dots))
}

batch <- record_batch(dots)

but I just end up with

Error: Unknown: only data frames are allowed as unnamed arguments to be auto spliced

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the solution is just to have some non-empty default names? perhaps x1, x2, x3, ...?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry for forgetting about this thread! I think Will nailed it...the easiest way around record_batch()'s auto-splicing of data.frames from unnamed arguments thing is just to give all the arguments names. Maybe using rlang::dots_list(..., .named = TRUE)?

library(arrow, warn.conflicts = FALSE)

struct_array <- function(...) {
  args <- rlang::dots_list(..., .named = TRUE)
  batch <- record_batch(!!! args)
  array_ptr <- arrow:::allocate_arrow_array()
  schema_ptr <- arrow:::allocate_arrow_schema()
  batch$export_to_c(array_ptr, schema_ptr)
  Array$import_from_c(array_ptr, schema_ptr)
}

struct_array(Array$create(1), Array$create(2))
#> StructArray
#> <struct<Array$create(1): double, Array$create(2): double>>
#> -- is_valid: all not null
#> -- child 0 type: double
#>   [
#>     1
#>   ]
#> -- child 1 type: double
#>   [
#>     2
#>   ]

Created on 2022-12-27 with reprex v2.0.2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't work as it adds names to things which shouldn't be named like data.frames.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you reprex your initial problem? (i.e., what usage caused the Error: Unknown: only data frames are allowed as unnamed arguments to be auto spliced). I'm wondering if the initial error might be fixed by record_batch(!!!dots) instead of record_batch(dots)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks, but solved it a little differently. I've ended up sticking with the previous non-record batch approach as I wanted to avoid creating additional R6 objects en route, and this approach seems to work now.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great!

My concern about using record_batch() isn't about any love for record_batch() per se, but I do want to make sure that using both of them is the same user experience because the object they are generating is functionally identical. The ability to re-use existing record_batch() code would also mean that users of StructArray$create() would benefit from our existing tests for record_batch() which covers things like recycling of length-1 R vectors (and maybe some other things neither of us has considered yet). It's fine to not use it; however, I would prefer that the common pieces of code be reused by both functions somehow (or that if different behaviour is desired that it be documented in a comment somewhere so that somebody like me in the future doesn't come along and obliterate the work you've put in here trying to refactor them into a common constructor).

r/R/array.R Outdated
Comment on lines 456 to 465
if (all(map_lgl(dots, ~ inherits(.x, "Array")))) {
# Check for Array length equality
if (!length(unique(lengths(dots))) == 1) {
abort(
message = c(
"All input Arrays must be equal lengths",
x = paste("Array lengths:", paste(lengths(dots), collapse = ", "))
)
)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can cbind_check_length() be reused here? It has a pretty good error message for arguments of different lengths.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we also have to consider what happens with Scalars/R vectors that are length 1 here?

# pr_fetch(14922)
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
StructArray$create(a = 5L, b = 1:10)
#> Error in eval(expr, envir, enclos): attempt to apply non-function
record_batch(a = 5L, b = 1:10)
#> RecordBatch
#> 10 rows x 2 columns
#> $a <int32>
#> $b <int32>
str(tibble::tibble(a = 5L, b = 1:10))
#> tibble [10 × 2] (S3: tbl_df/tbl/data.frame)
#>  $ a: int [1:10] 5 5 5 5 5 5 5 5 5 5
#>  $ b: int [1:10] 1 2 3 4 5 6 7 8 9 10

r/R/array.R Outdated
Comment on lines 451 to 454
# if we can convert items in dots to Arrays, then do
if (!all(map_lgl(dots, ~ inherits(.x, "Array")))) {
dots <- map(dots, Array$create)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if dots <- lapply(dots, as_arrow_array) is more concise?

@thisisnic
Copy link
Member Author

Finally got this working better with a combination of suggestions, mind taking another look @paleolimbot ? I'm intentionally deferring the need for it to work on every possible input seeing as it's not clear that we'd need them.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a suggestion to fix the failing test! The error that the test is failing with is not good and could be improved for both StructArray$create() and RecordBatch$create() (no need for a followup ticket unless you're passionate about this one!).

r/tests/testthat/test-Array.R Outdated Show resolved Hide resolved
Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>
@thisisnic thisisnic merged commit a06a5d6 into apache:master Jan 11, 2023
@ursabot
Copy link

ursabot commented Jan 12, 2023

Benchmark runs are scheduled for baseline = e11ef97 and contender = a06a5d6. a06a5d6 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.57% ⬆️0.42%] test-mac-arm
[Finished ⬇️0.0% ⬆️0.26%] ursa-i9-9960x
[Finished ⬇️0.34% ⬆️0.09%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] a06a5d65 ec2-t3-xlarge-us-east-2
[Failed] a06a5d65 test-mac-arm
[Finished] a06a5d65 ursa-i9-9960x
[Finished] a06a5d65 ursa-thinkcentre-m75q
[Finished] e11ef977 ec2-t3-xlarge-us-east-2
[Failed] e11ef977 test-mac-arm
[Finished] e11ef977 ursa-i9-9960x
[Finished] e11ef977 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants