Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-5718: [R] auto splice data frames in record_batch() and table() #4704

Closed

Conversation

romainfrancois
Copy link
Contributor

elements of ... that are unnamed in record_batch(...) or table(...) are automatically spliced:

library(arrow, warn.conflicts = FALSE)
library(tibble)

df <- tibble(x = 1:10, y = letters[1:10])
# auto splicing
batch <- record_batch(df, z = 1:10)
as_tibble(batch)
#> # A tibble: 10 x 3
#>        x y         z
#>    <int> <chr> <int>
#>  1     1 a         1
#>  2     2 b         2
#>  3     3 c         3
#>  4     4 d         4
#>  5     5 e         5
#>  6     6 f         6
#>  7     7 g         7
#>  8     8 h         8
#>  9     9 i         9
#> 10    10 j        10

# same as explicit splicing
batch <- record_batch(!!!df, z = 1:10)
as_tibble(batch)
#> # A tibble: 10 x 3
#>        x y         z
#>    <int> <chr> <int>
#>  1     1 a         1
#>  2     2 b         2
#>  3     3 c         3
#>  4     4 d         4
#>  5     5 e         5
#>  6     6 f         6
#>  7     7 g         7
#>  8     8 h         8
#>  9     9 i         9
#> 10    10 j        10
# auto splicing
tab <- table(df, z = 1:10)
as_tibble(tab)
#> # A tibble: 10 x 3
#>        x y         z
#>    <int> <chr> <int>
#>  1     1 a         1
#>  2     2 b         2
#>  3     3 c         3
#>  4     4 d         4
#>  5     5 e         5
#>  6     6 f         6
#>  7     7 g         7
#>  8     8 h         8
#>  9     9 i         9
#> 10    10 j        10

# same as explicit splicing
tab <- table(!!!df, z = 1:10)
as_tibble(tab)
#> # A tibble: 10 x 3
#>        x y         z
#>    <int> <chr> <int>
#>  1     1 a         1
#>  2     2 b         2
#>  3     3 c         3
#>  4     4 d         4
#>  5     5 e         5
#>  6     6 f         6
#>  7     7 g         7
#>  8     8 h         8
#>  9     9 i         9
#> 10    10 j        10

In particular, this gives us back record_batch(<data.frame>) and table(<data.frame>) :

library(arrow, warn.conflicts = FALSE)
library(tibble)

df <- tibble(x = 1:10, y = letters[1:10])
record_batch(df)
#> arrow::RecordBatch
table(df)
#> arrow::Table

Copy link
Contributor

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of notes. Generally looks good, might like to see a couple more tests.

r/R/RecordBatch.R Show resolved Hide resolved
r/tests/testthat/test-RecordBatch.R Show resolved Hide resolved
@@ -117,54 +117,84 @@ std::shared_ptr<arrow::Table> Table__from_dots(SEXP lst, SEXP schema_sxp) {
return tab;
}

R_xlen_t n = XLENGTH(lst);
std::vector<std::shared_ptr<arrow::Column>> columns(n);
int num_fields;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why the record_batch and table code isn't more shared

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be. record batch only have to handle arrays, where tables have to handle arrays, chunked arrays, columns. but I agree that the structure of the code is similar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should refactor the flatten-for loop into a small function if they're all the same.

@@ -98,6 +98,7 @@
#' @export
record_batch <- function(..., schema = NULL){
arrays <- list2(...)
# making sure there are always names
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right but why? IIUC this has to do with how the cpp code distinguishes things to autosplice, switching behavior on nchar(name). That's a subtlety I'll probably forget by Monday so it seems worth explaining.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

names(list(a = 1, b = 2))
#> [1] "a" "b"
names(list(a = 1, 2))
#> [1] "a" ""
names(list(1, 2))
#> NULL

otherwise we'd have to treat the NULL case differently internally as we did before, maybe I could do that instead.

arrays[j] = arrow::r::Array__from_vector(x, schema->field(j)->type(), false);
};

for (R_xlen_t i = 0, j = 0; j < num_fields; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you simplify the loop conditions? Make it only iterate over i and XLENGTH(lst) and use an external variable,e.g.

int out_idx = 0;
for (R_xlen_t i = 0; i < XLENGHT(lst); i++) {
  ...
  if (..) {
    for (...)
      fill_array(out_idx++, VECTOR_ELT(x_i, k), STRING_ELT(names_x_i, k));
  } else {
    fill_array(out_idx++, x_i, name_i);
  }
}
assert(out_idx == num_fields);

The loop is now invariant on j (out_idx in example) Easier to follow and review/refactor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bonus point, get rid of all of this and make a C++ iterator over lst that will behave as-if "flattened".

std::vector<std::shared_ptr<arrow::Field>> fields(n_arrays);
for (R_xlen_t i = 0; i < n_arrays; i++) {
fields[i] = std::make_shared<arrow::Field>(std::string(names[i]), arrays[i]->type());
std::vector<std::shared_ptr<arrow::Field>> fields(num_fields);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like Schema::Make(const std::vector<Array>, const std::vector<string>) could be a convenient method to have globally.

@@ -117,54 +117,84 @@ std::shared_ptr<arrow::Table> Table__from_dots(SEXP lst, SEXP schema_sxp) {
return tab;
}

R_xlen_t n = XLENGTH(lst);
std::vector<std::shared_ptr<arrow::Column>> columns(n);
int num_fields;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should refactor the flatten-for loop into a small function if they're all the same.

@wesm
Copy link
Member

wesm commented Jun 27, 2019

What's the status on this? Since we're at the 0.14.0 end game we probably need to get this to merge readiness or defer to next release in the next 12-24 hours

@nealrichardson
Copy link
Contributor

This should go in 0.14, otherwise we'll have to facilitate a sparklyr workaround. Plus it's the right interface to expose.

IMO the feedback from François could be addressed in a followup ticket.

@fsaintjacques
Copy link
Contributor

I also think it's more important to get this in (seems like a very convenient feature), than have a simplified loop.

@nealrichardson
Copy link
Contributor

The appveyor failure was the r-project.org DNS outage from yesterday. I just checked out this PR locally and ran R CMD check and it was clean. Created https://issues.apache.org/jira/browse/ARROW-5761 for the followup. @fsaintjacques would you please merge?

@fsaintjacques
Copy link
Contributor

I will run the docker-compose R image locally, should be completed in 5-10 minutes and will merge if successful.

@fsaintjacques
Copy link
Contributor

docker image failed, but unrelated to this change, I'll open a ticket, but merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants