New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-5718: [R] auto splice data frames in record_batch() and table() #4704
ARROW-5718: [R] auto splice data frames in record_batch() and table() #4704
Conversation
663730f
to
61902ab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of notes. Generally looks good, might like to see a couple more tests.
@@ -117,54 +117,84 @@ std::shared_ptr<arrow::Table> Table__from_dots(SEXP lst, SEXP schema_sxp) { | |||
return tab; | |||
} | |||
|
|||
R_xlen_t n = XLENGTH(lst); | |||
std::vector<std::shared_ptr<arrow::Column>> columns(n); | |||
int num_fields; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious why the record_batch and table code isn't more shared
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be. record batch only have to handle arrays, where tables have to handle arrays, chunked arrays, columns. but I agree that the structure of the code is similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should refactor the flatten-for loop into a small function if they're all the same.
@@ -98,6 +98,7 @@ | |||
#' @export | |||
record_batch <- function(..., schema = NULL){ | |||
arrays <- list2(...) | |||
# making sure there are always names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right but why? IIUC this has to do with how the cpp code distinguishes things to autosplice, switching behavior on nchar(name)
. That's a subtlety I'll probably forget by Monday so it seems worth explaining.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
names(list(a = 1, b = 2))
#> [1] "a" "b"
names(list(a = 1, 2))
#> [1] "a" ""
names(list(1, 2))
#> NULL
otherwise we'd have to treat the NULL case differently internally as we did before, maybe I could do that instead.
arrays[j] = arrow::r::Array__from_vector(x, schema->field(j)->type(), false); | ||
}; | ||
|
||
for (R_xlen_t i = 0, j = 0; j < num_fields; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you simplify the loop conditions? Make it only iterate over i
and XLENGTH(lst) and use an external variable,e.g.
int out_idx = 0;
for (R_xlen_t i = 0; i < XLENGHT(lst); i++) {
...
if (..) {
for (...)
fill_array(out_idx++, VECTOR_ELT(x_i, k), STRING_ELT(names_x_i, k));
} else {
fill_array(out_idx++, x_i, name_i);
}
}
assert(out_idx == num_fields);
The loop is now invariant on j (out_idx in example) Easier to follow and review/refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bonus point, get rid of all of this and make a C++ iterator over lst
that will behave as-if "flattened".
std::vector<std::shared_ptr<arrow::Field>> fields(n_arrays); | ||
for (R_xlen_t i = 0; i < n_arrays; i++) { | ||
fields[i] = std::make_shared<arrow::Field>(std::string(names[i]), arrays[i]->type()); | ||
std::vector<std::shared_ptr<arrow::Field>> fields(num_fields); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like Schema::Make(const std::vector<Array>, const std::vector<string>)
could be a convenient method to have globally.
@@ -117,54 +117,84 @@ std::shared_ptr<arrow::Table> Table__from_dots(SEXP lst, SEXP schema_sxp) { | |||
return tab; | |||
} | |||
|
|||
R_xlen_t n = XLENGTH(lst); | |||
std::vector<std::shared_ptr<arrow::Column>> columns(n); | |||
int num_fields; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should refactor the flatten-for loop into a small function if they're all the same.
What's the status on this? Since we're at the 0.14.0 end game we probably need to get this to merge readiness or defer to next release in the next 12-24 hours |
This should go in 0.14, otherwise we'll have to facilitate a sparklyr workaround. Plus it's the right interface to expose. IMO the feedback from François could be addressed in a followup ticket. |
I also think it's more important to get this in (seems like a very convenient feature), than have a simplified loop. |
The appveyor failure was the r-project.org DNS outage from yesterday. I just checked out this PR locally and ran |
I will run the docker-compose R image locally, should be completed in 5-10 minutes and will merge if successful. |
docker image failed, but unrelated to this change, I'll open a ticket, but merge. |
elements of
...
that are unnamed inrecord_batch(...)
ortable(...)
are automatically spliced:In particular, this gives us back
record_batch(<data.frame>)
andtable(<data.frame>)
: