-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-14919: [R] write_parquet() drops attributes for grouped dataframes #12073
ARROW-14919: [R] write_parquet() drops attributes for grouped dataframes #12073
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good, one comment (which I think is totally fine, but want to call it out none the less)
test_that("grouped_df non-arrow metadata is preserved", { | ||
|
||
simple_tbl <- tibble(a = 1:2, b = 3:4) | ||
attr(simple_tbl, "other_metadata") <- "look I'm still here!" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"look I'm still here!"
😂
r/R/metadata.R
Outdated
@@ -115,7 +115,7 @@ apply_arrow_r_metadata <- function(x, r_metadata) { | |||
remove_attributes <- function(x) { | |||
removed_attributes <- character() | |||
if (identical(class(x), c("tbl_df", "tbl", "data.frame"))) { | |||
removed_attributes <- c("class", "row.names", "names") | |||
removed_attributes <- c("class", "row.names", "names", "groups") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be hard to find a definitive answer if this is ok, but this will remove any groups
attribute whether or not it is from dplyr grouping. It is almost certainly ok, since dplyr itself would overwrite this attribute if something used it, but in this code path, we aren't any more checking inherits(x, "grouped_df")
before deleting that attribute.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Have updated the PR to account for this - let me know if it looks alright to you.
# Regardless, we shouldn't keep groups around | ||
attr(x, "groups") <- NULL | ||
att[[".group_vars"]] <- gv | ||
removed_attributes <- c(removed_attributes, "groups", "class") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"class" should already be in removed_attributes
right? Or am I misunderstanding the flow up there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When remove_attributes()
is called on x
it hits this block of code:
if (identical(class(x), c("tbl_df", "tbl", "data.frame"))) { removed_attributes <- c("class", "row.names", "names") } else if (inherits(x, "data.frame")) { removed_attributes <- c("row.names", "names")
As it's a grouped_df
it fails the first condition and hits the second. That doesn't include "class" for reasons I don't fully understand but its addition makes other tests fail and changes the returned object type. I've tried a few other approaches but they make other tests fail, and I'm going down a bit of a rabbit hole.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aaaah, I missed that identical()
bit there the first time I was reading. I presume one of the branches you tried was something like all(c("tbl_df", "tbl", "data.frame") %in% class(x)))
and that didn't work?
I'm fine to keep it here if that doesn't work (/ causes other issues).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - there are checks for object class in other tests and those began to fail. How about we leave it as is for now, but fix later if other issues arise? I wanted to find a simple solution but it's more complex than I thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
8cf0d0c
to
a7d8d5a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thanks for this
Benchmark runs are scheduled for baseline = d19b0e8 and contender = 4a76d5e. 4a76d5e is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
No description provided.