Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-14919: [R] write_parquet() drops attributes for grouped dataframes #12073

Closed

Conversation

thisisnic
Copy link
Member

No description provided.

@github-actions
Copy link

github-actions bot commented Jan 4, 2022

@github-actions
Copy link

github-actions bot commented Jan 4, 2022

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good, one comment (which I think is totally fine, but want to call it out none the less)

test_that("grouped_df non-arrow metadata is preserved", {

simple_tbl <- tibble(a = 1:2, b = 3:4)
attr(simple_tbl, "other_metadata") <- "look I'm still here!"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"look I'm still here!"

😂

r/R/metadata.R Outdated
@@ -115,7 +115,7 @@ apply_arrow_r_metadata <- function(x, r_metadata) {
remove_attributes <- function(x) {
removed_attributes <- character()
if (identical(class(x), c("tbl_df", "tbl", "data.frame"))) {
removed_attributes <- c("class", "row.names", "names")
removed_attributes <- c("class", "row.names", "names", "groups")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be hard to find a definitive answer if this is ok, but this will remove any groups attribute whether or not it is from dplyr grouping. It is almost certainly ok, since dplyr itself would overwrite this attribute if something used it, but in this code path, we aren't any more checking inherits(x, "grouped_df") before deleting that attribute.

Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Have updated the PR to account for this - let me know if it looks alright to you.

# Regardless, we shouldn't keep groups around
attr(x, "groups") <- NULL
att[[".group_vars"]] <- gv
removed_attributes <- c(removed_attributes, "groups", "class")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"class" should already be in removed_attributes right? Or am I misunderstanding the flow up there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When remove_attributes() is called on x it hits this block of code:
if (identical(class(x), c("tbl_df", "tbl", "data.frame"))) { removed_attributes <- c("class", "row.names", "names") } else if (inherits(x, "data.frame")) { removed_attributes <- c("row.names", "names")
As it's a grouped_df it fails the first condition and hits the second. That doesn't include "class" for reasons I don't fully understand but its addition makes other tests fail and changes the returned object type. I've tried a few other approaches but they make other tests fail, and I'm going down a bit of a rabbit hole.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaaah, I missed that identical() bit there the first time I was reading. I presume one of the branches you tried was something like all(c("tbl_df", "tbl", "data.frame") %in% class(x))) and that didn't work?

I'm fine to keep it here if that doesn't work (/ causes other issues).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - there are checks for object class in other tests and those began to fail. How about we leave it as is for now, but fix later if other issues arise? I wanted to find a simple solution but it's more complex than I thought.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

@thisisnic thisisnic force-pushed the ARROW-14919_groups_drops_metadata branch from 8cf0d0c to a7d8d5a Compare January 6, 2022 15:44
Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks for this

@thisisnic thisisnic closed this in 4a76d5e Jan 6, 2022
@ursabot
Copy link

ursabot commented Jan 6, 2022

Benchmark runs are scheduled for baseline = d19b0e8 and contender = 4a76d5e. 4a76d5e is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.0% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants