Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-13860: [R] arrow 5.0.0 write_parquet throws error writing grouped data.frame #11315

Closed
wants to merge 9 commits into from

Conversation

nealrichardson
Copy link
Member

@nealrichardson nealrichardson commented Oct 5, 2021

  • Table/RecordBatch$create() on grouped_df no longer returns an arrow_dplyr_query, which was the change in the last release. This means these functions are type stable again, and this fixes the user report that write_parquet() doesn't work.
  • Instead of creating arrow_dplyr_query, group vars are stored in a special .group_vars attribute in the metadata$r. This attribute is used to restore groups on the round trip back to R, so grouped_df %>% record_batch() %>% as.data.frame() returns a grouped_df
  • The current dplyr release caches a lot of metadata about groups in a grouped_df, including all row indices matching each group value. This bloated the schema metadata we serialize, so it has been removed here. When converting back to a grouped_df/data.frame, dplyr will recreate this metadata.
  • The group_vars() and ungroup() methods for ArrowTabular read/write this new metadata$r$attributes$.group_vars field, so df %>% group_by() %>% record_batch() %>% group_vars() returns the same as df %>% record_batch() %>% group_by() %>% group_vars(). arrow_dplyr_query() also picks up on it.
  • New helper active binding $r_metadata to wrap the (de)serialization into the Arrow string KeyValueMetadata

@github-actions
Copy link

github-actions bot commented Oct 5, 2021

@github-actions
Copy link

github-actions bot commented Oct 5, 2021

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, I've gone through it once, but want to think through a bit more of possible edgecases/oddities. I'll follow up in a little bit if I have some or not.

Comment on lines +138 to +151
# Helper for the R metadata that handles the serialization
# See also method on ArrowTabular
if (missing(new)) {
out <- self$metadata$r
if (!is.null(out)) {
# Can't unserialize NULL
out <- .unserialize_arrow_r_metadata(out)
}
# Returns either NULL or a named list
out
} else {
# Set the R metadata
self$metadata$r <- .serialize_arrow_r_metadata(new)
self
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could? Should? we merge this and the tabular method? They look the same to me (except the comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could, IDK about should 🤷. They are identical, though not sure it's worth the trouble/indirection to set it up since it's an active binding that switches based on whether the argument is missing.

r/tests/testthat/test-parquet.R Show resolved Hide resolved
@@ -129,6 +133,19 @@ remove_attributes <- function(x) {
}

arrow_attributes <- function(x, only_top_level = FALSE) {
if (inherits(x, "grouped_df")) {
# Keep only the group var names, not the rest of the cached data that dplyr
# uses, which may be large
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How large? We can do this as a follow on if we want, but I would be curious if removing it and then having dplyr regenerate it is worse or better (performance wise) compared to serializing the large cache.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You basically get a data.frame that is distinct(group_vars()) plus a list column of integer vectors of the row indices that match each condition:

> mtcars %>% group_by(cyl, hp) %>% attr("groups")
# A tibble: 23 × 3
     cyl    hp       .rows
   <dbl> <dbl> <list<int>>
 1     4    52         [1]
 2     4    62         [1]
 3     4    65         [1]
 4     4    66         [2]
 5     4    91         [1]
 6     4    93         [1]
 7     4    95         [1]
 8     4    97         [1]
 9     4   109         [1]
10     4   113         [1]
# … with 13 more rows

So clearly that gets big both when you have lots of rows and when you have high cardinality in your groups. I don't think it makes sense for us to save it to feather/parquet, and we don't need to because we can recreate it from just group_vars() on the round trip.

r/tests/testthat/test-metadata.R Outdated Show resolved Hide resolved
self$set_pointer(out$pointer())
self
}
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Table$metadata() would now dispatch through the ArrowTabular$metadata() method/binding now, correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, I factored out the ReplaceSchemaMetadata piece that was different between Table and RecordBatch (bc static typing) and then moved the method to ArrowTabular

@jonkeane jonkeane closed this in 7eba115 Oct 14, 2021
@ursabot
Copy link

ursabot commented Oct 14, 2021

Benchmark runs are scheduled for baseline = 5845556 and contender = 7eba115. 7eba115 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.51% ⬆️0.51%] ursa-i9-9960x
[Finished ⬇️0.04% ⬆️0.04%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants