-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-13860: [R] arrow 5.0.0 write_parquet throws error writing grouped data.frame #11315
Conversation
|
9d513da
to
82040aa
Compare
82040aa
to
4525074
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, I've gone through it once, but want to think through a bit more of possible edgecases/oddities. I'll follow up in a little bit if I have some or not.
# Helper for the R metadata that handles the serialization | ||
# See also method on ArrowTabular | ||
if (missing(new)) { | ||
out <- self$metadata$r | ||
if (!is.null(out)) { | ||
# Can't unserialize NULL | ||
out <- .unserialize_arrow_r_metadata(out) | ||
} | ||
# Returns either NULL or a named list | ||
out | ||
} else { | ||
# Set the R metadata | ||
self$metadata$r <- .serialize_arrow_r_metadata(new) | ||
self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could? Should? we merge this and the tabular method? They look the same to me (except the comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could, IDK about should 🤷. They are identical, though not sure it's worth the trouble/indirection to set it up since it's an active binding that switches based on whether the argument is missing.
@@ -129,6 +133,19 @@ remove_attributes <- function(x) { | |||
} | |||
|
|||
arrow_attributes <- function(x, only_top_level = FALSE) { | |||
if (inherits(x, "grouped_df")) { | |||
# Keep only the group var names, not the rest of the cached data that dplyr | |||
# uses, which may be large |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How large? We can do this as a follow on if we want, but I would be curious if removing it and then having dplyr regenerate it is worse or better (performance wise) compared to serializing the large cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You basically get a data.frame that is distinct(group_vars())
plus a list column of integer vectors of the row indices that match each condition:
> mtcars %>% group_by(cyl, hp) %>% attr("groups")
# A tibble: 23 × 3
cyl hp .rows
<dbl> <dbl> <list<int>>
1 4 52 [1]
2 4 62 [1]
3 4 65 [1]
4 4 66 [2]
5 4 91 [1]
6 4 93 [1]
7 4 95 [1]
8 4 97 [1]
9 4 109 [1]
10 4 113 [1]
# … with 13 more rows
So clearly that gets big both when you have lots of rows and when you have high cardinality in your groups. I don't think it makes sense for us to save it to feather/parquet, and we don't need to because we can recreate it from just group_vars()
on the round trip.
self$set_pointer(out$pointer()) | ||
self | ||
} | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Table$metadata()
would now dispatch through the ArrowTabular$metadata()
method/binding now, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, I factored out the ReplaceSchemaMetadata piece that was different between Table and RecordBatch (bc static typing) and then moved the method to ArrowTabular
Co-authored-by: Jonathan Keane <jkeane@gmail.com>
Benchmark runs are scheduled for baseline = 5845556 and contender = 7eba115. 7eba115 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Table/RecordBatch$create()
ongrouped_df
no longer returns anarrow_dplyr_query
, which was the change in the last release. This means these functions are type stable again, and this fixes the user report that write_parquet() doesn't work.arrow_dplyr_query
, group vars are stored in a special.group_vars
attribute in themetadata$r
. This attribute is used to restore groups on the round trip back to R, sogrouped_df %>% record_batch() %>% as.data.frame()
returns agrouped_df
grouped_df
, including all row indices matching each group value. This bloated the schema metadata we serialize, so it has been removed here. When converting back to agrouped_df
/data.frame
, dplyr will recreate this metadata.group_vars()
andungroup()
methods forArrowTabular
read/write this newmetadata$r$attributes$.group_vars
field, sodf %>% group_by() %>% record_batch() %>% group_vars()
returns the same asdf %>% record_batch() %>% group_by() %>% group_vars()
.arrow_dplyr_query()
also picks up on it.$r_metadata
to wrap the (de)serialization into the Arrow string KeyValueMetadata