ARROW-13860: [R] arrow 5.0.0 write_parquet throws error writing grouped data.frame #11315

nealrichardson · 2021-10-05T00:01:48Z

Table/RecordBatch$create() on grouped_df no longer returns an arrow_dplyr_query, which was the change in the last release. This means these functions are type stable again, and this fixes the user report that write_parquet() doesn't work.
Instead of creating arrow_dplyr_query, group vars are stored in a special .group_vars attribute in the metadata$r. This attribute is used to restore groups on the round trip back to R, so grouped_df %>% record_batch() %>% as.data.frame() returns a grouped_df
The current dplyr release caches a lot of metadata about groups in a grouped_df, including all row indices matching each group value. This bloated the schema metadata we serialize, so it has been removed here. When converting back to a grouped_df/data.frame, dplyr will recreate this metadata.
The group_vars() and ungroup() methods for ArrowTabular read/write this new metadata$r$attributes$.group_vars field, so df %>% group_by() %>% record_batch() %>% group_vars() returns the same as df %>% record_batch() %>% group_by() %>% group_vars(). arrow_dplyr_query() also picks up on it.
New helper active binding $r_metadata to wrap the (de)serialization into the Arrow string KeyValueMetadata

github-actions · 2021-10-05T00:02:09Z

https://issues.apache.org/jira/browse/ARROW-13860

github-actions · 2021-10-05T00:02:11Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

jonkeane

This looks good, I've gone through it once, but want to think through a bit more of possible edgecases/oddities. I'll follow up in a little bit if I have some or not.

jonkeane · 2021-10-13T19:05:46Z

r/R/schema.R

+      # Helper for the R metadata that handles the serialization
+      # See also method on ArrowTabular
+      if (missing(new)) {
+        out <- self$metadata$r
+        if (!is.null(out)) {
+          # Can't unserialize NULL
+          out <- .unserialize_arrow_r_metadata(out)
+        }
+        # Returns either NULL or a named list
+        out
+      } else {
+        # Set the R metadata
+        self$metadata$r <- .serialize_arrow_r_metadata(new)
+        self


Could? Should? we merge this and the tabular method? They look the same to me (except the comment)

Could, IDK about should 🤷. They are identical, though not sure it's worth the trouble/indirection to set it up since it's an active binding that switches based on whether the argument is missing.

r/tests/testthat/test-parquet.R

jonkeane · 2021-10-13T19:12:37Z

r/R/metadata.R

@@ -129,6 +133,19 @@ remove_attributes <- function(x) {
 }

 arrow_attributes <- function(x, only_top_level = FALSE) {
+  if (inherits(x, "grouped_df")) {
+    # Keep only the group var names, not the rest of the cached data that dplyr
+    # uses, which may be large


How large? We can do this as a follow on if we want, but I would be curious if removing it and then having dplyr regenerate it is worse or better (performance wise) compared to serializing the large cache.

You basically get a data.frame that is distinct(group_vars()) plus a list column of integer vectors of the row indices that match each condition:

> mtcars %>% group_by(cyl, hp) %>% attr("groups") # A tibble: 23 × 3 cyl hp .rows <dbl> <dbl> <list<int>> 1 4 52 [1] 2 4 62 [1] 3 4 65 [1] 4 4 66 [2] 5 4 91 [1] 6 4 93 [1] 7 4 95 [1] 8 4 97 [1] 9 4 109 [1] 10 4 113 [1] # … with 13 more rows

So clearly that gets big both when you have lots of rows and when you have high cardinality in your groups. I don't think it makes sense for us to save it to feather/parquet, and we don't need to because we can recreate it from just group_vars() on the round trip.

r/tests/testthat/test-metadata.R

jonkeane · 2021-10-13T19:16:53Z

r/R/table.R

-        self$set_pointer(out$pointer())
-        self
-      }
-    },


Table$metadata() would now dispatch through the ArrowTabular$metadata() method/binding now, correct?

Correct, I factored out the ReplaceSchemaMetadata piece that was different between Table and RecordBatch (bc static typing) and then moved the method to ArrowTabular

Co-authored-by: Jonathan Keane <jkeane@gmail.com>

ursabot · 2021-10-14T20:20:21Z

Benchmark runs are scheduled for baseline = 5845556 and contender = 7eba115. 7eba115 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.51% ⬆️0.51%] ursa-i9-9960x
[Finished ⬇️0.04% ⬆️0.04%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

github-actions bot added the Component: R label Oct 5, 2021

nealrichardson force-pushed the fix-grouped-df branch from 9d513da to 82040aa Compare October 13, 2021 02:01

nealrichardson added 7 commits October 13, 2021 11:16

Failing test

06f1b72

Fix that test, break the others

f244042

WIP: store grouped_df metadata but not all of it

2a5d113

Fix the storing of the group attribute

05c7984

Refactor and add r_metadata helper

bb22fa2

group_vars/ungroup need to deal with r_metadata

c93cb1d

Fix style from conflict resolution

4525074

nealrichardson force-pushed the fix-grouped-df branch from 82040aa to 4525074 Compare October 13, 2021 18:21

nealrichardson marked this pull request as ready for review October 13, 2021 18:22

nealrichardson requested review from jonkeane and ianmcook October 13, 2021 18:22

jonkeane reviewed Oct 13, 2021

View reviewed changes

nealrichardson and others added 2 commits October 13, 2021 13:02

Update r/tests/testthat/test-metadata.R

91239dc

Co-authored-by: Jonathan Keane <jkeane@gmail.com>

Update r/tests/testthat/test-parquet.R

81aa659

jonkeane closed this in 7eba115 Oct 14, 2021

nealrichardson deleted the fix-grouped-df branch October 14, 2021 22:00

asfimport mentioned this pull request Oct 17, 2021

[R] arrow 5.0.0 write_parquet throws error writing grouped data.frame #29480

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-13860: [R] arrow 5.0.0 write_parquet throws error writing grouped data.frame #11315

ARROW-13860: [R] arrow 5.0.0 write_parquet throws error writing grouped data.frame #11315

nealrichardson commented Oct 5, 2021 •

edited

Loading

github-actions bot commented Oct 5, 2021

github-actions bot commented Oct 5, 2021

jonkeane left a comment

jonkeane Oct 13, 2021

nealrichardson Oct 13, 2021

jonkeane Oct 13, 2021

nealrichardson Oct 13, 2021

jonkeane Oct 13, 2021

nealrichardson Oct 13, 2021

ursabot commented Oct 14, 2021 •

edited

Loading

ARROW-13860: [R] arrow 5.0.0 write_parquet throws error writing grouped data.frame #11315

ARROW-13860: [R] arrow 5.0.0 write_parquet throws error writing grouped data.frame #11315

Conversation

nealrichardson commented Oct 5, 2021 • edited Loading

github-actions bot commented Oct 5, 2021

github-actions bot commented Oct 5, 2021

jonkeane left a comment

Choose a reason for hiding this comment

jonkeane Oct 13, 2021

Choose a reason for hiding this comment

nealrichardson Oct 13, 2021

Choose a reason for hiding this comment

jonkeane Oct 13, 2021

Choose a reason for hiding this comment

nealrichardson Oct 13, 2021

Choose a reason for hiding this comment

jonkeane Oct 13, 2021

Choose a reason for hiding this comment

nealrichardson Oct 13, 2021

Choose a reason for hiding this comment

ursabot commented Oct 14, 2021 • edited Loading

nealrichardson commented Oct 5, 2021 •

edited

Loading

ursabot commented Oct 14, 2021 •

edited

Loading