Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10415: [R] Support for dplyr::distinct() #11143

Closed
wants to merge 27 commits into from

Conversation

thisisnic
Copy link
Member

No description provided.

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@thisisnic thisisnic marked this pull request as ready for review September 14, 2021 11:31

if (.keep_all == TRUE) {
# After ARROW-13767 is merged, we can implement this via e.g.
# iris %>% group_by(Species) %>% slice(1) %>% ungroup()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is accurate

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think ARROW-13767 will help us implement .keep_all = TRUE. To implement that, I think we would need a hash aggregate function hash_first() that returns the first value of a column within each group. However since row order within groups is potentially non-deterministic, this is perhaps better understood as a function that just grabs any one column value from within the group.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened ARROW-13993 for a hash aggregate function that returns the first value (effectively any one arbitrary value) from within each group

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ianmcook !

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thisisnic could you please open a separate Jira to implement .keep_all = TRUE for 7.0.0? Thanks

r/R/dplyr-distinct.R Outdated Show resolved Hide resolved
}

# Call mutate in case any columns are expressions
.data <- dplyr::mutate(.data, ...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need this: group_by will handle this for you, won't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

.data$group_by_vars <- gv
}

# Select the columns to return in the correct order
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are columns out of order? group_by should get the order right for you

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of columns returned by group_by() and then summarise() are the order in which they're supplied to group_by(), whereas the order of columns returned from distinct() are the order in which they occur in the data, even if there's a group_by() call previously.

> tbl %>%
+       group_by(some_grouping, int) %>%
+       summarise() %>%
+       collect()
`summarise()` has grouped output by 'some_grouping'. You can override using the `.groups` argument.
# A tibble: 10 × 2
# Groups:   some_grouping [2]
   some_grouping   int
           <dbl> <int>
 1             1     1
 2             1     3
 3             1     5
 4             1     7
 5             1     9
 6             2     2
 7             2     6
 8             2     8
 9             2    10
10             2    NA
> tbl %>%
+       group_by(some_grouping, int) %>%
+       distinct(lgl) %>%
+       collect()
# A tibble: 10 × 3
# Groups:   some_grouping, int [10]
     int lgl   some_grouping
   <int> <lgl>         <dbl>
 1     1 TRUE              1
 2     2 NA                2
 3     3 TRUE              1
 4    NA FALSE             2
 5     5 TRUE              1
 6     6 NA                2
 7     7 NA                1
 8     8 FALSE             2
 9     9 FALSE             1
10    10 NA                2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, that's worthy of a comment in the code :)

Maybe you can sort the group_by_vars to match their order in the data before calling summarize, then the result from summarize should match?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, I'll do this!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment.

Am no longer using group_by_vars after realising that group_by() has a .add argument, which seems cleaner to take advantage of instead of redoing all of the grouping myself. Therefore, have kept in the ordering code.

.data <- dplyr::group_by(.data, !!!syms(vars_to_group))
.data <- dplyr::summarize(.data)

# Add back in any grouping which existed in the data previously
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect what you're actually hitting here is ARROW-13550, that summarize isn't restoring groups as expected yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

})

test_that("distinct() can retain groups", {
skip("ARROW-13777 - internally uses mutate on grouped data; should work after this is merged")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove your mutate call though and these should pass if you let group_by do the work

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will try that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also now have a PR open for ARROW-13777

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: that PR has now merged, so you can rebase and unskip

@thisisnic
Copy link
Member Author

It it reasonable to chuck arrange() into all of the tests as the order of rows returned is different?

@ianmcook
Copy link
Member

It it reasonable to chuck arrange() into all of the tests as the order of rows returned is different?

Yes

arrow_not_supported("`distinct()` with `.keep_all = TRUE`")
}

distinct_groups <- ensure_named_exprs(quos(...))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect a function body something like this:

.data <- dplyr::group_by(.data, ..., .add = TRUE)
gv <- dplyr::group_vars(.data) # this is `names(.data$group_by_vars)`
if (length(gv)) {
  # `data %>% group_by() %>% summarise()` returns cols in order supplied
  # but distinct() returns cols in dataset order, so sort group vars
  vars_in_data <- names(.data)[names(.data) %in% gv]
  .data$group_by_vars <- c(
    .data$group_by_vars[vars_in_data], 
    .data$group_by_vars[!(gv %in% vars_in_data)]
  )
} else {
  # distinct() with no vars specified means distinct across all cols
  .data <- dplyr::group_by(.data, !!!syms(names(.data)))
}
# TODO: use .groups = "keep" (ARROW-13550)
dplyr::summarize(.data)

IIUC the only specialness about distinct() over group_by() %>% summarize() is the order of columns and the meaning of 0 groups. If that's true then we should delegate the rest of the work to group_by and summarize.

Copy link
Member Author

@thisisnic thisisnic Sep 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to get it more clear in my head, but I think this won't work in cases where an expression has been passed into distinct().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's more complicated than that, that's fine. But group_by handles expressions too, so I would expect that it would do the right thing.

Copy link
Member Author

@thisisnic thisisnic Sep 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the nuance is in the difference between calling group_by() which handles expressions before adding things to .data$group_by_vars versus directly adding things to .data$group_by_vars, which only works (AFAIK) when it's just the names of the columns which are not expressions. I think I could do a better job of refactoring some of this to make it a bit tidier still though - will update shortly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion doesn't add things to group_by_vars directly, it only reorders them to make sure they follow dataset order.

Copy link
Member Author

@thisisnic thisisnic Sep 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh I'm struggling to follow this a bit. I'm not sure what this bit is doing:

  vars_in_data <- names(.data)[names(.data) %in% gv]
  .data$group_by_vars <- c(
    .data$group_by_vars[vars_in_data], 
    .data$group_by_vars[!(gv %in% vars_in_data)]
  )

Specifically, vars_in_data

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, never mind, I see now. You subset to get the same thing but reordered.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't get it to work that way at all, sorry. I think this is ready for re-review now @nealrichardson as have made some other changes in light of new tests which I think mean that this way no longer makes sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried again, but I end up trying to do horrible things with lists of quosures to rearrange them and then ensure they are named, which I can't get working. Given it works the way is is written now, I'm giving up on this, unless anyone has any very specific code suggestions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was annoying me, so I tried it again today. I got 90% of the way there, but I think ultimately, it requires us to effectively undo and redo any group_by() which happens before the call to distinct()

I ended up with this monstrosity at the start of the code:

gv <- dplyr::group_vars(.data)
gv_as_quo <- quos(!!!map(.data$selected_columns[gv], ~sym(.x$ToString())))

This isn't correct though, as it's pulling out the Expression and not the R expression used to create it, which I guess we can't get back at this point?

@nealrichardson
Copy link
Contributor

It it reasonable to chuck arrange() into all of the tests as the order of rows returned is different?

Yes

There's also a global option you can turn on, if you prefer. See NEWS.md and test-dplyr-summarize.R for an example.

ViniciusSouzaRoque pushed a commit to s1mbi0se/arrow that referenced this pull request Oct 20, 2021
Closes apache#11143 from thisisnic/ARROW-10415_distinct

Lead-authored-by: Nic Crane <thisisnic@gmail.com>
Co-authored-by: Nic <thisisnic@gmail.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants