Skip to content

[R] [C++] Implement SQL-alike distinct() for dplyr queries #18714

Description

@asfimport

Hi

It would be desirable to have the ability to obtain a data frame with the unique combinations, say

open_dataset("sitc-rev2/parquet/",
             partitioning = c("Year", "Trade Flow", "Reporter ISO")) %>%
  select(Year, `Reporter ISO`) %>%
  filter(Year >= 1988 & Year <= 1994) %>% 
  distinct() %>% 
  collect()

However, in the current development version of the Arrow package (installed from GitHub), we get this error for the last expression

Error in UseMethod("distinct") : 
  no applicable method for 'distinct' applied to an object of class "arrow_dplyr_query"

This works

reporters_1 <- open_dataset("sitc-rev2/parquet/",
             partitioning = c("Year", "Trade Flow", "Reporter ISO")) %>%
  select(Year, `Reporter ISO`) %>%
  filter(Year >= 1988 & Year <= 1994) %>% 
  collect() %>% 
  distinct()

Reporter: Mauricio 'Pachá' Vargas Sepúlveda / @pachadotdev

Related issues:

Note: This issue was originally created as ARROW-13107. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions