Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] count_distinct aggregates incorrectly across row groups #32138

Closed
asfimport opened this issue Jun 10, 2022 · 4 comments
Closed

[C++] count_distinct aggregates incorrectly across row groups #32138

asfimport opened this issue Jun 10, 2022 · 4 comments
Labels
Milestone

Comments

@asfimport
Copy link

asfimport commented Jun 10, 2022

When reading from parquet files with multiple row groups, count_distinct (wrapped by n_distinct in R) returns inaccurate and inconsistent results:

library(dplyr, warn.conflicts = FALSE)

path <- tempfile(fileext = '.parquet')
arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L)

ds <- arrow::open_dataset(path)

ds %>% count(sex) %>% collect()
#> # A tibble: 5 × 2
#>   sex                n
#>   <chr>          <int>
#> 1 male              60
#> 2 none               6
#> 3 female            16
#> 4 hermaphroditic     1
#> 5 <NA>               4

ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    19
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    17

# correct
ds %>% collect() %>% summarise(n = n_distinct(sex))
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1     5

If the file is stored as a single row group, results are correct. When grouped, results are correct.

I can reproduce this in Python as well using the same file and pyarrow.compute.count_distinct:

import pyarrow as pa
import pyarrow.parquet as pq

pa.__version__
#> 8.0.0

starwars = pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chc0000gn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet')

pa.compute.count_distinct(starwars.column('sex')).as_py()
#> 15
pa.compute.unique(starwars.column('sex'))
#> [
#>   "male",
#>   "none",
#>   "female",
#>   "hermaphroditic",
#>    null
#> ]

This seems likely to be the same problem in this StackOverflow question: https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array which is working from orc files.

Environment: > arrow::arrow_info()
Arrow package version: 8.0.0.9000

Capabilities:

dataset TRUE
substrait FALSE
parquet TRUE
json TRUE
s3 TRUE
utf8proc TRUE
re2 TRUE
snappy TRUE
gzip TRUE
brotli TRUE
zstd TRUE
lz4 TRUE
lz4_frame TRUE
lzo FALSE
bz2 TRUE
jemalloc TRUE
mimalloc FALSE

Memory:

Allocator jemalloc
Current 37.25 Kb
Max 925.42 Kb

Runtime:

SIMD Level none
Detected SIMD Level none

Build:

C++ Library Version 9.0.0-SNAPSHOT
C++ Compiler AppleClang
C++ Compiler Version 13.1.6.13160021
Git ID d9d7894
Reporter: Edward Visel / @alistaire47
Assignee: Aldrin Montana / @drin

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-16807. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Yibo Cai / @cyb70289:
Looks current count_distinct doesn't handle chunked array. It simply accumulates the distinct counts of each chunk.
https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/aggregate_basic.cc#L159
It's wrong if there are duplicated values among chunks.
E.g., for two chunks "1,2,3", "1,2,3", current count_distinct kernel outpus 3+3=6.

@asfimport
Copy link
Author

Aldrin Montana / @drin:
@westonpace found a bug (47dd2ec). I will pick this up tomorrow and will try to add some extra unit tests and check for other bugs in the area.

@asfimport
Copy link
Author

Aldrin Montana / @drin:
clarification: the above commit (47dd2ec) is actually for ARROW-16904.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Issue resolved by pull request 13583
#13583

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant