[C++] count_distinct aggregates incorrectly across row groups #32138

asfimport · 2022-06-10T17:00:44Z

When reading from parquet files with multiple row groups, count_distinct (wrapped by n_distinct in R) returns inaccurate and inconsistent results:

library(dplyr, warn.conflicts = FALSE)

path <- tempfile(fileext = '.parquet')
arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L)

ds <- arrow::open_dataset(path)

ds %>% count(sex) %>% collect()
#> # A tibble: 5 × 2
#>   sex                n
#>   <chr>          <int>
#> 1 male              60
#> 2 none               6
#> 3 female            16
#> 4 hermaphroditic     1
#> 5 <NA>               4

ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    19
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    17

# correct
ds %>% collect() %>% summarise(n = n_distinct(sex))
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1     5

If the file is stored as a single row group, results are correct. When grouped, results are correct.

I can reproduce this in Python as well using the same file and pyarrow.compute.count_distinct:

import pyarrow as pa
import pyarrow.parquet as pq

pa.__version__
#> 8.0.0

starwars = pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chc0000gn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet')

pa.compute.count_distinct(starwars.column('sex')).as_py()
#> 15
pa.compute.unique(starwars.column('sex'))
#> [
#>   "male",
#>   "none",
#>   "female",
#>   "hermaphroditic",
#>    null
#> ]

This seems likely to be the same problem in this StackOverflow question: https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array which is working from orc files.

Environment: > arrow::arrow_info()
Arrow package version: 8.0.0.9000

Capabilities:

dataset TRUE
substrait FALSE
parquet TRUE
json TRUE
s3 TRUE
utf8proc TRUE
re2 TRUE
snappy TRUE
gzip TRUE
brotli TRUE
zstd TRUE
lz4 TRUE
lz4_frame TRUE
lzo FALSE
bz2 TRUE
jemalloc TRUE
mimalloc FALSE

Memory:

Allocator jemalloc
Current 37.25 Kb
Max 925.42 Kb

Runtime:

SIMD Level none
Detected SIMD Level none

Build:

C++ Library Version 9.0.0-SNAPSHOT
C++ Compiler AppleClang
C++ Compiler Version 13.1.6.13160021
Git ID d9d7894
Reporter: Edward Visel / @alistaire47
Assignee: Aldrin Montana / @drin

Related issues:

[C++] min/max not deterministic if Parquet files have multiple row groups (is related to)

PRs and other links:

GitHub Pull Request #13583

_{Note: This issue was originally created as ARROW-16807. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2022-06-27T02:37:37Z

Yibo Cai / @cyb70289:
Looks current count_distinct doesn't handle chunked array. It simply accumulates the distinct counts of each chunk.
https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/aggregate_basic.cc#L159
It's wrong if there are duplicated values among chunks.
E.g., for two chunks "1,2,3", "1,2,3", current count_distinct kernel outpus 3+3=6.

asfimport · 2022-06-28T19:47:56Z

Aldrin Montana / @drin:
@westonpace found a bug (47dd2ec). I will pick this up tomorrow and will try to add some extra unit tests and check for other bugs in the area.

asfimport · 2022-06-30T15:56:04Z

Aldrin Montana / @drin:
clarification: the above commit (47dd2ec) is actually for ARROW-16904.

asfimport · 2022-07-16T21:45:53Z

Wes McKinney / @wesm:
Issue resolved by pull request 13583
#13583

asfimport closed this as completed Jul 16, 2022

asfimport mentioned this issue Jan 11, 2023

[C++] min/max not deterministic if Parquet files have multiple row groups #20300

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] count_distinct aggregates incorrectly across row groups #32138

[C++] count_distinct aggregates incorrectly across row groups #32138

asfimport commented Jun 10, 2022 •

edited

asfimport commented Jun 27, 2022

asfimport commented Jun 28, 2022

asfimport commented Jun 30, 2022

asfimport commented Jul 16, 2022

[C++] count_distinct aggregates incorrectly across row groups #32138

[C++] count_distinct aggregates incorrectly across row groups #32138

Comments

asfimport commented Jun 10, 2022 • edited

Related issues:

PRs and other links:

asfimport commented Jun 27, 2022

asfimport commented Jun 28, 2022

asfimport commented Jun 30, 2022

asfimport commented Jul 16, 2022

asfimport commented Jun 10, 2022 •

edited